## 1. Intent Classification Model (You are here!)

This Python code builds and trains a machine learning model to classify the intent behind text, such as customer support queries.

**Here's a breakdown of what the code does:**

1. **Loads and prepares a dataset:** It uses the `datasets` library to load a chatbot training dataset and converts it into a suitable format for processing.
2. **Preprocesses the text:** It defines a `TextPreprocessor` class to clean and normalize the input text by performing actions like:
    * Removing special characters and extra spaces.
    * Converting the text to lowercase.
    * Eliminating very short words.
3. **Creates a Machine Learning pipeline:** It uses the `ModelBuilder` class to configure a Scikit-learn pipeline consisting of:
    * A text preprocessing step.
    * Text vectorization using TF-IDF (Term Frequency-Inverse Document Frequency).
    * An `ExtraTreesClassifier` to predict the intent.
4. **Trains and evaluates the model:** It trains the pipeline using the training data and evaluates its performance on the test data.
5. **Defines a prediction function:** It creates the `predict_intent` function to take new text as input, apply the trained pipeline, and return the predicted intent.

**Essentially, this code creates a system that learns to categorize text, which can be used to:**

* Identify the intent behind customer queries in a chatbot.
* Automatically label support tickets based on their content.
* Classify text documents into different categories.

The code utilizes good programming practices like implementing design patterns (Abstract Factory and Factory Method), making it more organized, reusable, and maintainable.

In [16]:
import re
import joblib
from abc import ABC, abstractmethod

import pandas as pd
from datasets import load_dataset
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

**1. `TextPreprocessorBase (Abstract Class)`:**

* Defines the template for text preprocessing classes.
* **`preprocess_text(self, text: str) -> str` (Abstract Method):**
    * Defines the method signature that all derived classes must implement.
    * Takes a text string as input and returns the preprocessed text.
* **`fit(self, X, y=None)`:**
    * Default method for compatibility with the sklearn Pipeline, returns the instance itself (`self`).
* **`transform(self, X, y=None)`:**
    * Applies the `preprocess_text` method to each text in `X` (a list of text strings).

In [2]:
# Abstract Base Class for Text Preprocessors
class TextPreprocessorBase(ABC, BaseEstimator, TransformerMixin):
    @abstractmethod
    def preprocess_text(self, text: str) -> str:
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return [self.preprocess_text(text) for text in X]

**2. `TextPreprocessor (Concrete Class)`:**

* Implements the abstract class `TextPreprocessorBase` to perform text preprocessing.
* **`to_lowercase(self, text: str) -> str`:**
    * Converts the text to lowercase using the `lower()` method.
* **`remove_special_chars(self, text: str) -> str`:**
    * Removes special characters from the text using regular expressions (`re.sub`).
    * `r'[^\w\s]|_|[àèìòùñ°ª]'`: Regex that identifies special characters, including some accents.
* **`remove_single_char(self, text: str) -> str`:**
    * Removes isolated characters from the text using regular expressions (`re.sub`).
    * `r'\b[A-Za-z]\b'`: Regex that identifies isolated alphabetical characters.
* **`remove_spaces(self, text: str) -> str`:**
    * Removes extra whitespaces from the text using regular expressions (`re.sub`).
    * `r'(\s)+'`: Regex that identifies one or more consecutive whitespaces.
* **`preprocess_text(self, text: str) -> str`:**
    * Implements the abstract method of the base class.
    * Applies the preprocessing functions in sequence:
        1. `remove_special_chars`
        2. `remove_single_char`
        3. `to_lowercase`
        4. `remove_spaces`

In [3]:
# Concrete Text Preprocessor
class TextPreprocessor(TextPreprocessorBase):
    def to_lowercase(self, text: str) -> str:
        return text.lower()

    def remove_special_chars(self, text: str) -> str:
        special_chars_pattern = re.compile(r'[^\w\s]|_|[àèìòùñ°ª]', flags=re.UNICODE)
        return re.sub(special_chars_pattern, "", text)

    def remove_single_char(self, text: str) -> str:
        single_char_pattern = re.compile(r'\b[A-Za-z]\b')
        return re.sub(single_char_pattern, "", text)

    def remove_spaces(self, text: str) -> str:
        pattern = re.compile(r'(\s)+')
        return pattern.sub(' ', text).strip()

    def preprocess_text(self, text: str) -> str:
        text = self.remove_special_chars(text)
        text = self.remove_single_char(text)
        text = self.to_lowercase(text)
        text = self.remove_spaces(text)
        return text

**3. `ModelBuilder (Class)`:**

* Responsible for creating and configuring the machine learning pipeline.
* **`__init__(self)`:**
    * Initializes instances of the pipeline components:
        * `text_preprocessor`: Uses the `TextPreprocessor` class to preprocess the text.
        * `vectorizer`: Uses `TfidfVectorizer` to transform the text into numerical vectors.
            * `ngram_range=(2, 3)`: Considers bigrams and trigrams.
            * `max_features=1000`: Limits the vocabulary to 1000 features.
            * `min_df=5`: Ignores words that appear in less than 5 documents.
        * `classifier`: Uses `ExtraTreesClassifier` for classification.
            * `criterion='gini'`: Uses the Gini index to measure the quality of a split.
            * `max_depth=100`: Sets the maximum depth of the tree.
            * `n_jobs=-1`: Uses all processor cores.
* **`create_pipeline(self) -> Pipeline`:**
    * Creates and returns the sklearn pipeline with the following steps:
        1. `text_preprocessor`: Preprocesses the text.
        2. `tfidf`: Vectorizes the text using TF-IDF.
        3. `clf`: Classifies the data using the classifier.

In [4]:
# Model Builder (Factory Method Pattern)
class ModelBuilder:
    def __init__(self):
        self.text_preprocessor = TextPreprocessor()
        self.vectorizer = TfidfVectorizer(ngram_range=(2, 3),
                                         max_features=1000,
                                         min_df=5)
        self.classifier = ExtraTreesClassifier(criterion='gini',
                                               max_depth=100,
                                               n_jobs=-1)

    def create_pipeline(self) -> Pipeline:
        return Pipeline(steps=[
            ('text_preprocessor', self.text_preprocessor),
            ('tfidf', self.vectorizer),
            ('clf', self.classifier)
        ])


In [6]:
# Load and Prepare Dataset
dataset = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset")
df = pd.DataFrame(dataset['train'])[["instruction", "intent"]]
df = df.rename(columns={"instruction": "text", "intent": "category"})
df

Unnamed: 0,text,category
0,question about cancelling order {{Order Number}},cancel_order
1,i have a question about cancelling oorder {{Or...,cancel_order
2,i need help cancelling puchase {{Order Number}},cancel_order
3,I need to cancel purchase {{Order Number}},cancel_order
4,"I cannot afford this order, cancel purchase {{...",cancel_order
...,...,...
26867,I am waiting for a rebate of {{Refund Amount}}...,track_refund
26868,how to see if there is anything wrong with my ...,track_refund
26869,I'm waiting for a reimbjrsement of {{Currency ...,track_refund
26870,I don't know what to do to see my reimbursemen...,track_refund


In [7]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['category'], test_size=0.3, random_state=42)

In [8]:
# Build and Train Model
model_builder = ModelBuilder()
pipeline = model_builder.create_pipeline()
pipeline.fit(df['text'], df['category'])

In [19]:
joblib.dump(pipeline, "/home/rafael/Downloads/faq-chatbot/app/customer_support_pipeline.pkl")

['/home/rafael/Downloads/faq-chatbot/app/customer_support_pipeline.pkl']

In [9]:
# Evaluate Model
score = pipeline.score(X_test, y_test)
print(f"Model Accuracy: {score}")

Model Accuracy: 0.9119325229471595


**4. `predict_intent(text: str) -> str`:**

* Utility function to perform intent prediction from a text string.
* Takes a text string as input.
* Uses the trained pipeline to predict the category (intent) of the text.
* Returns the predicted category.

In [10]:
# Prediction Function
def predict_intent(text: str) -> str:
    return pipeline.predict([text])[0]

In [11]:
# Example Usage
example_text = df.text.iloc[0]
predicted_intent = predict_intent(example_text)

print(f"Question: {example_text}")
print(f"Answer: {predicted_intent}")

Question: question about cancelling order {{Order Number}}
Answer: cancel_order


**Other parts of the code:**

* Dataset loading and preparation using `datasets` and `pandas`.
* Data splitting into training and testing sets using `train_test_split`.
* Model training using `pipeline.fit` with the training data.
* Model evaluation using `pipeline.score` with the testing data.
* Example usage of the `predict_intent` function to predict the intent of a text string.

## 2. Response Design & Organization

To organize the chatbot's flow and register responses, you can opt for two main approaches:

**1. Direct Mapping (simpler):**

* Create a mapping between intents and responses.
* For each intent predicted by the model, the chatbot fetches the corresponding response from this mapping.
* Example: If the intent is "greeting", the mapped response could be "Hello! How can I assist you?".

**2. Dialogue Flow (advanced):**

* Define states and transitions that represent the flow of the conversation.
* Each state can have associated responses and conditions for transitioning to other states.
* Example: A "gathering_information" state could have the response "What is your name?" and transition to a "confirming_data" state after receiving the information.

Choosing the right approach depends on the chatbot's complexity and the need for personalized interactions. Direct mapping is simpler, while dialogue flow offers more flexibility and control over the conversation.


In [12]:
intent_responses = {
    "edit_account": "To edit your account information, you can go to the 'Account Settings' page.",
    "switch_account": "You can switch between multiple accounts using the account switcher in the top-right corner.",
    "check_invoice": "To check your invoices, visit the 'Billing' section of your account.",
    "complaint": "We appreciate your feedback. Please provide more details about your complaint so we can assist you better.",
    "contact_customer_service": "You can reach our customer service team at [phone number] or [email address].",
    "delivery_period": "Delivery times vary depending on your location and the shipping method chosen during checkout.",
    "registration_problems": "If you're having trouble registering, please ensure you're using a valid email address and have met the password requirements.",
    "check_payment_methods": "We accept the following payment methods: [List payment methods].",
    "contact_human_agent": "I can connect you with a human agent. Please hold while I transfer you.",
    "payment_issue": "We apologize for the payment issue. Please verify your payment information or contact your bank.",
    "newsletter_subscription": "To subscribe to our newsletter, enter your email address in the 'Subscribe' box at the bottom of the page.",
    "get_invoice": "You can download your invoices from the 'Order History' section.",
    "place_order": "To place an order, simply add the desired items to your cart and proceed to checkout.",
    "cancel_order": "You can cancel your order before it ships. Please contact customer support for assistance.",
    "track_refund": "Once your refund is processed, you'll receive a confirmation email with tracking information.",
    "change_order": "Order modifications may be possible depending on the order status. Please contact us.",
    "get_refund": "Refunds are eligible for returns within [number] days of purchase.",
    "create_account": "Creating an account is easy! Just click on 'Sign Up' and follow the instructions.",
    "check_refund_policy": "Our refund policy can be found on our website, under 'Terms and Conditions'.",
    "review": "We appreciate your feedback! Please leave your review on our product page.",
    "set_up_shipping_address": "You can add or edit shipping addresses in the 'Address Book' section of your account.",
    "delivery_options": "We offer various shipping options, including standard and express delivery.",
    "delete_account": "We're sorry to hear you'd like to delete your account. You can do so in the 'Account Settings'.",
    "recover_password": "To reset your password, click on 'Forgot Password' and follow the instructions sent to your email.",
    "track_order": "You can track your order status using the tracking number provided in your confirmation email.",
    "change_shipping_address": "Address changes are possible before shipment. Please contact us immediately.",
    "check_cancellation_fee": "Cancellation fees may apply depending on the order. Please review our cancellation policy."
}

**Why a pre-built setup is not ideal:**

- **Limitation:**  A system with predefined responses lacks flexibility and cannot handle new inputs or requests not included in the dictionary.
- **Lack of Context:** It cannot maintain the context of previous conversations, leading to irrelevant or repetitive responses.
- **No Personalization:** Every user gets the same generic responses, limiting the ability to provide a personalized experience.

In [13]:
def chatbot(text):

    intent = predict_intent(example_text)

    if intent:
        return intent_responses[intent]
    else:
        return "I'm sorry, I didn't understand your request. Can you please rephrase?"

# Example usage:
user_input = "I want to check my invoice"
response = chatbot(user_input)
print(response)  # Output: To check your invoices, visit the 'Billing' section of your account.

You can cancel your order before it ships. Please contact customer support for assistance.


## 3.  Dialogue Management (If Using Dialogue Flow)


## 4. User Interface (UI) Development


## 5.  Integration & Testing


## 6. Deployment & Continuous Improvement