# AI Workshop - Lab 2-2: Intent Classification

In this lab, we’ll build a system to classify customer text messages into different categories (called **intents**) using a powerful type of AI model called a transformer. Transformers are a key technology behind tools like ChatGPT and other modern language systems.

### Data Overview

We’re working with a dataset of customer text messages that has already been labeled with their intent (e.g., "Order Status", "Product Inquiry", "Account Help"). The goal is to teach the model to recognize these patterns so it can classify new messages correctly.

- **Number of Categories**: 27 different intents.

### What You’ll Learn
- **Transformers**: Get an introduction to these models and why they’re so powerful for language tasks.
- **Model Evaluation**: Understand how to measure a model’s performance and interpret its predictions.

In [None]:
!pip install -Uq datasets transformers accelerate evaluate sentencepiece

For this lab, it's essential that we have a GPU available to speed up training. On Google Colab, you can enable a GPU by going to **Runtime** > **Change runtime type** > **Hardware accelerator** > **GPU**.

The following line of code will check if a GPU is available:

In [None]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device.type == 'cuda':
    print('GPU is available!')
else:
    print('GPU is not available. Enable a GPU runtime in Colab under "Runtime" > "Change runtime type".')

### Loading the Dataset

Now that we’ve set up our environment and imported the necessary packages, let’s begin by loading our dataset.

In this lab, we’ll work with a dataset of **customer text messages** that have been labeled with their **intent**. Each sample in the dataset includes a text message and a corresponding label indicating the intent behind the message (e.g., inquiry, complaint, order request). This dataset will allow us to build and evaluate models for intent classification.

#### Steps:
1. **Load the Dataset**:
   - Use the `load_dataset` function from the `datasets` library to download and load the dataset.
   - The dataset we’re using is hosted at `"alexwaolson/customer-intents"`.
2. **Inspect the Dataset**:
   - After loading, examine the training split (`intents['train']`) to understand its structure and the data it contains.

In [None]:
from datasets import load_dataset
import pandas as pd

# Load the customer intents dataset
intents = load_dataset("alexwaolson/customer-intents")

# Display the training split
pd.DataFrame(intents['train'])

The dataset consists of two key columns:
- **`message`**: Contains the text of the customer message.
- **`label`**: Contains the intent category for each message.

There are **27 possible intent categories** in this dataset. To understand the distribution of these categories, we can count the number of examples for each intent. This helps us determine whether the dataset is balanced (i.e., whether all categories have similar representation) or imbalanced (some categories have significantly more or fewer samples than others).

Run the code below to calculate the distribution of intent labels:

In [None]:
from collections import Counter
import matplotlib.pyplot as plt

# Count the occurrences of each intent label in the training data
label_counts = Counter(intents['train']['label'])
print(f'Number of unique intents: {len(label_counts)}')

# Plot the distribution of intent labels
plt.figure(figsize=(12, 6))
plt.bar(label_counts.keys(), label_counts.values())
plt.xlabel('Intent Label')
plt.ylabel('Number of Examples')
plt.title('Distribution of Intent Labels')
plt.xticks(rotation=90)
plt.show()

### Zero-Shot Learning

One of the most powerful features of large language models is their ability to perform **zero-shot learning**. Unlike traditional models that require task-specific training, a zero-shot learning model can classify text based on its general understanding of language, even if it hasn’t been explicitly trained on that specific task.

#### How It Works:
- Instead of fine-tuning the model, you provide it with a **prompt** that describes the task and possible labels (e.g., "What is the intent of this message?").
- The model uses its pre-trained knowledge to predict the most appropriate label.

This approach leverages the model's extensive training on a wide variety of text, making it flexible for many tasks.

#### Why Use Zero-Shot Learning?
- **Quick Prototyping**: No need to preprocess or fine-tune the model for every new task.
- **Versatility**: Works for tasks the model wasn’t explicitly trained on, as long as the task can be described in a prompt.

#### Model Selection:
For zero-shot classification, we’ll use the `flan-t5-large` model, which is well suited for this task due to its size and broad understanding of language. Since this model doesn’t require fine-tuning, we can focus on testing its performance directly.

### Zero-Shot Intent Classification with Flan-T5

We’ll now use the **Flan-T5 large** model to classify intents via zero-shot learning. This approach involves crafting a **prompt** that describes the task and provides the model with the possible labels. The model then uses its language understanding to predict the intent without task-specific training.

#### Prompt Construction
The prompt is key to zero-shot learning. For our task:
1. The prompt begins by instructing the model to classify the intent of the message.
2. It lists the available intent categories.
3. Finally, it appends the message to classify.

In [None]:
# Define the prompt
prompt = "Classify the intent of the following message using these categories:\n"
for label in label_counts.keys():
    prompt += f"- {label}\n"
prompt += "Message: "

print(prompt)

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# Function for zero-shot classification
def zero_shot_intent_classification(model, prompt, message):
    # Combine the prompt and the message
    input_text = prompt + message
    # Tokenize the input text
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
    # Generate a prediction
    output = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)
    # Decode the prediction into text
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Test the function
zero_shot_intent_classification(model, prompt, "I need to cancel my order")

### Testing Zero-Shot Intent Classification

You can now test the zero-shot classification capabilities of the `flan-t5-large` model on a subset of messages from the test set. This will provide a sense of how well the model performs without task-specific training.

In [None]:
for message in intents['test']['message'][25:35]:
    print(f"Message: {message}")
    print(f"Predicted Intent: {zero_shot_intent_classification(model, prompt, message)}")
    print()

### Predicting Intent At Scale

To evaluate the performance of the `flan-t5-large` zero-shot model on the entire test dataset, we’ll:
1. **Generate Predictions**: Use the `zero_shot_intent_classification` function to predict intents for all test messages.
2. **Compare Predictions**: Compare the zero-shot predictions to the true labels in the test set.
3. **Examine mis-classified text**: Look at incorrectly classified examples to see if we can understand what went wrong.

In [None]:
from tqdm import tqdm

zero_shot_predictions = [zero_shot_intent_classification(model, prompt, message) for message in tqdm(intents['test']['message'])]
true_labels_text = intents['test']['label']

In [None]:
from sklearn.metrics import accuracy_score

print(f'Accuracy: {accuracy_score(true_labels_text, zero_shot_predictions)}')

Incredibly, our accuracy using zero-shot learning is around **70%**, even without training on the categories first! Let's take a look at accuracy by category to see if there are any that the model struggles on. The **classification report** will break down the performance of the model by category, allowing us to understand if some categories are less well supported by the model than others. It provides us with the following information:

- **Precision**: This measures the proportion of correctly predicted positive observations to the total predicted positives. High precision indicates that the model makes few false positive errors.

  $$
  \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
  $$

  - **Example**: If the task is to classify emails as "spam," a **true positive** is an email correctly classified as spam, while a **false positive** is a legitimate email incorrectly classified as spam.

- **Recall**: This measures the proportion of correctly predicted positive observations to all the actual positives. High recall indicates that the model captures most of the true positive cases.

  $$
  \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
  $$

  - **Example**: In the same email classification task, a **true positive** is an email correctly classified as spam, while a **false negative** is a spam email incorrectly classified as legitimate.

- **F1 Score**: This is the harmonic mean of precision and recall, balancing the two metrics. A high F1 score indicates a good trade-off between precision and recall.

  $$
  \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
  $$

- **Support**: This refers to the number of actual occurrences of each category in the dataset. It helps us understand the distribution of the categories and whether any are underrepresented, which can impact performance metrics.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(true_labels_text, zero_shot_predictions, zero_division=0))

Let's now look at some mis-classified examples to see if we can understand why they were not classified correctly.

In [None]:
# Display mis-classified examples
misclassified_examples = [(message, true_label, pred) for message, true_label, pred in zip(intents['test']['message'], true_labels_text, zero_shot_predictions) if true_label != pred]
pd.DataFrame(misclassified_examples, columns=['Message', 'True Label', 'Predicted Label'])

### Flagging Abuse

One of the challenges in customer service is identifying and handling abusive messages. Even in this dataset there are examples where customers have used inappropriate language in their requests.

In [None]:
# Display abusive examples
pd.DataFrame([(message, label) for message, label in zip(intents['test']['message'], true_labels_text) if 'damn' in message.lower()], columns=['Message', 'Label'])

Let's say that we want to introduce a new task to classify messages as abusive or not. We can use the same zero-shot approach to classify messages as abusive or not abusive. The prompt will be similar to the previous one, but with the new task and labels.

In [None]:
# Define the prompt for abusive language classification
abuse_prompt = "Classify this message as abusive if it contains inappropriate language or not abusive if it does not.\n"

# Load the tokenizer and model
abuse_tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
abuse_model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="cuda" if torch.cuda.is_available() else "cpu")

# Function for zero-shot classification of abusive language
def zero_shot_abuse_classification(model, prompt, message):
    # Combine the prompt and the message
    input_text = prompt + message
    # Tokenize the input text
    input_ids = abuse_tokenizer(input_text, return_tensors="pt").input_ids.to(device)
    # Generate a prediction
    output = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)
    # Decode the prediction into text
    return abuse_tokenizer.decode(output[0], skip_special_tokens=True)

# Test the function
zero_shot_abuse_classification(abuse_model, abuse_prompt, "I'm so damn frustrated with your service!")

Now we can use the model to flag messages as inappropriate or not inappropriate:

In [None]:
# Predict for samples that contain the word "damn"
for message in intents['test']['message']:
    if 'damn' in message.lower():
        print(f"Message: {message}")
        print(f"Predicted Intent: {zero_shot_abuse_classification(abuse_model, abuse_prompt, message)}")
        print()

### Translation

In some cases, it may be useful to translate customer messages into a different language. This can help customer service teams understand and respond to messages in languages they don't speak. Let's use the `Hugging Face` library to translate a sample message from English to French.

In [None]:
en_fr_translator = pipeline("translation_en_to_fr")

# Translate a sample message
en_message = "I need help with my order."

In [None]:
# Translate the message
fr_message = en_fr_translator(en_message)[0]['translation_text']
print(fr_message)

In [None]:
# Translate first 50 messages in the test set
en_messages = intents['test']['message'][:50]
fr_messages = en_fr_translator(en_messages)

# Display the translations
