# AI Workshop - Lab 2-2: Intent Classification

In this lab, we’ll build a system to classify customer text messages into different categories (called **intents**) using a powerful type of AI model called a transformer. Transformers are a key technology behind tools like ChatGPT and other modern language systems, but don’t worry if you’re new to them—we’ll break it down step by step.

### Data Overview

We’re working with a dataset of customer text messages that has already been labeled with their intent (e.g., "Order Status", "Product Inquiry", "Account Help"). The goal is to teach the model to recognize these patterns so it can classify new messages correctly.

- **Dataset**:
  - Provided as two files: one for training and one for testing.
  - Training data is used to teach the model, and testing data is used to see how well it learned.
- **Number of Categories**: 27 different intents.

### What We’ll Do in This Lab
1. **Load the Data**:
   - Open and inspect the dataset to understand its structure.
   - Check how many examples we have for each intent.
2. **Prepare the Data**:
   - Use a tool called a **tokenizer** to break down text messages into a format the model can understand.
   - Convert the intent labels into numbers so the model can learn from them.
3. **Use a Pre-Trained Model**:
   - Start with an existing model called `T5-small` that already knows a lot about language.
   - Customize (or fine-tune) it to focus on the intents in our dataset.
4. **Train the Model**:
   - Use the prepared data to train the model step by step.
   - Measure how well it’s doing along the way.
5. **Evaluate the Model**:
   - Test the model on new data it hasn’t seen before.
   - Check how accurate it is and where it might make mistakes.

### What You’ll Learn
- **Transformers**: Get an introduction to these models and why they’re so powerful for language tasks.
- **Fine-Tuning**: Learn how to take a pre-trained model and adapt it to solve a specific problem.
- **Model Evaluation**: Understand how to measure a model’s performance and interpret its predictions.

### HuggingFace Libraries

So far we have been working with Keras, a popular library for building neural networks. In this lab, we’ll use the HuggingFace libraries, which are designed specifically for working with transformers.

The main HuggingFace library is called `transformers`, and it provides tools for working with pre-trained models, tokenizers, and training pipelines. We’ll also use `datasets` to load and process our data. `accelerate` and `evaluate` are additional libraries that help speed up training and evaluate models, respectively. Install them below:

In [1]:
!pip install -Uq datasets transformers accelerate evaluate

For this lab, it's essential that we have a GPU available to speed up training. On Google Colab, you can enable a GPU by going to **Runtime** > **Change runtime type** > **Hardware accelerator** > **GPU**.

The following line of code will check if a GPU is available:

In [22]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device.type == 'cuda':
    print('GPU is available!')
else:
    print('GPU is not available. Enable a GPU runtime in Colab under "Runtime" > "Change runtime type".')

GPU is available!


# Loading the Dataset

Great! Now that we have our packages installed and imported, we can get going with loading the dataset.

We will be working with a dataset of customer text messages that have been labeled with their intent. Let's load the dataset and inspect it to understand its structure.

In [23]:
from datasets import load_dataset

In [24]:
intents = load_dataset("alexwaolson/customer-intents")

In [25]:
intents['train']

Dataset({
    features: ['message', 'label'],
    num_rows: 1555
})

As you can see, the dataset is comprised of `message` and `label` columns. The `message` column contains the text of the customer message, and the `label` column contains the intent category. There are 27 possible intent categories in this dataset. We can count how many examples we have for each intent to see if the dataset is balanced.

In [26]:
from collections import Counter

Counter(intents['train']['label'])

Counter({'edit account': 65,
         'delivery period': 65,
         'get refund': 63,
         'check payment methods': 62,
         'change shipping address': 62,
         'check cancellation fee': 62,
         'check invoice': 61,
         'payment issue': 60,
         'set up shipping address': 59,
         'create account': 58,
         'track refund': 58,
         'complaint': 58,
         'contact customer service': 58,
         'change order': 58,
         'switch account': 57,
         'cancel order': 57,
         'get invoice': 56,
         'track order': 56,
         'newsletter subscription': 56,
         'recover password': 55,
         'delete account': 54,
         'delivery options': 54,
         'place order': 53,
         'check refund policy': 53,
         'contact human agent': 52,
         'registration problems': 52,
         'review': 51})

# Preparing the Data

Now that we have our dataset loaded, we need to prepare it for training. This involves **tokenizing** the inputs and converting the labels into a format the model can understand.

### Tokenization

Up until now we have been working with data that's easily converted into numbers (like images or tabular data). But with text, we need to do some extra work to convert it into a format the model can understand.

The first step is to **tokenize** the text. Tokenization is the process of breaking down text into smaller parts called **tokens**. Tokens are typically words, but can also be **subwords** or **characters**. For example, the sentence "Hello, how are you?" might be tokenized into `['Hello', ',', 'how', 'are', 'you', '?']`.

We are going to be working with a pre-trained model called `T5-small`. This model expects its input in a specific format, so we need to use its tokenizer to convert our text into tokens.

In [27]:
from transformers import AutoTokenizer

# Load our tokenizer
model_name = 't5-small'
# The AutoTokenizer class will automatically select the correct tokenizer class for the model!
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [28]:
tokenizer('Hello, how are you?')

{'input_ids': [8774, 6, 149, 33, 25, 58, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Let's inspect the tokenizer a bit more closely to understand how it works. We can use the `show_tokenization` function below to see how a sentence gets tokenized by the model.

**Task**: Run the `show_tokenization` function with a sentence of your choice to see how it gets tokenized.

In [29]:
def show_tokenization(tokenizer, text):
    print(f'Original text: {text}')
    tokens = tokenizer(text, truncation=True)['input_ids']
    for token in tokens:
        print(f'{tokenizer.decode([token]):10} -> {token}')

# Write any sentence and see how it gets tokenized by the model:
show_tokenization(tokenizer, 'your sentence here')

Original text: your sentence here
your       -> 39
sentence   -> 7142
here       -> 270
</s>       -> 1


Hopefully you can see how the tokenizer works now! We also glossed over a term which you might spot later on too: the `attention_mask`. This is a sequence of 1s and 0s that tells the model which tokens to pay attention to and which to ignore. It's a crucial part of how transformers work, but we don't need to worry about it too much for now.

### Padding

Another important step in preparing the data is **padding**. When we tokenize text, we end up with sequences of different lengths. But the model expects inputs of the same length, so we need to pad the sequences to make them equal. This is done by adding a special token called `[PAD]` to the shorter sequences. The function below will handle this for us, as well as tokenize the full dataset.

In [30]:
def preprocess_function(examples):
    return tokenizer(examples["message"], truncation=True)

In [31]:
tokenized_intents = intents.map(preprocess_function, batched=True)

In [32]:
tokenized_intents['train']

Dataset({
    features: ['message', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1555
})

In [33]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

In [13]:
import evaluate

accuracy = evaluate.load("accuracy")

In [14]:
import numpy as np


def compute_metrics(eval_pred):
    # Unpack the predictions and labels
    predictions, labels = eval_pred.predictions, eval_pred.label_ids

    # Handle tuple predictions
    if isinstance(predictions, tuple):
        predictions = predictions[0]  # Take the first element, assuming it's the logits

    # Convert to NumPy array if necessary
    predictions = np.array(predictions)

    # Compute class predictions
    predictions = np.argmax(predictions, axis=1)

    # Return computed metrics
    return accuracy.compute(predictions=predictions, references=labels)


In [15]:
# Convert labels to integers
label2id = {label: i for i, label in enumerate(intents['train'].unique('label'))}
id2label = {i: label for label, i in label2id.items()}

def encode_label(example):
    example['label'] = label2id[example['label']]
    return example

tokenized_intents = tokenized_intents.map(encode_label)

In [16]:
tokenized_intents['train']

Dataset({
    features: ['message', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1555
})

In [17]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "t5-small",
    num_labels=27,
    id2label=id2label,
    label2id=label2id
)

Some weights of T5ForSequenceClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_dir='logs',
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=10,
    save_strategy="no",
    output_dir='output'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_intents['train'],
    eval_dataset=tokenized_intents['test'],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [20]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
10,3.3758,3.307129,0.061697
20,3.3479,3.284916,0.07455
30,3.2969,3.266773,0.079692
40,3.334,3.250174,0.100257
50,3.3456,3.234811,0.125964
60,3.2715,3.222997,0.125964
70,3.2823,3.214721,0.120823
80,3.2322,3.203785,0.118252
90,3.2507,3.191883,0.113111
100,3.2314,3.177467,0.118252


TrainOutput(global_step=585, training_loss=2.9015784663012902, metrics={'train_runtime': 66.6871, 'train_samples_per_second': 69.954, 'train_steps_per_second': 8.772, 'total_flos': 20767034735856.0, 'train_loss': 2.9015784663012902, 'epoch': 3.0})