# AI Workshop - Lab 2-2: Intent Classification

In this lab, we’ll build a system to classify customer text messages into different categories (called **intents**) using a powerful type of AI model called a transformer. Transformers are a key technology behind tools like ChatGPT and other modern language systems, but don’t worry if you’re new to them—we’ll break it down step by step.

### Data Overview

We’re working with a dataset of customer text messages that has already been labeled with their intent (e.g., "Order Status", "Product Inquiry", "Account Help"). The goal is to teach the model to recognize these patterns so it can classify new messages correctly.

- **Dataset**:
  - Provided as two files: one for training and one for testing.
  - Training data is used to teach the model, and testing data is used to see how well it learned.
- **Number of Categories**: 27 different intents.

### What We’ll Do in This Lab
1. **Load the Data**:
   - Open and inspect the dataset to understand its structure.
   - Check how many examples we have for each intent.
2. **Prepare the Data**:
   - Use a tool called a **tokenizer** to break down text messages into a format the model can understand.
   - Convert the intent labels into numbers so the model can learn from them.
3. **Use a Pre-Trained Model**:
   - Start with an existing model called `T5-small` that already knows a lot about language.
   - Customize (or fine-tune) it to focus on the intents in our dataset.
4. **Train the Model**:
   - Use the prepared data to train the model step by step.
   - Measure how well it’s doing along the way.
5. **Evaluate the Model**:
   - Test the model on new data it hasn’t seen before.
   - Check how accurate it is and where it might make mistakes.

### What You’ll Learn
- **Transformers**: Get an introduction to these models and why they’re so powerful for language tasks.
- **Fine-Tuning**: Learn how to take a pre-trained model and adapt it to solve a specific problem.
- **Model Evaluation**: Understand how to measure a model’s performance and interpret its predictions.

### HuggingFace Libraries

So far we have been working with Keras, a popular library for building neural networks. In this lab, we’ll use the HuggingFace libraries, which are designed specifically for working with transformers.

The main HuggingFace library is called `transformers`, and it provides tools for working with pre-trained models, tokenizers, and training pipelines. We’ll also use `datasets` to load and process our data. `accelerate` and `evaluate` are additional libraries that help speed up training and evaluate models, respectively. Install them below:

In [1]:
!pip install -Uq datasets transformers accelerate evaluate

For this lab, it's essential that we have a GPU available to speed up training. On Google Colab, you can enable a GPU by going to **Runtime** > **Change runtime type** > **Hardware accelerator** > **GPU**.

The following line of code will check if a GPU is available:

In [22]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device.type == 'cuda':
    print('GPU is available!')
else:
    print('GPU is not available. Enable a GPU runtime in Colab under "Runtime" > "Change runtime type".')

GPU is available!


# Loading the Dataset

Great! Now that we have our packages installed and imported, we can get going with loading the dataset.

We will be working with a dataset of customer text messages that have been labeled with their intent. Let's load the dataset and inspect it to understand its structure.

In [23]:
from datasets import load_dataset

In [24]:
intents = load_dataset("alexwaolson/customer-intents")

In [25]:
intents['train']

Dataset({
    features: ['message', 'label'],
    num_rows: 1555
})

As you can see, the dataset is comprised of `message` and `label` columns. The `message` column contains the text of the customer message, and the `label` column contains the intent category. There are 27 possible intent categories in this dataset. We can count how many examples we have for each intent to see if the dataset is balanced.

In [26]:
from collections import Counter

Counter(intents['train']['label'])

Counter({'edit account': 65,
         'delivery period': 65,
         'get refund': 63,
         'check payment methods': 62,
         'change shipping address': 62,
         'check cancellation fee': 62,
         'check invoice': 61,
         'payment issue': 60,
         'set up shipping address': 59,
         'create account': 58,
         'track refund': 58,
         'complaint': 58,
         'contact customer service': 58,
         'change order': 58,
         'switch account': 57,
         'cancel order': 57,
         'get invoice': 56,
         'track order': 56,
         'newsletter subscription': 56,
         'recover password': 55,
         'delete account': 54,
         'delivery options': 54,
         'place order': 53,
         'check refund policy': 53,
         'contact human agent': 52,
         'registration problems': 52,
         'review': 51})

## Preprocessing the Dataset

Now that we have our dataset loaded, we need to preprocess it for training. This involves tokenizing the inputs and outputs, padding them to a fixed length, and setting up the data collator for training. We'll walk through each of these steps together in the next few cells.

### Tokenization

Almost all language models don't work directly with text but with _tokenized_ inputs. Tokenization is the process of splitting a text into individual words, subwords, or characters, which are then mapped to unique IDs (integers) by the model's tokenizer. This is then a format that can slot directly into the linear algebra underpinning any deep learning model.

For most LLMs, there are tokens corresponding to most words you'd expect to find, but also for things like common suffixes and prefixes. This allows the model to generalize better to new words that it hasn't seen before. For example, the word "running" might be tokenized into "run" and "##ning" (the "##" prefix indicates that the token is a suffix). This allows the model to learn separately the meaning of the word "run" and the suffix "-ning", then combine them to understand the word "running".

We'll use the `AutoTokenizer` class from the `transformers` library to load a tokenizer that matches the model we're using. In this case, we'll use the `t5-small` model, which is a smaller version of the T5 model developed by Google. T5 stands for "Text-to-Text Transfer Transformer", and it's a versatile model that can be fine-tuned for many different NLP tasks.

In [27]:
from transformers import AutoTokenizer

# Load our tokenizer
model_name = 't5-small'
# The AutoTokenizer class will automatically select the correct tokenizer class for the model!
tokenizer = AutoTokenizer.from_pretrained(model_name)

Let's inspect the tokenizer a bit more closely to understand what it's capable of. First, we can look at the vocabulary size and the special tokens that the tokenizer uses. The special tokens are used to mark the beginning and end of sequences, as well as to pad sequences to a fixed length.

In [None]:
print(f'Vocab size: {tokenizer.vocab_size}')
print(f'Special tokens: {tokenizer.special_tokens_map}')

We can also see how the tokenizer encodes and decodes text. The `encode` method takes a string and converts it into a list of token IDs, while the `decode` method takes a list of token IDs and converts it back into a string. Let's write a method that shows us what a given input looks like encoded, as well as when we translate that back into text.

**Your Turn**: Write a sentence in the `show_tokenization` function and see how it gets tokenized by the model. Try inserting a sentence containing a made up word, or a word that you think might be tokenized into multiple tokens.

In [29]:
def show_tokenization(tokenizer, text):
    print(f'Original text: {text}')
    tokens = tokenizer(text, truncation=True)['input_ids']
    for token in tokens:
        print(f'{tokenizer.decode([token]):10} -> {token}')

# Write any sentence and see how it gets tokenized by the model:
show_tokenization(tokenizer, 'your sentence here')

Original text: your sentence here
your       -> 39
sentence   -> 7142
here       -> 270
</s>       -> 1


Hopefully you can see how the tokenizer works now! We also glossed over a term which you may spot later on: the `attention_mask`. This is a vector that tells the model which tokens are part of the input and which are left over at the end (padding tokens).

### Padding

When training a model, it's common to train on batches of data. However, each sequence in a batch needs to be the same length. This is where padding comes in: we add special padding tokens to the end of sequences that are shorter than the maximum length in the batch. This ensures that all sequences are the same length and can be processed in parallel.

For example, if we wanted all of our batches to have 20 tokens, and we put in the string "Hello, world!", we would translate that into five tokens (don't forget that there's punctuation and an end of string). We would then pad the rest of the sequence with padding tokens until we reach 20 tokens.

Our T5-small model by default expects sequences of 512 tokens. This is a lot, and we don't need that many for our task. We'll set the maximum length to 40 tokens for the input. This means that any sequence longer than 40 tokens will be truncated to 40 tokens.

In [None]:
# Define maximum sequence lengths
max_input_length = 40

tokenizer.model_max_length = max_input_length

So what happens if we input a sequence that's _longer_ than the maximum length? The tokenizer will _truncate_ the sequence to the max length, which really just means chopping off any excess. Depending on the task at hand, this can either be done by cutting the start (left truncation) or the end (right truncation) of the sequence.

**Your Turn**: Try changing the `truncation_side` in the cell below to see how the tokenizer behaves when you input a long sentence. You can also try changing the `max_input_length` and `max_target_length` to see how the tokenizer behaves when you input a sentence that's longer than the maximum length.

In [None]:
tokenizer.truncation_side = 'right'  # Truncate from the right side, i.e. the end of the sequence
# tokenizer.truncation_side = 'left'  # Truncate from the left side, i.e. the start of the sequence

show_tokenization(tokenizer, "write a REALLY long sentence in here and see what happens")

With all this set up, we are nearly ready to pre-process our data. The last step is to define a function that will take in a batch of examples and tokenize them. This function will be used to process our dataset before training.

In [30]:
def preprocess_function(examples):
    return tokenizer(examples["message"], truncation=True)

In [31]:
tokenized_intents = intents.map(preprocess_function, batched=True)

In [32]:
tokenized_intents['train']

Dataset({
    features: ['message', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1555
})

### Data Collation

We have a few small steps left before we can start training our model. One of these is to set up a data collator. This is a function that takes a list of examples and collates them into a batch that can be fed into the model. The data collator will also ensure that the input sequences are padded to the same length.

In [33]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

### Evaluation Metric

When we train the model, the quality of our predictions will be fed into the training loop. We can use this to compute metrics like accuracy, which tells us how often the model's predictions match the true labels.

In [13]:
import evaluate

accuracy = evaluate.load("accuracy")

In [14]:
import numpy as np


def compute_metrics(eval_pred):
    # Unpack the predictions and labels
    predictions, labels = eval_pred.predictions, eval_pred.label_ids

    # Handle tuple predictions
    if isinstance(predictions, tuple):
        predictions = predictions[0]  # Take the first element, assuming it's the logits

    # Convert to NumPy array if necessary
    predictions = np.array(predictions)

    # Compute class predictions
    predictions = np.argmax(predictions, axis=1)

    # Return computed metrics
    return accuracy.compute(predictions=predictions, references=labels)


### Encoding Labels

Before we can train our model, we need to convert the intent labels into numbers. This is because the model can't learn from text labels directly—it needs numbers. We'll create a mapping from the text labels to numbers, and then use this mapping to convert the labels in our dataset.

In [15]:
# Convert labels to integers
label2id = {label: i for i, label in enumerate(intents['train'].unique('label'))}
id2label = {i: label for label, i in label2id.items()}

def encode_label(example):
    example['label'] = label2id[example['label']]
    return example

tokenized_intents = tokenized_intents.map(encode_label)

In [16]:
tokenized_intents['train']

Dataset({
    features: ['message', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1555
})

# Loading the model

Now that we've preprocessed our data, we can load the pre-trained model that we'll be fine-tuning. We'll use the `AutoModelForSequenceClassification` class from the `transformers` library to load a model that's already set up for sequence classification.

In [17]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "t5-small",
    num_labels=27,
    id2label=id2label,
    label2id=label2id
)

Some weights of T5ForSequenceClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training

Finally, we're ready to start training our model! We'll use the `Trainer` class from the `transformers` library to handle the training process. We'll also define some training arguments, like the number of epochs, the batch size, and the learning rate. The purpose behind each of them is explained below.

In [19]:
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_dir='logs',
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=10,
    save_strategy="no",
    output_dir='output'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_intents['train'],
    eval_dataset=tokenized_intents['test'],
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [20]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
10,3.3758,3.307129,0.061697
20,3.3479,3.284916,0.07455
30,3.2969,3.266773,0.079692
40,3.334,3.250174,0.100257
50,3.3456,3.234811,0.125964
60,3.2715,3.222997,0.125964
70,3.2823,3.214721,0.120823
80,3.2322,3.203785,0.118252
90,3.2507,3.191883,0.113111
100,3.2314,3.177467,0.118252


TrainOutput(global_step=585, training_loss=2.9015784663012902, metrics={'train_runtime': 66.6871, 'train_samples_per_second': 69.954, 'train_steps_per_second': 8.772, 'total_flos': 20767034735856.0, 'train_loss': 2.9015784663012902, 'epoch': 3.0})