# Training Transformer Models for Text Classification

## Introduction
This notebook provides a hands-on guide to using Hugging Face transformer models for text classification tasks.

You will learn how to leverage pre-trained transformer models like DistilBERT to build an emotion classification system that can identify emotions from Twitter messages.

Throughout this tutorial, we'll explore two main approaches to text classification with transformers: feature extraction (using transformer embeddings with a simple classifier) and fine-tuning (training the entire model end-to-end). You'll gain practical experience working with the Hugging Face ecosystem, including the `transformers` and `datasets` libraries.

‚ö†Ô∏è If your computer is slow, you can run this notebook on Google Colab by downloading it and running it there.

## Learning Goals
By the end of this notebook, you will be able to:

- Load and explore datasets from the Hugging Face Hub for NLP tasks
- Understand tokenization strategies including character, word, and subword tokenization
- Use pre-trained transformers as feature extractors to generate embeddings for downstream tasks
- Fine-tune transformer models for custom classification problems using the Trainer API
- Evaluate model performance using appropriate metrics and confusion matrices
- Perform error analysis to identify model weaknesses and dataset issues
- Visualize high-dimensional embeddings using dimensionality reduction techniques (UMAP)
- Save and share models on the Hugging Face Hub for deployment
- Compare trade-offs between feature-based and fine-tuning approaches

## Prerequisites
- Basic knowledge of Python and machine learning concepts
- Familiarity with PyTorch or TensorFlow (helpful but not required)
- Understanding of classification tasks and evaluation metrics

## The Dataset
To build our emotion detector we'll use a great dataset from an article that explored how emotions are represented in English Twitter messages.footnote:[E. Saravia et al., "CARER: Contextualized Affect Representations for Emotion Recognition," Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (Oct‚ÄìNov 2018): 3687‚Äì3697, http://dx.doi.org/10.18653/v1/D18-1404.] Unlike most sentiment analysis datasets that involve just "positive" and "negative" polarities, this dataset contains six basic emotions: anger, disgust, fear, joy, sadness, and surprise. Given a tweet, our task will be to train a model that can classify it into one of these emotions.

### A First Look at Hugging Face Datasets
We will use `datasets` to download the data from the Hugging Face Hub. We can use the `list_datasets()` function to see what datasets are available on the Hub:

In [None]:
from datasets import list_datasets

all_datasets = list_datasets()
print(f"There are {len(all_datasets)} datasets currently available on the Hub")
print(f"The first 10 are: {all_datasets[:10]}")

We see that each dataset is given a name, so let's load the emotion dataset with the `load_dataset()` function:

In [None]:
from datasets import load_dataset

emotions = load_dataset("emotion")

If we look inside our `emotions` object we can inspect the available splits:

In [None]:
emotions

The dataset behaves similarly to a Python dictionary, with each key corresponding to a different split. We can use the usual dictionary syntax to access an individual split:

In [None]:
train_ds = emotions["train"]
train_ds

The `Dataset` object behaves like an array, so we can query its length or access rows:

In [None]:
len(train_ds)

In [None]:
train_ds[0]

Column names and features can also be inspected:

In [None]:
train_ds.column_names

In [None]:
print(train_ds.features)

We can retrieve multiple rows or entire columns as lists:

In [None]:
print(train_ds[:5])

In [None]:
print(train_ds["text"][:5])

üéØ **Exercise 1: Dataset Exploration**

Now that you've seen how to load and explore the emotion dataset, try the following:

1. Browse the Hugging Face Datasets Hub and find another text classification dataset (e.g., `imdb`, `ag_news`, or `yelp_review_full`). Load this dataset and explore its structure. How many classes does it have? How is it different from the emotion dataset?
2. The current dataset is imbalanced. Using the Pandas documentation, research and implement at least one strategy to handle class imbalance (e.g., using `resample()` or `sample()` with weights). What effect do you expect this to have on model performance?

üí° Hint: Check the Hugging Face Datasets documentation for loading different datasets and the imbalanced-learn documentation for sampling strategies.

### Sidebar: What If My Dataset Is Not on the Hub?
We'll be using the Hugging Face Hub to download datasets for most of the examples in this book. But in many cases, you'll find yourself working with data that is either stored on your laptop or on a remote server in your organization. Datasets provides several loading scripts to handle local and remote datasets. Examples for the most common data formats are shown below.

| Data format | Loading script | Example |
| --- | --- | --- |
| CSV | `csv` | `load_dataset("csv", data_files="my_file.csv")` |
| Text | `text` | `load_dataset("text", data_files="my_file.txt")` |
| JSON | `json` | `load_dataset("json", data_files="my_file.jsonl")` |

As you can see for each data format, we just need to pass the relevant loading script to the `load_dataset()` function, along with a `data_files` argument that specifies the path or URL to one or more files.

In [None]:
dataset_url = "https://huggingface.co/datasets/transformersbook/emotion-train-split/raw/main/train.txt"
!wget {dataset_url}

In [None]:
!head -n 1 train.txt

In [None]:
emotions_local = load_dataset("csv", data_files="train.txt", sep=";", 
                              names=["text", "label"])

In [None]:
dataset_url = "https://huggingface.co/datasets/transformersbook/emotion-train-split/raw/main/train.txt"
emotions_remote = load_dataset("csv", data_files=dataset_url, sep=";", 
                               names=["text", "label"])

### From Datasets to DataFrames
Although `datasets` provides a lot of low-level functionality to slice and dice our data, it is often convenient to convert a `Dataset` object to a Pandas `DataFrame` so we can access high-level APIs for data visualization.

In [None]:
import pandas as pd

emotions.set_format(type="pandas")
df = emotions["train"][:]
df.head()

In [None]:
def label_int2str(row):
    return emotions["train"].features["label"].int2str(row)

df["label_name"] = df["label"].apply(label_int2str)
df.head()

### Looking at the Class Distribution
Whenever you are working on text classification problems, it is a good idea to examine the distribution of examples across the classes.

In [None]:
import matplotlib.pyplot as plt

(df["label_name"].value_counts(ascending=True)
    .plot.barh())
plt.title("Frequency of Classes")
plt.show()

In this case, we can see that the dataset is heavily imbalanced; the `joy` and `sadness` classes appear frequently, whereas `love` and `surprise` are about 5‚Äì10 times rarer.

### How Long Are Our Tweets?
Transformer models have a maximum input sequence length that is referred to as the maximum context size. For applications using DistilBERT, the maximum context size is 512 tokens.

In [None]:
df["Words Per Tweet"] = df["text"].str.split().apply(len)
df.boxplot("Words Per Tweet", by="label_name", grid=False, showfliers=False,
           color="black")
plt.suptitle("")
plt.xlabel("")
plt.show()

In [None]:
emotions.reset_format()

## From Text to Tokens
Transformer models like DistilBERT cannot receive raw strings as input; instead, they assume the text has been tokenized and encoded as numerical vectors.

### Character Tokenization

In [None]:
text = "Tokenizing text is a core task of NLP."
tokenized_text = list(text)
print(tokenized_text)

In [None]:
token2idx = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}
print(token2idx)

In [None]:
input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids)

In [None]:
import torch
import torch.nn.functional as F

input_ids = torch.tensor(input_ids)
one_hot_encodings = F.one_hot(input_ids, num_classes=len(token2idx))
one_hot_encodings.shape

In [None]:
print(f"Token: {tokenized_text[0]}")
print(f"Tensor index: {input_ids[0]}")
print(f"One-hot: {one_hot_encodings[0]}")

### Word Tokenization

In [None]:
tokenized_text = text.split()
print(tokenized_text)

### Subword Tokenization
We'll use the tokenizer associated with DistilBERT.

In [None]:
from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
encoded_text = tokenizer(text)
print(encoded_text)

In [None]:
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)

In [None]:
print(tokenizer.convert_tokens_to_string(tokens))

In [None]:
print(tokenizer.vocab_size)
print(tokenizer.model_max_length)
print(tokenizer.model_input_names)

## Tokenizing the Whole Dataset

In [None]:
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

print(tokenize(emotions["train"][:2]))

In [None]:
tokens2ids = list(zip(tokenizer.all_special_tokens, tokenizer.all_special_ids))
data = sorted(tokens2ids, key=lambda x: x[-1])
df_special = pd.DataFrame(data, columns=["Special Token", "Special Token ID"])
df_special.T

In [None]:
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

In [None]:
print(emotions_encoded["train"].column_names)

## Training a Text Classifier
Models like DistilBERT are pretrained to predict masked words in text. To use them for text classification we need to modify them slightly.

### Transformers as Feature Extractors

In [None]:
from transformers import AutoModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

üéØ **Exercise 2: Tokenization Experiments**

1. Load a different pre-trained model tokenizer (e.g., `bert-base-cased`, `roberta-base`, or `albert-base-v2`) and compare its tokenization output with DistilBERT. What differences do you notice? How does cased vs uncased tokenization affect the output?
2. Experiment with the tokenizer's padding and truncation parameters. What happens if you set `max_length=10` with `truncation=True`? Read the tokenizer documentation to understand different padding strategies (`max_length`, `longest`, `do_not_pad`).

### Extracting the Last Hidden States

In [None]:
text = "this is a test"
inputs = tokenizer(text, return_tensors="pt")
print(f"Input tensor shape: {inputs['input_ids'].size()}")

In [None]:
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
    outputs = model(**inputs)
print(outputs)

In [None]:
outputs.last_hidden_state.size()

In [None]:
outputs.last_hidden_state[:, 0].size()

Now let's extract the hidden states for the whole dataset.

In [None]:
def extract_hidden_states(batch):
    inputs = {k: v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names}
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
    return {"hidden_state": last_hidden_state[:, 0].cpu().numpy()}

In [None]:
emotions_encoded.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [None]:
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True)

In [None]:
emotions_hidden["train"].column_names

### Creating a Feature Matrix

In [None]:
import numpy as np

X_train = np.array(emotions_hidden["train"]["hidden_state"])
X_valid = np.array(emotions_hidden["validation"]["hidden_state"])
y_train = np.array(emotions_hidden["train"]["label"])
y_valid = np.array(emotions_hidden["validation"]["label"])
X_train.shape, X_valid.shape

### Visualizing the Training Set

In [None]:
from umap import UMAP
from sklearn.preprocessing import MinMaxScaler

X_scaled = MinMaxScaler().fit_transform(X_train)
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label"] = y_train
df_emb.head()

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(7, 5))
axes = axes.flatten()
cmaps = ["Greys", "Blues", "Oranges", "Reds", "Purples", "Greens"]
labels = emotions["train"].features["label"].names

for i, (label, cmap) in enumerate(zip(labels, cmaps)):
    df_emb_sub = df_emb.query(f"label == {i}")
    axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap,
                   gridsize=20, linewidths=(0,))
    axes[i].set_title(label)
    axes[i].set_xticks([])
    axes[i].set_yticks([])

plt.tight_layout()
plt.show()

### Training a Simple Classifier

In [None]:
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(max_iter=3000)
lr_clf.fit(X_train, y_train)
lr_clf.score(X_valid, y_valid)

In [None]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_valid, y_valid)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    fig, ax = plt.subplots(figsize=(6, 6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()
    
y_preds = lr_clf.predict(X_valid)
plot_confusion_matrix(y_preds, y_valid, labels)

## Fine-Tuning Transformers

In [None]:
from transformers import AutoModelForSequenceClassification

num_labels = 6
model = (AutoModelForSequenceClassification
         .from_pretrained(model_ckpt, num_labels=num_labels)
         .to(device))

from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

üéØ **Exercise 3: Feature Extraction and Model Selection**

1. Try using different classifiers from scikit-learn instead of logistic regression. Test at least two of the following: `RandomForestClassifier`, `SVC`, or `MLPClassifier`. Compare their performance with logistic regression. Which one works best and why?
2. Experiment with the UMAP configuration (`n_components`, `metric`, `n_neighbors`). How do these changes affect the visualization and what insights can you gain?

Log in to your account on the Hugging Face Hub to push the fine-tuned model:

In [None]:
from huggingface_hub import notebook_login

# notebook_login()  # Uncomment to log in from a notebook environment

Define the training arguments and initialize the `Trainer`.

In [None]:
from transformers import Trainer, TrainingArguments

batch_size = 64
logging_steps = len(emotions_encoded["train"]) // batch_size
model_name = f"{model_ckpt}-finetuned-emotion"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  push_to_hub=True,
                                  log_level="error")

In [None]:
trainer = Trainer(model=model, args=training_args, 
                  compute_metrics=compute_metrics,
                  train_dataset=emotions_encoded["train"],
                  eval_dataset=emotions_encoded["validation"],
                  tokenizer=tokenizer)

In [None]:
trainer.train()

In [None]:
preds_output = trainer.predict(emotions_encoded["validation"])

In [None]:
preds_output.metrics

In [None]:
y_preds = np.argmax(preds_output.predictions, axis=1)
plot_confusion_matrix(y_preds, y_valid, labels)

üéØ **Exercise 4: Fine-Tuning Hyperparameters**

1. Modify the `TrainingArguments` to experiment with different hyperparameters (e.g., learning rate, epochs, batch size, weight decay, warmup steps). Which combination gives the best F1-score?
2. Try fine-tuning a different pre-trained model (e.g., `bert-base-uncased`, `roberta-base`, or `albert-base-v2`). How does the training time and final performance compare?

### Error Analysis

In [None]:
from torch.nn.functional import cross_entropy

def forward_pass_with_label(batch):
    inputs = {k: v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names}
    with torch.no_grad():
        output = model(**inputs)
        pred_label = torch.argmax(output.logits, axis=-1)
        loss = cross_entropy(output.logits, batch["label"].to(device), reduction="none")
    return {"loss": loss.cpu().numpy(), 
            "predicted_label": pred_label.cpu().numpy()}

In [None]:
emotions_encoded.set_format("torch", 
                            columns=["input_ids", "attention_mask", "label"])
emotions_encoded["validation"] = emotions_encoded["validation"].map(
    forward_pass_with_label, batched=True, batch_size=16)

In [None]:
emotions_encoded.set_format("pandas")
cols = ["text", "label", "predicted_label", "loss"]
df_test = emotions_encoded["validation"][:][cols]
df_test["label"] = df_test["label"].apply(label_int2str)
df_test["predicted_label"] = df_test["predicted_label"].apply(label_int2str)

In [None]:
df_test.sort_values("loss", ascending=False).head(10)

In [None]:
df_test.sort_values("loss", ascending=True).head(10)

üéØ **Exercise 6: Advanced Evaluation Metrics**

1. Implement additional evaluation metrics (e.g., per-class precision/recall, macro vs micro vs weighted F1-score, Cohen's Kappa, MCC). Which metric would you prioritize for this task and why?
2. Create a function that extracts the top 10 most misclassified examples (where the model was most confident but wrong), identifies patterns, and suggests data augmentation strategies to improve performance.

## Saving and Sharing the Model

In [None]:
trainer.push_to_hub(commit_message="Training completed!")

Use the fine-tuned model with the `pipeline()` API:

In [None]:
from transformers import pipeline

model_id = "transformersbook/distilbert-base-uncased-finetuned-emotion"
classifier = pipeline("text-classification", model=model_id)

In [None]:
custom_tweet = "I saw a movie today and it was really good."
preds = classifier(custom_tweet, return_all_scores=True)
preds

In [None]:
preds_df = pd.DataFrame(preds[0])
plt.bar(labels, 100 * preds_df["score"], color='C0')
plt.title(f'"{custom_tweet}"')
plt.ylabel("Class probability (%)")
plt.show()

üéØ **Exercise 5: Model Deployment and Inference**

1. Read about the Hugging Face Inference API and test your deployed model via HTTP requests. Write a Python function using the `requests` library to send text to your model and receive predictions. How would you integrate this into a web application?
2. Explore different parameters of the `pipeline()` API such as `top_k`, `truncation`, and `max_length`. Try creating a pipeline for another task (e.g., `sentiment-analysis` or `zero-shot-classification`).

## Conclusion
Congratulations, you now know how to train a transformer model to classify the emotions in tweets! We have seen two complementary approaches based on features and fine-tuning, and investigated their strengths and weaknesses. Continue exploring by deploying models, speeding them up, and expanding to multilingual or low-resource settings.