# Module 10: Text Classification with Transformers

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 100 minutes  
**Prerequisites**: [Module 09: Fine-Tuning Transformers](09_fine_tuning_transformers.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Build production-ready text classifiers using transformer models
2. Implement sentiment analysis systems with BERT
3. Handle multi-class and multi-label classification problems
4. Evaluate classification models with appropriate metrics
5. Fine-tune pre-trained models on custom classification tasks

## Why Text Classification?

Text classification is one of the most common NLP tasks with wide-ranging applications:

- **Sentiment Analysis**: Product reviews, social media monitoring
- **Topic Classification**: News categorization, document routing
- **Intent Detection**: Chatbots, customer service automation
- **Spam Detection**: Email filtering, content moderation
- **Content Moderation**: Identifying toxic or inappropriate content

With transformers, we can achieve state-of-the-art accuracy with minimal training data!

## Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

# Hugging Face
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    pipeline
)
from datasets import load_dataset, load_metric

# Sklearn for metrics
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    precision_recall_fscore_support
)

# Visualization
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Random seed
np.random.seed(42)
torch.manual_seed(42)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print("✓ All libraries imported successfully!")

## 1. Quick Start: Sentiment Analysis with Pipelines

The easiest way to get started is using Hugging Face pipelines - they handle tokenization, inference, and post-processing automatically.

In [None]:
# Load pre-trained sentiment analysis pipeline
# This uses distilbert-base-uncased-finetuned-sst-2-english by default
sentiment_classifier = pipeline(
    'sentiment-analysis',
    device=0 if torch.cuda.is_available() else -1
)

print("✓ Sentiment classifier loaded!")

In [None]:
# Test on various examples
test_reviews = [
    "I absolutely love this product! Best purchase ever!",
    "This is terrible. Waste of money.",
    "It's okay, nothing special. Gets the job done.",
    "Exceeded my expectations! Highly recommend.",
    "Disappointed. Poor quality and overpriced."
]

print("Sentiment Analysis Results:\n")
print(f"{'Review':<50} {'Sentiment':<12} {'Confidence':<10}")
print("-" * 72)

for review in test_reviews:
    result = sentiment_classifier(review)[0]
    label = result['label']
    score = result['score']
    print(f"{review:<50} {label:<12} {score:.2%}")

**Exercise 1**: Test edge cases

Test the sentiment classifier on these challenging cases:
1. Sarcasm: "Oh great, another software update that breaks everything!"
2. Negation: "This is not bad at all"
3. Mixed sentiment: "The food was great but the service was terrible"
4. Neutral: "The package arrived on Tuesday"

Discuss where the model succeeds and where it struggles.

In [None]:
# YOUR CODE HERE
edge_cases = [
    "Oh great, another software update that breaks everything!",
    "This is not bad at all",
    "The food was great but the service was terrible",
    "The package arrived on Tuesday"
]

# Test and analyze results

## 2. Fine-Tuning BERT for Multi-Class Classification

Now let's fine-tune BERT on a multi-class classification task: categorizing news articles into topics.

### 2.1 Load and Explore the AG News Dataset

AG News is a collection of news articles classified into 4 categories: World, Sports, Business, Sci/Tech.

In [None]:
# Load AG News dataset
# We'll use a subset for faster training
dataset = load_dataset('ag_news')

# Explore the data
print("Dataset structure:")
print(dataset)
print(f"\nClasses: {dataset['train'].features['label'].names}")
print(f"\nTraining samples: {len(dataset['train'])}")
print(f"Test samples: {len(dataset['test'])}")

In [None]:
# Look at some examples
print("Sample news articles:\n")
class_names = dataset['train'].features['label'].names

for i in range(5):
    example = dataset['train'][i]
    label = class_names[example['label']]
    text = example['text'][:150] + "..."  # Truncate for display
    print(f"Category: {label}")
    print(f"Text: {text}\n")

In [None]:
# Class distribution
train_labels = [example['label'] for example in dataset['train']]
label_counts = pd.Series(train_labels).value_counts().sort_index()

plt.figure(figsize=(10, 5))
plt.bar(range(len(class_names)), label_counts.values, color='skyblue')
plt.xticks(range(len(class_names)), class_names)
plt.xlabel('Category')
plt.ylabel('Number of Articles')
plt.title('AG News Dataset - Class Distribution')
plt.tight_layout()
plt.show()

print("Class distribution:")
for i, name in enumerate(class_names):
    print(f"{name}: {label_counts[i]:,} articles")

### 2.2 Prepare Data for Training

We need to tokenize the text and prepare it in the format BERT expects.

In [None]:
# Load BERT tokenizer
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"✓ Loaded tokenizer: {model_name}")

In [None]:
def tokenize_function(examples):
    """
    Tokenize text for BERT.
    
    Parameters:
    -----------
    examples : dict
        Batch of examples with 'text' field
        
    Returns:
    --------
    dict : Tokenized inputs (input_ids, attention_mask)
    """
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128  # Shorter for faster training
    )

# For faster training, use a subset
# Remove this for production use
small_train = dataset['train'].shuffle(seed=42).select(range(5000))
small_test = dataset['test'].shuffle(seed=42).select(range(1000))

# Tokenize datasets
print("Tokenizing datasets...")
tokenized_train = small_train.map(tokenize_function, batched=True)
tokenized_test = small_test.map(tokenize_function, batched=True)

print("✓ Tokenization complete!")

### 2.3 Load Pre-trained BERT for Classification

We'll load BERT and add a classification head on top (this is done automatically).

In [None]:
# Load BERT for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4  # 4 classes in AG News
)

model.to(device)
print(f"✓ Model loaded with {model.num_parameters():,} parameters")
print(f"✓ Model moved to {device}")

### 2.4 Define Training Configuration

Hugging Face Trainer makes training very simple - we just need to specify hyperparameters.

In [None]:
# Define metrics for evaluation
def compute_metrics(eval_pred):
    """
    Compute accuracy and F1 score for evaluation.
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='weighted')
    
    return {
        'accuracy': accuracy,
        'f1': f1
    }

print("✓ Metrics function defined")

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
)

print("Training configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")

In [None]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    compute_metrics=compute_metrics,
)

print("✓ Trainer initialized!")

### 2.5 Train the Model

Now we can start training! This will take a few minutes on CPU, much faster on GPU.

In [None]:
# Train the model
print("Starting training...\n")
train_result = trainer.train()

print("\n✓ Training complete!")
print(f"Training time: {train_result.metrics['train_runtime']:.2f} seconds")
print(f"Final loss: {train_result.metrics['train_loss']:.4f}")

### 2.6 Evaluate the Model

In [None]:
# Evaluate on test set
eval_result = trainer.evaluate()

print("Test Set Performance:")
print(f"  Accuracy: {eval_result['eval_accuracy']:.2%}")
print(f"  F1 Score: {eval_result['eval_f1']:.4f}")
print(f"  Loss: {eval_result['eval_loss']:.4f}")

In [None]:
# Get predictions for detailed analysis
predictions = trainer.predict(tokenized_test)
y_pred = np.argmax(predictions.predictions, axis=1)
y_true = predictions.label_ids

# Classification report
print("\nDetailed Classification Report:\n")
print(classification_report(y_true, y_pred, target_names=class_names))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=class_names,
    yticklabels=class_names
)
plt.title('Confusion Matrix - AG News Classification')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

# Analyze confusion
print("\nMost confused pairs:")
for i in range(len(class_names)):
    for j in range(len(class_names)):
        if i != j and cm[i,j] > 10:
            print(f"  {class_names[i]} → {class_names[j]}: {cm[i,j]} errors")

**Exercise 2**: Error analysis

Find and analyze misclassified examples:
1. Extract 5 examples where the model was most confident but wrong
2. Identify patterns in the errors
3. Suggest improvements to reduce these errors

In [None]:
# YOUR CODE HERE
# Hint: Use predictions.predictions to get confidence scores
# Find examples with high confidence but incorrect predictions

### 2.7 Use the Trained Model for Inference

In [None]:
# Create pipeline from our fine-tuned model
classifier = pipeline(
    'text-classification',
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Test on new articles
new_articles = [
    "Apple announces new iPhone with revolutionary features",
    "Stock market hits record high amid economic recovery",
    "Scientists discover new exoplanet in habitable zone",
    "Manchester United wins Champions League final"
]

print("Predictions on new articles:\n")
for article in new_articles:
    result = classifier(article)[0]
    label_id = int(result['label'].split('_')[1])  # Extract label ID
    predicted_class = class_names[label_id]
    confidence = result['score']
    print(f"Article: {article}")
    print(f"Category: {predicted_class} (confidence: {confidence:.2%})\n")

## 3. Multi-Label Classification

Sometimes a text can belong to multiple categories. For example, a movie review might be both "Action" and "Sci-Fi".

**Key Difference**: 
- **Multi-class**: One label per example (softmax)
- **Multi-label**: Multiple labels per example (sigmoid)

In [None]:
# Example: Movie genre classification
# Each movie can have multiple genres
sample_data = {
    'text': [
        "A thrilling space adventure with amazing visual effects",
        "Romantic comedy set in New York City",
        "Dark psychological thriller with unexpected twists",
        "Action-packed superhero origin story"
    ],
    'labels': [
        [1, 0, 1, 0],  # Action, Sci-Fi
        [0, 1, 0, 1],  # Romance, Comedy
        [0, 0, 1, 0],  # Thriller
        [1, 0, 0, 0],  # Action
    ]
}

genre_names = ['Action', 'Romance', 'Thriller', 'Comedy']

# For multi-label, we use BCEWithLogitsLoss instead of CrossEntropyLoss
print("Multi-label classification requires:")
print("1. Sigmoid activation (not softmax)")
print("2. BCEWithLogitsLoss")
print("3. Independent probability for each label")

**Exercise 3**: Implement multi-label classification

Create a multi-label text classifier:
1. Modify the model to use sigmoid instead of softmax
2. Train on a multi-label dataset
3. Evaluate using multi-label metrics (Hamming loss, Jaccard score)
4. Predict multiple labels for new examples

Hint: Use `problem_type="multi_label_classification"` in the model config

In [None]:
# YOUR CODE HERE

## 4. Tips for Production Text Classification

### 4.1 Handling Imbalanced Data

Real-world datasets often have imbalanced classes. Solutions:
- **Weighted loss**: Give more weight to minority classes
- **Oversampling**: Duplicate minority class examples
- **Undersampling**: Reduce majority class examples
- **Data augmentation**: Create synthetic examples

In [None]:
# Example: Weighted loss for imbalanced data
from torch.nn import CrossEntropyLoss

# Calculate class weights (inverse frequency)
class_counts = np.bincount(train_labels)
class_weights = 1.0 / class_counts
class_weights = class_weights / class_weights.sum()  # Normalize

print("Class weights:")
for i, name in enumerate(class_names):
    print(f"{name}: {class_weights[i]:.4f}")

# Use weighted loss in training
# loss_fn = CrossEntropyLoss(weight=torch.tensor(class_weights).to(device))

### 4.2 Model Selection

Different models have different trade-offs:

| Model | Size | Speed | Accuracy | Best For |
|-------|------|-------|----------|----------|
| DistilBERT | 66M | Fast | Good | Production, limited resources |
| BERT-base | 110M | Medium | Very Good | Balanced accuracy/speed |
| RoBERTa | 125M | Medium | Excellent | Maximum accuracy |
| ALBERT | 12M | Fast | Good | Mobile/edge deployment |
| DeBERTa | 140M | Slow | Excellent | Research, competitions |

**Exercise 4**: Model comparison

Compare 3 different models on the same task:
1. Fine-tune DistilBERT, BERT, and RoBERTa on AG News
2. Measure accuracy, inference time, and model size
3. Create a comparison table
4. Recommend which to use for different scenarios

In [None]:
# YOUR CODE HERE

## Summary

### Key Concepts Covered:

1. **Quick Classification with Pipelines**:
   - Zero-code inference with pre-trained models
   - Sentiment analysis out of the box
   - Limitations of pre-trained models

2. **Fine-Tuning for Custom Tasks**:
   - AG News topic classification
   - Hugging Face Trainer API
   - Evaluation metrics (accuracy, F1, confusion matrix)
   - Error analysis and interpretation

3. **Advanced Techniques**:
   - Multi-label classification
   - Handling imbalanced data
   - Model selection criteria
   - Production considerations

### Important Takeaways:

- **Transformers achieve 90%+ accuracy** with minimal training
- **Fine-tuning is fast**: Even 1-2 epochs can work well
- **Use pipelines for quick experiments**, Trainer for production
- **Monitor both accuracy and F1** for imbalanced data
- **Error analysis is crucial** for improving models

### What's Next?

In **Module 11: Named Entity Recognition**, we'll learn:
- Token-level classification (vs sequence-level)
- BIO tagging scheme
- Extracting entities from text (people, organizations, locations)
- Building custom NER systems
- Combining NER with text classification

### Additional Resources:

- **Hugging Face Tasks**: [huggingface.co/tasks/text-classification](https://huggingface.co/tasks/text-classification)
- **GLUE Benchmark**: [gluebenchmark.com](https://gluebenchmark.com/)
- **Best Practices**: [hf.co/docs/transformers/training](https://huggingface.co/docs/transformers/training)