# üîÑ Transfer Learning with Transformers

**Module 02 | Notebook 1 of 3**

Transfer learning is the foundation of modern NLP. Instead of training from scratch, we start with a pre-trained model and adapt it to our specific task.

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand when and why to use transfer learning
2. Use the Hugging Face Trainer API
3. Prepare datasets for fine-tuning
4. Monitor training progress

---

In [None]:
%%capture
!pip install transformers datasets accelerate evaluate scikit-learn

In [None]:
import torch
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import load_dataset
import evaluate
import numpy as np
import warnings
warnings.filterwarnings('ignore')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

---

## 1Ô∏è‚É£ What is Transfer Learning?

### The Traditional Approach (Training from Scratch)
```
Random Weights ‚Üí Train on YOUR data ‚Üí Task-specific Model
                    ‚Üë
            Requires: Millions of examples
                     Weeks of training
                     Expensive GPUs
```

### Transfer Learning Approach
```
Pre-trained Model ‚Üí Fine-tune on YOUR data ‚Üí Task-specific Model
     (BERT)              ‚Üë
                 Requires: Hundreds-thousands of examples
                          Minutes-hours of training
                          Single GPU
```

### Why Transfer Learning Works

Pre-trained models have already learned:
- **Grammar and syntax** from billions of sentences
- **Word meanings** and relationships
- **Common patterns** in language

You only need to teach them your specific task!

### When to Use Transfer Learning

| Scenario | Recommendation |
|----------|----------------|
| Limited labeled data (<10k examples) | ‚úÖ Transfer Learning |
| Standard NLP task (classification, NER, QA) | ‚úÖ Transfer Learning |
| Limited compute budget | ‚úÖ Transfer Learning |
| Very domain-specific data (legal, medical) | ‚úÖ Transfer Learning + Domain Pre-training |
| Massive dataset (>1M examples) | Consider training from scratch |

---

## 2Ô∏è‚É£ Dataset Preparation

We'll use the Rotten Tomatoes movie review dataset for sentiment classification.

In [None]:
# Load a small dataset for quick training
dataset = load_dataset("rotten_tomatoes")

print("Dataset structure:")
print(dataset)
print(f"\nTrain examples: {len(dataset['train'])}")
print(f"Test examples: {len(dataset['test'])}")

In [None]:
# Explore the data
print("Sample examples:")
print("-" * 60)
for i in range(3):
    example = dataset['train'][i]
    label = "Positive" if example['label'] == 1 else "Negative"
    print(f"Label: {label}")
    print(f"Text: {example['text'][:100]}...")
    print()

In [None]:
# Load tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        padding=True,
        truncation=True,
        max_length=256
    )

# Tokenize the dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

print("Tokenized dataset columns:", tokenized_dataset['train'].column_names)

In [None]:
# Create a smaller subset for quick training (optional - use full dataset for better results)
small_train = tokenized_dataset['train'].shuffle(seed=42).select(range(1000))
small_val = tokenized_dataset['validation'].shuffle(seed=42).select(range(200))

print(f"Training samples: {len(small_train)}")
print(f"Validation samples: {len(small_val)}")

---

## 3Ô∏è‚É£ Model Setup

In [None]:
# Load pre-trained model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,  # Binary classification
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}
)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
print(f"\nModel architecture:")
print(model)

### Understanding the Model Structure

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ           DistilBERT Base Model             ‚îÇ  ‚Üê Pre-trained weights
‚îÇ     (learned language understanding)        ‚îÇ    (66M parameters)
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                      ‚îÇ
                      ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ         Classification Head                 ‚îÇ  ‚Üê Randomly initialized
‚îÇ     (Linear: 768 ‚Üí 2 classes)               ‚îÇ    (learns during fine-tuning)
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                      ‚îÇ
                      ‚ñº
              [NEGATIVE, POSITIVE]
```

---

## 4Ô∏è‚É£ Training with the Trainer API

The Hugging Face `Trainer` handles:
- Training loop
- Gradient accumulation
- Mixed precision training
- Logging and checkpointing
- Evaluation

In [None]:
# Define evaluation metrics
accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    
    # Training hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    
    # Evaluation
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    
    # Logging
    logging_dir="./logs",
    logging_steps=50,
    
    # Performance
    fp16=torch.cuda.is_available(),  # Use mixed precision if GPU available
    
    # Misc
    report_to="none",  # Disable wandb/tensorboard for this demo
    push_to_hub=False
)

print("Training configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Mixed precision: {training_args.fp16}")

In [None]:
# Data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train,
    eval_dataset=small_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

print("Trainer created successfully!")

In [None]:
# Train the model
print("Starting training...")
print("=" * 50)

train_result = trainer.train()

print("\n" + "=" * 50)
print("Training complete!")
print(f"Training time: {train_result.metrics['train_runtime']:.1f}s")
print(f"Samples per second: {train_result.metrics['train_samples_per_second']:.1f}")

In [None]:
# Evaluate on validation set
eval_results = trainer.evaluate()

print("\nEvaluation Results:")
print(f"  Loss: {eval_results['eval_loss']:.4f}")
print(f"  Accuracy: {eval_results['eval_accuracy']:.2%}")

---

## 5Ô∏è‚É£ Testing the Fine-tuned Model

In [None]:
from transformers import pipeline

# Create a pipeline with our fine-tuned model
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Test on new examples
test_texts = [
    "This movie was absolutely fantastic! A masterpiece.",
    "Terrible film. Complete waste of time.",
    "It was okay, nothing special but watchable.",
    "The acting was superb and the plot kept me engaged.",
    "I fell asleep halfway through. So boring."
]

print("Predictions:")
print("=" * 60)
for text in test_texts:
    result = sentiment_pipeline(text)[0]
    print(f"Text: {text[:50]}...")
    print(f"  ‚Üí {result['label']} ({result['score']:.2%})\n")

---

## 6Ô∏è‚É£ Saving and Loading the Model

In [None]:
# Save the model
save_path = "./fine_tuned_sentiment_model"
trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model saved to {save_path}")

In [None]:
# Load the saved model
loaded_model = AutoModelForSequenceClassification.from_pretrained(save_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(save_path)

# Verify it works
loaded_pipeline = pipeline(
    "sentiment-analysis",
    model=loaded_model,
    tokenizer=loaded_tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

result = loaded_pipeline("This is a great course!")[0]
print(f"Loaded model prediction: {result['label']} ({result['score']:.2%})")

---

## üéØ Student Challenge

### Challenge: Fine-tune on AG News

Fine-tune a model for **4-class text classification** using the AG News dataset.

In [None]:
# TODO: Your code here
# 1. Load the AG News dataset: load_dataset("ag_news")
# 2. The dataset has 4 classes: World, Sports, Business, Sci/Tech
# 3. Modify num_labels and id2label/label2id
# 4. Train for 2 epochs on a subset
# 5. Report accuracy

# Hint: AG News uses 'text' and 'label' columns

# Your solution:


---

## üìù Key Takeaways

1. **Transfer learning** leverages pre-trained models to reduce training time and data requirements
2. **The Trainer API** simplifies training with built-in best practices
3. **Classification heads** are added on top of pre-trained models for specific tasks
4. **Hyperparameters** like learning rate and batch size significantly impact results
5. **Save and load** models for deployment using `save_pretrained`/`from_pretrained`

---

## ‚û°Ô∏è Next Steps

Continue to `02_sentiment_analysis.ipynb` for a deeper dive into classification!