# A Full Training Loop in PyTorch üîÑ

This notebook provides a comprehensive guide to implementing custom training loops for transformer models using PyTorch. While the Hugging Face `Trainer` API abstracts away most training details, understanding the low-level training loop gives you:

- **Full control** over the training process
- **Custom optimization** strategies
- **Advanced techniques** like gradient accumulation and mixed precision
- **Better debugging** capabilities

## Table of Contents
1. [Data Preparation](#prepare-for-training)
2. [Basic Training Loop](#the-training-loop)
3. [Modern Optimizations](#modern-training-optimizations)
4. [Evaluation Loop](#the-evaluation-loop)
5. [Accelerate Library](#supercharge-your-training-loop-with--accelerate)
6. [Practice: SST-2 Fine-tuning](#modify-the-previous-training-loop-to-fine-tune-your-model-on-the-sst-2-dataset)

In [None]:
# Step 1: Load and Tokenize the Dataset
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

# Load the MRPC (Microsoft Research Paraphrase Corpus) dataset from GLUE benchmark
# MRPC: Binary classification - determine if two sentences are paraphrases
raw_datasets = load_dataset("glue", "mrpc")

# Load the BERT tokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Tokenization function for sentence pairs
def tokenize_function(example):
    """
    Tokenize pairs of sentences for paraphrase detection.
    - truncation=True: Truncate sequences longer than model's max length
    - The tokenizer automatically handles [SEP] token between sentences
    """
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# Apply tokenization to all splits (train, validation, test)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# DataCollator handles dynamic padding - pads each batch to max length in that batch
# This is more efficient than padding all sequences to the same length
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

## Prepare for Training

Before training, we need to prepare our dataset for PyTorch:

| Step | Action | Reason |
|------|--------|--------|
| 1 | Remove unused columns | Model only needs `input_ids`, `attention_mask`, `token_type_ids` |
| 2 | Rename `label` ‚Üí `labels` | Hugging Face models expect `labels` argument |
| 3 | Set format to PyTorch | Convert from Python lists to PyTorch tensors |

In [2]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [None]:
# Step 2: Prepare Dataset for PyTorch

# Remove columns the model doesn't need (raw text and index)
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])

# Rename 'label' to 'labels' (HuggingFace models expect 'labels')
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

# Convert to PyTorch tensors
tokenized_datasets.set_format("torch")

# Verify the columns are correct
print("‚úÖ Final columns:", tokenized_datasets["train"].column_names)

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [None]:
# Verify data is now in tensor format
example = tokenized_datasets["train"][0]
print(f"Type of input_ids: {type(example['input_ids'])}")
print(f"Shape of input_ids: {example['input_ids'].shape}")

<class 'torch.Tensor'>


In [5]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [None]:
# Step 3: Create DataLoaders
from torch.utils.data import DataLoader

# Training DataLoader with shuffling for better generalization
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,              # Shuffle training data each epoch
    batch_size=8,              # Process 8 examples at a time
    collate_fn=data_collator   # Dynamic padding per batch
)

# Evaluation DataLoader (no shuffling needed)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"],
    batch_size=8,
    collate_fn=data_collator
)

# Verify batch structure
for batch in train_dataloader:
    break
print("üì¶ Batch shapes:")
for k, v in batch.items():
    print(f"  {k}: {v.shape}")

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 78]),
 'token_type_ids': torch.Size([8, 78]),
 'attention_mask': torch.Size([8, 78])}

In [None]:
# Step 4: Load the Pre-trained Model
from transformers import AutoModelForSequenceClassification

# Load BERT with a classification head (2 labels: paraphrase or not)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# Quick sanity check - forward pass should work
outputs = model(**batch)
print(f"‚úÖ Loss: {outputs.loss.item():.4f}")
print(f"‚úÖ Logits shape: {outputs.logits.shape}  # (batch_size, num_labels)")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor(0.5146, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


## Optimizer and Learning Rate Scheduler

### Choosing an Optimizer

| Optimizer | Use Case | Memory |
|-----------|----------|--------|
| `AdamW` | Standard choice for transformers | Normal |
| `AdamW8bit` | Memory-constrained training | ~50% less |
| `SGD` | Very large batch sizes | Lower |

### Learning Rate Guidelines
- **BERT-base**: `2e-5` to `5e-5`
- **Large models**: `1e-5` to `3e-5`
- **Always use warmup** for stability

```python
# Standard AdamW with weight decay (recommended)
AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)

# 8-bit Adam for memory efficiency (requires bitsandbytes)
import bitsandbytes as bnb
optimizer = bnb.optim.AdamW8bit(model.parameters(), lr=5e-5)
```

In [None]:
# Step 5: Configure Optimizer and Scheduler
from torch.optim import AdamW
from transformers import get_scheduler

# AdamW optimizer - the standard for transformer fine-tuning
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training configuration
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)

# Linear learning rate scheduler with warmup
# - Warmup: Gradually increase LR from 0 to target (prevents early instability)
# - Decay: Linearly decrease LR to 0 (helps convergence)
lr_scheduler = get_scheduler(
    name="linear",                           # Linear decay after warmup
    optimizer=optimizer,
    num_warmup_steps=0,                      # No warmup (can set to ~10% of total steps)
    num_training_steps=num_training_steps
)

print(f"üìä Training Configuration:")
print(f"   Epochs: {num_epochs}")
print(f"   Batches per epoch: {len(train_dataloader)}")
print(f"   Total training steps: {num_training_steps}")

1377


## The Training Loop

The basic training loop follows these steps for each batch:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  1. Forward Pass    ‚Üí  Compute predictions and loss        ‚îÇ
‚îÇ  2. Backward Pass   ‚Üí  Calculate gradients                 ‚îÇ
‚îÇ  3. Optimizer Step  ‚Üí  Update model weights                ‚îÇ
‚îÇ  4. Scheduler Step  ‚Üí  Adjust learning rate                ‚îÇ
‚îÇ  5. Zero Gradients  ‚Üí  Clear gradients for next iteration  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

In [None]:
# Step 6: Setup Device (GPU/CPU)
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

print(f"üñ•Ô∏è Training on: {device}")
if device.type == "cuda":
    print(f"   GPU: {torch.cuda.get_device_name(0)}")

device(type='cuda')

In [None]:
# Step 7: Basic Training Loop
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps), desc="Training")

model.train()  # Set model to training mode (enables dropout, etc.)

for epoch in range(num_epochs):
    for batch in train_dataloader:
        # Move batch to device (GPU/CPU)
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # Forward pass - compute predictions and loss
        outputs = model(**batch)
        loss = outputs.loss
        
        # Backward pass - compute gradients
        loss.backward()
        
        # Optimizer step - update weights using gradients
        optimizer.step()
        
        # Scheduler step - adjust learning rate
        lr_scheduler.step()
        
        # Zero gradients - clear for next iteration
        optimizer.zero_grad()
        
        # Update progress bar
        progress_bar.update(1)
        progress_bar.set_postfix({"loss": f"{loss.item():.4f}"})

  0%|          | 0/1377 [00:00<?, ?it/s]

In [None]:
# Step 8: Evaluation Loop
import evaluate

# Load the GLUE MRPC metric (F1 and Accuracy)
metric = evaluate.load("glue", "mrpc")

model.eval()  # Set model to evaluation mode (disables dropout)

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    
    # No gradient computation needed for evaluation
    with torch.no_grad():
        outputs = model(**batch)
    
    # Get predictions from logits
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    
    # Add batch results to metric
    metric.add_batch(predictions=predictions, references=batch["labels"])

# Compute final metrics
results = metric.compute()
print(f"üìà Evaluation Results:")
print(f"   Accuracy: {results['accuracy']:.4f}")
print(f"   F1 Score: {results['f1']:.4f}")

{'accuracy': 0.8602941176470589, 'f1': 0.9025641025641026}

---

## Modern Training Optimizations üöÄ

To make training more efficient and stable, we can add these techniques:

| Technique | Benefit | When to Use |
|-----------|---------|-------------|
| **Gradient Clipping** | Prevents exploding gradients | Always recommended |
| **Mixed Precision (FP16)** | 2x faster, 50% less memory | GPU with Tensor Cores |
| **Gradient Accumulation** | Simulate larger batch sizes | Limited GPU memory |
| **Checkpointing** | Resume interrupted training | Long training runs |

### How They Work Together:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  for step, batch in enumerate(dataloader):                        ‚îÇ
‚îÇ      with autocast('cuda'):           # ‚Üê Mixed Precision         ‚îÇ
‚îÇ          loss = model(**batch).loss                                ‚îÇ
‚îÇ          loss = loss / accum_steps    # ‚Üê Gradient Accumulation   ‚îÇ
‚îÇ                                                                    ‚îÇ
‚îÇ      scaler.scale(loss).backward()    # ‚Üê Scaled Backward         ‚îÇ
‚îÇ                                                                    ‚îÇ
‚îÇ      if (step + 1) % accum_steps == 0:                            ‚îÇ
‚îÇ          scaler.unscale_(optimizer)                                ‚îÇ
‚îÇ          clip_grad_norm_(params, 1.0) # ‚Üê Gradient Clipping       ‚îÇ
‚îÇ          scaler.step(optimizer)                                    ‚îÇ
‚îÇ          scaler.update()                                           ‚îÇ
‚îÇ          optimizer.zero_grad()                                     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

In [None]:
# Advanced Training Loop with All Optimizations
from tqdm.auto import tqdm
import torch
from torch.amp import GradScaler, autocast

# Configuration
gradient_accumulation_steps = 4              # Accumulate gradients over 4 batches
effective_batch_size = 8 * gradient_accumulation_steps  # = 32

# Mixed Precision: GradScaler handles loss scaling to prevent underflow in FP16
scaler = GradScaler('cuda')

# Adjust progress bar for accumulated steps
total_optimization_steps = num_training_steps // gradient_accumulation_steps
progress_bar = tqdm(range(total_optimization_steps), desc="Training (Optimized)")

model.train()

for epoch in range(num_epochs):
    for step, batch in enumerate(train_dataloader):
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # Mixed Precision: autocast automatically uses FP16 where safe
        with autocast('cuda'):
            outputs = model(**batch)
            # Scale loss by accumulation steps for correct gradient magnitude
            loss = outputs.loss / gradient_accumulation_steps
        
        # Backward pass with gradient scaling (prevents FP16 underflow)
        scaler.scale(loss).backward()
        
        # Only update weights after accumulating enough gradients
        if (step + 1) % gradient_accumulation_steps == 0:
            # Unscale gradients before clipping
            scaler.unscale_(optimizer)
            
            # Gradient Clipping: Prevent exploding gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            
            # Optimizer step with scaled gradients
            scaler.step(optimizer)
            scaler.update()
            
            # Learning rate scheduler step
            lr_scheduler.step()
            
            # Clear gradients for next accumulation cycle
            optimizer.zero_grad()
            
            # Update progress
            progress_bar.update(1)
            progress_bar.set_postfix({
                "loss": f"{loss.item() * gradient_accumulation_steps:.4f}",
                "lr": f"{lr_scheduler.get_last_lr()[0]:.2e}"
            })

print(f"\n‚úÖ Training complete! Effective batch size: {effective_batch_size}")

  0%|          | 0/344 [00:00<?, ?it/s]

---

## The Evaluation Loop

After training, we evaluate the model on the validation set:

```python
model.eval()                    # Disable dropout, batch norm in eval mode
with torch.no_grad():           # No gradient computation needed
    outputs = model(**batch)
predictions = torch.argmax(outputs.logits, dim=-1)
```

In [None]:
# Install the evaluate library if not already installed
!pip install evaluate -q



In [None]:
# Evaluation after optimized training
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

results = metric.compute()
print(f"üìà Final Evaluation Results:")
print(f"   Accuracy: {results['accuracy']:.4f}")
print(f"   F1 Score: {results['f1']:.4f}")

{'accuracy': 0.8602941176470589, 'f1': 0.9025641025641026}

---

## Supercharge Your Training Loop with ü§ó Accelerate

The **Accelerate** library provides a simple way to run the same training code on:
- Single GPU
- Multiple GPUs (Data Parallel)
- TPUs
- Mixed precision

### Key Benefits:
| Feature | Benefit |
|---------|---------|
| `accelerator.prepare()` | Automatically handles device placement |
| `accelerator.backward()` | Handles distributed gradients |
| `accelerator.gather_for_metrics()` | Collects predictions across devices |

### Minimal Code Changes:
```python
from accelerate import Accelerator
accelerator = Accelerator()

# Wrap your objects
model, optimizer, train_dl, eval_dl = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

# Replace loss.backward() with:
accelerator.backward(loss)
```

In [None]:
# Complete Training Script with Accelerate
from accelerate import Accelerator
from torch.optim import AdamW
from transformers import AutoModelForSequenceClassification, get_scheduler
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from torch.utils.data import DataLoader
import torch
import evaluate
from tqdm.auto import tqdm

# ============ Data Preparation ============
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets = tokenized_datasets.with_format("torch")

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

# ============ Accelerator Setup ============
accelerator = Accelerator()  # Automatically detects available hardware

# ============ Model & Optimizer ============
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

# Prepare for distributed training (handles device placement automatically)
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

# ============ Scheduler ============
num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

# ============ Training Loop ============
progress_bar = tqdm(range(num_training_steps), desc="Training with Accelerate")

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        
        # Use accelerator.backward() instead of loss.backward()
        accelerator.backward(loss)
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

# ============ Evaluation ============
metric = evaluate.load("glue", "mrpc")
model.eval()

for batch in eval_dl:
    with torch.no_grad():
        outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    
    # Gather predictions from all devices (important for distributed training)
    predictions, references = accelerator.gather_for_metrics(
        (predictions, batch["labels"])
    )
    metric.add_batch(predictions=predictions, references=references)

eval_metric = metric.compute()
print(f"\nüìà Evaluation Results: Accuracy={eval_metric['accuracy']:.4f}, F1={eval_metric['f1']:.4f}")

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]

Evaluation Results: {'accuracy': 0.8651960784313726, 'f1': 0.9066213921901528}


In [None]:
# Install required packages
!pip install datasets evaluate accelerate -q

Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.6.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)
Collecting multiprocess<0.70.19 (from datasets)
  Downloading multiprocess-0.70.18-py312-none-any.whl.metadata (7.5 kB)
Collecting fsspec<=2025.10.0,>=2023.1.0 (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.10.0,>=2023.1.0->datasets)
  Downloading aiohttp-3.13.2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (8.1 kB)
Collecting aiohappyeyeballs>=2.5.0 (from aiohttp!=4.0.0a0,!=4.0.0a1->fsspe

In [None]:
%%writefile train_script.py
"""
Distributed Training Script with Accelerate

Run with: accelerate launch train_script.py
Configure with: accelerate config
"""
from accelerate import Accelerator
from torch.optim import AdamW
from transformers import (
    AutoModelForSequenceClassification,
    get_scheduler,
    AutoTokenizer,
    default_data_collator
)
from datasets import load_dataset
from torch.utils.data import DataLoader
import torch
import evaluate
from tqdm.auto import tqdm

def main():
    # ============ Data Preparation ============
    raw_datasets = load_dataset("glue", "mrpc")
    checkpoint = "bert-base-uncased"
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    
    def tokenize_function(example):
        return tokenizer(
            example["sentence1"],
            example["sentence2"],
            truncation=True,
            padding="max_length",
            max_length=128
        )
    
    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
    tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
    tokenized_datasets = tokenized_datasets.with_format("torch")
    
    train_dataloader = DataLoader(
        tokenized_datasets["train"],
        shuffle=True,
        batch_size=8,
        collate_fn=default_data_collator
    )
    eval_dataloader = DataLoader(
        tokenized_datasets["validation"],
        batch_size=8,
        collate_fn=default_data_collator
    )
    
    # ============ Accelerator Setup ============
    accelerator = Accelerator()
    
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
    optimizer = AdamW(model.parameters(), lr=3e-5)
    
    train_dl, eval_dl, model, optimizer = accelerator.prepare(
        train_dataloader, eval_dataloader, model, optimizer
    )
    
    # ============ Training Configuration ============
    num_epochs = 3
    num_training_steps = num_epochs * len(train_dl)
    lr_scheduler = get_scheduler(
        "linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )
    
    # Only show progress on main process
    if accelerator.is_main_process:
        progress_bar = tqdm(range(num_training_steps), desc="Training")
    
    # ============ Training Loop ============
    model.train()
    for epoch in range(num_epochs):
        for batch in train_dl:
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
            
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            
            if accelerator.is_main_process:
                progress_bar.update(1)
    
    # ============ Evaluation ============
    if accelerator.is_main_process:
        print("\nüîç Starting evaluation...")
    
    metric = evaluate.load("glue", "mrpc")
    model.eval()
    
    for batch in eval_dl:
        with torch.no_grad():
            outputs = model(**batch)
        
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        predictions, references = accelerator.gather_for_metrics(
            (predictions, batch["labels"])
        )
        metric.add_batch(predictions=predictions, references=references)
    
    if accelerator.is_main_process:
        eval_metric = metric.compute()
        print(f"üìà Results: Accuracy={eval_metric['accuracy']:.4f}, F1={eval_metric['f1']:.4f}")

if __name__ == "__main__":
    main()

Overwriting train_script.py


### Running Distributed Training

```bash
# Configure accelerate (choose hardware: CPU, GPU, multi-GPU, TPU)
accelerate config

# Launch training
accelerate launch train_script.py
```

In [5]:
!accelerate launch train_script.py

	`--num_processes` was set to a value of `0`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
README.md: 35.3kB [00:00, 93.5MB/s]
mrpc/train-00000-of-00001.parquet: 100% 649k/649k [00:01<00:00, 404kB/s]  
mrpc/validation-00000-of-00001.parquet: 100% 75.7k/75.7k [00:00<00:00, 104kB/s]
mrpc/test-00000-of-00001.parquet: 100% 308k/308k [00:00<00:00, 760kB/s]
Generating train split: 100% 3668/3668 [00:00<00:00, 654389.92 examples/s]
Generating validation split: 100% 408/408 [00:00<00:00, 248731.98 examples/s]
Generating test split: 100% 1725/1725 [00:00<00:00, 629912.45 examples/s]
tokenizer_config.json: 100% 48.0/48.0 [00:00<00:00, 682kB/s]
config.json: 100% 570/570 [00:00<00:00, 6.46MB/s]
vocab.txt: 100% 232k/232k [00:00<00:00, 16.0MB/s]
tokenizer.json: 100% 466k/466k [00:00<00:00, 3.63MB/s]
Map: 100% 3668/3668 [00:00<00:00, 13716.58 examples/s]
Map: 100% 408/408 [00:00<00:00, 14470.70 examples/s

---

## üìù Exercise: Fine-tune on SST-2 Dataset

Now it's your turn! Modify the training loop to fine-tune BERT on the **SST-2** (Stanford Sentiment Treebank) dataset:

- **Task**: Binary sentiment classification (positive/negative)
- **Dataset**: Single sentences with sentiment labels
- **Key difference**: Only one sentence per example (not sentence pairs like MRPC)

In [None]:
# Solution: Complete SST-2 Fine-tuning Pipeline
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm.auto import tqdm
import evaluate

# ============ Setup ============
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"üñ•Ô∏è Using device: {device}")

# ============ Data Preparation ============
# SST-2: Stanford Sentiment Treebank (binary sentiment classification)
raw_datasets = load_dataset("glue", "sst2")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Note: SST-2 has single sentences (unlike MRPC which has sentence pairs)
def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Prepare dataset
tokenized_datasets = tokenized_datasets.remove_columns(["sentence", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

print(f"üìä Dataset: SST-2")
print(f"   Training samples: {len(tokenized_datasets['train'])}")
print(f"   Validation samples: {len(tokenized_datasets['validation'])}")

# ============ Model Setup ============
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
model.to(device)

# ============ Training Configuration ============
optimizer = AdamW(model.parameters(), lr=5e-5)
num_epochs = 1  # SST-2 is larger, so 1 epoch is often enough for demo
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

# ============ Training Loop ============
progress_bar = tqdm(range(num_training_steps), desc="Fine-tuning on SST-2")

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

print("\n‚úÖ Training complete! Starting evaluation...")

# ============ Evaluation ============
metric = evaluate.load("glue", "sst2")
model.eval()

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

final_score = metric.compute()
print(f"\nüìà SST-2 Evaluation Results:")
print(f"   Accuracy: {final_score['accuracy']:.4f}")

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/8419 [00:00<?, ?it/s]

Training Finished! Starting Evaluation...
{'accuracy': 0.9162844036697247}
