# Week 2 — Hugging Face & Transformers: Using Pretrained Models

This notebook walks through the essentials for **using Hugging Face**. We treat:
- Selecting checkpoints and using pipelines
- Manual tokenization and model forward passes
- Generation parameters and devices
- Batching with `datasets`
- Caching, offline mode, and revisions
- Optional: Hosted Inference API

Use the opportunity to play and vary the different parameters of the model to get an idea on their influence on the outcome.


## Setup
Install the core libraries (CPU by default). If you have a GPU, install the appropriate PyTorch build and optionally `bitsandbytes`.

- Definition: An HF token is a personal key for accessing gated/private repos or hosted inference.
- Why: Some models require accepting a license; hosted endpoints need to know who is calling.

- Terminal: `pip install -U transformers datasets huggingface_hub accelerate safetensors`
- GPU (optional): `pip install bitsandbytes` and a CUDA-enabled torch wheel.

Authentication is only needed for gated/private repos or the hosted Inference API. You can either run `huggingface-cli login` or set `HF_TOKEN` in your environment.


In [None]:
import os
import torch
import accelerate
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
)
from datasets import load_dataset

HF_TOKEN = os.getenv('HF_TOKEN') or os.getenv('HUGGINGFACEHUB_API_TOKEN')
DEVICE = 'cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu')
DEVICE


## 1) Pipelines: Quick Inference
- Definition: A pipeline bundles the right tokenizer, model, and postprocessing for a task.
- Why: It reduces moving parts so you can confirm the model works before customizing.


In [None]:
# Sentiment analysis (binary SST-2)
sent_clf = pipeline(
    'text-classification',
    model='distilbert-base-uncased-finetuned-sst-2-english',
    device_map='auto'
)
sent_clf(['I love data!', 'This is terrible...'])


In [None]:
# Fill-mask
mlm = pipeline('fill-mask', model='bert-base-uncased', device_map='auto')
mlm('Marseille is the real [MASK] of France.')


In [None]:
# Text generation (small model for speed)
gen = pipeline('text-generation', model='gpt2', device_map='auto')
gen('What is the capital of France?', max_new_tokens=40, do_sample=True, temperature=0.7)

### Exercise
Use the version of gpt2 that was committed on Nov23, 20022, on Huggingface for the example above. 

### Exercise
- Try `zero-shot-classification` with `facebook/bart-large-mnli`.
- Try `summarization` with `facebook/bart-large-cnn` on a paragraph.
- Try `feature-extraction` with `sentence-transformers/all-MiniLM-L6-v2` and compute cosine similarity between two sentences.


## 2) Manual Tokenization + Model Forward

### Understanding AutoModelForSequenceClassification
- **What**: A model architecture for text classification with:
  - Transformer backbone (e.g., BERT, DistilBERT)
  - Classification head (linear layer) on top
  - Output logits for each class (here: positive/negative sentiment)
- **Why**: Perfect for tasks like sentiment analysis, topic classification, or intent detection

In this example, we're doing binary sentiment classification:
- Input: Text (e.g., movie review)
- Output: Binary prediction (0 = negative, 1 = positive)
- Model: DistilBERT fine-tuned on SST-2 (Stanford Sentiment Treebank)



In [None]:
# Set up model and tokenizer for manual inference
model_id = 'distilbert-base-uncased-finetuned-sst-2-english'
tok = AutoTokenizer.from_pretrained(model_id)
mdl = AutoModelForSequenceClassification.from_pretrained(model_id)

# Manual inference
texts = [
    'I absolutely loved this movie!',
    'The movie was okay; yet I can not say anything positive about it.',
    'The plot was weak and boring.',
    'The actors where neither good nor bad, just average.',
]
# Create batch
batch = tok(texts, padding=True, truncation=True, return_tensors='pt')

# Forward pass
with torch.no_grad(): # disable gradient calculation for memory efficiency
    out = mdl(**batch) # forward pass

# Convert logits to probabilities
    probs = out.logits.softmax(-1)
    probs

# Print results with class labels
labels = ['negative', 'positive']
for text, prob in zip(texts, probs):
    pred_class = labels[prob.argmax().item()]
    confidence = prob.max().item()
    print(f"\nText: {text}\nPrediction: {pred_class} ({confidence:.2%})")

### Understanding Text Generation with `generate()`
- **What**: A method for auto-regressive text generation that:
  - Takes an input prompt
  - Generates new tokens one by one
  - Uses different decoding strategies (greedy, beam, sampling)
- **Why**: Control the trade-off between:
  - Creativity vs determinism
  - Diversity vs coherence
  - Speed vs quality

Key parameters explained:
- `max_new_tokens`: Maximum length of generated text
- `temperature`: Controls randomness (higher = more creative)
- `top_p`: Nucleus sampling threshold (cumulative probability)
- `top_k`: Limits vocabulary to k most likely tokens
- `num_beams`: Number of parallel sequences to explore
- `repetition_penalty`: Discourages word repetition



In [None]:
# Set up model and tokenizer
lm_id = 'gpt2'  # small demo model
tok_lm = AutoTokenizer.from_pretrained(lm_id)
lm = AutoModelForCausalLM.from_pretrained(lm_id)

# Prepare input
prompt = 'What is so special about Marseille?'
inputs = tok_lm(prompt, return_tensors='pt')

# Generate with different settings
with torch.no_grad():  # inference mode
    # Conservative settings (more focused)
    conservative = lm.generate(
        **inputs,
        max_new_tokens=64,
        num_beams=5,
        temperature=0.7,
        repetition_penalty=1.2
    )
    
    # Creative settings (more diverse)
    creative = lm.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=True,
        temperature=1.2,
        top_p=0.92,
        top_k=50
    )

# Print results
print("Conservative output:")
print(tok_lm.decode(conservative[0], skip_special_tokens=True))
print("\nCreative output:")
print(tok_lm.decode(creative[0], skip_special_tokens=True))

## 3) Devices and Memory
### Understanding Device Placement and Memory Options
- **What**: Ways to optimize model loading and inference:
  - `device_map="auto"`: Smart layer distribution across hardware
  - `dtype`: Numeric precision (float32, float16, bfloat16)
  - Quantization: 8-bit or 4-bit weights (via `bitsandbytes`)
- **Why**: Critical for:
  - Fitting larger models in limited memory
  - Speeding up inference
  - Balancing speed vs accuracy

### Key Concepts
1. **Device Mapping**:
   - `auto`: Automatic placement based on available memory
   - `cpu`/`cuda`/`mps`: Manual device selection
   - Split placement for very large models

2. **Precision Options**:
   - `float32`: Full precision, highest accuracy
   - `float16`: Half precision, 2x memory savings
   - `bfloat16`: Better numeric stability than float16

3. **Quantization**:
   - 8-bit: Good balance of speed/memory/accuracy
   - 4-bit: Maximum memory savings
   - Requires: `bitsandbytes` library



In [None]:
import time, torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

def run_inference_benchmark(name, texts, devices, dtypes):
    """Benchmark model performance across devices and dtypes"""
    rows = []
    
    for device in devices:
        for dtype in dtypes:
            # Skip unsupported combinations
            if device == 'cpu' and dtype is torch.float16:
                continue
                
            # Load model with specified precision
            print(f"\nTesting {device} with {dtype}")
            tok = AutoTokenizer.from_pretrained(name)
            mdl = AutoModelForSequenceClassification.from_pretrained(
                name, 
                torch_dtype=dtype,
                device_map='auto' if device == 'cuda' else None
            )
            if device != 'cuda':  # Only move to device if not using device_map='auto'
                mdl.to(device)
            
            # Prepare input batch
            batch = tok(texts, padding=True, truncation=True, return_tensors='pt')
            batch = {k: v.to(device) for k, v in batch.items()}
            
            # Memory measurements
            param_mb = sum(p.numel() * p.element_size() for p in mdl.parameters()) / 1e6
            if device == 'cuda':
                torch.cuda.reset_peak_memory_stats()
                
            # Warmup run
            with torch.no_grad():
                mdl(**batch)
                
            # Timed inference
            t0 = time.perf_counter()
            with torch.no_grad():
                for _ in range(10):
                    mdl(**batch)
            dt = time.perf_counter() - t0
            
            # Record peak GPU memory if applicable
            peak_mb = None
            if device == 'cuda':
                peak_mb = torch.cuda.max_memory_allocated() / 1e6
                
            rows.append((
                device, 
                str(dtype).replace('torch.', ''),
                round(param_mb, 1),
                round(dt, 3),
                None if peak_mb is None else round(peak_mb, 1)
            ))
            
    return rows

# Run benchmark
name = 'distilbert-base-uncased-finetuned-sst-2-english'
texts = ['STUDYING DATA SCIENCE IS FUN!'] * 32  # Multiple instances for better timing

# Detect available devices
devices = []
if torch.cuda.is_available(): devices.append('cuda')
if torch.backends.mps.is_available(): devices.append('mps')
devices.append('cpu')

# Test different precisions
dtypes = [torch.float32, torch.float16]

# Run and display results
rows = run_inference_benchmark(name, texts, devices, dtypes)
print('\nBenchmark Results:')
print(f"{'Device':>8} {'Dtype':>8} {'ParamMB':>8} {'Seconds':>8} {'PeakMB':>8}")
print('-' * 45)
for r in rows:
    print(f"{r[0]:>8} {r[1]:>8} {r[2]:>8.1f} {r[3]:>8.3f} {str(r[4]):>8}")

## 4) Understanding Batch Processing with Hugging Face Transformers

### Key Concepts
- **What**: Processing multiple inputs simultaneously through the model
- **Why**: Trade-off between throughput and memory usage
  - ✓ Faster: Parallel processing of multiple inputs
  - ✓ Efficient: Better GPU utilization
  - × Memory: Higher RAM/VRAM usage due to padding
  - × Latency: Must wait for entire batch to complete

### Memory Considerations
1. **Input Padding**:
   - Batches must have uniform length
   - Shorter sequences get padded to longest in batch
   - More padding = more wasted memory

2. **Activation Memory**:
   - Scales with batch size
   - Depends on model architecture
   - Key limiting factor for large models

3. **Optimal Batch Size**:
   - Hardware dependent (GPU memory)
   - Task dependent (sequence length)
   - Model dependent (parameters)

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import time
import numpy as np
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
import psutil

def measure_memory_cpu():
    """Get current memory usage in MB"""
    process = psutil.Process()
    return process.memory_info().rss / 1024 / 1024

def process_with_detailed_metrics(dataset, batch_size, model, tokenizer):
    """Process dataset with detailed per-batch metrics"""
    predictions = []
    memory_usage = []
    batch_times = []
    batch_accuracies = []
    
    # Record initial memory state
    base_memory = measure_memory_cpu()
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
        base_gpu_memory = torch.cuda.memory_allocated()
    
    with tqdm(range(0, len(dataset), batch_size), desc='Processing') as pbar:
        for i in pbar:
            batch = dataset[i:i + batch_size]
            batch_t0 = time.perf_counter()
            
            # Process batch
            inputs = tokenizer(
                batch['text'],  # Using 'text' field from dataset
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors='pt'
            ).to(model.device)
            
            with torch.no_grad():
                outputs = model(**inputs)
                preds = outputs.logits.argmax(-1).cpu().tolist()
                predictions.extend(preds)
            
            # Metrics
            batch_time = time.perf_counter() - batch_t0
            batch_times.append(batch_time)
            
            current_cpu_mem = measure_memory_cpu() - base_memory
            if torch.cuda.is_available():
                current_gpu_mem = torch.cuda.max_memory_allocated() / 1e6
                total_mem = current_cpu_mem + current_gpu_mem
            else:
                total_mem = current_cpu_mem
                
            memory_usage.append(total_mem)
            
            # Update progress
            pbar.set_postfix({
                'Memory (MB)': f'{total_mem:.0f}',
                'Time (s)': f'{batch_time:.3f}'
            })
    
    return {
        'predictions': predictions,
        'batch_times': batch_times,
        'memory_usage': memory_usage,
        'avg_batch_time': np.mean(batch_times),
        'avg_memory': np.mean(memory_usage),
        'peak_memory': max(memory_usage)
    }

# Load test dataset
dataset = load_dataset('imdb', split='test[:100]')  # Small subset for demo

# Load model and tokenizer
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

if torch.cuda.is_available():
    model = model.cuda()

# Test different batch sizes
batch_sizes = [1, 4, 16, 32]
results = {}

for bs in batch_sizes:
    print(f"\nTesting batch_size={bs}")
    results[bs] = process_with_detailed_metrics(dataset, bs, model, tokenizer)

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Plot timing
times = [r['avg_batch_time'] for r in results.values()]
ax1.plot(batch_sizes, times, marker='o')
ax1.set_xlabel('Batch Size')
ax1.set_ylabel('Average Time per Batch (s)')
ax1.set_title('Processing Time vs Batch Size')

# Plot memory
memory = [r['peak_memory'] for r in results.values()]
ax2.plot(batch_sizes, memory, marker='o', color='orange')
ax2.set_xlabel('Batch Size')
ax2.set_ylabel('Peak Memory Usage (MB)')
ax2.set_title('Memory Usage vs Batch Size')

plt.tight_layout()
plt.show()

## 5) Revisions, Caching, and Offline
- Definition: A checkpoint is a released model; a revision is an exact commit/tag.
- Why: Pinning revisions and controlling caches ensures reproducibility on different machines.
- Pin exact versions with `revision=` when calling `from_pretrained`.
- Set cache directories with `HF_HOME` or `TRANSFORMERS_CACHE`.
- Force offline mode with `HF_HUB_OFFLINE=1` (uses only local cache).


In [None]:
# Example: pinning a revision (replace with a real commit SHA/tag for production)
tok_pinned = AutoTokenizer.from_pretrained('bert-base-uncased', revision='main')
mdl_pinned = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english', revision='main')

# Where is the cache?
print('HF_HOME =', os.getenv('HF_HOME'))
print('TRANSFORMERS_CACHE =', os.getenv('TRANSFORMERS_CACHE'))


## 6) Optional: Hosted Inference API
- Definition: The Inference API is a managed endpoint for common tasks.
- Why: Zero setup for quick demos or when you lack local GPU resources.
Run inference on HF-hosted models (requires token; rate limits apply).


In [None]:
try:
    from huggingface_hub import InferenceClient
    if HF_TOKEN:
        client = InferenceClient(model='facebook/bart-large-cnn', token=HF_TOKEN)
        s = client.summarization('''Large Language Models (LLMs) are a transformative technology in artificial intelligence, powering applications like chatbots, text generation, and automated analysis. These models, built on deep learning architectures, excel at understanding and generating human-like text by learning patterns from vast datasets. At their core, LLMs are neural networks trained on billions of words from diverse sources, such as books, websites, and social media, enabling them to capture the nuances of language, from grammar to context.
The foundation of LLMs lies in the transformer architecture, introduced in 2017 with the paper "Attention is All You Need." Transformers use mechanisms like self-attention to process input text, allowing the model to weigh the importance of each word relative to others in a sentence. This enables LLMs to handle long-range dependencies, making them adept at tasks like translation, summarization, and question-answering. Models like BERT, GPT, and Llama have pushed the boundaries of what machines can achieve, with each iteration scaling up in size and capability.
Training an LLM involves feeding it massive text corpora and optimizing billions of parameters using techniques like supervised learning and reinforcement learning. The process is computationally intensive, requiring powerful GPUs or TPUs and significant energy resources. Once trained, LLMs can perform zero-shot or few-shot learning, meaning they can tackle tasks with little to no task-specific training, relying on their prelearned knowledge. For example, an LLM can generate a story from a prompt like "Once upon a time" or classify sentiment in a sentence without explicit retraining.
LLMs are accessible through platforms like the Hugging Face Hub, which hosts pretrained models, datasets, and tools like the transformers library. This democratizes AI, allowing developers to use models like GPT-2 or DistilBERT for tasks such as text classification or generation without building from scratch. However, using LLMs responsibly requires understanding their limitations, including biases in training data, high computational costs, and potential security risks when loading untrusted models.
Applications of LLMs span industries: they power virtual assistants, automate customer support, assist in coding, and enhance research by summarizing complex texts. Yet, challenges remain, including ensuring fairness, reducing environmental impact, and managing the risk of generating misleading information. As LLMs evolve, they promise to reshape how we interact with technology, making it critical to approach their development and use with care.
In summary, LLMs represent a leap forward in AI, driven by transformers and massive datasets. Their ability to process and generate language has broad implications, but careful management is essential to harness their potential effectively..''')
        print(s)
    else:
        print('HF_TOKEN not set; skipping hosted inference.')
except Exception as e:
    print('InferenceClient not available or error:', e)


## Exercise — Model comparison and benchmarking


###  Task Description
Create a comprehensive comparison of two sentiment classifiers using the IMDB dataset. Your analysis should include:

#### Step 1: Model Selection  and Setup

Complete the setup part and logging of the used models. 

In [None]:
# TODO import the necessary packages

# Model configurations
models = {
    'distilbert': {
        'name': 'distilbert-base-uncased-finetuned-sst-2-english',
        'revision': 'main',  # Pin for reproducibility
    },
    'bert': {
        'name': 'textattack/bert-base-uncased-SST-2',
        'revision': 'main',  # Pin for reproducibility
    }
}

# Print model architectures and sizes
for model_key, config in models.items():
    model = #TODO
    n_params = sum(p.numel() for p in model.parameters())
    print(f"\n{model_key.upper()} Architecture:")
    print(f"Parameters: {n_params:,}")
    print(f"Label mapping: {model.config.id2label}")

### Step 2: Data Loading and Processing
Complete the dataset loading and preprocessing.
Hint: Use `load_dataset('imdb')` and implement proper text cleaning if needed.

In [None]:
from datasets import concatenate_datasets
# Data loading template
def load_and_preprocess_data(split='test', max_samples=1000):
    """Load and preprocess IMDB data"""
    # Load dataset
    dataset = load_dataset('imdb', split=split)

   #TODO Separate positive and negative samples
    

    #TODO Take equal samples from both classes
   
    #TODO  Create a balanced dataset
    balanced_dataset = #TODO

    #TODO  Shuffle the balanced dataset
    balanced_dataset = balanced_dataset.shuffle(seed=42)
    return balanced_dataset

# Load evaluation data
eval_dataset = load_and_preprocess_data(split='test', max_samples=1000)

# Verify the proportion of positive/negative samples
labels = [example['label'] for example in eval_dataset]
pos_ratio = sum(labels) / len(labels)
print(f"Positive samples: {pos_ratio:.2%}, Negative samples: {1 - pos_ratio:.2%}")

### Step 3: Evaluation Function
Reuse the benchmarking code from earlier and adapt it.

In [None]:
def evaluate_model(model_name, dataset, batch_size=32, device='auto'):
    """Evaluate model performance and efficiency"""
    # Determine device
    if device == 'auto':
        device = 'cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu')
    
    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        device_map='auto' if device == 'cuda' else None
    )
    if device != 'cuda':
        model.to(device)
    
    # Initialize metrics
    predictions = []
    labels = []
    batch_times = []
    
    # Process batches
    for i in range(0, len(dataset), batch_size):
        batch = dataset[i:i + batch_size]
        batch_start = time.perf_counter()
        
        #TODO  Tokenize and prepare input
        inputs = tokenizer(#TODO)
        
        # Get predictions
        with torch.no_grad():
            outputs = model(**inputs)
            preds = #TODO
        
        # Record metrics
        predictions.extend(preds)
        labels.extend(batch['label'])
        batch_times.append(time.perf_counter() - batch_start)
    
    # Calculate metrics with full classification report
    report = classification_report(
            labels, 
            predictions, 
            output_dict=True,
            zero_division=0,
            target_names=['negative', 'positive']
    )
    
    # Extract all relevant metrics
    metrics = {
        'accuracy': report['accuracy'],
        'negative': report['negative'],  # Contains precision, recall, f1-score
        'positive': report['positive'],  # Contains precision, recall, f1-score
        'avg_time_per_batch': sum(batch_times) / len(batch_times),
        'total_time': sum(batch_times)
    }
    
    return metrics

def print_detailed_results(results):
    """Print detailed results for each model"""
    for model_key, metrics in results.items():
        print(f"\nResults for {model_key.upper()}:")
        print(f"Accuracy: {metrics['accuracy']:.4f}")
        print("Class-wise Metrics:")
        for cls in ['negative', 'positive']:
            print(f"  {cls.capitalize()}:")
            print(f"    Precision: {metrics[cls]['precision']:.4f}")
            print(f"    Recall:    {metrics[cls]['recall']:.4f}")
            print(f"    F1-Score:  {metrics[cls]['f1-score']:.4f}")
        print(f"Average Time per Batch: {metrics['avg_time_per_batch']:.4f} seconds")
        print(f"Total Evaluation Time: {metrics['total_time']:.4f} seconds")

# Test both models
results = {}
for model_key, config in models.items():
    print(f"\nEvaluating {model_key}...")
    metrics = evaluate_model(
        config['name'],
        eval_dataset,
        batch_size=32,
        device='auto'
    )
    results[model_key] = metrics

# Display results
print_detailed_results(results)