# Week 2 — Hugging Face & Transformers: Using Pretrained Models

This notebook walks through the essentials for **using Hugging Face**. We treat:
- Selecting checkpoints and using pipelines
- Manual tokenization and model forward passes
- Generation parameters and devices
- Batching with `datasets`
- Caching, offline mode, and revisions
- Optional: Hosted Inference API

Use the opportunity to play and vary the different parameters of the model to get an idea on their influence on the outcome.


## Setup
Install the core libraries (CPU by default). If you have a GPU, install the appropriate PyTorch build and optionally `bitsandbytes`.

- Definition: An HF token is a personal key for accessing gated/private repos or hosted inference.
- Why: Some models require accepting a license; hosted endpoints need to know who is calling.

- Terminal: `pip install -U transformers datasets huggingface_hub accelerate safetensors`
- GPU (optional): `pip install bitsandbytes` and a CUDA-enabled torch wheel.

Authentication is only needed for gated/private repos or the hosted Inference API. You can either run `huggingface-cli login` or set `HF_TOKEN` in your environment.


In [None]:
import os
import torch
import accelerate
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
)
from datasets import load_dataset

HF_TOKEN = os.getenv('HF_TOKEN') or os.getenv('HUGGINGFACEHUB_API_TOKEN')
DEVICE = 'cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu')
DEVICE


## 1) Pipelines: Quick Inference
- Definition: A pipeline bundles the right tokenizer, model, and postprocessing for a task.
- Why: It reduces moving parts so you can confirm the model works before customizing.


In [None]:
# Sentiment analysis (binary SST-2)
sent_clf = pipeline(
    'text-classification',
    model='distilbert-base-uncased-finetuned-sst-2-english',
    device_map='auto'
)
sent_clf(['I love data!', 'This is terrible...'])


In [None]:
# Fill-mask
mlm = pipeline('fill-mask', model='bert-base-uncased', device_map='auto')
mlm('Marseille is the real [MASK] of France.')


In [None]:
# Text generation (small model for speed)
gen = pipeline('text-generation', model='gpt2', device_map='auto')
gen('Tell me a joke about the people of Marseille:', max_new_tokens=40, do_sample=True, temperature=0.1)

### Exercise
Use the version of gpt2 that was committed on Nov23, 20022, on Huggingface for the example above. 

### Exercise
- Try `zero-shot-classification` with `facebook/bart-large-mnli`.
- Try `summarization` with `facebook/bart-large-cnn` on a paragraph.
- Try `feature-extraction` with `sentence-transformers/all-MiniLM-L6-v2` and compute cosine similarity between two sentences.


## 2) Manual Tokenization + Model Forward
- Definition: A tokenizer maps text to token IDs and attention masks; a model head is a task-specific layer (e.g., classification).
- Why: Manual control lets you batch, pad, and inspect outputs precisely for downstream evaluation.


In [None]:
model_id = 'distilbert-base-uncased-finetuned-sst-2-english'
tok = AutoTokenizer.from_pretrained(model_id)
mdl = AutoModelForSequenceClassification.from_pretrained(model_id)

texts = [
    'I absolutely loved this movie!',
    'The plot was weak and boring.'
]
batch = tok(texts, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    out = mdl(**batch)

probs = out.logits.softmax(-1)
probs


### Generation with `generate()`
- Definition: Decoding chooses next tokens (sampling vs. beam search).
- Why: Tuning decoding trades off creativity vs. determinism and repetition.
Key parameters: `max_new_tokens`, `temperature`, `top_p`, `top_k`, `num_beams`, `repetition_penalty`.


In [None]:
lm_id = 'gpt2'  # small demo model
tok_lm = AutoTokenizer.from_pretrained(lm_id)
lm = AutoModelForCausalLM.from_pretrained(lm_id)
inputs = tok_lm('Tell me a joke about the people of Marseille', return_tensors='pt')
out = lm.generate(**inputs, max_new_tokens=64, do_sample=True, temperature=0.7, top_p=0.9)
print(tok_lm.decode(out[0], skip_special_tokens=True))


## 3) Devices and Memory
- Definition: `device_map="auto"` automatically places layers across CPU/GPU/MPS; `torch_dtype` sets numeric precision; quantization loads 8/4-bit weights.
- Why: Fit models in memory and run them faster on your hardware.
Use `device_map="auto"` to place weights on available accelerators. Reduce memory via `torch_dtype=torch.float16` or 8/4-bit loading (requires `bitsandbytes`).

### Experiment: benchmark dtype & device
We compare runtime and memory across supported devices/dtypes. On CPU, float16 compute is usually not supported, so we skip it.


In [None]:
import time, torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

name = 'distilbert-base-uncased-finetuned-sst-2-english'
texts = ['STUDYING DATA SCIENCE IS FUN!'] * 16

devices = []
if torch.cuda.is_available(): devices.append('cuda')
if getattr(torch.backends, 'mps', None) and torch.backends.mps.is_available():
    devices.append('mps')
devices.append('cpu')

rows = []
for device in devices:
    for dtype in (torch.float32, torch.float16):
        if device == 'cpu' and dtype is torch.float16:
            continue
        tok = AutoTokenizer.from_pretrained(name)
        mdl = AutoModelForSequenceClassification.from_pretrained(name, dtype=dtype)
        mdl.to(device)
        batch = tok(texts, padding=True, truncation=True, return_tensors='pt')
        batch = {k: v.to(device) for k, v in batch.items()}
        # Estimate parameter memory
        param_mb = sum(p.numel() * p.element_size() for p in mdl.parameters()) / 1e6
        # Optional: GPU peak memory
        if device == 'cuda':
            torch.cuda.reset_peak_memory_stats()
        # Warmup + timed run
        with torch.no_grad(): mdl(**batch)
        t0 = time.perf_counter()
        with torch.no_grad():
            for _ in range(10): mdl(**batch)
        dt = time.perf_counter() - t0
        peak_mb = None
        if device == 'cuda':
            peak_mb = torch.cuda.max_memory_allocated() / 1e6
        rows.append((device, str(dtype).replace('torch.', ''), round(param_mb,1), round(dt,3), None if peak_mb is None else round(peak_mb,1)))

print('device	dtype	paramMB	sec(10 iters)	peakMB(cuda)')
for r in rows:
    print('	'.join(map(str, r)))


## 4) Understanding Batch Processing with Hugging Face Transformers
The following code demonstrates a complete pipeline for processing and analyzing text data using Hugging Face's transformers library. 

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import time
import numpy as np
from tqdm.auto import tqdm

# ...existing imports...
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
import psutil  # for RAM monitoring

def measure_memory_cpu():
    """Get current memory usage in MB"""
    process = psutil.Process()
    return process.memory_info().rss / 1024 / 1024

def process_with_detailed_metrics(dataset, batch_size):
    """Process with detailed per-batch metrics"""
    t0 = time.perf_counter()
    predictions = []
    memory_usage = []
    batch_times = []
    batch_accuracies = []
    
    for i in tqdm(range(0, len(dataset), batch_size)):
        batch = dataset[i:i + batch_size]
        batch_t0 = time.perf_counter()
        
        # Record memory before batch
        if torch.cuda.is_available():
            torch.cuda.reset_peak_memory_stats()
        mem_before = measure_memory_cpu()
        
        # Process batch
        inputs = tokenizer(batch['sentence'], 
                         padding=True,
                         truncation=True, 
                         return_tensors='pt').to(model.device)
        
        with torch.no_grad():
            output = model(**inputs)
            preds = output.logits.argmax(-1).cpu().tolist()
            predictions.extend(preds)
        
        # Record metrics
        batch_time = time.perf_counter() - batch_t0
        batch_times.append(batch_time)
        
        # Memory delta
        mem_after = measure_memory_cpu()
        memory_delta = mem_after - mem_before
        if torch.cuda.is_available():
            memory_delta += torch.cuda.max_memory_allocated() / 1024 / 1024
        memory_usage.append(memory_delta)
        
        # Batch accuracy
        batch_acc = accuracy_score(batch['label'], preds)
        batch_accuracies.append(batch_acc)
    
    total_time = time.perf_counter() - t0
    
    return {
        'predictions': predictions,
        'total_time': total_time,
        'batch_times': batch_times,
        'memory_usage': memory_usage,
        'batch_accuracies': batch_accuracies,
        'avg_batch_time': np.mean(batch_times),
        'avg_memory': np.mean(memory_usage),
        'avg_accuracy': np.mean(batch_accuracies)
    }

# Test different batch sizes
batch_sizes = [1, 8, 32, 64]
results = {}

for bs in batch_sizes:
    print(f"\nProcessing with batch_size={bs}")
    results[bs] = process_with_detailed_metrics(ds, bs)

# Print summary
print("\nDetailed Comparison:")
print(f"{'Batch Size':>10} {'Memory (MB)':>12} {'Time (s)':>10} {'Accuracy':>10}")
print("-" * 45)
for bs in batch_sizes:
    r = results[bs]
    print(f"{bs:>10} {r['avg_memory']:>12.1f} {r['avg_batch_time']:>10.3f} {r['avg_accuracy']:>10.3f}")

## 5) Revisions, Caching, and Offline
- Definition: A checkpoint is a released model; a revision is an exact commit/tag.
- Why: Pinning revisions and controlling caches ensures reproducibility on different machines.
- Pin exact versions with `revision=` when calling `from_pretrained`.
- Set cache directories with `HF_HOME` or `TRANSFORMERS_CACHE`.
- Force offline mode with `HF_HUB_OFFLINE=1` (uses only local cache).


In [None]:
# Example: pinning a revision (replace with a real commit SHA/tag for production)
tok_pinned = AutoTokenizer.from_pretrained('bert-base-uncased', revision='main')
mdl_pinned = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english', revision='main')

# Where is the cache?
print('HF_HOME =', os.getenv('HF_HOME'))
print('TRANSFORMERS_CACHE =', os.getenv('TRANSFORMERS_CACHE'))


## 6) Optional: Hosted Inference API
- Definition: The Inference API is a managed endpoint for common tasks.
- Why: Zero setup for quick demos or when you lack local GPU resources.
Run inference on HF-hosted models (requires token; rate limits apply).


In [None]:
try:
    from huggingface_hub import InferenceClient
    if HF_TOKEN:
        client = InferenceClient(model='facebook/bart-large-cnn', token=HF_TOKEN)
        s = client.summarization('''Large Language Models (LLMs) are a transformative technology in artificial intelligence, powering applications like chatbots, text generation, and automated analysis. These models, built on deep learning architectures, excel at understanding and generating human-like text by learning patterns from vast datasets. At their core, LLMs are neural networks trained on billions of words from diverse sources, such as books, websites, and social media, enabling them to capture the nuances of language, from grammar to context.
The foundation of LLMs lies in the transformer architecture, introduced in 2017 with the paper "Attention is All You Need." Transformers use mechanisms like self-attention to process input text, allowing the model to weigh the importance of each word relative to others in a sentence. This enables LLMs to handle long-range dependencies, making them adept at tasks like translation, summarization, and question-answering. Models like BERT, GPT, and Llama have pushed the boundaries of what machines can achieve, with each iteration scaling up in size and capability.
Training an LLM involves feeding it massive text corpora and optimizing billions of parameters using techniques like supervised learning and reinforcement learning. The process is computationally intensive, requiring powerful GPUs or TPUs and significant energy resources. Once trained, LLMs can perform zero-shot or few-shot learning, meaning they can tackle tasks with little to no task-specific training, relying on their prelearned knowledge. For example, an LLM can generate a story from a prompt like "Once upon a time" or classify sentiment in a sentence without explicit retraining.
LLMs are accessible through platforms like the Hugging Face Hub, which hosts pretrained models, datasets, and tools like the transformers library. This democratizes AI, allowing developers to use models like GPT-2 or DistilBERT for tasks such as text classification or generation without building from scratch. However, using LLMs responsibly requires understanding their limitations, including biases in training data, high computational costs, and potential security risks when loading untrusted models.
Applications of LLMs span industries: they power virtual assistants, automate customer support, assist in coding, and enhance research by summarizing complex texts. Yet, challenges remain, including ensuring fairness, reducing environmental impact, and managing the risk of generating misleading information. As LLMs evolve, they promise to reshape how we interact with technology, making it critical to approach their development and use with care.
In summary, LLMs represent a leap forward in AI, driven by transformers and massive datasets. Their ability to process and generate language has broad implications, but careful management is essential to harness their potential effectively..''')
        print(s)
    else:
        print('HF_TOKEN not set; skipping hosted inference.')
except Exception as e:
    print('InferenceClient not available or error:', e)


## Exercise — Model comparison and benchmarking


###  Task Description
Create a comprehensive comparison of two sentiment classifiers using the IMDB dataset. Your analysis should include:

#### Model Selection
- Compare `distilbert-base-uncased-finetuned-sst-2-english` and `textattack/bert-base-uncased-SST-2`
- Document the model architectures and sizes

#### Implementation Requirements
- Use the Hugging Face datasets library to load IMDB data
- Implement batch processing for memory efficiency
- Include proper error handling and device management
- Record and compare inference times

#### Evaluation Metrics
- Calculate and compare accuracy scores
- Generate classification reports
- Record processing time per batch
- Document memory usage where applicable

#### Technical Requirements
- Pin model revisions for reproducibility 
- Record label mappings from `config.id2label`
- Use appropriate batch sizes (suggest starting with 32)

#### Deliverables
- Working code implementation
- Performance comparison table
- Brief analysis of tradeoffs (speed vs. accuracy)
- Documentation of label mappings and any data preprocessing

#### Bonus Tasks
- Experiment with different dataset sizes
- Add visualization of results
- Compare memory usage across models
- Analyze misclassified examples
