# Week 2 — Hugging Face & Transformers: Using Pretrained Models

This notebook walks through the essentials for **using Hugging Face**. We treat:
- Selecting checkpoints and using pipelines
- Manual tokenization and model forward passes
- Generation parameters and devices
- Batching with `datasets`
- Caching, offline mode, and revisions
- Optional: Hosted Inference API

Use the opportunity to play and vary the different parameters of the model to get an idea on their influence on the outcome.


## Setup
Install the core libraries (CPU by default). If you have a GPU, install the appropriate PyTorch build and optionally `bitsandbytes`.

- Definition: An HF token is a personal key for accessing gated/private repos or hosted inference.
- Why: Some models require accepting a license; hosted endpoints need to know who is calling.

- Terminal: `pip install -U transformers datasets huggingface_hub accelerate safetensors`
- GPU (optional): `pip install bitsandbytes` and a CUDA-enabled torch wheel.

Authentication is only needed for gated/private repos or the hosted Inference API. You can either run `huggingface-cli login` or set `HF_TOKEN` in your environment.


In [12]:
import os
import torch
import accelerate
from transformers import (
    pipeline,
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForCausalLM,
)
from datasets import load_dataset

HF_TOKEN = os.getenv('HF_TOKEN') or os.getenv('HUGGINGFACEHUB_API_TOKEN')
DEVICE = 'cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu')
DEVICE


'cpu'

## 1) Pipelines: Quick Inference
- Definition: A pipeline bundles the right tokenizer, model, and postprocessing for a task.
- Why: It reduces moving parts so you can confirm the model works before customizing.


In [13]:
# Sentiment analysis (binary SST-2)
sent_clf = pipeline(
    'text-classification',
    model='distilbert-base-uncased-finetuned-sst-2-english',
    device_map='auto'
)
sent_clf(['I love data!', 'This is terrible...'])


Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998714923858643},
 {'label': 'NEGATIVE', 'score': 0.9997186064720154}]

In [14]:
# Fill-mask
mlm = pipeline('fill-mask', model='bert-base-uncased', device_map='auto')
mlm('Paris is the [MASK] of France.')


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[{'score': 0.9969370365142822,
  'token': 3007,
  'token_str': 'capital',
  'sequence': 'paris is the capital of france.'},
 {'score': 0.0005914855282753706,
  'token': 2540,
  'token_str': 'heart',
  'sequence': 'paris is the heart of france.'},
 {'score': 0.0004378748417366296,
  'token': 2415,
  'token_str': 'center',
  'sequence': 'paris is the center of france.'},
 {'score': 0.0003378352848812938,
  'token': 2803,
  'token_str': 'centre',
  'sequence': 'paris is the centre of france.'},
 {'score': 0.00026995810912922025,
  'token': 2103,
  'token_str': 'city',
  'sequence': 'paris is the city of france.'}]

In [15]:
# Text generation (small model for speed)
gen = pipeline('text-generation', model='gpt2', device_map='auto')
gen('Once upon a time', max_new_tokens=40, do_sample=True, temperature=0.8)

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Once upon a time, it was thought that even at the age of five, it would be hard to see a girl as attractive as you. But then, at thirteen, you were able to get to know her for'}]

### Exercise
Use the version of gpt2 that was committed on Nov23, 20022, on Huggingface for the example above. 

In [16]:
# Text generation avec une version spécifique de GPT-2
gen = pipeline('text-generation', 
               model='gpt2', 
               revision='b5a36b5b5c5b5a5a5e5d5c5b5a59585756555453',  # Remplacez par le hash réel du 23/11/2022
               device_map='auto')
gen('Once upon a time', max_new_tokens=40, do_sample=True, temperature=0.8)

ValueError: Could not load model gpt2 with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,). See the original errors:

while loading with AutoModelForCausalLM, an error is thrown:
Traceback (most recent call last):
  File "/home/administrateur/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 293, in infer_framework_load_model
    model = model_class.from_pretrained(model, **kwargs)
  File "/home/administrateur/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
    return model_class.from_pretrained(
  File "/home/administrateur/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
    return func(*args, **kwargs)
  File "/home/administrateur/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4900, in from_pretrained
    checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files(
  File "/home/administrateur/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1148, in _get_resolved_checkpoint_files
    raise OSError(
OSError: gpt2 does not appear to have a file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt or flax_model.msgpack.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/administrateur/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 311, in infer_framework_load_model
    model = model_class.from_pretrained(model, **fp32_kwargs)
  File "/home/administrateur/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 604, in from_pretrained
    return model_class.from_pretrained(
  File "/home/administrateur/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
    return func(*args, **kwargs)
  File "/home/administrateur/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4900, in from_pretrained
    checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files(
  File "/home/administrateur/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1148, in _get_resolved_checkpoint_files
    raise OSError(
OSError: gpt2 does not appear to have a file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt or flax_model.msgpack.




### Exercise
- Try `zero-shot-classification` with `facebook/bart-large-mnli`.
- Try `summarization` with `facebook/bart-large-cnn` on a paragraph.
- Try `feature-extraction` with `sentence-transformers/all-MiniLM-L6-v2` and compute cosine similarity between two sentences.


In [None]:
# 1. Zero-shot classification
zero_shot = pipeline('zero-shot-classification', 
                    model='facebook/bart-large-mnli',
                    device_map='auto')

candidate_labels = ['technology', 'sports', 'politics', 'entertainment']
text = "The new smartphone has amazing features and incredible battery life."
result = zero_shot(text, candidate_labels)
print("Zero-shot classification:")
print(f"Text: {text}")
print(f"Predicted label: {result['labels'][0]} (confidence: {result['scores'][0]:.3f})")
print()

# 2. Summarization
summarizer = pipeline('summarization',
                     model='facebook/bart-large-cnn',
                     device_map='auto')

article = '''
Artificial intelligence is transforming many aspects of our daily lives. From virtual assistants 
like Siri and Alexa to recommendation systems on Netflix and Amazon, AI algorithms are becoming 
increasingly sophisticated. Machine learning models can now recognize images, translate languages, 
and even generate human-like text. However, these advancements also raise important ethical 
questions about privacy, bias, and job displacement that society must address.
'''

summary = summarizer(article, max_length=100, min_length=30, do_sample=False)
print("Summarization:")
print(f"Original length: {len(article)} characters")
print(f"Summary: {summary[0]['summary_text']}")
print()

# 3. Feature extraction and cosine similarity
feature_extractor = pipeline('feature-extraction',
                           model='sentence-transformers/all-MiniLM-L6-v2',
                           device_map='auto')

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

sentences = [
    "The weather is beautiful today",
    "It's a lovely sunny day",
    "I enjoy studying mathematics"
]

# Extraire les features
features = feature_extractor(sentences)

# Calculer la similarité cosinus
def get_sentence_embedding(features):
    # Moyenner sur les tokens pour obtenir l'embedding de la phrase
    return np.mean(features[0], axis=0)

embeddings = [get_sentence_embedding(feat) for feat in features]

print("Cosine similarities:")
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        sim = cosine_similarity([embeddings[i]], [embeddings[j]])[0][0]
        print(f"'{sentences[i][:20]}...' vs '{sentences[j][:20]}...': {sim:.3f}")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

  [2m2025-10-17T14:19:18.752655Z[0m [33m WARN[0m  [33mReqwest(reqwest::Error { kind: Request, url: "https://transfer.xethub.hf.co/xorbs/default/4266a5a099a58b225895b642647acff00ac0c3db4d2ea683f180a9736aeab27a?X-Xet-Signed-Range=bytes%3D29048743-31820655&X-Xet-Session-Id=01K7S8JKX6CVD0XK8HBQ9TA9WW&Expires=1760714310&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly90cmFuc2Zlci54ZXRodWIuaGYuY28veG9yYnMvZGVmYXVsdC80MjY2YTVhMDk5YTU4YjIyNTg5NWI2NDI2NDdhY2ZmMDBhYzBjM2RiNGQyZWE2ODNmMTgwYTk3MzZhZWFiMjdhP1gtWGV0LVNpZ25lZC1SYW5nZT1ieXRlcyUzRDI5MDQ4NzQzLTMxODIwNjU1JlgtWGV0LVNlc3Npb24tSWQ9MDFLN1M4SktYNkNWRDBYSzhIQlE5VEE5V1ciLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE3NjA3MTQzMTB9fX1dfQ__&Signature=Qo~DqQ0VrGhmbYXHg44La1SUiTD3vOUUzBbnz1Sn5Pi0~mtbFyaKbwgztHM8ZnddxLZ1Jtd1yoPgH7Gl7xWDKYyYI3xxU6ynDcbU1BjCF-qK8wXJ4i7UloJI4vrSnl8iuXuBj4SbYsiFUHQIdxlJeWeHJIKpSDawJfuKFA76S~aJrqhUZlbFczkW9qu-FhkldlZrSiMpUjpkwmTa13MJrPT5O6GKuW41e2Z5wSCMoGgNIKRkGCq4witFN14jNGyj16KTM1gnSxJLH3BuIbSs

## 2) Manual Tokenization + Model Forward
- Definition: A tokenizer maps text to token IDs and attention masks; a model head is a task-specific layer (e.g., classification).
- Why: Manual control lets you batch, pad, and inspect outputs precisely for downstream evaluation.


In [5]:
model_id = 'distilbert-base-uncased-finetuned-sst-2-english'
tok = AutoTokenizer.from_pretrained(model_id)
mdl = AutoModelForSequenceClassification.from_pretrained(model_id)

texts = [
    'I absolutely loved this movie!',
    'The plot was weak and boring.'
]
batch = tok(texts, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    out = mdl(**batch)

probs = out.logits.softmax(-1)
probs


tensor([[1.2140e-04, 9.9988e-01],
        [9.9980e-01, 2.0337e-04]])

### Generation with `generate()`
- Definition: Decoding chooses next tokens (sampling vs. beam search).
- Why: Tuning decoding trades off creativity vs. determinism and repetition.
Key parameters: `max_new_tokens`, `temperature`, `top_p`, `top_k`, `num_beams`, `repetition_penalty`.


In [6]:
lm_id = 'gpt2'  # small demo model
tok_lm = AutoTokenizer.from_pretrained(lm_id)
lm = AutoModelForCausalLM.from_pretrained(lm_id)
inputs = tok_lm('In data science, transformers are', return_tensors='pt')
out = lm.generate(**inputs, max_new_tokens=64, do_sample=True, temperature=0.7, top_p=0.9)
print(tok_lm.decode(out[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In data science, transformers are the most common form of data manipulation, and have been used for centuries in many fields. They can be used to manipulate the data of many different types of applications.

Transformers are a popular tool for data analysis, but they are not the only tool for data manipulation. In fact, the most common use of


## 3) Devices and Memory
- Definition: `device_map="auto"` automatically places layers across CPU/GPU/MPS; `torch_dtype` sets numeric precision; quantization loads 8/4-bit weights.
- Why: Fit models in memory and run them faster on your hardware.
Use `device_map="auto"` to place weights on available accelerators. Reduce memory via `torch_dtype=torch.float16` or 8/4-bit loading (requires `bitsandbytes`).

### Experiment: benchmark dtype & device
We compare runtime and memory across supported devices/dtypes. On CPU, float16 compute is usually not supported, so we skip it.


In [7]:
import time, torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

name = 'distilbert-base-uncased-finetuned-sst-2-english'
texts = ['STUDYING DATA SCIENCE IS FUN!'] * 16

devices = []
if torch.cuda.is_available(): devices.append('cuda')
if getattr(torch.backends, 'mps', None) and torch.backends.mps.is_available():
    devices.append('mps')
devices.append('cpu')

rows = []
for device in devices:
    for dtype in (torch.float32, torch.float16):
        if device == 'cpu' and dtype is torch.float16:
            continue
        tok = AutoTokenizer.from_pretrained(name)
        mdl = AutoModelForSequenceClassification.from_pretrained(name, dtype=dtype)
        mdl.to(device)
        batch = tok(texts, padding=True, truncation=True, return_tensors='pt')
        batch = {k: v.to(device) for k, v in batch.items()}
        # Estimate parameter memory
        param_mb = sum(p.numel() * p.element_size() for p in mdl.parameters()) / 1e6
        # Optional: GPU peak memory
        if device == 'cuda':
            torch.cuda.reset_peak_memory_stats()
        # Warmup + timed run
        with torch.no_grad(): mdl(**batch)
        t0 = time.perf_counter()
        with torch.no_grad():
            for _ in range(10): mdl(**batch)
        dt = time.perf_counter() - t0
        peak_mb = None
        if device == 'cuda':
            peak_mb = torch.cuda.max_memory_allocated() / 1e6
        rows.append((device, str(dtype).replace('torch.', ''), round(param_mb,1), round(dt,3), None if peak_mb is None else round(peak_mb,1)))

print('device	dtype	paramMB	sec(10 iters)	peakMB(cuda)')
for r in rows:
    print('	'.join(map(str, r)))


device	dtype	paramMB	sec(10 iters)	peakMB(cuda)
cpu	float32	267.8	1.157	None


## 4) Understanding Batch Processing with Hugging Face Transformers
The following code demonstrates a complete pipeline for processing and analyzing text data using Hugging Face's transformers library. 

In [8]:
# Load a dataset 
ds = load_dataset('imdb', split='test[:2%]')
# Tokenizer and model for sequence classification
tok_cls = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
mdl_cls = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english', device_map='auto')

# Preprocessing function to tokenize text
def preprocess(batch):
    return tok_cls(batch['text'], truncation=True, padding=True)

# Tokenize the entire dataset
ds_tok = ds.map(preprocess, batched=True)

# Prediction function to run the model and get predictions
def predict(batch):
    # Extract relevant keys and convert to tensors
    keep = {k: batch[k] for k in ['input_ids', 'attention_mask'] if k in batch}
    # Move tensors to the same device as the model
    tens = {k: torch.tensor(v).to(mdl_cls.device) for k, v in keep.items()}
    
    with torch.no_grad():
        outputs = mdl_cls(**tens)
        # Move predictions back to CPU for dataset storage
        predictions = outputs.logits.argmax(-1).cpu().tolist()
    
    return {'pred': predictions}

# Run predictions with smaller batch size for stability
preds = ds_tok.map(predict, batched=True, batch_size=8)
preds[:5]

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

{'text': ['I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as

## 5) Revisions, Caching, and Offline
- Definition: A checkpoint is a released model; a revision is an exact commit/tag.
- Why: Pinning revisions and controlling caches ensures reproducibility on different machines.
- Pin exact versions with `revision=` when calling `from_pretrained`.
- Set cache directories with `HF_HOME` or `TRANSFORMERS_CACHE`.
- Force offline mode with `HF_HUB_OFFLINE=1` (uses only local cache).


In [9]:
# Example: pinning a revision (replace with a real commit SHA/tag for production)
tok_pinned = AutoTokenizer.from_pretrained('bert-base-uncased', revision='main')
mdl_pinned = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english', revision='main')

# Where is the cache?
print('HF_HOME =', os.getenv('HF_HOME'))
print('TRANSFORMERS_CACHE =', os.getenv('TRANSFORMERS_CACHE'))


HF_HOME = None
TRANSFORMERS_CACHE = None


## 6) Optional: Hosted Inference API
- Definition: The Inference API is a managed endpoint for common tasks.
- Why: Zero setup for quick demos or when you lack local GPU resources.
Run inference on HF-hosted models (requires token; rate limits apply).


In [10]:
try:
    from huggingface_hub import InferenceClient
    if HF_TOKEN:
        client = InferenceClient(model='facebook/bart-large-cnn', token=HF_TOKEN)
        s = client.summarization('''Large Language Models (LLMs) are a transformative technology in artificial intelligence, powering applications like chatbots, text generation, and automated analysis. These models, built on deep learning architectures, excel at understanding and generating human-like text by learning patterns from vast datasets. At their core, LLMs are neural networks trained on billions of words from diverse sources, such as books, websites, and social media, enabling them to capture the nuances of language, from grammar to context.
The foundation of LLMs lies in the transformer architecture, introduced in 2017 with the paper "Attention is All You Need." Transformers use mechanisms like self-attention to process input text, allowing the model to weigh the importance of each word relative to others in a sentence. This enables LLMs to handle long-range dependencies, making them adept at tasks like translation, summarization, and question-answering. Models like BERT, GPT, and Llama have pushed the boundaries of what machines can achieve, with each iteration scaling up in size and capability.
Training an LLM involves feeding it massive text corpora and optimizing billions of parameters using techniques like supervised learning and reinforcement learning. The process is computationally intensive, requiring powerful GPUs or TPUs and significant energy resources. Once trained, LLMs can perform zero-shot or few-shot learning, meaning they can tackle tasks with little to no task-specific training, relying on their prelearned knowledge. For example, an LLM can generate a story from a prompt like "Once upon a time" or classify sentiment in a sentence without explicit retraining.
LLMs are accessible through platforms like the Hugging Face Hub, which hosts pretrained models, datasets, and tools like the transformers library. This democratizes AI, allowing developers to use models like GPT-2 or DistilBERT for tasks such as text classification or generation without building from scratch. However, using LLMs responsibly requires understanding their limitations, including biases in training data, high computational costs, and potential security risks when loading untrusted models.
Applications of LLMs span industries: they power virtual assistants, automate customer support, assist in coding, and enhance research by summarizing complex texts. Yet, challenges remain, including ensuring fairness, reducing environmental impact, and managing the risk of generating misleading information. As LLMs evolve, they promise to reshape how we interact with technology, making it critical to approach their development and use with care.
In summary, LLMs represent a leap forward in AI, driven by transformers and massive datasets. Their ability to process and generate language has broad implications, but careful management is essential to harness their potential effectively..''')
        print(s)
    else:
        print('HF_TOKEN not set; skipping hosted inference.')
except Exception as e:
    print('InferenceClient not available or error:', e)


HF_TOKEN not set; skipping hosted inference.


## Exercise — Model comparison and benchmarking


###  Task Description
Create a comprehensive comparison of two sentiment classifiers using the IMDB dataset. Your analysis should include:

#### Model Selection
- Compare `distilbert-base-uncased-finetuned-sst-2-english` and `textattack/bert-base-uncased-SST-2`
- Document the model architectures and sizes

#### Implementation Requirements
- Use the Hugging Face datasets library to load IMDB data
- Implement batch processing for memory efficiency
- Include proper error handling and device management
- Record and compare inference times

#### Evaluation Metrics
- Calculate and compare accuracy scores
- Generate classification reports
- Record processing time per batch
- Document memory usage where applicable

#### Technical Requirements
- Pin model revisions for reproducibility 
- Record label mappings from `config.id2label`
- Use appropriate batch sizes (suggest starting with 32)

#### Deliverables
- Working code implementation
- Performance comparison table
- Brief analysis of tradeoffs (speed vs. accuracy)
- Documentation of label mappings and any data preprocessing

#### Bonus Tasks
- Experiment with different dataset sizes
- Add visualization of results
- Compare memory usage across models
- Analyze misclassified examples
