In [1]:
!uv pip install transformers torch sentencepiece

[2mAudited [1m3 packages[0m [2min 31ms[0m[0m


## Question 1: Understanding Pipelines

### 1. What is a `pipeline` in Hugging Face Transformers? What does it abstract away from the user?

A pipeline is a high-level inference API that makes it easy to use pretrained models. It handles all the complexity behind the scenes: tokenizing the input, running the model, and converting outputs to readable results. You just specify the task and provide data.

### 2. List at least 3 other tasks (besides text-classification) available in pipelines:

- `text-generation` for generating text from a prompt
- `automatic-speech-recognition` for transcribing audio
- `image-classification` for classifying images
- `question-answering` for extracting answers from context

### 3. What happens when you don't specify a model? How can you specify a specific model?

Without a model, the pipeline uses a default pretrained model for that task. For `text-classification`, it defaults to `distilbert-base-uncased-finetuned-sst-2-english`.

To specify a model, pass the `model` parameter with a Hugging Face Hub model identifier:

In [2]:
import torch
from transformers import pipeline

classifier_default = pipeline("text-classification")

classifier_custom = pipeline("text-classification", model="nlptown/bert-base-multilingual-uncased-sentiment")

text = "I love this product!"
print("Default model:", classifier_default(text))
print("Custom model:", classifier_custom(text))

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0
Device set to use mps:0


Default model: [{'label': 'POSITIVE', 'score': 0.9998855590820312}]
Custom model: [{'label': '5 stars', 'score': 0.9135047197341919}]


---

## Question 2: Text Classification Deep Dive

### 1. What is the default model used for text-classification?

`distilbert-base-uncased-finetuned-sst-2-english`

You can find it on the [Hugging Face Hub](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).

### 2. What dataset was this model fine-tuned on? What kind of text does it work best with?

The model was fine-tuned on the Stanford Sentiment Treebank (SST-2), which is part of the GLUE benchmark. It works best with short to medium-length English text (up to 128 tokens), especially movie reviews and general sentiment analysis.

### 3. What does the `score` field represent? What range of values can it have?

The score represents the model's confidence in its prediction, ranging from 0 to 1. A score of 0.90 means the model is 90% confident in that label.

### 4. Challenge: Emotion classification model

I found `j-hartmann/emotion-english-distilroberta-base`, which classifies text into 7 emotions: anger, disgust, fear, joy, neutral, sadness, and surprise.

Link: https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

In [3]:
emotion_classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base")

test_texts = [
    "I'm so happy today!",
    "That made me super angry.",
    "Damn, that is so scary!",
    "I feel so sad and lonely."
]

for text in test_texts:
    result = emotion_classifier(text)
    print(f"'{text}' -> {result[0]['label']} ({result[0]['score']:.2%})")

Device set to use mps:0


'I'm so happy today!' -> joy (95.24%)
'That made me super angry.' -> anger (97.02%)
'Damn, that is so scary!' -> fear (92.46%)
'I feel so sad and lonely.' -> sadness (98.68%)


## Question 3: Named Entity Recognition (NER)

### 1. What does the `aggregation_strategy="simple"` parameter do?

It groups tokens with the same entity type together based on BIO tags. Without it, you get individual token predictions. With "simple", tokens like B-PER, I-PER, I-PER get merged into a single entity, and subword tokens are combined into complete words.

### 2. What do the entity types mean? (ORG, MISC, LOC, PER)

PER is for person names, LOC for locations, ORG for organizations, and MISC for things that don't fit the other categories (like product names or events).

### 3. Why do some words appear with `##` prefix (like `##tron` and `##icons`)?

The `##` prefix indicates subword tokens. When a word isn't in the tokenizer's vocabulary, it gets split into smaller pieces. For example, "Megatron" becomes "Mega" + "##tron". The `##` means it's a continuation of the previous token.

### 4. Why are "Megatron" and "Decepticons" split incorrectly?

The model was trained on Reuters news data from 1996-1997 (CoNLL-2003 dataset). Fictional character names from Transformers weren't common in news articles back then, so the model doesn't recognize them as coherent entities. This shows that NER models work best on entity types similar to their training data.

### 5. Challenge: What is the CoNLL-2003 dataset?

CoNLL-2003 is a benchmark NER dataset built from Reuters news stories (Aug 1996 - Aug 1997). It uses IOB2 tagging format with four entity types: PER, LOC, ORG, and MISC. It's available in English and German and is the standard benchmark for evaluating NER models.

Model card: https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english

In [4]:
ner = pipeline("ner", aggregation_strategy="simple")
text = "Last week I ordered an Optimus Prime action figure from Amazon in Germany."
results = ner(text)
for entity in results:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2%})")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining

Optimus Prime -> MISC (98.33%)
Amazon -> ORG (99.39%)
Germany -> LOC (99.96%)


## Question 4: Question Answering Systems

### 1. What type of question answering is this? (Extractive vs. Generative)

This is extractive question answering. The model finds and returns a span of text directly from the context, rather than generating new text.

### 2. What do `start` and `end` indices represent? Why are they important?

The `start` and `end` indices are character positions marking where the answer begins and ends in the context. They let you extract the exact answer span, highlight it in a UI, or verify where the answer came from.

### 3. What is the SQuAD dataset?

SQuAD (Stanford Question Answering Dataset) is a benchmark for extractive QA with over 100,000 question-answer pairs based on Wikipedia articles. Answers are always text spans from the articles. SQuAD 2.0 also includes unanswerable questions.

Model card: https://huggingface.co/distilbert-base-cased-distilled-squad

### 4. Question the model CANNOT answer

Questions requiring inference or information not in the text will fail. For example, asking "What is the CEO's opinion?" when no opinion is stated, or "What will happen next?" since extractive QA can only return text that exists in the context.

### 5. Challenge: Extractive vs Generative QA

Extractive QA finds and returns exact text spans from the context. Models like BERT and DistilBERT do this. Generative QA creates new text as the answer, so it can rephrase or synthesize information. Models like T5 and BART do generative QA. The key difference is that extractive can only answer if the answer is literally in the text, while generative can produce answers even when they're not explicitly stated.

For a generative example, see the code cell below using `google/flan-t5-base`:

In [5]:
qa = pipeline("question-answering")
context = "Hugging Face was founded in 2016 by Cl√©ment Delangue, Julien Chaumond, and Thomas Wolf in New York City."
question = "Where was Hugging Face founded?"
result = qa(question=question, context=context)
print(f"Answer: {result['answer']}")
print(f"Score: {result['score']:.2%}")
print(f"Position: {result['start']} to {result['end']}")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


Answer: New York City
Score: 95.34%
Position: 90 to 103


In [6]:
generator = pipeline("text2text-generation", model="google/flan-t5-base")
result = generator("question: What is the capital of France? context: France is a country in Western Europe. Its capital is Paris, known for the Eiffel Tower.")
print(result)

Device set to use mps:0


[{'generated_text': 'Paris'}]


## Question 5: Text Summarization

### 1. What is the difference between extractive and abstractive summarization?

Extractive summarization picks and combines existing sentences from the source text. Abstractive summarization generates new text that captures the meaning, potentially rephrasing or using different words. The default model uses abstractive summarization.

### 2. Default model analysis: `sshleifer/distilbart-cnn-12-6`

This is a DistilBART model (distilled version of BART) trained on CNN/DailyMail and XSum news articles. It's about 1.24x faster than BART-large-cnn with similar quality.

Model card: https://huggingface.co/sshleifer/distilbart-cnn-12-6

### 3. What do `max_length` and `min_length` control?

`max_length` sets the maximum tokens in the summary, `min_length` sets the minimum. If `min_length` is larger than `max_length`, you get a warning and the model just generates up to `max_length`.

### 4. What does `clean_up_tokenization_spaces=True` do?

It removes extra spaces around punctuation that appear during tokenization. Without it you might get "Hello , world ." instead of "Hello, world."

### 5. Challenge: Two different summarization models

For short texts like news articles, `facebook/bart-large-xsum` is optimized for single-sentence summaries. For long documents like research papers, `allenai/led-base-16384` can handle up to 16,384 tokens using Longformer attention, compared to ~1024 for standard BART.

In [7]:
summarizer = pipeline("summarization")
article = """
Artificial intelligence has transformed many industries in recent years.
Machine learning models can now understand natural language, recognize images,
and even generate creative content. Companies are investing billions of dollars
in AI research and development. However, concerns about ethics and job displacement
continue to be debated by experts and policymakers around the world.
"""
result = summarizer(article, max_length=50, min_length=20)
print(result[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


 Machine learning models can now understand natural language, recognize images, and even generate creative content . Companies are investing billions of dollars in AI research and development .


## Question 6: Machine Translation

### 1. What is the architecture behind `Helsinki-NLP/opus-mt-en-de`?

It uses the Marian architecture, a transformer encoder-decoder with 6 layers each. OPUS stands for Open Parallel Universal Sources (a collection of translated texts from the web), and MT stands for Machine Translation. It uses SentencePiece tokenization.

Model card: https://huggingface.co/Helsinki-NLP/opus-mt-en-de

### 2. How to find an English to French translation model?

Two options: `Helsinki-NLP/opus-mt-en-fr` for standard translation, or `Helsinki-NLP/opus-mt-tc-big-en-fr` for higher quality. See the code cell below.

### 3. Bilingual vs Multilingual translation models

Bilingual models like `opus-mt-en-de` handle one language pair and are smaller, faster, and usually higher quality for that specific pair. Multilingual models like `facebook/m2m100_418M` handle many language pairs (100 languages, 9,900 directions) with one model, but trade off some quality for flexibility.

### 4. How does `"translation_en_to_de"` relate to the model?

The task name specifies the translation direction: `en` is the source language (English), `de` is the target (German). This must match what the model was trained for. For French, use `translation_en_to_fr` with `Helsinki-NLP/opus-mt-en-fr`.

### 5. What is `sacremoses` used for?

Sacremoses is a Python port of the Moses statistical MT toolkit. It provides tokenization, detokenization, punctuation normalization, and truecasing for text preprocessing in NLP pipelines.

### 6. Challenge: Multilingual model

`facebook/m2m100_418M` supports 100 languages and can translate directly between any pair without going through English as a pivot. There's also a larger 1.2B parameter version for better quality.

Model card: https://huggingface.co/facebook/m2m100_418M

In [8]:
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you today?")
print(result)

Device set to use mps:0


[{'translation_text': "Bonjour, comment allez-vous aujourd'hui ?"}]


In [9]:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

tokenizer.src_lang = "fr"
text = "Bonjour, comment allez-vous?"
encoded = tokenizer(text, return_tensors="pt")
generated = model.generate(**encoded, forced_bos_token_id=tokenizer.get_lang_id("de"))
result = tokenizer.decode(generated[0], skip_special_tokens=True)

print(f"French: {text}")
print(f"German: {result}")

French: Bonjour, comment allez-vous?
German: Hallo, wie geht es dir?


## Question 7: Text Generation

### 1. What is the default model for text-generation? What architecture does GPT-2 use?

The default model is `openai-community/gpt2`. GPT-2 uses a decoder-only transformer architecture with 124M parameters in the base version. It performs autoregressive generation, meaning it predicts the next token based on all previous tokens. Larger variants include gpt2-medium (355M), gpt2-large (774M), and gpt2-xl (1.5B).

### 2. Why do we use `set_seed(42)` before generation?

Setting a random seed ensures reproducibility. When using sampling-based generation (`do_sample=True`), the model randomly selects tokens from a probability distribution. Without a fixed seed, you get different outputs each run. With `set_seed(42)`, the same prompt produces the same output every time.

### 3. What parameters control text generation?

- `temperature`: controls randomness. Lower values (e.g., 0.7) make output more focused/deterministic, higher values (e.g., 1.5) make it more random/creative
- `top_k`: only samples from the k most likely tokens at each step
- `do_sample`: when False, uses greedy decoding (always picks most likely token). When True, enables sampling
- `top_p` (nucleus sampling): samples from the smallest set of tokens whose cumulative probability exceeds p
- `max_new_tokens`: limits how many tokens to generate

### 4. What does the truncation warning mean?

GPT-2 has a maximum context length of 1024 tokens. If your input prompt exceeds this, the model truncates it to fit. This means early parts of long prompts get cut off, losing context.

### 5. What does `pad_token_id` being set to `eos_token_id` mean?

GPT-2 wasn't trained with a padding token, so it doesn't have one defined. The pipeline automatically uses the end-of-sequence token (ID 50256) as padding. This is needed for batch processing where sequences have different lengths.

### 6. What are the trade-offs between model size and generation quality?

Larger models like gpt2-xl (1.5B parameters) produce more coherent and contextual text, but are slower and require more memory. Smaller models like gpt2 (124M parameters) are faster and lighter but produce lower quality output. You can balance this with techniques like distillation or quantization.

In [10]:
generator_small = pipeline("text-generation", model="gpt2")
prompt = "The future of artificial intelligence is"

print("Greedy:")
print(generator_small(prompt, max_new_tokens=50, do_sample=False, return_full_text=False )[0]['generated_text'])

print("\nSampling (temperature=0.7):")
print(generator_small(prompt, max_new_tokens=50, do_sample=True, temperature=0.7, return_full_text=False)[0]['generated_text'])

print("\nTop-k (k=50):")
print(generator_small(prompt, max_new_tokens=50, do_sample=True, top_k=50, return_full_text=False)[0]['generated_text'])

Device set to use mps:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Greedy:


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 uncertain.

"We're not sure what the future will look like," said Dr. Michael S. Schoenfeld, a professor of computer science at the University of California, Berkeley. "But we're not sure what the future will look

Sampling (temperature=0.7):


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 uncertain. In the meantime, we need to be more patient and more patient with AI.

The first thing to know about artificial intelligence is that it is not new. The world has already been doing with it, with the same sort of ideas

Top-k (k=50):
 uncertain, although it is an area that could be explored soon in a new way.

"It is a very, very, very, very interesting world," said Chris Stapel, executive director of the AI Society, a nonprofit that advises
