In [1]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

# Question 1

###What is a pipeline in Hugging Face Transformers? What does it abstract away from the user?

Un pipeline est une interface haut niveau de la librairie Transformers qui regroupe tout ce qu’il faut pour une tâche donnée : le tokenizer, le modèle, et la post-processing du résultat.


Pour l’utilisateur, ça veut dire qu’on n’a pas besoin de gérer manuellement :

--la tokenisation du texte d’entrée,

--l’envoi des tenseurs au modèle,

le décodage / mise en forme de la sortie (labels, scores, texte généré, etc.).

On lui donne juste un texte en entrée, et on récupère directement une sortie “prête à l’emploi” (label + score, réponse, résumé, etc.).

###Visit the pipeline documentation and list at least 3 other tasks (besides text-classification) that are available.

question-answering

token-classification

summarization

translation

text-generation

###What happens when you don't specify a model in the pipeline? How can you specify a specific model?

If you don’t specify a model, the pipeline automatically loads the default model associated with the task.
For example, for "text-classification" it loads :  distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f

In [2]:
from transformers import pipeline

classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


#Question 2: Text Classification Deep Dive

###What is the default model used for text-classification?

The default model used is
distilbert/distilbert-base-uncased-finetuned-sst-2-english.

###What dataset was this model fine-tuned on? What kind of text does it work best with?

This model was fine-tuned on the SST-2 (Stanford Sentiment Treebank v2) dataset.
SST-2 contains short movie review sentences labelled as positive or negative.
Therefore, the model works best with short, opinionated sentences expressing a clear positive or negative sentiment (e.g. reviews, comments, short statements).

###The output includes a score field. What does this score represent? What range of values can it have?

The score field represents the model’s confidence in its predicted label, i.e. the softmax probability assigned to that label.
It always lies between 0 and 1, where values closer to 1 indicate higher confidence.

###Challenge: Find a different text-classification model on the Hub that classifies emotions (not just positive/negative). What is its name?

Example of an emotion-classification model on the Hugging Face Hub:

j-hartmann/emotion-english-distilroberta-base

In [3]:
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,NEGATIVE,0.901546


#Named Entity Recognition

In [4]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.87901,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.55657,Mega,208,212
4,PER,0.590256,##tron,212,216
5,ORG,0.669692,Decept,253,259
6,MISC,0.498349,##icons,259,264
7,MISC,0.775362,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511


#Question 3: Named Entity Recognition (NER)

###What does the aggregation_strategy="simple" parameter do in the NER pipeline?

aggregation_strategy="simple" groups together all subword tokens that belong to the same predicted entity into a single entity span.

###Looking at the output above, what do the entity types mean? (ORG, MISC, LOC, PER)



ORG → organisations (companies, institutions, groups)

PER → persons (people, characters)

LOC → locations (cities, countries, places)

MISC → miscellaneous entities (names that don’t fit in the other categories: events, products, nationalities, etc.)

###Why do some words appear with prefix (like ##tron and ##icons)? What does this indicate about tokenization?

The ## prefix indicates that the token is a subword produced by WordPiece tokenization.

Megatron → Mega + ##tron

Decepticons → Decept + ##icons

###The model seems to have split "Megatron" and "Decepticons" incorrectly. Why might this happen? What does this tell you about the model's training data?

Because the model was not trained on text containing Transformers characters or fictional sci-fi names.

NER models trained on CoNLL-2003 (or similar corpora) mostly see:
-news articles, -real people, -real organizations, -real locations

So rare or fictional words like “Megatron” are unknown to the tokenizer, causing: -subword splitting, -incorrect or missing entity classification

###Challenge: Find the model card for dbmdz/bert-large-cased-finetuned-conll03-english. What is the CoNLL-2003 dataset?

The CoNLL-2003 dataset is a benchmark corpus used for Named Entity Recognition.
It contains English news articles from the Reuters RCV1 dataset and includes four entity types:

PER (persons)

ORG (organizations)

LOC (locations)

MISC (miscellaneous)

In [5]:
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Unnamed: 0,score,start,end,answer
0,0.631292,335,358,an exchange of Megatron


# Question 4: Question Answering Systems

###What type of question answering is this? (Extractive vs. Generative)

This is extractive question answering.
The model selects a span directly from the input text rather than generating new text.

###The model outputs start and end indices. What do they represent? Why are they important?

The start index is the position (token index) where the answer begins in the context text.

The end index is where the answer ends.

They are important because an extractive QA model does not generate new words; instead, it predicts which segment of the input text answers the question.

###What is the SQuAD dataset?

SQuAD (Stanford Question Answering Dataset) is a large dataset of reading comprehension questions based on Wikipedia articles.
The model must answer questions by extracting a span from the article, not generating new text.

###Think of a question this model CANNOT answer based on the text. Why would it fail?

“Is Amazon a good company overall?”

The model would fail because:

extractive QA can only return text that literally appears in the passage,

it cannot reason, infer motivations, or use external knowledge.

If the answer is not explicitly present as a text span, the model cannot find it.

###Challenge: Difference between extractive and generative QA + example of a generative model

Extractive QA

Selects an answer by extracting text directly from the input.

Output is always a substring of the context.


vs


Generative QA

Generates an answer in natural language, even if the exact words are not in the context.

Can rephrase, summarise, or add reasoning.

''Example'' of generative QA model on the Hub:
google/flan-t5-base
(or any T5 / FLAN-T5 / GPT-style model).

#Question 5: Text Summarization

###What is the difference between extractive and abstractive summarization? Check the summarization documentation.

####Extractive summarization

Selects and concatenates existing sentences or phrases from the text. No new wording is produced. Works like “smart copy-paste”.

####Abstractive summarization

Generates new sentences, paraphrases, and reformulates the content. Similar to how a human writes a summary. Requires deeper language understanding.

###Looking at the code in the next cell, what is the default model used for summarization? Search for it on the Hugging Face Model Hub and determine:

###Is it an extractive or abstractive model?
###What architecture does it use?(Hint: look at the model name)
###What dataset was it trained on?

**a) Is it extractive or abstractive?**

Abstractive.
BART is a seq-to-seq generative transformer.

**b) What architecture does it use?**

It uses the BART architecture (Bidirectional and Auto-Regressive Transformers), which is an encoder–decoder model.

**c) What dataset was it trained on?**

BART-large-CNN is fine-tuned on the CNN/DailyMail news summarization dataset (articles + bullet-point highlights).

###What do the max_length and min_length parameters control? What happens if min_length > max_length?

max_length → the maximum number of tokens in the generated summary.

min_length → the minimum number of tokens the summary must contain.

If min_length > max_length, the model cannot satisfy the constraint → it raises an error (conflicting generation limits).

###The parameter clean_up_tokenization_spaces=True is used. What does this parameter do? Why might it be useful for summarization?

This option removes:

extra spaces,

odd spacing from subword tokenization,

unnecessary whitespace around punctuation.

It makes the final summary more readable and natural, especially since abstractive models often generate tokens in fragments.

###Challenge: Find two different summarization models on the Hub:

###One optimized for short texts (like news articles)
###One that can handle longer documents
###Compare their architectures and training data.

**Model for short texts (news-style)**

--> facebook/bart-large-cnn

Architecture: BART (encoder–decoder)

Trained on CNN/DailyMail

Optimized for short news articles (~300–500 words)

**Model for long documents**

👉 google/pegasus-large or allenai/led-base-16384

In [6]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Device set to use cpu
Your min_length=56 must be inferior than your max_length=45.


 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead.


###What is the architecture behind the Helsinki-NLP/opus-mt-en-de model? Look it up on the Model Hub.
###What does "OPUS" stand for?
###What does "MT" stand for?

Helsinki-NLP/opus-mt-en-de is a MarianMT model:

a Transformer encoder–decoder architecture trained with the Marian framework, specifically designed for neural machine translation.

OPUS = Open Parallel Corpus (a large open collection of parallel texts used to train these MT models).

MT = Machine Translation.

###How would you find a model to translate from English to French? Visit the translation documentation and the Model Hub to find at least 2 different models.

Filter by Task = Translation and search en-fr.

Examples of English→French models:

Helsinki-NLP/opus-mt-en-fr (MarianMT, bilingual EN→FR).

facebook/mbart-large-50-many-to-many-mmt (mBART-50, multilingual; can do EN→FR among many others).

###What is the difference between bilingual and multilingual translation models? What are the advantages and disadvantages of each?

Bilingual models (e.g. opus-mt-en-fr)

Trained on one language pair.

Often simpler and more accurate on that specific pair.

You need one model per pair, less flexible.

Multilingual models (e.g. mBART-50, M2M100)

Trained on many languages / directions.

Can handle lots of pairs with a single model, allow transfer learning from high-resource to low-resource languages.

Capacity is shared → sometimes slightly worse per pair; more complex and heavier.

###In the code, we specify the task as "translation_en_to_de". How does this relate to the model we're loading?

The task string (translation_en_to_de) tells the pipeline:

this is a translation task,

from English to German.

###The output shows a warning about sacremoses. What is this library used for in NLP?

Sacremoses is a Python port of the Moses tokenizer / normalizer / truecaser.

In MarianMT / OPUS models it’s used for pre- and post-processing text:

tokenization,

punctuation normalization,

detokenization.

###Challenge: Find a multilingual model (like mBART or M2M100) that can translate between multiple language pairs. How many language pairs does it support?

facebook/mbart-large-50-many-to-many-mmt

Supports 50 languages and can translate between any pair of them.

facebook/m2m100_418M

Supports 100 languages and 9,900 translation directions (all ordered pairs of 100 languages).

In [7]:
translator = pipeline("translation_en_to_de",
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Eingeschlossen sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, von Ihnen bald zu hören. Aufrichtig, Bumblebee.


###What is the default model used for text generation in the code below? Look it up on the Hub and answer:
###What architecture does GPT-2 use? (decoder-only, encoder-decoder, or encoder-only?)
###How many parameters does the base GPT-2 model have?
###What type of generation does it perform? (autoregressive, non-autoregressive, etc.)

When you run pipeline("text-generation"), the default model is:

--> gpt2 (the base GPT-2 model from OpenAI).

GPT-2 uses a decoder-only Transformer architecture. It has no encoder. It generates text token by token, conditioning on previous tokens.

The base GPT-2 model has 124 million parameters.

GPT-2 performs autoregressive generation:

-It predicts the next token based on all previously generated tokens.

-This continues until a stopping criterion (length limit or EOS token).

###Why do we use set_seed(42) before generation? What would happen without it?

set_seed(42) makes the generation deterministic by fixing the random seed.

With it → you always get exactly the same output for the same prompt.

Without it → the output will be different each time, because sampling involves randomness.

###The code uses max_length=200. What other parameters can control text generation? Research and explain:
###-temperature
###-top_k
###-do_sample

• temperature

Controls randomness:

High temperature (>1.0) → more diverse, creative output

Low temperature (<1.0) → safer, more predictable output

temperature=0 → fully greedy (no randomness)

• top_k

Top-k sampling keeps only the k most likely tokens at each step.

Small k → more controlled, less creative

Large k → more diversity and riskier generations

• do_sample

Controls whether sampling is used:

do_sample=True → random sampling (creative)

do_sample=False → greedy decoding (deterministic, less diverse)

###Looking at the output, you can see a warning about truncation. What does this mean? Why is the input being truncated?

The warning says the input was too long and has been cut off to fit the model’s maximum context window.

GPT-2 has a maximum input length of 1024 tokens.

If your prompt exceeds this, the pipeline truncates the beginning of the prompt.

###What does pad_token_id being set to eos_token_id mean? Why is this necessary for GPT-2?

GPT-2 has no pad token in its vocabulary because it was not trained with padding.

When padding is needed, GPT-2 will insert the end-of-sentence token instead.

This avoids errors during batch generation where padding is required.

###What are the trade-offs between model size and generation quality?

####Small models (e.g., GPT-2 small)

**Pros :**

Fast to run

Lower memory use

Suitable for lightweight applications

Cons :

Less coherent long-range reasoning

More repetition

More likely to go off-topic

####Large models (e.g., GPT-2 XL, GPT-3, LLaMA-70B)

**Pros :**

More fluent, coherent text

Better factual correctness

Richer vocabulary and style

Cons :

Slower, needs GPU

Higher memory cost

More expensive to train/run

In [9]:
from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results

In [None]:
generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]