# Hello world Transformers

This notebook explores the basics of Hugging Face Transformers using pre-trained models to perform various NLP tasks. We'll use pipelines to classify text, recognize named entities, answer questions, summarize text, translate, and generate text.

**⚠️ Do not forget to install the transformers library to run this notebook.**

## Quick overview of Transformer applications

In [None]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. However, when the package arrived, I discovered to my horror that you had sent me a Megatron action figure instead. As a lifelong fan of the Transformers franchise, I cannot express how disappointed I am. Optimus Prime is the noble leader of the Autobots, while Megatron is the treacherous leader of the Decepticons. I specifically ordered Optimus Prime, and receiving Megatron instead is simply unacceptable. I demand a full refund and a replacement with the correct action figure. I hope Bumblebee can help resolve this matter quickly."""

### Question 1: Understanding Pipelines

1. **What is a `pipeline` in Hugging Face Transformers?**  
   A `pipeline` is a high-level abstraction that combines a model, tokenizer, and preprocessing/postprocessing steps into a single, easy-to-use interface. It abstracts away:
   - Model loading and initialization
   - Tokenization (converting text to tokens)
   - Model inference (running the model)
   - Post-processing (converting model outputs to human-readable format)
   - Device management (CPU/GPU/MPS)
   - Batch processing

2. **Other tasks besides text-classification:**  
   - `"sentiment-analysis"` (similar to text-classification)
   - `"ner"` (Named Entity Recognition)
   - `"question-answering"` or `"qa"`
   - `"summarization"`
   - `"translation"` or `"translation_en_to_fr"` (language-specific)
   - `"text-generation"`
   - `"zero-shot-classification"`
   - `"fill-mask"`
   - `"feature-extraction"`
   - `"conversational"`

3. **What happens when you don't specify a model?**  
   When no model is specified, Hugging Face uses a default pre-trained model for that task. The pipeline automatically selects a suitable model from the Hub (usually a popular, well-performing model). You can specify a model by passing the `model` parameter:
   ```python
   classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
   ```
   Or use a model identifier from the Hub:
   ```python
   classifier = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-sentiment-latest")
   ```

In [None]:
from transformers import pipeline

classifier = pipeline("text-classification")

### Question 2: Text Classification Deep Dive

1. **Default model used:**  
   The default model is typically `distilbert-base-uncased-finetuned-sst-2-english`. This is a DistilBERT model (a smaller, faster version of BERT) that has been fine-tuned on the SST-2 (Stanford Sentiment Treebank v2) dataset.

2. **Dataset and text type:**  
   - **Dataset:** SST-2 (Stanford Sentiment Treebank v2) - a binary sentiment classification dataset with movie reviews
   - **Works best with:** English text, particularly reviews, opinions, and subjective text. It's trained on movie reviews, so it performs well on similar opinionated text.

3. **What does `score` represent?**  
   The `score` represents the model's confidence/probability for the predicted label. It's a value between 0 and 1, where:
   - Values closer to 1 indicate higher confidence
   - Values closer to 0.5 indicate uncertainty
   - The scores for all labels typically sum to 1 (softmax probabilities)

4. **Emotion classification model:**  
   One example is `j-hartmann/emotion-english-distilroberta-base` which classifies text into 7 emotions: anger, disgust, fear, joy, neutral, sadness, and surprise. Another popular one is `bhadresh-savani/bert-base-uncased-emotion` which classifies into 6 emotions.

In [None]:
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)

## Named Entity Recognition

### Question 3: Named Entity Recognition (NER)

1. **What does `aggregation_strategy="simple"` do?**  
   `aggregation_strategy="simple"` groups subword tokens that belong to the same entity into a single entity. For example, if "Optimus" and "##Prime" are both tagged as MISC, they get combined into "Optimus Prime" as one entity. Without this, you'd see fragmented entities split across multiple tokens.

2. **Entity types meaning:**  
   - **ORG** (Organization): Companies, institutions, groups (e.g., "Amazon", "Autobots")
   - **MISC** (Miscellaneous): Other named entities that don't fit other categories (e.g., "Optimus Prime", product names)
   - **LOC** (Location): Geographic locations (e.g., "Germany", "New York")
   - **PER** (Person): Person names (e.g., "Bumblebee" - though this is actually a character name)

3. **Why `##` prefix?**  
   The `##` prefix indicates that this token is a **subword continuation** from WordPiece/BPE tokenization. When a word is split into multiple tokens, the first token keeps the original form, and subsequent tokens get the `##` prefix. For example:
   - "Megatron" → ["Mega", "##tron"]
   - "Decepticons" → ["Decept", "##icons"]
   This is how tokenizers handle out-of-vocabulary words by breaking them into smaller pieces.

4. **Why "Megatron" and "Decepticons" split incorrectly?**  
   These are fictional character names from Transformers that likely weren't in the model's training vocabulary. The model was trained on CoNLL-2003, which contains real-world entities (news articles). Since these fictional names are rare/absent, the tokenizer splits them into subwords, and the NER model may not recognize them as complete entities. This tells us the model's training data domain (news) doesn't include fictional/fantasy content.

5. **CoNLL-2003 dataset:**  
   CoNLL-2003 is a Named Entity Recognition dataset from the Conference on Computational Natural Language Learning. It contains English news articles annotated with four entity types: PER, LOC, ORG, and MISC. The dataset is split into training, validation, and test sets. Tokenizer choice affects NER because:
   - Better tokenization preserves entity boundaries
   - Subword tokenization can fragment entities
   - Case-sensitive tokenizers (like `bert-large-cased`) preserve capitalization cues important for NER

In [None]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

### Question 4: Question Answering Systems

1. **Type of QA:**  
   This is **Extractive Question Answering**. The model extracts a span of text directly from the input context that answers the question, rather than generating a new answer.

2. **What do start/end indices represent?**  
   The `start` and `end` indices indicate the character positions in the original context text where the answer begins and ends. They're important because:
   - They allow you to extract the exact answer substring from the original text
   - They enable highlighting/visualization of answers
   - They provide interpretability (you can see exactly what text the model selected)
   - They're used for evaluation metrics (exact match, F1 score)

3. **What is the SQuAD dataset?**  
   SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles. The answer to every question is a segment of text (span) from the corresponding reading passage. SQuAD 1.1 has 100,000+ question-answer pairs. SQuAD 2.0 includes unanswerable questions.

4. **Questions the model cannot answer:**  
   - Questions requiring information not in the text: "What is the customer's email address?"
   - Questions requiring reasoning beyond the text: "Why is the customer more upset about receiving Megatron than Optimus Prime?" (requires knowledge about Transformers lore)
   - Questions requiring numerical calculations: "How many days ago did the customer order?"
   - Questions about future events or hypotheticals: "What will Amazon do next?"
   The model fails because it can only extract spans from the given context and cannot perform external reasoning or access world knowledge.

5. **Extractive vs Generative QA:**  
   - **Extractive QA:** Selects a span from the context (faster, more accurate for factual questions, but limited to what's in the text)
   - **Generative QA:** Generates new text as an answer (can synthesize information, answer questions not directly in text, but may hallucinate)
   
   **Example generative QA model:** `google/flan-t5-base` or `google/flan-t5-large` can be used for generative QA. Models like `microsoft/DialoGPT-medium` or `facebook/blenderbot-400M-distill` are conversational models that can generate answers.

## Question Answering

In [None]:
reader = pipeline("question-answering")

question = "What does the customer want?"
outputs = reader(question=question, context=text)

pd.DataFrame([outputs])

### Question 5: Text Summarization

1. **Extractive vs Abstractive summarization:**  
   - **Extractive:** Selects and concatenates important sentences/phrases directly from the source text (like highlighting key sentences). Preserves original wording.
   - **Abstractive:** Generates new sentences that capture the meaning, potentially using different words than the source (like writing a summary in your own words). More flexible but harder.

2. **Default model:**  
   The default is typically `sshleifer/distilbart-cnn-12-6` or similar BART-based model. This is an **abstractive** summarization model using the **BART architecture** (encoder-decoder transformer, similar to GPT for decoder, BERT for encoder). It's trained on **CNN/DailyMail** dataset (news articles with summaries).

3. **`max_length` and `min_length`:**  
   - `max_length`: Maximum number of tokens in the generated summary
   - `min_length`: Minimum number of tokens in the generated summary
   - If `min_length > max_length`: The model will generate up to `max_length` tokens and ignore the `min_length` constraint (you'll see a warning). This is what happens in the code cell above!

4. **`clean_up_tokenization_spaces=True`:**  
   This parameter removes extra spaces that can be introduced during tokenization. For example, tokenizers might add spaces around punctuation or split words in ways that create awkward spacing. Setting this to `True` produces cleaner, more readable output by normalizing whitespace.

5. **Two summarization models comparison:**  
   - **For short texts (news):** `facebook/bart-large-cnn` - BART architecture, trained on CNN/DailyMail, good for news articles (typically 500-1000 words)
   - **For longer documents:** `google/pegasus-xsum` or `google/pegasus-large` - PEGASUS architecture, trained on news and scientific papers, designed for longer documents (thousands of words)
   
   **Why summarization is harder than classification:**  
   - Requires understanding the entire document, not just local patterns
   - Must identify what's important vs. what's detail
   - Needs to maintain coherence and fluency in generated text
   - Abstractive summarization requires generation capabilities (more complex than classification)
   - Must handle variable-length outputs
   - Evaluation is more subjective (multiple valid summaries possible)

### Question 6: Machine Translation

1. **Architecture of `Helsinki-NLP/opus-mt-en-de`:**  
   - **Architecture:** MarianMT (Marian Neural Machine Translation) - an encoder-decoder transformer architecture optimized for translation
   - **OPUS:** Open Parallel Corpus - a collection of translated texts from various sources, used for training translation models
   - **MT:** Machine Translation

2. **English→French translation models:**  
   - `Helsinki-NLP/opus-mt-en-fr` - MarianMT model for English to French
   - `facebook/mbart-large-50-many-to-many-mmt` - Multilingual BART model supporting English↔French
   - `t5-base` or `t5-large` can be fine-tuned for translation tasks
   - `Helsinki-NLP/opus-mt-en-roa` (Romance languages including French)

3. **Bilingual vs Multilingual models:**  
   - **Bilingual:** Trained on one language pair (e.g., English↔German). Pros: Often better quality for that specific pair, smaller model size. Cons: Need separate model for each language pair.
   - **Multilingual:** Trained on many language pairs simultaneously. Pros: One model handles many languages, can leverage cross-lingual transfer. Cons: Larger models, may have lower quality per language pair, requires more training data.

4. **How `"translation_en_to_de"` relates to model:**  
   The task name `"translation_en_to_de"` tells the pipeline to load a model that translates from English (en) to German (de). When you also specify `model="Helsinki-NLP/opus-mt-en-de"`, it uses that specific model. The task name helps the pipeline understand the translation direction and apply appropriate preprocessing.

5. **What is `sacremoses`?**  
   `sacremoses` is a Python library that provides sentence segmentation and tokenization tools, particularly for machine translation. It's used by MarianMT tokenizers for:
   - Sentence splitting (breaking text into sentences)
   - Tokenization (especially for languages with complex tokenization rules)
   - Detokenization (converting tokens back to text)
   The warning appears because `sacremoses` improves tokenization quality but isn't strictly required (the model works without it, just with potentially lower quality).

6. **Multilingual translation models:**  
   - **mBART (multilingual BART):** `facebook/mbart-large-50` supports 50 languages and can translate between many language pairs
   - **M2M-100:** `facebook/m2m100_418M` or `facebook/m2m100_1.2B` supports 100 languages and can translate between any pair of those 100 languages (9,900 possible language pairs!)
   - **OPUS-MT multilingual:** Various models covering multiple language families
   
   **Low-resource language challenges:**  
   - Limited training data available
   - Fewer parallel corpora
   - May need to use transfer learning from high-resource languages
   - Code-switching and dialect variations
   - Evaluation metrics may not work well
   - Need for specialized tokenization

In [None]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, min_length=56, clean_up_tokenization_spaces=True)
print(outputs[0]["summary_text"])

### Question 7: Text Generation

1. **Default model and architecture:**  
   - **Default model:** `gpt2` (GPT-2 base)
   - **Architecture:** **Decoder-only transformer** (not encoder-decoder, not encoder-only). GPT-2 uses a stack of transformer decoder blocks with masked self-attention.
   - **Parameters:** GPT-2 base has **124 million parameters**. Other sizes: GPT-2 small (117M), GPT-2 medium (345M), GPT-2 large (762M), GPT-2 XL (1.5B).
   - **Generation type:** **Autoregressive generation** - generates tokens one at a time, using previously generated tokens as context for the next token.

2. **Why `set_seed(42)`?**  
   `set_seed(42)` sets the random seed for reproducibility. Without it:
   - Each run produces different outputs (due to random sampling)
   - Results are not reproducible
   - Makes it difficult to compare results or debug
   - With the seed, you get the same output every time (deterministic behavior)

3. **Other generation parameters:**  
   - **`temperature`:** Controls randomness. Lower (0.1-0.5) = more deterministic/focused. Higher (0.7-1.5) = more creative/random. Default is 1.0.
   - **`top_k`:** Limits sampling to the top K most likely tokens. Reduces chance of low-probability tokens. `top_k=50` means only consider the 50 most likely next tokens.
   - **`do_sample`:** If `True`, uses sampling (random selection based on probabilities). If `False`, uses greedy decoding (always picks most likely token). `do_sample=True` with `temperature=1.0` gives diverse outputs.

4. **Truncation warning:**  
   The truncation warning means the input text was too long for the model's maximum context length (GPT-2 has 1024 token limit). The pipeline automatically truncates the input to fit. This happens because:
   - The prompt exceeds the model's context window
   - The model can only process a fixed maximum number of tokens
   - Truncation ensures the model can process the input, but you lose information from the truncated part

5. **Setting `pad_token_id` to `eos_token_id`:**  
   GPT-2 doesn't have a padding token by default (it wasn't trained with one). When batching sequences of different lengths, you need a padding token. Setting `pad_token_id=generator.tokenizer.eos_token_id` uses the end-of-sequence token as padding. This is necessary because:
   - Without a pad token, the tokenizer/model may raise errors during batching
   - Using EOS as padding is a common workaround
   - It tells the model where sequences end (though not ideal, it works)

6. **Trade-offs between model size and generation quality:**  
   - **Larger models (GPT-2 XL, GPT-3):** Better coherence, more knowledge, better at following instructions, but require more memory, slower inference, higher cost
   - **Smaller models (GPT-2 base):** Faster, less memory, lower cost, but may produce less coherent text, less knowledge, more repetition
   - **Sweet spot:** Depends on use case - for simple tasks, smaller models suffice; for complex generation, larger models are worth the cost

In [None]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True)
print(outputs[0]["translation_text"])

In [None]:
from transformers import pipeline, set_seed

generator = pipeline("text-generation")
set_seed(42)

outputs = generator(
    text,
    max_length=200,
    do_sample=True,
    pad_token_id=generator.tokenizer.eos_token_id
)

print(outputs[0]["generated_text"])