<a href="https://colab.research.google.com/github/Octave-Horlin/NLP/blob/main/LabTransformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hello world Transformers üëê

In this notebook we will explore the basics of the Hugging Face library by using a pre-trained model to classify text.


## Quick overview of Transformer applications


Let's start by defining a text that we will use to test the model.


In [None]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure
from your online store in Germany. Unfortunately, when I opened the package,
I discovered to my horror that I had been sent an action figure of Megatron
instead! As a lifelong enemy of the Decepticons, I hope you can understand my
dilemma. To resolve the issue, I demand an exchange of Megatron for the
Optimus Prime figure I ordered. Enclosed are copies of my records concerning
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""


## Text Classification

### Question 1: Understanding Pipelines

Before we start using the models, let's understand what we're working with:

1. What is a `pipeline` in Hugging Face Transformers? What does it abstract away from the user?

2. Visit the `pipeline` documentation and list at least 3 other tasks (besides text-classification) that are available.

3. What happens when you don't specify a model in the `pipeline`? How can you specify a specific model?


**Answers:**

1. A `pipeline` in Hugging Face Transformers is a high-level abstraction that encapsulates the entire process necessary for an NLP task. It abstracts away the complexity of:
   - Managing the tokenizer (text tokenization)
   - Loading and initializing the model
   - Handling input/output formats
   - Post-processing the results
   - Managing the device (CPU/GPU)

2. Here are at least 3 other tasks available with `pipeline` (besides text-classification):
   - `sentiment-analysis`
   - `translation`
   - `summarization`
   - `question-answering`
   - `named-entity-recognition` (NER)
   - `text-generation`

3. When you don't specify a model in the `pipeline`, Hugging Face automatically uses a default model for the requested task. For example, for `text-classification`, it uses `distilbert-base-uncased-finetuned-sst-2-english` by default. To specify a specific model, you can use:
   ```python
   classifier = pipeline("text-classification", model="model-name")
   ```


In [None]:
from transformers import pipeline
classifier = pipeline("text-classification")





No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


### Question 2: Text Classification Deep Dive

Now that you've seen text classification in action, explore further:

1. What is the default model used for text-classification? Look at the output above to find its name, then search for it on the Hugging Face Model Hub.

2. What dataset was this model fine-tuned on? What kind of text does it work best with?

3. The output includes a `score` field. What does this score represent? What range of values can it have?

4. Challenge: Find a different text-classification model on the Hub that classifies emotions (not just positive/negative). What is its name?


**Answers:**

1. The default model used for text-classification is `distilbert-base-uncased-finetuned-sst-2-english`. This is a DistilBERT model that has been fine-tuned specifically for sentiment analysis.

2. This model was fine-tuned on the **SST-2 (Stanford Sentiment Treebank v2)** dataset. This dataset contains movie reviews labeled as positive or negative. The model works best with English text that expresses sentiment or opinion, such as reviews, feedback, or opinionated statements.

3. The `score` field represents the **confidence/probability** that the text belongs to the predicted label. It is a value between 0 and 1, where values closer to 1 indicate higher confidence in the prediction. For binary classification, the scores for both labels typically sum to approximately 1.0.

4. One example of a text-classification model for emotions is `j-hartmann/emotion-english-distilroberta-base`, which classifies text into emotions such as joy, sadness, anger, fear, surprise, and disgust. Another popular one is `bhadresh-savani/bert-base-uncased-emotion`.


In [None]:
import pandas as pd
outputs = classifier(text)
pd.DataFrame(outputs)


Unnamed: 0,label,score
0,NEGATIVE,0.901546


## Named Entity Recognition


In [None]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)


No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.87901,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.55657,Mega,208,212
4,PER,0.590256,##tron,212,216
5,ORG,0.669693,Decept,253,259
6,MISC,0.498349,##icons,259,264
7,MISC,0.775362,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511


### Question 3: Named Entity Recognition (NER)

Let's understand NER better:

1. What does the `aggregation_strategy="simple"` parameter do in the NER pipeline? Check the token classification documentation.

2. Looking at the output above, what do the entity types mean? (ORG, MISC, LOC, PER)

3. Why do some words appear with `##` prefix (like `##tron` and `##icons`)? What does this indicate about tokenization?

4. The model seems to have split "Megatron" and "Decepticons" incorrectly. Why might this happen? What does this tell you about the model's training data?

5. **Challenge:** Find the model card for `dbmdz/bert-large-cased-finetuned-conll03-english`. What is the CoNLL-2003 dataset?

ü§î How might the choice of tokenizer affect NER performance?


**Answers:**

1. The `aggregation_strategy="simple"` parameter groups consecutive tokens that have the same entity label. Without this parameter, NER would return one prediction per token (subword), which can be fragmented. With "simple", it combines tokens with the same label into a single entity, making the output more readable and useful.

2. The entity types represent different categories:
   - **ORG**: Organization (e.g., companies, institutions)
   - **MISC**: Miscellaneous (other named entities that don't fit other categories)
   - **LOC**: Location (geographical locations like cities, countries)
   - **PER**: Person (names of people)

3. The `##` prefix indicates that these are **subword tokens** that are part of a larger word. This happens with WordPiece tokenization (used by BERT models), where words are split into smaller pieces. The `##` prefix means this token is a continuation of the previous token, not the start of a new word.

4. "Megatron" and "Decepticons" were likely split incorrectly because these are fictional names from Transformers that weren't present in the model's training data (CoNLL-2003). The model trained on real-world entities, so when it encounters unknown proper nouns, it tries to apply its learned patterns, which can lead to incorrect tokenization and entity recognition for fictional or domain-specific terms.

5. The **CoNLL-2003** dataset is a standard benchmark dataset for Named Entity Recognition. It contains news articles annotated with four entity types (PER, LOC, ORG, MISC). The dataset is in English and German, with training, validation, and test sets. It was created for the CoNLL-2003 shared task on language-independent named entity recognition.

The choice of tokenizer can significantly affect NER performance because:
- Different tokenizers split words differently, which can break entity boundaries
- Subword tokenization (WordPiece, BPE) can fragment multi-word entities
- The tokenizer should match the one used during training for best results
- Some tokenizers preserve more word-level information, which can help with entity recognition


## Question Answering


In [None]:
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Unnamed: 0,score,start,end,answer
0,0.631292,335,358,an exchange of Megatron


### Question 4: Question Answering Systems

Explore how question answering works:

1. What type of question answering is this? (Extractive vs. Generative) Check the question answering documentation.

2. The model outputs `start` and `end` indices. What do these represent? Why are they important?

3. What is the SQuAD dataset? (Look up the model `distilbert-base-cased-distilled-squad` on the Hub)

4. Try to think of a question this model CANNOT answer based on the text. Why would it fail?

5. **Challenge:** What's the difference between extractive and generative question answering? Find an example of a generative QA model on the Hub.

üí° **Hint:** Try asking questions that require reasoning or information not in the text. What happens?


**Answers:**

1. This is **extractive question answering**. The model extracts a span of text directly from the provided context document to answer the question, rather than generating a new answer. The model identifies the start and end positions in the context where the answer can be found.

2. The `start` and `end` indices represent the **character positions** in the original context text where the answer begins and ends. They are important because:
   - They allow you to locate the exact answer within the context
   - They enable highlighting or extracting the answer span
   - They provide precise positioning for downstream applications
   - They help verify that the model found a valid answer within the provided context

3. **SQuAD (Stanford Question Answering Dataset)** is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles. The answers to these questions are spans of text from the corresponding reading passage. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles. The model `distilbert-base-cased-distilled-squad` was fine-tuned on this dataset to perform extractive question answering.

4. This model cannot answer questions that:
   - Require information not present in the context text (e.g., "What is the customer's email address?" if it's not in the text)
   - Need reasoning beyond simple extraction (e.g., "Why did the customer choose Amazon?" requires inference)
   - Ask for opinions, predictions, or hypotheticals (e.g., "What should Amazon do next?")
   - Require combining information from multiple parts of the text with complex logic

   It would fail because extractive QA can only copy text from the context and cannot generate new information or perform complex reasoning.

5. **Extractive QA** finds and extracts a span of text from the context document. The answer must exist verbatim in the context. Examples include BERT-based models fine-tuned on SQuAD.

   **Generative QA** generates a new answer that may not appear exactly in the context, using language generation capabilities. Examples include:
   - `google/flan-t5-base` (can be used for generative QA)
   - `microsoft/DialoGPT-medium` (dialogue-based QA)
   - `facebook/blenderbot-400M-distill` (conversational QA)


## Summarization


### Question 5: Text Summarization

Before running the summarization code, let's understand how it works:

1. What is the difference between extractive and abstractive summarization? Check the summarization documentation.

2. Looking at the code in the next cell, what is the default model used for summarization? Search for it on the Hugging Face Model Hub and determine:
   - Is it an extractive or abstractive model?
   - What architecture does it use? (Hint: look at the model name)
   - What dataset was it trained on?

3. What do the `max_length` and `min_length` parameters control? What happens if `min_length > max_length`?

4. The parameter `clean_up_tokenization_spaces=True` is used. What does this parameter do? Why might it be useful for summarization?

5. **Challenge:** Find two different summarization models on the Hub:
   - One optimized for short texts (like news articles)
   - One that can handle longer documents
   - Compare their architectures and training data.

üí° Why might summarization be more challenging than text classification? What linguistic capabilities does the model need?


**Answers:**

1. **Extractive summarization** selects and combines existing sentences or phrases directly from the source text to create a summary. It doesn't generate new text, just extracts the most important parts.

   **Abstractive summarization** generates new sentences that may not appear in the original text, paraphrasing and condensing the information. It requires understanding the content and expressing it in a different way.

2. The default model is `sshleifer/distilbart-cnn-12-6`:
   - It is an **abstractive** model (generates new text)
   - It uses the **BART** architecture (Bidirectional and Auto-Regressive Transformers), specifically a distilled version
   - It was trained on the **CNN/DailyMail** dataset, which contains news articles and their summaries
   - The "12-6" in the name refers to 12 encoder layers and 6 decoder layers

3. The `max_length` parameter controls the **maximum number of tokens** in the generated summary, while `min_length` controls the **minimum number of tokens**. If `min_length > max_length`, it creates an impossible constraint - the model cannot generate a summary that is both longer than the maximum and shorter than the minimum. This will cause a warning and the model will generate up to `max_length`, ignoring the `min_length` constraint.

4. The `clean_up_tokenization_spaces=True` parameter removes extra spaces that can be introduced during tokenization. For example, subword tokenization might add spaces around punctuation or between tokens. This parameter ensures the output text is clean and readable, which is especially important for summarization where the output should be natural and well-formatted.

5. Examples of summarization models:
   - **Short texts (news articles):** `facebook/bart-large-cnn` - Optimized for CNN/DailyMail news articles, uses BART architecture
   - **Longer documents:** `google/pegasus-xsum` - Can handle longer documents, uses PEGASUS architecture trained on XSum dataset
   - Another option for long documents: `allenai/led-large-16384` - Uses Longformer architecture that can handle up to 16,384 tokens

Summarization is more challenging than text classification because:
- It requires **generation** of new text, not just classification
- It needs to understand the **entire document** and identify key information
- It must maintain **coherence** and **fluency** in the generated summary
- It requires **compression** skills to condense information while preserving meaning
- It needs to handle **long-range dependencies** across the entire document
- It must balance between being concise and informative


In [None]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=40, min_length=10, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Exception ignored in: <function tqdm.__del__ at 0x000001FA5095C180>
Traceback (most recent call last):
  File "c:\Users\octav\AppData\Local\Programs\Python\Python311\Lib\site-packages\tqdm\std.py", line 1148, in __del__
    self.close()
  File "c:\Users\octav\AppData\Local\Programs\Python\Python311\Lib\site-packages\tqdm\notebook.py", line 279, in close
    self.disp(bar_style='danger', check_delay=False)
    ^^^^^^^^^
AttributeError: 'tqdm' object has no attribute 'disp'


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure


## Machine Translation

### Question 6: Machine Translation

Let's explore how translation models work:

1. What is the architecture behind the `Helsinki-NLP/opus-mt-en-de` model? Look it up on the Model Hub.
   - What does 'OPUS' stand for?
   - What does 'MT' stand for?

2. How would you find a model to translate from English to French? Visit the translation documentation and the Model Hub to find at least 2 different models.

3. What is the difference between **bilingual** and **multilingual** translation models? What are the advantages and disadvantages of each?

4. In the code, we specify the task as `translation_en_to_de`. How does this relate to the model we're loading?

5. The output shows a warning about `sacremoses`. What is this library used for in NLP? Check the MarianMT documentation.

6. **Challenge:** Find a multilingual model (like mBART or M2M100) that can translate between multiple language pairs. How many language pairs does it support?

üåç What challenges exist for low-resource languages?


**Answers:**

1. The `Helsinki-NLP/opus-mt-en-de` model uses the **MarianMT** architecture (Neural Machine Translation framework):
   - **OPUS** stands for "Open Parallel Corpus" - it's a collection of parallel texts (texts in multiple languages) used for training translation models
   - **MT** stands for "Machine Translation"

2. To find a model for English to French translation, you can:
   - Use the Hugging Face Model Hub and search for "translation_en_to_fr"
   - Examples include:
     - `Helsinki-NLP/opus-mt-en-fr` (bilingual, MarianMT architecture)
     - `facebook/mbart-large-50-many-to-many-mmt` (multilingual, mBART architecture)
     - `t5-base` can also be fine-tuned for translation tasks

3. **Bilingual models** are trained specifically for one language pair (e.g., English ‚Üî German):
   - Advantages: Usually higher quality for that specific pair, faster inference, smaller model size
   - Disadvantages: Need separate models for each language pair, cannot handle multiple languages

   **Multilingual models** can translate between multiple language pairs with a single model:
   - Advantages: One model handles many languages, can leverage shared representations across languages
   - Disadvantages: Larger model size, potentially lower quality for individual pairs, more complex training

4. The task `translation_en_to_de` specifies that we want to translate from English (en) to German (de). This task name helps Hugging Face select the appropriate default model and ensures the pipeline knows the translation direction. When we also specify `model="Helsinki-NLP/opus-mt-en-de"`, we're explicitly choosing a model trained for this specific language pair.

5. **Sacremoses** is a Python library for sentence segmentation and tokenization, particularly useful for languages that require morphological analysis. It's recommended for MarianMT models because:
   - It helps with proper sentence splitting (especially for languages with different punctuation rules)
   - It provides better tokenization for certain languages
   - It improves translation quality by handling sentence boundaries correctly

6. Examples of multilingual translation models:
   - **mBART-50**: `facebook/mbart-large-50-many-to-many-mmt` - Supports 50 languages and can translate between any pair of those languages
   - **M2M-100**: `facebook/m2m100_418M` - Supports 100 languages and can translate between any pair
   - **NLLB-200**: `facebook/nllb-200-3.3B` - Meta's No Language Left Behind model supporting 200 languages

Challenges for low-resource languages:
- Limited training data available
- Fewer parallel corpora (texts translated into those languages)
- Lack of quality benchmarks and evaluation datasets
- Models often underperform compared to high-resource languages
- Domain-specific terminology may be missing
- Dialectal variations and informal language are poorly covered


In [None]:
translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True)
print(outputs[0]['translation_text'])


config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Exception ignored in: <function tqdm.__del__ at 0x000001FA5095C180>
Traceback (most recent call last):
  File "c:\Users\octav\AppData\Local\Programs\Python\Python311\Lib\site-packages\tqdm\std.py", line 1148, in __del__
    self.close()
  File "c:\Users\octav\AppData\Local\Programs\Python\Python311\Lib\site-packages\tqdm\notebook.py", line 279, in close
    self.disp(bar_style='danger', check_delay=False)
    ^^^^^^^^^
AttributeError: 'tqdm' object has no attribute 'disp'


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket √∂ffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie k√∂nnen mein Dilemma verstehen. Um das Problem zu l√∂sen, Ich fordere einen Austausch von Megatron f√ºr die Optimus Prime Figur habe ich bestellt.


## Text Generation


### Question 7: Text Generation

Understand how language models generate text:

1. What is the default model used for text generation in the code below? Look it up on the Hub and answer:
   - What architecture does GPT-2 use? (decoder-only, encoder-decoder, or encoder-only?)
   - How many parameters does the base GPT-2 model have?
   - What type of generation does it perform? (autoregressive, non-autoregressive, etc.)

2. Why do we use `set_seed(42)` before generation? What would happen without it? Check the generation documentation.

3. The code uses `max_length=200`. What other parameters can control text generation? Research and explain:
   - `temperature`
   - `top_k`
   - `do_sample`

4. Looking at the output, you can see a warning about truncation. What does this mean? Why is the input being truncated?

5. What does `pad_token_id` being set to `eos_token_id` mean? Why is this necessary for GPT-2?

6. What are the trade-offs between model size and generation quality?


**Answers:**

1. The default model is **GPT-2** (`openai-community/gpt2`):
   - **Architecture:** GPT-2 uses a **decoder-only** Transformer architecture. It has only decoder layers (no encoder), with self-attention mechanisms and feed-forward networks.
   - **Parameters:** The base GPT-2 model has **124 million parameters**. There are also larger variants: GPT-2 Medium (355M), GPT-2 Large (774M), and GPT-2 XL (1.5B).
   - **Generation type:** GPT-2 performs **autoregressive generation**, meaning it generates text one token at a time, with each new token being conditioned on all previously generated tokens.

2. `set_seed(42)` ensures **reproducibility** by fixing the random number generator seed. Without it:
   - Each run would produce different outputs (non-deterministic)
   - Results would be difficult to reproduce for debugging or comparison
   - The generation would be stochastic, making it harder to test and validate
   - Setting a seed allows for consistent, reproducible results across runs

3. Parameters that control text generation:
   - **`temperature`**: Controls randomness in generation. Lower values (0.1-0.7) make output more deterministic and focused, higher values (0.8-1.5) make it more creative/random. Default is 1.0.
   - **`top_k`**: Limits sampling to the k most likely next tokens. Reduces randomness by ignoring low-probability tokens. Common values: 50, 100.
   - **`do_sample`**: Boolean flag that enables/disables sampling. If `False`, uses greedy decoding (always picks most likely token). If `True`, uses sampling based on probability distribution.

4. The truncation warning means the input text is being **cut off** because it exceeds the model's maximum context length. GPT-2 has a maximum sequence length (typically 1024 tokens). If the input + generation length exceeds this, the input must be truncated. This is necessary because:
   - Transformers have fixed maximum context windows
   - The model needs to fit within memory constraints
   - Truncation prevents errors from sequences that are too long

5. Setting `pad_token_id` to `eos_token_id` is necessary because:
   - GPT-2 doesn't have a dedicated padding token in its vocabulary
   - During training, padding is needed for batching sequences of different lengths
   - Using the end-of-sequence (EOS) token as padding allows the model to work properly with padded sequences
   - Without this, the model might not handle variable-length inputs correctly

6. Trade-offs between model size and generation quality:
   - **Larger models** (e.g., GPT-2 XL, GPT-3): Better quality, more coherent text, better understanding of context, but require more memory, slower inference, higher cost
   - **Smaller models** (e.g., GPT-2 base): Faster inference, less memory, lower cost, but may produce less coherent or less contextually appropriate text
   - Generally, quality improves with size up to a point, but with diminishing returns
   - Larger models can handle longer contexts and generate more diverse, creative outputs


In [None]:
import torch
from transformers import set_seed

generator = pipeline("text-generation", model="openai-community/gpt2")
set_seed(42)
prompt = "Customer service response: "
outputs = generator(prompt + text[:100], max_length=200, num_return_sequences=1,
                   pad_token_id=generator.tokenizer.eos_token_id, truncation=True)
print("Customer service response:", outputs[0]['generated_text'].split("Customer service response: ")[1])


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Customer service response: Dear Amazon, last week I ordered an Optimus Prime action figure
from your online store in Germany. Ugh, I was not able to pick up my order, i am working hard and would like to make an order here. I am so happy because I am looking forward to getting my picture taken with my new figure. Thank you.

Hi, I have ordered a lot of Optimus Prime action figures so I'm going to try to order some more. But, I'm not sure what to say, I just hope you will give me a discount on some of the products in my order.

Thank you again for your time and understanding.

Best regards,

S.

Dear Santa,

Thank you so much for the great thank you! I was told that you would send me another order for my order, but I was told that you would not. I've ordered this figure for my friend to get to see his favorite movie. She is a fan, and I was the only one who could see the movie in the movie theater (she has no idea my boyfriend is in the movie theater). So I am really looking forward to y