# Hello world transformers ✨

In this notebook we will explore the basics of the Hugging Face library by using a pre-trained model to classify text. Here is our exemple text with which we will test our model.

In [3]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

## Question 1: Understanding Pipelines
### Question 1.1 — What is a pipeline in Hugging Face Transformers? What does it abstract away from the user?

A **pipeline** in Hugging Face Transformers is a high-level API that provides an easy way to use pre-trained models for common NLP (and multimodal) tasks.

It abstracts away many technical details, including:
- Loading the correct **pre-trained model** for a given task
- Loading and applying the appropriate **tokenizer**
- Handling **preprocessing** (tokenization, padding, truncation)
- Running **inference** with the model
- Applying **postprocessing** to convert raw model outputs into human-readable results (labels, scores, generated text, etc.)

This allows users to perform complex tasks with just a few lines of code, without needing to understand the low-level model architecture or training details.

---

### Question 1.2 — List at least 3 other tasks available in pipelines (besides text-classification)

Besides `text-classification`, Hugging Face pipelines support many other tasks. Examples include:

- **sentiment-analysis** – classify text sentiment (positive/negative, etc.)
- **question-answering** – extract answers from a context given a question
- **text-generation** – generate text using language models (e.g., GPT-style models)
- **summarization** – generate concise summaries of long texts
- **translation** – translate text between languages
- **named-entity-recognition** – identify entities such as names, locations, and organizations
- **fill-mask** – predict missing words in masked sentences

---

### Question 1.3 — What happens when you don’t specify a model? How can you specify a specific model?

When you do **not** specify a model in a pipeline, Hugging Face automatically loads a **default pre-trained model** that is considered suitable for the chosen task.

For example:
- `pipeline("text-classification")` loads a default sentiment-analysis model.

To specify a **specific model**, you can pass its name explicitly using the `model` argument:

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)


In [4]:
from transformers import pipeline

classifier = pipeline("text-classification")

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


## Question 2 — Text Classification Deep Dive

### Question 2.1 What is the default model used for text-classification?

When you run a Hugging Face `pipeline("text-classification")` **without specifying a model**, it automatically loads a default pre-trained model.  
In most recent versions of the Transformers library, this default is:

**`distilbert-base-uncased-finetuned-sst-2-english`** —  
a DistilBERT model fine-tuned specifically for sentiment classification. :contentReference[oaicite:0]{index=0}

---

### Question 2.2 What dataset was this model fine-tuned on? What kind of text does it work best with?

The model `distilbert-base-uncased-finetuned-sst-2-english` was fine-tuned on the **Stanford Sentiment Treebank (SST-2)** dataset, a benchmark dataset for sentiment analysis.  
This dataset consists of English movie reviews labeled as **positive** or **negative** sentiment, and the model excels at classifying similar sentiment tasks on short to medium-length English text. :contentReference[oaicite:1]{index=1}

So it works best with **general English text where sentiment (positive/negative) is clear**, such as reviews, tweets, or customer feedback.

---

### Question 2.3 What does the score field represent? What range of values can it have?

In the output of the classification pipeline, the `"score"` field represents the model’s **confidence (probability)** in its prediction for that label.  
- It is the **softmax probability** assigned to the predicted label.
- The value ranges from **0.0 to 1.0**.
- A **higher score** means the model is more confident in that label.

For example:

```python
[{'label': 'NEGATIVE', 'score': 0.95},
 {'label': 'POSITIVE', 'score': 0.05}]


In [5]:
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)    

Unnamed: 0,label,score
0,NEGATIVE,0.901546


## Question 3 — Named Entity Recognition (NER)


In [6]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)    

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.879009,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.556567,Mega,208,212
4,PER,0.590258,##tron,212,216
5,ORG,0.669693,Decept,253,259
6,MISC,0.498349,##icons,259,264
7,MISC,0.775361,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812096,Bumblebee,502,511



### Question 3.1 What does `aggregation_strategy="simple"` do in the NER pipeline?

In a Hugging Face NER (token-classification) pipeline, `aggregation_strategy="simple"` groups together consecutive tokens that belong to the **same entity type** into a single entity span.

Without aggregation, the model outputs predictions **per token**, which can be difficult to interpret when words are split into sub-tokens.  
With `"simple"` aggregation:
- sub-tokens belonging to the same entity are merged,
- the entity label is assigned to the whole word or phrase,
- a single confidence score is computed for the aggregated entity.

This makes the output more readable and closer to how humans think about named entities.

---

### Question 3.2 What do the entity types mean? (ORG, MISC, LOC, PER)

The entity types come from the **CoNLL-2003 annotation scheme**:

- **PER**: Person  
  → names of people (e.g. *Bumblebee*)
- **ORG**: Organization  
  → companies, institutions, groups (e.g. *Amazon*)
- **LOC**: Location  
  → geographical locations such as cities, countries, regions
- **MISC**: Miscellaneous  
  → entities that do not fit the other categories, such as products, events, nationalities, or fictional groups

---

### Question 3.3 Why do some words appear with a `##` prefix (e.g. `##tron`, `##icons`)?

The `##` prefix indicates that the tokenizer uses **subword tokenization** (specifically WordPiece tokenization).

- A token **without** `##` marks the **start of a word**.
- A token **with** `##` means “this piece continues the previous token”.

For example:
- `Megatron` → `Mega` + `##tron`
- `Decepticons` → `Decept` + `##icons`

This allows the model to handle:
- rare or unknown words,
- new words formed from known subwords,
- large vocabularies efficiently.

---

### Question 3.4 Why were "Megatron" and "Decepticons" split incorrectly? What does this say about the training data?

The model splits these words because they are **unlikely to appear frequently (or at all)** in the training data.

Reasons:
- They are **fictional names** from popular culture.
- The model was trained mainly on **news articles**, not on movie or toy-related text.
- As a result, these words are treated as unknown and broken into subwords that *do* exist in the vocabulary.

This tells us that:
- NER models perform best on text **similar to their training domain**.
- Out-of-domain entities (fictional characters, slang, brand-new names) are harder to recognize correctly.

---

### Question 3.5 What is the CoNLL-2003 dataset?

The **CoNLL-2003** dataset is a standard benchmark dataset for Named Entity Recognition.

It consists of:
- English newswire articles (from Reuters)
- Manually annotated named entities
- Four entity types: **PER, ORG, LOC, MISC**

The model  
**`dbmdz/bert-large-cased-finetuned-conll03-english`**  
is a BERT model fine-tuned specifically on this dataset, which explains why it performs well on **formal English news text**.

---

### How might the choice of tokenizer affect NER performance?

The tokenizer directly affects how text is split into tokens, which in turn affects entity recognition.

- **Over-splitting words** can make it harder to correctly assign entity labels.
- **Cased tokenizers** (that preserve uppercase letters) usually perform better for NER, since capitalization carries important information (e.g. names).
- Tokenizers trained on **domain-specific data** (medical, legal, social media) can significantly improve NER performance in those domains.

In short, a tokenizer that better matches the **language style and vocabulary** of the task will usually lead to better NER results.


## Question 4 — Question Answering Systems


In [7]:
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])    

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


Unnamed: 0,score,start,end,answer
0,0.631292,335,358,an exchange of Megatron


### Question 4.1 What type of question answering is this? (Extractive vs. Generative)

This is **extractive question answering**.

In extractive QA, the model:
- is given a **context** (a passage of text),
- selects a **span of text directly from that context** as the answer.

It does **not generate new text**; it only extracts what already exists in the input.  
This is exactly how the Hugging Face `question-answering` pipeline works by default.

---

### Question 4.2 What do the start and end indices represent? Why are they important?

The `start` and `end` indices represent:
- the **character positions** in the input context
- that delimit the extracted answer span.

They are important because:
- they specify *where* in the original text the answer was found,
- they allow precise extraction of the answer,
- they make the model’s decision **traceable and interpretable**.

In extractive QA, predicting the correct start and end positions is the core learning task.

---

### Question 4.3 What is the SQuAD dataset?

**SQuAD (Stanford Question Answering Dataset)** is a benchmark dataset for extractive QA.

It consists of:
- Wikipedia passages (contexts),
- human-written questions,
- answers that are **exact spans from the context**.

The model  
`distilbert-base-cased-distilled-squad`  
is a DistilBERT model fine-tuned on SQuAD, meaning it is optimized to:
- read factual text,
- locate short, precise answers within a passage.

---

### Question 4.4 Try to think of a question this model CANNOT answer. Why would it fail?

Example of a question the model cannot answer:

> **"Why does Bumblebee dislike the Decepticons?"**

It would fail because:
- the answer requires **reasoning and background knowledge**,
- the context does not explicitly state the reason,
- extractive models cannot infer or invent explanations.

Another failing example:

> **"What happened after Amazon replied?"**

This information is **not present in the text**, so the model has nothing to extract.

---

### Question 4.5 Challenge — Difference between extractive and generative QA

| Extractive QA | Generative QA |
|---------------|---------------|
| Selects answers directly from the context | Generates new answers in natural language |
| Answer must exist verbatim in the text | Answer may not appear in the text |
| Uses start/end span prediction | Uses text generation |
| More reliable and factual | More flexible but can hallucinate |

---

### Question 4.6 Example of a generative QA model

An example of a **generative QA model** on Hugging Face is:

- `google/flan-t5-base`

This model can:
- reason over text,
- generate full-sentence answers,
- answer questions even if the exact wording is not in the context.

---

### Asking questions the extractive model cannot answer



In [8]:
from transformers import pipeline
import pandas as pd

reader = pipeline("question-answering")

# A question answerable from the text
question_ok = "What does the customer want?"
out_ok = reader(question=question_ok, context=text)

# A question that requires reasoning or external knowledge
question_fail = "Why does Bumblebee hate the Decepticons?"
out_fail = reader(question=question_fail, context=text)

pd.DataFrame([out_ok, out_fail])

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


Unnamed: 0,score,start,end,answer
0,0.631292,335,358,an exchange of Megatron
1,0.151185,266,302,I hope you can understand my dilemma


We see that the model always returns a span of text, even when the question cannot truly be answered from the context.

## Question 5 — Text Summarization

### Question 5.1 What is the difference between extractive and abstractive summarization?

- **Extractive summarization** selects sentences or phrases **directly from the original text** and concatenates them to form a summary.  
  The summary only contains text that already exists in the document and does not generate new wording.

- **Abstractive summarization** generates a **new summary in natural language** by paraphrasing, compressing, or reformulating the original content.  
  The summary may use words or sentence structures that do not appear in the source text.

The Hugging Face `summarization` pipeline relies on **abstractive summarization models**.

---

### Question 5.2 What is the default model used for summarization?

When the summarization pipeline is used without specifying a model, the default model is typically:

- **facebook/bart-large-cnn**

From the Hugging Face Model Hub:

- The model is **abstractive**
- It uses the **BART (Bidirectional and Auto-Regressive Transformers)** architecture
- It follows an **encoder–decoder (sequence-to-sequence)** design
- It was trained on the **CNN/DailyMail** dataset, composed of news articles and human-written summaries

This makes the model particularly effective for **news-style summarization**.

---

### Question 5.3 What do the `max_length` and `min_length` parameters control? What happens if `min_length > max_length`?

- **max_length** defines the maximum number of tokens allowed in the generated summary  
- **min_length** defines the minimum number of tokens the summary must contain  

These parameters control the level of **compression versus detail** in the summary.

If `min_length` is greater than `max_length`, text generation fails because the constraints are inconsistent and cannot be satisfied simultaneously.

---

### Question 5.4 What does `clean_up_tokenization_spaces=True` do? Why is it useful for summarization?

This parameter removes tokenization artifacts such as:
- unnecessary spaces before punctuation
- awkward spacing between words

It is useful for summarization because summaries are intended to be **human-readable**, and cleaning up spacing improves readability and grammatical quality.

---

### Question 5.5 Challenge — Two different summarization models on the Hub

- **Model optimized for short texts (e.g. news articles):**
  - **facebook/bart-large-cnn**
  - Architecture: BART (encoder–decoder, sequence-to-sequence)
  - Training data: CNN/DailyMail
  - Best suited for short to medium-length news content

- **Model that can handle longer documents:**
  - **google/pegasus-arxiv** (or google/pegasus-pubmed)
  - Architecture: PEGASUS (encoder–decoder, sequence-to-sequence)
  - Training data: ArXiv research papers (or PubMed biomedical articles)
  - Designed for long, structured documents

**Comparison:**  
Both models use encoder–decoder architectures. BART is optimized for journalistic text, while PEGASUS is trained on much longer documents and better captures long-range dependencies.

---

### Why might summarization be more challenging than text classification? What linguistic capabilities does the model need?

Summarization is more challenging because it requires the model to:
- understand the **global meaning** of a document
- identify **important versus secondary information**
- model discourse structure and coherence
- paraphrase and compress content
- generate fluent and grammatically correct natural language

Text classification only assigns a label, whereas summarization requires **deep understanding and language generation**, making it a significantly more complex task.


In [9]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0
Your min_length=56 must be inferior than your max_length=45.


 Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead.


## Question 6 — Machine Translation

### Question 6.1 What is the architecture behind the Helsinki-NLP/opus-mt-en-de model?

The model **Helsinki-NLP/opus-mt-en-de** is based on the **MarianMT** architecture.

MarianMT is a **transformer-based encoder–decoder (sequence-to-sequence)** neural machine translation architecture.  
- The encoder processes the source language (English).
- The decoder generates the target language (German).

This architecture is specifically optimized for machine translation tasks.

---

### Question 6.2 What does "OPUS" stand for?

**OPUS** stands for **Open Parallel Corpus**.

It is a large, open collection of parallel texts in many languages, gathered from sources such as:
- subtitles,
- parliamentary proceedings,
- news,
- official documents.

OPUS is widely used to train machine translation models.

---

### Question 6.3 What does "MT" stand for?

**MT** stands for **Machine Translation**.

It indicates that the model is designed to automatically translate text from one language to another.

---

### Question 6.4 How would you find a model to translate from English to French?

To find an English-to-French translation model, you can:
- search the Hugging Face Model Hub for models with names containing `en-fr` or `en-fr`,
- check the translation task documentation.

Examples of English-to-French models include:
- **Helsinki-NLP/opus-mt-en-fr**  
- **facebook/m2m100_418M** (multilingual, supports English–French among many pairs)

---

### Question 6.5 What is the difference between bilingual and multilingual translation models?

- **Bilingual models**
  - Trained on a single language pair (e.g. English → German).
  - Often achieve higher performance for that specific pair.
  - Require one model per language pair.

- **Multilingual models**
  - Trained on many language pairs within a single model.
  - Can translate between multiple languages, sometimes even unseen pairs.
  - More flexible, but may perform slightly worse than specialized bilingual models for high-resource languages.

**Advantages and disadvantages:**
- Bilingual models: higher accuracy, less flexible.
- Multilingual models: more scalable, but sometimes less specialized.

---

### Question 6.6 How does the task "translation_en_to_de" relate to the model being loaded?

The task name **"translation_en_to_de"** specifies:
- the **source language** (English),
- the **target language** (German).

This task matches the model **Helsinki-NLP/opus-mt-en-de**, which was trained specifically to translate from English to German.  
The pipeline uses this information to apply the correct preprocessing and decoding steps.

---

### Question 6.7 What is the sacremoses warning about? What is this library used for?

**sacremoses** is a library used for:
- text normalization,
- tokenization and detokenization,
- handling punctuation and special characters.

In MarianMT models, sacremoses is often used to **preprocess and postprocess text** so that translations are cleaner and more consistent with the training data.

The warning indicates that this optional dependency is not installed, which may slightly affect text formatting but not the core translation quality.

---

### Question 6.8 Challenge — Find a multilingual translation model

An example of a multilingual translation model is:

- **facebook/m2m100_418M**

This model:
- supports **over 100 languages**,
- can translate between **thousands of language pairs**,
- does not require English as an intermediate language.

Another example is:
- **facebook/mbart-large-50-many-to-many-mmt**
  - supports 50 languages,
  - enables many-to-many translation.

---

### What challenges exist for low-resource languages?

Low-resource languages face several challenges:
- limited availability of parallel training data,
- lack of high-quality written corpora,
- dialectal variation and non-standardized spelling,
- poorer model performance compared to high-resource languages,
- higher risk of translation errors or omissions.

These challenges make multilingual models and transfer learning especially important for improving translation quality in low-resource settings.


In [10]:
#Here we changed the model since OPUS can't be found currently within the library
translator = pipeline("translation", model="facebook/m2m100_418M", src_lang="en", tgt_lang="ko")
outputs = translator(text)
print(outputs[0]["translation_text"])


Device set to use mps:0


사랑하는 아마존, 지난 주에 나는 독일에서 온라인 상점에서 Optimus Prime 액션 숫자를 주문했습니다. 불행히도, 패키지를 열었을 때, 나는 나의 공포에 메가트론 액션 숫자를 보냈다는 것을 발견했습니다! Decepticons의 평생 적으로서, 나는 당신이 나의 딜레마를 이해할 수 있기를 바랍니다. 문제를 해결하기 위해, 나는 내가 주문 한 Optimus Prime 숫자에 대한 메가트론 교환을 요구합니다.


## Question 7 — Text Generation

### Question 7.1.1 What is the default model used for text generation in the code below?

In the Transformers `text-generation` pipeline, if you do not specify a model, it typically loads **GPT-2** by default (often shown as `openai-community/gpt2` in recent hubs). :contentReference[oaicite:0]{index=0}  
In practice, you can confirm the exact model name by looking at the pipeline logs/output when the model is loaded.

---

### Question 7.1.2 What architecture does GPT-2 use? (decoder-only, encoder-decoder, or encoder-only?)

GPT-2 is a **decoder-only Transformer** (a causal language model). It predicts the next token given previous tokens.

---

### Question 7.1.3 How many parameters does the base GPT-2 model have?

The base/small GPT-2 model has **~124M parameters**. :contentReference[oaicite:1]{index=1}

---

### Question 7.1.4 What type of generation does it perform? (autoregressive, non-autoregressive, etc.)

GPT-2 performs **autoregressive generation**: it generates text **token by token**, each new token conditioned on the previously generated tokens.

---

### Question 7.2 Why do we use `set_seed(42)` before generation? What would happen without it?

`set_seed(42)` sets random seeds (Python `random`, NumPy, and PyTorch if installed) to make results **reproducible**. :contentReference[oaicite:2]{index=2}  
Without setting a seed, sampling-based generation can produce **different outputs** each time you run it (even with the same prompt and parameters), because randomness influences which tokens are sampled.

---

### Question 7.3 The code uses `max_length=200`. What other parameters can control text generation?

- **temperature**  
  Controls randomness by scaling the logits before sampling.  
  - Low temperature (< 1): more conservative, more repetitive/deterministic  
  - High temperature (> 1): more diverse, but higher risk of incoherence

- **top_k**  
  Restricts sampling to the **K most probable tokens** at each step.  
  - Smaller top_k: safer/more focused output  
  - Larger top_k: more variety

- **do_sample**  
  Controls whether the model **samples** tokens or uses deterministic decoding.  
  - `do_sample=False`: typically uses greedy decoding (more deterministic)  
  - `do_sample=True`: enables sampling (more diverse, seed matters more)

(These are part of the standard text generation controls in Transformers.) :contentReference[oaicite:3]{index=3}

---

### Question 7.4 What does the truncation warning mean? Why is the input being truncated?

The truncation warning means the **input prompt is longer than the model’s maximum supported input length** (context window).  
When that happens, the tokenizer/model will **truncate** (cut off) part of the input so it fits the maximum length.

This matters because truncation can remove important context, changing what the model “sees” and therefore changing the generated output.

---

### Question 7.5 What does `pad_token_id` being set to `eos_token_id` mean? Why is this necessary for GPT-2?

GPT-2 does not have a dedicated padding token by default. During generation (especially in batched settings), the library may need a pad token to align sequences.  
Setting `pad_token_id = eos_token_id` tells the model to use the **end-of-sequence token** as padding, which avoids errors/warnings during open-ended generation. :contentReference[oaicite:4]{index=4}

Trade-off: using EOS as padding is a practical workaround, but it can be conceptually odd (padding becomes “end of text”), so it’s mainly used for compatibility.

---

### Question 7.6 What are the trade-offs between model size and generation quality?

- **Larger models**  
  - Typically produce more coherent, fluent, and context-aware text  
  - Better at long-range dependencies and instruction following (in modern instruction-tuned variants)  
  - Require more compute (slower generation, more memory)

- **Smaller models**  
  - Faster and cheaper to run  
  - More likely to produce repetitive, shallow, or off-topic generations  
  - Shorter effective context handling and weaker world knowledge

In general: increasing model size often improves quality, but with diminishing returns and higher hardware cost.


In [11]:
from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results

In [12]:
generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Dear Amazon, last week I ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead! As a lifelong enemy of the Decepticons, I hope you can understand my dilemma. To resolve the issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered. Enclosed are copies of my records concerning this purchase. I expect to hear from you soon. Sincerely, Bumblebee.

Customer service response:
Dear Bumblebee, I am sorry to hear that your order was mixed up. I did purchase the right size Optimus Prime figure for me, but I did not receive the correct size shipment. I appreciate your patience as I have been working hard to resolve this issue. I have made many attempts at sorting out this issue, but I have not had a single success.

Message for my readers with questions and concerns:

Dear Customer,

In case you need help sorting out the problem of

-> This output makes sense considering we are using GPT2: a decoder only transformer