# Hello world Transformers ü§ó

In this notebook we will explore the basics of the Hugging Face library by using a pre-trained model to classify text.

‚ö†Ô∏è Do not forget to install the transformers library to run this notebook.

## Quick overview of Transformer applications
Let's start by defining a text that we will use to test the model.

For testing purposes, we will use a text that is a complaint about a product. You can generate your own text or change the text to test the model with different inputs ü§ì

In [None]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

# Text Classification

## üìö Question 1: Understanding Pipelines
Before we start using the models, let's understand what we're working with:

1. What is a pipeline in Hugging Face Transformers? What does it abstract away from the user?
2. Visit the pipeline documentation and list at least 3 other tasks (besides text-classification) that are available.
3. What happens when you don't specify a model in the pipeline? How can you specify a specific model?

üí° Hint: Check the official documentation to answer these questions!

First thing we will do is to classify the text into two categories: positive or negative.

To do this, we will use a pre-trained model from the Hugging Face library.

We will use the pipeline function to load the model and the text-classification task.

See the documentation for more details: https://huggingface.co/docs/transformers/main/en/pipeline_tutorial

In [None]:
from transformers import pipeline

classifier = pipeline("text-classification")

## üìö Question 2: Text Classification Deep Dive
Now that you've seen text classification in action, explore further:

1. What is the default model used for text-classification? Look at the output above to find its name, then search for it on the Hugging Face Model Hub.
2. What dataset was this model fine-tuned on? What kind of text does it work best with?
3. The output includes a score field. What does this score represent? What range of values can it have?
4. Challenge: Find a different text-classification model on the Hub that classifies emotions (not just positive/negative). What is its name?

üí° Click on the model card in the Hub to see detailed information about training data and performance!

In [None]:
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)

# Named Entity Recognition

In [None]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

## üìö Question 3: Named Entity Recognition (NER)
Let's understand NER better:

1. What does the `aggregation_strategy="simple"` parameter do in the NER pipeline? Check the token classification documentation.
2. Looking at the output above, what do the entity types mean? (ORG, MISC, LOC, PER)
3. Why do some words appear with `##` prefix (like `##tron` and `##icons`)? What does this indicate about tokenization?
4. The model seems to have split "Megatron" and "Decepticons" incorrectly. Why might this happen? What does this tell you about the model's training data?
5. Challenge: Find the model card for `dbmdz/bert-large-cased-finetuned-conll03-english`. What is the CoNLL-2003 dataset?

ü§î How might the choice of tokenizer affect NER performance?

# Question Answering

In [None]:
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

## üìö Question 4: Question Answering Systems
Explore how question answering works:

1. What type of question answering is this? (Extractive vs. Generative) Check the question answering documentation.
2. The model outputs start and end indices. What do these represent? Why are they important?
3. What is the SQuAD dataset? (Look up the model `distilbert-base-cased-distilled-squad` on the Hub)
4. Try to think of a question this model CANNOT answer based on the text. Why would it fail?
5. Challenge: What's the difference between extractive and generative question answering? Find an example of a generative QA model on the Hub.

üí° Try asking questions that require reasoning or information not in the text. What happens?

# Summarization

## üìö Question 5: Text Summarization
Before running the summarization code, let's understand how it works:

1. What is the difference between extractive and abstractive summarization? Check the summarization documentation.

Looking at the code in the next cell, what is the default model used for summarization? Search for it on the Hugging Face Model Hub and determine:

- Is it an extractive or abstractive model?
- What architecture does it use? (Hint: look at the model name)
- What dataset was it trained on?
- What do the `max_length` and `min_length` parameters control? What happens if `min_length > max_length`?

The parameter `clean_up_tokenization_spaces=True` is used. What does this parameter do? Why might it be useful for summarization?

Challenge: Find two different summarization models on the Hub:
- One optimized for short texts (like news articles)
- One that can handle longer documents

Compare their architectures and training data.

üí° Why might summarization be more challenging than text classification? What linguistic capabilities does the model need?

In [None]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

# Translation

## üìö Question 6: Machine Translation
Let's explore how translation models work:

1. What is the architecture behind the `Helsinki-NLP/opus-mt-en-de` model? Look it up on the Model Hub.
2. What does "OPUS" stand for?
3. What does "MT" stand for?
4. How would you find a model to translate from English to French? Visit the translation documentation and the Model Hub to find at least 2 different models.
5. What is the difference between bilingual and multilingual translation models? What are the advantages and disadvantages of each?
6. In the code, we specify the task as `"translation_en_to_de"`. How does this relate to the model we're loading?
7. The output shows a warning about `sacremoses`. What is this library used for in NLP? Check the MarianMT documentation.
8. Challenge: Find a multilingual model (like mBART or M2M100) that can translate between multiple language pairs. How many language pairs does it support?

üåç What challenges exist for low-resource languages?

In [None]:
translator = pipeline("translation_en_to_de", 
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

# Text Generation

## üìö Question 7: Text Generation
Understand how language models generate text:

1. What is the default model used for text generation in the code below? Look it up on the Hub and answer:
   - What architecture does GPT-2 use? (decoder-only, encoder-decoder, or encoder-only?)
   - How many parameters does the base GPT-2 model have?
   - What type of generation does it perform? (autoregressive, non-autoregressive, etc.)
2. Why do we use `set_seed(42)` before generation? What would happen without it? Check the generation documentation.
3. The code uses `max_length=200`. What other parameters can control text generation? Research and explain:
   - `temperature`
   - `top_k`
   - `do_sample`
4. Looking at the output, you can see a warning about truncation. What does this mean? Why is the input being truncated?
5. What does `pad_token_id` being set to `eos_token_id` mean? Why is this necessary for GPT-2?
6. What are the trade-offs between model size and generation quality?

In [None]:
from transformers import set_seed
set_seed(42) # Set the seed to get reproducible results
generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

Change the model inside the pipeline to see other models. Try also other languages üåç