# Chapter 1: Transformer Models

- pipeline() function for tasks such as text generation and classification
- Transformer architecture
- encoder, decoder, and encoder-decoder architectures and use cases


## Introduction 


NLP focuses on **understanding everything related to human language**. not only to understand **single words individually**, but to be able to understand the **context of those words**.

**Use case examples:**

- **Classifying whole sentences**:     
    - sentiment analysis, spam detection, grammar correction    
- **Classifying each word in a sentence**:     
    - part-of-speech tagging or POS tagging: identifying the grammatical components of a sentence such as nouns, verbs, and adjectives and assigning the appropriate grammatical tag to each word).    
    - entity recognition or NER: task of identifying and classifying named entities such as persons, locations, organizations, and other proper nouns in a text. 

- **Generating text content**:
    This task is often used for text completion, sentence generation, or to assess a model's understanding and ability to generate coherent and contextually appropriate text. known as masked language modeling or cloze-style language modeling. In this task, a model is given a text with certain words or tokens masked or removed, and the model's objective is to predict or generate the missing words or tokens.

- **Extracting an answer from a text**:   
    Question answering which can be extractive and abstractive:
    - In extractive question answering, the model identifies and selects a span of text from the context that directly answers the question. The selected span is typically a contiguous sequence of words or tokens from the context.

    - In abstractive question answering, the model generates a concise and coherent answer to the question based on the information in the context. The generated answer may not be an exact span of text from the context but rather a paraphrased or synthesized response.

- **Generating a new sentence from an input text**: machine translation (MT) and text summarization


NLP doesn't only deal with written text. It also works on understanding and solving difficult problems related to speech recognition and computer vision. For example, it can generate a written version of an audio recording or describe what's happening in an image.

## Working with pipelines

Transformer models are used to solve all kinds of NLP tasks. The HF Transformers library provides the functionality to **create** and **use the models** that have been shared by researchers. 

**pipeline() function** is the most basic object in the libaray, it connects a model with its necessary preprocessing and postprocessing steps.

We can use the pipeline directly to input any text and get an output. The following code shows this with an example sentence.

In [5]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598049521446228}]

The code **creates a pipeline** for **sentiment analysis** using the **pipeline function** from the **transformers module**.

We can do the same with multiple sentences:

In [6]:
classifier(
    ["I've been feeling sick latley.", "This NLP project is sick!"]
)

[{'label': 'NEGATIVE', 'score': 0.9996962547302246},
 {'label': 'NEGATIVE', 'score': 0.9997852444648743}]

Note: the model didnt pick up the meaning of word sick, which is a slang, in the second sentence. 

**What happens when we pass some text to the pipline?**

- The input data is being preprocessed for the model. that is the text is preprocessed into a format the model can understand. 
- The preprocessed inputs are passed to the model.
- The predictions of the model are post-processed, so we can understand them.


## More on Pipeline

https://huggingface.co/docs/transformers/pipeline_tutorial 


There are two categories of pipeline abstractions to be aware about:

- 1. **The pipeline()** which is the most powerful object encapsulating all other pipelines.
- 2. **Task-specific pipelines** are available for audio, computer vision, natural language processing, and multimodal tasks.

For NLP tasks there are different available pipelines, we will look at some of them:

## Zero-shot classification 

**Objective:** classify texts that haven’t been labelled.
- The zero-shot-classification pipeline allows you to **classify text into multiple candidate labels even if those labels are not present in the training data**.
- This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want!

In [7]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 1.15k/1.15k [00:00<00:00, 2.91MB/s]
Downloading model.safetensors: 100%|██████████| 1.63G/1.63G [00:55<00:00, 29.5MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 71.2kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 10.1MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 8.33MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 12.7MB/s]


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445996642112732, 0.11197379231452942, 0.04342653229832649]}

Note: as you can see when no model is specified, the pipline uses the default model which in this case is facebook/bart-large-mnli.          
We can use the following models:

1. **facebook/bart-large-mnli**: This model is a **BART model fine-tuned on the MNLI** (Multi-Genre Natural Language Inference) dataset. It is capable of zero-shot classification tasks.

2. **roberta-large-mnli**: This model is a **RoBERTa model fine-tuned on the MNLI dataset**. It is also suitable for zero-shot classification.

3. **distilbert-base-uncased**: This model is a **DistilBERT model** trained on the uncased version of the English text. It is a **smaller and faster variant of BERT** and can be used for zero-shot classification.

Now lets do the same but this time with a specified model, roberta-large-mnli:

In [8]:
classifier = pipeline("zero-shot-classification", model="roberta-large-mnli")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

Downloading (…)lve/main/config.json: 100%|██████████| 688/688 [00:00<00:00, 1.06MB/s]
Downloading model.safetensors: 100%|██████████| 1.43G/1.43G [00:29<00:00, 48.2MB/s]
Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 15.2MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.9562345743179321, 0.026972180232405663, 0.016793299466371536]}

**What does the scores output mean here?**      
The scores represent the **confidence scores** assigned to each of the candidate labels provided in the classifier call. These scores indicate the model's confidence in the predicted probability of the input text belonging to each of the candidate labels. The scores are typically normalized probabilities, meaning they add up to 1 across all labels.
In this example, the highest score of 0.9562345743179321 is assigned to the label 'education', indicating that the model believes the input text is most likely related to education compared to the other candidate labels. The lower scores for the remaining labels suggest lower confidence in those classifications.

The scores can be useful for understanding the model's level of certainty in its predictions and can be utilized to make decisions based on the classification confidence thresholds that best suit any specific application.

## Text generation

**Objective:** inputing a prompt, the model will auto-complete it by generating the remaining text. 

In [10]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this news article, ")

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this news article, \xa0Jai and other contributors discuss at length the reasons for the increased demand for medical facilities in Indonesia. The majority blame the price of the medications used to treat the disease for the increased demand and increase in the number of'}]


The pipiline defaulted to gpt2 when not given any model. Text-generation pipeline can use various pre-trained models depending on the version of the transformers library. A few examples of models we can use with the text-generation pipeline are gpt2 (Generative Pretrained Transformer 2) model, gpt2-medium, gpt2-large, and distilgpt2 (a smaller and faster variant of the GPT-2 model called DistilGPT-2)

In [11]:
generator = pipeline("text-generation", model="distilgpt2", num_return_sequences= 5 , max_length=30)
generator("In this news article, ")


Downloading (…)lve/main/config.json: 100%|██████████| 762/762 [00:00<00:00, 1.85MB/s]
Downloading model.safetensors: 100%|██████████| 353M/353M [00:07<00:00, 49.2MB/s] 
Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<00:00, 780kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 26.6MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 28.0MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 21.0MB/s]
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this news article, izar is a freelance reporter. He writes about global politics in the Washington Post for the Center for American Progress and has'},
 {'generated_text': 'In this news article, \xa0 is only one of his recent claims that it is true that the U.S. Constitution actually prohibits the practice of'},
 {'generated_text': 'In this news article, ______________________________________________\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'},
 {'generated_text': 'In this news article, 今漀便楮人, which came out after the World War I ended, would'},
 {'generated_text': 'In this news article, \ue601 \ue60a \ue60a \ue60a \ue60a \ue60a '}]

In [1]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

  from .autonotebook import tqdm as notebook_tqdm
Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to install and modify the Arduino IDE using Arduino on an Arduino computer.\n\n\n\nIn any case'},
 {'generated_text': 'In this course, we will teach you how to start your own family life and the future with a personal story.'}]

## Using any model from the Hub in a pipeline

To find other models to use with the pipeline: go to the HuggingFace **Model Hub** and click on the corresponding tag (e.g here text generation) on the left to display only the supported models for that task.

## Mask filling

**Objective:** filling in the blanks in a given text:

In [3]:
unmasker = pipeline("fill-mask")
unmasker("life is all about <mask> .", top_k=2)

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'score': 0.020203422755002975,
  'token': 7967,
  'token_str': ' survival',
  'sequence': 'life is all about survival.'},
 {'score': 0.01898636482656002,
  'token': 657,
  'token_str': ' love',
  'sequence': 'life is all about love.'}]

The **top_k** argument controls how many possibilities you want to be displayed. The special <mask> word is often referred to as a mask token.

## Named entity recognition

**Objective:** finding which parts of the input text correspond to entities such as persons, locations, or organizations. 

In [4]:

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 998/998 [00:00<00:00, 2.07MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.33G/1.33G [00:30<00:00, 43.8MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 60.0/60.0 [00:00<00:00, 101kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 17.5MB/s]


[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

**grouped_entities=True**: to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity. e.g here the model correctly grouped “Hugging” and “Face” as a single organization, even though the name consists of multiple words

## Question answering

**Objective:** answering questions using information from a given context.

In [6]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I do?",
    context="My name is Mahnaz and I work as a ML engineer.",
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.6483435034751892, 'start': 34, 'end': 45, 'answer': 'ML engineer'}

Note that this pipeline **does not generate the answer**, it only works by **extracting information** from the **provided context**;

## Summarization

**Objective:** reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text.

In [7]:
summarizer = pipeline("summarization")
summarizer(
    """A key feature of Transformer models is that they are built with special layers called attention layers. In fact, the title 
    of the paper introducing the Transformer architecture was “Attention Is All You Need”! We will explore the details of attention 
    layers later in the course; for now, all you need to know is that this layer will tell the model to pay specific attention to
      certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.
      To put this into context, consider the task of translating text from English to French. Given the input “You like this course”, a 
      translation model will need to also attend to the adjacent word “You” to get the proper translation for the word “like”, because 
      in French the verb “like” is conjugated differently depending on the subject. The rest of the sentence, however, is not useful for
        the translation of that word. In the same vein, when translating “this” the model will also need to pay attention to the word “course”,
          because “this” translates differently depending on whether the associated noun is masculine or feminine. Again, the other words in 
          the sentence will not matter for the translation of “this”. With more complex sentences (and more complex grammar rules), the model
            would need to pay special attention to words that might appear farther away in the sentence to properly translate each word.
The same concept applies to any task associated with natural language: a word by itself has a meaning, but that meaning is deeply affected
 by the context, which can be any other word (or words) before or after the word being studied."
""")


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 1.80k/1.80k [00:00<00:00, 2.87MB/s]
Downloading pytorch_model.bin: 100%|██████████| 1.22G/1.22G [00:48<00:00, 25.2MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 42.3kB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 24.3MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 14.2MB/s]


[{'summary_text': ' A key feature of Transformer models is that they are built with special layers called attention layers . This layer tells the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word . To put this into context, consider the task of translating text from English to French .'}]

We can specify a max_length or a min_length for the result.

## Translation

In [8]:

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

Downloading (…)lve/main/config.json: 100%|██████████| 1.42k/1.42k [00:00<00:00, 3.50MB/s]
Downloading pytorch_model.bin: 100%|██████████| 301M/301M [00:07<00:00, 41.1MB/s] 
Downloading (…)neration_config.json: 100%|██████████| 293/293 [00:00<00:00, 1.17MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 42.0/42.0 [00:00<00:00, 62.2kB/s]
Downloading (…)olve/main/source.spm: 100%|██████████| 802k/802k [00:00<00:00, 22.5MB/s]
Downloading (…)olve/main/target.spm: 100%|██████████| 778k/778k [00:00<00:00, 59.4MB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 1.34M/1.34M [00:00<00:00, 45.5MB/s]


[{'translation_text': 'This course is produced by Hugging Face.'}]

The reason these models dont work very well is that they were programmed for specific tasks and cannot perform variations of them. In order to get a better result we should customize the behaviour of pipeline.

# How do Transformers work?

Transformers go back to 2017 when the papare **Attention Is All You Need**, focused on translation tasks, was published. 
This was followed by the introduction of several influential models:

- **GPT-like** (aka auto-regressive Transformer models)
- **BERT-like** (aka called auto-encoding Transformer models)
- **BART/T5-like** (aka sequence-to-sequence Transformer models)

Some of the highlights:

- June 2018: **GPT**, the **first pretrained Transformer model**, used for fine-tuning on **various NLP tasks** and obtained state-of-the-art results

- October 2018: **BERT**, another large pretrained model, this one designed to produce better **summaries of sentences**(more on this in the next chapter!)

- February 2019: **GPT-2**, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns

- October 2019: **DistilBERT**, a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT’s performance

- October 2019: **BART** and **T5**, two large pretrained models using the same architecture as the original Transformer model (the first to do so)

- May 2020, **GPT-3**, an even **bigger version of GPT-2** that is able to perform well on a variety of tasks **without the need for fine-tuning** (called zero-shot learning)


Transformer models are language models, meaning they have been **trained on large amounts of raw text** in a **self-supervised** fashion. in self-supervised learning the **objective is automatically computed** **from the inputs** of the model. That means that humans are not needed to label the data!

- These models develops a **statistical understanding of the language** it's trained on but lacks usefulness for specific practical tasks.
- **Transfer learning** addresses this issue by **fine-tuning a pretrained model on a specific task**.
- Fine-tuning involves supervised learning using human-annotated labels.
- The process helps the model become more practical and task-oriented.

Two examples of LLMs:    

1. **Causal language modeling** is an example of a task where the goal is to **predict the next word** in a sentence based on the preceding n words. The output of this task depends on the past and present inputs, but not on future inputs.

2. **Masked language modeling** in which the model predicts a masked word in the sentence.

Ways to achieve better performance in transformers (except a few outliers like DistilBERT):  

1. increasing the models’ sizes, 
2. as well as increasing the amount of data they are pretrained on.

This becomes very costly in terms of time, compute resources, and environmental impacts.

## Transfer Learning

Some definitions:

- **Pretraining**:    
    - training a model from scratch.
    - training starts without any prior knowledge, weights are randomly initialized.
    - usually done on very large amounts of data,
    - requires a very large corpus of data, 
    - training can take up to several weeks.

- **Fine-tuning**:
    - training done after a model has been pretrained. 
    - get the pretrained model, then do additional training with a dataset specific to your task.

Why not simply train directly for the final task, instead og fine-tuning?     
- Since the pretrained model was trained on a dataset that shares similarities with the fine-tuning dataset, the fine-tuning process can leverage the knowledge gained by the initial model during pretraining.
- fine-tuning requires way less data to get decent results.
- fine-tuning requires way less  time and resources.

Fine-tuning a model therefore has **lower time, data, financial, and environmental costs**. It's also easier to iterate over different fine-tuning schemes, also achieve better results than training from scratch.

## General architecture

The model is primarily composed of two blocks:

- **Encoder**: The encoder **receives an input** and **builds a representation of it** (its features). This means that the model is **optimized to acquire understanding from the input.**       

- **Decoder**: The decoder **uses the encoder’s representation** (features) along with other inputs to **generate a target sequence.** This means that the model is **optimized for generating outputs**.

Each of these parts can be used independently, depending on the task:

- Encoder-only models: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition.       
- Decoder-only models: Good for generative tasks such as text generation.     
- Encoder-decoder models or sequence-to-sequence models: Good for generative tasks that require an input, such as translation or summarization.        

## Attention layers

This layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word.


## The original architecture

The Transformer architecture was **originally designed for translation** and it works as follow:

During **training**: 

 - **encoder**: receives inputs (sentences) in a **certain language,**
    - the attention layers can use all the words in a sentence (since the translation of a given word can be dependent on what is after as well as before it in the sentence).
 - **decoder**: receives the **same sentences** in the desired **target language.** 
    - works sequentially, the attention layers can only use the words in the sentence that it has already translated (so, only the words before the word currently being generated).

To speed things up another layer of attention is added to decoder:   

during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!).

**attention layers in decoder**
- First attention layer in decoder: Considers all past inputs
- Second attention layer in decoder: Utilizes encoder's output
- Second attention layer accesses whole input sentence for accurate predictions
- Helpful for languages with different word orders or when context is important

**attention mask**:
can  be used in the encoder/decoder to **prevent the model from paying attention to some special words**.


## Architectures vs. checkpoints

Some termonology in Transformer models:

- **Architecture**: skeleton of the model — the definition of each **layer** and each **operation** that happens within the model.
- **Checkpoints**: These are the **weights** that will be loaded in a given architecture.
- **Model**: This is an **umbrella term** that isn’t as precise as “architecture” or “checkpoint”: it can mean both.     
For example, BERT is an architecture while bert-base-cased, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say “the BERT model” and “the bert-base-cased model.”

