# Introduction 

Gaol of HUggingFace transformers libaray:    

Its goal is to provide a **single API** through which any **Transformer model can be loaded, trained, and saved**.       
The library’s main features:

- Downloading, loading, and using a state-of-the-art NLP model for inference is made easy with just **two lines of code**.
- The models are built using PyTorch nn.Module or TensorFlow tf.keras.Model classes, making them flexible and compatible with their respective ML frameworks.
- The library emphasizes simplicity by **minimizing abstractions**.
- The **"All in one file" concept** is central, where a model's forward pass is defined in a single file, ensuring code readability and modifiability.


The 🤗 Transformers has a very unique feature: unlike traditional ML libraries where models are constructed using shared modules across different files, in 🤗 Transformers, **each model has its own distinct layers**.       

This approach offers several advantages:
 
- Firstly, it makes the models more **accessible** and **comprehensible**, as the code for each model is self-contained and easier to understand. 
- Secondly, it provides the **flexibility to experiment** and **modify specific models without impacting others**. You can make changes to one model's layers or configurations without affecting the functionality or performance of other models.

By adopting this approach, 🤗 Transformers enhances the **modularity**, ease of **experimentation**, and overall **usability** of the library.

# Behind the pipeline

Pipeline groups together three steps:

1. preprocessing, done with tokenizer,
2. passing the inputs through the model,
3. postprocessing.

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

## Preprocessing with a tokenizer

Modles can’t process raw text directly, so we use a tokenizer to convert the text inputs into numbers that the model can make sense of. Tokenizer do the following:

- **Splitting** the input into **words, subwords, or symbols** (like punctuation) that are **called tokens**
- **Mapping** each token to an **integer**
- Adding **additional inputs** that may be useful to the model     

Important note: all this **preprocessing** must be done in **exactly the same way as when the model was pretrained**.
To do so we use **AutoTokenizer** class and its **from_pretrained()** method to download that information. Using the **checkpoint name** of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it. 

In [1]:
from transformers import AutoTokenizer

# the default checkpoint of the sentiment-analysis pipeline is distilbert-base-uncased-finetuned-sst-2-english

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

  from .autonotebook import tqdm as notebook_tqdm


Once we have the tokenizer, we can directly pass our sentences to it and **we’ll get back a dictionary** that’s ready to feed to our model! The only thing left to do is to convert the list of input IDs to tensors.

**Transformer** models **only accept tensors as input**. To **specify the type of tensors** we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the return_tensors argument.

We can pass one **sentence or a list of sentences,** as well as specifying the type of tensors we want to get back (if no type is passed, you will get a list of lists as a result).    

In [11]:
raw_inputs = [
    "I've been waiting for a HuggingFace.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(type(inputs))
print(len(inputs))  
print(inputs)

<class 'transformers.tokenization_utils_base.BatchEncoding'>
2
{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}


The output of tokenizer is a **dictionary** containing **two keys**:    
1. **input_ids**: contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence. 
2. **attention_mask**: we'll see what they are later.

## Going through the model

We download our pretrained model using **AutoModel class** which also has a from_pretrained() method:

In [12]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


- This architecture **contains only the base Transformer module**: given some inputs, it **outputs what we’ll call hidden states**, also known as **features**. For each model input, we’ll retrieve a **high-dimensional vector** representing the **contextual understanding** of that input by the Transformer model.

- The hidden states, although valuable on their own, typically serve as inputs to another component of the model called the head. 
- Various tasks are accomplished using a **shared architecture**, but **each task had its own unique associated head**.

### A high-dimensional vector?

**Vector output** by the Transformer module is usually large. It generally has three dimensions:   

- **Batch size**: The **number of sequences processed** at a time (2 in our example).
- **Sequence length**: The **length of the numerical representation** of the sequence (16 in our example).
- **Hidden size**: The **vector dimension** of each model input.

It is said to be **“high dimensional” because of the last value**. 
The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

In [14]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 12, 768])


outputs of 🤗 Transformers models behave like namedtuples or **dictionaries**. You can access the elements **by attributes** (like we did) or **by key** (outputs["last_hidden_state"]), or even **by index** if you know exactly where the thing you are looking for is (outputs[0]).

In [16]:
print(outputs.keys())

odict_keys(['last_hidden_state'])


### Model heads: Making sense out of numbers

Model heads take high-dimensional hidden states as input and project them onto a different dimension.
Model heads typically consist of one or a few linear layers.      

The model is as follwoing:    

Model input --> Embeddings --> Layers --> Hidden states --> Head --> Model output

- Full model includes:  Embeddings --> Layers --> Hidden states --> Head       
- Transformer network: Embeddings --> Layers       

- The **output of the Transformer** model is **directly sent to the model head** for processing.
- The **embeddings** layer converts **input IDs into token** vectors and the subsequent layers use attention mechanisms to manipulate the vectors and generate the final sentence representation.

- There are many **different architectures** available in 🤗 Transformers, with **each one designed around tackling a specific task.** Here is a non-exhaustive list:

- *Model (retrieve the hidden states)     
- *ForCausalLM      
- *ForMaskedLM     
- *ForMultipleChoice     
- *ForQuestionAnswering       
- *ForSequenceClassification      
- *ForTokenClassification       
- and others

For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification:


In [17]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

The model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label):

In [19]:
print(outputs.logits.shape)

torch.Size([2, 2])


Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

## Postprocessing the output

Values we get as output from our model don’t necessarily make sense by themselves:

In [20]:
print(outputs.logits)

tensor([[ 1.9743, -1.6959],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


Our model predicted [-1.5607, 1.6123] for the first sentence and [ 4.1692, -3.3464] for the second one. Those are not probabilities but **logits**, the **raw, unnormalized scores outputted by the last layer of the model**. To be **converted to probabilities,** they need to go through a **SoftMax layer** (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

In [21]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[9.7516e-01, 2.4838e-02],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


Now we can see that the model predicted [0.0402, 0.9598] for the first sentence and [0.9995, 0.0005] for the second one. These are recognizable probability scores.

To get the labels corresponding to each position, we can inspect the id2label attribute of the model config (more on this in the next section):

In [22]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:      

First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598         
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005        

# Models

# Tokenizers

# Handling multiple sequesnces

# Putting it all together