# Introduction 

Gaol of HUggingFace transformers libaray:    

Its goal is to provide a **single API** through which any **Transformer model can be loaded, trained, and saved**.       
The library’s main features:

- Downloading, loading, and using a state-of-the-art NLP model for inference is made easy with just **two lines of code**.
- The models are built using PyTorch nn.Module or TensorFlow tf.keras.Model classes, making them flexible and compatible with their respective ML frameworks.
- The library emphasizes simplicity by **minimizing abstractions**.
- The **"All in one file" concept** is central, where a model's forward pass is defined in a single file, ensuring code readability and modifiability.


The 🤗 Transformers has a very unique feature: unlike traditional ML libraries where models are constructed using shared modules across different files, in 🤗 Transformers, **each model has its own distinct layers**.       

This approach offers several advantages:
 
- Firstly, it makes the models more **accessible** and **comprehensible**, as the code for each model is self-contained and easier to understand. 
- Secondly, it provides the **flexibility to experiment** and **modify specific models without impacting others**. You can make changes to one model's layers or configurations without affecting the functionality or performance of other models.

By adopting this approach, 🤗 Transformers enhances the **modularity**, ease of **experimentation**, and overall **usability** of the library.

# Behind the pipeline

Pipeline groups together three steps:

1. preprocessing, done with tokenizer,
2. passing the inputs through the model,
3. postprocessing.

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

## Preprocessing with a tokenizer

Modles can’t process raw text directly, so we use a tokenizer to convert the text inputs into numbers that the model can make sense of. Tokenizer do the following:

- **Splitting** the input into **words, subwords, or symbols** (like punctuation) that are **called tokens**
- **Mapping** each token to an **integer**
- Adding **additional inputs** that may be useful to the model     

Important note: all this **preprocessing** must be done in **exactly the same way as when the model was pretrained**.
To do so we use **AutoTokenizer** class and its **from_pretrained()** method to download that information. Using the **checkpoint name** of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it. 

In [1]:
from transformers import AutoTokenizer

# the default checkpoint of the sentiment-analysis pipeline is distilbert-base-uncased-finetuned-sst-2-english

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

  from .autonotebook import tqdm as notebook_tqdm


Once we have the tokenizer, we can directly pass our sentences to it and **we’ll get back a dictionary** that’s ready to feed to our model! The only thing left to do is to convert the list of input IDs to tensors.

**Transformer** models **only accept tensors as input**. To **specify the type of tensors** we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the return_tensors argument.

We can pass one **sentence or a list of sentences,** as well as specifying the type of tensors we want to get back (if no type is passed, you will get a list of lists as a result).    

In [11]:
raw_inputs = [
    "I've been waiting for a HuggingFace.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(type(inputs))
print(len(inputs))  
print(inputs)

<class 'transformers.tokenization_utils_base.BatchEncoding'>
2
{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}


The output of tokenizer is a **dictionary** containing **two keys**:    
1. **input_ids**: contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence. 
2. **attention_mask**: we'll see what they are later.

## Going through the model

We download our pretrained model using **AutoModel class** which also has a from_pretrained() method:

In [12]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


- This architecture **contains only the base Transformer module**: given some inputs, it **outputs what we’ll call hidden states**, also known as **features**. For each model input, we’ll retrieve a **high-dimensional vector** representing the **contextual understanding** of that input by the Transformer model.

- The hidden states, although valuable on their own, typically serve as inputs to another component of the model called the head. 
- Various tasks are accomplished using a **shared architecture**, but **each task had its own unique associated head**.

### A high-dimensional vector?

**Vector output** by the Transformer module is usually large. It generally has three dimensions:   

- **Batch size**: The **number of sequences processed** at a time (2 in our example).
- **Sequence length**: The **length of the numerical representation** of the sequence (16 in our example).
- **Hidden size**: The **vector dimension** of each model input.

It is said to be **“high dimensional” because of the last value**. 
The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

In [14]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 12, 768])


outputs of 🤗 Transformers models behave like namedtuples or **dictionaries**. You can access the elements **by attributes** (like we did) or **by key** (outputs["last_hidden_state"]), or even **by index** if you know exactly where the thing you are looking for is (outputs[0]).

In [16]:
print(outputs.keys())

odict_keys(['last_hidden_state'])


### Model heads: Making sense out of numbers

Model heads take high-dimensional hidden states as input and project them onto a different dimension.
Model heads typically consist of one or a few linear layers.      

The model is as follwoing:    

Model input --> Embeddings --> Layers --> Hidden states --> Head --> Model output

- Full model includes:  Embeddings --> Layers --> Hidden states --> Head       
- Transformer network: Embeddings --> Layers       

- The **output of the Transformer** model is **directly sent to the model head** for processing.
- The **embeddings** layer converts **input IDs into token** vectors and the subsequent layers use attention mechanisms to manipulate the vectors and generate the final sentence representation.

- There are many **different architectures** available in 🤗 Transformers, with **each one designed around tackling a specific task.** Here is a non-exhaustive list:

- *Model (retrieve the hidden states)     
- *ForCausalLM      
- *ForMaskedLM     
- *ForMultipleChoice     
- *ForQuestionAnswering       
- *ForSequenceClassification      
- *ForTokenClassification       
- and others

For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification:


In [17]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

The model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label):

In [19]:
print(outputs.logits.shape)

torch.Size([2, 2])


Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2.

## Postprocessing the output

Values we get as output from our model don’t necessarily make sense by themselves:

In [20]:
print(outputs.logits)

tensor([[ 1.9743, -1.6959],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


Our model predicted [-1.5607, 1.6123] for the first sentence and [ 4.1692, -3.3464] for the second one. Those are not probabilities but **logits**, the **raw, unnormalized scores outputted by the last layer of the model**. To be **converted to probabilities,** they need to go through a **SoftMax layer** (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

In [21]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[9.7516e-01, 2.4838e-02],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


Now we can see that the model predicted [0.0402, 0.9598] for the first sentence and [0.9995, 0.0005] for the second one. These are recognizable probability scores.

To get the labels corresponding to each position, we can inspect the id2label attribute of the model config (more on this in the next section):

In [22]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:      

First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598         
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005        

# Models

- The AutoModel class and all of its relatives are actually simple **wrappers** over the wide variety of models available in the library.
- It can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture.

However, if you know the type of model you want to use, you can use the class that defines its architecture directly. Let’s take a look at how this works with a BERT model. The first thing we’ll need to do to initialize a BERT model is load a configuration object:

In [1]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

  from .autonotebook import tqdm as notebook_tqdm


The configuration contains many attributes that are used to build the model:

In [2]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.30.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



**hidden_size** attribute defines the **size of the hidden_states vector**, 
**num_hidden_layers** defines the **number of layers** the Transformer model has.

In [3]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

# Model is randomly initialized!

The model can be used in this state, but it will output gibberish; it needs to be trained first. But we reuse models that have already been trained.      

Loading a Transformer model that is already trained is simple — we can do this using the **from_pretrained()** method:

In [4]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

Downloading model.safetensors: 100%|██████████| 436M/436M [00:08<00:00, 53.8MB/s] 
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


This model is now initialized with all the weights of the checkpoint. It can be used directly for **inference on the tasks it was trained on**, and it can also be **fine-tuned on a new task.** By training with pretrained weights rather than from scratch, we can quickly achieve good results.   

The identifier used to load the model can be the identifier of any model on the Model Hub, as long as it is compatible with the BERT architecture.

### Saving methods

In [None]:
model.save_pretrained("directory_on_my_computer")

This **saves** **two files** to your disk:     

ls directory_on_my_computer        

**config.json** **pytorch_model.bin***

1. **config.json** includes:
- the **attributes** necessary to build the model architecture.   
- some **metadata**, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint.       
2. **pytorch_model.bin** 
- known as the state dictionary; 
- it contains all your **model’s weights**. 

**The two files go hand in hand; the configuration is necessary to know your model’s architecture, while the model weights are your model’s parameters.**

### Using a Transformer model for inference

After loading a modelwe can make some predictions. Transformer models can only process the  numbers that the tokenizer generates.      
Tokenizers can take care of casting the inputs to the appropriate framework’s tensors,

In [None]:
sequences = ["Hello!", "Cool.", "Nice!"]

The tokenizer **converts these to vocabulary indices** which are typically called **input IDs**. Each sequence is now a list of numbers! The resulting output is:

In [7]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]


This is a list of encoded sequences: a list of lists. Tensors only accept rectangular shapes (think matrices). This “array” is already of rectangular shape, so converting it to a tensor is easy:

In [8]:
import torch

model_inputs = torch.tensor(encoded_sequences)

### Using the tensors as inputs to the model     

Making use of the tensors with the model is extremely simple — we just call the model with the inputs:

In [9]:
output = model(model_inputs)

# Tokenizers


- Tokenizers are essential for NLP. 
- They serve one purpose: **converting text to numerical data** for models. 
- They enable processing by translating text into numbers. 
- The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation.

### Some examples of tokenization algorithms:

### Word-based     

- It’s generally very **easy to set up** and use with only a **few rules**, 
- often yields decent results.     
- the goal is to **split the raw text into words** and find a **numerical representation** for each of them:

e.g: Let's do this!    
    Split on spaces: Let's + do + this!
    Split on punctuation: Let + 's + do + this + !


In [10]:
# use whitespace to tokenize the text into words#
tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']



- Word tokenizers can have variations that include **extra rules for punctuation**.        
- A **vocabulary** is defined by the **total number of independent tokens in a corpus**.          
- **Each word** in the tokenizer **is assigned a unique ID** within the vocabulary.       
- Completely covering a language with a word-based tokenizer **requires a large number of tokens**.      
- **Words with similar meanings or variations** (e.g., "dog" and "dogs") may be **initially treated as unrelated** by the model.
- A **custom token**, often represented as **"[UNK]"** or **""**, is used to represent **words not in the vocabulary**.
- An excessive number of unknown tokens indicates a loss of information, so the goal is to minimize the occurrence of such tokens by carefully crafting the vocabulary.
- One way to reduce the amount of unknown tokens is to go one level deeper, using a character-based tokenizer.

### Character-based

Character-based tokenizers **split the text into characters**, rather than words. This has two primary benefits:

- The vocabulary is much smaller.
- There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters.


- Character-based tokenization may be less meaningful compared to word-based tokenization, as individual characters carry less information.            
- The significance of character-based tokenization varies by language, with Chinese characters holding more information than Latin characters.        
- Character-based tokenization leads to a larger number of tokens compared to word-based tokenization.

### Subword tokenization

Subword tokenization combines the advantages of both word-based and character-based approaches.

Summary:

- Subword tokenization algorithms follow the principle of:
    - **frequently used words** should **not be split** into smaller subwords,       
    - but **rare words** **should be decomposed** into meaningful subwords.                  
    e.g: "annoyingly" is a rare word, can be split into subwords like "annoying" and "ly" to increase their frequency as standalone subwords while retaining the overall meaning.           
- Subword tokenization aims to strike a **balance between preserving word meaning and efficiently representing the vocabulary**.           

- An example: "Let's do tokenization!"      
tokenized:  ["Let's", "do", "token", "ization", "!"]

- Subwords obtained through tokenization carry significant **semantic meaning**.        
- For example, "tokenization" can be split into "token" and "ization," where both subwords have semantic significance.         
- By using subwords, long words can be represented efficiently with fewer tokens.
- Subword tokenization enables **good coverage with small vocabularies** and **minimizes the occurrence of unknown tokens.**
- This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords.

There are many more techniques out there. To name a few:        

- Byte-level BPE, as used in GPT-2
- WordPiece, as used in BERT
- SentencePiece or Unigram, as used in several multilingual models

## Loading and saving

-  it’s based on the same two methods as saving and loading models: **from_pretrained()** and **save_pretrained()**.     
- These methods will load or save: 
    - the **algorithm** used by the tokenizer (a bit like the architecture of the model),
    - as well as **its vocabulary** (a bit like the weights of the model).

To **load the BERT tokenizer** that is **trained with the same checkpoint as BERT**, you can use the BertTokenizer class. The process is similar to loading the model.

In [11]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

We can now use the tokenizer:

In [12]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

**Saving a tokenizer** is identical to saving a model:

In [None]:
tokenizer.save_pretrained("directory_on_my_computer")

We want tor know how the input_ids (in the tokenizer output) are generated.

## Encoding

- Encoding is translating **text into numerical** representations.    
- Encoding consists of two steps: **tokenization** and **conversion to input IDs**.        
- Tokenization **splits the text into tokens** (words, parts of words, punctuation symbols, etc.).       
-  There are multiple rules that can govern that process, which is why we need to **instantiate the tokenizer using the name of the model,** to make sure we use the same rules that were used when the model was pretrained.            
- Conversion of tokens to numbers involves mapping them to the **tokenizer's vocabulary**.        
- The **tokenizer's vocabulary is downloaded during instantiation** using the `from_pretrained()` method.      


To get a better understanding of the two steps, we’ll explore them separately. Note that we will use some methods that perform parts of the tokenization pipeline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs.

### Tokenization

The tokenization process is done by the tokenize() method of the tokenizer:

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


This tokenizer is a **subword tokenizer**: it splits the words until it obtains tokens that can be represented by its vocabulary. That’s the case here with transformer, which is split into two tokens: transform and ##er.

### From tokens to input IDs

The conversion to input IDs is handled by the convert_tokens_to_ids() tokenizer method:

In [14]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


## Decoding

Decoding is going the other way around: **from vocabulary indices**, we want to **get a string**. This can be done with the decode() method as follows:

In [15]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)

Using a transformer network is simple


- The `decode` method not only converts indices back to tokens but also groups together tokens that were part of the same words, resulting in a readable sentence.
- This behavior is particularly useful when working with models that generate new text, such as language generation or sequence-to-sequence tasks like translation or summarization.
- The `decode` method helps in producing coherent and meaningful output by properly combining the tokens into words and sentences.

# Handling multiple sequesnces

Some questions we want to answer:      

1. How do we handle multiple sequences?
2. How do we handle multiple sequences of different lengths?
3. Are vocabulary indices the only inputs that allow a model to work well?
4. Is there such a thing as too long a sequence?


## Models Expect a Batch of Inputs

## Padding the inputs

## Attention masks

## Longer sequences

# Putting it all together