<a href="https://colab.research.google.com/github/Priscilla97/llm-rag-foundations/blob/main/01_NLP_basics/3_Models_(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Models (PyTorch)

In this section, we’ll take a closer look at **creating and using models**. We’ll use the *AutoModel* class, which is handy when you want to instantiate any model from a checkpoint.

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

# Creating a Transformer
Let’s begin by examining what happens when we instantiate an AutoModel:

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")

The **checkpoint name** corresponds to:
- a specific model architecture
- weights

In this case a **BERT model** with a basic architecture:
- 12 layers,
- 768 hidden size,
- 12 attention heads)
and cased inputs (uppercase/lowercase distinction is important).

The **AutoModel** class and its associates are actually simple wrappers designed to fetch the appropriate model architecture for a given checkpoint. *It’s an “auto” class meaning* it will guess the appropriate model architecture for you and instantiate the correct model class.

However, if you know the type of model you want to use, you can use the class that defines its architecture directly:

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

# Loading and saving

The models actually have the same *save_pretrained()* method, which saves the model’s weights and architecture configuration:

In [None]:
model.save_pretrained("directory_on_my_computer")

This will save two files to your disk:

ls directory_on_my_computer
config.json model.safetensors

- the **config.json file** contains all the necessary *attributes* needed to build the model architecture. This file also contains some *metadata*, such as where the *checkpoint* originated.

- the **pytorch_model.safetensors** file is known as the state dictionary; it contains all your *model’s* *weights*.

The two files work together: the **configuration file** is needed to know about the model architecture, while the **model weights** are the parameters of the model.

To reuse a saved model, use the from_pretrained() method again:

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained("directory_on_my_computer")

If you’re using a notebook, you can easily log in with this:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Then you can push the model to the Hub with the push_to_hub() method:

In [None]:
model.push_to_hub("my-awesome-model")

This will upload the model files to the Hub, in a repository under your namespace named my-awesome-model. Then, anyone can load your model with the from_pretrained() method!

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained("your-username/my-awesome-model")

# Encoding text
Tokenizers split the text into tokens and then **convert** these tokens into numbers.

We can see this conversion through a simple tokenizer:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)

OUTPUT: <BR>
{'input_ids': [101, 8667, 117, 1000, 1045, 1005, 1049, 2235, 17662, 12172, 1012, 102], <br>
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], <br>
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

We get a **dictionary** with the following fields:

- **input_ids**: numerical representations of your tokens
- **token_type_ids**: these tell the model which part of the input is sentence A and which is sentence B
- **attention_mask**: this indicates which tokens should be attended to and which should not (discussed more in a bit)

We can decode the input IDs to get back the original text:

In [None]:
tokenizer.decode(encoded_input["input_ids"])

OUTPUT <br>
"[CLS] Hello, I'm a single sentence! [SEP]"

You’ll notice that the tokenizer has added special tokens — **[CLS] and [SEP]** — required by the model. Not all models need special tokens; they’re utilized when a model was pretrained with them, in which case the tokenizer needs to add them as that model expects these tokens.

You can encode **multiple sentences** at once, either by batching them together  or by passing a list.

Note that when passing multiple sentences, the tokenizer returns a **list** for each sentence for each dictionary value.

In [None]:
encoded_input = tokenizer("How are you?", "I'm fine, thank you!", return_tensors="pt")
print(encoded_input)

OUTPUT: <br>


{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102], <br>
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), <br>
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), <br>
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

But there’s a **problem**: the two lists don’t have the same length! <br>**Arrays and tensors need to be rectangular**, so we can’t simply convert these lists to a PyTorch tensor (or NumPy array). The tokenizer provides an option for that: **padding**.
        

# Padding
If we ask the tokenizer to pad the inputs, it will make all sentences the same length by adding a special padding token to the sentences that are shorter than the longest one:
- padding = true

In [None]:
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"], padding=True, return_tensors="pt"
)
print(encoded_input)

OUTPUT: <br>
{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102,     0,     0,     0,     0], <br>
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), <br>
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), <br>
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

         

Now we have rectangular tensors! Note that the padding tokens have been encoded into input IDs with **ID 0**, and they have an **attention mask value of 0** as well. This is because those padding **tokens shouldn’t be analyzed by the model**: they’re not part of the actual sentence.

# Truncating inputs
The tensors might get **too big** to be processed by the model.

For instance, BERT was only pretrained with sequences up to 512 tokens, so *it cannot process longer sequences*. If you have sequences longer than the model can handle, you’ll need to **truncate** them with the truncation parameter:

In [None]:
encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)
print(encoded_input["input_ids"])

OUTPUT: <br>
[101, 1188, 1110, 170, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1179, 5650, 119, 102]

By combining the padding and truncation arguments, you can make sure your tensors have the exact size you need:
- truncation=True

In [None]:
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
    padding=True,
    truncation=True,
    max_length=5,
    return_tensors="pt",
)
print(encoded_input)

OUTPUT: <br>
{'input_ids': tensor([[  101,  1731,  1132,  1128,   102],
         [  101,  1045,  1005,  1049,   102]]), <br>
 'token_type_ids': tensor([[0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0]]), <br>
 'attention_mask': tensor([[1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1]])}

# Adding special tokens
Special tokens are added to better represent the sentence boundaries, such as the **beginning** of a sentence **([CLS])** or **separator** between sentences **([SEP])**. Let’s look at a simple example:

In [None]:
encoded_input = tokenizer("How are you?")
print(encoded_input["input_ids"])
tokenizer.decode(encoded_input["input_ids"])

OUTPUT: <br>
[101, 1731, 1132, 1128, 136, 102] <br>
'[CLS] How are you? [SEP]'

These special tokens are automatically added by the tokenizer. Not all models need special tokens; they are primarily used when a model was pretrained with them, in which case the tokenizer will add them since the model expects them.

# Riassunto

In [None]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

In [None]:
print(config)

BertConfig {
  [...]
  "hidden_size": 768,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  [...]
}

In [None]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

# Model is randomly initialized!

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

In [None]:
model.save_pretrained("directory_on_my_computer")

In [None]:
sequences = ["Hello!", "Cool.", "Nice!"]

In [None]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

In [None]:
import torch

model_inputs = torch.tensor(encoded_sequences)

Making use of the tensors with the **model**. <br>
While the model accepts a lot of different arguments, only the input IDs are necessary.

In [None]:
output = model(model_inputs)