# What are transformers
Transformers are language models. 

All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as language models. This means they have been trained on large amounts of raw text in a self-supervised fashion. 

# Self-supervised learning 
Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data!

This type of model develops a **statistical** understanding of the language it has been trained on, but it’s not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning. During this process, the model is fine-tuned in a supervised way — that is, using **human-annotated labels** — on a given task.

- causal language modeling
- masked language modeling

causal language modeling: the output depends on the past and present inputs, but not the future ones.

masked language modeling: the model predicts a masked word in the sentence.


# Fine-tuning
Fine-tuning 
is the training done after a model has been pretrained. To perform fine-tuning, you first **acquire a pretrained language model**, then perform **additional training** with a dataset specific to your task. 
















# pipline 

```
classifier = pipeline("sentiment-analysis")
```

`pipeline()` groups together three steps: 
- preprocessing, (raw text --> IDs)
- passing the inputs through the model, (IDs --> logits)
- postprocessing(logits --> predictions/possibilities)


# Tokenizer
```
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
```
Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it (so it’s only **downloaded** the first time you run the code below).


```
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
```

A `tokenizer` is responsible for:
- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an `integer`
- Adding `additional inputs ` that may be useful to the model

All this `preprocessing` needs to be done in exactly the same way as when the `model` was pretrained


# `AutoTokenizer` class and its `from_pretrained()` method.

```
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
```
Once we have the tokenizer, we can directly pass our sentences to it and we’ll get back a `dictionary` that’s ready to feed to our model! 

The only thing left to do is to convert the `list of input IDs `to `tensors`.

 Transformer models only accept **`tensors` as input**. 
If this is your first time hearing about tensors, you can think of them as **NumPy arrays** instead. 

## `return_tensors `
To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the `return_tensors `argument:

```
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
```

Here’s what the results look like as PyTorch tensors:

```
{
    'input_ids': tensor([
        [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,     0,     0,     0,     0,     0,     0]
    ]), 
    'attention_mask': tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    ])
}
```
The output itself is a dictionary containing two keys, `input_ids` and `attention_mask`. 

# the model

We can download our pretrained model the same way we did with our tokenizer.



```
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

model = AutoModel.from_pretrained(checkpoint)
```

In this code snippet, we have **downloaded the same checkpoint** we used in our pipeline before (it should actually have been cached already) and **instantiated a model** with it.


This architecture is responsible for: given some inputs, it outputs what we’ll call **hidden states**, also known as features. 

For each model **input**, we’ll retrieve a **high-dimensional vector** representing the contextual understanding of that input by the Transformer model.

# A high-dimensional vector

The vector output by the Transformer module is usually large. It generally has three dimensions:

- Batch size: The number of sequences processed at a time (2 in our example).

- equence length: The length of the numerical representation of the sequence (16 in our example).
- Hidden size: The vector dimension of each model input.

### head 

While these hidden states can be useful on their own, they’re usually inputs to another part of the model, known as the **head**.

If you are performing the task the model was pre-trained on, you can simply model the input:

```
outputs = model(**inputs)
```

```
print(outputs.last_hidden_state.shape)
```

The result:
```
torch.Size([2, 16, 768])

```






