# Intro
* HFT - Hugging face transformers
* API - Application Programming Interface
* HFT Models large, millions to 10s of billions params, training and deploying is complicated
* HFT library was created to provide a signle API for loading, training, saving
    * Ease of use - DL, load & use NLP models for inference can be done w/ two lines of code
    * Flexibility - all models are simple pytorch (nn.Module) or TensorFlow (tf.keras.Model) classes, handled same as other models in respective ML frameworks
    * Simplicity - All in one file so the code is understandable & hackable
* HFT different from other libraries
    * Other ML libs built on modules shared across files
    * HFT models have their own layers, allowing you to experiment w/ one model w/o affecting other models
* Chapter Contents
    * Replicate end to end example using model & tokenizer to replicate the pipeline() function
    * Discuss model API
        * Model & Config classes
        * Load model showing how it processes numerican inputs to output predictions
    * Tokenizer API - Main component of Pipeline function
        * First & Last processing steps, handling:
            * conversion from text to numerical inputs for neural network
            * Conversion back to text when needed
    * Batching multiple sentences through model
    * High-level tokenizer() function
# Behind the Pipeline
**Reminder**
![Image](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)
* Tokenization: Raw Text to Input IDs
    * Raw text inputed "This course is amazing!"
    * Split into tokens [This, course, is, amazing, !]
    * Model adds special tokens [[CLS], this, course, is, amazing, !, [SEP]]
        * CLS stands for "classify", placed at beginning of input sequence for sequence classification or next sentence prediction
            * Summary token absorbing contents of sentence
        * SEP is "separator" used to separate two segments or sentences in the same input or mark the end of a sentence/segment
    * Input ID: Matches each token to it's unique ID to the vocabulary of the pretrained model
        * [101, 2023, 2607, 2003, 6429, 999, 102]
## Chapter 1
The code:
```Python
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)
```
Obtained:
```bash
[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558095932007}]
 ```
### Preprocessing W/ Tokenizer
* HFT Models can't process raw text, thus converts inputs to numbers using tokenizer to:
    * Split input into words, subwords, symbols called tokens
    * Map each to integer
    * Add additional inputs that may be useful to the model
* All of this needs to match the way the model was pretrained
    * That info can be downloaded from model hub
    * Use ```AutoTokenizer``` Class & its ```from_pretrained()``` method.
    * Use model checkpoint name, auto fetches data associated w/ mode's tokenizer & cache it

In [1]:
from transformers import AutoTokenizer # Note that AutoTokenizer is case-sensitive

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

* First time will download tokenizer
* Once we have the tokenizer, we can pass our sentence to it & get a dictionary to be fed to our model
    * Don't worry about ML framework used as backend when using HFT, may be PyTorch or TensorFlow or Flax
    * TF models only accept tensors as inputs
        * Tensors are like NumPy arrays, could be scalar (0d), vector(2d), or matrix (3d)
* To specify tensor type (PyTorch, TensorFlow, or plain NumPy) we use ```return_tensors``` argument

In [6]:
from transformers import AutoTokenizer # Note that AutoTokenizer is case-sensitive

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


Response:
```Bash
{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0, 0,     0,     0,     0,     0,     0]]),
        'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
```
* The attention mask details where padding was applied, so that the model pays it no attention

### Going Through the model
* HFT provides an AutoModel class which also has a from_pretrained() method

In [7]:
from transformers import AutoModel # AutoModel only instantiates the BODY of the model, not the head

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


* Downloaded or ensured presence of checkpoint used in pipeline before
* this architecture contains only base transformer module: given some inputs, it outputs hidden states (aka features)
* For each model input, output a high-dimensional vector representing the **contextual understanding of that input by the TF model**
    * These hidden states are inputs to another part of the model, the head
### High Dimensional Vector
* Vector output by the TF module is usually large w/ 3 dimensions:
    * **Batch Size:** Number of sequences processed at a time (2 in our example)
    * **Sequence Length:** Length of numerical representation of the sequence
        * Details how many tokens in each sequence **After tokenization & padding**
        * Transformers need **fixed-length inputs**, so they should always have fixed lengths
        * shorter sentences are padded (e.g. "I love NLP")
            * Simplified tokenization [I, love, NLP] -> Padded [I, love, NLP, Pad, Pad, Pad]
        * longer are truncated (E.g."This sentence is way too long and must be cut")
            * Tokenized [this, sentence, is, way, too, long, and, must, be, cut] -> [This, sentence, is, way, too, long]
    * **Hidden Size:** Vector dimension of each model input
        * Size of each individual token embedding/vector
        * Defines richness of models internal representation of each token
        * E.g. in bert-base-uncased, hidden size 768, meaning each token is a 768 dimensional vector after processing
        * Larger models can reach 3072 or more
    * **Overall Example**
        * If using bert-base-uncased, model output shape is:
        ```torch.Size([2,16,768])```
        * We can see this by running:
        ```python
        outputs = model(**inputs)
        print(outputs.last_hidden_state.shape)
        ```
* Outputs of HFT models behave like namedtuples or dictionaries
* access elements by:
    * attributes (like python code above)
    * key ```(outputs["last_hidden_state"])```
    * index (if you  know exactly where the thing you're looking for is) ```(outputs[0])```.
### Model Heads: Making Sense of Numbers
* Take high-dim vectors of hidden states as input & project them onto a different dimension
* Output sent to model head for processing
* Composed of one or few linear layers:
![Heads](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head.svg)
* In diagram:
    * Model represented by embeddings layer & subsequent layers
        * Embeddings converts each input ID in the tokenized input into a vector representing associated token
        * Subsequent layers manipulate those vectors using attention mechanisms to produce final representation of sentences
* Many architectures in HFT, each one designed around a specific task. Some are:
    * Model (retrieve the hidden states)
        * E.g. BertModel or GPT2Model
        * Only returns hidden states - raw internal representations from transformer layers
            * Input words➡️Converted to numbers (tokens)➡️Processed through Transformer Layers➡️Outputs a vector (long list of numbers) for each token at each layer
            * Output vectors are called hidden states, can be thought of like model's internal understanding of each word in the sentence - thought cloud containing meaning, grammar, context
        * Useful if you want to build something custom, only provides base model
    * ForCausalLM
        * Used for text generation in GPT style models to predict next word
    * ForMaskedLM
        * Fill-in-the-blank to predict missing words
    * ForMultipleChoice
        * Choose best answer out of several options
    * ForQuestionAnswering
        * Extract Answer span - given context & question, finds start and end position of answer inside context
    * ForSequenceClassification
        * Sentiment or intent
    * ForTokenClassification
        * Tag each words, like NER to find names
* EX: Using sequence classification head, we use AutoModelForSequenceClassification instead of just AutoModel

In [12]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.logits.shape)

torch.Size([2, 2])


* Output is ```torch.Size([2, 2])```
    * There are 2 characters as this is a 2D tensor (matrix) with
        * 2 Rows - one for each input sentence
        * 2 Columns - One for each class label
    * In general shape is: [batch_size, num_labels]
* When looking at output shape, dimensionality will be much lower
    * Model takes high-dimensional vectors as inputs
    * Outputs vectors containing two values (one per label)
    * Since we have just two sentences, result we get is 2x2
### Post Processing Output

In [13]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


* Our model predicted [-1.5607, 1.6123] for the first sentence & [ 4.1692, -3.3464] for the second
    * These aren't probabilities but **Logits**
        * Raw, unnormalized scores outputted by the last layer of the model
        * To be converted to probabilities, they need to go through a SoftMax layer

In [14]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5981e-01],
        [9.9946e-01, 5.4419e-04]], grad_fn=<SoftmaxBackward0>)


* **Probability Scores:** Model predicted [4.0195e-02, 9.5981e-01] for first & [9.9946e-01, 5.4419e-04] for second
* To get labels corresponding to each position, we can inspect ```id2label``` attribute of model config

In [15]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

```{0: 'NEGATIVE', 1: 'POSITIVE'}```
* Now we can conclude that the first sentence:
    * Negative: 0.0402, Positive: 0.9598
* Second sentence:
    *  Negative: 0.9995, Positive: 0.0005