# Using HF Transfomers

Deepdive into the HF transformer `pipeline()` function. 

It’s important to understand how it works under the hood. This notebook covers...

- how to use tokenizers and models to **replicate the pipeline()** function’s behavior
- how to **load and save models and tokenizers**
- Different **tokenization approaches**
- how to handle multiple sentences of varying lengths

Context.

Transformer models are usually very large. With millions to tens of billions of parameters, training and deploying these models is a complicated undertaking. Furthermore, with new models being released on a near-daily basis and each having its own implementation, trying them all out is no easy task.

The HF Transformers library was created to solve this problem. Its goal is to **provide a single API** through which any Transformer model can be loaded, trained, and saved. 
The library’s main features are:

- **Ease of use:** Downloading, loading, and using a state-of-the-art NLP model for inference can be done in just two lines of code.
- **Flexibility:** At their core, all models are simple PyTorch `nn.Module` or TensorFlow `tf.keras.Model` classes and can be handled like any other models in their respective machine learning (ML) frameworks.
- **Simplicity:** Hardly any abstractions are made across the library. The **“All in one file” is a core concept**: a model’s forward pass is **entirely defined in a single file**, so that the code itself is understandable and hackable.

This last feature makes 🤗 Transformers quite different from other ML libraries. The models are not built on modules that are shared across files; instead, each model has its own layers. In addition to making the models more approachable and understandable, this allows you to easily experiment on one model without affecting others.



## End-to-end example 

Starting with a model and a tokenizer together to replicate the pipeline() function

Component 1: Model API
Component 2: Tokenizer API

Tokenizers take care of the first and last processing steps, handling the conversion from text to numerical inputs for the neural network, and the conversion back to text when it is needed. 

**What happends inside the Pipeline() function?**

<img src="https://huggingface.co/course/static/chapter2/full_nlp_pipeline.png" width="50%">

In [1]:
### What was the pipeline?

from transformers import pipeline

classifier = pipeline('sentiment-analysis')
raw_inputs = [
    "I wanted to got to othe Tacos place last Friday but there's was a queue as long as I would expect in front of the Berghain. So I skipped it.",
    "The dinner was pretty tasty ;-)"
    ]
classifier(raw_inputs)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'NEGATIVE', 'score': 0.9987136125564575},
 {'label': 'POSITIVE', 'score': 0.9998186230659485}]

### Preprocessing with a tokenizer

Like other NN, Transformer models can’t process raw text directly, so the first step of our pipeline is to **convert the text inputs into numbers** that the model can make sense of. 

- **Splitting** the input into words, subwords, or symbols (like punctuation) that are called tokens
- **Mapping** each token to an integer
- Adding additional inputs that may be useful to the model

All this **preprocessing needs to be done in exactly the same way as when the model was pretrained**, so we first need to download that information from the Model Hub. To do this, we use the `AutoTokenizer` class and its `from_pretrained()` method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it (so it’s only downloaded the first time you run the code below).

Since the default checkpoint of the sentiment-analysis pipeline is distilbert-base-uncased-finetuned-sst-2-english, we run the following:

In [2]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)  # Get the tokenizer specific to this model

print(tokenizer)

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})


- Once we have the tokenizer, we can directly pass our sentences to it and we’ll get back a dictionary that’s ready to feed to our model! 
- The only thing left to do is to **convert the list of input IDs to tensors**.

Using Transformers we do not need to worry about which ML framework is used as a backend; it might be PyTorch or TensorFlow, or Flax for some models. 

However, **Transformer models only accept tensors** as input. 

To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the return_tensors argument:

In [3]:
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors='tf')
print(inputs)

{'input_ids': <tf.Tensor: shape=(2, 40), dtype=int32, numpy=
array([[  101,  1045,  2359,  2000,  2288,  2000, 27178,  5369, 11937,
        13186,  2173,  2197,  5958,  2021,  2045,  1005,  1055,  2001,
         1037, 24240,  2004,  2146,  2004,  1045,  2052,  5987,  1999,
         2392,  1997,  1996, 15214, 10932,  2078,  1012,  2061,  1045,
        16791,  2009,  1012,   102],
       [  101,  1996,  4596,  2001,  3492, 11937, 21756,  1025,  1011,
         1007,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 40), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     

2021-11-16 11:51:50.031150: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-16 11:51:50.050311: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fc12dd68050 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-16 11:51:50.050323: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


The output itself is a dictionary containing two keys, **input_ids** and **attention_mask**. 

- input_ids contains two rows of integers (one for each sentence) that are the **unique identifiers** **of the tokens** in each sentence. 
- We’ll explain what the attention_mask is later in this chapter.

### Going through the model

We can download our pretrained model the same way we did with our tokenizer. HF Transformers provides an `TFAutoModel` class which also has a `from_pretrained` method:

- This architecture contains only the **base Transformer module**: given some inputs, it outputs what we’ll call hidden states, also known as features. 
- For each model input, we’ll retrieve a **high-dimensional vector** representing the contextual understanding of that input by the Transformer model.

More details: 
- While these **hidden states** can be useful on their own, they’re **usually inputs to another part of the model**, known as the **head**. In Chapter 1, the 
- different tasks can be performed with the same architecture, but **each of these tasks will have a different head** associated with it.

In [4]:
from transformers import TFAutoModel

checkpoint = checkpoint  # i.e. same model as above
model = TFAutoModel.from_pretrained(checkpoint)

2021-11-16 11:51:50.943673: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertModel: ['pre_classifier', 'dropout_19', 'classifier']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the tas

The **vector output** by the Transformer module is usually large. It generally has **three dimensions**:

- **Batch size**: The number of sequences processed at a time (2 in this example).
- **Sequence length**: The length of the numerical representation of the sequence (40 in this example).
- **Hidden size**: The vector dimension of each model input.

It is said to be “high dimensional” because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more).

In [5]:
outputs = model(inputs)  # Feeding tokenized input into model 

print(
    "Batch Size:", outputs.last_hidden_state.shape[0], "\n",
    "Sequence Length:", outputs.last_hidden_state.shape[1], "\n",
    "Hidden Size (Vector Dimension):", outputs.last_hidden_state.shape[2])

Batch Size: 2 
 Sequence Length: 40 
 Hidden Size (Vector Dimension): 768


- the outputs of HF Transformers models behave like namedtuples or dictionaries. 
- we can access the elements by 
  - attributes (`outputs.last_hidden_state.shape[1]`)
  - by key (`outputs["last_hidden_state"]`), 
  - or by index if you know exactly where the thing you are looking for is (`outputs[0]`).

#### Model heads: Making sense out of numbers

- The model heads take the **high-dimensional vector of hidden states as input and project them onto a different dimension**. 
- They are usually composed of one or a few linear layers:


<img src="https://huggingface.co/course/static/chapter2/transformer_and_head.png" width="50%">

The output of the Transformer model is sent directly to the model head to be processed.

In the diagram above the model is represented by ...

- its embeddings layer and the subsequent layers (transformer network). 
  - the embeddings layer **converts** each input ID in the tokenized input into a **vector** that represents the associated token. 
  - the subsequent layers manipulate those vectors using the **attention mechanism** to produce the **final representation** of the sentences.

There are many different architectures available in HF transformers library, with each one designed around tackling a specific task. Some examples: 

*Model (retrieve the hidden states)
*ForCausalLM
*ForMaskedLM
*ForMultipleChoice
*ForQuestionAnswering
*ForSequenceClassification
*ForTokenClassification

 and more....

Sentiment Example:

In our running example we need a model with a **sequence classification head** (to be able to classify the sentences as positive or negative). So, we won’t actually use the TFAutoModel class, but TFAutoModelForSequenceClassification:

In [14]:
from transformers import TFAutoModelForSequenceClassification

print("Remember our model was: ", checkpoint)  # get the model name
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)  # get the model weights
outputs = model(inputs)  # input the tokenized raw_input

Remember our model was:  distilbert-base-uncased-finetuned-sst-2-english


Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_58']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now if we look at the shape of our inputs, the dimensionality will be **much lower**: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing **two values** (one per label):

In [18]:
print("Outputs shape (Batch Size, Labels)", outputs.logits.shape)
                                                                                   

Outputs shape (Batch Size, Labels) (2, 2)


### Postprocessing the output

The values we get as output from our model don’t necessarily make sense by themselves. Let’s take a look:

In [20]:
print(outputs.logits)

tf.Tensor(
[[ 3.6506321 -3.0039456]
 [-4.1607122  4.454075 ]], shape=(2, 2), dtype=float32)


- Our model predicted **[ 3.6506321 -3.0039456]** for the first sentence and **[-4.1607122  4.454075 ]** for the second one. 
- Those are **not probabilities but logits**, the raw, unnormalized scores outputted by the last layer of the model. 
- To be **converted to probabilities**, they need to go through a **SoftMax** layer

all HF  Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

In [21]:
import tensorflow as tf

predictions = tf.math.softmax(outputs.logits, axis = -1)
print(predictions)

tf.Tensor(
[[9.9871361e-01 1.2864548e-03]
 [1.8137053e-04 9.9981862e-01]], shape=(2, 2), dtype=float32)


Now we can see that the **model predicted** 

1. [9.9871361e-01 1.2864548e-03] for the first sentence and 
2. [1.8137053e-04 9.9981862e-01] for the second one. 
3. 
4. These are recognizable probability scores.

To get the labels corresponding to each position, we can inspect the **id2label** attribute of the model config:

In [23]:
print(model.config.id2label, "\n", "Remember results from the full pipeline classifier: ", classifier(raw_inputs))

{0: 'NEGATIVE', 1: 'POSITIVE'} 
 Remember results from the full pipeline classifier:  [{'label': 'NEGATIVE', 'score': 0.9987136125564575}, {'label': 'POSITIVE', 'score': 0.9998186230659485}]


The pipeline results match the replicated results so we successfully reproduced the pipeline. 

**Preprocess with Tokenizers -> Passing the inputs through the model and postprocessing**

Full code:

In [28]:
import tensorflow as tf
from transformers import AutoTokenizer
from transformers import TFAutoModel
from transformers import TFAutoModelForSequenceClassification

# Select Model
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
print("Select Model", checkpoint)

print("######################### \n", "Step 1: Tokenization")
print("Raw Inputs: ", raw_inputs)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)  # Get the tokenizer specific to this model
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors='tf')
print("Tokenized Inputs: ", inputs)

print("######################### \n", "Step 2: Modeling")
# Either use full model
model = TFAutoModel.from_pretrained(checkpoint)
outputs = model(inputs)  # Feeding tokenized input into model 

print(
    "Hidden States: \n",
    "Batch Size:", outputs.last_hidden_state.shape[0], "\n",
    "Sequence Length:", outputs.last_hidden_state.shape[1], "\n",
    "Hidden Size (Vector Dimension):", outputs.last_hidden_state.shape[2])

# Or use exact model
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)  # get the model weights
outputs = model(inputs)  # input the tokenized raw_input

print("Outputs shape \n (Batch Size, Labels)", outputs.logits.shape)

print("######################### \n", "Step 3: Postprecessing")
print("Logits: ", outputs.logits)

predictions = tf.math.softmax(outputs.logits, axis = -1)
print("Prediction: ", predictions)
print(model.config.id2label)
                                                                                   

Select Model distilbert-base-uncased-finetuned-sst-2-english
######################### 
 Step 1: Tokenization
Raw Inputs:  ["I wanted to got to othe Tacos place last Friday but there's was a queue as long as I would expect in front of the Berghain. So I skipped it.", 'The dinner was pretty tasty ;-)']
Tokenized Inputs:  {'input_ids': <tf.Tensor: shape=(2, 40), dtype=int32, numpy=
array([[  101,  1045,  2359,  2000,  2288,  2000, 27178,  5369, 11937,
        13186,  2173,  2197,  5958,  2021,  2045,  1005,  1055,  2001,
         1037, 24240,  2004,  2146,  2004,  1045,  2052,  5987,  1999,
         2392,  1997,  1996, 15214, 10932,  2078,  1012,  2061,  1045,
        16791,  2009,  1012,   102],
       [  101,  1996,  4596,  2001,  3492, 11937, 21756,  1025,  1011,
         1007,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertModel: ['pre_classifier', 'dropout_19', 'classifier']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Hidden States: 
 Batch Size: 2 
 Sequence Length: 40 
 Hidden Size (Vector Dimension): 768


Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_253']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Outputs shape 
 (Batch Size, Labels) (2, 2)
######################### 
 Step 3: Postprecessing
Logits:  tf.Tensor(
[[ 3.6506321 -3.0039456]
 [-4.1607122  4.454075 ]], shape=(2, 2), dtype=float32)
Prediction:  tf.Tensor(
[[9.9871361e-01 1.2864548e-03]
 [1.8137053e-04 9.9981862e-01]], shape=(2, 2), dtype=float32)
{0: 'NEGATIVE', 1: 'POSITIVE'}
