## Chapter 2

In [1]:
from platform import python_version

print(python_version())

3.9.13


In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

## Preprocessing

![Image of NLP pipeline](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

### Notes  
* Transformers can't process text directly and need it converted into a numerical format for processing
* To transform the text into numbers we use a tokenizer

* Tokenizers:  
    * words -> tokens
        * words, sub-words, or punctuation (!) into tokens
    * token -> mapped to an integer
        * each token (letters) is mapped to an integer (number)
    * also, may add additional inputs which might be useful for the model 

* This process must be done in an identical fashion as to how the pre-trained model was preprocessed
* HuggingFace's transformer library figures this out for you, when using a pre-trained model

* "...we use the *AutoTokenizer* class and its *from_pretrained()* method. Using the *checkpoint name* of our model, it will automatically fetch the data associated with the model’s tokenizer and cache it (so it’s only downloaded the first time you run the code below)."

"...The default checkpoint of the *sentiment-analysis* pipeline is *distilbert-base-uncased-finetuned-sst-2-english*"

In [1]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

  from .autonotebook import tqdm as notebook_tqdm


* "Transformer models only accept *tensors* as input. If this is your first time hearing about tensors, you can think of them as *NumPy arrays* instead. A NumPy array can be a scalar (0D), a vector (1D), a matrix (2D), or have more dimensions. It’s *effectively* a tensor; other ML frameworks’ tensors behave similarly, and are usually as simple to instantiate as NumPy arrays."

In [2]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf")
print(inputs)

2024-05-26 15:08:57.494333: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


{'input_ids': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,
        12172,  2607,  2026,  2878,  2166,  1012,   102],
       [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,
            0,     0,     0,     0,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}


* "The output itself is a dictionary containing two keys, *input_ids* and *attention_mask*.  
* input_ids contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence"

## The Model

In [3]:
from transformers import TFAutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModel.from_pretrained(checkpoint)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


* "This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call hidden states, also known as features. For each model input, we’ll retrieve a high-dimensional vector representing *the contextual understanding of that input by the Transformer model.*"  

* "The vector output by the Transformer module is usually large. It generally has three dimensions:
    * Batch size: The number of sequences processed at a time (2 in our example).
    * Sequence length: The length of the numerical representation of the sequence (16 in our example).
    * Hidden size: The vector dimension of each model input."
        * "The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more)."

In [4]:
inputs

{'input_ids': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,
        12172,  2607,  2026,  2878,  2166,  1012,   102],
       [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,
            0,     0,     0,     0,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(2, 16), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)>}

In [5]:
outputs = model(inputs)
print(outputs.last_hidden_state.shape)

(2, 16, 768)


## Model Heads

"The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:"

![Image of Transformer head](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/transformer_and_head.svg)

"The output of the Transformer model is sent directly to the model head to be processed."

"In this diagram, the model is represented by its embeddings layer and the subsequent layers. The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences."

In [6]:
# For our example, we will need a model with a sequence classification head 
# (to be able to classify the sentences as positive or negative). 
# So, we won’t actually use the TFAutoModel class, but TFAutoModelForSequenceClassification:

from transformers import TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(inputs)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [7]:
print(outputs.logits.shape)

(2, 2)


The shape is (2, 2) because we have 2 sentences as the initial input, and 2 labels/classes.  

Reminder:  
* Raw inputs = sentences
* inputs = tokenised raw inputs aka the tokenised sentences
* Pre-processing is turning the raw inputs into a numerical representation understood by the model we will be using
* The (pre-trained) model is a model architecture intialised with specific weights/from a checkpoint (in this case, "distilbert-base-uncased-finetuned-sst-2-english")  
* The model takes in this numerical, tokenized input

### Post-process the model output

In [8]:
print(outputs.logits)

tf.Tensor(
[[-1.5606977  1.6122826]
 [ 4.169232  -3.346448 ]], shape=(2, 2), dtype=float32)


"Our model predicted [-1.5607, 1.6123] for the first sentence and [ 4.1692, -3.3464] for the second one.  
* Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. 
To be converted to probabilities, they need to go through a SoftMax layer 
    * (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy)"

In [9]:
import tensorflow as tf

predictions = tf.math.softmax(outputs.logits, axis=-1)
print(predictions)

tf.Tensor(
[[4.0195279e-02 9.5980471e-01]
 [9.9945587e-01 5.4418348e-04]], shape=(2, 2), dtype=float32)


"Now we can see that the model predicted [0.0402, 0.9598] for the first sentence and [0.9995, 0.0005] for the second one. These are recognizable probability scores."  
* Probability the first one is part of label/class 1 vs label/class 2
    * First sentence is predicted with 95% to bec lass 2
    * Second sentence is predicted with 99% to be class 1

In [10]:
#To get the labels corresponding to each position, 
# we can inspect the id2label attribute of the model config (more on this in the next section):
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

All of the above manually reproduces the HuggingFace pipeline:  
* Pre-process input according to the pre-processing of the model to be used
* Get and load in a pretrained model
* Get output from passing the pretrained model the pre-processed inputs
* Post-process the outputs (logits) to get the final predictions
    * SoftMax the logits output to get probabilities