<a href="https://colab.research.google.com/github/Gibsdevops/learn_large_language_models-NLP_huggingface/blob/main/using_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Preporcessing with the tokenizer

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

Since the default checkpoint of the sentiment-analysis pipeline is distilbert-base-uncased-finetuned-sst-2-english (you can see its model card here), we run the following:

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:
tokenizer

In [None]:
# transformer models accept only tensors

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

print(inputs) #retuns pytorch tensors

Going through the model

We can download our pretrained model the same way we did with our tokenizer.

Transformers provides an AutoModel class which also has a from_pretrained() method:

In [6]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [8]:
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

torch.Size([2, 16, 768])


In [9]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

Now if we look at the shape of our outputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label):

In [11]:
print(outputs.logits.shape)
#Since we have just two sentences and two labels,
#the result we get from our model is of shape 2 x 2.

torch.Size([2, 2])


Postprocessing the output

The values we get as output from our model don’t necessarily make sense by themselves. Let’s take a look:

In [12]:
print(outputs.logits)

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


Our model predicted [-1.5607, 1.6123] for the first sentence and [ 4.1692, -3.3464] for the second one. Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy):

In [13]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


Now we can see that the model predicted [0.0402, 0.9598] for the first sentence and [0.9995, 0.0005] for the second one. These are recognizable probability scores.

To get the labels corresponding to each position, we can inspect the id2label attribute of the model config (more on this in the next section)

In [14]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:

First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005

We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing! Now let’s take some time to dive deeper into each of those steps.

In [19]:
sentences = [
    "I hate website development",
    "I love machine learning"
]

In [24]:
#we want to classify the two sentences going through tokenizers, passing inputs through the moel and postprocessing
#which are the three steps of the pipeline function

#first get the model going to be used coz inputs to this model must be processed the same way the model was pretrained
#to get information about that we use AutoTokenizer class and it's method from_pretrained
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer_2 = AutoTokenizer.from_pretrained(checkpoint)
#tokenizer_2

#after that since transformer models only accept tensors we have to get inputs
#we use our tokenozer above to get those input tensors
inputs_2 = tokenizer_2(sentences, padding=True, truncation=True, return_tensors="pt")
#print(inputs_2)  #pytorch tensors

from transformers import AutoModelForSequenceClassification

model_2 = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs_2 = model_2(**inputs_2)

#carryout postprocessing
#since the values output don't make sense
import torch
predictions_2 = torch.nn.functional.softmax(outputs_2.logits, dim=-1)

print(predictions_2)


tensor([[9.9967e-01, 3.2979e-04],
        [2.1920e-04, 9.9978e-01]], grad_fn=<SoftmaxBackward0>)
