Understanding the Transformers library

In this notebook we shall understand the working behind the transformers library

& how could you effectively build solutions using pre-trained models

We need to understand 3 main components

• Tokenizers

• AutoModels

• Choosing the right model(s)

A raw input text such as: "Generative Al is fun" cannot be processed directly

This is true for all machine learning models that converts these raw text into

numbers

In [1]:
! pip install transformers



A raw input text such as: "Generative Al is fun" cannot be processed directly

This is true for all machine learning models that converts these raw text into

numbers Tokenizers convert the raw text into sub-tokens (where multiple sub-tokens could join to form a word)

eg: Generative --> Generat, ive

Each of these tokens are then assigned an integer based on their popularity in
the training data corpus

Tokenizers convert these raw text into sub-tokens and thereby into integers

Tokenizers also add special tokens such as <'eos'> and <'cls'> and are specific
to each transformer model

For using the pre-trained model we need to use the corresponding tokenizer for
the model

In [4]:
from transformers import AutoTokenizer
model_id = "distilgpt2"
tokenizer= AutoTokenizer.from_pretrained (model_id, padding_side="left")
#since we are passing batch input, it is necessary to pad them to the same Length
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token #new tokens do not have a defined embedding and might raise errors

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [5]:
raw_inputs = [
"I've been waiting for a Hugging Face course my whole life.",
"I am very excited about training the model !!",
"I hate this weather which makes me feel irritated !"
]

In [8]:
inputs = tokenizer (raw_inputs, truncation= True, padding=True, return_tensors="pt")

In [13]:
inputs

{'input_ids': tensor([[   40,  1053,   587,  4953,   329,   257, 12905,  2667, 15399,  1781,
           616,  2187,  1204,    13],
        [50256, 50256, 50256, 50256, 50256,    40,   716,   845,  6568,   546,
          3047,   262,  2746, 37867],
        [50256, 50256, 50256, 50256,    40,  5465,   428,  6193,   543,  1838,
           502,  1254, 38635,  5145]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [9]:
print(inputs[0].tokens)

['I', "'ve", 'Ġbeen', 'Ġwaiting', 'Ġfor', 'Ġa', 'ĠHug', 'ging', 'ĠFace', 'Ġcourse', 'Ġmy', 'Ġwhole', 'Ġlife', '.']


In [10]:
print(inputs[1].tokens)

['<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', 'I', 'Ġam', 'Ġvery', 'Ġexcited', 'Ġabout', 'Ġtraining', 'Ġthe', 'Ġmodel', 'Ġ!!']


In [14]:
from transformers import AutoModel
model = AutoModel.from_pretrained (model_id)

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Similar to the tokenizer, the model is also downloaded and cached for further usage.

When the above code is executed, the base model without any head is
installed i.e. for any input to the model we will retrieve a high-dimensional vector representing contextual understanding of that input by the Transformer model

In [15]:
inputs['input_ids' ].size()

torch.Size([3, 14])

In [16]:
output = model (**inputs)

In [17]:
output


BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[-0.0389,  0.3338,  0.0144,  ..., -0.1871,  0.1037, -0.1066],
         [ 0.0090, -0.1319, -1.0859,  ..., -0.3795,  0.1762, -0.0579],
         [-0.7782, -0.0481, -0.4733,  ..., -0.3067, -0.1036, -0.0747],
         ...,
         [-0.4948,  0.7663, -1.1026,  ..., -0.4272,  0.3760,  0.1030],
         [ 0.2029, -0.2114, -0.9762,  ..., -0.1133,  0.0679, -0.0477],
         [-0.0938, -0.4075, -0.4138,  ..., -0.0477, -0.3055, -0.2257]],

        [[-0.0891,  0.2588, -0.1452,  ..., -0.0186,  0.1003,  0.1174],
         [-0.4293,  0.5818, -0.0191,  ...,  0.3211,  0.8908,  0.5298],
         [-0.3452,  0.5230,  0.0466,  ...,  0.3604,  0.7849,  0.4683],
         ...,
         [-0.8571,  0.7226, -0.3196,  ...,  0.1861,  1.1892,  0.2926],
         [-1.0412,  0.5586, -0.6740,  ..., -0.1755,  1.3704,  0.1132],
         [-0.4473,  0.5372,  0.3923,  ..., -0.1490,  0.1542,  0.8195]],

        [[-0.0594,  0.3128, -0.2585,  ..., -0.1458,  0.1

**Text generation **

In [20]:
from transformers import AutoModelForCausalLM
causal_model = AutoModelForCausalLM. from_pretrained (model_id)

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [21]:
output= causal_model.generate (**inputs, max_new_tokens=10, temperature=0.75)
#generate method is I I for model that are trained for text-generation such as GPT or T5

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [22]:
output

tensor([[   40,  1053,   587,  4953,   329,   257, 12905,  2667, 15399,  1781,
           616,  2187,  1204,    13,   314,  1053,   587,  4953,   329,   257,
         12905,  2667, 15399,  1781],
        [50256, 50256, 50256, 50256, 50256,    40,   716,   845,  6568,   546,
          3047,   262,  2746, 37867,   198,   198,    40,   716,   845,  6568,
           546,  3047,   262,  2746],
        [50256, 50256, 50256, 50256,    40,  5465,   428,  6193,   543,  1838,
           502,  1254, 38635,  5145,   198,   198,    40,   716,   257,  4336,
           286,   262,  6193,   290]])

In [23]:
raw_inputs[0]

"I've been waiting for a Hugging Face course my whole life."

In [24]:
tokenizer.decode(output[0])

"I've been waiting for a Hugging Face course my whole life. I've been waiting for a Hugging Face course"

In [25]:
tokenizer.decode(output[1])

'<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>I am very excited about training the model!!\n\nI am very excited about training the model'

**Text classification**

In [31]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

cls_repo_id = "distilbert-base-uncased-finetuned-sst-2-english"
cls_tokenizer = AutoTokenizer.from_pretrained(cls_repo_id)
cls_model = AutoModelForSequenceClassification.from_pretrained(cls_repo_id)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [33]:
inputs = cls_tokenizer (raw_inputs, padding=True, truncation=True, return_tensors="pt")

In [34]:
inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,  2227,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  2572,  2200,  7568,  2055,  2731,  1996,  2944,   999,
           999,   102,     0,     0,     0,     0],
        [  101,  1045,  5223,  2023,  4633,  2029,  3084,  2033,  2514, 15560,
           999,   102,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

In [35]:
outputs = cls_model (**inputs)

In [36]:
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-3.1071,  3.2654],
        [-3.9628,  4.2892],
        [ 4.0951, -3.3116]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [37]:
import torch
outputs_processed = torch.nn. functional.softmax (outputs. logits, dim=-1)
outputs_processed

tensor([[1.7051e-03, 9.9829e-01],
        [2.6066e-04, 9.9974e-01],
        [9.9939e-01, 6.0683e-04]], grad_fn=<SoftmaxBackward0>)

In [38]:
cls_model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [39]:
torch.argmax (outputs_processed, dim=1)

tensor([1, 1, 0])