<a href="https://colab.research.google.com/github/Ibraheem101/Data-Science-learning/blob/main/huggingface/2_hf_tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenizers

### Loading and Saving

In [1]:
!pip install transformers[sentencepiece] # dev version

Collecting transformers[sentencepiece]
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers[sentencepiece])
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[sentencepiece])
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m53.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[sentencepiece])
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from transformers import BertTokenizer

Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the BertTokenizer class

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [4]:
from transformers import AutoTokenizer

In [5]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [6]:
tokenizer("I'm still learning how to use transformers")

{'input_ids': [101, 146, 112, 182, 1253, 3776, 1293, 1106, 1329, 11303, 1468, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [7]:
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

### Encoding
Translating text to numbers is known as encoding. Encoding is done in a two-step process: the tokenization, followed by the conversion to input ID or Numericalization.

* The first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called tokens. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained

* The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a vocabulary, which is the part we download when we instantiate it with the from_pretrained() method. Again, we need to use the same vocabulary used when the model was pretrained.

In [8]:
sequence = "I'm still learning how to use transformers"

In [9]:
tokens = tokenizer.tokenize(sequence)

In [10]:
print(tokens)

['I', "'", 'm', 'still', 'learning', 'how', 'to', 'use', 'transform', '##ers']


In [11]:
ids = tokenizer.convert_tokens_to_ids(tokens)

In [12]:
print(ids)

[146, 112, 182, 1253, 3776, 1293, 1106, 1329, 11303, 1468]


In [16]:
torch.tensor(ids)

tensor([  146,   112,   182,  1253,  3776,  1293,  1106,  1329, 11303,  1468])

### Decoding
Decoding is going the other way around: from vocabulary indices, we want to get a string.

In [13]:
decoded_string = tokenizer.decode([146, 112, 182, 1253, 3776, 1293, 1106, 1329, 11303, 1468])

In [14]:
print(decoded_string)

I'm still learning how to use transformers


### Handling multiple sequences


In [15]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [23]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sentence2 = "This is the best course in the world!"
tokens = tokenizer.tokenize(sentence2)
ids = tokenizer.convert_tokens_to_ids(tokens)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [24]:
input_ids = torch.tensor(ids)
input_ids

tensor([2023, 2003, 1996, 2190, 2607, 1999, 1996, 2088,  999])

In [25]:
model(input_ids) # Will fail

IndexError: ignored

The problem is that we sent a single sequence to the model, whereas 🤗 Transformers models expect multiple sentences by default. Here we tried to do everything the tokenizer did behind the scenes when we applied it to a sequence. But if you look closely, you’ll see that the tokenizer didn’t just convert the list of input IDs into a tensor, it added a dimension on top of it:

In [27]:
tokenizer(sentence2, return_tensors="pt")['input_ids']

tensor([[ 101, 2023, 2003, 1996, 2190, 2607, 1999, 1996, 2088,  999,  102]])

Let's try this again but we'll add a new dimension

In [28]:
input_ids = torch.tensor([ids])
input_ids

tensor([[2023, 2003, 1996, 2190, 2607, 1999, 1996, 2088,  999]])

In [31]:
model(input_ids), model(input_ids).logits # Will run

(SequenceClassifierOutput(loss=None, logits=tensor([[-4.1817,  4.5285]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None),
 tensor([[-4.1817,  4.5285]], grad_fn=<AddmmBackward0>))

Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence:

In [32]:
batched_ids = [ids, ids]

In [33]:
input_batch_ids = torch.tensor(batched_ids)
input_batch_ids

tensor([[2023, 2003, 1996, 2190, 2607, 1999, 1996, 2088,  999],
        [2023, 2003, 1996, 2190, 2607, 1999, 1996, 2088,  999]])

In [34]:
model(input_batch_ids)

SequenceClassifierOutput(loss=None, logits=tensor([[-4.1817,  4.5285],
        [-4.1817,  4.5285]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Batching allows the model to work when you feed it multiple sentences. Using multiple sequences is just as simple as building a batch with a single sequence. There’s a second issue, though. When you’re trying to batch together two (or more) sentences, they might be of different lengths. If you’ve ever worked with tensors before, you know that they need to be of rectangular shape, so you won’t be able to convert the list of input IDs into a tensor directly. To work around this problem, we usually pad the inputs.

### Padding the inputs


In [37]:
# Cannot convert to tensor
try:
    torch.tensor([
        [200, 200, 200],
        [200, 200]
    ])
except ValueError:
    print("ValueError: expected sequence of length l at dim d (got m)")

ValueError: expected sequence of length l at dim d (got m)


In [43]:
# Padding
padding_id = 100
torch.tensor([
        [200, 200, 200],
        [200, 200, padding_id]
])

tensor([[200, 200, 200],
        [200, 200, 100]])

In [44]:
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)


There’s something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we’ve got completely different values!

This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

Just like above: We strongly recommend passing in an `attention_mask` since your input_ids may be padded