**Execute** the three code cells below before switching to presentation mode

In [1]:
from ipywidgets import Layout, interact, interactive, fixed, interact_manual, widgets
from IPython.display import display


In [2]:
import pandas as pd
from pprint import pprint

In [3]:
# to display a pair of subtokens to be merged in a slider
def get_pairs(pair:int):
    """
    pair: index of the pair. 
    """
    if pair>0:
        left, right = lines[pair].strip('\n').split(' ')
        print(f'{left} , {right}')
        
# to display token ids  in a slider
def display_token_id(id):
    token,id = vocab_sorted[id]
    print(f'id:{id} \t token:{token}')

<h1 style="color:Tomato;"> Dataset </h1>

In [4]:
from datasets import load_dataset

We will use the `bookcorpus` dataset to train a tokenizer. It may take about 15 minutes or more to download and generate the split for the first time.

It requires approximately 5 GB of memory to load the dataset. Please ensure you have sufficient memory available. If not, consider loading only a fraction of the dataset.

In [5]:
ds = load_dataset('bookcorpus',split='all')
pprint(ds)

Dataset({
    features: ['text'],
    num_rows: 74004228
})


The dataset contains 74 million sentences of varying lengths.

The length of sentences does not affect the training of a tokenizer.

Let's take a look at a few examples from the dataset.

In [7]:
num_samples = 6
for idx,sample in enumerate(ds[0:num_samples]['text']):
    print(f'{idx} : {sample}')

0 : usually , he would be tearing around the living room , playing with his toys .
1 : but just one look at a minion sent him practically catatonic .
2 : that had been megan 's plan when she got him dressed earlier .
3 : he 'd seen the movie almost by mistake , considering he was a little young for the pg cartoon , but with older cousins , along with her brothers , mason was often exposed to things that were older .
4 : she liked to think being surrounded by adults and older kids was one reason why he was a such a good talker for his age .
5 : `` are n't you being a good boy ? ''


<h1 style="color:Tomato;"> Tokenization </h1>

Recall that a `tokenization` pipeline consists of Normalizer, Pre-Tokenizer, Model and Post-Processor as show below

<img src=https://raw.githubusercontent.com/Arunprakash-A/Modern-NLP-with-Hugging-Face/refs/heads/main/Notebooks/images/pipeline.png align='left'> <br> <br> <br> <br> <br> Let us build this pipeline by importing the `Tokenizer` class. 

In [8]:
from tokenizers import Tokenizer

Let's use a simple tokenizer with the following choices for each component

|**Component** |**Choice**  |
|:------------:|:----------:|
|normalizer    |Lowercase   |
|pre-tokenizer |Whitespace  |
|model         | BPE        |
|postprocessor | None       |

In [9]:
from tokenizers.normalizers import Lowercase 
from tokenizers.pre_tokenizers import Whitespace 
from tokenizers.models import BPE 

Initiate the tokenizer with the BPE model and the special tokens ("[UNK]" in this case) that the model will use during **prediction**

In [10]:
model = BPE(unk_token="[UNK]")
tokenizer = Tokenizer(model)

Add the normalizer and pre-tokenizer to the pipeline

In [11]:
tokenizer.normalizer = Lowercase()
tokenizer.pre_tokenizer = Whitespace()

Create a trainer by setting `vocab_size` and `special_tokens`

In [12]:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(vocab_size=32000,special_tokens=["[PAD]","[UNK]"],continuing_subword_prefix='##')

The pipeline is ready. Next,we need to pass **list** of strings as input to the tokenizer

However, creating an additional **list** containing all 74 million samples will require an extra 5 GB of memory.

To reduce the memory usage, let us create a generator that returns a batch of samples as a list

In [13]:
def get_examples(batch_size=1000):
    for i in range(0, len(ds), batch_size):
        yield ds[i : i + batch_size]['text']    

Let us train the tokenizer

<img src=https://raw.githubusercontent.com/Arunprakash-A/Modern-NLP-with-Hugging-Face/refs/heads/main/Notebooks/images/trainer.png>

In [14]:
from multiprocessing import cpu_count
print(cpu_count())

12


In [15]:
tokenizer.train_from_iterator(get_examples(batch_size=10000),trainer=trainer,length=len(ds))






The training took about 5 minutes to complete and required an additional 1 GB of memory.

It would be interesting to see the subtokens that were merged to create the final vocabulary.

In [16]:
tokenizer.model.save('model',prefix='hopper')

['model/hopper-vocab.json', 'model/hopper-merges.txt']

In [17]:
with open('model/hopper-merges.txt','r') as file:
    row = 0
    num_lines = 10
    for line in file.readlines():
        print(line)
        row+=1 
        if row >= num_lines:
            break   
    

#version: 0.2

##h ##e

t ##he

##i ##n

##e ##r

##e ##d

##o ##u

##n ##d

##in ##g

t ##o



Display last `n` merges

In [18]:
with open('model/hopper-merges.txt','r') as file:
    row = 0
    num_lines = 10
    for line in reversed(file.readlines()):
        print(line)
        row+=1
        if row >= num_lines:
            break   

mel ##anthe

black ##er

ad ##ject

v ##ang

betroth ##al

tiptoe ##ing

restroom ##s

consol ##ing

esp ##ionage

influ ##x



<h1 style="color:Tomato;"> Vocabulary </h1>

Let's view the number of merges

In [19]:
with open('model/hopper-merges.txt','r') as file:    
    lines = file.readlines()        

In [20]:
print(f'Number of merges:{len(lines)}')

Number of merges:31871


In [21]:
print(f'vocab size:{tokenizer.get_vocab_size()}') 

vocab size:32000


The number of merges is slightly less than the size of the vocabulary because the merges do not include single-character tokens, such as letters, numbers, and special symbols.

We can view the final learned vocabulary either from the saved `hopper-vocab.json` file or by using the `get_vocab` method.

In [22]:
vocab = tokenizer.get_vocab()

For convenience, let's print the vocabulary sorted by token IDs.

In [23]:
vocab_sorted = sorted(vocab.items(), key=lambda item: item[1])

Let's adjust the sliders below to view the merged subwords and their corresponding tokens in the vocabulary.

In [24]:
_ = interact(get_pairs,pair=widgets.IntSlider(min=1, max=len(lines)-1, step=1, value=1,layout=Layout(width='900px')))
_ = interact(display_token_id,id=widgets.IntSlider(min=0, max=31999, step=1, value=130,layout=Layout(width='900px')))

interactive(children=(IntSlider(value=1, description='pair', layout=Layout(width='900px'), max=31870, min=1), …

interactive(children=(IntSlider(value=130, description='id', layout=Layout(width='900px'), max=31999), Output(…

Note that the 130th ID represents the first merge. What will the next two tokens to be merged be? You can explore this by moving the sliders.

The sub-words are `t` and `##he`. After the merge, they will form the token `the`.

Similarly the last two merges are: `mel ##anthe` and `black ##er` (Move the slider to the far right to see them.)

<h1 style="color:Tomato;"> Encoding </h1>

Pass a single sample to the `encode` method of the tokenizer class

In [25]:
sample = ds[0]['text']
print(f'sample: {sample}')
encoding = tokenizer.encode(sample)
print(encoding)

sample: usually , he would be tearing around the living room , playing with his toys .
Encoding(num_tokens=16, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


It returned the output as `Encoding` object that contains useful attributes such as `token_ids` (ids), `type_ids` and so on

We need to access these attributes to get their respective values

In [26]:
token_ids = encoding.ids
tokens = encoding.tokens
type_ids = encoding.type_ids
attention_mask = encoding.attention_mask

In [27]:
from tokenizers.tools import EncodingVisualizer
visualizer = EncodingVisualizer(tokenizer=tokenizer)
visualizer(text=sample)

In [28]:
out_dict = {'tokens':tokens,'ids':token_ids,'type_ids':type_ids,'attention_mask':attention_mask}
df = pd.DataFrame.from_dict(out_dict)
df

Unnamed: 0,tokens,ids,type_ids,attention_mask
0,usually,2462,0,1
1,",",19,0,1
2,he,149,0,1
3,would,277,0,1
4,be,162,0,1
5,tearing,6456,0,1
6,around,422,0,1
7,the,131,0,1
8,living,1559,0,1
9,room,536,0,1


`type_ids` are model-specific. For instance, BERT-like models use type IDs of 0 and 1.

Another important attribute is the `attention_mask`, which is used in nearly all transformer-based architectures. In this mask, the value is 1 for tokens to be attended to and 0 for masked tokens (which may seem counterintuitive given the term "masking").

<h1 style="color:Tomato;">  Batch Encoding </h1>

When encoding a batch of samples, we need to`[PAD]` shorter sequences in the batch, a process known as dynamic batching.

Therefore, we use `encode_batch` method of the tokenizer object.

The `token_id` for the `[PAD]` token is 0

In [29]:
samples = ds[0:4]['text']

In [30]:
batch_encoding = tokenizer.encode_batch(samples)
pprint(batch_encoding)

[Encoding(num_tokens=16, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=14, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=14, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=42, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]


The maximum length of the sequence in the batch is 42. Clearly, padding is not applied to the remaining samples. 

In general, it is also possible for the length of a sequence to exceed the model's context length (or window size).

Assuming the model's context length is 512, enable padding and truncation while batching the samples.

In [31]:
# all default args
tokenizer.enable_padding(direction = 'right',
                         pad_id = 0,
                         pad_type_id = 0,
                         pad_token = '[PAD]',
                         length = None, # None default to max_len in the batch
                         pad_to_multiple_of = None) 

tokenizer.enable_truncation(max_length=512)


In [None]:
batch_encoding = tokenizer.encode_batch(samples)
pprint(batch_encoding)

Now we can see that all samples in the batch are of the same length.

<h1 style="color:Tomato;">  Quick test </h1>

Let's pass a test sequence that contains two tokens not present in the vocabulary.

In [32]:
text = "All this is so simple to do in HF இ😊."
encoded = tokenizer.encode(text).tokens
print(encoded)

['all', 'this', 'is', 'so', 'simple', 'to', 'do', 'in', 'h', '##f', '[UNK]', '[UNK]', '##.']


Normalization worked as expected (This $\rightarrow$ this, HF $\rightarrow$ hf)

Pre-tokenization followed by model tokenization has been applied correctly.

The continuing prefix has been used appropriately. ('##f', '##.`) 

<h1 style="color:Tomato;">  Saving and Loading Tokenizer </h1>

Let us save the tokenizer and load it in a single line of code in the model training script

In [33]:
tokenizer.save('hopper.json')

It saves all the required information such as `added_tokens`, details of the `model` (vocab,merges,..) , `normalizer`, `pre-tokenizer` ..

In [34]:
import json
with open('hopper.json','r') as file:
    json_data = json.load(file)

In [35]:
pprint(json_data, depth=1)

{'added_tokens': [...],
 'decoder': None,
 'model': {...},
 'normalizer': {...},
 'padding': {...},
 'post_processor': None,
 'pre_tokenizer': {...},
 'truncation': {...},
 'version': '1.0'}


It is now easy to load the tokenizer back

In [36]:
trained_tokenizer = Tokenizer(BPE())

In [37]:
trained_tokenizer = trained_tokenizer.from_file('hopper.json')

In [38]:
tokens = trained_tokenizer.encode(text).tokens
print(tokens)

['all', 'this', 'is', 'so', 'simple', 'to', 'do', 'in', 'h', '##f', '[UNK]', '[UNK]', '##.']


<h1 style="color:Tomato;"> BERT Tokenizer </h1>

Let’s quickly build a BERT-like tokenizer (don’t worry about what BERT-like models are for now)

The input to BERT models generally follows the template below, with slight variations across different BERT implementations.

[ `[CLS]`, token-A_1, $\cdots$, token-A_n, `[SEP]`, token-B_1, $\cdots$, token-B_m ]

Here `A` denotes sentence `A` with `n` tokens and `B` denotes  sentence `B` with `m` tokens. 

The special tokens are : `[CLS]`,`[SEP]`,`[PAD]`,`[MASK]` and `[UNK]`

In [39]:
bert_tokenizer = Tokenizer(BPE(unk_token='[UNK]'))
bert_tokenizer.normalizer = Lowercase()
bert_tokenizer.pre_tokenizer = Whitespace()
bert_trainer = BpeTrainer(vocab_size=32000,
                          special_tokens=["[PAD]","[UNK]","[CLS]","[SEP]","[MASK]"],
                          continuing_subword_prefix='##')

We just need to add `post_processing_step` where the special tokens are inserted according to  the template

In [40]:
from tokenizers.processors import TemplateProcessing

* If we pass a single sentence, the tokenizer should output `"[CLS] $0 [SEP]"` where 0 denotes the `type_id` (which defaults to zero if there is a single sentence) <br>

* If we pass a pair of sentence, the tokenizer should output `"[CLS] $A:0 [SEP] $B:1"`

In [41]:
bert_tokenizer.post_processor = TemplateProcessing(single="[CLS] $0 [SEP]",
                                                   pair="[CLS] $A [SEP] $B:1",
                                                   special_tokens=[("[CLS]", 2), ("[SEP]", 3)],
                                                  )

In [42]:
bert_tokenizer.train_from_iterator(get_examples(batch_size=10000),trainer=bert_trainer,length=len(ds))






Pass a single sentence to the tokenizer

In [43]:
text = "All these are so simple to do in HF. Let's do more"
encoded = bert_tokenizer.encode(text)
tokens = encoded.tokens
ids = encoded.ids
out_dict = {'tokens':tokens,'ids':ids}
pprint(out_dict,depth=2,compact=True)

{'ids': [2, 270, 956, 336, 231, 2534, 141, 206, 157, 56, 95, 24, 462, 17, 67,
         206, 387, 3],
 'tokens': ['[CLS]', 'all', 'these', 'are', 'so', 'simple', 'to', 'do', 'in',
            'h', '##f', '.', 'let', "'", 's', 'do', 'more', '[SEP]']}


pass a pair of sentences

In [44]:
text = "All these are so simple to do in HF. Let's do more"
pair = "We have a long way to go!"
encoded = bert_tokenizer.encode(text,pair)
tokens = encoded.tokens
ids = encoded.ids
out_dict = {'tokens':tokens,'ids':ids}
pprint(out_dict,depth=2,compact=True)

{'ids': [2, 270, 956, 336, 231, 2534, 141, 206, 157, 56, 95, 24, 462, 17, 67,
         206, 387, 3, 214, 250, 49, 490, 415, 141, 260, 12],
 'tokens': ['[CLS]', 'all', 'these', 'are', 'so', 'simple', 'to', 'do', 'in',
            'h', '##f', '.', 'let', "'", 's', 'do', 'more', '[SEP]', 'we',
            'have', 'a', 'long', 'way', 'to', 'go', '!']}


<h1 style="color:Tomato;"> Decoding </h1>

The special tokens (for ex, `[PAD]`) need to be removed and sub-words have to be merged before outputting the final result to the end user.

Let us decode the enoced `token_ids` from the previous step.

In [45]:
plain_tokens = bert_tokenizer.decode(ids)
print(plain_tokens)

all these are so simple to do in h ##f . let ' s do more we have a long way to go !


The special tokens were removed, but the subword `(h, ##f)` wasn't merged into a complete word.

We need to use appropriate decoders based on the type of tokenizer. See the list of decoders [here](https://huggingface.co/docs/tokenizers/v0.13.4.rc2/en/api/decoders)

In [46]:
from tokenizers.decoders import WordPiece

In [47]:
bert_tokenizer.decoder = WordPiece(prefix='##')

In [48]:
plain_tokens = bert_tokenizer.decode(ids)
print(plain_tokens)

all these are so simple to do in hf. let ' s do more we have a long way to go!


<h1 style="color:Tomato;">  Pretrained Tokenizer </h1>

Finally, we need to wrap everything in a `PreTrainedTokenizer` class.

Recall that using `tokenizer.encode` returns `Encoding` with a list of attributes that includes `ids`, `ids` `tokens`, `attention_mask` and many more

In [49]:
encoding = tokenizer.encode(text)
print(encoding)

Encoding(num_tokens=16, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])


In general, the model requires the `input_ids`,`attention_mask` and other model specific attributes. To get these from the `Encoding` object we need to iterate over all the `Encoding` objects corresponding to each sample in a batch. 

Here comes the `PreTrainedTokenizer` class to rescue!

In [50]:
from transformers import PreTrainedTokenizerFast

In [51]:
pt_tokenizer = PreTrainedTokenizerFast(tokenizer_file='hopper.json',
                                      unk_token='[UNK]',
                                      pad_token='[PAD]',
                                      model_input_names=["input_ids","token_type_ids","attention_mask"],
                                      )



Now we can simply call the `pt_tokenizer` with an input. See the call signature [here](https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__)

In [52]:
model_inputs = pt_tokenizer(text)
pprint(model_inputs,compact=True)

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [267, 953, 333, 228, 2531, 138, 203, 154, 53, 96, 21, 459, 14, 64,
               203, 384],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


In [53]:
model_inputs = pt_tokenizer(text,text_pair=pair)
pprint(model_inputs,compact=True)

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1],
 'input_ids': [267, 953, 333, 228, 2531, 138, 203, 154, 53, 96, 21, 459, 14, 64,
               203, 384, 211, 247, 46, 487, 412, 138, 257, 9],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
                    1, 1, 1, 1]}


Note that there is an additional `type id`. There are no special tokens `[CLS]:2`,`[SEP]:3` as we haven't used `bert_tokenizer`. 

We can also pass a **batch** of samples and it works!

In [54]:
batch_text = ['I like the book The Psychology of Money','I enjoyed watching the Transformers movie','oh! thanks for this']

In [55]:
model_inputs = pt_tokenizer(batch_text)
pprint(model_inputs,compact=True)

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1]],
 'input_ids': [[54, 281, 131, 1701, 131, 19478, 153, 1564],
               [54, 4096, 1443, 131, 7744, 307, 3760],
               [772, 9, 1767, 200, 254]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0]]}


Padding is not done by default so let's enable padding

In [56]:
model_inputs = pt_tokenizer(batch_text,padding=True)
pprint(model_inputs,compact=True)

{'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0],
                    [1, 1, 1, 1, 1, 0, 0, 0]],
 'input_ids': [[54, 281, 131, 1701, 131, 19478, 153, 1564],
               [54, 4096, 1443, 131, 7744, 307, 3760, 0],
               [772, 9, 1767, 200, 254, 0, 0, 0]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0]]}


We can save the pre-trained tokenizer by calling `pt_tokenizer.save('hopper')`.It will create a directory named `hopper` and store all the required files ( `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`)

`PreTrainedTokenizer` class implements additional methods that are useful for the model during training.

We just need to pass the `tokenizer_file` and other few arguments.

We will see how to make use of `PreTrainedTokenizer` for training a Model in the next experiment.