# **Analysing The Python Tokenizer**

The tokenizer is built on top of GPT2's default tokenizer.
Note: we are builing and deploying the tokenizer  to HF hub with `src/tokenizer.py`, I am just Analysing the performance in the following section

The tokenizer is using BPE algorithm that deals with Unicode strings. It maps the first 256 bytes to the unicode characters. There are many control charactrers, i.e. newline, tab, escape, line feed, and other nonprintable characters. The GPT2 Tokenizer maps the 256 elementary values to Unicode strings that all correspond to standard
printable Unicode characters.

In [33]:
from transformers import  AutoTokenizer
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode

In [34]:
byte_to_unicode_map = bytes_to_unicode()
unicode_to_byte_map = dict((v, k) for k, v in byte_to_unicode_map.items())
base_vocab = list(unicode_to_byte_map.keys())
print(f'Size of our base vocabulary: {len(base_vocab)}')
print(f'First element: `{base_vocab[0]}`, last element: `{base_vocab[-1]}`')

Size of our base vocabulary: 256
First element: `!`, last element: `Ń`


In [35]:
tokenizer = AutoTokenizer.from_pretrained("razhan/codeqmul")

## Longest words in the vocabulary

In [36]:

tokens = sorted(tokenizer.vocab.items(), key=lambda x: len(x[0]), reverse=True)
print([f'{tokenizer.convert_tokens_to_string(t)}' for t, _ in tokens[:10]]);

['\n                                                                                                  ', '\n                                                                                                ', '################################################################################################', '\n                                                                                              ', '\n                                                                                            ', '\n                                                                                          ', '\n                                                                                        ', '\n                                                                                      ', '\n                                                                                    ', '\n                                                                                  ']


That makes sense. As we can see it's either a long line of space or a long line of hash which is used for commenting code. 

## Least common words in the vocabulary

In [37]:
tokens = sorted(tokenizer.vocab.items(), key=lambda x: x[1], reverse=True)
print([f'{tokenizer.convert_tokens_to_string(t)}' for t, _ in tokens[:15]])

['658', ' uptime', " '>',", ' RPN', ' fiscal', 'TracedValue', 'Sale', 'Finds', 'MORE', 'fen', '%",', 'correctly', 'Metaclass', ' Consumer', 'arena']


The last words added to the vocabulary are the least common occuring words in the corpus

## First tokens after the first 256 bytes

In [38]:
tokens = sorted(tokenizer.vocab.items(), key=lambda x: x[1], reverse=False)
print([f'{tokenizer.convert_tokens_to_string(t)}' for t, _ in tokens[257:290]])

[' p', 'ge', ' re', 'ur', '--', 'ce', ' "', ' n', '):', 'mp', 'it', ' s', 'lo', 'ue', ' in', 'ame', 'ut', 'ing', ' o', 'ct', 'def', 'pe', 'ate', "',", '\n                ', ' a', 'el', 'id', '\n                  ', 'ser', '##', '\n\n   ', 'fi']


If we skip the first 256 bytes of the vocabulary we can see various levels of indetation. This makes sense, since Python is indentation based programming language and we don't want to lose those becuase it's important for our model to generate correct programs.

## The last words in the vocabulary

In [39]:
print([f'{tokenizer.convert_tokens_to_string(t)}' for t,_ in tokens[-15:]])

['arena', ' Consumer', 'Metaclass', 'correctly', '%",', 'fen', 'MORE', 'Finds', 'Sale', 'TracedValue', ' fiscal', ' RPN', " '>',", ' uptime', '658']


We can see some operators and all the special tokens we added for begining of sentence, end of sentence, padded text, and finally unknown text. This shows our tokenizer works as intended

## Test cases

In [40]:
python_code ="""def is_prime(n):
    for i in range(2,int(n**0.5)+1):
        if n%i==0:
            return False
    return True"""

print(tokenizer(python_code).tokens())

['def', 'Ġis', '_', 'prime', '(', 'n', '):', 'ĊĠĠĠ', 'Ġfor', 'Ġi', 'Ġin', 'Ġrange', '(', '2', ',', 'int', '(', 'n', '**', '0', '.', '5', ')+', '1', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġif', 'Ġn', '%', 'i', '==', '0', ':', 'ĊĠĠĠĠĠĠĠĠĠĠĠ', 'Ġreturn', 'ĠFalse', 'ĊĠĠĠ', 'Ġreturn', 'ĠTrue']


It's working perfectly since the symbol indicates Ġ space and at the places where we have 4 or 8 spaces it treats the all the spaces as one token, even where we have a newline which is represented by Ċ caret symbol, it's treated as one token where it occurs frequently. This makes or tokenizer much more efficient since it does not have to treat each space separetly

## Checking if all the python reserved keywords are in the vocabulary

In [41]:
print(f'There are in total {len(keyword.kwlist)} reserved keywords in the python language.')
for keyw in keyword.kwlist:
    if keyw not in tokenizer.vocab:
        print(f'`{keyw}` is not in the vocabulary')

There are in total 35 reserved keywords in the python language.
`nonlocal` is not in the vocabulary


Only `nonlocal` is not the vocabulary. That's okay, since that word is not used frequently, therefore, not including in the vocab is not gonna effect the performance. I tried to give the tokenizer twice of a bigger protion of the dataset. It still did not contain the nonlocal keyword. The tokenizer is trained on 20% of the corpus which is a good representation of the corpus 

## Conclusion of the tokenizer

In comparision our brand new tokenizer trained on the code corpus is at least twice as good as the original tokenizer provided by GPT-2. We can see the sequence length generated by our tokenizer is half of the length of the sequences generated by the default tokenizer. This will allow us to have double of the model context as before. In my case it will be 4 times model context as GPT-2 since I used GTP-Neo, the window size is increased to 2048 from 1024.