# Lab 1: Tokenization

Splitting a block of text into meaningful subunits is an essential part of processing text. Text could be split into individual characters or words or somewhere in between. A very basic approach is shown below that splits up text using white-space. There's already a shortcoming as the final word 'dog' has punctuation attached to it.

In [1]:
'The quick brown fox jumps over the lazy dog.'.split(' ')

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']

With Transformer models, we do subword tokenizations and split the text up using a prebuilt tokenizer. This has been trained on a large amount of text where it has learned what are common words and which are less common and could be split into parts (that often look like syllables).

First let's load one for a common Transformer model `distilgpt2`. We can load it with the code below. The `distilgpt2` model is a smaller model based upon `gpt2` which is a predecessor to the language model that underpins ChatGPT.

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilgpt2')

The tokenizer has a function `tokenizer.tokenize` that splits up text. Run it on the string "I visited Glasgow.".

In [3]:
tokenizer.tokenize("I visited Glasgow.")

['I', 'Ġvisited', 'ĠGlasgow', '.']

 You should four tokens, with some starting with an odd character 'Ġ'. That 'Ġ' denotes that the token starts a new word. Try tokenizing "volcano" below with `tokenizer.tokenize` again. It should be split up into two subword tokens.


In [4]:
tokenizer.tokenize('volcano')

['vol', 'cano']


Along with tokenizing the text into tokens/subtokens, we actually want the tokens to be mapped to numbers. The Transformers take the token indices as input. For example, the token index for the word 'Glasgow' is.

In [6]:
tokenizer.vocab['ĠGlasgow'], tokenizer.vocab['Ġvisited'], tokenizer.vocab['vol']

(23995, 8672, 10396)

`tokenizer.vocab` is a big dictionary mapping subword tokens to their indices. Let's see how big the vocabulary that the `distilgpt2` tokenizer has:

In [7]:
len(tokenizer.vocab)

50257

We could manually map the tokenized output to the token indices. But the tokenizer can do it for us. Pass "I visited Glasgow." into the `tokenizer.encode` function.

In [15]:
tokenizer.vocab['I'], tokenizer.vocab['Ġvisited'], tokenizer.vocab['ĠGlasgow'], tokenizer.vocab['.']

(40, 8672, 23995, 13)

In [10]:
tokenizer.encode("I visited Glasgow.")

[40, 8672, 23995, 13]

You should get a list of indices (`[40, 8672, 23995, 13]`).

You can use the `tokenizer.decode` function to convert from a list of indices back to text. Try it out with this list: `[464, 7850, 46922, 4539, 832, 23995, 13]`

In [12]:
tokenizer.decode([464, 7850, 46922, 4539, 832, 23995, 13])

'The river Clyde runs through Glasgow.'

You should have decoded a message about Glasgow's main river.

The tokenizer has a lot of parameters to give extra control. For instance, you sometimes need to truncate very long strings (as there is a limit on the length of input to Transformer models). Use the `tokenizer.encode` function to tokenize "Kelvingrove is a beautiful park in Glasgow." and also trim it to only 5 tokens using `truncation=True` and `max_length=5`.

In [14]:
tokenizer.encode("Kelvingrove is a beautiful park in Glasgow.", truncation=True, max_length=5)

[42, 417, 1075, 305, 303]

That should have given you a list of only five token indices.

Now the most common way to use a tokenizer is below which outputs a format ready to pass into a Transformer model. It uses `return_tensors='pt'` which tells it to return Pytorch tensors. PyTorch tensors are a data structure used for deep learning.

The output has the `input_ids` which are the token indices as well as an `attention_mask` which can be used to tell a Transformer to ignore certain tokens. This occurs when using padding to deal with some sequences being shorter than others. That's not the case here, so the attention values are all one.

In [3]:
tokenizer('Kelvingrove is a park in Glasgow.', return_tensors='pt')

{'input_ids': tensor([[   42,   417,  1075,   305,   303,   318,   257,  3952,   287, 23995,
            13]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

It should be noted that each tokenizer is very specific to the text it was trained on. For instance, below is a tokenizer that was trained on Spanish text.

In [4]:
from transformers import AutoTokenizer

spanish_tokenizer = AutoTokenizer.from_pretrained('datificate/gpt2-small-spanish')

If we give it a previous sentence in English, it tokenizes it very differently and splits up common English words into multiple parts.

In [5]:
spanish_tokenizer.tokenize('The river Clyde runs through Glasgow.')

['The', 'Ġri', 'ver', 'ĠClyde', 'Ġr', 'uns', 'Ġth', 'rough', 'ĠGlasgow', '.']

Furthermore the underlying token indices will also be very different.

In [6]:
spanish_tokenizer.encode('The river Clyde runs through Glasgow.')

[1667, 1316, 778, 44417, 474, 4133, 15848, 26603, 23554, 14]

However, it will tokenize Spanish effectively:

In [7]:
spanish_tokenizer.tokenize('Que te vaya bien')

['Que', 'Ġte', 'Ġvaya', 'Ġbien']

Whereas our `distilgpt2` tokenizer that is trained on English will split up common Spanish words.

In [8]:
tokenizer.tokenize('Que te vaya bien')

['Que', 'Ġte', 'Ġv', 'aya', 'Ġb', 'ien']

That's the end of this mini-lab.

## Optional Extra:
 - Find an English word that is tokenized into 3,4,5 or even 6 subword tokens with the `distilgpt2` tokenizer

In [13]:
tokenizer.tokenize('microservice'), tokenizer.tokenize('cryptocurrency'), tokenizer.tokenize('onboarding')

(['micro', 'service'], ['crypt', 'oc', 'urrency'], ['on', 'boarding'])

In [15]:
tokenizer.tokenize('kubernetes'), tokenizer.tokenize('station'), tokenizer.tokenize('preprocessing')

(['k', 'uber', 'net', 'es'], ['station'], ['pre', 'processing'])