# Tokenization

Task: Convert text to numbers; interpret subword tokenization.

There are various different ways of converting text to numbers. This assignment works with one popular approach: assign numbers to parts of words.

## Setup

We'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, famous for OpenAI's assertion that it was "too dangerous" to release in full.

[Documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for the model and tokenizer.

The `transformers` library is pre-installed on many systems, but in case you need to install it, you can run the following cell.

In [None]:
# Uncomment the following line to install the transformers library
!pip install -q transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import torch
from torch import tensor

### Download and load the model

This cell downloads the model and tokenizer, and loads them into memory.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
# We'll use this smaller version of GPT-2
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
# Alternative to add_prefix_space is to use `is_split_into_words=True`
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
token_to_id_dict = tokenizer.get_vocab()
print(f"The tokenizer has {len(token_to_id_dict)} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")

The tokenizer has 50257 strings in its vocabulary.
The model has 81,912,576 parameters.


In [None]:
# warning: this assumes that there are no gaps in the token ids, which happens to be true for this tokenizer.
id_to_token = [token for token, id in sorted(token_to_id_dict.items(), key=lambda x: x[1])]
print(f"The first 10 tokens are: {id_to_token[:10]}")
print(f"The last 10 tokens are: {id_to_token[-10:]}")

The first 10 tokens are: ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']
The last 10 tokens are: ['Ġ(/', 'âĢ¦."', 'Compar', 'Ġamplification', 'ominated', 'Ġregress', 'ĠCollider', 'Ġinformants', 'Ġgazed', '<|endoftext|>']


## Task

Consider the following phrase:

In [None]:
phrase = "I visited Muskegon"
# Another one to try later. This was a famous early example of the GPT-2 model:
#phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"

### Getting familiar with tokens

1: Use `tokenizer.tokenize` to convert the phrase into a list of tokens. (What do you think the `Ġ` means?)

In [None]:
tokens = tokenizer.tokenize(phrase)
tokens

['ĠI', 'Ġvisited', 'ĠMus', 'ke', 'gon']

2: Use `tokenizer.convert_tokens_to_string` to convert the tokens back into a string.


In [None]:
tokenizer.convert_tokens_to_string(tokens)

' I visited Muskegon'

3: Use `tokenizer.encode` to convert the original phrase into token ids. (*Note: this is equivalent to `tokenize` followed by `convert_tokens_to_ids`*.) Call the result `input_ids`.


In [None]:
input_ids = tokenizer.encode(phrase)
input_ids

[314, 8672, 2629, 365, 14520]

4: Turn `input_ids` back into a readable string. Try this two ways: (1) using `convert_ids_to_tokens` and (2) using `tokenizer.decode`.

In [None]:
# using convert_ids_to_tokens
newTokens = tokenizer.convert_ids_to_tokens(input_ids)
tokenizer.convert_tokens_to_string(newTokens)

' I visited Muskegon'

In [None]:
# using tokenizer.decode
tokenizer.decode(input_ids)

' I visited Muskegon'

### Applying what you learned

5: Use `model.generate(tensor([input_ids]))` to generate a completion of this phrase. (Note that we needed to add `[]`s to give a "batch" dimension to the input.) Call the result `output_ids`.


In [None]:
output_ids = model.generate(tensor([input_ids]), max_length=60, top_k=50, do_sample=True)
output_ids

tensor([[  314,  8672,  2629,   365, 14520,   422,  3426,  1160,   400,    11,
          5878,    11,   475,  1201,   788,   617,   423,  2077,   262,  3663,
           284,  1826,   290,  2740,   351,  1111,  4671,    13,  2102,    11,
           340,   318,   262,  2551,   286,   262,  4097,  2346,   355,   880,
            13,   198,   198,   198,  1639,   743,  3505,   326,   262,   717,
           640,   314,  1138,   319,   262,  1700,   351,  2629,   365, 14520]])

6: Convert your `output_ids` into a readable form. (Note: it has an extra "batch" dimension, so you'll need to use `output_ids[0]`.)

In [None]:
newTokens = tokenizer.convert_ids_to_tokens(output_ids[0])
tokenizer.convert_tokens_to_string(newTokens)

' I visited Muskegon from December 20th, 2001, but since then some have taken the opportunity to meet and speak with both parties. However, it is the decision of the band itself as well.\n\n\nYou may remember that the first time I met on the record with Muskegon'

Note: `generate` uses a greedy decoding by default, but it's highly customizable. We'll play more with it in later exercises. For now, if you want more interesting results, try:

- Turn on `do_sample=True`. Run it a few times to see what it gives.
- Set `top_k=5`. Or 50.

7. What is the largest possible token id for this tokenizer? What token does it correspond to?

In [None]:
tokenizer.convert_ids_to_tokens(output_ids[0])

['ĠI',
 'Ġvisited',
 'ĠMus',
 'ke',
 'gon',
 'Ġfrom',
 'ĠDecember',
 'Ġ20',
 'th',
 ',',
 'Ġ2001',
 ',',
 'Ġbut',
 'Ġsince',
 'Ġthen',
 'Ġsome',
 'Ġhave',
 'Ġtaken',
 'Ġthe',
 'Ġopportunity',
 'Ġto',
 'Ġmeet',
 'Ġand',
 'Ġspeak',
 'Ġwith',
 'Ġboth',
 'Ġparties',
 '.',
 'ĠHowever',
 ',',
 'Ġit',
 'Ġis',
 'Ġthe',
 'Ġdecision',
 'Ġof',
 'Ġthe',
 'Ġband',
 'Ġitself',
 'Ġas',
 'Ġwell',
 '.',
 'Ċ',
 'Ċ',
 'Ċ',
 'You',
 'Ġmay',
 'Ġremember',
 'Ġthat',
 'Ġthe',
 'Ġfirst',
 'Ġtime',
 'ĠI',
 'Ġmet',
 'Ġon',
 'Ġthe',
 'Ġrecord',
 'Ġwith',
 'ĠMus',
 'ke',
 'gon']

It seems to be 14520, which after looking at the list of tokens, seems to be 'gon'

## Analysis

Q1: Write a brief explanation of what a tokenizer does. Note that we worked with two parts of a tokenizer in this exercise (one that deals only with strings, and another that deals with numbers); make sure your explanation addresses both parts.

One part of the tokenizer encodes strings into a list of shorter tokens or ids to be used by the model, the other part decodes the ids of tokens into strings.

Q2: What do you think the `Ġ` means? (Hint: it replaces a single well-known character.)

I think it represents a space character.



Q3: Suppose you add some personal flair to your writing by doubling some letters. Explain what the tokenizer we have loaded up in this notebook will do with your embellished writing.

It will most likely double some letters as a result since the tokens used for training will have those doubled letters.