# Tokenizer test

This notebook serves to test the behaviour of a tokenizer trained in english in portuguese text. 

In [None]:
# %%cmd
# conda install --yes pytorch transformers

In [1]:
from pathlib import Path

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

from src.utils import compute_perplexity, train_tokenizer


DATA_DIR = Path("../data")
MODEL = "microsoft/phi-1_5"

In [3]:
llm = AutoModelForCausalLM.from_pretrained(MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)

config.json:   0%|          | 0.00/736 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/2.84G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

In [4]:
en_tkns = tokenizer.tokenize("Hello, I'm a single sentence!")
en_tkns

['Hello', ',', 'ĠI', "'m", 'Ġa', 'Ġsingle', 'Ġsentence', '!']

Spaces are converted in a special character (the Ġ ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process. [link](https://discuss.huggingface.co/t/bpe-tokenizers-and-spaces-before-words/475/2?u=joaogante)

In [5]:
pt_tkns = tokenizer.tokenize("Olá, eu sou uma frase simples!")
pt_tkns

['Ol',
 'Ã¡',
 ',',
 'Ġe',
 'u',
 'Ġsou',
 'Ġu',
 'ma',
 'Ġfr',
 'ase',
 'Ġsim',
 'ples',
 '!']

Note that the tokenizer splited the word `"eu"` into `"Ġe"` and `"u"` which is strange, since the `"eu"` is a very common word in Portuguese. Also note that the word `"sentence"` is keepet as a unique token while its equivilant in portugueses `"frase"` is splited into two tokens `"Ġfr"` and `"ase"`.

In [6]:
print(f"Number of tokens in English: {len(en_tkns)}")
print(f"Number of tokens in Portuguese: {len(pt_tkns)}")

Number of tokens in English: 8
Number of tokens in Portuguese: 13


As a last remark, note that the number of tokens produced for the portuguese sentence is almost double the aomount of tokens produced for english. This is problem in the efeciency of the system as it requires much more compute to produce the text in portuguese than the text in english.

Is there any way to limit this phenomne?

## Compute Preplexity

In [7]:
ppl_en = compute_perplexity(llm, tokenizer, "Hello, I'm a single sentence!")
ppl_pt = compute_perplexity(llm, tokenizer, "Olá, eu sou uma frase simples!")
print(f"Perplexity of English: {ppl_en}")
print(f"Perplexity of Portuguese: {ppl_pt}")

Perplexity of English: 28.975732803344727
Perplexity of Portuguese: 160.80287170410156


The portuguese sentence has a lower preplexity than the english sentence meaning that the sequence of words in the portuguesese sentence is less surprising than sequence of words in the english sentence. This is expected as the preplexity mesuare is used to evaluate how well the language model learned the training set. Since the phi model was only trained on english text it is normal that the portuguese text to have a much higher preplexity. The question is: can we further maintain or lower this value of preplexity for the portuguese text while lowering the amount of tokens generated?

As a first approach let's test the following approach. We will start by selecting a portuguese corpus (lusa news probably). Second we will compute the preplexity of the phi-2 model on that corpus. This will give us a baseline to take as a reference. As a third step, we will train a tokenizer on the portuguse corpus. Then, we will check the tokens that are on the new vocabolary that were missing in the original one. The following step is to access what is the best way to cerate the embeddings for this new tokens to the orignal tokenizer so that the preplexity of the model gets lower on the portuguese corpus.

The stratagy to create the new embeddings migth be by employing an aggregation strategy or by training the model. 

## Train the tokenizer in Portuguese text

### Read data

In [None]:
corpus = (DATA_DIR / "sample.txt").read_text()
print(corpus[:1000])

In [None]:
lines = corpus.split("\n")
print(f"Number of lines: {len(lines)}")

lines = list(set(lines))
lines = [line.strip() for line in lines if line.strip()]
print(f"Number of unique lines {len(set(lines))}")

In [None]:
for line in lines[:10]:
    print(line)

### Train the tokenizer

In [None]:
tokenizer_pt = train_tokenizer(tokenizer, lines)

Check the number of tokens with this tokenizer.

In [None]:
pt_tkns = tokenizer.tokenize("Olá, eu sou uma frase simples!")
print(f"Number of tokens in with original tokenizer: {len(pt_tkns)}")

pt_tkns = tokenizer_pt.tokenize("Olá, eu sou uma frase simples!")
print(f"Number of tokens with new tokenizer: {len(pt_tkns)}")


This is good. The number of tokens with the new tokenizer is lower than the original one. 

What are the tokens in new tokenizer that are not on the original?

In [12]:
vocab_org = tokenizer.vocab.keys()
vocab_new = tokenizer_pt.vocab.keys()

In [None]:
new_tokens = list(set(vocab_new) - set(vocab_org))
print(f"Number of new tokens: {len(new_tokens)}")
print(f"(some) New tokens:\n{new_tokens[:10]}")

This are pretty frquent portuguese words that were missing from the original vocab. Lets now try to add the new tokens to the original vocab.

In [None]:
print(f"A sample of the tokens to be added:\n{new_tokens[:15]}")

Lets first take one token as an example and see how it would be tokenized by the oroginal tokenizer.

In [None]:
example = new_tokens[1]
example = " contrato"
example

In [None]:
tokens = tokenizer.tokenize(example)
print(f"Previous tokens: {tokens}")

new_token = "".join(tokens)
print(f"New token: {new_token}")


Lets now get the embeddings for this tokens.

In [None]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

In [None]:
model = llm.base_model
token_embs = model.embed_tokens(torch.tensor(token_ids))
token_embs

In [None]:
token_embs_agg = token_embs.mean(dim=0)
token_embs_agg

Lets add this a new token to the tokenizer and the new embedding to the model.

In [None]:
print(f"Number of tokens before adding the token: {len(tokenizer)}")

In [None]:
tokenizer.add_tokens([new_token])
new_token_id = tokenizer.vocab[new_token]
print(f"New token id: {new_token_id}")


In [None]:
tokenizer.tokenize(example)

In [None]:
tokenizer.tokenize("O tipo nao tem contrato")

In [None]:
len(tokenizer)

The new token has been added with token id 50295. Now we need to add that id to the model.

In [None]:
embed = model.embed_tokens
type(embed)

Miss match between the embeddings and the vocab size explained in this [chat](https://huggingface.co/bigscience/bloom/discussions/120).

In [None]:
weight = embed.weight.data
print(f"Shape of weight matrix: {weight.shape}")

In [None]:
# add new tokens to the model
weight = torch.cat([weight, token_embs_agg.unsqueeze(0)], dim=0)
print(f"Shape of weight matrix: {weight.shape}")

In [28]:
weight[new_token_id] = token_embs_agg

In [29]:
embed.weight.data = weight

In [30]:
assert  torch.equal(llm.model.embed_tokens(torch.tensor(new_token_id)), token_embs_agg)

Lets now test if this reduces the preplexity of the model.

In [31]:
llm_original = AutoModelForCausalLM.from_pretrained(MODEL)


In [32]:
tokenizer_original = AutoTokenizer.from_pretrained(MODEL)

In [None]:
compute_perplexity(llm_original, tokenizer_original, "Olá, eu sou uma frase simples com a palavra incapacidade!")

In [None]:
compute_perplexity(llm, tokenizer, "Olá, eu sou uma frase simples com a palavra incapacidade!")

In [None]:
test_sentence = "Olá, eu sou uma frase simples com a palavra incapacidade!"
tokenizer_original.tokenize(test_sentence)

In [None]:
tokenizer.tokenize(test_sentence)