# From Text to Tokens

All AI models including transformers, cannot receive raw strings as input; instead, the text must be `tokenized` and `encoded` as `numerical vectors` before use it.

In this notebook, we will review multiple ways to achive this.

References:
- "Natural Language Processing with Transformers" by Lewis Tunstall, Leandro von Werra and Thomas Wolf.

## Dependencies

In [5]:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer
from datasets import load_dataset

## Character Tokenization

This tokenization method consist of convert each `character` into `integers`

In [6]:
text = "Tokenizing text is a core task of NLP."
tokenized_text = list(text)
print(tokenized_text)

['T', 'o', 'k', 'e', 'n', 'i', 'z', 'i', 'n', 'g', ' ', 't', 'e', 'x', 't', ' ', 'i', 's', ' ', 'a', ' ', 'c', 'o', 'r', 'e', ' ', 't', 'a', 's', 'k', ' ', 'o', 'f', ' ', 'N', 'L', 'P', '.']


Then we convert tokens into `unique integers` with a process called `numericalization`

In [7]:
token2idx = { ch: idx for idx, ch in enumerate(sorted(set(tokenized_text))) }
print(token2idx)

{' ': 0, '.': 1, 'L': 2, 'N': 3, 'P': 4, 'T': 5, 'a': 6, 'c': 7, 'e': 8, 'f': 9, 'g': 10, 'i': 11, 'k': 12, 'n': 13, 'o': 14, 'r': 15, 's': 16, 't': 17, 'x': 18, 'z': 19}


We convert the tokenized text to a list of integers

In [8]:
input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids)

[5, 14, 12, 8, 13, 11, 19, 11, 13, 10, 0, 17, 8, 18, 17, 0, 11, 16, 0, 6, 0, 7, 14, 15, 8, 0, 17, 6, 16, 12, 0, 14, 9, 0, 3, 2, 4, 1]


Finally, as with any dummy variable, to do substractions or additions with those `ids`, we need to put each token in different dimensions, so two or more tokens can co-occur.

We do that creating a one-hot encodings by converting the list of ids to a tensor

In [9]:
inputs_ids = torch.tensor(input_ids)
one_hot_encodings = F.one_hot(inputs_ids, num_classes=len(token2idx))
one_hot_encodings.shape

torch.Size([38, 20])

By examining the first vector, we can verify that a 1 appears

In [10]:
print(f"Token: {tokenized_text[0]}")
print(f"Tensor index: {input_ids[0]}")
print(f"One-hot: {one_hot_encodings[0]}")

Token: T
Tensor index: 5
One-hot: tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


This method requires a lot of computation and decrease the context size significantly, because the amount of tokens that the model need to process is very long

In [11]:
len(input_ids)

38

Next we will review other better ways to tokenize

## Word Tokenization

To reduce the number of tokens in a text, we can split the text into words and then convert each word to an integer.

In [12]:
tokenized_text = text.split()
tokenized_text

['Tokenizing', 'text', 'is', 'a', 'core', 'task', 'of', 'NLP.']

The number of tokens has been significantly reduced,

But it's still quite substantial, because we can easily have 1-millon unique words in a large vocabulary.

If we want to compress the 1-millon-dimensional input vectors to 1-thousand-dimensional vectors in the first layer of our neural network, the number of parameters of this layer will be:

`parameters = 1 millon x 1 thousand = 1 billon weights`
    
This is comparable to the largest GPT-2 model with `1.5 billon parameters`.

## Subword Tokenization

Subword tokenization combines the best of character and word tokenization. This allow us to deal with complex words and misspellings by splitting the text into smaller units, but also keeps the number of parameters at a manageable size.

Subword tokenization is learned from the pretraining corpus using a mix of statistical rules and algorithms.

In [13]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
encoded_text = tokenizer(text)
encoded_text

{'input_ids': [101, 19204, 6026, 3793, 2003, 1037, 4563, 4708, 1997, 17953, 2361, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Attention mask

As you can see, we get the tokens `input_ids` and also an `attention_mask`.

`attention_mask` is used to deal with short contexts than expected by the model, basically tells the model if the empty input should be ignored or not; simplified example:

`(Text input 1) ids: [231, 634, 132, Nan, Nan], mask: [1, 1, 1, 0, 0]`

`(Text input 2) ids: [831, 645, 232, 358, 729], mask [1, 1, 1, 1, 1]`

`(Text input 3) ids: [835, 724, Nan, Nan, Nan], mask: [1, 1, 0, 0, 0]`

We can convert `input_ids` back to string tokens with

In [14]:
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)

['[CLS]', 'token', '##izing', 'text', 'is', 'a', 'core', 'task', 'of', 'nl', '##p', '.', '[SEP]']


`[CLS]`, `[SEP]` are special tokens. `##` indicates the preceding string is not whitespace.

Tokens can be converted back to text with:

In [15]:
print(tokenizer.convert_tokens_to_string(tokens))

[CLS] tokenizing text is a core task of nlp. [SEP]


You can see some specs of the tokenizer with

In [16]:
print(f"Tokenizer vocabulary size: {tokenizer.vocab_size}")
print(f"Max context size: {tokenizer.model_max_length}")

Tokenizer vocabulary size: 30522
Max context size: 512


*NOTE: when using pretrained models, is very important to use the same tokenizer used to train the model.*

## Tokenize a entire dataset

In [17]:
emotions = load_dataset("emotion")

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

emotions_encoded = emotions.map(tokenize, batched=True)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Map: 100%|██████████| 2000/2000 [00:00<00:00, 35905.06 examples/s]
