# Tokenization: From text to numerical tokens


## Libraries

In [22]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [23]:
from transformers import DistilBertTokenizer, AutoTokenizer
from datasets import load_dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# PyTorch
import torch
import torch.nn.functional as F

## Tokenization

Natural language, when inputted into a transformer, is read in through the process of tokenization. Tokenization is the process of transforming text into bite-size data (words or letters) that can be easily ingested by our processor.

Two common types of tokenization are:
- Character Tokenization
- Word Tokenization

### Character Tokenization

Character Tokenization is a very simple tokenization scheme. As the name suggests, it is a process in which each character of the word is fed into the model for tokenization. After feeding text into the model, it undergoes *numericalization*. Numericalizaton is the process of converting each token into an integer for machine readability. One of the ways in which this is achieved is through encoding. We'll walk through character tokenization below.

In [3]:
# tokenizing text
text = "Tokenizing text is a core task of NLP."
tokenized_text = list(text)
print(tokenized_text)

['T', 'o', 'k', 'e', 'n', 'i', 'z', 'i', 'n', 'g', ' ', 't', 'e', 'x', 't', ' ', 'i', 's', ' ', 'a', ' ', 'c', 'o', 'r', 'e', ' ', 't', 'a', 's', 'k', ' ', 'o', 'f', ' ', 'N', 'L', 'P', '.']


In [4]:
# encoding text for numericalization
token2idx = {
    # Place into a set to extract unique values, then sort
    # idx gives us a mapping for each character used in the text
    ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))
}

print(token2idx)

{' ': 0, '.': 1, 'L': 2, 'N': 3, 'P': 4, 'T': 5, 'a': 6, 'c': 7, 'e': 8, 'f': 9, 'g': 10, 'i': 11, 'k': 12, 'n': 13, 'o': 14, 'r': 15, 's': 16, 't': 17, 'x': 18, 'z': 19}


In [5]:
# make a list of indices mapped to tokens
input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids)

[5, 14, 12, 8, 13, 11, 19, 11, 13, 10, 0, 17, 8, 18, 17, 0, 11, 16, 0, 6, 0, 7, 14, 15, 8, 0, 17, 6, 16, 12, 0, 14, 9, 0, 3, 2, 4, 1]


#### Encoding
Next we'll convert the token identifiers to a 2D tensor of one-hot encoded vectors as text is classified as categorical data. Unique IDs aren't a great approach to NLP characterization as they invite a fictitous ordering between text into the dataset due to the ascending order of uniq_id values. This is problematic as it neural networks will associate a relationship between id values and outputs. One-hot encoding solves this problem by removing ordering. This process is shown below first using *Pandas* followed by *PyTorch*.

In [6]:
# example: encoding names of the Transformers from the amazon review
cat_df = pd.DataFrame(
    {"Name": ["Bumblebee", "Optimus Prime", "Megatron"],
     "id": [0, 1, 2]
     }
)
cat_df

Unnamed: 0,Name,id
0,Bumblebee,0
1,Optimus Prime,1
2,Megatron,2


In [7]:
# Making the one-hot encoded data
pd.get_dummies(cat_df["Name"])    # specify name as feature of interest

Unnamed: 0,Bumblebee,Megatron,Optimus Prime
0,True,False,False
1,False,False,True
2,False,True,False


PyTorch Implementation

In [8]:
# Converting the token Ids to 2D tensors
input_ids = torch.tensor(input_ids)

# one-hot encoding, max dimensions = count of all ids
# important to set num_classes to prevent premature truncation of encodings
one_hot_encodings = F.one_hot(input_ids, num_classes=len(token2idx))

# displaying the shape of the tensor
print(f"Token Id Count:\t{len(token2idx)}")
print(f"One-hot encoded tensor shape:\t{one_hot_encodings.shape}")

Token Id Count:	20
One-hot encoded tensor shape:	torch.Size([38, 20])


In [9]:
# verifying ids
i = 0
print(f"Token: {tokenized_text[i]}")
print(f"Tensor index: {input_ids[i]}")
print(f"One-hot: {one_hot_encodings[i]}")

# Confirmation
if 1 == one_hot_encodings[i][input_ids[i]]:
  print("\nTokens are correctly mapped!")
else:
  print("\nTokens are incorrectly mapped!")

Token: T
Tensor index: 5
One-hot: tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Tokens are correctly mapped!


#### Drawbacks

While character encoding ignores text structure to deal with mispellings and rare words, it performa poorl with *learning* words. The character level tokenization also requires significant computation, memory, and data.

### Word Tokenization

Word tokenization has a similar process to character tokenization, however it preserves text structure in the tokenization step by mapping integers to **words** rather than individual **characters**. This bypasses the need to learn words from characters and saves computation time and costs. The resulting model is simpler in complexity and training.

We'll start by tokenizing the same text used in character tokenization by using whitespace to segregate words.

In [10]:
# split the text by whitespace
tokenized_text = text.split()
print(tokenized_text)

['Tokenizing', 'text', 'is', 'a', 'core', 'task', 'of', 'NLP.']


As we can see here, punctuation isn't accounted for. This causes the size of the vocabulary to balloon into the millions. This causes an issue as it requires neural networks to have a commensurate amount of parameters.

A compromise between character and word tokenization that preserves some input information AND some input structure seems like a good way to reduce parameters needed as well as retaining important information. Luckily such a method exists - **Subword Tokenization**.

## Subword Tokenization

The idea behind subword tokenization is to take the best of word and character tokenization. We want to spilit rare words into smaller characters to deal with misspellings as well as keeping frequent words as unique identities to decrease our input size.

A distinguishing feature of subword tokenization is that it is *learned* from pre-training vocabulary using statistical methods.

### WordPiece

WordPiece is a commonly used NLP subword tokenizer which starts from a small vocabulary including the special tokens used by the model and the initial alphabet. Since it identifies subwords by adding a prefix (like ## for BERT), each word is initially split by adding that prefix to all the characters inside the word. WordPiece finds the longest subword that is in the vocabulary, then splits on it.

In [11]:
# load distilbert tokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# manual loading of distilbert tokenizer
# distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [12]:
encoded_text = tokenizer(text)
print(encoded_text)

{'input_ids': [101, 19204, 6026, 3793, 2003, 1037, 4563, 4708, 1997, 17953, 2361, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [13]:
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)
print(tokenizer.convert_tokens_to_string(tokens))

['[CLS]', 'token', '##izing', 'text', 'is', 'a', 'core', 'task', 'of', 'nl', '##p', '.', '[SEP]']
[CLS] tokenizing text is a core task of nlp. [SEP]


CLS and SEP are tokens that mark the start and end of sentences.

In [14]:
print(f"Model vocabulary size: {tokenizer.vocab_size}")    # number of words/subwords/characters in the model
print(f"The maximum token length for model: {tokenizer.model_max_length }")

Model vocabulary size: 30522
The maximum token length for model: 512


When using pretrained models, make sure not to switch tokenizer, else the words will be shuffled around due to indices mismatch.

## Tokenizing The Whole Dataset

To tokenize the entire corpus (corpus means the entirety of the text of a written body, in this case english vocabulary) we will use the map() method.

In [30]:
# function to tokenize a batch of text
def tokenize(batch):
  return tokenizer(batch['text'],
                   padding = True,    # adds trailing zeros for uniformity
                   truncation=True)   # truncates inputs to the model size


In [38]:
# see the tokenizer in action
emotions = load_dataset("emotion")
# emotions.set_format(type = "pandas")
print(tokenize(emotions["train"][:2]))

{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


The attention mask tells the model what can be ignored due to padding. Now that we have defined a processing function, we can use map() to apply it across the entire corpus.

In [39]:
emotions_encoded = emotions.map(tokenize, batched = True, batch_size = None)

# verify the addition of 'input_ids' and 'attention_mask' columns for encoding
print(emotions_encoded["train"].column_names)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

['text', 'label', 'input_ids', 'attention_mask']
