# Tokenization: Evolution, Types, and Analysis

# Introduction



Tokenization is a fundamental process in natural language processing (NLP) that involves breaking down text into smaller units called tokens. These tokens can be words, subwords, characters, or other meaningful elements. Tokenization has evolved significantly over the years, adapting to the needs of various NLP tasks and models. This blog explores the history of tokenization, different types of tokenization, their pros and cons, and the compression ratio of num_bytes/num_tokens for each type.

## History of Tokenization

Tokenization began with simple word-based approaches, where text was split into individual words. As NLP models grew in complexity, more sophisticated methods were developed to handle nuances in language, such as subword tokenization and character-level tokenization.

### Important metrics of Tokenization

#### Compression Ratio Analysis
The compression ratio of num_bytes/num_tokens is an important metric in tokenization. It indicates how efficiently text is represented in token form. A lower ratio is generally beneficial as it means fewer bytes are used per token, leading to more efficient storage and processing.

In [17]:
def get_compression_ratio(string: str, indices: list[int]) -> float:
    """Given `string` that has been tokenized into `indices`, ."""
    num_bytes = len(bytes(string, encoding="utf-8"))  # @inspect num_bytes
    num_tokens = len(indices)                       # @inspect num_tokens
    return num_bytes / num_tokens

In [1]:
from abc import ABC
class Tokenizer(ABC):
    """Abstract interface for a tokenizer."""
    def encode(self, string: str) -> list[int]:
        raise NotImplementedError
    def decode(self, indices: list[int]) -> str:
        raise NotImplementedError

# Types of Tokenization

# Character Tokenization


Description: Character tokenization splits text into individual characters.

Example Sentence: "Tokenization is essential for NLP."

Tokenized Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n", " ", "i", "s", " ", "e", "s", "s", "e", "n", "t", "i", "a", "l", " ", "f", "o", "r", " ", "N", "L", "P"] (tokens Corresponding to vocabulary set that can be assigned to corresponding indices)

Number of Bytes: 34 (UTF-8 encoding) 
Number of Tokens: 34 

$ 
\text{Compression Ratio: } \frac{34 \text{ bytes}}{34 \text{ tokens}} = 1.0
$

### Pros:
    1. Simplest form of tokenization.
    2. Handles any text without out-of-vocabulary issues.

### Cons:
    1. Inefficient for long texts.
    2. Produces very large token sequences.

In [12]:
from dataclasses import dataclass

@dataclass(frozen=True)
class CharacterTokenizer(Tokenizer):
    """Represent a string as a sequence of Unicode code points."""
    def encode(self, string: str) -> list[int]:
        return list(map(ord, string))
    def decode(self, indices: list[int]) -> str:
        return "".join(map(chr, indices))
    def raw_decode(self, indices: list[int])-> list[int]:
        return list(map(chr, indices))

In [13]:
tokenizer = CharacterTokenizer()
string = "Tokenization is essential for NLP."  # @inspect string

In [14]:
indices = tokenizer.encode(string)  # @inspect indices
print(indices)

[84, 111, 107, 101, 110, 105, 122, 97, 116, 105, 111, 110, 32, 105, 115, 32, 101, 115, 115, 101, 110, 116, 105, 97, 108, 32, 102, 111, 114, 32, 78, 76, 80, 46]


In [16]:
tokens = tokenizer.raw_decode(indices)
print(tokens)

['T', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'e', 's', 's', 'e', 'n', 't', 'i', 'a', 'l', ' ', 'f', 'o', 'r', ' ', 'N', 'L', 'P', '.']


In [6]:
reconstructed_string = tokenizer.decode(indices)  # @inspect reconstructed_string
print(reconstructed_string)
assert string == reconstructed_string

Tokenization is essential for NLP.


In [18]:
compression_ratio = get_compression_ratio(string, indices)
print(compression_ratio)

1.0


# Byte-Based Tokenization

Description: Byte-based tokenization treats each byte of the text as a token. This method is particularly useful for handling any kind of text data, including non-standard characters and binary data.

Example Sentence: "Tokenization is essential for NLP."

Tokenized Output: [84, 111, 107, 101, 110, 105, 122, 97, 116, 105, 111, 110, 32, 105, 115, 32, 101, 115, 115, 101, 110, 116, 105, 97, 108, 32, 102, 111, 114, 32, 78, 76, 80, 46]

Number of Bytes: 34 (UTF-8 encoding)

Number of Tokens: 34

$ 
\text{Compression Ratio: } \frac{34 \text{ bytes}}{34 \text{ tokens}} = 1.0
$

### Pros:
    Handles any text data, including non-standard characters.
    No out-of-vocabulary issues.

### Cons:

    Produces very large token sequences.
    Less interpretable tokens.

In [19]:
from dataclasses import dataclass

@dataclass(frozen=True)
class ByteTokenizer(Tokenizer):
    """Represent a string as a sequence of bytes."""
    def encode(self, string: str) -> list[int]:
        string_bytes = string.encode("utf-8")  # @inspect string_bytes
        indices = list(map(int, string_bytes))  # @inspect indices
        return indices
    def decode(self, indices: list[int]) -> str:
        string_bytes = bytes(indices)  # @inspect string_bytes
        string = string_bytes.decode("utf-8")  # @inspect string
        return string

In [20]:
tokenizer = ByteTokenizer()
string = "Tokenization is essential for NLP."  # @inspect string

In [21]:
indices = tokenizer.encode(string)  # @inspect indices
print(indices)

[84, 111, 107, 101, 110, 105, 122, 97, 116, 105, 111, 110, 32, 105, 115, 32, 101, 115, 115, 101, 110, 116, 105, 97, 108, 32, 102, 111, 114, 32, 78, 76, 80, 46]


In [None]:
reconstructed_string = tokenizer.decode(indices)  # @inspect reconstructed_string
print(reconstructed_string)
assert string == reconstructed_string

Tokenization is essential for NLP.


: 

# Word Tokenization


Description: Word tokenization splits text into individual words based on spaces and punctuation.

Example Sentence: "Tokenization is essential for NLP."

Tokenized Output: ["Tokenization", "is", "essential", "for", "NLP"] (tokens Corresponding to vocabulary set that can be assigned to corresponding indices)

Number of Bytes: 34 (UTF-8 encoding) 
Number of Tokens: 5 

$
\text{Compression Ratio: } \frac{34 \text{ bytes}}{5 \text{ tokens}} = 6.8
$

### Pros: 
    1. Simple and intuitive.
    2. Works well for languages with clear word boundaries.

### Cons:
    1. Struggles with out-of-vocabulary words.
    2. Inefficient for morphologically rich languages.

# Subword Tokenization (e.g., Byte-Pair Encoding - BPE)


Description: Subword tokenization breaks words into smaller units, often based on frequency of subword pairs.

Example Sentence: "Tokenization is essential for NLP."

Tokenized Output: ["Token", "ization", "is", "essential", "for", "N", "L", "P"] (tokens Corresponding to vocabulary set that can be assigned to corresponding indices)

Number of Bytes: 34 (UTF-8 encoding) 
Number of Tokens: 8

$ 
\text{Compression Ratio: } \frac{34 \text{ bytes}}{8 \text{ tokens}} = 4.25
$

### Pros:
    1. Handles out-of-vocabulary words better.
    2. Efficient for morphologically rich languages.

### Cons:
    1. More complex implementation.
    2. May produce less interpretable tokens.

# SentencePiece Tokenization

Description: SentencePiece is a subword tokenization method that treats text as a sequence of Unicode characters and uses a model to tokenize.

Example Sentence: "Tokenization is essential for NLP."

Tokenized Output: ["▁Token", "ization", "▁is", "▁essential", "▁for", "▁NLP"] (tokens Corresponding to vocabulary set that can be assigned to corresponding indices)

Number of Bytes: 34 (UTF-8 encoding) 
Number of Tokens: 6 

$ 
\text{Compression Ratio: } \frac{34 \text{ bytes}}{6 \text{ tokens}} = 5.67
$

### Pros:
    1. Language-agnostic.
    2. Efficient for various languages and domains.
### Cons:
    1. Requires training a model.
    2. May produce less interpretable tokens.

### Summary of Compression Ratios
    Word Tokenization: 6.8
    Subword Tokenization: 4.25
    Character Tokenization: 1.0
    SentencePiece Tokenization: 5.67