### 1. Tokenization

![Tokenizer](https://stanford-cs336.github.io/spring2025-lectures/images/tokenized-example.png)

In [3]:
from abc import ABC
class Tokenizer(ABC):
  def encode(self, string: str):
    raise NotImplementedError
  def decode(self, indices: list[int]):
    raise NotImplementedError

def get_compression_ratio(string: str, indices: list[int]) -> float:
  num_bytes = len(string.encode("utf-8"))
  num_token = len(indices)
  return num_bytes / num_token

#### Character Tokenization

In [4]:
class CharacterTokenizer(Tokenizer):
  def encode(self, string: str):
    return list(map(ord, string))
  def decode(self, indices: list[int]):
    return ''.join(map(chr, indices))

tokenizer = CharacterTokenizer()
string = "Hello, 游깴! 擔먼봏!"
indices = tokenizer.encode(string)
reconstructed_string = tokenizer.decode(indices)

print(indices)
print(reconstructed_string)
print(get_compression_ratio(string, indices))

[72, 101, 108, 108, 111, 44, 32, 127757, 33, 32, 20320, 22909, 33]
Hello, 游깴! 擔먼봏!
1.5384615384615385


- **Problem 1**: this is a very large vocabulary.
- **Problem 2**: many characters are quite rare (e.g., 游깴), which is inefficient use of the vocabulary.

#### Bytes Tokenization

In [5]:
class BytesTokenizer(Tokenizer):
  def encode(self, string: str):
    string_bytes = string.encode("utf-8")
    return list(map(int, string_bytes))
  def decode(self, indices: list[int]):
    string_bytes = bytes(indices)
    return string_bytes.decode("utf-8")

tokenizer = BytesTokenizer()
string = "Hello, 游깴! 擔먼봏!"
indices = tokenizer.encode(string)
reconstructed_string = tokenizer.decode(indices)

print(indices)
print(reconstructed_string)
print(get_compression_ratio(string, indices))

[72, 101, 108, 108, 111, 44, 32, 240, 159, 140, 141, 33, 32, 228, 189, 160, 229, 165, 189, 33]
Hello, 游깴! 擔먼봏!
1.0


While **byte-level tokenization** can alleviate the **out-of-vocabulary** issues faced by word-level tokenizers, tokenizing text into bytes results in extremely **long input sequences**. This slows down model training, since a  5 sentence with 10 words might only be 10 tokens long in a word-level language model, but could be 50 or more tokens long in a character-level model (depending on the length of the words).

#### Subword Tokenization

### regex

In [1]:
import regex as re

def remove_special_tokens(text: str, specials_tokens:list[str], keep_specials_tokens=True) -> list[str]:
  if specials_tokens is None:
    return [text]

  specials_tokens = sorted(specials_tokens, key=len, reverse=True)
  delimiter = "|".join(re.escape(token) for token in specials_tokens)
  if not keep_specials_tokens:
    chunks = re.split(delimiter, text)
  else: 
    chunks = re.split(f"({delimiter})", text)
  return chunks

In [2]:
text = " <BOS>Hello <BOS><EOS> World<EOS> "
tokens = ["<BOS>", "<EOS>"]
result = remove_special_tokens(text, tokens)
print(result)

[' ', '<BOS>', 'Hello ', '<BOS>', '', '<EOS>', ' World', '<EOS>', ' ']
