# Exercise 11

## Group ID: 
*   Person1
*   Person2
*   Person3

## Exercise day: Tuesday/Wednesday






## Description
In this exercise you will implement a Tokenizer class that will be used to tokenize a given text.
A token will be a short sequence of bytes and tokenization will be about splitting up the text into those short substrings and then representing it in terms of token ids.
This allows language models to work on a better text representation than just using a character level model as in the last exercise.
We will use the so called "Byte Pair Encoding" (BPE) algorithm to tokenize the text. The BPE algorithm is a simple algorithm that iteratively merges the most frequent pair of bytes/ids in the text. This notebook will guide you through the implementation of the Tokenizer class. To make the representation of the text easier, we will use integer ids to represent the bytes of the text.

Note: The tokenizer is not part of the model, it is a preprocessing step that is used to tokenize the text before feeding it to the model. Therefore, it can be trained on a different dataset than the model.
Once trained, the tokenizer is capable of encoding and decoding any given text to/from a list of tokens.

## Tasks
1. Implement the `get_stats` method that will return a dictionary containing the frequency of each pair of characters in the text. (0.5 points)
1. Implement the `merge` method that will merge the most frequent pair of characters in the text. (0.5 points)
1. Implement the `fit` method that will train the tokenizer on the given text. (1 point)
1. Implement the `encode` method that will encode the given text to a list of tokens. (1 point)
1. Implement the `decode` method that will decode the given list of tokens to a text. (1 point)
1. Implement the `BPE` class. This class should contain the previously implemented methods. (1 point)

In [6]:
def get_stats(ids):
    """ Returns a dictionary with the number of times each pair of ids appears in the input list """
    dict = {}
    for i in range(len(ids) - 1):
        # create a tuple with the pair of ids
        pair = (ids[i], ids[(i+1)])
        # if the pair is already in the dictionary, increment the count
        if pair in dict:
            dict[pair] += 1
        # if the pair is not in the dictionary, add it
        else:
            dict[pair] = 1

    return dict

In [None]:
ids = [1, 2, 3, 1, 2, 3, 1, 2, 3]
expected = {(1, 2): 3, (2, 3): 3, (3, 1): 2}
assert get_stats(ids) == expected

{(1, 2): 3, (2, 3): 3, (3, 1): 2}


In [53]:
def merge(ids, pair, new_id):
    """Merge every pair of elements in ids that are equal to pair into a single element new_id"""
    _ids = ids.copy()
    # the length of the list will change as we merge elements, so we need to check the length of the list at each iteration
    i = 0
    while i < len(_ids) - 1:
        if _ids[i] == pair[0] and _ids[i+1] == pair[1]:
            _ids[i] = new_id
            _ids.pop(i+1)
        i += 1

    return _ids

In [54]:
ids = [0,1,2,3,0,1,2,3]
pair = (0,1)
new_id = 4
new_ids = merge(ids, pair, new_id)
expected = [4,2,3,4,2,3]
assert new_ids == expected, f"Merge failed new_ids actual: {new_ids} expected: {expected}"

In [62]:
def fit(ids, max_iter=1000):
    """Fit the model to the data by merging recursively the most common pairs of ids max_iter times or until no pair appears more than once. If two pairs have the same frequency, the one that appears first is chosen.
    Returns the fitted model and a dictionary with the merges containing the new_id for each pair.
    To ensure each new_id is unique (and our results are comparable), it is set to the maximum id in the list plus one.
    """
    merges = {}
    _ids = ids.copy()
    for i in range(max_iter):
        stats = get_stats(_ids)
        # if there are no pairs, we are done
        if not stats:
            break

        # get the most common pair
        pair = max(stats, key=stats.get)

        # assign a new id to the pair
        new_id = max(_ids) + 1

        # merge the pair
        _ids = merge(_ids, pair, new_id)

        # store the pair and its id
        merges[pair] = new_id

    return _ids, merges

In [63]:
ids = [0,1,2,3,0,1,2,3]
num_iterations = 1
new_ids, merges = fit(ids,num_iterations)
assert new_ids == [4,2,3,4,2,3], f"Wrong ids after fit {new_ids}"
assert merges == {(0,1):4}, f"Wrong merges after fit {merges}"

In [64]:
ids = [0,1,2,3,0,1,2,3]
num_iterations = 2
new_ids, merges = fit(ids,num_iterations)
assert new_ids == [5,3,5,3], f"Wrong ids after fit {new_ids}"
assert merges == {(0,1):4, (4,2):5}, f"Wrong merges after fit {merges}"

In [69]:
def encode(ids, merges):
    """Encode the input list of ids using the merges dictionary"""
    _merges = merges.copy()
    _ids = ids.copy()
    for i in range(len(merges)):
        pair = min(_merges, key=_merges.get)
        pair_id = _merges.pop(pair)
        j = 0
        while j < len(_ids) - 1:
            if _ids[j] == pair[0] and _ids[j+1] == pair[1]:
                _ids[j] = pair_id
                _ids.pop(j+1)
            j += 1
    return _ids

In [70]:
ids = [0,1,2,3,0,1,2,3]
merges = {(0,1):4}
encoded_ids = encode(ids, merges)
expected = [4,2,3,4,2,3]
assert encoded_ids == expected, f"Wrong encoded ids {encoded_ids} expected: {expected}"
assert ids == [0,1,2,3,0,1,2,3], f"Input ids should not be modified"

In [73]:
def decode(ids, merges):
    """Decode the input list of ids using the merges dictionary"""
    _merges = merges.copy()
    _ids = ids.copy()
    for i in range(len(merges)):
        pair = max(_merges, key=_merges.get)
        pair_id = _merges.pop(pair)
        j = 0
        while j < len(_ids):
            if _ids[j] == pair_id:
                _ids[j] = pair[0]
                _ids.insert(j+1, pair[1])
            j += 1

    return _ids

In [74]:
ids = [6,6]
merges = {(0,1):4, (4,2):5, (5,3):6}
decoded_ids = decode(ids, merges)
expected = [0,1,2,3,0,1,2,3]
assert decoded_ids == expected, f"Wrong decoded ids {decoded_ids} expected: {expected}"

In [191]:
class BPE:
    def __init__(self, max_iter:int=1000):
        self.max_iter = max_iter
        self.vocab = {idx: bytes([idx]).decode("utf-8", errors="replace") for idx in range(256)}
        self.merges = None

    def fit(self, text):
        # convert the input text to a list of ids using utf-8 encoding
        # rewrite the code to use the utf-8 encoding (256 predefined encodings)
        max_iter = self.max_iter

        merges = {}
        ids = list(map(int, text.encode("utf-8")))
        _ids = ids.copy()
        for i in range(max_iter):
            stats = get_stats(_ids)
            if not stats:
                break
            pair = max(stats, key=stats.get)
            # assign a new id to the pair, ensuring it is unique
            new_id = max(_ids + [256]) + 1
            # merge the pair
            _ids = merge(_ids, pair, new_id)
            # store the pair and its id
            merges[pair] = new_id

        self.merges = merges
        self.vocab.update({v: k for k, v in merges.items()})
        return _ids

    def encode(self, text):
        if self.merges is None:
            raise ValueError("Model has not been fitted")
        ids = list(map(int, text.encode("utf-8")))
        return encode(ids, self.merges)

    def decode(self, ids):
        if self.merges is None:
            raise ValueError("Model has not been fitted")
        ids = decode(ids, self.merges)
        # convert the list of ids back to a string
        text = bytes(ids).decode("utf-8", errors="replace")
        return text

In [192]:
# testing the BPE class
text = "abracadabra abracadabra"
bpe = BPE(max_iter=4)
bpe.fit(text)
encoded_text = bpe.encode(text)
print(encoded_text)
print(bpe.merges)
decoded_text = bpe.decode(encoded_text)
assert decoded_text == text, f"Decoded text {decoded_text} does not match original text {text}"

[260, 97, 100, 259, 32, 260, 97, 100, 259]
{(97, 98): 257, (257, 114): 258, (258, 97): 259, (259, 99): 260}


In [193]:
text = "A large language model (LLM) is a type of computational model designed for natural language processing tasks such as language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process. The largest and most capable LLMs are artificial neural networks built with a decoder-only transformer-based architecture, enabling efficient processing and generation of large-scale text data. Modern models can be fine-tuned for specific tasks or guided by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologies inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained in." # cleaned version of the first paragraph of the Wikipedia page on LLMs
ids = list(map(int, text.encode('utf-8')))

In [194]:
bpe = BPE(max_iter=1000)
bpe.fit(text)
encoded_text = bpe.encode(text)
decoded_text = bpe.decode(encoded_text)
assert text == decoded_text, "Decoding failed"

## Evaluation: 

Let us check what a tokenizer will give us for training a simple language model.

We will use the tiny Shakespeare dataset and train with and without tokenization and compare generated text.


In [195]:
!wget 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

--2025-01-08 23:32:09--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2025-01-08 23:32:10 (5.05 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



In [196]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')
print(device)

cuda


In [197]:
# Compute tokenizer
with open('input.txt') as f:
    text = f.read()
bpe = BPE(max_iter=5)
bpe.fit(text)
enc = bpe.encode(text)

In [198]:
print(enc[:1000])

[70, 105, 114, 115, 259, 67, 105, 116, 105, 122, 101, 110, 58, 10, 66, 101, 102, 111, 114, 257, 119, 257, 112, 114, 111, 99, 101, 101, 261, 97, 110, 121, 32, 102, 117, 114, 258, 101, 114, 44, 32, 104, 101, 97, 114, 32, 109, 257, 115, 112, 101, 97, 107, 46, 10, 10, 65, 108, 108, 58, 10, 83, 112, 101, 97, 107, 44, 32, 115, 112, 101, 97, 107, 46, 10, 10, 70, 105, 114, 115, 259, 67, 105, 116, 105, 122, 101, 110, 58, 10, 89, 111, 117, 32, 97, 114, 257, 97, 108, 108, 32, 114, 101, 115, 111, 108, 118, 101, 261, 114, 97, 258, 101, 114, 32, 116, 111, 32, 100, 105, 257, 258, 97, 110, 32, 116, 111, 32, 102, 97, 109, 105, 115, 104, 63, 10, 10, 65, 108, 108, 58, 10, 82, 101, 115, 111, 108, 118, 101, 100, 46, 32, 114, 101, 115, 111, 108, 118, 101, 100, 46, 10, 10, 70, 105, 114, 115, 259, 67, 105, 116, 105, 122, 101, 110, 58, 10, 70, 105, 114, 115, 116, 44, 32, 121, 111, 117, 32, 107, 110, 111, 119, 32, 67, 97, 105, 117, 260, 77, 97, 114, 99, 105, 117, 260, 105, 260, 99, 104, 105, 101, 102, 32, 101, 

In [199]:
print(bpe.decode(enc[:200]))

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Ma


In [200]:
class TextDataset(Dataset):
    def __init__(self, enc_text, blocksize):
        self.data = enc_text
        self.blocksize = blocksize

    def __len__(self):
        return len(self.data)-self.blocksize-1

    def __getitem__(self, idx):
        # x is a sequence of blocksize characters
        x = self.data[idx:idx+self.blocksize]
        # y here is the next character after x and will be used as the target
        y = self.data[idx+self.blocksize]
        return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long),

In [201]:
class MLPNextTokenPredictor(nn.Module):
    def __init__(self, vocab_size, block_size, embed_dim=32, hidden_dim=128):
        super().__init__()
        self.vocab_size = vocab_size
        self.block_size = block_size
        self.embed_dim = embed_dim
        self.hidden_dim = hidden_dim

        self.emb = nn.Embedding(vocab_size, embed_dim)

        self.fc1 = nn.Linear(block_size * embed_dim, hidden_dim)
        self.bn1 = nn.BatchNorm1d(hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.bn2 = nn.BatchNorm1d(hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        emb = self.emb(x)
        emb_flat = emb.view(-1, self.block_size * self.embed_dim)

        h = torch.relu(self.bn1(self.fc1(emb_flat)))
        h = torch.relu(self.bn2(self.fc2(h)))
        logits = self.fc3(h)

        return logits

In [202]:
# generate new text based on continuing provided one
def generate(model, block_size, starting_text):
    model.eval()
    assert len(starting_text) >= block_size
    x = torch.tensor(starting_text, dtype=torch.long).to(device)

    with torch.no_grad():
        for _ in range(100):
            logits = model(x[-block_size:])
            next_token = torch.multinomial(F.softmax(logits, dim=-1), 1)
            starting_text = starting_text + [next_token.item()]
            x = torch.cat([x, next_token.squeeze(0)])
    return starting_text

In [203]:
def produce_example_text(model, block_size, tokenizer):
    starting_text = """We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance;'
"""
    tokenized_starting_text = tokenizer.encode(starting_text)
    continuation = generate(model, block_size, tokenized_starting_text)
    return tokenizer.decode(continuation[len(tokenized_starting_text):])

In [204]:
# Compare with untrained MLP:
vocab_size = max(bpe.vocab)+1
block_size = 32
mlp_untrained = MLPNextTokenPredictor(vocab_size, block_size)
mlp_untrained.to(device)
continuation = produce_example_text(mlp_untrained, block_size, bpe)
print(continuation)

�����k�o�B�EG�d �+m���Me �3>0�F����h��xWe��2e s ce�����+��E���L{��>c�s6th���]��HXm>�G


In [205]:
def train(model, dataloader, optimizer, nr_epochs):
    model.train()
    for epoch in range(nr_epochs):
        avg_loss = 0.0
        for x, y in dataloader:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x)
            loss = F.cross_entropy(logits.view(-1, model.vocab_size), y.view(-1))
            loss.backward()
            optimizer.step()
            avg_loss += loss.item() / len(dataloader)
        print(f"Epoch {epoch} loss: {avg_loss}")

In [206]:
# training on tokenized text
block_size = 32
vocab_size = max(bpe.vocab)+1
nr_epochs = 15

dataloader = DataLoader(TextDataset(enc, block_size), batch_size=32, shuffle=True, drop_last=True)
mlp_tokenizer = MLPNextTokenPredictor(vocab_size, block_size)
mlp_tokenizer.to(device)
optimizer = optim.AdamW(mlp_tokenizer.parameters(), lr=0.001, weight_decay=0.01)

train(mlp_tokenizer, dataloader, optimizer, nr_epochs)


Epoch 0 loss: 2.1924059386978287
Epoch 1 loss: 1.9405911034747791


KeyboardInterrupt: 

In [207]:
# generate new text using tokenization based MLP
continuation = produce_example_text(mlp_tokenizer, block_size, bpe)
print(continuation)

Whear her boy parth holour! speakd up the o
phonoure quigle-by the swrut. Camite;
But whose fromisints masbuse 


In [208]:
# training on original text
block_size = 64 # let us give more context because of no tokenization
# convert text to numbers
char_enc = [ord(c) for c in text]
vocab_size = max(char_enc) + 1
nr_epochs = 15

dataloader = DataLoader(TextDataset(char_enc, block_size), batch_size=32, shuffle=True, drop_last=True)
mlp_wo_tokenizer = MLPNextTokenPredictor(vocab_size, block_size)
mlp_wo_tokenizer.to(device)
optimizer = optim.AdamW(mlp_wo_tokenizer.parameters(), lr=0.001, weight_decay=0.01)

train(mlp_wo_tokenizer, dataloader, optimizer, nr_epochs)

Epoch 0 loss: 2.062389275977341
Epoch 1 loss: 1.815641518650402
Epoch 2 loss: 1.7578229227564515
Epoch 3 loss: 1.7307396914719837
Epoch 4 loss: 1.715839851543524
Epoch 5 loss: 1.7057468593120597
Epoch 6 loss: 1.6994272638220285
Epoch 7 loss: 1.694440029821013


KeyboardInterrupt: 

In [209]:
# generate new text using non-tokenization based MLP
class no_tok:
    def encode(self, text):
        return [ord(c) for c in text]

    def decode(self, ids):
        return ''.join([chr(i) for i in ids])

t = produce_example_text(mlp_wo_tokenizer, block_size, no_tok())
print(t)

K vad sTriciusid,
ory full.
Forthy:
He wish exter's my cloot thou mither.

KING EDWARD IV:
Now, new;
