Exercise 2.1 Byte pair encoding of unknown words
Try the BPE tokenizer from the tiktoken library on the unknown words “Akwirw ier” and
print the individual token IDs. Then, call the decode function on each of the resulting
integers in this list to reproduce the mapping shown in figure 2.11. Lastly, call the
decode method on the token IDs to check whether it can reconstruct the original
input, “Akwirw ier.”

In [4]:
import tiktoken

In [None]:


encoding = tiktoken.get_encoding("gpt2")
text = "Akwirw ier"

token_ids = encoding.encode(text)
print("Token IDs:", token_ids)

print("Individual token decodings:")
for token_id in token_ids:
    decoded_token = encoding.decode([token_id])
    print(f"{token_id}: {decoded_token}")

decoded_text = encoding.decode(token_ids)
print("Decoded text:", decoded_text)

Token IDs: [33901, 86, 343, 86, 220, 959]
Individual token decodings:
33901: Ak
86: w
343: ir
86: w
220:  
959: ier
Decoded text: Akwirw ier


Exercise 2.2 Data loaders with different strides and context sizes
To develop more intuition for how the data loader works, try to run it with different
settings such as max_length=2 and stride=2, and max_length=8 and stride=2.

In [None]:
import tiktoken
import torch
from torch.utils.data import Dataset, DataLoader

In [None]:


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})
 
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i : i + max_length]
            target_chunk = token_ids[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

def create_dataloader(txt, batch_size, max_length, stride):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    return DataLoader(dataset, batch_size=batch_size)

# test string
sample_text = "This is a sample text for our data loader test."

print("test：max_length=2, stride=2")
loader1 = create_dataloader(sample_text, batch_size=2, max_length=2, stride=2)
for batch in loader1:
    inputs, targets = batch
    print("Inputs:", inputs)
    print("Targets:", targets)
    break  

print("\ntest：max_length=8, stride=2")
loader2 = create_dataloader(sample_text, batch_size=2, max_length=8, stride=2)
for batch in loader2:
    inputs, targets = batch
    print("Inputs:", inputs)
    print("Targets:", targets)
    break  