# Week 2: Turning Words into Tokens

In this notebook, we will explore the process of converting text into tokens, a fundamental step in NLP tasks.

## 0. Setup

We will begin by importing the necessary libraries.

In [None]:
# Import necessary libraries
import re
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from typing import List, Dict, Tuple, Union

In [None]:
# Import file checking whether TODO has been removed
from helpers.check_todo import check_implementation

In [None]:
# Create the src directory if it doesn't exist
import os
os.makedirs('src', exist_ok=True)

## 1. Running Simple Tokenization

This section demonstrates a basic approach to tokenization using Python's built-in libraries and PyTorch. We will implement a basic tokenization function. This function will split the text into individual tokens.

In [None]:
sample_text = "Hello, how are you doing today?"

In [None]:
code_text = """
def calculate_llm_perplexity(model, text, max_length=1024):
    tokens = tokenizer.encode(text, max_length=max_length, truncation=True)
    input_ids = torch.tensor([tokens]).to(device)
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
    loss = outputs.loss
    return math.exp(loss.item())

# Example usage
perplexity = calculate_llm_perplexity(gpt2_model, "Hello, world!")
print(f"Perplexity: {perplexity:.2f}")
"""

In [None]:
def tokenize(text: str) -> List[str]:
    # TODO: Implement a basic tokenization function    
    # Hint: Use regex to split the text into words and punctuation
    pass  # Temporary placeholder to avoid syntax errors

Our time to test whether you have reviewed 'TODO' in the first function we implement together. Remove 'TODO' once you're done implementing and no error message will appear.

In [None]:
try:
    check_implementation(tokenize)
except NotImplementedError as e:
    print(e)

In [None]:
print("Tokenized text:", tokenize(sample_text))

In [None]:
print("Tokenized code:", tokenize(code_text))

## 2. Creating a Vocabulary

In this section we will create a function that takes a list of texts as input and returns a dictionary. In it each key is a unique word (or token) from the texts and its corresponding value is a unique index. The function should also reserve a special token <UNK> with index 0 to represent unknown words that may appear in future texts.

In [None]:
def build_vocabulary(texts: List[str]) -> Dict[str, int]:

    # TODO: Create a function to build a word-level vocabulary from a list of texts
    # Hint: Use a set to collect unique tokens, then convert to a dictionary
    # with enumerated indices
    
    # Do not forget to reserve a slot for unknown tokens as {'<UNK>': 0}

    pass

In [None]:
try:
    check_implementation(build_vocabulary)
except NotImplementedError as e:
    print(e)

In [None]:
# TODO: Use your examples for a sample dataset
# We won't be checking whether you have removed TODO here
# But using your own sentences is encouraged!

sample_dataset = [
    "42 is the Ultimate answer for Life, the Universe, and Everything.",
    "Hello, world of LLM Trailblazers! This is another example.",
    "What is the weather like today in Munich?"
]

In [None]:
vocab = build_vocabulary(sample_dataset)
print("Vocabulary:", vocab)

## 3. Implementing a Custom Dataloader

We have a lot of text data, but it's all different lengths. We need to make it work for our model. To do this, we'll create two special helpers:

1. A `Dataset` class: This will help us prepare our text data for our model. We'll break down the text into smaller pieces and convert it into a format our model can understand.
2. A `DataLoader` class: This will help us feed our prepared data to our model in batches. We'll sort the batches by length, add padding to make them all the same size, and create a mask to ignore the extra padding.

By using these two helpers, we'll be able to get our data in order and make it easy for our model to work with. This will make our training process smoother and more efficient.

In [None]:
class TextDataset(Dataset):
    def __init__(self, texts: List[str], vocab: Dict[str, int]):
        """
        Initialize the dataset with texts and vocabulary.

        :param texts: A list of text samples.
        :param vocab: A dictionary representing the vocabulary, where keys are tokens and values are their corresponding IDs.
        """
        self.texts = texts
        self.vocab = vocab
    
    def __len__(self) -> int:
        
        # TODO: Return the number of samples in the input data
        
        pass
    
    def __getitem__(self, idx: int) -> torch.Tensor:
        
        # TODO: Convert a text sample to token IDs using the vocabulary
        # Hint: dictionary.get(keyname, value if a certain key doesn't exist) can be helpful
        
        pass

In [None]:
try:
    check_implementation(TextDataset)
except NotImplementedError as e:
    print(e)

In [None]:
# Create a dataset instance
dataset = TextDataset(sample_dataset, vocab)

In [None]:
batch_size = 2
simple_dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

In [None]:
# Display a batch of data
for batch in simple_dataloader:
    print("Batch shape:", batch.shape)
    print("Sample batch:", batch)
    break

In [None]:
print("Attempting to iterate through the dataloader:")
try:
    for batch in simple_dataloader:
        print("Processed batch:", batch)
        break
except RuntimeError as e:
    print(f"Caught an error: {e}")
    print("\nThis error occurs because we're trying to batch sequences of different lengths.")

Now, let's implement a custom collate_fn to handle variable-length sequences. 

In [None]:
# TODO: analyse this function and consider how to implement it without pad_sequence()
def collate_fn(batch: List[torch.Tensor]) -> Tuple[torch.Tensor, torch.Tensor]:
    # Separate the input sequences and targets
    sequences, targets = zip(*batch)
    
    # Pad the sequences
    padded_sequences = pad_sequence(sequences, batch_first=True, padding_value=0)
    
    # Pad the targets if they are sequences, otherwise just stack them
    if isinstance(targets[0], torch.Tensor) and targets[0].dim() > 0:
        padded_targets = pad_sequence(targets, batch_first=True, padding_value=0)
    else:
        padded_targets = torch.stack(targets)
    
    return padded_sequences, padded_targets

In [None]:
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

In [None]:
print("Iterating through the dataloader with custom collate_fn:")
for batch, mask in dataloader:
    print("Processed batch shape:", batch.shape)
    print("Mask shape:", mask.shape)
    print("Sample batch:")
    print(batch)
    print("Sample mask:")
    print(mask)
    break

# TODO: Experiment with setting DataLoader with shuffle=False

The TextProcessor now successfully handles variable-length sequences!

## 4. Putting It All Together

Time to combine tokenization, vocabulary creation and data preparation in batches. That's where our `TextProcessor` will help.

In [None]:
class TextProcessor:
    def __init__(self):
        self.vocab: Dict[str, int] = None
    
    def tokenize(self, text: str) -> List[str]:

        # TODO: Implement tokenization
        
        pass
    
    def build_vocab(self, texts: List[str]) -> None:
        
        # TODO: Build vocabulary from a list of texts
        
        pass
    
    def create_dataloader(self, texts: List[str], batch_size: int) -> DataLoader:
        
        # TODO: Create a DataLoader with TextDataset from a list of texts
        
        pass

In [None]:
try:
    check_implementation(TextProcessor)
except NotImplementedError as e:
    print(e)

In [None]:
# Test the TextProcessor
processor = TextProcessor()
processor.build_vocab(sample_dataset)
dataloader = processor.create_dataloader(sample_dataset, batch_size=2)

In [None]:
for batch in dataloader:
    print("Processed batch:", batch)
    break

#### Congratulations! You've implemented a basic text processing pipeline. This will be useful for handling input data in your LLM projects.

## Extra: Reviewing Tokenization Libraries

We'll use `tiktoken`at a later stage for tokenization, so let's see what it does and compare it to another simple tokenization library `NLTK`.

### Using NLTK

In [None]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

In [None]:
nltk_tokens = word_tokenize(sample_text)
print("NLTK Tokens:", nltk_tokens)

In [None]:
nltk_code_tokens = word_tokenize(code_text)
print("NLTK Tokens for Code:")

### Using Tiktoken

In [None]:
import tiktoken

In [None]:
enc = tiktoken.get_encoding("cl100k_base")
tiktoken_tokens = enc.encode(sample_text)
print("Tiktoken Tokens:", tiktoken_tokens)
print("Decoded Tiktoken Tokens:", enc.decode(tiktoken_tokens))

In [None]:
print(f"NLTK token count: {len(nltk_tokens)}")
print(f"Tiktoken token count: {len(tiktoken_tokens)}")

In [None]:
tiktoken_code_tokens = enc.encode(code_text)
print("\nTiktoken Tokens (decoded for readability):")
print(enc.decode_tokens_bytes(tiktoken_code_tokens))
print(f"Tiktoken token count: {len(tiktoken_code_tokens)}")