# Understand Tokenization Concept

In this notebook, we will explore the process of converting text into tokens, a fundamental step in LLM Models.
Using this you can deal with any type of data i.e, coding, conversational, Q/A, Mathematical problems etc

## Importing Libaries

In [1]:
import re
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence


## 1. Running Simple Tokenization

This section demonstrates a basic approach to tokenization using Python's built-in libraries and PyTorch. We will implement a basic tokenization function. This function will split the text into individual tokens.

i. Sample_text: It has simple text sentence. 

ii. Code_text: It has python code example

These both help you to understand how a text and code convert into tokens.

Note: Model only understand numerical values. It is necesary to map text into unique numerical ID.

In [2]:
sample_text = "Hi! I am excited to take my first step at LLM."

In [39]:
code_text = """
def calculate_llm_perplexity(model, text, max_length=1024):
    tokens = tokenizer.encode(text, max_length=max_length, truncation=True)
    input_ids = torch.tensor([tokens]).to(device)
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
    loss = outputs.loss
    return math.exp(loss.item())

# Example usage
perplexity = calculate_llm_perplexity(gpt2_model, "Hello, world!")
print(f"Perplexity: {perplexity:.2f}")
"""

In [4]:
def tokenize(text):   
    # Hint: Use regex to split the text into words and punctuation
    result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
    # pass  # Temporary placeholder to avoid syntax errors
        # Remove any empty strings that may occur
    result = [token for token in result if token.strip() != '']
    return result

In [None]:
print("Tokenized text:", tokenize(sample_text))

In [None]:
print("Tokenized code:", tokenize(code_text))

## 2. Creating a Vocabulary

In this section we will create a function that takes a list of texts as input and returns a dictionary. In it each key is a unique word (or token) from the texts and its corresponding value is a unique index. The function should also reserve a special token <UNK> with index 0 to represent unknown words that may appear in future texts.

 Note: We added two special token 

i. "<|endoftext|>" : It will use to separate the two unrealted text source.

ii."<|unk|>" : It will deal with the unknow text which is not a part of training data and also not present in the vocabulary.

In [7]:
def build_vocabulary(texts):
    ## Create a function to build a word-level vocabulary from a list of texts
    # Use a set to collect unique tokens, then convert to a dictionary
    all_tokens = set()
    
    for text in texts:
        preprocessed = tokenize(text)
        all_tokens.update(preprocessed)
    
    all_tokens = sorted(list(all_tokens))
    all_tokens.extend(["<|endoftext|>", "<|unk|>"])

    vocab = {token:integer for integer,token in enumerate(all_tokens)}

    return vocab


In [None]:
sample_dataset = [
"Dr. Ava Chen yawned, rubbing her tired eyes as she stared at the lines of code scrolling across her monitor. For months, she had been immersed in the cutting-edge world of Large Language Models, pushing the boundaries of artificial intelligence.",
"Her latest project aimed to create an LLM that could understand and generate complex scientific theories. As she fine-tuned the model's parameters, Ava couldn't help but wonder about the ethical implications of her work.",
"Suddenly, an alert flashed on her screen. The model had produced something unprecedented—a novel theory in quantum mechanics. Ava's heart raced as she read through the output, her mind struggling to grasp the implications.",
"Was this a breakthrough or a clever combination of existing knowledge? As dawn broke outside her lab, Ava realized her journey into the depths of LLMs had only just begun, with countless questions still unanswered."
]



text = sample_dataset[0]
print("text: ", text)

# Tokenize the text and convert tokens to IDs
tokens = text.split()  # Simple tokenization; adjust if needed
print(tokens)
# token_ids = [self.vocab.get(token, self.vocab.get('<UNK>')) for token in tokens]  # Map tokens to IDs
# print(token_ids)

In [None]:
vocab = build_vocabulary(sample_dataset)
print("Vocabulary:", vocab)

## 3. Implementing a Custom Dataloader

We have a lot of text data, but it's all different lengths. We need to make it work for our model. To do this, we'll create two special helpers:

1. A `Dataset` class: This will help us prepare our text data for our model. We'll break down the text into smaller pieces and convert it into a format our model can understand.
2. A `DataLoader` class: This will help us feed our prepared data to our model in batches. We'll sort the batches by length, add padding to make them all the same size, and create a mask to ignore the extra padding.

By using these two helpers, we'll be able to get our data in order and make it easy for our model to work with. This will make our training process smoother and more efficient.

In [21]:
class TextDataset(Dataset):
    def __init__(self, texts, vocab):

        """
        Initialize the dataset with texts and vocabulary.

        :param texts: A list of text samples.
        :param vocab: A dictionary representing the vocabulary, where keys are tokens and values are their corresponding IDs.
        """
        self.texts = texts
        self.vocab = vocab
    
    def __len__(self):
    
        return len(self.texts)

    
    def __getitem__(self, idx):
        
        # Convert a text sample to token IDs using the vocabulary
        # Hint: dictionary.get(keyname, value if a certain key doesn't exist) can be helpful
        # Tokens = []
    
         # Get the text sample at index idx
        tokens = []
        token_ids =[]

        # for i in range(idx):
        text = self.texts[idx]
            
        # Tokenize the text and convert tokens to IDs
        # tokens = text.split()  # Simple tokenization; adjust if needed
        tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        token_ids = [self.vocab.get(token, self.vocab.get('<|unk|>')) for token in tokens]  # Map tokens to IDs


        return torch.tensor(token_ids) 
   

        
        # pass

In [None]:
example = [
    "Dr. Ava Chen yawned, wondering if 42 truly was the answer to Life, the Universe, and Everything.",
    "Aspiring AI researchers gather excitedly at the conference, ready to push the boundaries of language models.",
    "A traveler checks their phone anxiously, hoping Munich's notoriously unpredictable weather won't spoil their vacation plans."
]
# Create a dataset instance
dataset = TextDataset(example, vocab)
print(dataset[0])



In [28]:
batch_size = 2
simple_dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

# Error Will appear beacuse each sample has different length -- Let's resolve it!

In [None]:
# Display a batch of data
for batch in simple_dataloader:
    print("Batch shape:", batch.shape)
    print("Sample batch:", batch)
    break

In [None]:
print("Attempting to iterate through the dataloader:")
try:
    for batch in simple_dataloader:
        print("Processed batch:", batch)
        break
except RuntimeError as e:
    print(f"Caught an error: {e}")
    print("\nThis error occurs because we're trying to batch sequences of different lengths.")

Now, let's implement a custom collate_fn to handle variable-length sequences. 

In [7]:
def collate_fn(batch):
    sequences = batch

    # Pad the sequences
    padded_sequences = pad_sequence(sequences, batch_first=True, padding_value=0)
    
    return padded_sequences

In [32]:
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

In [None]:
print("Iterating through the dataloader with custom collate_fn:")
for batch in dataloader:
    print("Processed batch shape:", batch.shape)
    print("Sample batch:")
    print(batch)
    break

The TextProcessor now successfully handles variable-length sequences!

## 4. Putting It All Together

Time to combine tokenization, vocabulary creation and data preparation in batches. That's where our `TextProcessor` will help.

In [31]:
class TextProcessor:
    def __init__(self):
        self.vocab = None
    
    def tokenize(self, text):


        #  Implement tokenization
        result = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        result = [token for token in result if token.strip() != '']
        
        return result
    
    def build_vocab(self, texts):
        
        #  Build vocabulary from a list of texts
        all_tokens = set()
        
        for text in texts:
            preprocessed = self.tokenize(text)
            all_tokens.update(preprocessed)
        
        all_tokens = sorted(list(all_tokens))
        all_tokens.extend(["<|endoftext|>", "<|unk|>"])

        self.vocab = {token:integer for integer,token in enumerate(all_tokens)}

        return self.vocab
    
    def create_dataloader(self, texts, batch_size):
        token_ids = []
        # Create a DataLoader with TextDataset from a list of text
        # dataset = TextDataset(texts, vocab)
        for text in texts:
            preprocessed = self.tokenize(text)
            tokens = [self.vocab.get(token, self.vocab.get('<|unk|>')) for token in preprocessed]  # Map tokens to IDs
            tokens = torch.tensor(tokens)
            token_ids.append(tokens)
            
        dataloader = DataLoader(token_ids, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

        return dataloader

In [32]:
sample_dataset = [
"Dr. Ava Chen yawned, rubbing her tired eyes as she stared at the lines of code scrolling across her monitor. For months, she had been immersed in the cutting-edge world of Large Language Models, pushing the boundaries of artificial intelligence.",
"Her latest project aimed to create an LLM that could understand and generate complex scientific theories. As she fine-tuned the model's parameters, Ava couldn't help but wonder about the ethical implications of her work.",
"Suddenly, an alert flashed on her screen. The model had produced something unprecedented—a novel theory in quantum mechanics. Ava's heart raced as she read through the output, her mind struggling to grasp the implications.",
"Was this a breakthrough or a clever combination of existing knowledge? As dawn broke outside her lab, Ava realized her journey into the depths of LLMs had only just begun, with countless questions still unanswered."
]

# Test the TextProcessor
processor = TextProcessor()
vocab = processor.build_vocab(sample_dataset)

dataloader = processor.create_dataloader(sample_dataset, batch_size=2)



[tensor([  7,   2,   5,   6, 114,   1,  88,  55, 105,  47,  26,  93,  95,  27,
        100,  66,  73,  35,  92,  20,  55,  70,   2,   8,  71,   1,  93,  52,
         28,  56,  58, 100,  42, 113,  73,  13,  12,  14,   1,  82, 100,  30,
         73,  25,  59,   2]), tensor([  9,  65,  81,  21, 106,  41,  23,  10,  99,  38, 108,  24,  50,  37,
         90, 101,   2,   4,  93,  48, 100,  69,   0,  89,  79,   1,   5,  39,
          0,  98,  54,  33, 111,  19, 100,  45,  57,  73,  55, 112,   2]), tensor([ 15,   1,  23,  22,  49,  74,  55,  91,   2,  16,  69,  52,  80,  94,
        109,  72, 102,  58,  83,  67,   2,   5,   0,  89,  53,  85,  26,  93,
         86, 104, 100,  77,   1,  55,  68,  97, 106,  51, 100,  57,   2]), tensor([ 17, 103,  18,  31,  76,  18,  34,  36,  73,  46,  63,   3,   4,  43,
         32,  78,  55,  64,   1,   5,  87,  55,  61,  60, 100,  44,  73,  11,
         52,  75,  62,  29,   1, 110,  40,  84,  96, 107,   2])]
4


In [33]:
for batch in dataloader:
    print("Processed batch:", batch)
    break


Processed batch: tensor([[  7,   2,   5,   6, 114,   1,  88,  55, 105,  47,  26,  93,  95,  27,
         100,  66,  73,  35,  92,  20,  55,  70,   2,   8,  71,   1,  93,  52,
          28,  56,  58, 100,  42, 113,  73,  13,  12,  14,   1,  82, 100,  30,
          73,  25,  59,   2],
        [  9,  65,  81,  21, 106,  41,  23,  10,  99,  38, 108,  24,  50,  37,
          90, 101,   2,   4,  93,  48, 100,  69,   0,  89,  79,   1,   5,  39,
           0,  98,  54,  33, 111,  19, 100,  45,  57,  73,  55, 112,   2,   0,
           0,   0,   0,   0]])


#### Congratulations! You've implemented a basic text processing pipeline. This will be useful for handling input data in your LLM projects.

## Reviewing Tokenization Libraries

We'll use `tiktoken`at a later stage for tokenization, so let's see what it does and compare it to another simple tokenization library `NLTK`.

### Using NLTK

In [None]:

!pip install nltk

In [34]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/abdulrehman/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [38]:
sample_text = "Hi! I am excited to take my first step at LLM."
nltk_tokens = word_tokenize(sample_text)
print("NLTK Tokens:", nltk_tokens)

NLTK Tokens: ['Hi', '!', 'I', 'am', 'excited', 'to', 'take', 'my', 'first', 'step', 'at', 'LLM', '.']


In [42]:
nltk_code_tokens = word_tokenize(code_text)
print("NLTK Tokens for Code:",nltk_code_tokens )

NLTK Tokens for Code: ['def', 'calculate_llm_perplexity', '(', 'model', ',', 'text', ',', 'max_length=1024', ')', ':', 'tokens', '=', 'tokenizer.encode', '(', 'text', ',', 'max_length=max_length', ',', 'truncation=True', ')', 'input_ids', '=', 'torch.tensor', '(', '[', 'tokens', ']', ')', '.to', '(', 'device', ')', 'with', 'torch.no_grad', '(', ')', ':', 'outputs', '=', 'model', '(', 'input_ids', ',', 'labels=input_ids', ')', 'loss', '=', 'outputs.loss', 'return', 'math.exp', '(', 'loss.item', '(', ')', ')', '#', 'Example', 'usage', 'perplexity', '=', 'calculate_llm_perplexity', '(', 'gpt2_model', ',', '``', 'Hello', ',', 'world', '!', "''", ')', 'print', '(', 'f', "''", 'Perplexity', ':', '{', 'perplexity', ':', '.2f', '}', "''", ')']


### Using Tiktoken

In [43]:
import tiktoken

In [44]:
enc = tiktoken.get_encoding("cl100k_base")
tiktoken_tokens = enc.encode(sample_text)
print("Tiktoken Tokens:", tiktoken_tokens)
print("Decoded Tiktoken Tokens:", enc.decode(tiktoken_tokens))

Tiktoken Tokens: [13347, 0, 358, 1097, 12304, 311, 1935, 856, 1176, 3094, 520, 445, 11237, 13]
Decoded Tiktoken Tokens: Hi! I am excited to take my first step at LLM.


In [45]:
print(f"NLTK token count: {len(nltk_tokens)}")
print(f"Tiktoken token count: {len(tiktoken_tokens)}")

NLTK token count: 13
Tiktoken token count: 14


In [46]:
tiktoken_code_tokens = enc.encode(code_text)
print("\nTiktoken Tokens (decoded for readability):")
print(enc.decode_tokens_bytes(tiktoken_code_tokens))
print(f"Tiktoken token count: {len(tiktoken_code_tokens)}")


Tiktoken Tokens (decoded for readability):
[b'\n', b'def', b' calculate', b'_ll', b'm', b'_per', b'plex', b'ity', b'(model', b',', b' text', b',', b' max', b'_length', b'=', b'102', b'4', b'):\n', b'   ', b' tokens', b' =', b' tokenizer', b'.encode', b'(text', b',', b' max', b'_length', b'=max', b'_length', b',', b' trunc', b'ation', b'=True', b')\n', b'   ', b' input', b'_ids', b' =', b' torch', b'.tensor', b'([', b'tokens', b']).', b'to', b'(device', b')\n', b'   ', b' with', b' torch', b'.no', b'_grad', b'():\n', b'       ', b' outputs', b' =', b' model', b'(input', b'_ids', b',', b' labels', b'=input', b'_ids', b')\n', b'   ', b' loss', b' =', b' outputs', b'.loss', b'\n', b'   ', b' return', b' math', b'.exp', b'(loss', b'.item', b'())\n\n', b'#', b' Example', b' usage', b'\n', b'per', b'plex', b'ity', b' =', b' calculate', b'_ll', b'm', b'_per', b'plex', b'ity', b'(g', b'pt', b'2', b'_model', b',', b' "', b'Hello', b',', b' world', b'!")\n', b'print', b'(f', b'"', b'Per', b'plex