# Step 1: Creating Tokens

The print command prints the total number of characters followed by the first 100 characters of this file for illustration purposes.

In [3]:
with open("alice_in_wonderland.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 71263
TITLE: Alice's Adventures in Wonderland
AUTHOR: Lewis Carroll

= CHAPTER I = 
=( Down the Rabbit-Ho


Using some simple example text, we can use the re.split command with the following syntax to split a text on whitespace characters:

In [4]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


Using some simple example text, we can use the re.split command with the following syntax to split a text on whitespace characters:

In [5]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


Let's modify the regular expression (re) splits on whitespaces (\s) and commas, and periods ([,.]):




In [6]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


A small remaining issue is that the list still includes whitespace characters. Optionally, we can remove these redundant characters safely as follows:


In [7]:
result = [item for item in result if item.strip()]      #  item.strip() :- Removing the WhiteSpaces
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


 Let's modify it a bit further so that it can also handle other types of punctuation, such as question marks, quotation marks, and the double-dashes we have seen earlier in the first 100 characters of Edith Wharton's short story, along with additional special characters:

In [8]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [9]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


## Now that we got a basic tokenizer working,

In [10]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['TITLE', ':', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'AUTHOR', ':', 'Lewis', 'Carroll', '=', 'CHAPTER', 'I', '=', '=', '(', 'Down', 'the', 'Rabbit-Hole', ')', '=', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired']


In [11]:
print(len(preprocessed))


16362


In [12]:
# if we want to remove the Numerical values form the Dataset then we can also do it here then after that creating the token id

# Step 2: Creating Token IDs

In the previous section, we tokenized a short story and assigned it to a Python variable called preprocessed. Let's now create a list of all unique tokens and sort them alphabetically to determine the vocabulary size:

In [13]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

2096


After determining that the vocabulary size is 3189 via the above code, we create the vocabulary and print its first 51 entries for illustration purposes:

In [14]:
vocab = {token:integer for integer,token in enumerate(all_words)}   # assinning the token id to the vocab

In [15]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
('*', 5)
(',', 6)
('--', 7)
('.', 8)
(':', 9)
(';', 10)
('=', 11)
('?', 12)
('A', 13)
('ALICE', 14)
('ARE', 15)
('AUTHOR', 16)
('Ada', 17)
('Adventures', 18)
('Advice', 19)
('After', 20)
('Ah', 21)
('Alas', 22)
('Alice', 23)
('Allow', 24)
('An', 25)
('And', 26)
('Ann', 27)
('Antipathies', 28)
('As', 29)
('At', 30)
('Atheling', 31)
('Australia', 32)
('BEE', 33)
('BUSY', 34)
('Be', 35)
('Because', 36)
('Besides', 37)
('Bill', 38)
('Brandy', 39)
('But', 40)
('By', 41)
('C', 42)
('CAN', 43)
('CHAPTER', 44)
('CHORUS', 45)
('COULD', 46)
('CURTSEYING', 47)
('Canary', 48)
('Canterbury', 49)
('Carroll', 50)


Later in this book, when we want to convert the outputs of an LLM from numbers back into text, we also need a way to turn token IDs into text.

For this, we can create an inverse version of the vocabulary that maps token IDs back to corresponding text tokens.

### The class will have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary.
### In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.

In [16]:
# Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods

# Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

# Step 3: Process input text into token IDs

# Step 4: Convert token IDs back into text

# Step 5: Replace spaces before the specified punctuation

In [17]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

Let's instantiate a new tokenizer object from the SimpleTokenizerV1 try it out in practice:

In [18]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"Alice's Adventures in Wonderland"""
ids = tokenizer.encode(text)
print(ids)

[1, 23, 2, 1603, 18, 1117, 264]


The code above prints the following token IDs: Next, let's see if we can turn these token IDs back into text using the decode method:

In [19]:
tokenizer.decode(ids)

'" Alice\' s Adventures in Wonderland'

In [20]:
text = "Hello, do you like tea?"   # Hello this word is not in the Txt file so this error is getting so we are ====> adding special context
print(tokenizer.encode(text))

KeyError: 'Hello'

## ADDING SPECIAL CONTEXT TOKENS

In [21]:
# In the previous section, we implemented a simple tokenizer and applied it to a passage from the training set.

# In this section, we will modify this tokenizer to handle unknown words.

# In particular, we will modify the vocabulary and tokenizer we implemented in the previous section, SimpleTokenizerV2
# to support two new tokens, <|unk|> and <|endoftext|>

Let's now modify the vocabulary to include these two special tokens, and <|endoftext|>, by adding these to the list of all unique words that we created in the previous section:

In [22]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [23]:
len(vocab.items())

2098

In [24]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('yourself', 2093)
('youth', 2094)
('zigzag', 2095)
('<|endoftext|>', 2096)
('<|unk|>', 2097)


Step 1: Replace unknown words by <|unk|> tokens

Step 2: Replace spaces before the specified punctuations

In [25]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int           # <===#
            else "<|unk|>" for item in preprocessed   # <===#
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [26]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [27]:
tokenizer.encode(text)

[2097,
 6,
 780,
 2090,
 1222,
 2097,
 12,
 2096,
 119,
 1860,
 2097,
 2097,
 1377,
 1860,
 2097,
 8]

In [51]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like <|unk|>? <|endoftext|> In the <|unk|> <|unk|> of the <|unk|>.'

# BPE (Byte Pair Encoding)

This section covers a more sophisticated tokenization scheme based on a concept called byte pair encoding (BPE).
The BPE tokenizer covered in this section was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.

In [52]:
# https://github.com/openai/tiktoken

In [53]:
# Tiktoken (OpenAI’s GPT-4 / GPT-4o / GPT-4.5)
# GPT-4 and GPT-4o use tiktoken, OpenAI’s efficient tokenizer.

# Based on byte-level BPE, but highly optimized for speed & memory.

# Backward-compatible with GPT-3 vocabulary but faster.


In [54]:
# 1 > Word Based Tokenizer.
# 2 > Sub-Word Based Tokenizer.
# 3 > Charecter Wised Tokenizer.

The usage of this tokenizer is similar to SimpleTokenizerV2 we implemented previously via an encode method: https://github.com/PrashantTakale369/Transformer-Basics/blob/b0eb0d70b16f09b11f4c5e8bd99e803c8e51771e/Tokanizer/Tokanizer.ipynb

In [55]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [56]:
tokenizer = tiktoken.get_encoding("gpt2")

In [57]:
text = (
    "Hello, do you like tea? <|endoftext|> My name is Prashant someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 2011, 1438, 318, 1736, 1077, 415, 617, 34680, 27271, 13]


We can then convert the token IDs back into text using the decode method, similar to our SimpleTokenizerV2 earlier: https://github.com/PrashantTakale369/Transformer-Basics/blob/b0eb0d70b16f09b11f4c5e8bd99e803c8e51771e/Tokanizer/Tokanizer.ipynb

In [58]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> My name is Prashant someunknownPlace.


In fact, the BPE tokenizer, which was used to train models such as GPT-2, GPT-3, and the original model used in ChatGPT, has a total vocabulary size of 50,257, with <|endoftext|> being assigned the largest token ID.

Second, the BPE tokenizer above encodes and decodes unknown words, such as "someunknownPlace" correctly. The BPE tokenizer can handle any unknown word. How does it achieve this without using <|unk|> tokens?

*Lets Try on Diff meningless word *

In [59]:
text = (
    "jjnd difn"
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[41098, 358, 288, 361, 77]


In [60]:
strings = tokenizer.decode(integers)
print(strings)

jjnd difn


# Data Loader and Input Output Pairs

### CREATING INPUT-TARGET PAIRS

In [1]:
# In this section we implement a data loader that fetches the input-target pairs using a sliding window approach.
# To get started, we will first tokenize the whole The Verdict short story we worked with earlier using the BPE tokenizer introduced in the previous section:

In [29]:
with open("alice_in_wonderland.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print(len(enc_text))

16362


In [31]:
enc_sample = enc_text[50:]

One of the easiest and most intuitive ways to create the input-target pairs for the nextword prediction task is to create two variables, x and y, where x contains the input tokens and y contains the targets, which are the inputs shifted by 1:

In [32]:
context_size = 4 #length of the input
#The context_size of 4 means that the model is trained to look at a sequence of 4 words (or tokens)
#to predict the next word in the sequence.
#The input x is the first 4 tokens [1, 2, 3, 4], and the target y is the next 4 tokens [2, 3, 4, 5]

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [1024, 1434, 1131, 1860]
y:      [1434, 1131, 1860, 557]


Processing the inputs along with the targets, which are the inputs shifted by one position, we can then create the next-word prediction tasks as follows:

In [None]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(context, "---->", desired)

For understanding purposes, let's repeat the previous code but convert the token IDs into text:

In [33]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

had ----> peeped
had peeped ----> into
had peeped into ----> the
had peeped into the ----> book


Now we need to ===>
implementing an efficient data loader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors, which can be thought of as multidimensional arrays.

### IMPLEMENTING A DATA LOADER

In [None]:
# Step 1: Tokenize the entire text
# Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length
# Step 3: Return the total number of rows in the dataset
# Step 4: Return a single row from the dataset

In [34]:
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

Each row consists of a number of token IDs (based on a max_length) assigned to an input_chunk tensor.

The target_chunk tensor contains the corresponding targets.

In [None]:
# Step 1: Initialize the tokenizer

# Step 2: Create dataset

# Step 3: drop_last=True drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training

# Step 4: The number of CPU processes to use for preprocessing

In [43]:
pip install tiktoken



In [44]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,stride=128, shuffle=True, drop_last=True,num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

In [45]:
with open("alice_in_wonderland.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

Convert dataloader into a Python iterator to fetch the next entry via Python's built-in next() function

In [47]:
import tiktoken

In [48]:
import torch
print("PyTorch version:", torch.__version__)
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

PyTorch version: 2.6.0+cu124
[tensor([[49560,  2538,    25, 14862]]), tensor([[ 2538,    25, 14862,   338]])]


The first_batch variable contains two tensors: the first tensor stores the input token IDs, and the second tensor stores the target token IDs.
Since the max_length is set to 4, each of the two tensors contains 4 token IDs.

In [49]:
# In mostly LLM Has 256 maxLength :- means it predict the next word after 256 word yes

 creating the embedding vectors from the token IDs,

In [50]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[49560,  2538,    25, 14862],
        [  338, 15640,   287, 42713],
        [  198,    32, 24318,  1581],
        [   25, 10174, 21298,   198],
        [  198,    28,  5870, 29485],
        [  314,   796,   220,   198],
        [16193,  5588,   262, 25498],
        [   12,    39,  2305,  1267]])

Targets:
 tensor([[ 2538,    25, 14862,   338],
        [15640,   287, 42713,   198],
        [   32, 24318,  1581,    25],
        [10174, 21298,   198,   198],
        [   28,  5870, 29485,   314],
        [  796,   220,   198, 16193],
        [ 5588,   262, 25498,    12],
        [   39,  2305,  1267,    28]])
