# Importing The Dataset

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Harry Potter And The Chamber of Secrets - By JK Rowling
</div>

### Step 1: Reading The Document

In [40]:
with open("datasets/harryPotter.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("The total number of characters: ", len(raw_text))
print(raw_text[:99])

The total number of characters:  506149
CHAPTER   ONE
 
 
THE WORST BIRTHDAY
 
 
Not for the first time, an argument had broken out over br


**The total number of characters in this document is around 506,147 and the first 100 characters are displayed**

### Step 2: Converting the characters into tokens [Example]

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Using an example dataset before converting the whole book.
</div>

In [58]:
# Importing the regular expression package
import re

In [59]:
text = "Hello, World!"
text = re.split(r'(\s)', text)
print(text)

['Hello,', ' ', 'World!']


**This splits the entire sentence giving us three words: Hello, whitespace, World!**

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Modifying regular expression to include special characters like commas and periods
</div>

In [60]:
text = "Hello, World. This, is a test."
text = re.split(r'([.,]|\s)', text)
print(text)

['Hello', ',', '', ' ', 'World', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


**Now, this splits the sentences with whitespace or special characters.**

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Removing the redundant whitespace, many include whitespace as characters but removing it helps to reduce the size or memory of the text data.
</div>

In [61]:
result = [item for item in text if item.strip()]
print(result)

['Hello', ',', 'World', '.', 'This', ',', 'is', 'a', 'test', '.']


**This helps to remove the redundant whitespaces properly**

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Modifying The Regular Expression to include all of the special characters in the text
</div>

In [62]:
text= "Hello, world. Is this-- a test?"
text = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in text if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


**This is the actual way of creating tokens from a dataset.**

### Step 3: Converting the characters into tokens [Actual Data]

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Now, Creating tokens for the entire dataset of Harry Potter
</div>

In [63]:
preProcessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
# Cleaning the entire data
preProcessed = [item.strip() for item in preProcessed if item.strip()]
print('The new total number of characters ', len(preProcessed))
print(preProcessed[:99])

The new total number of characters  107171
['CHAPTER', 'ONE', 'THEWORSTBIRTHDAY', 'Not', 'for', 'the', 'first', 'time', ',', 'an', 'argument', 'had', 'broken', 'out', 'over', 'breakfast', 'at', 'number', 'four', ',', 'Privet', 'Drive', '.', 'Mr', '.', 'Vernon', 'Dursley', 'had', 'been', 'woken', 'in', 'the', 'early', 'hours', 'of', 'the', 'morning', 'by', 'a', 'loud', ',', 'hooting', 'noise', 'from', 'his', 'nephewHarry', "'", 'sroom', '.', '"', 'Thirdtimethisweek', '!', '"', 'heroaredacrossthetable', '.', '"', 'Ifyoucan', "'", 't', 'controlthatowl', ',', 'it', "'", 'llhavetogo', '!', '"', 'Harry', 'tried', ',', 'yet', 'again', ',', 'to', 'explain', '.', '"', 'She', "'", 's', 'bored', ',', '"', 'he', 'said', '.', '"', 'She', "'", 's', 'used', 'to', 'flying', 'aroun', 'd', 'outside', '.', 'If', 'I', 'could']


**The new total characters after data cleaning is 107,171 and the first 100 characters are displayed**

### Step 4: Creating Vocabulary

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    For Vocabulary all the words should be sorted and unique
</div>

In [64]:
all_words = sorted(set(preProcessed))
print(len(all_words))

10686


**The total number of unique words used in this book: 10,686**

In [66]:
vocab = {token: integer for integer, token in enumerate(all_words)}
for i,item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
('*148*', 5)
('*247*', 6)
('*300*', 7)
('*339*', 8)
('*light', 9)
(',', 10)
('-', 11)
('--', 12)
('.', 13)
('1', 14)
('13', 15)
('1492', 16)
('1875', 17)
('2', 18)
('3', 19)
('30P', 20)
('51', 21)
('54', 22)
(':', 23)
(';', 24)
('=', 25)
('?', 26)
('A', 27)
('ABNORMALITY', 28)
('ABOUTSAYINGTHE`M', 29)
('ABSOLUTELYDISGUSTED', 30)
('AM', 31)
('AN', 32)
('AND', 33)
('ANGLIA', 34)
('ANOTHER', 35)
('AT', 36)
('ATFL0VRISHANDBLOTTS', 37)
('ATTAAAACK', 38)
('ATTACK', 39)
('Aaargh', 40)
('About', 41)
('Abyssinian', 42)
('According', 43)
('Act', 44)
('Add', 45)
('Adrian', 46)
('Adventures', 47)
('After', 48)
('Afterahurriedlunch', 49)
('Aftertennoisy', 50)


### Step 5: Creating Tokenizer Class

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Tokenizer Class Includes Encoder And Decoder Logic
</div>

In [69]:
class simple_tokenizer_v1:
    # Constructor class
    def __init__ (self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {integer:string for string,integer in vocab.items()}
    # Encoder
    def encode (self, text):
        preProcessed = re.split(r'([,.;:?!_"()\'])|--|\s', text)
        preProcessed = [item.strip() for item in preProcessed if item and item.strip()]
        ids = [self.str_to_int[s] for s in preProcessed]
        return ids
    # decoder
    def decode (self, ids):
        text = (" ").join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.;:?!_"()\'])', r'\1', text)
        return text

In [70]:
tokenizer = simple_tokenizer_v1(vocab)
text = """Professor Lockhart would be unable to return next y
ear, because he needed to go away and get his memory back. Quite a few
 
of the teachers joined in the cheering that greeted this news."""
tokenizer.encode(text)

[1594,
 1196,
 10578,
 2952,
 9988,
 9729,
 7881,
 6861,
 10611,
 4376,
 10,
 2975,
 5465,
 6842,
 9729,
 5233,
 2874,
 2534,
 5133,
 5605,
 6577,
 2890,
 13,
 1623,
 2352,
 4776,
 6973,
 9494,
 9410,
 6013,
 5826,
 9494,
 3564,
 9481,
 5300,
 9611,
 6858,
 13]

**The encoder is encoding the entire paragraph assigning each character a unique tokenId according to vocab**

In [71]:
tokenizer.decode(tokenizer.encode(text))

'Professor Lockhart would be unable to return next y ear, because he needed to go away and get his memory back. Quite a few of the teachers joined in the cheering that greeted this news.'

**The decoder is decoding the entire paragraph from the list of token Ids**

### Step 6: The Problem of Tokenizer Class

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    The text to be encoded must belong to a vocabulary, if we assign a unique words the class will fail to encode.
</div>

In [75]:
text = "The robotic llama danced gracefully across the moonlit ice rink, humming an old jazz tune from Mars."
tokenizer.encode(text)

KeyError: 'robotic'

**Hence to resolve this error we insert two additional tokens: <|unk|> and <|endoftext|>**

<div style="background-color: blue; color: white; padding: 10px; border-radius: 5px;">
    unk: Unknown token, used if there is any unknown character or token in the vocab like: Tea
    endoftext: This token, is used specially by GPT to specify that the new source of text is used and this marks the end of any text source.
    GPT doesn't use unk token but has own tokenizer called Byte Pair Encoding which converts the subwords into a token instead of a single word.
</div>

### Step 7: Special Context Tokens

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Adding two new tokens to the vocabulary
</div>

In [73]:
all_tokens = sorted(list(set(preProcessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {string:integer for integer, string in enumerate(all_tokens)}
len(vocab.items())
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('zing', 10683)
('zoo', 10684)
('zoomed', 10685)
('<|endoftext|>', 10686)
('<|unk|>', 10687)


**The total number of unique words used in this book: 10,688 by adding two new tokens**

In [1]:
class simple_tokenizer_v2:
    def __init__ (self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    # Modified Encoder Logic
    def encode (self, text):
        preProcessed = re.split(r'([,.;:?!_"()\']|--|\s)', text)
        preProcessed = [item.strip() for item in preProcessed if item and item.strip()]
        preProcessed = [item if item in self.str_to_int
                       else "<|unk|>" for item in preProcessed
                       ]
        ids = [self.str_to_int[s] for s in preProcessed]
        return ids
    # Same Decoder Logic
    def decode (self, ids):
        text = (" ").join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.;:?!_"()\'])', r'\1', text)
        return text

In [76]:
tokenizer = simple_tokenizer_v2(vocab)
text = "The robotic llama danced gracefully across the moonlit ice rink, humming an old jazz tune from Mars."
tokenizer.encode(text)

[1940,
 10687,
 10687,
 10687,
 10687,
 2396,
 9494,
 6690,
 5755,
 10687,
 10,
 5724,
 2531,
 6995,
 10687,
 10687,
 5040,
 10687,
 13]

In [77]:
tokenizer.decode(tokenizer.encode(text))

'The <|unk|> <|unk|> <|unk|> <|unk|> across the moonlit ice <|unk|>, humming an old <|unk|> <|unk|> from <|unk|>.'

**Using The <|endoftext|> token**

In [80]:
# First Source Of Text
text1 = """Professor Lockhart would be unable to return next y
ear, owing to the 
fact that he needed to go away and get his memory back. Quite a few 
of the teachers joined in the cheering that greeted this news."""
# Second Source of Text
text2 = """Shame, said Ron, helping himself to a jam doughnut. He was 
starting to grow on me."""
# Joining The Both Text
text = "<|endoftext|> ".join((text1, text2))
print(text)

Professor Lockhart would be unable to return next y
ear, owing to the 
fact that he needed to go away and get his memory back. Quite a few 
of the teachers joined in the cheering that greeted this news.<|endoftext|> Shame, said Ron, helping himself to a jam doughnut. He was 
starting to grow on me.


**The proper intended use of the token.**

In [82]:
tokenizer.encode(text)

[1594,
 1196,
 10578,
 2952,
 9988,
 9729,
 7881,
 6861,
 10611,
 4376,
 10,
 7146,
 9729,
 9494,
 4674,
 9481,
 5465,
 6842,
 9729,
 5233,
 2874,
 2534,
 5133,
 5605,
 6577,
 2890,
 13,
 1623,
 2352,
 4776,
 6973,
 9494,
 9410,
 6013,
 5826,
 9494,
 3564,
 9481,
 5300,
 9611,
 6858,
 13,
 10686,
 1769,
 10,
 8076,
 1665,
 10,
 5533,
 5596,
 9729,
 2352,
 5993,
 4236,
 13,
 829,
 10267,
 9026,
 9729,
 5328,
 7011,
 6547,
 13]

**The list of tokenIds will contain the id of endoftext token**

In [83]:
tokenizer.decode(tokenizer.encode(text))

'Professor Lockhart would be unable to return next y ear, owing to the fact that he needed to go away and get his memory back. Quite a few of the teachers joined in the cheering that greeted this news. <|endoftext|> Shame, said Ron, helping himself to a jam doughnut. He was starting to grow on me.'

**The decoded sentence includes the endoftext token**

## Implementing the BPE tokenizer

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Using TikToken Class Developed by OpenAI
</div>

In [1]:
!pip3 install tiktoken
import importlib, tiktoken
print('TikToken Version: ', importlib.metadata.version('tiktoken'))

TikToken Version:  0.9.0


**Instantiate the BPE and passing the entire dataset**

In [7]:
tokenizer = tiktoken.get_encoding('gpt2')
encode_text = tokenizer.encode(raw_text, allowed_special={'<|endoftext|>'})
print(encode_text[:50])
print(len(encode_text))

[41481, 220, 220, 16329, 198, 220, 198, 220, 198, 10970, 21881, 2257, 347, 4663, 4221, 26442, 198, 220, 198, 220, 198, 3673, 329, 262, 717, 640, 11, 281, 4578, 550, 5445, 503, 625, 12607, 379, 198, 220, 198, 17618, 1440, 11, 4389, 16809, 9974, 13, 1770, 13, 27820, 360, 1834]
163347


**The total number of tokens: 163,347**

## Creating DataSet and DataLoader Using PyTorch

**So, how does the input-target pair works?**

In [8]:
import tiktoken
# Studying the data chunking done by the dataset, practically
context_size = 4
# Importing The Data
with open('datasets/harryPotter.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()
# Sample Data For The Concept
tokenizer = tiktoken.get_encoding('gpt2')
encoded_text = tokenizer.encode(raw_text, allowed_special={'<|endoftext|>'})
sample_text = encoded_text[:50]
print(len(sample_text))

# The Output of the token predicted is just the input token shifted to 1
x = sample_text[:context_size]
y = sample_text[1:context_size+1]
print(f"Input Cluster: {x}")
print(f"Output Cluster:       {y}")

50
Input Cluster: [41481, 16329, 3336, 21881]
Output Cluster:       [16329, 3336, 21881, 2257]



**This, is the loop to understand the prediction task visually.**

In [10]:
# The first loop to understand why the Input-Target Pair is necessary
for i in range(1, context_size):
    input_ids = sample_text[:i]
    output_id = sample_text[i]
    print(f'{input_ids} -------->  {output_id}')
# The second loop to understand it in a real implementation
for i in range (1, context_size):
    input_ids = sample_text[:i]
    input_decoded = tokenizer.decode(input_ids)
    output_id = sample_text[i]
    output_decoded = tokenizer.decode([output_id])
    print(f'{input_decoded} --------> {output_decoded}')

[41481] -------->  16329
[41481, 16329] -------->  3336
[41481, 16329, 3336] -------->  21881
CHAPTER -------->  ONE
CHAPTER ONE -------->  THE
CHAPTER ONE THE -------->  WOR


**Now, Creating DataSet and DataLoader Using PyTorch**

In [11]:
!pip3 install torch
import torch, importlib
from torch.utils.data import Dataset, DataLoader

print('Torch Version: ', importlib.metadata.version('torch'))

Torch Version:  2.7.0


**Creating a class of DataLoader That Creates Input-Output Tensors According To Stride and Context_Size**

In [14]:
class GPTDatasetV1(Dataset): 
    # Constructor
    def __init__(self, text, tokenizer, max_length, stride):
        self.input_ids = []
        self.output_ids = []
        # Encode the entire dataset
        encoded_data = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
        # Sliding Mechanism:
        for i in range (0, len(encoded_data) - max_length, stride):
            x = encoded_data[i:i + max_length]
            y = encoded_data[i+1: i + max_length + 1]
            self.input_ids.append(torch.tensor(x))
            self.output_ids.append(torch.tensor(y))
    # Must override the interface method:
    def __len__ (self):
        return (len(self.input_ids))
    # To Get Specific Row
    def __getitem__ (self, idx):
        return self.input_ids[idx], self.output_ids[idx]

**Now, defining the DataLoader Function That Uses The DataLoader Class of PyTorch**

In [15]:
def create_dataloader_v1 (text, max_length = 256, stride = 128, batch_size = 4, num_workers = 0, shuffle = True, drop_last = True):
    #Creating the tokenizer to pass to the dataset
    tokenizer = tiktoken.get_encoding('gpt2')
    #Creating the object of the dataset
    dataset = GPTDatasetV1(text, tokenizer = tokenizer, max_length = max_length, stride = stride)
    #Creating the object of the dataloader
    dataloader = DataLoader(dataset, batch_size = batch_size, shuffle = shuffle, drop_last = drop_last, num_workers = num_workers)
    #returning the dataloader
    return dataloader

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
  Testing the dataloader which will instantiate the dataset. We will be using batch_size = 1 and context_size = 4
</div>

In [17]:
# Importing the dataset
with open("datasets/harryPotter.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
# Invoking the function    
dataloader = create_dataloader_v1(raw_text, max_length = 4, stride = 1, batch_size = 1, shuffle = False)
#For Iteration
data_iter = iter(dataloader)
first_batch = next(data_iter)
print(f'First batch of input-target pair: {first_batch}')
second_batch = next(data_iter)
print(f'Second batch of input-target pair: {second_batch}')

First batch of input-target pair: [tensor([[41481, 16329,  3336, 21881]]), tensor([[16329,  3336, 21881,  2257]])]
Second batch of input-target pair: [tensor([[16329,  3336, 21881,  2257]]), tensor([[ 3336, 21881,  2257,   347]])]


<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
  When batch size will be around 8
</div>

In [20]:
# When the max_length and stride is equal this won't provide any overlapping set pf tensors
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs: \n", inputs)
print("Outputs: \n", targets)

Inputs: 
 tensor([[41481, 16329,  3336, 21881],
        [ 2257,   347,  4663,  4221],
        [26442,  1892,   329,   262],
        [  717,   640,    11,   281],
        [ 4578,   550,  5445,   503],
        [  625, 12607,   379,  1271],
        [ 1440,    11,  4389, 16809],
        [ 9974,    13,  1770,    13]])
Outputs: 
 tensor([[16329,  3336, 21881,  2257],
        [  347,  4663,  4221, 26442],
        [ 1892,   329,   262,   717],
        [  640,    11,   281,  4578],
        [  550,  5445,   503,   625],
        [12607,   379,  1271,  1440],
        [   11,  4389, 16809,  9974],
        [   13,  1770,    13, 27820]])
