# Reading a short story as text sample in python.

## Step 1: Creating tokens

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
  The print command prints the total number of characters followed by the first 100 characters of this file for illustrations.
</div>


In [10]:
with open("datasets/theVerdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
print("Total number of characters: ", len(raw_text))
print(raw_text[:99])

Total number of characters:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Our goal is to tokenize the 20479 characters short story into individual words and special characters that we can then turn into vector embeddings for LLM training.
</div>

**Regular Expression Library is used to split the sentences into individual texts.**

In [11]:
import re
text = "Hello, world. This, is a test"
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test']


<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    The /s is regular expression for the whitespace so this will split the sentence according to white-spaces found.
</div>

**We can modify to split the whitespaces, periods and commas also i.e [,.]**

In [12]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test']


**Removing The WhiteSpaces Safely**

In [13]:
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test']


<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Firstly, for item in result is used to iterate through every items in the list, this will only return true if there is no white-spaces found, hence it will return the item and be stored in a newList allocated to result.
</div>

**Modifying The Tokenizer Further To Accept quotations, question marks, double-dashes as a separate special characters or tokens.**

In [1]:
import re
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    We use double item.strip() cause when splitting the special characters it might contain some whitespaces again so double checking ensures that no whitespacing is included in the final text.
</div>

**Applying the tokenizer scheme to entire raw text data of the book**

In [29]:
preProcessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preProcessed = [item.strip() for item in preProcessed if item.strip()]
print(preProcessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [16]:
print(len(preProcessed))

4690


# Step 2: Assigning Token Id's

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Now, we take all the unique tokens together and sort them alphabetically
</div>

In [17]:
all_words = sorted(set(preProcessed))
vocab_size = len(all_words)
print(vocab_size)

1130


**Creating our own vocabulary with all the unique words by assigning integer values**

In [18]:
vocab = {token: integer for integer, token in enumerate(all_words)}
for i,item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)
('Hermia', 50)


**Now Creating a class in python for encoding and decoding**

In [39]:
class simpleTokenizerV1:
    def __init__ (self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s, i in vocab.items()}

    def encode (self, text):
        preprocessed = re.split(r'([.,:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decoder (self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Remove the whitespaces before any specific characters:
        text = re.sub(r'\s+([,.:_?!"()\'])', r'\1', text)
        return text

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Now, we instantiate the tokenizer class and try to pass the data. Encode Method converts the string into integers or token id's according to our vocabulary similarly, decode takes the list of ids and converts it back to the string.
</div>

In [42]:
tokenizer = simpleTokenizerV1(vocab)
text = "\"Money's only excuse is to put beauty into circulation,\" was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run over from Monte Carlo; and Mrs. Gisburn, beaming on him, added for my enlightenment: \"Jack is so morbidly sensitive to every form of beauty.\""
ids = tokenizer.encode(text)
print(ids)

[1, 63, 2, 850, 731, 406, 584, 1016, 806, 203, 579, 265, 5, 1, 1077, 729, 722, 988, 189, 533, 598, 362, 127, 988, 87, 157, 890, 722, 156, 411, 168, 651, 5, 1090, 5, 727, 115, 604, 315, 5, 53, 514, 140, 849, 741, 477, 64, 24, 9, 157, 67, 7, 38, 5, 199, 727, 546, 5, 130, 456, 697, 391, 8, 1, 57, 584, 908, 684, 868, 1016, 403, 464, 722, 203, 7, 1]


In [25]:
print(tokenizer.decoder(ids))

" Money' s only excuse is to put beauty into circulation," was one of the axioms he laid down across the Sevres and silver of an exquisitely appointed luncheon-table, when, on a later day, I had again run over from Monte Carlo ; and Mrs. Gisburn, beaming on him, added for my enlightenment:" Jack is so morbidly sensitive to every form of beauty."


<div style="background-color: lightyellow; color: red; padding: 10px; border-radius: 5px;">
   What if we send a sentence that doesn't belong to the training set or vocabulary?
</div>

In [26]:
text = "Hello, do you like some tea?"
ids = tokenizer.encode(text)
print(ids)

KeyError: 'Hello'

## Adding Special Context Tokens To Deal with data that is beyond the training set.

**We add two new tokens: <|unk|> which is used for token not part of the vocab and <|endoftext|> which was used for GPT pre-training to specify that the new text source is being used.**

In [33]:
all_tokens = sorted(list(set(preProcessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token: id for id, token in enumerate(all_tokens)}
print(len(vocab.items()))

1132


<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    This increased the size of the vocabulary to 1132 previously it was 1130
</div>

**Quickly printing the last 5 words in the vocabulary**

In [35]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [44]:
class simpleTokenizerV2:
    def __init__ (self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode (self, text):
        preprocessed = re.split(r'([.,:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int
                       else "<|unk|>" for item in preprocessed
                       ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode (self, ids):
        text = (" ").join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:_?!"()\'])', r'\1', text)
        return text

In [47]:
tokenizer = simpleTokenizerV2(vocab)
text = "Hello, do you like some tea?"
ids = tokenizer.encode(text)
print(ids)

[1131, 5, 355, 1126, 628, 910, 975, 10]


In [50]:
text1 = "Hello, do you like some tea?"
text2 = "In the sunlit terraces of the palace."
text = "<|endoftext|> ".join((text1, text2))
print(text)

Hello, do you like some tea?<|endoftext|> In the sunlit terraces of the palace.


In [51]:
tokenizer.encode(text)

[1131,
 5,
 355,
 1126,
 628,
 910,
 975,
 10,
 1130,
 55,
 988,
 956,
 984,
 722,
 988,
 1131,
 7]

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    The Unknown text and End of Text Tokens are working properly.
</div>

In [52]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like some tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

**Here, The Hello and Palace are two unknown in the training dataset**

## Byte Pair Encoding [Sub-Word Tokenizer]

<div style="background-color: lightyellow; color: red; padding: 10px; border-radius: 5px;">
  Since BPE is a complex algorithm so we will utilize the existing library in python developed by OpenAI used for the GPT model called tiktoken
</div>

In [2]:
! pip3 install tiktoken




[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


**Checking The Latest Version of TikToken**

In [5]:
import importlib
import tiktoken
print("Tiktoken Version: ", importlib.metadata.version("tiktoken"))

Tiktoken Version:  0.9.0


<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
    Once installed we can instantiate the BPE tokenizer from tiktoken as follows:
</div>

In [6]:
tokenizer = tiktoken.get_encoding("gpt2")

**Now, Testing the methods, just like: simpleTokenizerV2**

In [7]:
text = ("Do you like some tea? <|endoftext|> In the sunlit terraces""of someunknownPlace.")
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[5211, 345, 588, 617, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [8]:
strings = tokenizer.decode(integers)
print(strings)

Do you like some tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


**Another example of simple encoding and decoding done by BPE**

In [9]:
integers = tokenizer.encode("Akwirw ier")
print(integers)
text = tokenizer.decode(integers)
print(text)

[33901, 86, 343, 86, 220, 959]
Akwirw ier


# Input Target Pairs In Python

<div style = "color:green; background-color:lightgreen; padding: 10px; border-radius: 5px">
    In this lecture, I will learn to implement data loader that will fetch the input-target pairs using sliding window approach.
</div>

**We will firstly encode the entire dataset using the BPE tokenizer from tiktoken**

In [18]:
!pip3 install tiktoken

import importlib
import tiktoken
import re

# Current Version of tiktoken:
print("Tiktoken Version: ", importlib.metadata.version("tiktoken"))

Tiktoken Version:  0.9.0


**Instantiate the BPE tokenizer**

In [11]:
tokenizer = tiktoken.get_encoding('gpt2')
text = "Hello, World!"
tokenizer.encode(text)

[15496, 11, 2159, 0]

**Passing The Dataset To The BPE Encoder**

In [22]:
with open('datasets/theVerdict.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

clean_text = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
clean_text = [item.strip() for item in clean_text if item.strip()]
enc_text = tokenizer.encode(" ".join(clean_text), allowed_special={"<|endoftext|>"})
print(len(enc_text))


5110


<div style="background-color: lightyellow; color: red; padding: 10px; border-radius: 5px;">
  The total size of the vocab utilized for this entire data-set will be around: 5,145
</div>

In [13]:
enc_sample = enc_text[50:]

<div style="background-color: lightgreen; color: green; padding: 10px; border-radius: 5px;">
  The context size is the number of input tokens that will be provided to the LLM's to predict the next target pair.
</div>

In [14]:
context_size = 4 # Total input that the model will access
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


**The inputs shifted by one position, we can then create the next-word prediction task as follows:**

In [15]:
for i in range (1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, '------->',  desired)

[290] -------> 4920
[290, 4920] -------> 2241
[290, 4920, 2241] -------> 287
[290, 4920, 2241, 287] -------> 257


**Now, converting the ids or tokens into the respective texts to clear the understanding.**

In [23]:
for i in range (1, context_size + 1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), '-------->', tokenizer.decode([desired]))

 and -------->  established
 and established -------->  himself
 and established himself -------->  in
 and established himself in -------->  a
