# Reading in a short story as text sample into Python

# Step 1: Creating Tokens

In [1]:
with open("the-verdict.txt","r",encoding="utf-8") as f:
    raw_text=f.read()
print("Total number of character",len(raw_text))
print(raw_text[:99])

Total number of character 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


This Python code reads a text file named the-verdict.txt and gives you:

The total number of characters in the file.

The first 99 characters from the file's content.



In [2]:
import re 
text="Hello, world. This,is a test"
result= re.split(r'(\s)',text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,is', ' ', 'a', ' ', 'test']


🔍 What's Happening?

re.split(): Splits a string wherever the pattern matches.


r'(\s)':


\s → Matches any whitespace (space, tab, newline).


The pattern is inside parentheses (), which creates a capturing group.


📌 When you use a capturing group, the delimiter (matched part) is also included in the result list.


🔧 So re.split(r'(\s)', text) will:

Split text at each space.


Include the space itself in the result list.

🔎 Output:

['Hello,', ' ', 'world.', ' ', 'This,is', ' ', 'a', ' ', 'test']





#### SEPAARATE COMMA AS WELL

In [3]:
result=re.split(r'([,.]|\s)',text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', 'is', ' ', 'a', ' ', 'test']


#### our list include still include whilespace

In [4]:
result=[item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test']


In [5]:
import re

text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)


['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [6]:
preprocessed = re.split(r'([,.:;?_!()\']|--|\s)', raw_text)
preprocessed=[item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [7]:
print(len(preprocessed))

4606


# Step 2 Convert token into convert ID

In [8]:
# we tokenised the short strory and assigned to a python preprocessed


In [9]:
all_words=sorted(set(preprocessed))
vocab_size=len(all_words)
print(vocab_size)

1158


#Now we are creating vocabulary from token


In [10]:
vocab={token:integer for integer,token in enumerate(all_words)}


In [11]:
for i,item in enumerate(vocab.items()):
    print(item)
    if i>=50:
     break

('!', 0)
('"', 1)
('"Ah', 2)
('"Be', 3)
('"Begin', 4)
('"By', 5)
('"Come', 6)
('"Destroyed', 7)
('"Don', 8)
('"Gisburns"', 9)
('"Grindles', 10)
('"Hang', 11)
('"Has', 12)
('"How', 13)
('"I', 14)
('"If', 15)
('"It', 16)
('"Jack', 17)
('"Money', 18)
('"Moon-dancers"', 19)
('"Mr', 20)
('"Mrs', 21)
('"My', 22)
('"Never', 23)
('"Of', 24)
('"Oh', 25)
('"Once', 26)
('"Only', 27)
('"Or', 28)
('"That', 29)
('"The', 30)
('"Then', 31)
('"There', 32)
('"This', 33)
('"We', 34)
('"Well', 35)
('"What', 36)
('"When', 37)
('"Why', 38)
('"Yes', 39)
('"You', 40)
('"but', 41)
('"deadening', 42)
('"dragged', 43)
('"effects"', 44)
('"interesting"', 45)
('"lift', 46)
('"obituary"', 47)
('"strongest', 48)
('"strongly"', 49)
('"sweetly"', 50)


In [12]:
import re

class SimpleTokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join(self.int_to_str[i] for i in ids)
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text


# 💡 What is this class?
SimpleTokenizer is a custom class made to:


Convert text (like "Hello, world!") into numbers using a given dictionary (called vocab).

Convert those numbers back into text.

# 🧱 Class Structure Breakdown

1. __init__(self, vocab)
This is the constructor — it runs when you create a new object from the class.
vocab is a dictionary that maps words or characters to numbers.

Example

vocab = {"Hello": 1, ",": 2, "world": 3, "!": 4}
self.str_to_int = vocab
→ Stores the dictionary for encoding.

self.int_to_str = {i: s for s, i in vocab.items()}
→ Creates a reverse dictionary so we can decode (convert numbers back to text).

# 2. encode(self, text)
Takes a string as input (like "Hello, world!").

Splits the text into small parts: words, spaces, and punctuation using a regex.

python
Copy
Edit
re.split(r'([,.:;?_!"()\']|--|\s)', text)
This separates:

Words → "Hello"

Punctuation → ","

Spaces → " " (also removed later)

Removes empty or space-only parts using:

python
Copy
Edit
[item.strip() for item in preprocessed if item.strip()]
Then converts each word or symbol to its number using the vocabulary.

Example: 
"Hello, world!" → ["Hello", ",", "world", "!"] → [1, 2, 3, 4]


## 3. decode(self, ids)
Takes a list of numbers as input (like [1, 2, 3, 4]).

Uses self.int_to_str to convert numbers back to words or punctuation.

Joins them into a string using spaces:

"Hello , world !"
Cleans up spaces before punctuation using:

re.sub(r'\s+([,.?!"()\'])', r'\1', text)
Final result: "Hello, world!"



In [13]:
tokenizer=SimpleTokenizer(vocab)

text=""""It's the last he painted,you know,"
      Mrs.Gisburn said with pardonable pride"""
ids=tokenizer.encode(text)
print(ids)

[1, 95, 51, 880, 1015, 633, 564, 776, 54, 1154, 627, 54, 1, 104, 56, 82, 881, 1136, 784, 823]


🔍 item.strip() — What is it?
strip() is a Python string method that removes leading and trailing whitespace from a string.



In [14]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride'

✅ Final Answer to Your Question
Even if the vocab is sorted before assigning IDs, it doesn’t matter as long as:

The tokenizer remembers which token got which ID during encoding.

It uses the same mapping during decoding.

So, sorting only affects how IDs are assigned, not the actual sentence meaning, because:

🧠 The tokenizer doesn't rely on word order in the vocab; it relies on the ID-token mapping it created and stored.

📌 Analogy:
Imagine you assign numbers to words alphabetically:

'Apple' = 0, 'Banana' = 1, 'Cat' = 2

Your sentence: "Banana Cat Apple" → [1, 2, 0]

While decoding: [1, 2, 0] → "Banana Cat Apple"

✅ It works because you use the same mapping in reverse.



In [15]:
text="Hello, do you like tea?"
print(tokenizer.encode(text))   #token is not present in the list


KeyError: 'Hello'

# Adding special version token to deal new token

In [None]:
# we will add two new token unknown and the end of text
all_token=sorted(list(set(preprocessed)))
all_token.extend(["<|end of text|>","unk"])

vocab={token:integer for integer,token in enumerate (all_token)}

In [16]:
len(vocab.items())

1158

In [17]:
for i,item in enumerate (list(vocab.items())[-5:]):
    print(item)

('yet', 1153)
('you', 1154)
('younger', 1155)
('your', 1156)
('yourself', 1157)


# BYTE PAIR ENCODING

Byte Pair Encoding (BPE) is a simple and effective data compression and tokenization technique, especially popular in Natural Language Processing (NLP). It works by iteratively replacing the most frequent pair of bytes (or characters) in a sequence with a single, unused byte or symbol.

In [21]:
! pip install tiktoken




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### 🧠 What is tiktoken?

tiktoken is a fast tokenizer library developed by OpenAI for encoding text into tokens — specifically designed to work with OpenAI models like GPT-3, GPT-3.5, and GPT-4.


### 🚀 Key Purpose:

To convert text into tokens (smallest units like words or subwords) and back from tokens to text, which is essential for LLMs like GPT to process input.

In [24]:
import importlib
import tiktoken

print("Tiktoken version ", importlib.metadata.version("tiktoken"))


Tiktoken version  0.9.0


In [25]:
tokenizer= tiktoken.get_encoded("gpt2")

AttributeError: module 'tiktoken' has no attribute 'get_encoded'