🔹 OpenWebText
An open-source effort to replicate the dataset used to train GPT-2. It includes web pages linked from high-karma Reddit posts, excluding spam and low-quality content.

Size: ~40 GB

License: MIT

Notes: Emphasizes quality text from reputable online sources.

Link: https://www.kaggle.com/datasets/himonsarkar/openwebtext-dataset

Smaller part of the Big OpenwebText dataset.
Dataset(stas/openwebtext-10k): https://huggingface.co/datasets/stas/openwebtext-10k 
- Openweb text is a pensource 
- This is a small dataset of 15MB compressed and 50MB uncompressed

In [None]:
# #openweb 
# from datasets import load_dataset

# ds = load_dataset("stas/openwebtext-10k")

### Step: 1 Loading text, splitting into tokens and cleaning

(For now, using the-verdict.txt a book for learning basics, in second half will work on custom data i.e. "openwebtext" and apply learnings)

In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f: #reading the file 
    raw_text = f.read() #storing in the variable raw_text

print("Total number of characters: ", len(raw_text)) #print total number of characters
print(raw_text[:99])  #printing first 100 characters, remember character not word and it includes spaces as well. If it works, means we have loaded data and we're ready for next step, splitting text into words and subwords

Total number of characters:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 




#### regular expression library (i.e. regex) is used for splitting text based on specific keyword or pattern.
- A Regular Expression is a pattern-matching tool used to search, extract, replace, or manipulate strings/text data using specific character patterns.
- regex: This module provides functions and tools to use regular expressions in Python.

| Function                               | Description                            |
| -------------------------------------- | -------------------------------------- |
| `re.search(pattern, string)`           | Finds **first match**                  |
| `re.findall(pattern, string)`          | Returns **all matches**                |
| `re.sub(pattern, replacement, string)` | **Replaces** all matches               |
| `re.split(pattern, string)`            | Splits string by pattern               |
| `re.match(pattern, string)`            | Matches pattern **from the beginning** |

-> Above are many functions, but we'll use only split to split our text data based on our need. 

#### Common Use Cases of Regex:
| Task                   | Example Regex  | What it Does                                |
| ---------------------- | -------------- | ------------------------------------------- |
| **Find numbers**       | `\d+`          | Matches one or more digits                  |
| **Find words**         | `\w+`          | Matches one or more word characters         |
| **Whitespace**         | `\s+`          | Matches spaces, tabs, newlines              |
| **Email extraction**   | `\w+@\w+\.\w+` | Matches a simple email pattern              |
| **Remove punctuation** | `[^\w\s]`      | Matches anything that’s NOT a word or space |


##### 🧠 Why Is Regex Useful?
- Data cleaning (e.g., remove HTML tags, punctuation)
- Information extraction (e.g., extract dates, names)
- Preprocessing for NLP (e.g., splitting sentences)
- Validating input (e.g., form inputs like passwords or emails)

In [14]:
import re

# The following is a regex pattern designed to split a string on punctuation, whitespace, and dashes, while keeping the delimiters.
result = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
print(result[:99]) #this will still have spaces as individual tokens

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius', '--', 'though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough', '--', 'so', ' ', 'it', ' ', 'was', ' ', 'no', ' ', 'great', ' ', 'surprise', ' ', 'to', ' ', 'me', ' ', 'to', ' ', 'hear', ' ', 'that', ',', '', ' ', 'in', ' ', 'the', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory', ',', '', ' ', 'he', ' ', 'had', ' ', 'dropped', ' ', 'his', ' ', 'painting', ',', '', ' ', 'married', ' ', 'a', ' ', 'rich', ' ', 'widow', ',', '', ' ', 'and', ' ', 'established', ' ', 'himself', ' ', 'in', ' ', 'a']


What's happening in the above code?
1. r'...': The r prefix makes it a raw string — so backslashes (\) are not interpreted as escape characters by Python.

2. Outer parentheses (...): paranthesis after re.split(r'
- This is a capturing group, which means re.split() will keep the delimiters in the result (instead of discarding them).

3. Inside the group:

| Part           | Meaning                                                                                                                                                           |
| -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `[,.:;?_!"()]` | Matches **any single character** that is a comma, dot, colon, semicolon, question mark, underscore, exclamation mark, double quote, **open or close parenthesis** |
| `--`           | Matches **double dash** (not a character class — it's a separate alternative)                                                                                     |
| `\s`           | Matches **any whitespace** (space, tab, newline, etc.)                                                                                                            |


In [15]:
#to remove the spaces
result = [item.strip() for item in result if item.strip()]  #We are calling item.strip(), which is a built-in string function in Python used to remove leading and trailing whitespace from a string. This is applied to each item in the result list. After stripping the whitespace, if the resulting string is not empty, it is added to the new list. If the stripped string is empty (meaning the item was either empty or contained only whitespace), it is skipped. The result is a new list containing only the cleaned, non-empty strings from the original result list.

print(result[:25])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me']


In [16]:
print(len(result))

4690


### Step 2: Creating Token IDs

In [None]:
unique_words = sorted(set(result)) #set() will create a set of unique values from result, sorted() will produce a new sorted list which we'll store in words var
print(len(unique_words))

1130


#### -> Over here our vocab size is very small due to our small dataset, but LLMs are trained on very large data, therefore they have millions or even billions of tokens and their respective IDs.

In [None]:
#creating vocabulary
vocab = {token: integer for integer, token in enumerate(unique_words)}

# enumerate(all_words) returns (index, token) pairs for each token in the result list.
# for integer, token in enumerate(all_words) loops through that.
# token: integer is the key-value pair for the dictionary.
# { ... } wraps it into a dictionary comprehension.

In [25]:
for i, (key, value) in enumerate(vocab.items()):
    if i >= 10:
        break
    print(f"{key}: {value}")

!: 0
": 1
': 2
(: 3
): 4
,: 5
--: 6
.: 7
:: 8
;: 9


#### => So the main task starts now, we'll build a simple Tokenizer class that will help us automate the whole process from raw text to tokens to extracting token ID's from Vocab dictionary and vice versa. 
(remember this will give us a solid understanding of tokenizers and help later in building and creating advance tokenizer)

In [31]:
class SimpleTokenizer:
    def __init__(self, vocab): 
        self.str_to_int = vocab #vocab will be used to extract ID for input text tokens
        self.int_to_str = {i:s for s,i in vocab.items()} #extracting words associated with tokenID produced by LLM 

    def encode(self, text): 
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text) #creating tokens
        preprocessed = [item.strip() for item in preprocessed if item.strip()] # removing spaces
        
        ids = [self.str_to_int[s] for s in preprocessed] #as we already have a dictionary of words, we'll extract id from vocab for the input we get. 
        return ids
        

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?_!"()\'])', r'\1', text)
        return text

In [None]:
#let's test our tokenizer 
testText = "It had always been his fate to have women say such things of him: the fact should be set down in extenuation." 
# this text is from our training data, any token outside ouf the unique tokens in our vocab will produce error, and hence we need very big data.

tokenizer = SimpleTokenizer(vocab) #creating a instane/object of our class

ids = tokenizer.encode(testText)
print(ids)

[56, 514, 149, 208, 549, 431, 1016, 530, 1112, 856, 949, 997, 722, 546, 8, 988, 420, 879, 198, 871, 362, 568, 412, 7]


In [34]:
#let's convert this IDs back to text tocheck out decoder
print(tokenizer.decode(ids))

It had always been his fate to have women say such things of him: the fact should be set down in extenuation.


#### Special Context Tokens
- To handel the tokens that doesn't exist in our dataset/vocab we will add a special token, such as <|unk|> <|endoftext|>

In [None]:
newtokens = sorted(set(result)) 
print(newtokens[:10])

newtokens.extend(["<|unk|>", "<|endoftext|>"])

vocab = {token: integer for integer, token in enumerate(newtokens)} #adding new tokens to vocab

['!', '"', "'", '(', ')', ',', '--', '.', ':', ';']


In [None]:
print(newtokens[-3:]) 

['yourself', '<|unk|>', '<|endoftext|>']


In [40]:
print(len(vocab))

1132


In [51]:
for i, (key, value) in enumerate(list(vocab.items())[-3:]):
    print(f"{key}: {value}")

yourself: 1129
<|unk|>: 1130
<|endoftext|>: 1131


In [52]:
class SimpleTokenizerV2:
    def __init__(self, vocab): 
        self.str_to_int = vocab #vocab will be used to extract ID for input text tokens
        self.int_to_str = {i:s for s,i in vocab.items()} #extracting words associated with tokenID produced by LLM 

    def encode(self, text): 
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text) #creating tokens
        preprocessed = [item.strip() for item in preprocessed if item.strip()] # removing spaces
        
        #to handel tokens which are not in our dict, and avoid crashing our program
        preprocessed = [
            item if item in self.str_to_int #item represent a token in our list preprocessed, if that token exist in vocab it stays as it is
            else "<|unk|>" for item in preprocessed # else it is converted to <|unk|>
        ]

        ids = [self.str_to_int[s] for s in preprocessed] #as we already have a dictionary of words, we'll extract id from vocab for the input we get. 
        return ids
        

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?_!"()\'])', r'\1', text)
        return text

In [None]:
#testing the new implementation
tokenizer2 = SimpleTokenizerV2(vocab)
testText2 = "Hi, what are you doing? I found the couple at tea beneath their palm-trees"

print(tokenizer2.encode(testText2))

[1130, 5, 1089, 169, 1126, 357, 10, 53, 469, 988, 296, 180, 975, 215, 989, 751]


In [54]:
tokenizer2.decode(tokenizer2.encode(testText2))

'<|unk|>, what are you doing? I found the couple at tea beneath their palm-trees'

-> When an out of vocab word is encountered, then encoding it as unk won't really help and affect model output.

-> So to deal with this, "byte pair encoding" is used.

### Short summary of all the steps performed till now.
1. Split text into individual tokens, using regular expression.
2. Removed whitespace.
3. then created a sorted set of unique tokens.
4. Now based on this sorted set of unique tokens, created a vocab with each word having a unique token ID.
5. Created a tokenizer class, that used this vocab, of tokens and their IDs to encode text to tokenIDs and decode tokenID to text. 

 ____

# Working with custom dataset. i.e. Openwebtext-10k
- Importing a smaller version of openweb text data.
- This is a very large dataset compared to "the-verdict" book pdf.
- Therefore, it will require better regular expression to seperate various types of token in a better way.

In [None]:
#openweb 
from datasets import load_dataset

#loading dataset
ds = load_dataset("stas/openwebtext-10k", cache_dir="./data")

README.md:   0%|          | 0.00/951 [00:00<?, ?B/s]

openwebtext-10k.py: 0.00B [00:00, ?B/s]

0000.parquet:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [None]:
#exploring data
print(ds)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 10000
    })
})


In [60]:
print(ds.column_names)

{'train': ['text']}


In [81]:
print(ds['train'])

Dataset({
    features: ['text'],
    num_rows: 10000
})


In [None]:
#dataset has one feature and 10K entries/rows
#printing to see what it looks like
print(ds['train'][0])

{'text': "A magazine supplement with an image of Adolf Hitler and the title 'The Unreadable Book' is pictured in Berlin. No law bans “Mein Kampf” in Germany, but the government of Bavaria, holds the copyright and guards it ferociously. (Thomas Peter/REUTERS)\n\nThe city that was the center of Adolf Hitler’s empire is littered with reminders of the Nazi past, from the bullet holes that pit the fronts of many buildings to the hulking Luftwaffe headquarters that now house the Finance Ministry.\n\nWhat it doesn’t have, nor has it since 1945, are copies of Hitler’s autobiography and political manifesto, “Mein Kampf,” in its bookstores. The latest attempt to publish excerpts fizzled this week after the Bavarian government challenged it in court, although an expurgated copy appeared at newspaper kiosks around the country.\n\nBut in Germany — where keeping a tight lid on Hitler’s writings has become a rich tradition in itself — attitudes toward his book are slowly changing, and fewer people ar

In [None]:
train_ds = ds["train"]

In [69]:
print(train_ds.column_names)

['text']


In [None]:
print(len(train_ds)) #length of data             
train_ds.features  #to see the format/structure of data

10000


{'text': Value(dtype='string', id=None)}

#### "train_ds" (i.e. ds["train"] part of while dataset), is a Dataset object with:
- 10,000 dictionaries, where each dictionary is one data sample (row), with 'text' as key and the 'sentence' as value
[
 
  {'text': 'sentence 1'},
 
  {'text': 'sentence 2'},
 
  ...
 
  {'text': 'sentence 10000'}

]


In [75]:
print(train_ds['text'][0])

A magazine supplement with an image of Adolf Hitler and the title 'The Unreadable Book' is pictured in Berlin. No law bans “Mein Kampf” in Germany, but the government of Bavaria, holds the copyright and guards it ferociously. (Thomas Peter/REUTERS)

The city that was the center of Adolf Hitler’s empire is littered with reminders of the Nazi past, from the bullet holes that pit the fronts of many buildings to the hulking Luftwaffe headquarters that now house the Finance Ministry.

What it doesn’t have, nor has it since 1945, are copies of Hitler’s autobiography and political manifesto, “Mein Kampf,” in its bookstores. The latest attempt to publish excerpts fizzled this week after the Bavarian government challenged it in court, although an expurgated copy appeared at newspaper kiosks around the country.

But in Germany — where keeping a tight lid on Hitler’s writings has become a rich tradition in itself — attitudes toward his book are slowly changing, and fewer people are objecting to i

In [67]:
for i in range(3):
    print(f"Row {i}:", train_ds[i]["text"])
    print("-" * 40)


Row 0: A magazine supplement with an image of Adolf Hitler and the title 'The Unreadable Book' is pictured in Berlin. No law bans “Mein Kampf” in Germany, but the government of Bavaria, holds the copyright and guards it ferociously. (Thomas Peter/REUTERS)

The city that was the center of Adolf Hitler’s empire is littered with reminders of the Nazi past, from the bullet holes that pit the fronts of many buildings to the hulking Luftwaffe headquarters that now house the Finance Ministry.

What it doesn’t have, nor has it since 1945, are copies of Hitler’s autobiography and political manifesto, “Mein Kampf,” in its bookstores. The latest attempt to publish excerpts fizzled this week after the Bavarian government challenged it in court, although an expurgated copy appeared at newspaper kiosks around the country.

But in Germany — where keeping a tight lid on Hitler’s writings has become a rich tradition in itself — attitudes toward his book are slowly changing, and fewer people are objecti

#### Now, applying tokenization in a way I've learned till now
- Splitting texts to token.
- Creating a new vocab for token ID's and creating a tokeinzer class based on this new bigger vocab.

In [None]:
#building a function to split text to tokens and remove whitespaces
#as we have multiple rows of data, we'll call this function for each row and convert texts in those rows to tokens

def text_to_token(text):
    tokens = re.split(r'([,.:;?_!"()\']|--|\s)', text["text"]) #using the regular expresion to split text to token
    tokens = [t.strip() for t in tokens if t.strip()] #removing whitespaces
    return {"tokens": tokens} #returning a key value pair, with key="tokens" and tokenized text as value

tokenized_ds = train_ds.map(text_to_token) #.map function will apply our text_to_token function to each item in dataset and returns a map object, which is an iterator containing the results. 

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [96]:
print(tokenized_ds)
print("\nFollowing is the raw text from 1st row :")
print(tokenized_ds['text'][1]) #prints 1st row of text column
print("\nFollowing are the tokens of the raw text of 1st row :")
print(tokenized_ds['tokens'][1]) #prints 1st row of tokens column, here there are multiple tokens in form of list

Dataset({
    features: ['text', 'tokens'],
    num_rows: 10000
})

Following is the raw text from 1st row :
For today’s post, I’d like to take a look at California’s voter initiative to legalize pot. If the measure passes, and the sky doesn’t fall, many other states will probably be looking at similar law changes in the near future. Our drug policy of the last century has simply not worked, and it’s heartening to see a state attempting to legalize marijuana.

The statistics on marijuana arrests are really shocking. According to the Drug Policy Alliance, which is in favor of legalization, blacks are arrested for marijuana possession between four and twelve times more than whites in California, even though studies have consistently shown that whites smoke more pot than blacks. In the last ten years, around 500,000 people have been arrested for possession. That’s absurd! Think about how expensive that is for the criminal justice system. California spends $216,000 for each juvenile inmate

- So what has happened is, we have tokenized text and stored it as key:value pair in the new dataset var "tokenized_ds".
- for each text, we have respective tokens as new feature/column in dataset.

-> But there is a problem I am observing in the above tokens, token "today's" is not split into three seperate tokens, and when I researched about why, I found The ’ in "today’s" is a curly apostrophe (Unicode: U+2019), not a straight ' (ASCII: 39).

=> To fix this there are 2 options:
    
    -> Update the regular expression to handle Unicode apostrophes like ’, ‘, and possibly other Unicode punctuation as following: 
        re.split(r'([,.:;?_!"()\']|--|\s|’|‘)', text["text"])
    
    -> go broader using Unicode categories (requires regex module instead of re):
        import regex  # install with `pip install regex`
        regex.split(r'(\p{P}|\s)', text)  # splits on any Unicode punctuation or whitespace



In [None]:
#update-2
import regex  # install with `pip install regex`

def text_to_token(text):
    tokens = regex.split(r'(\p{P}|\s)', text["text"])  # splits on any Unicode punctuation or whitespace
    tokens = [t.strip() for t in tokens if t.strip()] #removing whitespaces
    return {"tokens": tokens}

tokenized_ds = train_ds.map(text_to_token)
print(tokenized_ds)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'tokens'],
    num_rows: 10000
})


In [100]:
print(tokenized_ds['tokens'][1])

['For', 'today', '’', 's', 'post', ',', 'I', '’', 'd', 'like', 'to', 'take', 'a', 'look', 'at', 'California', '’', 's', 'voter', 'initiative', 'to', 'legalize', 'pot', '.', 'If', 'the', 'measure', 'passes', ',', 'and', 'the', 'sky', 'doesn', '’', 't', 'fall', ',', 'many', 'other', 'states', 'will', 'probably', 'be', 'looking', 'at', 'similar', 'law', 'changes', 'in', 'the', 'near', 'future', '.', 'Our', 'drug', 'policy', 'of', 'the', 'last', 'century', 'has', 'simply', 'not', 'worked', ',', 'and', 'it', '’', 's', 'heartening', 'to', 'see', 'a', 'state', 'attempting', 'to', 'legalize', 'marijuana', '.', 'The', 'statistics', 'on', 'marijuana', 'arrests', 'are', 'really', 'shocking', '.', 'According', 'to', 'the', 'Drug', 'Policy', 'Alliance', ',', 'which', 'is', 'in', 'favor', 'of', 'legalization', ',', 'blacks', 'are', 'arrested', 'for', 'marijuana', 'possession', 'between', 'four', 'and', 'twelve', 'times', 'more', 'than', 'whites', 'in', 'California', ',', 'even', 'though', 'studies',

In [None]:
#collecting all tokens in a single list
all_tokens = [token for row in tokenized_ds 
                        for token in row["tokens"]]

#creating a sorted unique tokens list
unique_tokens = sorted(set(all_tokens))

print(unique_tokens[:250])

['!', '"', '#', '$', '$$', '$$$$', '$$3222111000', '$0', '$0$', '$0`', '$0x0', '$0x1', '$0x1020', '$0x20', '$0x5', '$0x6', '$1', '$1$', '$10', '$100', '$100+', '$1000', '$100K', '$100M', '$100NL']


In [107]:
print(len(unique_tokens))
print(unique_tokens[:1000])

184468
['!', '"', '#', '$', '$$', '$$$$', '$$3222111000', '$0', '$0$', '$0`', '$0x0', '$0x1', '$0x1020', '$0x20', '$0x5', '$0x6', '$1', '$1$', '$10', '$100', '$100+', '$1000', '$100K', '$100M', '$100NL', '$100k', '$100m', '$100s', '$101', '$102', '$103', '$104', '$104$', '$105', '$105$', '$106', '$107', '$108', '$108MM', '$109', '$10Highlight', '$10This', '$10m', '$10million', '$11', '$110', '$110M', '$110million', '$111', '$112', '$113', '$114', '$115', '$116', '$117', '$118', '$119', '$11m', '$11million', '$12', '$120', '$1200', '$121', '$122', '$123', '$124', '$1240', '$125', '$126', '$127M', '$127m', '$128', '$129', '$12It', '$12M', '$12million', '$12tn', '$13', '$130', '$1300', '$130m', '$132', '$133', '$134', '$135', '$135M', '$137', '$138', '$139', '$13bn', '$13m', '$14', '$140', '$1400', '$140B', '$140bn', '$141', '$142', '$142M', '$143', '$144', '$145', '$146', '$147', '$147bn', '$148', '$149', '$1499', '$14bn', '$15', '$150', '$150M', '$150k', '$151', '$153', '$153M', '$154',

In [None]:
'+2' in unique_tokens  # checking for token existence in the list

True

- The tokens are not what's desired, number-symbols-strings are not broken properly, so need to update the regular expression. 

(I'm sure there are advance ways to build better tokenizers, but here I'm trying to build as robust tokenizer as possible using the basic tools)

In [None]:
#updating the tokenizer function further, to improve robustness 
#update-3

def text_to_token(text):
    tokens = regex.findall(r'\p{L}+|\p{N}+|[\p{S}\p{P}]', text["text"]) #regex.findall(...) to match only the desired tokens directly, giving cleaner token boundaries.
    return {"tokens": tokens}

tokenized_ds = train_ds.map(text_to_token)
print(tokenized_ds)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'tokens'],
    num_rows: 10000
})


Explanation of pattern in the above regular expression:

\p{L}+ → one or more Unicode letters (e.g., words)

\p{N}+ → one or more Unicode numbers

[\p{S}\p{P}] = all symbols and punctuation (each as a single character)

In [116]:
all_tokens = [token for row in tokenized_ds 
                        for token in row["tokens"]]

unique_tokens = sorted(set(all_tokens))

print(unique_tokens[:250])

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '00', '000', '0000', '00000', '000000', '00000000', '000000000000', '0000000000000', '00000000000000', '0000000000000000', '0000000000000030', '00000000001', '00000001', '00000003', '00000008', '00000009', '0000001', '000000125', '00000016', '0000004', '000001', '000002', '000004', '000005', '000006', '00001', '0001', '000142', '00017', '0002', '00021', '000213', '00025', '0003', '000387', '0004', '0005', '0006', '00070', '00075', '00076125', '00081', '001', '0010', '0011', '001132', '0012', '00120', '00125', '0019', '00190', '002', '00221', '0023', '00238', '0026', '0026358', '00276319', '0029', '003', '0033', '0037', '004', '004123', '00416', '0042', '0044', '0048', '005', '005901', '005930', '006', '00658', '0067', '00684', '0069', '007', '0070', '00701532', '0073', '0075', '0079', '008', '00814037', '0087', '00872', '0089', '009', '00957359', '00996', '01', '010', '0100', '0101', '0101000020', '01014',

'0000000000000030' etc. might come from:
- Code dumps

- Malformed content

- Padding or spammy text

- Machine-generated text

In [117]:
print(len(unique_tokens))

175292


In [None]:
# checking for previously generated unwanted token existence in the current unique_tokens list
unwanted_tokens = ['<','$$', '$$$$', '$$3222111000', '$0', '$0$', '$0`', '$2MM', '$2^3$', '$2^6$', '$0x0', '$0x1', '$0x1020', '$0x20', '+2', '$0x5', '$0x6', '$1', '$105', '$105$', '$106', '$107', '$108', '$108MM', '$109', '$10Highlight', '$10This', '$10m', '$10million', '$19', '$190', '$190k']
found = [token for token in unwanted_tokens if token in unique_tokens]
print("Tokens found in vocab:", found)

Tokens found in vocab: []


In [130]:
print(unique_tokens[-50:])

['ｍＷ程度と小さいので', 'ｐＦでＩＢＳに同調しました', '～', '｡', '･', 'ｲｨ', 'ﾟ', '￼', '�', '🇨', '🇳', '🇸', '🇺', '🌐', '🌹', '🎄', '🎉', '🏁', '🏻', '🏼', '🏽', '🏿', '🐒', '👀', '👋', '👏', '👑', '💀', '💃', '📊', '🔍', '🔥', '🖕', '🖖', '🗽', '😀', '😉', '😎', '😩', '😮', '😱', '😺', '🙁', '🙂', '🙌', '🤓', '🤔', '🤷', '🥖', '🦆']


Everything looks good, so proceeding futher

In [165]:
unique_tokens.extend(['<|unk|>', '<|endoftext|>'])

In [166]:
#creating new vocabulary with more tokens
new_vocab = {token: idx for idx, token in enumerate(unique_tokens)}

In [167]:
print(len(new_vocab))

175294


### Creating a Tokenizer class, which will convert the input to Tokens and TokenIDs, and vice versa.

In [176]:
class Tokenizer_1:
    def __init__(self, new_vocab, unk_token="<|unk|>"): 
        self.str_to_int = new_vocab
        self.int_to_str = {i:s for s,i in new_vocab.items()} #extracting words associated with tokenID produced by LLM 
        self.unk_token = unk_token

    def encode(self, text): 
        preprocessed = regex.findall(r'\p{L}+|\p{N}+|[\p{S}\p{P}]', text) #creating tokens
        # preprocessed = [item.strip() for item in preprocessed if item.strip()] # removing spaces
        
        #handeling tokens not present in vocab
        # preprocessed = [
        #     item if item in self.str_to_int #item represent a token in our list preprocessed, if that token exist in vocab it stays as it is
        #     else self.unk_token for item in preprocessed # else it is converted to <|unk|>
        # ]

        #this update in ids helps avoid the loop just above
        ids = [self.str_to_int.get(t, self.str_to_int[self.unk_token]) for t in preprocessed] #as we already have a dictionary of words, we'll extract id from vocab for the input we get. 
        return ids
        

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # text = regex.findall(r'\p{L}+|\p{N}+|[\p{S}\p{P}]', text)
        return text

In [183]:
#testing tokenizer
adv_tokenizer = Tokenizer_1(new_vocab)

print(adv_tokenizer.encode("Hey man! How are you doing?"))

[40637, 135488, 0, 41745, 99696, 171544, 115331, 4481]


In [184]:
print(adv_tokenizer.decode(adv_tokenizer.encode("Hey man! How are you doing?")))

Hey man ! How are you doing ?


In [185]:
print(adv_tokenizer.encode("Hello baccho, aaj hum padhenge Physics"))

[40279, 175294, 11, 175294, 127275, 175294, 67348]


In [None]:
#testing for some possible unknown words
print(adv_tokenizer.decode(adv_tokenizer.encode("Hello baccho, aaj hum padhenge Physics")))

Hello <|unk|> , <|unk|> hum <|unk|> Physics


- This highlights that despite the vocab consisting of 175292, there are many possible words that aren't present in our vocab. 
- This simply explains the scale of data and tokens that are require to train a realworld LLM.
- Also, as the above output is displayed, it's not the best way to handel unknown tokens. 
- To overcome the above limitations with tokens and unknown token, Byte-pair encoding will help us make the system more robust. 


(on that next) :)

### Some Important points:

#### 1. Is removing whitespace a good practice in tokenization?
It depends on your tokenizer’s philosophy.
- In simple tokenizers like one implemented here, its okay to do so.

🔹 In modern LLM tokenizers (like GPT, BERT):
Whitespace is kept as meaningful context using special handling:

GPT's tokenizer (like Byte Pair Encoding or Tiktoken) often treats " hello" and "hello" as different tokens.

It may encode the leading space into the token (e.g., " hello" becomes one token that includes the space).

Spaces can influence token boundaries and are not always discarded.


#### 2. Are capitalized and lowercase words treated the same? (“Hello” vs “hello”)
No, they are treated as different tokens unless you normalize them.

🔹 Examples:
"Hello" and "hello" will be tokenized differently in GPT, BERT, and most tokenizers.

If you want case-insensitive tokens (e.g., for small LLMs or your own experiments), you can preprocess by lowercasing: text = text.lower()

But most modern LLMs are case-sensitive, because:

- Case carries semantic weight (e.g., "Apple" the company vs "apple" the fruit)

- Preserving case helps capture nuances in proper nouns, titles, etc.
