## **Toknizer** 
### **Step 1 : Spliting The Text Into Indidual Tokens  (Words and Subwords)**  

In [1]:
import re

In [2]:
with open("the-verdict.txt", "r", encoding="utf-8") as txt_file:
       raw_data = txt_file.read()

print("total characters:", len(raw_data))
print(raw_data[:250])  # print the first 250 characters

total characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on


**now we get our dataset which is a simple novel we want to convert it to tokens** 
<div class="alert alert-block alert-success">

How can we best split this text to obtain a list of tokens? For this, we go on a small
excursion and use Python's regular expression library re for illustration purposes. (Note
that you don't have to learn or memorize any regular expression syntax since we will
transition to a pre-built tokenizer later in this chapter.) </div>

In [3]:
toknized_text = re.split(r'([,.:;?_!"()\']|--|\s)' , raw_data )
print(toknized_text[:100])

['I', ' ', 'HAD', ' ', 'always', ' ', 'thought', ' ', 'Jack', ' ', 'Gisburn', ' ', 'rather', ' ', 'a', ' ', 'cheap', ' ', 'genius', '--', 'though', ' ', 'a', ' ', 'good', ' ', 'fellow', ' ', 'enough', '--', 'so', ' ', 'it', ' ', 'was', ' ', 'no', ' ', 'great', ' ', 'surprise', ' ', 'to', ' ', 'me', ' ', 'to', ' ', 'hear', ' ', 'that', ',', '', ' ', 'in', ' ', 'the', ' ', 'height', ' ', 'of', ' ', 'his', ' ', 'glory', ',', '', ' ', 'he', ' ', 'had', ' ', 'dropped', ' ', 'his', ' ', 'painting', ',', '', ' ', 'married', ' ', 'a', ' ', 'rich', ' ', 'widow', ',', '', ' ', 'and', ' ', 'established', ' ', 'himself', ' ', 'in', ' ', 'a', ' ']


<div class="alert alert-block alert-success">

REMOVING WHITESPACES OR NOT


When developing a simple tokenizer, whether we should encode whitespaces as
separate characters or just remove them depends on our application and its
requirements. Removing whitespaces reduces the memory and computing
requirements. However, keeping whitespaces can be useful if we train models that
are sensitive to the exact structure of the text (for example, Python code, which is
sensitive to indentation and spacing). Here, we remove whitespaces for simplicity
and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme
that includes whitespaces.

</div>

In [4]:
#remove white spaces
toknized_text = [word.strip() for word in toknized_text if word.strip()]
print(toknized_text[:100])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.', ')', '"', 'The', 'height', 'of', 'his', 'glory', '"', '--', 'that', 'was', 'what', 'the', 'women', 'called', 'it', '.', 'I', 'can', 'hear', 'Mrs', '.', 'Gideon', 'Thwing', '--', 'his', 'last', 'Chicago', 'sitter', '--']


In [5]:
print(len(toknized_text))

4690


### **Step 2 : Creating Token IDs** 

<div class="alert alert-block alert-warning">

In the previous section, we tokenized the story and assigned it to a
Python variable called preprocessed. Let's now create a list of all unique tokens and sort
them alphabetically to determine the vocabulary size:</div>

In [6]:
unique_words = sorted(set(toknized_text)) 
number_of_unique=len(unique_words)
number_of_unique

1130

#### **Crating The Vocabulary Itself**

In [7]:
vocabulary ={token : ID for ID, token in enumerate(unique_words)} # this has each token and its id 


In [8]:
simple_show={}
for key,val in vocabulary.items():
    simple_show[key]=val
    if val ==30:
        break
print(simple_show)

{'!': 0, '"': 1, "'": 2, '(': 3, ')': 4, ',': 5, '--': 6, '.': 7, ':': 8, ';': 9, '?': 10, 'A': 11, 'Ah': 12, 'Among': 13, 'And': 14, 'Are': 15, 'Arrt': 16, 'As': 17, 'At': 18, 'Be': 19, 'Begin': 20, 'Burlington': 21, 'But': 22, 'By': 23, 'Carlo': 24, 'Chicago': 25, 'Claude': 26, 'Come': 27, 'Croft': 28, 'Destroyed': 29, 'Devonshire': 30}


<div class="alert alert-block alert-success">

Later in this book, when we want to convert the outputs of an LLM from numbers back into
text, we also need a way to turn token IDs into text. 

For this, we can create an inverse
version of the vocabulary that maps token IDs back to corresponding text tokens.

</div>

**Let's implement a complete tokenizer class in Python.**

**The class will have an encode method that splits text into tokens and carries out the string-to-integer mapping to produce token IDs via the vocabulary.**

**In addition, we implement a decode method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.**

In [9]:
class SimpleTokenizar:
    def __init__(self,vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encoder(self,raw_text):
        toknized_text = re.split(r'([,.:;?_!"()\']|--|\s)' , raw_text )
        toknized_text = [word.strip() for word in toknized_text if word.strip()]
        #the last part is to convert tokens to IDs
        token_IDs = [self.str_to_int[token] for token in toknized_text ]

        return token_IDs
          
    def decoder(self,IDs):
        text = " ".join([self.int_to_str[ID] for ID in IDs])
        #remove spaces before spitial chars 
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        
        return text  

In [10]:
#lets try the class from our vocab 
tokenizer = SimpleTokenizar(vocabulary)

text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encoder(text)
print(ids)


[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [11]:
# turn it back to text 
decoded_text = tokenizer.decoder(ids)
print("the orginal text :",text)
print("the decoded text :",decoded_text)

the orginal text : "It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride.
the decoded text : " It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


<div class="alert alert-block alert-success">

We implemented a tokenizer capable of tokenizing and de-tokenizing
text based on a snippet from the training set. 

Let's now apply it to a new text sample that
is not contained in the training set:
</div>

In [12]:
try:
    text = "Hello, do you like tea?"
    print(tokenizer.encode(text))
except:
    print("sorry the tokenizer didnt see this before!")

sorry the tokenizer didnt see this before!


<div class="alert alert-block alert-info">
    
The problem is that the word "Hello" was not used in the The Verdict short story. 

Hence, it
is not contained in the vocabulary. 

This highlights the need to consider large and diverse
training sets to extend the vocabulary when working on LLMs.

</div>

### ADDING SPECIAL CONTEXT TOKENS

In the previous section, we implemented a simple tokenizer and applied it to a passage
from the training set. 

In this section, we will modify this tokenizer to handle unknown
words.


In particular, we will modify the vocabulary and tokenizer we implemented in the
previous section, SimpleTokenizerV2, to support two new tokens, <|unk|> and
<|endoftext|>

<div class="alert alert-block alert-warning">

We can modify the tokenizer to use an <|unk|> token if it
encounters a word that is not part of the vocabulary. 

Furthermore, we add a token between
unrelated texts. 

For example, when training GPT-like LLMs on multiple independent
documents or books, it is common to insert a token before each document or book that
follows a previous text source

</div>



In [13]:
all_tokens = sorted(set(toknized_text))
# add the 2 new tokens <|unk|> , <|endoftext|>
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
full_vocab = {token:ID for ID,token in enumerate(all_tokens)}

len(full_vocab.values())

1132

<div class="alert alert-block alert-info">
    
Based on the output of the print statement above, the new vocabulary size is 1132 (the
vocabulary size in the previous section was 1130).

</div>

In [14]:
for i, item in enumerate(list(full_vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


<div class="alert alert-block alert-success">

A simple text tokenizer that handles unknown words
</div>

In [15]:
class SimpleTokenizarV2:
    def __init__(self,vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encoder(self,raw_text):
        toknized_text = re.split(r'([,.:;?_!"()\']|--|\s)' , raw_text )
        toknized_text = [word.strip() for word in toknized_text if word.strip()]
        
        #adding the part of adding unkown token 
        toknized_text = [
            token if token in self.str_to_int 
            else "<|unk|>" for token in toknized_text 
        ]
        
        #the last part is to convert tokens to IDs
        token_IDs = [self.str_to_int[token] for token in toknized_text ]

        return token_IDs
          
    def decoder(self,IDs):
        text = " ".join([self.int_to_str[ID] for ID in IDs])
        #remove spaces before spitial chars 
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        
        return text

In [16]:
tokenizer = SimpleTokenizarV2(full_vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [17]:
ids=tokenizer.encoder(text)
ids

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 984, 722, 988, 1131, 7]

In [18]:
tokenizer.decoder(ids)

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'


<div class="alert alert-block alert-info">
    
Based on comparing the de-tokenized text above with the original input text, we know that
the training dataset, Edith Wharton's short story The Verdict, did not contain the words
"Hello" and "palace."

</div>

# Byte-Pair Encoding (BPE) 

## What is BPE?
Byte-Pair Encoding (BPE) is a **subword tokenization algorithm** used in NLP to split text into smaller units.  
It helps handle **rare words**, **out-of-vocabulary words**, and reduces the vocabulary size.

---

## Key Steps
1. **Add end-of-word marker** to each word (e.g., `</w>`).  
2. **Count character frequencies** in the dataset.  
3. **Find the most frequent adjacent pair** of symbols.  
4. **Merge the pair** into a new symbol.  
5. **Repeat steps 3-4** until reaching the desired number of merges or vocabulary size.

**the tokenizer used for GPT models also doesn't use an <|unk|> token for outof-vocabulary words. Instead, GPT models use a byte pair encoding tokenizer, which breaks down words into subword units**


In [19]:
### will use this library as open ai and will not implement from scratch 
import tiktoken
# Correct encoding for GPT-3.5 / GPT-4
tokenizer = tiktoken.get_encoding("cl100k_base")

In [20]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[9906, 11, 656, 499, 1093, 15600, 30, 220, 100257, 763, 279, 7160, 32735, 7317, 2492, 1073, 1063, 16476, 17826, 13]


**return the text back**


In [21]:
string = tokenizer.decode(integers)

print(string)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


**as we can see it can handel unkown words perficttly much better than our simple tokenizer becuse it go to char level of tokenization**

In [22]:
#example of unkown word 
integers = tokenizer.encode("mohamedwaleed elmasry")
print(integers)

strings = tokenizer.decode(integers)
print(strings)

[76, 2319, 3690, 86, 1604, 291, 658, 7044, 894]
mohamedwaleed elmasry


## Input-Target Pairs in LLM Training

Input-target pairs are fundamental training examples for Large Language Models (LLMs).
Each pair consists of:
- An input prompt or context provided to the model
- The target output the model should generate in response

These pairs form the basis of supervised learning for LLMs, enabling the model to learn
the relationship between prompts and appropriate responses through techniques like
next-token prediction and teacher forcing. 


In [23]:
with open("the-verdict.txt", "r", encoding="utf-8") as txt_file:
       raw_text = txt_file.read()

# aplay the BPE encoder
full_data = tokenizer.encode(raw_text)
#length of our vocabulary
print(len(full_data))

4943


**lets create input target pairs variables x, y** 

In [24]:
x=full_data[:-1]
y=full_data[1:]

for i in range(1,6):
    print(x[:i] ,' --> ' ,y[i-1])

[40]  -->  473
[40, 473]  -->  1846
[40, 473, 1846]  -->  2744
[40, 473, 1846, 2744]  -->  3463
[40, 473, 1846, 2744, 3463]  -->  7762


In [25]:
# for better sentece x(input) --> y(the llm prediction)
x = full_data[50:-1]
y = full_data[51:]

for i in range(1, 6):
 
    print(tokenizer.decode(x[:i]), ' --> ', tokenizer.decode([y[i-1]]))

 and  -->   established
 and established  -->   himself
 and established himself  -->   in
 and established himself in  -->   a
 and established himself in a  -->   villa


### IMPLEMENTING A DATA LOADER for Efficient Input-Target Pairs Feeding
**will used pytorch dataloader**

In [26]:
from torch.utils.data import Dataset, DataLoader
import torch

- **step1 : toknize the whole dataset**
- **step2 : chunk the dataset into overlapping setenses of max_length size** 

In [27]:
# Dataset --> is th pythorch class  
class LLMDataset(Dataset):
    # stride --> how much we will shift the data if =1 then each time we will shift input text by 1 and predict next word 
    def __init__(self,txt,toknizer,max_length,stride):
        self.input_ids  = []
        self.target_ids = []

        #toknize the whole data set and add end_of_text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # then devide the data into input and target chunks but as tensors not list
        for i in range(0,len(token_ids) - max_length,stride ):
            input_chunk = token_ids[i : i + max_length]
            target_chunk= token_ids[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def  __getitem__(self,index):
        return self.input_ids[index], self.target_ids[index]

    def __len__(self):
        return len(self.input_ids)
        

In [28]:
def Create_Dataloader(txt, batch_size=4, max_length=256, 
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):
    #intializt our tokinizer 
    tokenizer = tiktoken.get_encoding("gpt2")

    dataset = LLMDataset(txt,tokenizer,max_length,stride)

    dataloader = DataLoader(dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last,
                             num_workers=num_workers)
    return dataloader
    
    

**now lets try this on our data set to see the result** 

In [29]:
with open("the-verdict.txt", "r", encoding="utf-8") as txt_file:
       raw_data = txt_file.read()

dataloader = Create_Dataloader(raw_data ,batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[  40,  473, 1846, 2744]]), tensor([[ 473, 1846, 2744, 3463]])]


**these examples for illustration only batch size for nural networks should be at least 256**

In [30]:
#how batch size change
#change stride also to decrease overlab to decrease overfitting
dataloader = Create_Dataloader(raw_data ,batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("inputs:",'\n',inputs,'\n')
print("targets:",'\n',targets)

inputs: 
 tensor([[   40,   473,  1846,  2744],
        [ 3463,  7762,   480,   285],
        [22464,  4856,   264, 12136],
        [35201,   313,  4636,   264],
        [ 1695, 12637,  3403,   313],
        [  708,   433,   574,   912],
        [ 2294, 13051,   311,   757],
        [  311,  6865,   430,    11]]) 

targets: 
 tensor([[  473,  1846,  2744,  3463],
        [ 7762,   480,   285, 22464],
        [ 4856,   264, 12136, 35201],
        [  313,  4636,   264,  1695],
        [12637,  3403,   313,   708],
        [  433,   574,   912,  2294],
        [13051,   311,   757,   311],
        [ 6865,   430,    11,   304]])


## How Embeddings Work

Token embeddings are the bridge between text and numbers. They allow machine learning models to understand and operate on language by turning tokens (words, subwords, characters) into vectors.

### 1. **Tokenizer Produces Token IDs**

* Text is split using a tokenizer (like BPE).
* Each token is mapped to an integer ID.

### 2. **Embedding Matrix**

* A matrix of size:

  **vocab_size × embedding_dim**

  Example: `50,257 × 768`.
* Each row corresponds to the vector representation of a token.
* Initialized randomly.

### 3. **Lookup Step**

* When the model receives token IDs, it uses them to index into the embedding matrix.
* Output: A dense vector for each token.

### 4. **Learning the Embeddings**

* Embeddings update **only when a training objective exists**.

* Common objectives:

  * Next-token prediction (causal LM)
  * Masked-token prediction (BERT)
  * Sequence classification (sentiment, etc.)

* The embedding vectors change during training because:

  * The model predicts something.
  * It computes error (loss).
  * Backpropagation updates **ALL weights**, including embeddings.

### 5. **Embeddings Alone Don’t Train**

* If you only feed input tokens and do not ask the model to predict anything, the embeddings never update.
* Embeddings need a **supervision signal** or **self-supervised objective**.

### 6. **Why Embeddings Matter**

* They capture relationships:

  * Similar tokens → similar vectors
  * Semantic meaning encoded in geometry

Embedding quality depends entirely on **training objective + data quantity + model architecture**.

---

**In summary:**
Token embeddings are random at first. They become meaningful **only because the model is forced to make predictions**, and errors from those predictions update the embedding matrix during training.


**very simple example if we have only 4 token ids**

In [31]:
input_ids = torch.tensor([2,3,5,4])
input_ids

tensor([2, 3, 5, 4])

In [32]:
#if we have vocabulary of 6 words 
vocab_size =6
#and we will create an embiding vector of dimnsion of only 3
output_embiding=3

torch.manual_seed(123)
#crate the layer which will produce our vectos 
embedding_layer = torch.nn.Embedding(vocab_size,output_embiding)

In [33]:
#show the whole embding matrix >> our vocabulary_size * vecor dimnsion
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


**thes wheights need to optimize in training process**

In [34]:
#now lets convert our input ids into token embeddings
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [-1.1589,  0.3255, -0.6315]], grad_fn=<EmbeddingBackward0>)


## Absolute Positional Embeddings

Absolute positional embeddings give the model information about **the position of each token in the sequence**. Since token embeddings alone do not contain any notion of order, positional embeddings are added to preserve sequence structure.

---

### 1. **Why Do We Need Positional Embeddings?**

Transformers process tokens in parallel (unlike RNNs). Therefore:

* The model sees all tokens at once.
* It has **no natural sense of order**.

Positional embeddings inject the ordering information needed for:

* understanding sentences
* recognizing the difference between "Alice loves Bob" and "Bob loves Alice"

---

### 2. **Embedding Shape**

If:

* sequence length = *N*
* embedding dimension = *d*

Then positional embedding matrix has size:

**N × d**

Example:

* maximum sequence length = 1024
* embedding dimension = 768

Positional embeddings shape:

**1024 × 768**

---

### 3. **How Absolute Positional Embeddings Work**

Each position *i* (0, 1, 2, ..., N−1) has a unique vector:

```text
pos_embedding[i] → vector of size d
```

For each token embedding:

```text
final_embedding[i] = token_embedding[i] + pos_embedding[i]
```

Token embedding = meaning of the word
Positional embedding = where it appears

---

### 4. **Initialization and Learning**

Absolute positional embeddings are typically:

* initialized randomly
* updated during training via backpropagation

They learn patterns like:

* beginnings of sentences
* typical word orders
* location-based semantics (e.g., punctuation at sequence end)

---

### 5. **Summary**

* Absolute positional embeddings provide fixed vectors for each position.
* Added directly to token embeddings.
* Learn during training.
* Simple but limited for long or flexible sequences.

They were used in models like the original **Transformer** and **GPT-2**.


In [35]:
# lets try a real example of this 
vocab_size = 50257 #like GPT-2
output_dim = 256

embedding_layer = torch.nn.Embedding(vocab_size , output_dim)

In [36]:
#use data loader to create the token ids 
max_length = 4
batch_size = 8

data_loader = Create_Dataloader(raw_data,batch_size=batch_size,max_length=max_length,stride=max_length,shuffle=False)

first_data = iter(data_loader)
inputs, targets =next(first_data)

In [37]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   473,  1846,  2744],
        [ 3463,  7762,   480,   285],
        [22464,  4856,   264, 12136],
        [35201,   313,  4636,   264],
        [ 1695, 12637,  3403,   313],
        [  708,   433,   574,   912],
        [ 2294, 13051,   311,   757],
        [  311,  6865,   430,    11]])

Inputs shape:
 torch.Size([8, 4])


In [38]:
token_embeddings = embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


**now the postional encoding layer**

In [39]:
# we create this only one time as our context size * our output vector dim 
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [40]:
# create 0,1,...,context size 
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


In [41]:
# add same pos_embedding to each context  to get final input training for llm 
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])
