In [1]:
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/"
"LLMs-from-scratch/main/ch02/01_main-chapter-code/"
"the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

('the-verdict.txt', <http.client.HTTPMessage at 0x13827ca4eb0>)

In [2]:
with open(r"C:\Users\ashmi\Documents\Artificial Intelligence\Deep Learning\Pytorch\Learning Codes\Deep learning\PRACTICE\Transformer\GPT\the-verdict.txt","r",encoding="utf-8") as f:
    text = f.read()
print("total no. of characters :", len(text))
print (text[:99])

total no. of characters : 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


How can we best split this text to obtain a list of tokens? For this, we go on a small
excursion and use Python’s regular expression library re for illustration purposes.

In [7]:
import re 
test= "Hello, world . This is a test ?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', test)
print(result)

['Hello', ',', '', ' ', 'world', ' ', '', '.', '', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test', ' ', '', '?', '']


r'([,.:;?_!"()\']|--|\s) this is a computer experssion used to split all type of sentences

In [8]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
preprocessed = [item.strip() for item in preprocessed if item.strip()] #Removes unnecessary spaces from words and symbols. and Eliminates completely empty strings from the list.
print(len(preprocessed))

4690


In [9]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [10]:
all_words = sorted(set(preprocessed))
vocab_size = len (all_words)
print(vocab_size)


1130


In [21]:
vocab = {token :integer for integer , token in enumerate (all_words)}
for i , item in enumerate (all_words):
    print(item)
    if i > 20:
        break

!
"
'
(
)
,
--
.
:
;
?
A
Ah
Among
And
Are
Arrt
As
At
Be
Begin
Burlington


Let's break down how the line  

```python
vocab = {token: integer for integer, token in enumerate(all_words)}
```

works step by step.

---

### **Understanding the Components**
This is a **dictionary comprehension** that constructs a dictionary (`vocab`) by iterating over `all_words` using `enumerate()`. The structure is:

```python
{key: value for value, key in enumerate(iterable)}
```
where:
- **`token`** (key) is each item (word or symbol) in `all_words`.
- **`integer`** (value) is the index assigned to each `token` by `enumerate()`.

---

### **Step-by-Step Execution**

#### **1. `enumerate(all_words)`**
The `enumerate()` function assigns an index (starting from 0) to each item in `all_words`.  

For example, if:

```python
all_words = ['!', ',', '--', 'Hello', 'example', 'world']
```
Then:
```python
list(enumerate(all_words))
# Output:
[(0, '!'), (1, ','), (2, '--'), (3, 'Hello'), (4, 'example'), (5, 'world')]
```
Each word is paired with a unique index.

#### **2. Dictionary Comprehension**
The comprehension:

```python
{token: integer for integer, token in enumerate(all_words)}
```
- Iterates over each `(index, token)` pair from `enumerate(all_words)`.
- Assigns each **token** as a dictionary **key** and its corresponding **index** as the **value**.

---

### **Final Output**
For `all_words = ['!', ',', '--', 'Hello', 'example', 'world']`,  
the dictionary `vocab` will be:

```python
{
    '!': 0,
    ',': 1,
    '--': 2,
    'Hello': 3,
    'example': 4,
    'world': 5
}
```

---

### **How the Code Works Line by Line**
1. `enumerate(all_words)` → Generates `(index, word)` pairs.
2. `{token: integer for integer, token in enumerate(all_words)}` → Constructs a dictionary.
3. Each **token** (word/symbol) becomes a **key** and its **index** becomes a **value**.

---

### **Why is this Useful?**
- Converts words into numerical IDs (important for NLP, text processing, machine learning).
- Ensures a **consistent mapping** of words to numbers.
- Efficient, as it avoids multiple loops and extra variables.

Let me know if you need more clarification! 😊

In [18]:
class simpleTokenizerV1:
    def __init__ (self, vocab):
        self.str_to_int= vocab
        self.int_to_str ={i:s for s,i in vocab.items()}
    def encode (self, text ):
        preprocessed= re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed= [item.strip() for item in preprocessed if item.strip()]
        ids= [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text
        


In [20]:
tokenizer = simpleTokenizerV1(vocab)
text = """"It's the last he painted, you know,"
Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [22]:
dec= tokenizer.decode(ids)
print(dec)

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


his Python code defines a class called `simpleTokenizerV1`, which functions as a **basic tokenizer**. It converts text into numerical representations (**encoding**) and converts numerical representations back into text (**decoding**). This is commonly used in **Natural Language Processing (NLP)** tasks.

Let's break down the code **line by line** with an **example**.

---

## **1. Class Definition**
```python
class simpleTokenizerV1:
```
- Defines a **class** named `simpleTokenizerV1`.
- A **class** is a blueprint for creating objects that can store data and perform operations on it.

---

## **2. Initializing the Tokenizer**
```python
def __init__(self, vocab):
```
- `__init__` is a **constructor** that runs when an object of this class is created.
- `vocab` is a **dictionary** mapping words (tokens) to unique integer IDs.

Example:
```python
vocab = {'Hello': 0, ',': 1, 'world': 2, '!': 3}
tokenizer = simpleTokenizerV1(vocab)
```
This initializes the tokenizer with a vocabulary where:
- `"Hello"` → `0`
- `","` → `1`
- `"world"` → `2`
- `"!"` → `3`

---

## **3. Creating String-to-Integer and Integer-to-String Mappings**
```python
self.str_to_int = vocab
```
- **Stores the input vocabulary (`vocab`)** in an instance variable called `str_to_int`.
- This dictionary is used to convert words into numbers (**string → integer**).

```python
self.int_to_str = {i: s for s, i in vocab.items()}
```
- **Reverses the dictionary (`vocab.items()`)** to create a mapping from numbers to words (**integer → string**).
- This allows us to convert numbers back to text during **decoding**.

### Example Execution:
```python
vocab = {'Hello': 0, ',': 1, 'world': 2, '!': 3}
tokenizer = simpleTokenizerV1(vocab)

print(tokenizer.str_to_int)  
# {'Hello': 0, ',': 1, 'world': 2, '!': 3}

print(tokenizer.int_to_str)  
# {0: 'Hello', 1: ',', 2: 'world', 3: '!'}
```

---

## **4. Encoding: Converting Text to Numbers**
```python
def encode(self, text):
```
- This function takes a **string (`text`)** and **converts it into a list of integers**.

### **Step 1: Tokenizing the Text**
```python
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
```
- Uses `re.split()` (regular expressions) to **split text into words and punctuation**.
- **Pattern**: `([,.?_!"()\']|--|\s)`
  - This splits on **spaces, punctuation, and special symbols**.
  - The **punctuation is kept as separate tokens**.

Example:
```python
text = "Hello, world!"
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
print(preprocessed)
# Output: ['Hello', ',', ' ', 'world', '!', '']
```

### **Step 2: Cleaning Up Tokens**
```python
preprocessed = [item.strip() for item in preprocessed if item.strip()]
```
- Removes unnecessary spaces and empty strings from the token list.

Example:
```python
print(preprocessed)
# Output: ['Hello', ',', 'world', '!']
```

### **Step 3: Converting Words to Numbers**
```python
ids = [self.str_to_int[s] for s in preprocessed]
```
- Replaces **each word/token with its corresponding number** using `self.str_to_int`.

Example:
```python
ids = [0, 1, 2, 3]
```
- `"Hello"` → `0`
- `","` → `1`
- `"world"` → `2`
- `"!"` → `3`

### **Step 4: Returning the Encoded Output**
```python
return ids
```
- Returns the list of numbers.

#### **Final Encoding Example**
```python
vocab = {'Hello': 0, ',': 1, 'world': 2, '!': 3}
tokenizer = simpleTokenizerV1(vocab)

text = "Hello, world!"
encoded = tokenizer.encode(text)

print(encoded)  
# Output: [0, 1, 2, 3]
```

---

## **5. Decoding: Converting Numbers to Text**
```python
def decode(self, ids):
```
- This function **takes a list of numbers** and **converts them back into text**.

### **Step 1: Convert Numbers to Words**
```python
text = " ".join([self.int_to_str[i] for i in ids])
```
- Uses `self.int_to_str` to **replace each number with its corresponding word/token**.
- Joins the words with **spaces**.

Example:
```python
ids = [0, 1, 2, 3]
text = " ".join(['Hello', ',', 'world', '!'])
print(text)
# Output: "Hello , world !"
```
Notice that punctuation has spaces around it. We fix this in the next step.

### **Step 2: Remove Extra Spaces Before Punctuation**
```python
text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
```
- Uses **regular expressions (`re.sub`)** to remove **extra spaces before punctuation**.
- **Pattern**: `\s+([,.?!"()\'])`
  - Matches **one or more spaces (`\s+`)** before **punctuation**.
  - **Replaces it with just the punctuation (`\1`)**.

Example:
```python
text = "Hello , world !"
text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
print(text)
# Output: "Hello, world!"
```

### **Step 3: Return the Decoded Text**
```python
return text
```
- Returns the cleaned-up text.

---

## **6. Final Decoding Example**
```python
decoded_text = tokenizer.decode([0, 1, 2, 3])
print(decoded_text)
# Output: "Hello, world!"
```

---

## **Final Summary**
### **What This Class Does**
✅ **Encodes text** (converts words into numbers).  
✅ **Decodes numbers** (converts numbers back into text).  
✅ **Handles punctuation correctly** (keeps it separate and removes unnecessary spaces).

### **Example Usage**
```python
vocab = {'Hello': 0, ',': 1, 'world': 2, '!': 3}
tokenizer = simpleTokenizerV1(vocab)

# Encoding
text = "Hello, world!"
encoded = tokenizer.encode(text)
print(encoded)  
# Output: [0, 1, 2, 3]

# Decoding
decoded = tokenizer.decode(encoded)
print(decoded)
# Output: "Hello, world!"
```

This is a **basic tokenizer**, useful in NLP applications like text processing and machine learning. 🚀

Let me know if anything needs more explanation! 😊


We can modify the tokenizer to use an <|unk|> token if it encounters a word that is
not part of the vocabulary. This helps the LLM understand
that although these text sources are concatenated for training, they are, in fact,
unrelated.

In [38]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(len(vocab.items()))

1132


In [39]:
for i, item in enumerate(list(vocab.items())[-5:]):
  print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


In [40]:
class simpleTokenizerV2:
    def __init__ (self, vocab):
        self.str_to_int= vocab
        self.int_to_str ={i:s for s,i in vocab.items()}
    def encode (self, text ):
        preprocessed= re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed= [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed] # replace the unknown words by unknown tokens
        ids= [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [41]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text3 = " <|endoftext|> ".join((text1, text2))
print(text3)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [42]:
tokenizer = simpleTokenizerV2(vocab)
print(tokenizer.encode(text))

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [43]:
print(tokenizer.decode(tokenizer.encode(text)))

" It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


In [44]:
import tiktoken 

ext = (
"Hello, do you like tea? <|endoftext|> In the sunlit terraces"
"of someunknownPlace."
)
integers = tokenizer.encode(ext)
print(integers)

[1131, 5, 355, 1126, 628, 975, 10, 1130, 55, 988, 956, 1131, 1131, 7]


In [45]:
strings = tokenizer.decode(integers)
print(strings)

<|unk|>, do you like tea? <|endoftext|> In the sunlit <|unk|> <|unk|>.


In [46]:
import tiktoken  

# Load and tokenize the text
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

tokenizer = tiktoken.get_encoding("cl100k_base")  
enc_text = tokenizer.encode(raw_text)

In [47]:
enc_sample = enc_text[50:]
print(enc_sample)

[323, 9749, 5678, 304, 264, 47625, 389, 279, 51768, 26919, 13, 320, 27831, 358, 4856, 3463, 433, 1053, 617, 1027, 22463, 477, 48606, 9456, 10227, 2673, 315, 813, 27025, 75857, 9210, 574, 1148, 279, 3278, 2663, 433, 13, 358, 649, 6865, 18083, 13, 480, 100242, 666, 24510, 313, 26301, 1566, 10780, 2503, 466, 313, 451, 501, 5620, 813, 653, 4711, 481, 671, 67, 20901, 13, 330, 2173, 3388, 433, 596, 2133, 311, 3708, 279, 907, 315, 856, 6945, 364, 3195, 709, 26, 719, 358, 1541, 956, 1781, 315, 430, 11, 4491, 13, 23194, 5721, 313, 1820, 4814, 311, 18925, 83, 374, 682, 358, 1781, 315, 1210, 578, 3492, 11, 389, 18083, 13, 666, 24510, 596, 23726, 11, 56016, 1202, 721, 5544, 62, 439, 3582, 814, 1051, 27000, 304, 459, 26762, 40136, 315, 41585, 13, 1628, 433, 574, 539, 1193, 279, 18083, 13, 666, 86, 826, 889, 60234, 291, 13, 24805, 539, 279, 59708, 32565, 689, 25611, 728, 11, 520, 279, 1566, 480, 3017, 263, 19853, 1501, 11, 10717, 757, 1603, 480, 285, 22464, 596, 330, 77119, 1773, 32842, 1, 311, 2019

In [48]:
context_size= 4
x= enc_sample[:context_size]
y=enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y: {y}")

x: [323, 9749, 5678, 304]
y: [9749, 5678, 304, 264]


In [51]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[323] ----> 9749
[323, 9749] ----> 5678
[323, 9749, 5678] ----> 304
[323, 9749, 5678, 304] ----> 264


In [53]:
import torch 
from torch.utils.data import Dataset, DataLoader
class GPTDatasetV1(Dataset):
    def __init__(self,txt,tokenizer, max_length,stride):
        self.input_ids = []
        self.target_ids = []
        token_ids= tokenizer.encode(txt)
        for i in range (0,len(token_ids)- max_length,stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    def __len__(self):
        return len(self.input_ids)
    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]


### **Explanation of the Code:**

This code defines a custom `Dataset` class (`GPTDatasetV1`) that prepares data for training a model like GPT (a language model). It uses a **sliding window approach** to create input–target pairs for next-word prediction.

Let’s break it down step by step, especially focusing on how the **for loop** works.

---

### **1. Dataset Class Definition:**
```python
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        token_ids = tokenizer.encode(txt)
```

- `txt`: The raw text data that you want to use for training.
- `tokenizer`: A tokenizer that converts the text into token IDs (integers).
- `max_length`: The **maximum length** of each input sequence.
- `stride`: The **stride** or step size by which the sliding window moves.

The tokenizer converts the raw text (`txt`) into a list of **token IDs** (`token_ids`) using the `tokenizer.encode(txt)` function.

For example, if `txt = "The quick brown fox"`, and after tokenization, `token_ids` might look like:
```python
[101, 2000, 303, 4567, 2345]
```

---

### **2. The For Loop - Creating Input-Target Pairs**
```python
for i in range(0, len(token_ids) - max_length, stride):
    input_chunk = token_ids[i:i + max_length]
    target_chunk = token_ids[i + 1: i + max_length + 1]
    self.input_ids.append(torch.tensor(input_chunk))
    self.target_ids.append(torch.tensor(target_chunk))
```

Let’s break this loop down:

- **Range for the loop:**  
  The loop iterates over the `token_ids` list in steps determined by `stride`.  
  `range(0, len(token_ids) - max_length, stride)` means the loop will start at index `0` and go up to `len(token_ids) - max_length`, incrementing by `stride` each time.

- **Why `len(token_ids) - max_length`?**  
  We need to stop before reaching the end, to make sure we can still extract an entire chunk of `max_length` tokens.

#### **Example:**  
Let’s say `token_ids = [101, 2000, 303, 4567, 2345, 102]`, `max_length = 3`, and `stride = 1`.

- The loop will start at index `0` and continue until `len(token_ids) - max_length` (i.e., `len(token_ids) - 3 = 3`). So, the loop will go over the indices `0, 1, 2`.

---

### **3. Inside the Loop:**
Within the loop, two chunks are created: **input** and **target**.

#### **a. Input Chunk:**
```python
input_chunk = token_ids[i:i + max_length]
```

- **Explanation:**  
  This takes a slice of `token_ids` starting from index `i` to `i + max_length`. This slice represents the **input** for the model.
  
  For example, if `i = 0` and `max_length = 3`, `input_chunk` will be:
  ```python
  input_chunk = [101, 2000, 303]
  ```

#### **b. Target Chunk:**
```python
target_chunk = token_ids[i + 1: i + max_length + 1]
```

- **Explanation:**  
  This takes a slice of `token_ids` from `i + 1` to `i + max_length + 1`. This slice represents the **target** (next token prediction) for the model.
  
  If `i = 0` and `max_length = 3`, `target_chunk` will be:
  ```python
  target_chunk = [2000, 303, 4567]
  ```
  Here, the **target** is always the next token after the input.

---

### **4. Storing the Chunks:**
```python
self.input_ids.append(torch.tensor(input_chunk))
self.target_ids.append(torch.tensor(target_chunk))
```

- After creating the `input_chunk` and `target_chunk` for each iteration of the loop, both are converted into **PyTorch tensors** and **added to the lists** `self.input_ids` and `self.target_ids`.

#### **Example:**  
After the first iteration (if `i = 0`), the lists will look like this:
```python
self.input_ids = [tensor([101, 2000, 303])]
self.target_ids = [tensor([2000, 303, 4567])]
```

The loop continues to generate new input–target pairs as it slides through the `token_ids` list.

---

### **5. Final Length of Dataset**
```python
def __len__(self):
    return len(self.input_ids)
```

- This function returns the total number of input–target pairs (or data samples) in the dataset. The length is simply the length of `self.input_ids`, since each `input_ids` corresponds to a `target_ids`.

---

### **6. Get Item**
```python
def __getitem__(self, idx):
    return self.input_ids[idx], self.target_ids[idx]
```

- This function defines how to access individual samples from the dataset. For a given index `idx`, it returns the corresponding **input** and **target** tensors.

---

### **Example of How This Works:**

Given the example:
- `token_ids = [101, 2000, 303, 4567, 2345, 102]`
- `max_length = 3`
- `stride = 1`

**The dataset will be built as follows:**

1. **First Iteration (`i = 0`):**
   - `input_chunk = [101, 2000, 303]`
   - `target_chunk = [2000, 303, 4567]`
   
   `input_ids = [tensor([101, 2000, 303])]`
   `target_ids = [tensor([2000, 303, 4567])]`

2. **Second Iteration (`i = 1`):**
   - `input_chunk = [2000, 303, 4567]`
   - `target_chunk = [303, 4567, 2345]`
   
   `input_ids = [tensor([101, 2000, 303]), tensor([2000, 303, 4567])]`
   `target_ids = [tensor([2000, 303, 4567]), tensor([303, 4567, 2345])]`

3. **Third Iteration (`i = 2`):**
   - `input_chunk = [303, 4567, 2345]`
   - `target_chunk = [4567, 2345, 102]`
   
   `input_ids = [tensor([101, 2000, 303]), tensor([2000, 303, 4567]), tensor([303, 4567, 2345])]`
   `target_ids = [tensor([2000, 303, 4567]), tensor([303, 4567, 2345]), tensor([4567, 2345, 102])]`

The dataset will be created as a sequence of input–target pairs that can then be used for training a language model.

---

### **Key Points:**
- The **for loop** is used to create input–target pairs using a **sliding window** approach.
- **`stride`** controls how much the window moves (in our case, by 1 token each time).
- The **input** is the current sequence of tokens, and the **target** is the sequence shifted by 1 position, predicting the next token.


Let me explain the line:
```python
for i in range(0, len(token_ids) - max_length, stride):
```
This line is part of the loop that iterates over the `token_ids` list using a sliding window approach. Let's break it down and explain exactly what it does, step by step.

### **What the range does:**

- **Start (`0`)**: The loop starts from index `0`, which is the beginning of the `token_ids` list.
- **Stop (`len(token_ids) - max_length`)**: The loop stops before the index `len(token_ids) - max_length`. This is because you need to make sure that there are enough tokens left to create a full input chunk of size `max_length`. If the loop goes past this point, there wouldn't be enough tokens left to form a complete `max_length`-sized chunk, and you'd get an incomplete input-target pair.
- **Step (`stride`)**: The loop will move forward by `stride` positions in each iteration. This allows the sliding window to move forward by more than one token at a time, which is useful for generating diverse input-target pairs.

### **Why subtract `max_length` from `len(token_ids)`?**
You want to create an input chunk of size `max_length`. If you reach a point where there are fewer than `max_length` tokens left in the sequence, you can't create a full chunk, so you stop the loop there.

### **Example:**

Let’s say we have the following `token_ids` and settings:
```python
token_ids = [101, 2000, 303, 4567, 2345, 102]
max_length = 3
stride = 1
```

### **Step-by-Step Explanation:**
1. **Length of `token_ids`:**  
   `len(token_ids)` is `6` (since there are 6 tokens).
   
   We need to stop at `len(token_ids) - max_length`, which is `6 - 3 = 3`. So, the loop will go up to index `3`, but not include it.

2. **Range Function:**  
   The range function will be `range(0, 3, 1)`. This means the loop will iterate over the indices `0, 1, 2`.

3. **How the loop works:**  
   - **First iteration (`i = 0`):**
     - `input_chunk = token_ids[0:3] = [101, 2000, 303]`
     - `target_chunk = token_ids[1:4] = [2000, 303, 4567]`
   
   - **Second iteration (`i = 1`):**
     - `input_chunk = token_ids[1:4] = [2000, 303, 4567]`
     - `target_chunk = token_ids[2:5] = [303, 4567, 2345]`
   
   - **Third iteration (`i = 2`):**
     - `input_chunk = token_ids[2:5] = [303, 4567, 2345]`
     - `target_chunk = token_ids[3:6] = [4567, 2345, 102]`
   
   The loop ends here because there are no more tokens to process when `i = 3` (the last valid index).

### **Loop Iteration Breakdown:**

- **First Iteration (`i = 0`)**:
  - `input_chunk = [101, 2000, 303]`
  - `target_chunk = [2000, 303, 4567]`

- **Second Iteration (`i = 1`)**:
  - `input_chunk = [2000, 303, 4567]`
  - `target_chunk = [303, 4567, 2345]`

- **Third Iteration (`i = 2`)**:
  - `input_chunk = [303, 4567, 2345]`
  - `target_chunk = [4567, 2345, 102]`

So, the `stride = 1` means the sliding window moves forward **1 token at a time**.

---

### **General Formula:**

- The loop is designed to create a sliding window of `max_length` tokens, and at each step, the `stride` determines how many tokens you skip forward before creating the next chunk. If `stride = 1`, the window moves forward by one token, while if `stride = 2`, it would skip ahead by two tokens after each iteration.

---

### **Summary:**

- The **for loop** iterates over `token_ids` using a sliding window approach.
- The range function ensures that the loop doesn't go past the point where a full chunk of `max_length` tokens can be taken.
- **Stride** controls how far the sliding window moves in each iteration, and the loop generates pairs of **input** and **target** chunks for model training.

In [52]:
def create_dataloader_v1(txt,batch_size,max_length=256, stride= 128, shuffle=True,drop_last=True,num_workers=0):
   tokenizer = tiktoken.get_encoding("gpt2")
   dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
   dataloader = DataLoader(dataset,batch_size=batch_size,shuffle=shuffle,drop_last=drop_last,num_workers=num_workers)
   return dataloader 

In [59]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
dataloader= create_dataloader_v1(raw_text, max_length=4,stride=2,shuffle=False,batch_size=1)
data_iter = iter(dataloader)
first_batch = next(data_iter)
second_batch = next(data_iter)
print(first_batch)
print(second_batch)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
[tensor([[2885, 1464, 1807, 3619]]), tensor([[1464, 1807, 3619,  402]])]


### Explanation of the Process Described in the Passage:

The passage is describing the process of converting **token IDs** into **embedding vectors** for training a GPT-like (decoder-only transformer) language model (LLM). This process is essential in NLP models, as it allows the model to work with dense, continuous vector representations instead of sparse, discrete token IDs. Here's a breakdown of the key concepts and steps involved:

---

### 1. **Tokenization & Token IDs**:
First, you start with some **raw text** that is tokenized using a **tokenizer** (such as BPE - Byte Pair Encoding). For example, let's say the sentence "This is an example." is tokenized into token IDs, and the result might look like this:
```
Tokenized Text: ["This", "is", "an", "example", "."]
Token IDs: [40134, 2052, 133, 389, 12]
```

- The tokenizer converts each word or symbol into a **unique integer ID**.
  
Now, these token IDs need to be converted into **embedding vectors** to allow the model to work with them effectively during training.

---

### 2. **What is an Embedding?**:
An **embedding** is a continuous vector representation of a discrete object (in this case, a token). Instead of representing each token with just an integer ID (which is discrete), each token is represented as a vector in a continuous space. These vectors help the model capture semantic relationships between tokens.

For example:
- A token ID `3` could map to an embedding vector like `[ -0.4015, 0.9666, -1.1481 ]`.
- A token ID `5` might map to `[ 1.2753, -0.2010, -0.1606 ]`.

The key idea is that these embeddings provide a way to represent the token in a space where similar tokens are close to each other (in terms of their embedding vectors).

---

### 3. **Embedding Layer Initialization**:
The process starts by initializing a **random embedding layer**. This layer is typically a matrix where:
- Each **row** corresponds to the embedding of a specific token ID.
- Each **column** corresponds to a dimension of the embedding vector (e.g., 3 dimensions, 5 dimensions, 12,288 dimensions, etc.).

#### Example:
Suppose you have a vocabulary of **6 tokens** and you want to create embeddings of size **3** (3-dimensional embeddings). The embedding layer would look like this:
```
Embedding Weight Matrix (size: 6 x 3):
[ [ 0.3374, -0.1778, -0.1690],
  [ 0.9178,  1.5810,  1.3010],
  [ 1.2753, -0.2010, -0.1606],
  [ -0.4015,  0.9666, -1.1481],
  [ -1.1589,  0.3255, -0.6315],
  [ -2.8400, -0.7849, -1.4096] ]
```
Here, we have **6 rows** (one for each token in the vocabulary), and each row has **3 columns** (embedding dimensions).

---

### 4. **Token ID Lookup**:
When we want to look up the embedding of a token ID, we essentially perform a **lookup** in the embedding matrix.

For example, to get the embedding for token ID `3`, we go to the 3rd row of the matrix (using zero-based indexing), which is:
```
[ -0.4015, 0.9666, -1.1481 ]
```
This is the **embedding vector** corresponding to token ID `3`.

The process is akin to **one-hot encoding**, but instead of having a sparse vector (like `[0, 0, 1, 0, 0, 0]` for token ID `3`), we get a **dense, continuous vector** representing the token.

---

### 5. **Looking Up Embeddings for Multiple Token IDs**:
You can look up the embeddings for multiple token IDs in one step. For example, if you have a sequence of token IDs `[2, 3, 5, 1]`, and you pass them through the embedding layer:
```python
input_ids = torch.tensor([2, 3, 5, 1])
embedding_layer(input_ids)
```
The result is a **4 x 3 matrix**:
```
tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]])
```
Each **row** in this matrix corresponds to the embedding vector for each token ID in the input sequence.

---

### 6. **Embedding Layer as a Lookup**:
The embedding layer in PyTorch acts as a **lookup table** for token IDs. Instead of encoding the tokens manually into vectors, we use this layer to retrieve the corresponding embedding for each token ID efficiently. As mentioned, this is essentially a more optimized form of one-hot encoding followed by a matrix multiplication in a fully connected layer.

### Key Takeaways:
1. **Embedding Vectors**: The embedding process converts discrete token IDs into continuous vectors that represent tokens in a high-dimensional space, capturing their semantic meaning.
2. **Embedding Layer**: The embedding layer holds a matrix of token embeddings, with each row corresponding to a token's embedding vector.
3. **Lookup Process**: Token IDs are mapped to their respective embedding vectors by looking up the corresponding row in the embedding matrix.
4. **Training**: These embeddings are **randomly initialized** at the start of training and are updated during the backpropagation process to improve the model's ability to predict and understand relationships between tokens.

---

### Why is this Important?
Embedding vectors are an essential part of training LLMs like GPT, as they allow the model to process text input more effectively. These embeddings help the model understand semantic relationships between tokens, allowing it to generate text and make predictions more accurately. During training, the embedding weights are optimized to capture better relationships between words, which improves the model's performance over time.

In [60]:
import torch
import torch.nn as nn

# Define vocabulary size and embedding dimensions
vocab_size = 50257  # Vocabulary size from BPE tokenizer
embedding_dim = 256  # Embedding dimension (realistic but smaller than GPT-3)
context_length = 4   # Maximum sequence length (number of tokens per input)
batch_size = 8       # Number of text samples per batch

# Create the token embedding layer
token_embedding_layer = nn.Embedding(vocab_size, embedding_dim)

# Create the positional embedding layer
pos_embedding_layer = nn.Embedding(context_length, embedding_dim)

# Simulated tokenized input (batch of token IDs)
inputs = torch.randint(0, vocab_size, (batch_size, context_length))  # Random token IDs

# Convert token IDs to embeddings
token_embeddings = token_embedding_layer(inputs)  # Shape: (8, 4, 256)

# Create positional encodings (same for all samples in batch)
positional_indices = torch.arange(context_length).unsqueeze(0).repeat(batch_size, 1)
pos_embeddings = pos_embedding_layer(positional_indices)  # Shape: (8, 4, 256)

# Add positional embeddings to token embeddings
input_embeddings = token_embeddings + pos_embeddings  # Shape: (8, 4, 256)

# Print output shapes to verify
print("Token Embeddings Shape:", token_embeddings.shape)
print("Positional Embeddings Shape:", pos_embeddings.shape)
print("Final Input Embeddings Shape:", input_embeddings.shape)


Token Embeddings Shape: torch.Size([8, 4, 256])
Positional Embeddings Shape: torch.Size([8, 4, 256])
Final Input Embeddings Shape: torch.Size([8, 4, 256])


### **Summary of This Section: Encoding Word Positions and Creating Input Embeddings**  

1. **Token Embeddings**  
   - The text is first tokenized into **token IDs**.  
   - Each token ID is mapped to a **fixed-size vector** using an **embedding layer** in PyTorch.  
   - This is like looking up a word in a dictionary and getting a unique numerical representation.

2. **The Problem: No Positional Awareness**  
   - The transformer model does not **naturally** understand the order of words.  
   - For example, "The cat sat" and "Sat cat the" would look the same to the model.  
   - This happens because each word gets the same embedding regardless of its position in the sentence.

3. **Solution: Positional Embeddings**  
   - To fix this, we add **positional embeddings** to token embeddings.  
   - Each position in a sentence (1st, 2nd, 3rd, etc.) has a **unique** vector.  
   - These vectors are learned during training.  
   - By summing positional embeddings with token embeddings, the model now understands the order of words.

4. **Implementation in Code**  
   - We create a **token embedding layer** that maps token IDs to vectors.  
   - We create a **positional embedding layer** that generates position-specific vectors.  
   - The token embeddings and positional embeddings are **added together** to create **input embeddings**.  
   - The final result is a **3D tensor (batch_size, sequence_length, embedding_size)** that is ready to be fed into a **GPT model**.

### **Key Takeaways**
✅ Token embeddings give meaning to words.  
✅ Positional embeddings add order information.  
✅ The sum of both embeddings becomes the **final input representation** for LLMs.  

This ensures the model understands **both** the words and their **order in a sentence**. 🚀

# Final Code

In [1]:
import urllib.request
import torch
import tiktoken  
from torch.utils.data import Dataset, DataLoader

# Download the text file
url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

# Load and tokenize the text
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

tokenizer = tiktoken.get_encoding("cl100k_base")  
enc_text = tokenizer.encode(raw_text)

# Define a custom dataset class
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        token_ids = tokenizer.encode(txt)
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

# Function to create a DataLoader
def create_dataloader_v1(txt, batch_size, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
    return dataloader 

# Create DataLoader
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=2, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
second_batch = next(data_iter)
print("First batch:", first_batch)
print("Second batch:", second_batch)

# --- Adding Token and Positional Embeddings ---

# Define token embedding layer
vocab_size = 50257  # Typical size for GPT models
embedding_dim = 256  # Example embedding size (GPT-3 uses 12,288)
token_embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)

# Convert token IDs into token embeddings
token_embeddings = token_embedding_layer(first_batch[0])  # First batch of inputs
print("Token Embeddings Shape:", token_embeddings.shape)

# Define positional embedding layer
max_length = 4  # Same as the max sequence length
pos_embedding_layer = torch.nn.Embedding(max_length, embedding_dim)

# Generate position embeddings
positions = torch.arange(max_length).unsqueeze(0)  # Create position indices
pos_embeddings = pos_embedding_layer(positions)
print("Positional Embeddings Shape:", pos_embeddings.shape)

# Combine token and positional embeddings
input_embeddings = token_embeddings + pos_embeddings
print("Final Input Embeddings Shape:", input_embeddings.shape)


First batch: [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
Second batch: [tensor([[2885, 1464, 1807, 3619]]), tensor([[1464, 1807, 3619,  402]])]
Token Embeddings Shape: torch.Size([1, 4, 256])
Positional Embeddings Shape: torch.Size([1, 4, 256])
Final Input Embeddings Shape: torch.Size([1, 4, 256])



### **Explanation of What’s Happening in This Code**

#### **1. Loading and Tokenizing the Text**
- The text is downloaded from the internet and read into a variable.
- A tokenizer converts the text into **token IDs** (numerical representations of words or subwords).

#### **2. Creating a Dataset and DataLoader**
- A custom `GPTDatasetV1` class is created, which:
  - Splits the tokenized text into **fixed-length sequences**.
  - Generates **input** and **target** sequences by shifting tokens by one position.
- The `create_dataloader_v1()` function organizes these sequences into **batches**.

#### **3. Generating Token Embeddings**
- A **token embedding layer** is defined using `torch.nn.Embedding`.
- This layer maps **token IDs** to **256-dimensional vectors**.
- The first batch of token IDs is converted into embeddings.

#### **4. Generating Positional Embeddings**
- A **positional embedding layer** is created.
- It maps **each position** (1st, 2nd, 3rd, etc.) in the sequence to a **256-dimensional vector**.
- This ensures the model understands the order of tokens.

#### **5. Combining Token and Positional Embeddings**
- The **token embeddings** and **positional embeddings** are added together.
- This forms the **final input embeddings**, which will be passed into a GPT model.

---

### **Key Takeaways**
✅ **Token embeddings** give words a numerical representation.  
✅ **Positional embeddings** add order information to the sequence.  
✅ The sum of both creates **input embeddings**, which are the **final processed input** for an LLM.  

This setup ensures that the model can **both understand the meaning of words** and **their order in a sentence**. 🚀

The shape **`torch.Size([1, 4, 256])`** represents the dimensions of the **token embeddings tensor**, which consists of three parts:  

1️⃣ **Batch size (`1`)**  
   - This represents the number of sequences processed at once.  
   - In this case, we are processing **one sequence** in the batch.  

2️⃣ **Sequence length (`4`)**  
   - This represents the number of tokens in the input sequence.  
   - Each input sequence consists of **4 tokens** (since `max_length=4`).  

3️⃣ **Embedding dimension (`256`)**  
   - This represents the size of each token's embedding vector.  
   - Each token is mapped to a **256-dimensional vector** in the embedding layer.  

### **Example Breakdown**
If the input token IDs were:
```python
tensor([[40, 367, 2885, 1464]])
```
- The batch contains **one sequence** (`1`).
- There are **four tokens** in the sequence (`4`).
- Each token is represented by a **256-dimensional vector** (`256`).

---

### **Intuition**  
Each row in the tensor represents a **sequence** in the batch. Each column represents a **token** in the sequence. Each token is mapped to a **256-dimensional embedding vector** that encodes its meaning.  

For a larger batch size (e.g., `batch_size=8`), the shape would be **`(8, 4, 256)`**, meaning we are processing 8 sequences simultaneously.