# Load the tokenizer

this is better than 2nd

In [1]:
import sys
sys.path.append('..')

In [2]:
from minbpe import BasicTokenizer

tokenizer=BasicTokenizer()
tokenizer.load(model_file="../output/tokenizer/my_tokenizer.model")

def get_vocab_size(tokenizer:BasicTokenizer)->int:
    vocab=tokenizer.vocab
    special_tokens = tokenizer.special_tokens

    return len(vocab) + len(special_tokens)

# Create the model

In [3]:
import torch
torch.manual_seed(3647)

<torch._C.Generator at 0x2067f3f6dd0>


### 🧠 Explanation: Setting up Random Seed in PyTorch

#### 1️⃣ Importing the PyTorch Library
```python
import torch
```

* **`torch`** is the main PyTorch library used for building and training neural networks.
* It provides tools for:

  * Creating tensors (multi-dimensional arrays)
  * Performing mathematical operations
  * Running models on CPU or GPU
  * Handling deep learning workflows like automatic differentiation and optimization.

---

#### 2️⃣ Setting a Manual Seed

```python
torch.manual_seed(3647)
```

* **What this does:**
  It sets the *random seed* for PyTorch’s random number generator.
  Random numbers are used in many parts of AI/ML — for example, when:

  * Initializing model weights
  * Shuffling training data
  * Performing random augmentations

* **Why set a seed?**
  To make your results **reproducible**.
  Without fixing the seed, every time you run the code, random operations might produce different results.

* **Parameter Explanation:**

  * `3647` → This is just a random integer chosen as the *seed value*.
    You can use any integer (like 0, 42, or 9999).
    Using the same seed ensures you’ll get the same random numbers each time you run the program.

---

✅ **In simple words:**
This code imports PyTorch and fixes the random behavior so that your results stay the same every time you re-run the notebook.

```



In [4]:
from transformer.model import GPTLanguageModel

block_size = 256
n_embd = 512
n_head = 8
n_layer = 4
dropout = 0.2
batch_size = 64
vocab_size = get_vocab_size(tokenizer)
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = GPTLanguageModel(
    vocab_size=vocab_size,
    block_size=block_size,
    n_embd=n_embd,
    n_head=n_head,
    n_layer=n_layer,
    dropout=dropout,
    device=device
).to(device)
model = torch.compile(model)

print(sum(p.numel() for p in model.parameters())/1e6, 'M parameters')

13.795338 M parameters


# Data prepration

### 1. Load the data

In [5]:
with open("../output/combined_text.txt","r",encoding='utf-8') as f:
    text_sequence=f.read()

encoded_text_sequence=tokenizer.encode(text_sequence)
len(encoded_text_sequence)

167

### 2. Split into train and test

In [6]:
data = torch.tensor(encoded_text_sequence,dtype=torch.long)
split_index=int(0.9*len(data))
train_data=data[:split_index]
val_data =data[split_index:]

### 3. Data loader

In [7]:
from typing import Tuple
from torch.utils.data import Dataset,DataLoader

class TextDataset(Dataset):
    def __init__(self,data: torch.Tensor,block_size:int) -> None:
        if len(data) <= block_size:
            raise ValueError(
                f"The length of the data ({len(data)}) must be grater than the block_size ({block_size})."
            )
        
        self.data=data
        self.block_size=block_size

    def __len__(self)->int:
        return len(self.data) - self.block_size
    
    def __getitem__(self, index:int)->Tuple[torch.Tensor,torch.Tensor]:
        x=self.data[index : index + self.block_size]
        y=self.data[index + 1: index + self.block_size+1]
        return x,y
    

def get_dataloaders(
    train_data: torch.Tensor,
    val_data: torch.Tensor,
    block_size: int,
    batch_size: int,
    device: torch.device,
) -> Tuple[DataLoader, DataLoader]:
    train_dataset = TextDataset(train_data.to(device), block_size)
    val_dataset = TextDataset(val_data.to(device), block_size)

    train_loader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
    )
    val_loader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        shuffle=False,
    )

    return train_loader, val_loader


### 🧠 Explanation: Creating a Custom Text Dataset and Data Loaders in PyTorch

---

#### 1️⃣ Importing Required Modules
```python
from typing import Tuple
from torch.utils.data import Dataset, DataLoader
```

* **`typing.Tuple`** → Used for type hints.
  It tells readers (and some tools) that a function will return a tuple — for example, `(x, y)`.
* **`torch.utils.data.Dataset`** → Base class for creating custom datasets in PyTorch.
* **`torch.utils.data.DataLoader`** → Helper class that loads data from a `Dataset` in **batches**, optionally **shuffling** them for training.

---

#### 2️⃣ Creating the Custom Dataset Class

```python
class TextDataset(Dataset):
```

* You’re defining a new class `TextDataset` that inherits from PyTorch’s `Dataset`.
* This lets you store and organize your data in a format that PyTorch can easily use.

---

#### 3️⃣ Initializing the Dataset

```python
def __init__(self, data: torch.Tensor, block_size: int) -> None:
```

* **`__init__`**: This is the constructor that runs when you create a `TextDataset` object.
* **Parameters:**

  * `data`: The text data converted into a tensor of numbers (each number could represent a token or character).
  * `block_size`: The number of tokens in each input sequence (like the length of a sentence chunk).

---

#### 4️⃣ Checking Data Length

```python
if len(data) <= block_size:
    raise ValueError(
        f"The length of the data ({len(data)}) must be greater than the block_size ({block_size})."
    )
```

* Ensures the dataset is large enough to form at least one valid input-target pair.
* If not, it raises an error message.

---

#### 5️⃣ Storing Variables

```python
self.data = data
self.block_size = block_size
```

* Saves both values for later use in other functions of this class.

---

#### 6️⃣ Getting the Length of the Dataset

```python
def __len__(self) -> int:
    return len(self.data) - self.block_size
```

* This tells PyTorch **how many samples** your dataset contains.
* Each sample is made up of:

  * An input (`x`) of length `block_size`
  * A target (`y`) of length `block_size`
* So, the last few elements of `data` can’t form a full block, which is why we subtract `block_size`.

---

#### 7️⃣ Getting an Item by Index

```python
def __getitem__(self, index: int) -> Tuple[torch.Tensor, torch.Tensor]:
    x = self.data[index : index + self.block_size]
    y = self.data[index + 1 : index + self.block_size + 1]
    return x, y
```

* **What happens here:**
  For each `index`, we create:

  * **`x` (input)** → A slice of data of size `block_size`
  * **`y` (target/output)** → The same slice but shifted by one position to the right
* Example:
  If `data = [1, 2, 3, 4, 5]` and `block_size = 3`, then:

  * At `index = 0`:

    * `x = [1, 2, 3]`
    * `y = [2, 3, 4]`
      This helps the model learn **the next token prediction**.

---

### ⚙️ Function: Creating DataLoaders

```python
def get_dataloaders(
    train_data: torch.Tensor,
    val_data: torch.Tensor,
    block_size: int,
    batch_size: int,
    device: torch.device,
) -> Tuple[DataLoader, DataLoader]:
```

This function creates two **DataLoaders** — one for **training** and one for **validation**.

---

#### Step 1: Create Datasets

```python
train_dataset = TextDataset(train_data.to(device), block_size)
val_dataset = TextDataset(val_data.to(device), block_size)
```

* Converts the data to the chosen **device** (CPU or GPU).
* Creates `TextDataset` objects for both training and validation data.

---

#### Step 2: Wrap Datasets into DataLoaders

```python
train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
)
val_loader = DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=False,
)
```

* **`batch_size`** → How many samples to load at once.
  Example: If `batch_size = 32`, each iteration gives 32 input–target pairs.
* **`shuffle=True`** → Randomizes the order of samples in each epoch (good for training).
* **`shuffle=False`** → Keeps order consistent for validation/testing.

---

#### Step 3: Return Both Loaders

```python
return train_loader, val_loader
```

* Returns a tuple containing:

  * `train_loader`: for model training
  * `val_loader`: for model evaluation

---

✅ **In simple words:**
This code:

1. Defines how to slice text data into small training examples (`TextDataset`).
2. Packs them into batches with PyTorch’s `DataLoader`.
3. Makes it easy to feed text data into a neural network for training and validation.

```



In [8]:
train_loader,val_loader= get_dataloaders(
    train_data=train_data,
    val_data=val_data,
    block_size=block_size,
    batch_size=batch_size,
    device=device
)

x,y=next(iter(train_loader))
x.shape,y.shape

ValueError: The length of the data (150) must be grater than the block_size (256).


### 🧠 Explanation: Loading and Inspecting a Batch of Training Data

---

#### 1️⃣ Creating Training and Validation DataLoaders
```python
train_loader, val_loader = get_dataloaders(
    train_data=train_data,
    val_data=val_data,
    block_size=block_size,
    batch_size=batch_size,
    device=device
)
```

* This line **calls** the `get_dataloaders()` function (which we defined earlier).
* It prepares two **DataLoaders** — one for training and one for validation.

Let’s break down each argument 👇

| Parameter        | Meaning                                                                       |
| ---------------- | ----------------------------------------------------------------------------- |
| **`train_data`** | Tensor containing all training data (numerical representation of text).       |
| **`val_data`**   | Tensor for validation data (used to check model performance).                 |
| **`block_size`** | The number of tokens in one input sequence (length of each training example). |
| **`batch_size`** | Number of such input–output pairs processed at once during training.          |
| **`device`**     | Tells PyTorch whether to store data on the **CPU** or **GPU**.                |

* The function returns two loaders:

  * **`train_loader`** → Randomized batches for training.
  * **`val_loader`** → Sequential batches for validation.

---

#### 2️⃣ Getting One Batch from the Training DataLoader

```python
x, y = next(iter(train_loader))
```

* **`iter(train_loader)`** → Converts the DataLoader into an **iterator**, meaning we can go through it batch by batch.
* **`next(...)`** → Gets the **first batch** from that iterator.
* Each batch returns a **tuple**:

  * **`x`** → Batch of input sequences (each of length `block_size`).
  * **`y`** → Batch of target sequences (the same as `x` but shifted by one position).

---

#### 3️⃣ Checking the Shape of Batches

```python
x.shape, y.shape
```

* This prints the **dimensions (shape)** of both tensors.

* The expected shape is usually:

  ```
  (batch_size, block_size)
  ```

* Example:
  If `batch_size = 64` and `block_size = 8`, then:

  ```
  x.shape = torch.Size([64, 8])
  y.shape = torch.Size([64, 8])
  ```

  → meaning: **64 examples per batch**, each with **8 tokens**.

---

✅ **In simple words:**
This code takes your text data, divides it into mini-groups (batches) of input and target sequences, and shows the size of one batch that will be fed into the model during training.

```


# 4. Training

In [None]:
from typing import Dict
@torch.no_grad()
def estimata_loss(
    model: torch.nn.Module,
    train_loader:DataLoader,
    val_loader: DataLoader,
    eval_iters: int 
)-> Dict[str,float]:
    output={}
    model.eval()

    for split , loader in [('train',train_loader),('val','val_loader')]:
        losses = torch.zeros(eval_iters)
        for i ,(x,y) in enumerate(loader):
            if i>= eval_iters:
                break
            with torch.no_grad():
                _,loss=model(x,y)
            losses[i]=loss.item()
        output[split]=losses.mean().item()
    
    model.train()
    return output
    


### 🧠 Explanation: Estimating Loss for Training and Validation Sets

---

#### 1️⃣ Importing Required Module
```python
from typing import Dict
```

* **`Dict`** from the `typing` module is used for **type hints**.
* It indicates that a function will return a **dictionary** — here, `{ "train": value, "val": value }`.

---

#### 2️⃣ Disabling Gradient Calculation

```python
@torch.no_grad()
```

* This **decorator** tells PyTorch **not to track gradients** during the function call.
* Why?

  * When we are **evaluating** (not training), we don’t need gradients.
  * It **saves memory** and makes computations **faster**.

---

#### 3️⃣ Defining the Function

```python
def estimata_loss(
    model: torch.nn.Module,
    train_loader: DataLoader,
    val_loader: DataLoader,
    eval_iters: int
) -> Dict[str, float]:
```

* This function computes the **average loss** for both training and validation datasets over a few batches.

| Parameter          | Type               | Description                                               |
| ------------------ | ------------------ | --------------------------------------------------------- |
| **`model`**        | `torch.nn.Module`  | The neural network to evaluate.                           |
| **`train_loader`** | `DataLoader`       | DataLoader providing training data batches.               |
| **`val_loader`**   | `DataLoader`       | DataLoader providing validation data batches.             |
| **`eval_iters`**   | `int`              | Number of batches to use for estimating the average loss. |
| **Returns**        | `Dict[str, float]` | Dictionary with average losses for `"train"` and `"val"`. |

---

#### 4️⃣ Creating an Empty Output Dictionary

```python
output = {}
```

* This will store the results like:

  ```python
  {'train': 0.45, 'val': 0.50}
  ```

---

#### 5️⃣ Setting the Model to Evaluation Mode

```python
model.eval()
```

* Switches the model to **evaluation mode**, which:

  * Turns off dropout layers.
  * Disables batch normalization updates.
* This ensures consistent behavior while testing or validating.

---

#### 6️⃣ Iterating Through Both Loaders

```python
for split, loader in [('train', train_loader), ('val', val_loader)]:
```

* Loops over **two datasets**: one for training and one for validation.
* Each iteration sets:

  * `split` → either `"train"` or `"val"`.
  * `loader` → the corresponding DataLoader.

⚠️ **Note:**
In your code, `'val','val_loader'` has quotes around `val_loader`, which makes it a string — that’s a bug.
It should be:

```python
for split, loader in [('train', train_loader), ('val', val_loader)]:
```

---

#### 7️⃣ Initializing a Tensor for Loss Storage

```python
losses = torch.zeros(eval_iters)
```

* Creates a tensor (array) to store the loss value from each batch.
* Size = number of evaluation iterations (`eval_iters`).

---

#### 8️⃣ Looping Through Batches

```python
for i, (x, y) in enumerate(loader):
    if i >= eval_iters:
        break
```

* Iterates through the DataLoader, one batch at a time.
* **`i`** = batch index, **`(x, y)`** = input and target tensors.
* Stops once it has processed `eval_iters` batches.

---

#### 9️⃣ Calculating Loss (Without Gradients)

```python
with torch.no_grad():
    _, loss = model(x, y)
losses[i] = loss.item()
```

* **`with torch.no_grad()`** → Double safety to ensure no gradient tracking.
* Calls the model with input `x` and target `y`.

  * The model returns a tuple like `(logits, loss)`.
  * We only need the `loss` here.
* **`.item()`** converts a 1-element tensor to a regular Python number and stores it in `losses[i]`.

---

#### 🔟 Calculating the Average Loss

```python
output[split] = losses.mean().item()
```

* Computes the **mean of all stored losses** for that dataset (`train` or `val`).
* Stores it in the `output` dictionary.

---

#### 11️⃣ Switching Model Back to Training Mode

```python
model.train()
```

* Puts the model back into **training mode** (enables dropout, etc.)
* Important because later training steps require it.

---

#### 12️⃣ Returning the Result

```python
return output
```

* Returns a dictionary like:

  ```python
  {'train': 0.42, 'val': 0.50}
  ```

  which contains the average loss for both datasets.

---

✅ **In simple words:**
This function quickly checks how well your model is performing on both **training** and **validation** data — without updating weights — and returns their average loss values.

```



In [None]:
def save_checkout(
        mode:GPTLanguageModel,
        optimizer:torch.optim.Optimizer,
        epoch:int,
        loss:float,
        file_path:str="checkpoint.pth"

)-> None:
    checkpoint={
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss
    }
    torch.save(checkpoint,file_path)

NameError: name 'GPTLanguageModel' is not defined


### 💾 Explanation: Saving a Model Checkpoint in PyTorch

---

#### 1️⃣ Defining a Function to Save Model State
```python
def save_checkout(
        model: GPTLanguageModel,
        optimizer: torch.optim.Optimizer,
        epoch: int,
        loss: float,
        file_path: str = "checkpoint.pth"
) -> None:
```

Let’s break down every part 👇

##### 🧩 Function Purpose

* This function **saves the training progress** — including model weights, optimizer state, current epoch, and loss — into a file called a **checkpoint**.
* Checkpoints let you:

  * Stop training midway and **resume later** without losing progress.
  * **Restore** a trained model anytime for testing or inference.
  * **Avoid retraining** the model from scratch if something goes wrong.

##### 🧠 Parameters Explained

| Parameter       | Type                    | Description                                                                                        |
| --------------- | ----------------------- | -------------------------------------------------------------------------------------------------- |
| **`model`**     | `GPTLanguageModel`      | The neural network whose weights (parameters) you want to save.                                    |
| **`optimizer`** | `torch.optim.Optimizer` | The optimizer (like Adam, SGD) that updates model parameters.                                      |
| **`epoch`**     | `int`                   | The current epoch number — i.e., how many complete passes through the training data have occurred. |
| **`loss`**      | `float`                 | The loss value at this stage of training, useful for tracking performance.                         |
| **`file_path`** | `str`                   | Path and filename where checkpoint will be saved. Default: `"checkpoint.pth"`.                     |

##### 🧩 Return Type

* **`-> None`** → The function doesn’t return anything; it just saves data to disk.

---

#### 2️⃣ Creating the Checkpoint Dictionary

```python
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss
}
```

##### 💡 What is a “checkpoint”?

* A **checkpoint** is a dictionary that stores everything needed to **rebuild your model** and **continue training later**.

##### 🧱 Explanation of Each Key–Value Pair

| Key                      | Value                    | Description                                                                        |
| ------------------------ | ------------------------ | ---------------------------------------------------------------------------------- |
| `'epoch'`                | `epoch`                  | Saves the current epoch number, so you know from where to continue training later. |
| `'model_state_dict'`     | `model.state_dict()`     | Saves all learnable parameters (weights and biases) of the model.                  |
| `'optimizer_state_dict'` | `optimizer.state_dict()` | Saves optimizer settings (like learning rate, momentum, and running averages).     |
| `'loss'`                 | `loss`                   | Records the last loss value for reference when resuming training.                  |

---

#### 🔍 Deep Dive: What is `state_dict()`?

Both models and optimizers in PyTorch maintain internal **state dictionaries** — i.e., Python dictionaries containing their parameters.

```python
model.state_dict()
```

* Returns something like:

  ```python
  {
      'layer1.weight': tensor([...]),
      'layer1.bias': tensor([...]),
      'layer2.weight': tensor([...]),
      ...
  }
  ```
* Each entry represents one **parameter tensor** (weights or biases).

```python
optimizer.state_dict()
```

* Stores internal data such as:

  * The **current learning rate**.
  * **Momentum** buffers (used in SGD, Adam, etc.).
  * **Parameter groups** (if different parts of the model use different settings).
* This ensures that when you **reload the checkpoint**, the optimizer resumes exactly from where it left off (so training dynamics continue smoothly).

---

#### 3️⃣ Saving the Checkpoint to Disk

```python
torch.save(checkpoint, file_path)
```

##### 💾 What happens here

* **`torch.save()`** serializes the Python dictionary (`checkpoint`) and saves it as a binary file at the specified path.
* The file extension `.pth` or `.pt` is the common PyTorch format (both work the same).

##### ⚙️ Internal process:

* PyTorch uses **`pickle`** (Python’s object serialization library) under the hood to convert tensors and metadata into a storable format.
* This allows later restoration using:

  ```python
  checkpoint = torch.load("checkpoint.pth")
  ```

##### 🧠 Why it’s useful:

* You can later **resume training** like this:

  ```python
  checkpoint = torch.load("checkpoint.pth")
  model.load_state_dict(checkpoint['model_state_dict'])
  optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
  epoch = checkpoint['epoch']
  loss = checkpoint['loss']
  ```

  → This restores the model and optimizer **exactly** to their saved states.

---

#### ✅ In simple words:

This function takes a snapshot of your model’s current state — including:

* its learned parameters,
* optimizer progress,
* current training step (epoch),
* and the latest loss value —

and **saves it all to a file** so that you can continue training later or deploy the model without starting over.

---

🧩 **Example Usage**

```python
save_checkout(model, optimizer, epoch=10, loss=0.032, file_path="gpt_checkpoint.pth")
```

➡️ This will save a file `gpt_checkpoint.pth` containing everything needed to restore the model at **epoch 10** with a **loss of 0.032**.

```



In [None]:
max_iters=1
eval_interval=100
eval_iters = 200
learning_rate=3e-4

optimizer=torch.optim.AdamW(model.parameters(),lr=learning_rate)
train_loader , val_loader = get_dataloaders(
    train_data=train_data,
    val_data=val_data,
    block_size=block_size,
    batch_size=batch_size,
    device=device
)

train_losses=[]
val_losses=[]

for iteration in range(max_iters):
    for batch_idx, (x_batch,y_batch) in enumerate(train_loader):
        # Evaluation
        if batch_idx % eval_interval ==0 or batch_idx == len (train_loader) -1:
            losses = estimata_loss(
                model=model,
                train_loader=train_loader,
                val_loader=val_loader,
                eval_iters=min(eval_iters,len(val_loader))

            )
            train_losses.append(losses['train'])
            val_losses.append(losses['val'])

            print(
                f"iteration {iteration} / step {batch_idx}: "
                f"train loss {losses['train']:.4f}, "
                f"val loss {losses['val']:.4f}"

            )

            # training step
            logits,loss=model(x_batch,y_batch)
            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            optimizer.step()

        # save checkpoint
        save_checkpoint(
            model=model,
            optimizer=optimizer,
            epoch=iteration,
            loss=loss.item(),
            file_path=f"../output/pre_training/run_4/checkpoint_{iteration}.pth"

        )

SyntaxError: incomplete input (692356186.py, line 20)


### 🧠 Explanation: Full Training Loop for GPT Language Model

This code defines how the model is **trained**, **evaluated**, and **saved** after every few steps.  
It combines all the building blocks we defined earlier — dataset, dataloaders, optimizer, and checkpoint functions.

---

#### 1️⃣ Setting Hyperparameters
```python
max_iters = 1
eval_interval = 100
eval_iters = 200
learning_rate = 3e-4
```

| Variable            | Meaning                                                                                               | Description |
| ------------------- | ----------------------------------------------------------------------------------------------------- | ----------- |
| **`max_iters`**     | Total number of epochs (complete passes over the dataset). Here only `1` for quick test.              |             |
| **`eval_interval`** | How often (in batches) to evaluate model performance during training.                                 |             |
| **`eval_iters`**    | Number of batches to use for calculating average loss during evaluation.                              |             |
| **`learning_rate`** | How big each weight update step is — smaller = slower but more stable learning. Here `3e-4` = 0.0003. |             |

---

#### 2️⃣ Creating the Optimizer

```python
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
```

* **`AdamW`** is an improved version of the Adam optimizer that includes **weight decay**, which helps regularize the model and prevent overfitting.
* **`model.parameters()`** → Gives all the trainable weights (tensors) in the model.
* **`lr=learning_rate`** → Sets the learning rate to `0.0003`.

🔍 Internally:

* During training, the optimizer:

  1. Reads gradients from each parameter (`param.grad`).
  2. Updates each parameter slightly to reduce the loss function.

---

#### 3️⃣ Getting the DataLoaders

```python
train_loader, val_loader = get_dataloaders(
    train_data=train_data,
    val_data=val_data,
    block_size=block_size,
    batch_size=batch_size,
    device=device
)
```

* Loads the training and validation datasets into batches.
* Each batch gives `(x_batch, y_batch)` tensors for the model to train on.
* `block_size` = how long each input sequence is.
* `batch_size` = how many such sequences per batch.
* `device` = CPU or GPU.

---

#### 4️⃣ Lists to Store Loss Values

```python
train_losses = []
val_losses = []
```

* Empty lists used to **track loss values** over time so you can later plot training progress.

---

#### 5️⃣ Starting the Training Loop

```python
for iteration in range(max_iters):
```

* The **outer loop** runs for a number of epochs (`max_iters` times).
* Each “iteration” here corresponds to one epoch (a complete pass through the training set).

---

#### 6️⃣ Inner Loop: Going Through Each Batch

```python
for batch_idx, (x_batch, y_batch) in enumerate(train_loader):
```

* Loops through the batches of training data.
* **`batch_idx`** → Index of the batch (0, 1, 2, …).
* **`x_batch`** → Input text sequences.
* **`y_batch`** → Target text sequences (shifted by one token).

---

#### 7️⃣ Performing Evaluation at Intervals

```python
if batch_idx % eval_interval == 0 or batch_idx == len(train_loader) - 1:
```

* Runs evaluation:

  * Every `eval_interval` batches (e.g., every 100 steps).
  * OR at the **end** of the epoch (last batch).

---

#### 8️⃣ Estimate Loss for Train and Validation

```python
losses = estimata_loss(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    eval_iters=min(eval_iters, len(val_loader))
)
```

* Calls the **`estimata_loss()`** function we defined earlier.
* Evaluates both **train** and **validation** average loss.
* **`eval_iters=min(eval_iters, len(val_loader))`** → Ensures we don’t exceed available batches.

The result looks like:

```python
{'train': 0.4251, 'val': 0.5023}
```

---

#### 9️⃣ Record the Losses

```python
train_losses.append(losses['train'])
val_losses.append(losses['val'])
```

* Adds current losses to the tracking lists for future plotting or analysis.

---

#### 🔟 Print Progress to Console

```python
print(
    f"iteration {iteration} / step {batch_idx}: "
    f"train loss {losses['train']:.4f}, "
    f"val loss {losses['val']:.4f}"
)
```

* Displays the current progress in human-readable form:

  ```
  iteration 0 / step 100: train loss 0.4213, val loss 0.5072
  ```

---

#### 11️⃣ Forward Pass (Training Step)

```python
logits, loss = model(x_batch, y_batch)
```

* **`model(x_batch, y_batch)`** runs the input through the neural network.
* Returns:

  * **`logits`** → Model’s predicted outputs before applying softmax.
  * **`loss`** → Computed difference between predictions and actual targets.

---

#### 12️⃣ Resetting Gradients

```python
optimizer.zero_grad(set_to_none=True)
```

* Before computing new gradients, old ones must be cleared.
* If not cleared, PyTorch accumulates them (adds new to old).
* **`set_to_none=True`** sets gradients to `None` instead of `0`, which saves memory and improves speed.

---

#### 13️⃣ Backpropagation (Compute Gradients)

```python
loss.backward()
```

* This computes the **gradients** of the loss with respect to each model parameter.
* PyTorch automatically calculates these using the **autograd** engine.

---

#### 14️⃣ Update Model Parameters

```python
optimizer.step()
```

* The optimizer uses the computed gradients to **update model weights**.
* Each weight moves slightly in the direction that **reduces loss**.

---

#### 15️⃣ Save Model Checkpoint

```python
save_checkpoint(
    model=model,
    optimizer=optimizer,
    epoch=iteration,
    loss=loss.item(),
    file_path=f"../output/pre_training/run_4/checkpoint_{iteration}.pth"
)
```

* Saves the current model state to disk after each epoch.
* **`loss.item()`** converts the tensor loss to a plain Python float.
* The file path dynamically includes the iteration number so multiple checkpoints are saved separately:

  ```
  checkpoint_0.pth
  checkpoint_1.pth
  ...
  ```
* Each checkpoint contains:

  * Model weights
  * Optimizer state
  * Current epoch
  * Loss value

---

### 🧩 Summary of the Whole Loop

| Step  | Purpose                                                  |
| ----- | -------------------------------------------------------- |
| **1** | Load data and create model/optimizer                     |
| **2** | Loop over epochs                                         |
| **3** | Loop through mini-batches                                |
| **4** | Periodically evaluate model (training + validation loss) |
| **5** | Do forward pass, compute loss                            |
| **6** | Backpropagate and update model weights                   |
| **7** | Save checkpoint after each epoch                         |

---

✅ **In simple words:**
This code trains your model batch-by-batch, evaluates it at regular steps, updates the weights using backpropagation, and saves checkpoints so you can pause or continue training anytime.

---

⚠️ **Note:**
Your function name here is written as `save_checkpoint`, but in your earlier code, it was defined as `save_checkout`.
You should rename either one for consistency:

```python
def save_checkpoint(...):
    ...
```

so that your training loop calls the correct function.

```


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(train_losses, label="Train Loss", marker='o')
plt.plot(val_losses, label="Validation Loss", marker='o')
plt.xlabel("Evaluation Step")
plt.ylabel("Loss")
plt.title("Training and Validation Loss Over Time")
plt.legend()
plt.grid()
plt.show()


### 📊 Visualizing Training & Validation Loss Curves

#### 1️⃣ Importing the Plotting Library
```python
import matplotlib.pyplot as plt
```

* **`matplotlib.pyplot`** is a popular Python library for creating plots and graphs.
* We use it to **visually track model performance** over time.
* The alias `plt` is just a shorthand (common convention).

---

#### 2️⃣ Creating a New Figure

```python
plt.figure(figsize=(10, 5))
```

* Creates a **new blank figure** (like a canvas) for plotting.
* **`figsize=(10, 5)`** sets the **width = 10 inches** and **height = 5 inches** — a rectangular graph layout.
* This makes the plot more readable.

---

#### 3️⃣ Plotting the Training Loss

```python
plt.plot(train_losses, label="Train Loss", marker='o')
```

* **`train_losses`**: A Python list that stores the loss values recorded during training (from earlier loops).
* **`label="Train Loss"`** gives a name to this line for the legend.
* **`marker='o'`** draws small circles at each data point — helps visualize individual points clearly.
* The line connects those points to show the trend of loss over time.

---

#### 4️⃣ Plotting the Validation Loss

```python
plt.plot(val_losses, label="Validation Loss", marker='o')
```

* **`val_losses`**: Stores validation loss values measured during training.
* This helps compare how well the model generalizes to unseen data.
* Same styling: a line with circle markers and a legend label “Validation Loss”.

---

#### 5️⃣ Adding Axis Labels

```python
plt.xlabel("Evaluation Step")
plt.ylabel("Loss")
```

* **`xlabel`** → X-axis label: “Evaluation Step”

  * Represents how many times we evaluated the model during training.
* **`ylabel`** → Y-axis label: “Loss”

  * Shows the numerical loss value (lower = better model).

---

#### 6️⃣ Adding a Title

```python
plt.title("Training and Validation Loss Over Time")
```

* Adds a title at the top of the graph.
* Clearly explains that the plot shows how loss changes as training progresses.

---

#### 7️⃣ Adding Legend and Grid

```python
plt.legend()
plt.grid()
```

* **`plt.legend()`** → Displays the labels (“Train Loss”, “Validation Loss”) in a small box to identify lines.
* **`plt.grid()`** → Adds light grid lines for easier reading of values.

---

#### 8️⃣ Displaying the Plot

```python
plt.show()
```

* Renders and displays the complete plot.
* Without this line, the plot may not appear in some environments (like Jupyter Notebooks).

---

### ✅ Summary

This code plots **two lines**:

* 🟩 **Train Loss** — how well the model fits the training data.
* 🟦 **Validation Loss** — how well the model performs on unseen data.

If the graph shows both lines **decreasing steadily**, it means your model is learning properly.
If validation loss starts increasing while training loss keeps decreasing — your model may be **overfitting**.

```




In [None]:
input_tokens = tokenizer.encode("Salam labas ")
input_tokens = torch.tensor(
    input_tokens, dtype=torch.long).unsqueeze(0).to(device)

model.eval()
with torch.no_grad():
    output = model.generate(input_tokens=input_tokens, max_new_tokens=100)

print(tokenizer.decode(output[0].tolist()))


### 🧠 Explanation: Generating Text from a Trained GPT Model

---

#### 1️⃣ Tokenizing the Input Text
```python
input_tokens = tokenizer.encode("Salam labas ")
```

* **`tokenizer.encode()`** converts the raw text `"Salam labas "` into **numerical token IDs** that the model can understand.
* Example:

  ```
  "Salam labas " → [101, 1234, 5678, 0]
  ```
* Each number corresponds to a **word or subword token** in the model’s vocabulary.

---

#### 2️⃣ Converting Tokens to a Tensor

```python
input_tokens = torch.tensor(input_tokens, dtype=torch.long).unsqueeze(0).to(device)
```

* **`torch.tensor(input_tokens, dtype=torch.long)`**
  Converts the Python list of token IDs into a **PyTorch tensor** of type `long` (required for token IDs).

* **`.unsqueeze(0)`**
  Adds an extra dimension at the beginning to represent **batch size**.

  * Model expects input shape: `(batch_size, sequence_length)`
  * Example: `[101, 1234, 5678]` → `[[101, 1234, 5678]]`

* **`.to(device)`**
  Moves the tensor to the **CPU or GPU** (`device`) so the model can process it.

---

#### 3️⃣ Switching Model to Evaluation Mode

```python
model.eval()
```

* Puts the model into **evaluation mode**:

  * Turns off dropout layers.
  * Ensures batch normalization layers don’t update running statistics.
* This is important when generating text because we want **deterministic outputs**.

---

#### 4️⃣ Disabling Gradient Tracking

```python
with torch.no_grad():
```

* Prevents PyTorch from computing gradients during generation.
* **Why?**

  * Generation is **inference**, not training.
  * Saves memory and speeds up computation.

---

#### 5️⃣ Generating Text from the Model

```python
output = model.generate(input_tokens=input_tokens, max_new_tokens=100)
```

* **`model.generate()`** produces **new tokens** one by one, starting from the given input.

* Parameters:

  | Parameter        | Description                                                               |
  | ---------------- | ------------------------------------------------------------------------- |
  | `input_tokens`   | The starting sequence (tensor) from which the model continues generating. |
  | `max_new_tokens` | Maximum number of **new tokens** to generate. Here `100` tokens.          |

* The output is a tensor of **token IDs** representing the generated text sequence, including both the original input and the new tokens.

---

#### 6️⃣ Decoding the Generated Tokens

```python
print(tokenizer.decode(output[0].tolist()))
```

* **`output[0]`** → Gets the generated sequence for the first (and only) batch.
* **`.tolist()`** → Converts the tensor into a Python list of token IDs.
* **`tokenizer.decode()`** → Converts token IDs back into **human-readable text**.
* Example:

  ```
  [101, 1234, 5678, 3456, 7890] → "Salam labas! How are you today?"
  ```

---

### ✅ In simple words:

1. Convert your starting text into token IDs the model understands.
2. Move tokens to the device (CPU/GPU) and add batch dimension.
3. Put the model in evaluation mode and disable gradients.
4. Ask the model to generate a sequence of new tokens.
5. Decode the generated tokens back into **readable text** and print it.

---

📝 **Note:**

* You can adjust `max_new_tokens` to generate longer or shorter text.
* For more creative or diverse outputs, you can also use parameters like `temperature`, `top_k`, or `top_p` in `generate()`.

```
