# Preapare the data

### 1. Load the fine-tunung data

In [None]:
import json 
file_path="../output/fine_tuning/data/fine_tuning.json"
with open (file_path,"r") as file:
    data=json.load(file)

### 2. Load the tokenizer

In [None]:
import sys
sys.path.append('..')

In [None]:
from minbpe import RegexTokenizer
tokenizer=RegexTokenizer()
tokenizer.load(model_file="../output/tokenizer/darija_tokenizer.model")

def get_vocab_size(tokenizer: RegexTokenizer)-> int:
    vocab=tokenizer.vocab
    special_tokens = tokenizer.special_tokens

    return len(vocab)+len(special_tokens)




## 📌 Purpose

This script:

1. Loads a trained tokenizer from a `.model` file (presumably trained on Darija).
2. Defines a function to calculate the **total vocabulary size**, including:

   * Regular tokens
   * Special tokens (e.g. `<|pad|>`, `<|startoftext|>`, etc.)

---

## 🔍 Step-by-Step Breakdown

### 1. ✅ Import and Instantiate

```python
from minbpe import RegexTokenizer
tokenizer = RegexTokenizer()
```

* `RegexTokenizer` is a customizable tokenizer from the `minbpe` library.
* It uses regular expressions to tokenize text and can be trained or loaded from a saved model.

---

### 2. 📦 Load Pretrained Tokenizer

```python
tokenizer.load(model_file="../output/tokenizer/darija_tokenizer.model")
```

* Loads a previously trained tokenizer model from file.
* This file contains token patterns, vocab, merges, and possibly special tokens.

---

### 3. 🧮 Define Vocabulary Size Function

```python
def get_vocab_size(tokenizer: RegexTokenizer) -> int:
    vocab = tokenizer.vocab
    special_tokens = tokenizer.special_tokens

    return len(vocab) + len(special_tokens)
```

#### 🔸 `tokenizer.vocab`

* A dictionary of learned tokens (subwords, character groups, etc.)

#### 🔸 `tokenizer.special_tokens`

* A dictionary or list of predefined tokens like:

  * `<|pad|>`
  * `<|startoftext|>`
  * `<|endoftext|>`
  * Custom ones like `<|seprator|>`

#### ✅ Return Value:

* Total vocabulary size = number of normal tokens + number of special tokens

---

## 🧪 Example Output

If the tokenizer contains:

* 10,000 normal tokens
* 5 special tokens

Then:

```python
get_vocab_size(tokenizer)  # Returns 10005
```

---

## ✅ Summary

This setup:

* Loads a BPE tokenizer trained on a specific dataset
* Calculates the full vocabulary size including special tokens
* Useful when configuring models (e.g., `vocab_size` parameter in transformer models)

---



### 3. Tokenize the sequence

In [None]:
tokenized_data=[]
for item in data:
    tokenized_item=tokenizer.encode(item,allowed_special="all")
    tokenized_data.append(tokenized_item)
    
len(tokenized_data[0])


## 📌 Purpose

This script takes a list of text entries (likely training samples) and **tokenizes** them using your previously loaded `RegexTokenizer`. It then checks how many tokens are in the **first item** of the tokenized dataset.

---

## 🧱 Step-by-Step Breakdown

### 1. 📋 Initialize List

```python
tokenized_data = []
```

* A list to store tokenized versions of each string from `data`.

---

### 2. 🔄 Loop Through `data`

```python
for item in data:
```

* `data` is assumed to be a list of strings — probably the same kind of entries as in `fine_tuning_data`, like:

  ```
  <|startoftext|>Alice<|seprator|>Hello\nHow are you?<|endoftext|>
  ```

---

### 3. 🧩 Tokenize Each String

```python
tokenized_item = tokenizer.encode(item, allowed_special="all")
```

* `tokenizer.encode()` breaks the string into tokens using rules from your BPE model.
* `allowed_special="all"` tells the tokenizer to **recognize and preserve special tokens**, rather than splitting or ignoring them.

---

### 4. ➕ Append Tokenized Output

```python
tokenized_data.append(tokenized_item)
```

* Adds the tokenized version of each string to the `tokenized_data` list.

---

### 5. 📏 Check Token Count of First Item

```python
len(tokenized_data[0])
```

* Returns the **number of tokens** in the first tokenized string (i.e., how long the first training sample is after tokenization).
* This is useful for:

  * Debugging token lengths
  * Preparing inputs for models that have max token limits (e.g., 512 for many transformers)

---

## 🧪 Example

If your first string is:

```text
<|startoftext|>Alice<|seprator|>Hello<|endoftext|>
```

The tokenizer might return something like:

```python
[1, 482, 3, 1025, 2]  # (Token IDs)
```

Then:

```python
len(tokenized_data[0])  # Returns 5
```

---

## ✅ Summary

* This code tokenizes a list of strings using a pretrained tokenizer.
* Special tokens are preserved.
* The length of the first tokenized item is measured — useful for understanding sequence sizes in training data.

---


### 4. Spliting the data

We need to keep the multi-turn conversations complete in each part

Training and Testing sets start with ```You``` message and end with an ```Assistant``` message

In [None]:
inititial_split_index=int(0.95 * len(data))

#adjusting the index to ensure that the trainingset ends with Assistant message
# and validation set start with "You" message

# scanning backward to find an assistant message
split_index=inititial_split_index
while split_index>0 and not data[split_index-1].startswith('<|startoftext|>Assistant'):
    split_index -=1

train_data = data[:split_index]
val_data=data[split_index:]

print("Training set: ")
print(f"Start message: {train_data[0].split('<|separator|>')[0]}")
print(f"End message: {train_data[-1].split('<|separator|>')[0]}")

print("\n Validation Set")
print(f"Start message: {val_data[0].split('<|separator|>')[0]}")
print(f"End message: {val_data[-1].split('<|separator|>')[0]}")


# 🧠 Explanation: Smart Dataset Splitting to Preserve Dialogue Integrity

This logic is designed to split a dataset of formatted conversation messages (stored in `data`) into training and validation sets **without breaking the natural flow of dialogue** — particularly ensuring the training set ends with an Assistant response and the validation set begins with a User message.

---

## ⚙️ Step-by-Step Breakdown

### 🔹 Step 1: Define Initial Split Point

```python
inititial_split_index = int(0.95 * len(data))
```

* Calculates the initial split index at 95% of the dataset.
* This is a **common ratio** for fine-tuning: 95% training, 5% validation.

---

### 🔹 Step 2: Backtrack to the Last Assistant Message

```python
split_index = inititial_split_index

while split_index > 0 and not data[split_index - 1].startswith('<|startoftext|>Assistant'):
    split_index -= 1
```

* The code **backtracks from the 95% mark** to ensure the last training message is from the Assistant.

* This is important because many models learn from alternating patterns — e.g., User message → Assistant response.

* If training data ends in the middle of a user input, the model may struggle to learn proper turn-taking.

* The loop checks each message (going backward) to find the first one that starts with:

  ```
  <|startoftext|>Assistant
  ```

* Once found, that index becomes the new split point.

---

### 🔹 Step 3: Slice the Dataset

```python
train_data = data[:split_index]
val_data = data[split_index:]
```

* `train_data` contains all messages **up to and including** the last complete Assistant response.
* `val_data` contains the remaining messages, **starting from the next user prompt**.

---

### 🔹 Step 4: Print Metadata About the Split

```python
print("Training set: ")
print(f"Start message: {train_data[0].split('<|separator|>')[0]}")
print(f"End message: {train_data[-1].split('<|separator|>')[0]}")

print("\n Validation Set")
print(f"Start message: {val_data[0].split('<|separator|>')[0]}")
print(f"End message: {val_data[-1].split('<|separator|>')[0]}")
```

* Each message is formatted like:

  ```
  <|startoftext|>Assistant<|separator|>message content<|endoftext|>
  ```
* `split('<|separator|>')[0]` extracts the **sender** part of the message.
* This printout confirms:

  * The training set starts and ends with the right roles.
  * The validation set starts correctly with a user message, following a full Assistant response in the training set.

---

## ✅ Why This Matters

* **Maintains conversational context**: Models trained on structured conversations benefit when training and validation sets reflect complete message pairs.
* **Avoids broken samples**: Prevents the training set from ending mid-dialogue, which could degrade model quality.
* **Ensures natural flow**: Validation accuracy is more meaningful when the validation set starts with a user message and follows a realistic conversation thread.

---


spliting tokenized data

In [None]:
train_data = tokenized_data[:split_index]
val_data = tokenized_data[split_index:]

combine `you` and `Assistant` turns into one sequence. but make sure resulting sequence  does not exceed the `block_size`

In [None]:
block_size=256

def combined_turns(data: list[list[int]],should_trim_long_sequence: bool) -> list[list[int]]:
    combined_turns_data = []
    for i in range(0,len(data)-1,2):
        you_message=data[i]
        assistant_message=data[i+1]
        if not you_message or not assistant_message:
            continue 

        final_message=you_message+assistant_message
        if len(final_message)>block_size and should_trim_long_sequence:
            final_message=final_message[-block_size:]

        combined_turns_data.append(final_message)
    return combined_turns_data
combined_val_data=combined_turns(
    data=train_data,
    should_trim_long_sequence=True
)

combined_val_data=combined_turns(
    data=val_dal,
    should_trim_long_sequence=True
)

---

# 🧠 Explanation: Combining Paired Turns into Single Token Sequences for Model Training

This logic is designed to **combine user and assistant messages** into a single token sequence that fits within a defined `block_size`. This is a critical preprocessing step for training transformer models that expect inputs as flat sequences of tokens.

---

## ⚙️ Key Concepts

* **Paired Turns**: User and Assistant messages are expected to alternate. Each "turn" consists of one user message followed by one assistant reply.
* **Block Size**: The maximum number of tokens that a sequence should contain (set here to `256`).
* **Trimming**: Sequences longer than the block size can be trimmed from the beginning, preserving the most recent tokens (typically more relevant in conversations).

---

## 🧱 Step-by-Step Breakdown

### 🔹 `block_size = 256`

Defines the maximum length for a single input sequence (in tokens). This is often constrained by model architecture.

---

### 🔹 Function Definition: `combined_turns(...)`

```python
def combined_turns(data: list[list[int]], should_trim_long_sequence: bool) -> list[list[int]]:
```

* **Input**:

  * `data`: A list of tokenized messages, where each message is a list of integers (token IDs).
  * `should_trim_long_sequence`: A boolean flag indicating whether to trim sequences that exceed the block size.

* **Output**:

  * Returns a list of **combined input sequences**, where each item includes both user and assistant messages in a single list of tokens.

---

### 🔸 Loop Through Paired Messages

```python
for i in range(0, len(data)-1, 2):
```

* Iterates through the data **two messages at a time** (step of 2).
* This assumes the data is **ordered as User, Assistant, User, Assistant, ...**

---

### 🔸 Assign Messages

```python
you_message = data[i]
assistant_message = data[i+1]
```

* Picks the current user message and the immediate next assistant message.

---

### 🔸 Skip Incomplete Pairs

```python
if not you_message or not assistant_message:
    continue
```

* If either message is missing (e.g., empty or `None`), skip this pair.

---

### 🔸 Combine and Trim If Needed

```python
final_message = you_message + assistant_message
if len(final_message) > block_size and should_trim_long_sequence:
    final_message = final_message[-block_size:]
```

* Combines the two messages into a single sequence.
* If the result exceeds the `block_size` and trimming is allowed:

  * Trims from the **beginning**, keeping only the last `block_size` tokens.
  * This favors more recent dialogue, which is often more contextually relevant.

---

### 🔸 Append to Output

```python
combined_turns_data.append(final_message)
```

* Adds the finalized token sequence to the output list.

---

### 🔹 Apply to Training and Validation Sets

```python
combined_val_data = combined_turns(data=train_data, should_trim_long_sequence=True)
combined_val_data = combined_turns(data=val_dal, should_trim_long_sequence=True)
```

* **First call** processes the `train_data` token sequences.
* **Second call** attempts to process `val_dal`, which appears to be a typo — it should likely be `val_data`.

---

## ❗ Important Note

```python
combined_val_data = combined_turns(data=val_dal, should_trim_long_sequence=True)
```

* **Typo Alert**: `val_dal` should be corrected to `val_data` to avoid a `NameError`.

---

## ✅ Summary

* This process merges alternating User and Assistant messages into training samples of token sequences.
* Ensures that each input sample fits within a maximum length (`block_size`).
* Preserves recent context by trimming from the start if needed.
* Prepares data in a format compatible with autoregressive transformer models like GPT.

---


In [None]:
print("Train data")
print(f"Length before: {len(train_data)}")
print(f"Length after: {len(combined_train_data)}")

print("\nValidation data")
print(f"Length before: {len(val_data)}")
print(f"Length after: {len(combined_val_data)}")

convert each sequence of tokens into a tensor

In [None]:
import torch

train_data=torch.tensor(combined_train_data)
val_data=torch.tensor(combined_val_data)

---

# 🧠 Explanation: Converting Tokenized Data into PyTorch Tensors

This snippet converts your preprocessed token sequences into **PyTorch tensors**, which are the primary data structure used for model training in PyTorch.

---

## ⚙️ Step-by-Step Breakdown

### 🔹 Convert Training Data

```python
train_data = torch.tensor(combined_train_data)
```

* Takes `combined_train_data` (a list of lists of integers representing token IDs).
* Converts it into a **PyTorch tensor** with shape `[num_samples, sequence_length]`.
* Enables fast, GPU-accelerated operations during model training.

---

### 🔹 Convert Validation Data

```python
val_data = torch.tensor(combined_val_data)
```

* Similarly converts validation sequences to a tensor.
* Allows efficient evaluation of the model during training.

---

## ✅ Why Use PyTorch Tensors?

* Tensors are optimized for mathematical operations, including backpropagation.
* They seamlessly integrate with PyTorch’s DataLoaders and neural network modules.
* Support GPU acceleration for faster training.

---

## ⚠️ Assumptions and Tips

* The sequences in `combined_train_data` and `combined_val_data` are expected to be **uniform in length** (likely due to trimming or padding).
* If sequences vary in length, consider padding them before conversion, or use PyTorch’s `PackedSequence` utilities.
* Using tensors is a necessary step before feeding data into most PyTorch models.

---

This completes the pipeline from raw message data to tensor-ready inputs for a PyTorch model.


token sequence is not of same length so cant convert turn into tensor all at once
so for taht convert it into same length

use padding to fix the prb. add in start and at end of sequence

In [None]:
import torch
torch.manual_seed(3647)

#the token <|padding|> is used to mask the padding tokens.
# masking mean the model will ignore these tokens during trainning
# ie loss will not be calculated for those

pading_tokens=tokenizer.special_tokens["|<padding|>"]

def apply_padding_to_data(data: list[list[int]],block_size: int,padding_token: int)-> torch.Tensor :
    tensors=[]
    for i in range(len(data)):
        tensor = torch.tensor(data[i])
        padded_tensor=tensor.nn.functional.pad(
            input=tensor,
            # for rigth padding
            pad=(0,block_size-len(tensor)),
            #pad=(0,block_size-len(tensor),0),
            value=padding_token
        )
        tensors.append(padded_tensor)

    return torch.stack(tensors)

train_data_tensor = apply_padding_to_data(
    data=combined_train_data,
    block_size=block_size,
    padding_token=padding_token
)

val_data_tensor = apply_padding_to_data(
    data=combined_val_data,
    block_size=block_size,
    padding_token=padding_token 
)

val_data=tensor=apply_padding_to_data(
    data=combined_val_data,
    block_size=block_size,
    padding_token=padding_token
)

train_data_tensor.shape,val_data_tensor.shape

---

# 🧠 Explanation: Padding Token Sequences for Uniform Length Using PyTorch

This code snippet prepares your tokenized sequences for model training by **right-padding** them to a fixed `block_size`. Padding ensures that all sequences have the same length, which is necessary for batch processing in neural networks.

---

## ⚙️ Step-by-Step Breakdown

### 🔹 Set Random Seed

```python
torch.manual_seed(3647)
```

* Fixes randomness for reproducibility (important in experiments).

---

### 🔹 Identify Padding Token ID

```python
padding_token = tokenizer.special_tokens["|<padding|>"]
```

* Retrieves the integer ID for the special padding token.
* This token will be used to fill sequences that are shorter than `block_size`.

> **Note:** There seems to be a typo in the key: `"|<padding|>"` — typically it should be `"<|padding|>"`. Make sure this matches your tokenizer’s actual special token key.

---

### 🔹 Define Padding Function

```python
def apply_padding_to_data(data: list[list[int]], block_size: int, padding_token: int) -> torch.Tensor:
```

* **Inputs**:

  * `data`: List of token sequences (lists of ints).
  * `block_size`: Desired fixed sequence length.
  * `padding_token`: Token ID to use for padding.

* **Process**:

  * Converts each sequence to a PyTorch tensor.
  * Pads sequences on the **right side** to reach `block_size` tokens.
  * Uses `torch.nn.functional.pad` with parameters:

    * `pad=(0, block_size - len(tensor))`: pads zeros on the right.
    * `value=padding_token`: fills padding with the padding token ID.
  * Collects all padded tensors.

* **Returns**:

  * A stacked tensor of shape `[num_sequences, block_size]`.

---

### 🔹 Pad Training and Validation Data

```python
train_data_tensor = apply_padding_to_data(
    data=combined_train_data,
    block_size=block_size,
    padding_token=padding_token
)

val_data_tensor = apply_padding_to_data(
    data=combined_val_data,
    block_size=block_size,
    padding_token=padding_token 
)
```

* Applies the padding function to both training and validation datasets.
* Converts variable-length sequences into uniform-length tensors suitable for batching.

---

### 🔹 Duplicate Line to Pad Validation Data Again

```python
val_data = tensor = apply_padding_to_data(
    data=combined_val_data,
    block_size=block_size,
    padding_token=padding_token
)
```

* This line appears to redundantly pad `combined_val_data` again and assign it to `val_data` and `tensor`.
* This can be simplified or removed unless there's a reason to have a separate variable.

---

### 🔹 Final Shape Output

```python
train_data_tensor.shape, val_data_tensor.shape
```

* Returns the shape of the padded tensors.
* Should output something like `(num_train_samples, block_size)` and `(num_val_samples, block_size)` confirming successful padding.

---

## ✅ Summary

* Padding is essential to ensure fixed-length inputs.
* Padding tokens are masked during training, so they don’t affect loss computation.
* Right-padding preserves the original token order and sequence start.
* The output tensors are ready for batch feeding into a transformer model.

---

# ⚠️ Notes and Potential Fixes

* Correct the padding token key if necessary (`"<|padding|>"` vs `"|<padding|>"`).
* `tensor.nn.functional.pad` should be `torch.nn.functional.pad` (padding is a function in `torch.nn.functional` module, not a tensor method).
* The redundant assignment of `val_data` could be removed to avoid confusion.

---

This ensures your data is properly padded and formatted for model training with PyTorch!


In [None]:
train_data_tensor[0]

In [None]:
val_data_tensor[0]

### 5. Create the data loaders

In [None]:
train_data_tensor.shape

In [None]:
from typing import Tuple
from torch.utils.data import Dataset,DataLoader

class FineTuningDataset(Dataset):
    def __init__(self,data: torch.Tensor,device:torch.device , padding_token: int):
        self.data=data #shape: (num_samples, block_Size)
        self.device=device
        self.padding_token=padding_token

    def __len__(self) -> int:
        return len(self.data)
    
    def __getitem__(self,index: int)->Tuple[torch.Tensor,torch.Tensor]:
        sample=self.data[index]
        x=sample.to(self.device)
        y=sample[1:].to(self.device)
        padding_tensor=torch.tensor([self.padding_token],device=self.device)
        y=torch.cat((y,padding_tensor))
        return x,y
    
batch_size=64
device=torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_dataset=FineTuningDataset(
    data=train_data_tensor,
    device=device,
    padding_token=padding_token

)
train_loader=DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True
)

val_dataset=FineTuningDataset(
    data=val_data_tensor,
    padding_token=padding_token
)
val_loader=DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    shuffle=False
)


---

# 🧠 Explanation: Creating a Custom Dataset and DataLoader for Fine-Tuning with PyTorch

This code defines a **custom PyTorch Dataset class** for handling tokenized conversation data and prepares `DataLoader`s for efficient batch training and validation.

---

## ⚙️ Step-by-Step Breakdown

### 🔹 `FineTuningDataset` Class

* Inherits from `torch.utils.data.Dataset`, enabling seamless integration with PyTorch’s data pipeline.

---

### 🔸 `__init__` Method

```python
def __init__(self, data: torch.Tensor, device: torch.device, padding_token: int):
    self.data = data  # shape: (num_samples, block_size)
    self.device = device
    self.padding_token = padding_token
```

* Stores the tokenized and padded dataset as a tensor.
* Stores the device (CPU or GPU) for efficient data transfer.
* Keeps track of the padding token ID, used later for target sequence alignment.

---

### 🔸 `__len__` Method

```python
def __len__(self) -> int:
    return len(self.data)
```

* Returns the number of samples in the dataset.
* Enables use of `len()` on the dataset.

---

### 🔸 `__getitem__` Method

```python
def __getitem__(self, index: int) -> Tuple[torch.Tensor, torch.Tensor]:
    sample = self.data[index]
    x = sample.to(self.device)
    y = sample[1:].to(self.device)
    padding_tensor = torch.tensor([self.padding_token], device=self.device)
    y = torch.cat((y, padding_tensor))
    return x, y
```

* Retrieves one token sequence sample by index.
* **Inputs** (`x`):

  * The full token sequence.
* **Targets** (`y`):

  * The token sequence shifted one step **to the left**, i.e., from token 1 to end.
  * Padding token appended at the end to keep `y` the same length as `x`.
* This setup prepares sequences for **autoregressive training** where the model predicts the next token.
* Moves tensors to the appropriate device.

---

### 🔹 Dataset and DataLoader Instantiation

---

### 🔸 Batch Size and Device

```python
batch_size = 64
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
```

* Sets a batch size of 64 samples per batch.
* Chooses GPU if available; otherwise, CPU.

---

### 🔸 Create Train Dataset and Loader

```python
train_dataset = FineTuningDataset(
    data=train_data_tensor,
    device=device,
    padding_token=padding_token
)

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True
)
```

* Wraps training data tensor in the custom dataset.
* Creates a DataLoader to provide shuffled batches during training.

---

### 🔸 Create Validation Dataset and Loader

```python
val_dataset = FineTuningDataset(
    data=val_data_tensor,
    padding_token=padding_token
)

val_loader = DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    shuffle=False
)
```

* Wraps validation data similarly.
* Validation loader does **not shuffle** to keep evaluation deterministic.

> **Note:** `val_dataset` does not specify `device`. This means it won’t move validation data to GPU automatically. You might want to either add `device=device` or move batches manually during evaluation.

---

## ✅ Summary

* This setup allows easy batching and feeding of tokenized sequences to a transformer model.
* The target sequences are shifted versions of inputs for predicting the next token.
* Using PyTorch `DataLoader` provides efficient iteration, batching, and shuffling.
* Device handling within the dataset helps with minimizing manual tensor transfers during training.

---

This class and loaders form the backbone for training and validating autoregressive language models on conversation data.


In [None]:
x,y=next(iter(train_loader))

---

# 🧠 Explanation: Retrieving a Batch from the DataLoader

This line of code:

```python
x, y = next(iter(train_loader))
```

performs a **single batch extraction** from the training data loader.

---

## What Happens Here?

* `iter(train_loader)` creates an **iterator** over the `train_loader`.
* `next(...)` fetches the **first batch** from that iterator.
* The batch is unpacked into:

  * `x`: a tensor containing the input sequences (shape: `[batch_size, block_size]`).
  * `y`: a tensor containing the target sequences (shifted inputs for next-token prediction).

---

## Why Use This?

* Useful for **quick inspection** or debugging to check shapes, data types, and content of your batches.
* Ensures your data pipeline is working as expected before starting training.
* Can be used to test model input/output compatibility.

---

## Typical Use Case

```python
print(x.shape)  # Expected: (batch_size, block_size)
print(y.shape)  # Expected: (batch_size, block_size)
```

This confirms batch dimensions and helps verify padding and shifting are correct.

---

This step is a straightforward way to peek into your training batches before feeding them into the model.


# Fine-tuning

### 1. Load the saved checkpoint

In [None]:
from transformer.model import GPTLanguageModel

block_size=256
n_embd=512
n_head = 8
n_layer = 4
dropout = 0.2
batch_size = 64
vocab_size = get_vocab_size(tokenizer)

model = GPTLanguageModel(
    vocab_size=vocab_size,
    block_size=block_size,
    n_embd=n_embd,
    n_head=n_head,
    n_layer=n_layer,
    dropout=dropout,
    device=device,
    ignore_index=tokenizer.special_tokens["<|padding|>"],

).to(device)
model=torch.compile(model)
print(sum(p.numel() for p in model.parameter())/1e6, 'M parameter')

---

# 🧠 Explanation: Initializing and Compiling a GPT Language Model for Fine-Tuning

This snippet sets up a GPT-based language model with specified hyperparameters, prepares it for training on your dataset, and prints out the total number of model parameters.

---

## ⚙️ Step-by-Step Breakdown

### 🔹 Define Model Hyperparameters

```python
block_size = 256      # Maximum context length (sequence length)
n_embd = 512          # Embedding dimension size
n_head = 8            # Number of attention heads in each Transformer block
n_layer = 4           # Number of Transformer layers (depth)
dropout = 0.2         # Dropout rate for regularization
batch_size = 64       # Batch size for training
vocab_size = get_vocab_size(tokenizer)  # Vocabulary size based on tokenizer
```

* These hyperparameters configure the model's architecture and capacity.
* `block_size` matches the input sequence length you prepared.
* `vocab_size` corresponds to the tokenizer's vocabulary size plus special tokens.

---

### 🔹 Instantiate the Model

```python
model = GPTLanguageModel(
    vocab_size=vocab_size,
    block_size=block_size,
    n_embd=n_embd,
    n_head=n_head,
    n_layer=n_layer,
    dropout=dropout,
    device=device,
    ignore_index=tokenizer.special_tokens["<|padding|>"],
).to(device)
```

* Creates a GPT model with the specified architecture.
* Passes the device (CPU or GPU) for proper allocation.
* Sets `ignore_index` to the padding token so the loss function ignores padded tokens during training.
* The model is then moved to the selected device using `.to(device)`.

---

### 🔹 Compile the Model

```python
model = torch.compile(model)
```

* Uses PyTorch 2.0's `torch.compile()` feature to optimize the model's execution.
* Compilation can speed up training and inference by leveraging backend optimizations.
* Note: Requires PyTorch 2.0+ and compatible hardware.

---

### 🔹 Print Model Size

```python
print(sum(p.numel() for p in model.parameter()) / 1e6, 'M parameter')
```

* Calculates the total number of parameters in millions.
* Useful for understanding model complexity and estimating training resources.

---

## ✅ Summary

* This sets up a moderately sized GPT model tailored to your tokenizer and dataset.
* Incorporates device allocation, padding token handling, and runtime optimizations.
* The printed parameter count gives insight into model scale before training begins.

---

This prepares the core model architecture ready for fine-tuning on your dialogue data.


In [None]:
checkpoint_path = "../output/pre_training/base/epoch_5.pth"
checkpoint = torch.load(checkpoint_path, weights_only=True)
model_state_dict = checkpoint["model_state_dict"]
model.load_state_dict(model_state_dict)

---

# 🧠 Explanation: Loading a Pre-Trained Checkpoint into the Model

This snippet demonstrates how to **load pre-trained weights** from a saved checkpoint into your GPT model before fine-tuning.

---

## ⚙️ Step-by-Step Breakdown

### 🔹 Define Checkpoint Path

```python
checkpoint_path = "../output/pre_training/base/epoch_5.pth"
```

* Specifies the file path where the pre-trained model checkpoint is saved.
* Usually, this checkpoint contains the weights after training for 5 epochs (as suggested by the filename).

---

### 🔹 Load Checkpoint File

```python
checkpoint = torch.load(checkpoint_path, weights_only=True)
```

* Loads the checkpoint file from disk into memory.
* The argument `weights_only=True` hints to load only model weights, ignoring optimizer states or other metadata (may depend on your PyTorch version).

---

### 🔹 Extract Model State Dictionary

```python
model_state_dict = checkpoint["model_state_dict"]
```

* Retrieves the saved dictionary of model parameters (`state_dict`) from the checkpoint.
* `state_dict` maps parameter names to their tensor values.

---

### 🔹 Load Weights into Model

```python
model.load_state_dict(model_state_dict)
```

* Updates the model’s parameters with the pre-trained weights.
* Ensures that the model starts from a trained initialization rather than random weights.
* Essential for transfer learning or continued training.

---

## ✅ Summary

* Loading a pre-trained checkpoint accelerates convergence and can improve performance.
* This step prepares your GPT model to fine-tune on your specific data with previously learned knowledge.
* Make sure the model architecture matches the checkpoint to avoid errors when loading weights.

---

This sets your model up with pre-trained knowledge before beginning fine-tuning on your dataset.


Generate from the model to make sure that the weights were loaded correctly.

In [None]:
input_tokens = tokenizer.encode("Salam labas ", allowed_special="all")
input_tokens = torch.tensor(
    input_tokens, dtype=torch.long).unsqueeze(0).to(device)

model.eval()
with torch.no_grad():
    output = model.generate(input_tokens=input_tokens, max_new_tokens=100)

print(tokenizer.decode(output[0].tolist()))

---

# 🧠 Explanation: Generating Text with the Fine-Tuned GPT Model

This code snippet shows how to **generate text** (e.g., a chatbot reply) from your trained GPT model given an input prompt.

---

## ⚙️ Step-by-Step Breakdown

### 🔹 Tokenize Input Prompt

```python
input_tokens = tokenizer.encode("Salam labas ", allowed_special="all")
input_tokens = torch.tensor(input_tokens, dtype=torch.long).unsqueeze(0).to(device)
```

* Encodes the string `"Salam labas "` into token IDs using your tokenizer.
* `allowed_special="all"` allows special tokens in the input if needed.
* Converts the token list into a PyTorch tensor with shape `[1, sequence_length]` (`unsqueeze(0)` adds batch dimension).
* Moves the tensor to the appropriate device (CPU or GPU).

---

### 🔹 Set Model to Evaluation Mode

```python
model.eval()
```

* Switches the model to evaluation mode.
* Disables dropout and other training-specific behaviors for consistent output.

---

### 🔹 Generate Text Without Gradient Tracking

```python
with torch.no_grad():
    output = model.generate(input_tokens=input_tokens, max_new_tokens=100)
```

* Disables gradient computation to save memory and speed up inference.
* Calls the model’s `.generate()` method to produce up to 100 new tokens following the input.
* The generation method uses the model’s autoregressive property to predict next tokens one-by-one.

---

### 🔹 Decode and Print Output

```python
print(tokenizer.decode(output[0].tolist()))
```

* Converts the generated token IDs back into a human-readable string.
* Prints the generated text continuation, including the input prompt plus generated tokens.

---

## ✅ Summary

* This is a standard approach for interactive text generation with transformer language models.
* The input prompt seeds the generation, and the model produces a relevant continuation.
* Useful for chatbots, story generation, or any natural language generation task.

---

You get a quick demonstration of your fine-tuned GPT’s ability to understand and continue conversations in the target language.


### 2. Estimate loss

In [None]:
from typing import Dict


@torch.no_grad()
def estimate_loss(
    model: torch.nn.Module,
    train_loader: DataLoader,
    val_loader: DataLoader,
) -> Dict[str, float]:
    output = {}
    model.eval()

    for split, loader in [('train', train_loader), ('val', val_loader)]:
        losses = []
        for x, y in loader:
            with torch.no_grad():
                _, loss = model(x, y)
            losses.append(loss.item())
        output[split] = sum(losses) / len(losses)

    model.train()
    return output

---

# 🧠 Explanation: Function to Estimate Average Training and Validation Loss

This function computes the **average loss** of the model over the entire training and validation datasets, without updating the model weights. It helps monitor performance and overfitting during training.

---

## ⚙️ Step-by-Step Breakdown

### 🔹 Function Signature and Decorator

```python
@torch.no_grad()
def estimate_loss(
    model: torch.nn.Module,
    train_loader: DataLoader,
    val_loader: DataLoader,
) -> Dict[str, float]:
```

* Decorated with `@torch.no_grad()` to disable gradient calculation throughout the function, reducing memory usage and speeding up evaluation.
* Accepts:

  * `model`: The PyTorch model to evaluate.
  * `train_loader` and `val_loader`: DataLoaders for training and validation sets.
* Returns a dictionary with average losses, e.g., `{"train": 0.12, "val": 0.15}`.

---

### 🔹 Set Model to Evaluation Mode

```python
model.eval()
```

* Switches the model to evaluation mode to deactivate training behaviors like dropout.

---

### 🔹 Compute Loss for Each Split

```python
for split, loader in [('train', train_loader), ('val', val_loader)]:
    losses = []
    for x, y in loader:
        with torch.no_grad():
            _, loss = model(x, y)
        losses.append(loss.item())
    output[split] = sum(losses) / len(losses)
```

* Loops over the training and validation datasets.
* For each batch `(x, y)`:

  * Calls the model to get the loss (assuming the model returns a tuple with loss as the second element).
  * Collects the scalar loss values.
* Calculates the **mean loss** over all batches per split.
* Stores results in the output dictionary.

---

### 🔹 Restore Model to Training Mode

```python
model.train()
```

* Switches the model back to training mode after evaluation to resume normal training behavior.

---

### 🔹 Return Average Losses

```python
return output
```

* Returns a dictionary with average training and validation losses, useful for logging and early stopping.

---

## ✅ Summary

* Provides a clean way to evaluate the model’s performance on both datasets without affecting gradients.
* Helps track progress during training and identify overfitting or underfitting.
* Efficiently handles large datasets by using DataLoaders and disabling gradients.

---

This function is a key utility to monitor model learning quality during fine-tuning.


### Save checkpoints

In [None]:
def save_checkpoint(
    model: GPTLanguageModel,
    optimizer: torch.optim.Optimizer,
    epoch: int,
    loss: float,
    file_path: str = "checkpoint.pth"
) -> None:
    checkpoint = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss
    }
    torch.save(checkpoint, file_path)

---

# 🧠 Explanation: Saving a Training Checkpoint

This function saves the current state of the training process into a file, enabling you to **pause and resume training** later without losing progress.

---

## ⚙️ Step-by-Step Breakdown

### 🔹 Function Signature

```python
def save_checkpoint(
    model: GPTLanguageModel,
    optimizer: torch.optim.Optimizer,
    epoch: int,
    loss: float,
    file_path: str = "checkpoint.pth"
) -> None:
```

* Accepts:

  * `model`: The GPT model being trained.
  * `optimizer`: The optimizer used during training.
  * `epoch`: The current epoch number.
  * `loss`: The latest loss value (e.g., validation loss).
  * `file_path`: Where to save the checkpoint file (default: `"checkpoint.pth"`).

---

### 🔹 Prepare Checkpoint Dictionary

```python
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss
}
```

* Stores:

  * The current epoch for resuming training.
  * The model’s parameters (`state_dict`) to restore weights.
  * The optimizer’s parameters (`state_dict`) to preserve optimizer state (learning rate, momentum, etc.).
  * The loss value for monitoring or bookkeeping.

---

### 🔹 Save to Disk

```python
torch.save(checkpoint, file_path)
```

* Serializes and writes the checkpoint dictionary to the specified file path.
* Allows future loading via `torch.load()` for resuming or evaluation.

---

## ✅ Summary

* Checkpointing is essential for:

  * Recovering from interruptions (e.g., crashes, power failures).
  * Performing model evaluation at certain training milestones.
  * Experimenting with different training regimes without losing progress.
* This function captures all critical training state components to enable seamless resumption.

---

This function provides a reliable way to persist training progress during your fine-tuning workflow.


### 4. Training loop

In [None]:
max_iters = 20
eval_interval = 20
learning_rate = 6e-5

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
train_losses = []
val_losses = []

for iteration in range(max_iters):
    for batch_idx, (x_batch, y_batch) in enumerate(train_loader):
        # Evaluation
        if batch_idx % eval_interval == 0 or batch_idx == len(train_loader) - 1:
            losses = estimate_loss(
                model=model,
                train_loader=train_loader,
                val_loader=val_loader,
            )
            train_losses.append(losses['train'])
            val_losses.append(losses['val'])

            print(
                f"iteration {iteration} / step {batch_idx}: "
                f"train loss {losses['train']:.4f}, "
                f"val loss {losses['val']:.4f}"
            )

        # Training step
        logits, loss = model(x_batch, y_batch)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

    # Save checkpoint
    save_checkpoint(
        model=model,
        optimizer=optimizer,
        epoch=iteration,
        loss=loss.item(),
        file_path=f"../output/fine_tuning/run_3/checkpoint_{iteration}.pth"
    )

---

# 🧠 Explanation: Fine-Tuning Loop with Periodic Evaluation and Checkpointing

This code snippet runs the **training loop** for your GPT model with regular evaluation on both training and validation datasets, and saves checkpoints at the end of each epoch.

---

## ⚙️ Step-by-Step Breakdown

### 🔹 Define Training Parameters and Optimizer

```python
max_iters = 20         # Number of training epochs
eval_interval = 20     # Frequency (in batches) to evaluate model loss
learning_rate = 6e-5   # Learning rate for AdamW optimizer

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
train_losses = []
val_losses = []
```

* Sets how long and how often to train/evaluate.
* Uses AdamW optimizer, commonly effective for transformer models.
* Lists to store training and validation losses over time.

---

### 🔹 Outer Loop: Iterate Over Epochs

```python
for iteration in range(max_iters):
```

* Runs training for `max_iters` epochs.

---

### 🔹 Inner Loop: Iterate Over Training Batches

```python
for batch_idx, (x_batch, y_batch) in enumerate(train_loader):
```

* Processes the training dataset batch by batch.

---

### 🔹 Periodic Evaluation

```python
if batch_idx % eval_interval == 0 or batch_idx == len(train_loader) - 1:
    losses = estimate_loss(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
    )
    train_losses.append(losses['train'])
    val_losses.append(losses['val'])

    print(
        f"iteration {iteration} / step {batch_idx}: "
        f"train loss {losses['train']:.4f}, "
        f"val loss {losses['val']:.4f}"
    )
```

* Every `eval_interval` batches (and on the last batch), the model is evaluated on full training and validation sets using the previously defined `estimate_loss` function.
* Records and prints average losses for monitoring training progress.

---

### 🔹 Training Step

```python
logits, loss = model(x_batch, y_batch)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
```

* Computes the forward pass, yielding model outputs and loss.
* Clears previous gradients to avoid accumulation.
* Backpropagates the loss to compute gradients.
* Updates model weights using optimizer step.

---

### 🔹 Checkpoint Saving

```python
save_checkpoint(
    model=model,
    optimizer=optimizer,
    epoch=iteration,
    loss=loss.item(),
    file_path=f"../output/fine_tuning/run_3/checkpoint_{iteration}.pth"
)
```

* At the end of each epoch, saves the current state of the model and optimizer.
* Checkpoints allow resuming training or evaluating intermediate models.

---

## ✅ Summary

* This loop alternates between training on batches and periodically evaluating performance.
* Evaluation helps detect issues like overfitting early.
* Checkpointing provides safety and flexibility during long training runs.
* Learning rate and optimizer choice are crucial hyperparameters for smooth training.

---

This structure represents a standard, effective fine-tuning workflow for transformer models.


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Validation Loss")
plt.xlabel("Evaluation Step")
plt.ylabel("Loss")
plt.title("Training and Validation Loss Over Time")
plt.legend()
plt.grid()
plt.show()

In [None]:
def get_input_tokens(message: str) -> torch.Tensor:
    input_tokens = tokenizer.encode(
        f"<|startoftext|>{message}<|separator|>", allowed_special="all")
    input_tokens = torch.tensor(
        input_tokens, dtype=torch.long).unsqueeze(0).to(device)
    return input_tokens


user_message = "Salam labas"
input_tokens = get_input_tokens(message=user_message)
model_answer = ""

model.eval()
while True:
    output_tokens = model.generate(input_tokens=input_tokens, max_new_tokens=1)
    last_generated_token = output_tokens[0, -1].item()
    if last_generated_token == tokenizer.special_tokens["<|endoftext|>"]:
        break

    input_tokens = torch.cat((input_tokens, output_tokens[:, -1:]), dim=1)
    model_answer += tokenizer.decode([last_generated_token])

    if len(output_tokens[0]) > block_size:
        input_tokens = input_tokens[:, -block_size:]

print(f"You: {user_message}")
print(f"Assistant: {model_answer}")

---

# 🧠 Explanation: Interactive Token-by-Token Text Generation Loop

This code implements an **interactive generation loop**, where the model produces one token at a time in response to a user message until a special end-of-text token is generated.

---

## ⚙️ Step-by-Step Breakdown

### 🔹 Function to Prepare Input Tokens

```python
def get_input_tokens(message: str) -> torch.Tensor:
    input_tokens = tokenizer.encode(
        f"<|startoftext|>{message}<|separator|>", allowed_special="all")
    input_tokens = torch.tensor(
        input_tokens, dtype=torch.long).unsqueeze(0).to(device)
    return input_tokens
```

* Takes a raw user string and:

  * Adds the special start and separator tokens to mark input boundaries.
  * Encodes the string into token IDs using the tokenizer.
  * Converts it into a batch tensor with shape `[1, seq_len]`.
  * Moves it to the correct device (CPU/GPU).
* This prepares input format expected by the GPT model.

---

### 🔹 Initialize Variables for Generation

```python
user_message = "Salam labas"
input_tokens = get_input_tokens(message=user_message)
model_answer = ""
```

* Sets a user prompt.
* Encodes it into input tokens.
* Initializes an empty string to accumulate the generated response.

---

### 🔹 Switch Model to Evaluation Mode

```python
model.eval()
```

* Disables dropout and other training behaviors for consistent inference.

---

### 🔹 Token-by-Token Generation Loop

```python
while True:
    output_tokens = model.generate(input_tokens=input_tokens, max_new_tokens=1)
    last_generated_token = output_tokens[0, -1].item()

    if last_generated_token == tokenizer.special_tokens["<|endoftext|>"]:
        break

    input_tokens = torch.cat((input_tokens, output_tokens[:, -1:]), dim=1)
    model_answer += tokenizer.decode([last_generated_token])

    if len(output_tokens[0]) > block_size:
        input_tokens = input_tokens[:, -block_size:]
```

* Generates **one token at a time** by calling `model.generate()` with `max_new_tokens=1`.
* Extracts the last generated token ID.
* Stops generation if the special end-of-text token is produced.
* Otherwise:

  * Appends the newly generated token to the current input tokens for the next generation step.
  * Decodes and accumulates the generated token into a human-readable string.
* Keeps the input tokens length within the model's `block_size` by trimming from the left if necessary, maintaining the context window.

---

### 🔹 Print Conversation

```python
print(f"You: {user_message}")
print(f"Assistant: {model_answer}")
```

* Displays the original user prompt and the model's generated response.

---

## ✅ Summary

* This approach simulates real-time generation, token-by-token, allowing dynamic interaction.
* Manages input context length to fit the model's constraints.
* Continues generating until a clear stopping token signals end of response.
* Useful for building chatbots or dialogue agents where incremental token generation is needed.

---

This snippet provides a practical example of how to implement fine-grained control over autoregressive text generation with your fine-tuned GPT model.
