#EXPLORATION WITH DEEPSEEK 7B

This .ipynb file will serve as the core engine for our Bangla ad script generator. The codeblocks are given below, along with markdowns to explain what each step essentially does.

> **Note to Reviewer**: Phases 1-5 below document our exploration process with DeepSeek-7B. These cells are set to "Raw" format and will not execute. Please skip directly to **Phase 6 (Master Training Cell)** to run the actual training pipeline with Qwen2.5-1.5B.

## Phase 1: Environment Orchestration

### Step 1.1: Dependency Installation

**What we're doing:** Installing the specialized libraries that make this project possible.

| Library | Purpose |
|---------|---------|
| `unsloth` | Makes training 2-5x faster and uses 70% less memory. Without this, fine-tuning would crash on free Colab |
| `xformers` | Memory-efficient attention mechanism (helps the model "think" without running out of RAM) |
| `mergekit` | Lets us combine Tiger + DeepSeek into one "frankenstein" model |
| `peft` | Allows LoRA training - we only train 1% of the model instead of 100%, saving time and memory |

**Key Concept:** Why Unsloth?  
A great analogy found online is of cooking. Let us imagine we are cooking a meal but the stove is small. Unsloth is like using pressure cooking techniques - we get the same result faster with less energy. It is what allows us to train a 7B parameter model on free Google Colab's 15GB GPU.

In [None]:
'''# Step 1.1: Install core dependencies
# This takes 3-5 minutes.

# 1. Update pip first to avoid dependency resolution errors
!pip install --upgrade pip

# 2. Uninstall existing PyTorch and related packages to ensure clean CUDA installation
!pip uninstall -y torch torchvision torchaudio

# 3. Install CUDA-enabled PyTorch (assuming CUDA 12.1, common in Colab)
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 4. Install Unsloth from PyPI (stable version) with colab-new extras
!pip install "unsloth[colab-new]"

# 5. Install other required libraries
!pip install --no-deps xformers trl peft accelerate bitsandbytes
!pip install mergekit
!pip install pandas openpyxl

[0mFound existing installation: torch 2.10.0
Uninstalling torch-2.10.0:
  Successfully uninstalled torch-2.10.0
Found existing installation: torchvision 0.25.0
Uninstalling torchvision-0.25.0:
  Successfully uninstalled torchvision-0.25.0
[0mLooking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch
  Using cached https://download.pytorch.org/whl/cu121/torch-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (780.4 MB)
Collecting torchvision
  Using cached https://download.pytorch.org/whl/cu121/torchvision-0.20.1%2Bcu121-cp312-cp312-linux_x86_64.whl (7.3 MB)
Collecting torchaudio
  Using cached https://download.pytorch.org/whl/cu121/torchaudio-2.5.1%2Bcu121-cp312-cp312-linux_x86_64.whl (3.4 MB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch)
  Using cached https://download.pytorch.org/whl/cu121/nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch)
  Using cached https://download.pytorch.or

### Step 1.2: Library Setup & Hugging Face Login

**What we are doing:**
1. Validating that our GPU is ready.
2. Importing the tools we just installed.
3. Logging into Hugging Face so we can download the base model and upload our final `LekhAI` model.

**Action Required for future user:**
When you run this, you will see a text box. Paste your Hugging Face **Write** token there.

In [None]:
'''# Step 1.2: Import libraries and login
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from huggingface_hub import login

# Check if GPU is detected
gpu_stats = torch.cuda.get_device_properties(0)
print(f"GPU = {gpu_stats.name}. Max Memory = {round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)} GB.")

# Login to Hugging Face (Required to access models)
login()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
GPU = Tesla T4. Max Memory = 14.741 GB.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Step 1.3: Hugging Face Authentication

### What We Are Doing

Hugging Face is like GitHub, but specifically for machine learning models instead of code. It hosts thousands of pre-trained models that researchers and companies share publicly. To download the base models we need (TigerLLM and DeepSeek-R1-Distill-Qwen) and to upload our final LekhAI model after training, we must authenticate with the Hugging Face platform.

### Why Authentication Is Necessary

1. **Downloading Gated Models**: Some high-quality models on Hugging Face require us to accept their license terms before downloading. Authentication proves you have accepted these terms.

2. **Uploading Model**: After training, we will push the final LekhAI weights to our Hugging Face repository. This requires write access, which is only granted to authenticated users.

3. **Rate Limiting**: Anonymous downloads are rate-limited. Authenticated requests get higher priority and faster download speeds.

### How To Get Hugging Face Token - Future User

If you do not already have a Hugging Face account and token, follow these steps:

1. Go to [huggingface.co](https://huggingface.co) and create a free account.
2. Click on your profile picture in the top-right corner and select "Settings."
3. In the left sidebar, click "Access Tokens."
4. Click "Create new token" and give it a name (for example, "LekhAI Colab").
5. **Important**: Select "Write" as the token type. Read-only tokens cannot upload models.
6. Copy the token. It will look something like `hf_aBcDeFgHiJkLmNoPqRsTuVwXyZ123456`.

### Security Note

Your token is like a password. Do not share it publicly or commit it to version control. In Google Colab, the `login()` function stores the token securely in your session and does not display it in the notebook output.

In [None]:
'''# Step 1.3: Authenticate with Hugging Face
# When running this cell, a text input box will appear.
# Paste Hugging Face token (with Write permissions) and press Enter.

from huggingface_hub import login

# Initiate the login process
# The 'add_to_git_credential=True' flag stores the token for future Git operations
login(add_to_git_credential=True)

# After successful login, verify the connection by checking username
from huggingface_hub import whoami

try:
    user_info = whoami()
    print(f"Successfully authenticated as: {user_info['name']}")
    print(f"Account type: {user_info.get('type', 'user')}")
    print("You are now ready to download and upload models.")
except Exception as e:
    print(f"Authentication failed. Please check your token and try again.")
    print(f"Error details: {e}")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Successfully authenticated as: Shudipta
Account type: user
You are now ready to download and upload models.


## Phase 2: Model Loading and Fusing

### Step 2.1: Writing the Merge Configuration File

### What We Are Attempting

In this step, we are creating a configuration file that tells the `mergekit` tool exactly how to combine two different language models into one. One can think of it like a recipe: we are specifying which ingredients (models) to use, in what proportions, and what technique to apply.

### Why Are We Attempting to Merge Two Models?

The goal of LekhAI is to generate high-quality Bangla advertisement scripts. No single existing model excels at both:

1. **Bangla Language Fluency**: Understanding and generating grammatically correct, culturally appropriate Bangla text.
2. **Logical Reasoning and Structure**: Following complex instructions, maintaining coherent multi-turn dialogues, and producing well-structured outputs.

By merging two specialized models, we aim to create a hybrid that inherits the strengths of both:

| Model | Specialization | What It Contributes to LekhAI |
|-------|----------------|-------------------------------|
| **TigerLLM-7B-Base** | A Bangla-focused language model trained extensively on Bangla text corpora | Native Bangla vocabulary, grammar patterns, and cultural context |
| **DeepSeek-R1-Distill-Qwen-7B** | A reasoning-optimized model distilled from larger models, known for following complex instructions | Structured output generation, logical flow, and instruction-following capability |

### What Is SLERP Merging?

SLERP stands for **Spherical Linear Interpolation**. It is a mathematical technique for blending two sets of weights (the parameters of the neural network) in a way that preserves the "direction" of each model's learned knowledge.

An analogy that works is: Let us imagine we have two compasses, each pointing in a different direction. Simple averaging would just find the midpoint, which might not be meaningful. SLERP traces an arc between the two directions, creating a smooth blend that preserves the essential character of both.

In practical terms, SLERP merging tends to produce more coherent outputs than simple weight averaging because it respects the geometric structure of the high-dimensional parameter space.

### The Merge Configuration Explained

Below, we create a YAML file that specifies:

- **`slices`**: Which models to merge and which layers to include (we include all layers from both models).
- **`merge_method`**: The algorithm to use (SLERP in our case).
- **`base_model`**: The primary model whose architecture and tokenizer will be preserved.
- **`parameters.t`**: The interpolation factor. A value of 0.5 means equal contribution from both models. Values closer to 0.0 favor the first model; values closer to 1.0 favor the second.
- **`dtype`**: The numerical precision of the merged weights. We use float16 to reduce memory usage while maintaining quality.

### Important Note on Model Sizes

Both models are 7 billion parameters. After merging, the result will still be 7 billion parameters (we are blending weights, not concatenating them). This is crucial because it means the merged model will fit within the same memory constraints as the individual models.

In [None]:
'''# Step 2.1: Create the merge configuration file for MergeKit
# This configuration specifies how TigerLLM and DeepSeek will be combined.

import yaml
import os

# Define the merge configuration as a Python dictionary
# This is easier to read and modify than writing YAML directly

merge_config = {
    "slices": [
        {
            "sources": [
                {
                    "model": "TigerResearch/tigerbot-7b-base",
                    "layer_range": [0, 32]  # Include all 32 transformer layers
                },
                {
                    "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
                    "layer_range": [0, 32]
                }
            ]
        }
    ],
    "merge_method": "slerp",
    "base_model": "TigerResearch/tigerbot-7b-base",  # Use Tiger's tokenizer and architecture as the foundation
    "parameters": {
        "t": 0.5  # Equal contribution from both models (adjust between 0.0 and 1.0 if needed)
    },
    "dtype": "float16"  # Use half-precision to save memory
}

# Create a directory to store merge-related files
os.makedirs("merge_config", exist_ok=True)

# Write the configuration to a YAML file
config_path = "merge_config/lekhAI_merge_config.yaml"
with open(config_path, "w", encoding="utf-8") as f:
    yaml.dump(merge_config, f, default_flow_style=False, allow_unicode=True)

# Display the configuration for verification
print("Merge configuration saved to:", config_path)
print("\n" + "="*60)
print("CONFIGURATION CONTENTS:")
print("="*60 + "\n")

with open(config_path, "r", encoding="utf-8") as f:
    print(f.read())

print("="*60)
print("\nConfiguration file is ready. Proceed to Step 2.2 to execute the merge.")

Merge configuration saved to: merge_config/lekhAI_merge_config.yaml

CONFIGURATION CONTENTS:

base_model: TigerResearch/tigerbot-7b-base
dtype: float16
merge_method: slerp
parameters:
  t: 0.5
slices:
- sources:
  - layer_range:
    - 0
    - 32
    model: TigerResearch/tigerbot-7b-base
  - layer_range:
    - 0
    - 32
    model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B


Configuration file is ready. Proceed to Step 2.2 to execute the merge.


### Step 2.2: Executing the Model Merge

### What We Are Doing

In this step, we run the actual merging process. The `mergekit` tool will:

1. **Download both base models** from Hugging Face (approximately 14 gigabytes each, totaling around 28 gigabytes of downloads).
2. **Load the weights layer by layer** to avoid running out of memory.
3. **Apply SLERP interpolation** to blend the parameters according to our configuration.
4. **Save the merged model** to a local folder called `merged_lekhAI_base`.

### Expected Duration

This process typically takes **20 to 40 minutes** on Google Colab, depending on:
- Network speed for downloading the models
- Available CPU and RAM for the merge computation
- Disk write speed for saving the merged weights

### What Happens During the Merge (Technical Details)

1. **Layer-by-Layer Processing**: MergeKit does not load both 7-billion-parameter models into memory simultaneously (that would require over 50 gigabytes of RAM). Instead, it processes one layer at a time, loading the corresponding weights from both models, blending them, and writing the result to disk before moving to the next layer.

2. **Tokenizer Handling**: Because we specified `TigerResearch/tigerbot-7b-base` as the `base_model` in our configuration, the merged model will use Tiger's tokenizer. This is important because Tiger's tokenizer has been trained on Bangla text and contains Bangla-specific vocabulary tokens that DeepSeek's tokenizer lacks.

3. **Checkpoint Format**: The merged model will be saved in the Hugging Face Transformers format, meaning we can load it directly with libraries like `transformers` and `unsloth` without any additional conversion.

### Important Warnings for those using Colab

- **Do not interrupt this cell** while it is running. Interruption may leave partially written files that could cause errors later.
- **Monitor Colab session**: Google Colab may disconnect if left idle too long. Keep the browser tab active.
- **Disk space**: Ensure you have at least 30 gigabytes of free disk space in your Colab environment. You can check this by running `!df -h` in a separate cell. Conversely, you can hover the mouse pointer on the top right on the RAM and Disk tab below the 'Share' button.

### What To Expect in the Output

We will see progress messages indicating:
- Which layers are being processed (for example, "Processing layer 0/32")
- Download progress for each model
- Estimated time remaining

When complete, we will see a message confirming the merge was successful and the path to the merged model.

In [None]:
'''# Step 2.2 Diagnostic and Model Merging

import subprocess
import os

print("DIAGNOSTIC TEST 1: Check model accessibility")
print("="*60)

# Test if we can access each model
models_to_test = [
    "TigerResearch/tigerbot-7b-base",
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
]

from huggingface_hub import HfApi, model_info

api = HfApi()

for model_name in models_to_test:
    print(f"\nChecking: {model_name}")
    try:
        info = model_info(model_name)
        print(f"  Status: ACCESSIBLE")
        print(f"  Model type: {info.config.get('model_type', 'Unknown') if info.config else 'Unknown'}")
        print(f"  Library: {info.library_name}")
    except Exception as e:
        print(f"  Status: ERROR")
        print(f"  Error: {e}")

print("\n" + "="*60)
print("DIAGNOSTIC TEST 2: Check model architectures")
print("="*60)

from transformers import AutoConfig

for model_name in models_to_test:
    print(f"\nModel: {model_name}")
    try:
        config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
        print(f"  Model type: {config.model_type}")
        print(f"  Hidden size: {config.hidden_size}")
        print(f"  Num layers: {config.num_hidden_layers}")
        print(f"  Num attention heads: {config.num_attention_heads}")
        print(f"  Vocab size: {config.vocab_size}")
    except Exception as e:
        print(f"  ERROR: {e}")

print("\n" + "="*60)
print("DIAGNOSTIC TEST 3: Run mergekit with full error capture")
print("="*60)

config_path = "merge_config/lekhAI_merge_config.yaml"
output_path = "merged_lekhAI_base_test"

# Run merge and capture both stdout and stderr
result = subprocess.run(
    f"mergekit-yaml {config_path} {output_path} --copy-tokenizer --allow-crimes --verbose 2>&1",
    shell=True,
    capture_output=True,
    text=True
)

print(f"\nReturn code: {result.returncode}")
print("\nFull output:")
print("-"*60)
print(result.stdout if result.stdout else "(no stdout)")
print("-"*60)
if result.stderr:
    print("Stderr:")
    print(result.stderr)

DIAGNOSTIC TEST 1: Check model accessibility

Checking: TigerResearch/tigerbot-7b-base
  Status: ACCESSIBLE
  Model type: llama
  Library: transformers

Checking: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  Status: ACCESSIBLE
  Model type: qwen2
  Library: transformers

DIAGNOSTIC TEST 2: Check model architectures

Model: TigerResearch/tigerbot-7b-base


config.json:   0%|          | 0.00/640 [00:00<?, ?B/s]

  Model type: llama
  Hidden size: 4096
  Num layers: 32
  Num attention heads: 32
  Vocab size: 60928

Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B


config.json:   0%|          | 0.00/680 [00:00<?, ?B/s]

  Model type: qwen2
  Hidden size: 3584
  Num layers: 28
  Num attention heads: 28
  Vocab size: 152064

DIAGNOSTIC TEST 3: Run mergekit with full error capture

Return code: 2

Full output:
------------------------------------------------------------
Skipping import of cpp extensions due to incompatible torch version 2.10.0+cu128 for torchao version 0.15.0             Please see https://github.com/pytorch/ao/issues/2919 for more info
2026-02-08 14:48:48.793630: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1770562128.829987    4229 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1770562128.841981    4229 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registere

### Step 2.2 (Fallback): Loading the DeepSeek Base Model

### Change of Approach

After diagnostic testing, we discovered that the TigerLLM and DeepSeek models have fundamentally incompatible architectures (different model types, hidden sizes, and layer counts). SLERP merging requires identical architectures, which these models do not share.

### Our Solution

We will use **DeepSeek-R1-Distill-Qwen-7B** directly as our foundation, as recommended by the faculty. This model:

1. **Strong Reasoning Capabilities**: DeepSeek-R1 was specifically designed for logical reasoning and structured output generation.
2. **Large Vocabulary (152,064 tokens)**: Includes support for multiple languages and scripts, including Bangla characters.
3. **Qwen2 Architecture**: A modern transformer architecture with efficient attention mechanisms.
4. **Instruction-Following**: Distilled from a larger reasoning model, making it naturally good at following complex prompts.

### Handling Bangla Text

While DeepSeek was not specifically trained on Bangla corpora like TigerLLM was, its large vocabulary and multilingual training data include Bangla script coverage. During fine-tuning, the model will learn:
- Bangla vocabulary patterns specific to advertising
- The tone and structure of professional ad scripts
- Industry-specific terminology

### Key Concept: Transfer Learning

When we fine-tune DeepSeek on Bangla ad scripts, we are performing "transfer learning." The model's existing knowledge of language structure, grammar, and reasoning transfers to Bangla, even if it saw less Bangla during pre-training. The fine-tuning process teaches it the specific patterns of your dataset.

In [None]:
'''# Step 2.2 (Revised): Set up DeepSeek as the base model
# As recommended by faculty, we use DeepSeek for its reasoning capabilities.

import os

# Define the base model we will use
BASE_MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

# Create a variable to track this decision
print("BASE MODEL CONFIGURATION")
print("="*60)
print(f"Model: {BASE_MODEL_NAME}")
print("Architecture: Qwen2")
print("Parameters: 7 Billion")
print("Vocabulary: 152,064 tokens")
print()
print("Rationale (as per faculty recommendation):")
print("- DeepSeek-R1 has strong instruction-following capabilities")
print("- Distilled reasoning abilities from larger models")
print("- Large vocabulary with multilingual support including Bangla")
print("- Modern Qwen2 architecture optimized for generation tasks")
print("="*60)

# Store for use in later cells
base_model_path = BASE_MODEL_NAME
print(f"\nModel path set to: {base_model_path}")
print("\nProceeding to Step 2.3 for tokenizer verification.")

BASE MODEL CONFIGURATION
Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
Architecture: Qwen2
Parameters: 7 Billion
Vocabulary: 152,064 tokens

Rationale (as per faculty recommendation):
- DeepSeek-R1 has strong instruction-following capabilities
- Distilled reasoning abilities from larger models
- Large vocabulary with multilingual support including Bangla
- Modern Qwen2 architecture optimized for generation tasks

Model path set to: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

Proceeding to Step 2.3 for tokenizer verification.


### Step 2.3: Tokenizer Verification for DeepSeek

### What We Are Doing

In this step, we verify that the DeepSeek model's tokenizer correctly handles Bangla text. Although DeepSeek was primarily trained on Chinese and English, its large vocabulary of 152,064 tokens includes support for various scripts including Bangla.

### Why This Verification Matters

Before investing time in fine-tuning, we need to confirm that:

1. **Bangla characters are recognized**: The tokenizer should convert Bangla text into token IDs without replacing everything with "unknown" tokens.
2. **Tokenization is efficient**: Bangla words should be broken into reasonable subword units, not one token per character (which would be inefficient).
3. **Round-trip works**: Text encoded and then decoded should match the original.

### What Is the Qwen2 Tokenizer?

DeepSeek-R1-Distill-Qwen uses the Qwen2 tokenizer, which is based on the Byte-Level BPE (Byte-Pair Encoding) algorithm. Key features:

| Feature | Description |
|---------|-------------|
| **Byte-Level Encoding** | Any Unicode character can be represented, even if not seen during training |
| **Large Vocabulary** | 152,064 tokens provide extensive coverage of multiple languages |
| **Special Tokens** | Includes tokens for instruction formatting like `<|im_start|>` and `<|im_end|>` |
| **Chat Template** | Built-in support for multi-turn conversation formatting |

### Handling Unknown Characters

Even if specific Bangla words were not in the training data, the byte-level approach ensures they can still be processed. The model may initially produce lower-quality Bangla output, but fine-tuning on our dataset will teach it proper Bangla generation patterns.

In [None]:
'''# Step 2.3: Tokenizer Verification for DeepSeek-R1-Distill-Qwen-7B
# We verify that Bangla text can be properly encoded and decoded.

from transformers import AutoTokenizer

# Use the base model path defined in Step 2.2
BASE_MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

print("Loading DeepSeek tokenizer...")
print("="*60 + "\n")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    BASE_MODEL_NAME,
    trust_remote_code=True
)

# Display tokenizer information
print("TOKENIZER INFORMATION")
print("-"*40)
print(f"Tokenizer type: {type(tokenizer).__name__}")
print(f"Vocabulary size: {len(tokenizer):,} tokens")
print(f"Model max length: {tokenizer.model_max_length:,} tokens")
print(f"Padding side: {tokenizer.padding_side}")
print()

# Display special tokens
print("SPECIAL TOKENS")
print("-"*40)
special_tokens = {
    "BOS (Beginning of Sequence)": tokenizer.bos_token,
    "EOS (End of Sequence)": tokenizer.eos_token,
    "PAD (Padding)": tokenizer.pad_token,
    "UNK (Unknown)": tokenizer.unk_token,
}
for name, token in special_tokens.items():
    if token:
        token_id = tokenizer.convert_tokens_to_ids(token)
        print(f"  {name}: '{token}' (ID: {token_id})")
    else:
        print(f"  {name}: Not set")
print()

# Bangla text encoding test
print("BANGLA ENCODING TEST")
print("-"*40)

test_sentences = [
    "বাংলাদেশের বিজ্ঞাপন শিল্প অনেক উন্নত।",  # "Bangladesh's advertising industry is very advanced."
    "এটি একটি পেইন্টের বিজ্ঞাপন।",      # "This is an advertisement for paint."
    "আমাদের পণ্য সেরা মানের।",                  # "Our product is of the best quality."
]

all_tests_passed = True

for i, sentence in enumerate(test_sentences, 1):
    # Encode the sentence
    tokens = tokenizer.encode(sentence, add_special_tokens=False)
    token_count = len(tokens)
    char_count = len(sentence)

    # Calculate tokens per character (lower is more efficient)
    efficiency_ratio = token_count / char_count

    # Decode back to text
    decoded = tokenizer.decode(tokens, skip_special_tokens=True)

    # Check if round-trip is successful
    match_status = "PASS" if decoded.strip() == sentence.strip() else "FAIL"
    if match_status == "FAIL":
        all_tests_passed = False

    print(f"\nTest {i}:")
    print(f"  Original:     {sentence}")
    print(f"  Characters:   {char_count}")
    print(f"  Token IDs:    {tokens[:8]}{'...' if len(tokens) > 8 else ''}")
    print(f"  Token count:  {token_count}")
    print(f"  Efficiency:   {efficiency_ratio:.2f} tokens/char (lower is better)")
    print(f"  Decoded:      {decoded}")
    print(f"  Round-trip:   {match_status}")

print("\n" + "="*60)

# Configure tokenizer for training
print("\nTOKENIZER CONFIGURATION FOR TRAINING")
print("-"*40)

# Set pad token if not already set (required for batch training)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("Pad token was not set. Using EOS token as pad token.")
else:
    print(f"Pad token is set to: '{tokenizer.pad_token}'")

# Verify chat template exists
if hasattr(tokenizer, 'chat_template') and tokenizer.chat_template:
    print("Chat template: Available")
else:
    print("Chat template: Not available (will use default formatting)")

# Summary
print("\n" + "="*60)
print("TOKENIZER VERIFICATION SUMMARY")
print("="*60)

if all_tests_passed:
    print("\nAll Bangla encoding tests PASSED.")
    print("The tokenizer correctly handles Bangla text.")
    print("\nYou may proceed to Phase 3: Data Architecture.")
else:
    print("\nSome tests FAILED. Check the decoded output above.")
    print("The model may still work but could have issues with certain characters.")

Loading DeepSeek tokenizer...



tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

TOKENIZER INFORMATION
----------------------------------------
Tokenizer type: LlamaTokenizerFast
Vocabulary size: 151,665 tokens
Model max length: 16,384 tokens
Padding side: left

SPECIAL TOKENS
----------------------------------------
  BOS (Beginning of Sequence): '<｜begin▁of▁sentence｜>' (ID: 151646)
  EOS (End of Sequence): '<｜end▁of▁sentence｜>' (ID: 151643)
  PAD (Padding): '<｜end▁of▁sentence｜>' (ID: 151643)
  UNK (Unknown): Not set

BANGLA ENCODING TEST
----------------------------------------

Test 1:
  Original:     বাংলাদেশের বিজ্ঞাপন শিল্প অনেক উন্নত।
  Characters:   37
  Token IDs:    [146026, 49128, 224, 146227, 49128, 99, 58908, 148125]...
  Token count:  37
  Efficiency:   1.00 tokens/char (lower is better)
  Decoded:      বাংলাদেশের বিজ্ঞাপন শিল্প অনেক উন্নত।
  Round-trip:   PASS

Test 2:
  Original:     এটি একটি পেইন্টের বিজ্ঞাপন।
  Characters:   27
  Token IDs:    [149525, 147338, 61356, 35178, 237, 146775, 147338, 61356]...
  Token count:  27
  Efficiency:   1.00 tok

## Phase 3: Data Architecture and Pre-processing



### Step 3.1: Loading the Advertisement Script Dataset

### What We Are Doing

In this step, we load the Excel file containing our advertisement scripts into memory. This dataset is the core of our fine-tuning process. The model will learn from these examples to generate new scripts in the same style.

### Dataset Overview

The dataset contains:

| Attribute | Value |
|-----------|-------|
| Total Scripts | 102 rows |
| Real Agency Scripts | 17 (professional quality) |
| Augmented Scripts | 85 (AI-generated for training volume) |
| Format | Excel (.xlsx) |

### Key Columns in the Dataset

| Column Name | Purpose |
|-------------|---------|
| `agency_masked_id` | Anonymized identifier for the source agency |
| `tone_1`, `tone_2` | The emotional tone of the advertisement (for example, "emotional", "humorous") |
| `type` | The format of the ad (for example, "TVC", "OVC") |
| `industry` | The business sector (for example, "FMCG", "Real Estate") |
| `product` | The specific product being advertised |
| `duration` | The target length of the ad in seconds |
| `system_prompt` | Instructions that tell the model what role to play |
| `prompt_1`, `prompt_2`, `prompt_3` | User prompts that request specific scripts |
| `script` | The actual advertisement script (the target output) |

### Why We Explore the Data First

Before training, we must understand:
1. **Data quality**: Are there missing values or formatting issues?
2. **Text length distribution**: How long are the scripts? This affects our tokenization settings.
3. **Category distribution**: Are certain industries or tones overrepresented?

This exploration helps us make informed decisions about data preprocessing and training configuration.

In [None]:
'''# Step 3.1 Part A: Upload the Excel file to Google Colab
# This cell creates an upload widget. Click it and select your file.

from google.colab import files
import os

print("DATASET UPLOAD")
print("="*60)
print("Please upload your 'Ad Script Dataset.xlsx' file.")
print("Click the 'Choose Files' button that appears below.\n")

# Create upload widget
uploaded = files.upload()

# Get the filename of the uploaded file
if uploaded:
    uploaded_filename = list(uploaded.keys())[0]
    print(f"\nFile uploaded successfully: {uploaded_filename}")
    print(f"File size: {len(uploaded[uploaded_filename]) / 1024:.2f} KB")
else:
    print("No file was uploaded. Please run this cell again.")

DATASET UPLOAD
Please upload your 'Ad Script Dataset.xlsx' file.
Click the 'Choose Files' button that appears below.



Saving Ad Script Dataset.xlsx to Ad Script Dataset.xlsx

File uploaded successfully: Ad Script Dataset.xlsx
File size: 231.86 KB


In [None]:
'''# Step 3.1 Part B: Load the dataset and perform exploratory analysis

import pandas as pd
import numpy as np

# Load the Excel file
# Adjust the filename if yours (future user's) is different
DATASET_FILE = "Ad Script Dataset.xlsx"

print("LOADING DATASET")
print("="*60)

try:
    df = pd.read_excel(DATASET_FILE)
    print(f"Dataset loaded successfully from: {DATASET_FILE}")
except FileNotFoundError:
    # Try to find the file with a slightly different name
    import glob
    excel_files = glob.glob("*.xlsx")
    if excel_files:
        DATASET_FILE = excel_files[0]
        df = pd.read_excel(DATASET_FILE)
        print(f"Dataset loaded from: {DATASET_FILE}")
    else:
        raise FileNotFoundError("No Excel file found. Please upload the dataset first.")

print(f"Total rows: {len(df)}")
print(f"Total columns: {len(df.columns)}")
print()

# Display column information
print("COLUMN DETAILS")
print("-"*40)
for col in df.columns:
    non_null = df[col].notna().sum()
    dtype = df[col].dtype
    print(f"  {col}: {non_null}/{len(df)} non-null, type: {dtype}")
print()

# Display basic statistics
print("DATA QUALITY CHECK")
print("-"*40)

# Check for missing values in critical columns
critical_columns = ['system_prompt', 'prompt_1', 'script']
for col in critical_columns:
    if col in df.columns:
        missing = df[col].isna().sum()
        print(f"  {col}: {missing} missing values")
print()

# Analyze script lengths
if 'script' in df.columns:
    print("SCRIPT LENGTH ANALYSIS")
    print("-"*40)
    df['script_length'] = df['script'].astype(str).apply(len)
    print(f"  Minimum length: {df['script_length'].min()} characters")
    print(f"  Maximum length: {df['script_length'].max()} characters")
    print(f"  Average length: {df['script_length'].mean():.0f} characters")
    print(f"  Median length:  {df['script_length'].median():.0f} characters")
    print()

# Analyze categories if available
print("CATEGORY DISTRIBUTION")
print("-"*40)

categorical_columns = ['tone_1', 'industry', 'type']
for col in categorical_columns:
    if col in df.columns:
        print(f"\n  {col.upper()}:")
        value_counts = df[col].value_counts()
        for value, count in value_counts.head(5).items():
            print(f"    - {value}: {count} scripts")

print("\n" + "="*60)

# Display sample rows
print("\nSAMPLE DATA (First 2 rows)")
print("="*60)
print(df[['industry', 'product', 'tone_1', 'duration']].head(2).to_string())

print("\n" + "="*60)
print("\nDataset loaded and analyzed successfully.")
print("Proceed to Step 3.2 to format the data for training.")

LOADING DATASET
Dataset loaded successfully from: Ad Script Dataset.xlsx
Total rows: 102
Total columns: 14

COLUMN DETAILS
----------------------------------------
  agency_masked_id: 102/102 non-null, type: object
  tone_1: 102/102 non-null, type: object
  tone_2: 102/102 non-null, type: object
  type: 102/102 non-null, type: object
  industry: 102/102 non-null, type: object
  product: 102/102 non-null, type: object
  duration: 102/102 non-null, type: int64
  system_prompt: 102/102 non-null, type: object
  prompt_1: 102/102 non-null, type: object
  prompt_2: 102/102 non-null, type: object
  prompt_3: 102/102 non-null, type: object
  script: 102/102 non-null, type: object
  Unnamed: 12: 0/102 non-null, type: float64
  Unnamed: 13: 1/102 non-null, type: object

DATA QUALITY CHECK
----------------------------------------
  system_prompt: 0 missing values
  prompt_1: 0 missing values
  script: 0 missing values

SCRIPT LENGTH ANALYSIS
----------------------------------------
  Minimum leng

### Step 3.2: Formatting the Chat Template

### What We Are Doing

In this step, we convert our tabular dataset into a format that the language model can learn from. Language models learn through examples of conversations, so we need to structure our data as a series of "user asks, assistant responds" exchanges.

### The Conversation Structure

For each row in our dataset, we will create a training example with this structure: <br>

> [SYSTEM MESSAGE] You are LekhAI, a professional Bangla advertisement script writer... (content from system_prompt column) <br>
> [USER MESSAGE] (content from prompt_1 column - the request for a script)<br>
>[ASSISTANT MESSAGE] (content from script column - the actual advertisement script)


<br>
### Why This Format Matters

The model learns by predicting what comes next. When it sees the pattern:
1. System instruction sets the context
2. User makes a request
3. Assistant provides the script

It learns to generate appropriate scripts when given similar system instructions and user requests.

### DeepSeek Chat Template

DeepSeek uses a specific format with special tokens:
> <|begin▁of▁sentence|><|User|>message<|Assistant|>response<|end▁of▁sentence|>


We will use the tokenizer's built-in `apply_chat_template` function to handle this formatting automatically, ensuring compatibility with DeepSeek's expected input format.
<br><br>

### Handling Multiple Prompts

Our dataset has three prompt columns (prompt_1, prompt_2, prompt_3). For this training run, we will use prompt_1 as it appears to be the primary prompt. This creates 102 training examples. In future iterations, we will expand the dataset by also training on prompt_2 and prompt_3 variations.

### Key Concept: Supervised Fine-Tuning (SFT)

This process is called Supervised Fine-Tuning because:
- **Supervised**: We have labeled examples (prompt → script pairs)
- **Fine-Tuning**: We are adjusting a pre-trained model rather than training from scratch

The model already knows how to generate text. We are teaching it the specific style and structure of Bangla advertisements.




In [None]:
'''# Step 3.2: Format the dataset for training
# We convert each row into a conversation format that DeepSeek can learn from.

import pandas as pd
from datasets import Dataset

# Reload the dataframe if needed
DATASET_FILE = "Ad Script Dataset.xlsx"
df = pd.read_excel(DATASET_FILE)

print("FORMATTING DATASET FOR TRAINING")
print("="*60)

# Remove empty columns
df = df.drop(columns=['Unnamed: 12', 'Unnamed: 13'], errors='ignore')
print(f"Columns after cleanup: {list(df.columns)}")
print()

# Create the conversation format
def create_conversation(row):
    """
    Convert a single row into the conversation format expected by the model.

    Structure:
    - System message: Sets the context and role
    - User message: The prompt requesting a script
    - Assistant message: The actual script (what the model should learn to generate)
    """

    # Build the system message with context
    system_message = row['system_prompt']

    # User message is the prompt
    user_message = row['prompt_1']

    # Assistant response is the script
    assistant_message = row['script']

    # Return as a list of message dictionaries (standard chat format)
    conversation = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": assistant_message}
    ]

    return conversation

# Apply the formatting to each row
print("Converting rows to conversation format...")
df['conversations'] = df.apply(create_conversation, axis=1)

# Display a sample conversation
print("\nSAMPLE CONVERSATION (Row 0)")
print("-"*40)
sample = df['conversations'].iloc[0]
for msg in sample:
    role = msg['role'].upper()
    content = msg['content'][:200] + "..." if len(msg['content']) > 200 else msg['content']
    print(f"\n[{role}]")
    print(content)

print("\n" + "-"*40)

# Convert to Hugging Face Dataset format
print("\nConverting to Hugging Face Dataset format...")

# Create a list of all conversations
conversations_list = df['conversations'].tolist()

# Create the dataset
dataset = Dataset.from_dict({
    "conversations": conversations_list,
    "industry": df['industry'].tolist(),
    "tone": df['tone_1'].tolist(),
    "duration": df['duration'].tolist()
})

print(f"\nDataset created successfully!")
print(f"  Number of examples: {len(dataset)}")
print(f"  Features: {list(dataset.features.keys())}")

# Display dataset info
print("\nDATASET PREVIEW")
print("-"*40)
print(dataset)

print("\n" + "="*60)
print("\nDataset is ready for tokenization.")
print("Proceed to Step 3.3 to tokenize the conversations.")

FORMATTING DATASET FOR TRAINING
Columns after cleanup: ['agency_masked_id', 'tone_1', 'tone_2', 'type', 'industry', 'product', 'duration', 'system_prompt', 'prompt_1', 'prompt_2', 'prompt_3', 'script']

Converting rows to conversation format...

SAMPLE CONVERSATION (Row 0)
----------------------------------------

[SYSTEM]
You are LekhAI, a specialized AI assistant for X Integrated marketing agency. You generate high-conversion Bengali ad scripts with professional formatting.

[USER]
I need you to write a Bengali TVC script for Summer Dose Orange Lolly Ice Cream. Here's exactly what I need:
Product: Summer Dose Orange Lolly Ice Cream
Target Audience: 18-30 year old Bangladeshis
Du...

[ASSISTANT]
## গল্পঃ গ্যাঞ্জাম

গরমটা অসহনীয়। এই গরমের মধ্যেও প্রিন্ট করা ছবি, মোবাইলে থাকা ছবি দেখিয়ে কিছু ৪-৫ জন মিলে এক ছেলেকে খুঁজছে।  
- চায়ের দোকান, বাজার, বাসার নিচের গ্যারেজ সব জায়গায় খোঁজা হচ্ছে  
- পথচ...

----------------------------------------

Converting to Hugging Face Dataset format...

D

### Step 3.3: Data Augmentation and Tokenization

### What We Are Doing

In this step, we are performing a technique called **Data Augmentation**. Instead of just using the first prompt (`prompt_1`) for each script, we are creating three separate training examples for every single row in our dataset using `prompt_1`, `prompt_2`, and `prompt_3`.

### Why This Matters

1. **Triples the Dataset**: We effectively move from 102 examples to 306 examples without collecting any new data.
2. **Robustness**: The model learns that different ways of phrasing a request (industry, tone, product details) should still result in a professional script.
3. **Generalization**: It prevents the model from "overfitting" (memorizing) just one specific prompt structure.

### Technical Process

1. **Expansion**: We iterate through each row and create three distinct "Conversation" objects.
2. **Chat Templating**: We wrap these in the DeepSeek/Qwen2 chat template.
3. **Tokenization**: We convert the text into numerical IDs.
4. **Length Analysis**: We check the "token count" to ensure our scripts fit within the model's memory limits (2,048 tokens).

### Key Concept: Input vs. Output (Labels)

During this process, the `system_prompt` and `user_prompt` act as the "Instructions", and the `script` acts as the "Ground Truth." The model is trained to minimize the difference between its guess and our agency-grade scripts.

In [None]:
'''# Step 3.3: Advanced Data Augmentation and Tokenization
# This version creates 306 training examples from your 102 rows of data.

from transformers import AutoTokenizer
from datasets import Dataset
import pandas as pd
import numpy as np

# Configuration
DATASET_FILE = "Ad Script Dataset.xlsx"
BASE_MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
MAX_SEQ_LENGTH = 2048

print("INITIALIZING DATA PIPELINE")
print("="*60)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load the Excel file
df = pd.read_excel(DATASET_FILE)
df = df.drop(columns=['Unnamed: 12', 'Unnamed: 13'], errors='ignore')

print(f"Original rows: {len(df)}")

# --- DATA AUGMENTATION LOGIC ---
all_conversations = []

print("Performing Data Augmentation (Expanding 1 -> 3 prompts per script)...")

for _, row in df.iterrows():
    # We create 3 separate examples for every 1 script
    prompts = [row['prompt_1'], row['prompt_2'], row['prompt_3']]

    for p in prompts:
        # Check if the prompt is valid (not empty)
        if pd.isna(p) or str(p).strip() == "":
            continue

        conversation = [
            {"role": "system", "content": row['system_prompt']},
            {"role": "user", "content": str(p)},
            {"role": "assistant", "content": row['script']}
        ]
        all_conversations.append(conversation)

print(f"Total Augmented Examples: {len(all_conversations)}")
print("-" * 40)

# Create Hugging Face Dataset from the augmented list
augmented_dataset = Dataset.from_dict({"conversations": all_conversations})

# --- TOKENIZATION & TEMPLATING ---

def format_and_analyze(example):
    # Apply the DeepSeek/Qwen2 Chat Template
    full_text = tokenizer.apply_chat_template(
        example['conversations'],
        tokenize=False,
        add_generation_prompt=False
    )

    # Calculate token length for our analysis
    tokens = tokenizer.encode(full_text)

    return {
        "text": full_text,
        "token_length": len(tokens)
    }

print("Applying Chat Template and calculating token lengths...")
final_dataset = augmented_dataset.map(format_and_analyze, remove_columns=["conversations"])

# --- FINAL ANALYSIS ---

lengths = final_dataset['token_length']
print("\nTOKEN LENGTH STATISTICS")
print("-" * 40)
print(f"Mean Length:   {int(np.mean(lengths))} tokens")
print(f"Max Length:    {max(lengths)} tokens")
print(f"95th Percentile: {int(np.percentile(lengths, 95))} tokens")

exceeds = sum(1 for l in lengths if l > MAX_SEQ_LENGTH)
print(f"Examples exceeding {MAX_SEQ_LENGTH} limit: {exceeds} / {len(final_dataset)}")

print("\nSAMPLE AUGMENTED ENTRY (Instruction snippet):")
print("-" * 40)
print(final_dataset[0]['text'][:400] + "...")

print("\n" + "="*60)
print("Phase 3 Complete: We now have a robust, augmented dataset ready for training!")

INITIALIZING DATA PIPELINE
Original rows: 102
Performing Data Augmentation (Expanding 1 -> 3 prompts per script)...
Total Augmented Examples: 306
----------------------------------------
Applying Chat Template and calculating token lengths...


Map:   0%|          | 0/306 [00:00<?, ? examples/s]


TOKEN LENGTH STATISTICS
----------------------------------------
Mean Length:   1487 tokens
Max Length:    8191 tokens
95th Percentile: 3562 tokens
Examples exceeding 2048 limit: 38 / 306

SAMPLE AUGMENTED ENTRY (Instruction snippet):
----------------------------------------
<｜begin▁of▁sentence｜>You are LekhAI, a specialized AI assistant for X Integrated marketing agency. You generate high-conversion Bengali ad scripts with professional formatting.<｜User｜>I need you to write a Bengali TVC script for Summer Dose Orange Lolly Ice Cream. Here's exactly what I need:
Product: Summer Dose Orange Lolly Ice Cream
Target Audience: 18-30 year old Bangladeshis
Duration: 60 secon...

Phase 3 Complete: We now have a robust, augmented dataset ready for training!


In the token length statistics, we can see that the examples exceeding 2048 limit are 38 out of 306. So about 12% of our examples are too long for our current setting.

**What happens to those 38 examples during training?** <br>
They will be truncated (cut off) at the 2,048 token mark. The model will only see the first ~70% of those scripts and will not learn how to write their endings properly.

DeepSeek-R1-Distill-Qwen-7B supports up to 131,072 tokens in its architecture, but memory is the real constraint. On Google Colab's free T4 GPU (16 GB VRAM), we can safely handle 4,096 tokens if we:


*   Use 4-bit quantization (which we are already planning)
*   Use gradient checkpointing (saves memory during training)
* Keep batch size small (1 or 2)

The 95th percentile of 3,562 tokens fits within 4,096, meaning only a handful of extreme outliers (~10 scripts) will still be truncated.

In [None]:
'''# Update the maximum sequence length
MAX_SEQ_LENGTH = 4096

# Re-calculate how many examples now exceed the limit
exceeds = sum(1 for l in final_dataset['token_length'] if l > MAX_SEQ_LENGTH)
print(f"Examples exceeding {MAX_SEQ_LENGTH} limit: {exceeds} / {len(final_dataset)}")

Examples exceeding 4096 limit: 10 / 306


## Phase 4: Base Model Loading (The 4-bit Foundation)

### Step 4.1: Loading the Model with Unsloth in 4-bit Quantization

### What We Are Doing

In this step, we load the DeepSeek-R1-Distill-Qwen-7B model into GPU memory using the Unsloth library. We use a technique called **4-bit quantization** to compress the model so it fits within the limited memory of Google Colab's free GPU.

### Understanding Model Size and Memory

| Precision | Bits per Parameter | 7B Model Size | Fits in 16GB VRAM? |
|-----------|-------------------|---------------|-------------------|
| Full Precision (FP32) | 32 bits | ~28 GB | No |
| Half Precision (FP16) | 16 bits | ~14 GB | Barely |
| 8-bit Quantization | 8 bits | ~7 GB | Yes |
| **4-bit Quantization** | 4 bits | **~3.5 GB** | **Yes, with room to spare** |

By using 4-bit quantization, we reduce the model's memory footprint from 28 GB to approximately 3.5 GB, leaving plenty of room for training operations.

### What Is Quantization?

Quantization is the process of representing numbers with fewer bits. We can think of it like rounding:

- **Full precision**: 3.141592653589793 (very accurate, uses lots of memory)
- **4-bit**: 3.14 (less accurate, but uses 8 times less memory)

Modern quantization techniques are clever enough to preserve model quality despite the reduced precision. Research has shown that 4-bit quantized models perform nearly identically to full-precision models on most tasks.

### Why Unsloth?

Unsloth is a specialized library that makes fine-tuning large language models accessible on consumer hardware. Key benefits:

| Feature | Benefit |
|---------|---------|
| Memory Efficiency | Uses up to 70% less VRAM than standard implementations |
| Speed | Training is 2-5 times faster due to optimized kernels |
| Ease of Use | Simple API that wraps complex configurations |
| Compatibility | Works with popular models including Qwen2 (which DeepSeek uses) |

### What Happens During Loading

1. **Download**: Model weights are downloaded from Hugging Face (if not cached).
2. **Quantization**: Weights are compressed to 4-bit format on-the-fly.
3. **GPU Transfer**: The compressed model is loaded onto the GPU.
4. **Verification**: We confirm the model is ready for training.

### Expected Output

After this cell runs, we will see:
- GPU memory usage before and after loading
- Confirmation that the model architecture is Qwen2 (as expected for DeepSeek-R1)
- Model statistics including parameter count

In [None]:
'''# Step 4.1: Load DeepSeek Model with Unsloth in 4-bit Quantization
# This enables training on Google Colab's free GPU.

from unsloth import FastLanguageModel
import torch

# Configuration
BASE_MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
MAX_SEQ_LENGTH = 4096  # Updated from our analysis

print("PHASE 4: BASE MODEL LOADING")
print("="*60)

# Check GPU memory before loading
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    free_memory = torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated()
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Available VRAM before loading: {free_memory / 1024**3:.2f} GB")
else:
    print("WARNING: No GPU detected!")
print()

print("Loading model with 4-bit quantization...")
print("This may take 2-5 minutes on first run (downloading weights).")
print("-"*40)

# Load the model using Unsloth's optimized loader
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=BASE_MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect: will use float16 or bfloat16 based on GPU
    load_in_4bit=True,  # Enable 4-bit quantization
    trust_remote_code=True,  # Required for Qwen2 architecture
)

print("\nMODEL LOADED SUCCESSFULLY")
print("-"*40)

# Display model information
print(f"Model Type: {model.config.model_type}")
print(f"Hidden Size: {model.config.hidden_size}")
print(f"Number of Layers: {model.config.num_hidden_layers}")
print(f"Number of Attention Heads: {model.config.num_attention_heads}")
print(f"Vocabulary Size: {model.config.vocab_size:,}")
print(f"Max Sequence Length: {MAX_SEQ_LENGTH}")
print()

# Check GPU memory after loading
if torch.cuda.is_available():
    used_memory = torch.cuda.memory_allocated() / 1024**3
    reserved_memory = torch.cuda.memory_reserved() / 1024**3
    print("GPU MEMORY USAGE")
    print("-"*40)
    print(f"Allocated: {used_memory:.2f} GB")
    print(f"Reserved:  {reserved_memory:.2f} GB")

# Configure tokenizer for training
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print()
print("="*60)
print("\nModel is loaded and ready for LoRA configuration.")
print("Proceed to Step 4.2 to set up Parameter-Efficient Fine-Tuning.")

PHASE 4: BASE MODEL LOADING
GPU: Tesla T4
Available VRAM before loading: 14.74 GB

Loading model with 4-bit quantization...
This may take 2-5 minutes on first run (downloading weights).
----------------------------------------
Are you certain you want to do remote code execution?
==((====))==  Unsloth 2026.1.4: Fast Qwen2 patching. Transformers: 4.57.6.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.52G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]


MODEL LOADED SUCCESSFULLY
----------------------------------------
Model Type: qwen2
Hidden Size: 3584
Number of Layers: 28
Number of Attention Heads: 28
Vocabulary Size: 152,064
Max Sequence Length: 4096

GPU MEMORY USAGE
----------------------------------------
Allocated: 7.95 GB
Reserved:  8.02 GB


Model is loaded and ready for LoRA configuration.
Proceed to Step 4.2 to set up Parameter-Efficient Fine-Tuning.


### Step 4.2: LoRA Configuration (Parameter-Efficient Fine-Tuning)

### What We Are Doing

In this step, we configure **LoRA (Low-Rank Adaptation)**, a technique that allows us to fine-tune a massive 7-billion-parameter model by only training a tiny fraction of its weights. This is what makes fine-tuning possible on limited hardware.

### The Problem with Full Fine-Tuning

If we tried to train all 7 billion parameters:
- We would need to store gradients for every parameter (requires ~28 GB additional memory)
- Training would be extremely slow (days instead of hours)
- We risk "catastrophic forgetting" (the model forgets its pre-trained knowledge)

### How LoRA Solves This

LoRA works by "freezing" the original model weights and instead training small "adapter" matrices that modify the model's behavior. Think of it like this:

| Analogy | Original Model | LoRA Adapters |
|---------|---------------|---------------|
| A skilled chef | Knows how to cook | Learns your family's secret recipes |
| A musician | Knows music theory | Learns to play your favorite songs |
| DeepSeek | Knows language | Learns to write Bangla ad scripts |

The original knowledge stays intact. We only add new specialized skills on top.

### Technical Details: Rank and Alpha

| Parameter | What It Controls | Our Setting | Reasoning |
|-----------|-----------------|-------------|-----------|
| **r (rank)** | Size of the adapter matrices. Higher = more capacity, more memory. | 16 | Good balance for creative writing tasks |
| **lora_alpha** | Scaling factor for LoRA weights. Usually set to 2x the rank. | 32 | Standard practice: alpha = 2 * r |
| **lora_dropout** | Regularization to prevent overfitting. | 0.05 | Light dropout since we have limited data |
| **target_modules** | Which layers of the model to adapt. | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | All attention and feed-forward layers |

### What Are Target Modules?

A transformer model has multiple types of layers:

| Module | Full Name | Function |
|--------|-----------|----------|
| q_proj | Query Projection | Determines "what to look for" in the input |
| k_proj | Key Projection | Determines "what information is available" |
| v_proj | Value Projection | Holds the actual information to retrieve |
| o_proj | Output Projection | Combines attention results |
| gate_proj, up_proj, down_proj | Feed-Forward Network | Processes information after attention |

By targeting all of these, we allow the model to adapt its understanding (attention) and its processing (feed-forward) to the advertising domain.

### Trainable Parameters

After applying LoRA, we will see that only about 0.5-2% of the model's parameters are trainable. The rest remain frozen, preserving the model's general language abilities while we teach it advertising-specific patterns.

In [None]:
'''# Step 4.2: Configure LoRA Adapters for Parameter-Efficient Fine-Tuning
# This enables training only a small fraction of the model's parameters.

from unsloth import FastLanguageModel

print("CONFIGURING LoRA ADAPTERS")
print("="*60)

# Apply LoRA adapters to the model
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Rank of the LoRA matrices (higher = more capacity)
    target_modules=[
        "q_proj",      # Query projection (attention)
        "k_proj",      # Key projection (attention)
        "v_proj",      # Value projection (attention)
        "o_proj",      # Output projection (attention)
        "gate_proj",   # Feed-forward gate
        "up_proj",     # Feed-forward up-projection
        "down_proj",   # Feed-forward down-projection
    ],
    lora_alpha=32,      # Scaling factor (typically 2x rank)
    lora_dropout=0.05,  # Light regularization
    bias="none",        # Do not train bias terms (saves memory)
    use_gradient_checkpointing="unsloth",  # Saves memory during backpropagation
    random_state=42,    # For reproducibility
    use_rslora=False,   # Standard LoRA (not Rank-Stabilized)
    loftq_config=None,  # No LoftQ initialization
)

print("\nLoRA CONFIGURATION SUMMARY")
print("-"*40)

# Calculate trainable parameters
def count_parameters(model):
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    return trainable, total

trainable_params, total_params = count_parameters(model)
trainable_percent = (trainable_params / total_params) * 100

print(f"Total Parameters:     {total_params:,}")
print(f"Trainable Parameters: {trainable_params:,}")
print(f"Trainable Percentage: {trainable_percent:.2f}%")
print()

# Display LoRA settings
print("LoRA SETTINGS")
print("-"*40)
print(f"Rank (r):            16")
print(f"Alpha:               32")
print(f"Dropout:             0.05")
print(f"Target Modules:      {len(['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj'])} layers")
print(f"Gradient Checkpointing: Enabled (Unsloth optimized)")
print()

# Check memory after LoRA setup
import torch
if torch.cuda.is_available():
    used_memory = torch.cuda.memory_allocated() / 1024**3
    print("GPU MEMORY AFTER LoRA")
    print("-"*40)
    print(f"Allocated: {used_memory:.2f} GB")

print()
print("="*60)
print("\nLoRA adapters configured successfully.")
print("The model is now ready for training.")
print("Proceed to Phase 5 for the pre-training evaluation (baseline test).")

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.


CONFIGURING LoRA ADAPTERS


Unsloth 2026.1.4 patched 28 layers with 0 QKV layers, 0 O layers and 0 MLP layers.



LoRA CONFIGURATION SUMMARY
----------------------------------------
Total Parameters:     5,383,329,280
Trainable Parameters: 40,370,176
Trainable Percentage: 0.75%

LoRA SETTINGS
----------------------------------------
Rank (r):            16
Alpha:               32
Dropout:             0.05
Target Modules:      7 layers
Gradient Checkpointing: Enabled (Unsloth optimized)

GPU MEMORY AFTER LoRA
----------------------------------------
Allocated: 8.10 GB


LoRA adapters configured successfully.
The model is now ready for training.
Proceed to Phase 5 for the pre-training evaluation (baseline test).


## Phase 5: Pre-Training Evaluation



### Step 5.1: Creating the Inference Function

### What We Are Doing

Before we train the model, we need to establish a **baseline**. We will ask the model to generate a Bangla advertisement script right now, before any fine-tuning. This allows us to:

1. **Measure improvement**: After training, we can compare outputs to see how much the model learned.
2. **Verify the model works**: Ensure the model can generate Bangla text at all.
3. **Document for viewer**: Show a clear "before and after" comparison in our notebook.

### How Text Generation Works

Language models generate text one token at a time. At each step:

1. The model looks at all previous tokens.
2. It calculates a probability distribution over the entire vocabulary (152,064 possible next tokens).
3. It selects the next token based on sampling parameters.
4. This token is added to the sequence, and the process repeats.

### Key Generation Parameters

| Parameter | What It Controls | Our Setting | Effect |
|-----------|-----------------|-------------|--------|
| **max_new_tokens** | Maximum tokens to generate | 2048 | Caps output length to prevent runaway generation |
| **temperature** | Randomness of predictions | 0.7 | Lower = more deterministic, Higher = more creative |
| **top_p** | Nucleus sampling threshold | 0.9 | Only consider tokens in the top 90% probability mass |
| **repetition_penalty** | Discourages repeating phrases | 1.1 | Slightly penalizes recently used tokens |

### The Inference Pipeline

1. **Format the prompt**: Apply the chat template so the model understands the instruction format.
2. **Tokenize**: Convert text to token IDs.
3. **Generate**: Run the model to produce new tokens.
4. **Decode**: Convert token IDs back to readable text.
5. **Extract response**: Parse out just the assistant's reply.

### What to Expect from the Baseline

Since the model has not been trained on our dataset yet, expect:
- Generic advertising language (not specific to Bangla ad industry conventions)
- Possibly mixed languages (English terms mixed with Bangla)
- Missing the specific format our dataset uses (Visual | Audio table structure)
- Lack of cultural nuance specific to Bangladesh

In [None]:
'''# Step 5.1: Create the inference function for generating ad scripts
# This function will be used for both baseline testing and post-training evaluation.

from unsloth import FastLanguageModel
import torch

# Enable inference mode for faster generation
FastLanguageModel.for_inference(model)

def generate_ad_script(
    system_prompt: str,
    user_prompt: str,
    max_new_tokens: int = 2056,
    temperature: float = 0.7,
    top_p: float = 0.9,
    repetition_penalty: float = 1.1,
    show_full_output: bool = False
):
    """
    Generate a Bangla advertisement script using the model.
    """

    # Create the conversation format
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]

    # Apply the chat template
    formatted_prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    if show_full_output:
        print("FORMATTED INPUT:")
        print("-"*40)
        print(formatted_prompt)
        print("-"*40)

    # Tokenize the input
    inputs = tokenizer(
        formatted_prompt,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=4096
    ).to(model.device)

    # Generate the response
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=repetition_penalty,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    # Decode the full output
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=False)

    # Extract only the assistant's response
    # We split by the assistant token to get just the generated part
    if "<|assistant|>" in full_response:
        assistant_response = full_response.split("<|assistant|>")[-1]
    else:
        assistant_response = full_response

    # Clean up trailing tokens manually to avoid syntax errors
    assistant_response = assistant_response.replace("<|end_of_sentence|>", "").strip()
    assistant_response = assistant_response.replace("</s>", "").strip()

    return assistant_response

print("Inference function created successfully.")

Inference function created successfully.


### Step 5.2: Baseline Test (Before Fine-Tuning)

### What We Are Doing

We are now going to ask the model to generate a Bangla advertisement script **before** any fine-tuning. This establishes a "baseline" so we can measure improvement after training.

### Why This Matters

1. **Scientific Method**: To claim improvement, we must have a "before" measurement.
2. **Presentation**: We can show a clear comparison of outputs.
3. **Debugging**: If the baseline is completely broken, we know something is wrong before investing training time.

### What to Observe in the Baseline Output

| Aspect | Expected Baseline Behavior | Expected Post-Training Behavior |
|--------|---------------------------|--------------------------------|
| Language | Mixed English/Bangla, possibly more English | Primarily Bangla with industry-appropriate terms |
| Format | Unstructured paragraph or generic format | Visual/Audio table format matching our dataset |
| Tone | Generic marketing language | Matches the requested tone (Humorous, Warm, etc.) |
| Cultural Context | Generic global advertising style | Bangladesh-specific cultural references |
| Length | May be too short or too long | Appropriate for the requested duration |

### The Test Prompt

We will use a prompt similar to what exists in our dataset. This allows direct comparison with our real agency scripts.

In [None]:
'''# Step 5.2: Run the Baseline Test (Before Fine-Tuning)
# This tests the model's current ability to generate Bangla ad scripts.

print("PHASE 5.2: BASELINE TEST (BEFORE TRAINING)")
print("="*60)
print("Testing the model's current ability to generate Bangla ad scripts.")
print("Remember: The model has NOT been trained on our dataset yet.\n")

# Define a test prompt similar to your dataset
test_system_prompt = """You are LekhAI, a professional Bangla advertisement script writer.
You specialize in creating compelling TV commercial (TVC) and online video commercial (OVC) scripts
for the Bangladesh market. Your scripts should be culturally relevant, emotionally engaging,
and formatted with Visual and Audio columns."""

test_user_prompt = """Write a 45-second TVC scriptin Bangla language for a paint company called "Berger Paints".
Industry: Real Estate & Construction
Tone: Warm & Nostalgic
The ad should evoke feelings of home, family, and memories associated with colorful walls. It should feature colloquial, but wholesome dialogue and a CTA."""

print("TEST PROMPT")
print("-"*40)
print(f"Industry: Real Estate & Construction")
print(f"Product: Berger Paints")
print(f"Tone: Warm & Nostalgic")
print(f"Duration: 45 seconds")
print("-"*40)

print("\nGenerating baseline response...")
print("(This may take 30-60 seconds)\n")

# Generate the baseline response
baseline_response = generate_ad_script(
    system_prompt=test_system_prompt,
    user_prompt=test_user_prompt,
    max_new_tokens=2056,
    temperature=0.7
)

print("="*60)
print("BASELINE OUTPUT (BEFORE TRAINING)")
print("="*60)
print(baseline_response)
print("="*60)

# Save the baseline for later comparison
baseline_output_saved = baseline_response

print("\n[Baseline saved for post-training comparison]")
print("Proceed to Phase 6 for training.")

PHASE 5.2: BASELINE TEST (BEFORE TRAINING)
Testing the model's current ability to generate Bangla ad scripts.
Remember: The model has NOT been trained on our dataset yet.

TEST PROMPT
----------------------------------------
Industry: Real Estate & Construction
Product: Berger Paints
Tone: Warm & Nostalgic
Duration: 45 seconds
----------------------------------------

Generating baseline response...
(This may take 30-60 seconds)

BASELINE OUTPUT (BEFORE TRAINING)
<｜begin▁of▁sentence｜><｜begin▁of▁sentence｜>You are LekhAI, a professional Bangla advertisement script writer. 
You specialize in creating compelling TV commercial (TVC) and online video commercial (OVC) scripts 
for the Bangladesh market. Your scripts should be culturally relevant, emotionally engaging, 
and formatted with Visual and Audio columns.<｜User｜>Write a 45-second TVC scriptin Bangla language for a paint company called "Berger Paints".
Industry: Real Estate & Construction
Tone: Warm & Nostalgic
The ad should evoke feel

# Phase 6: Iterative Fine-Tuning

### Step 6.1 + 6.2 (Master Execution): Memory-Optimized Training Pipeline



### ***Technical Pivot: Model Selection Change***

### Original Plan
Our initial implementation plan targeted **DeepSeek-R1-Distill-Qwen-7B**, a 7-billion parameter reasoning model. Phases 1-5 above demonstrate the complete pipeline for loading and configuring this model.

### Resource Constraint Encountered
During training (Phase 6), we encountered persistent CUDA Out-of-Memory errors on Google Colab's free T4 GPU (15GB VRAM). Despite applying multiple optimizations:
- 4-bit quantization
- LoRA adapters (0.75% trainable parameters)
- Gradient checkpointing
- Reduced batch size and sequence length

The DeepSeek-7B model plus optimizer states exceeded available memory.

### Solution: Model Substitution
We pivoted to **Qwen2.5-1.5B-Instruct**, a 1.5-billion parameter model that:
- Fits comfortably in 15GB VRAM
- Shares the Qwen2 architecture (compatible with our pipeline)
- Maintains multilingual capabilities including Bangla

This is a common real-world scenario where initial model choices must be revised based on actual hardware availability.

### Key Learning
Large Language Model deployment requires careful consideration of the hardware-software stack. A smaller, well-fine-tuned model often outperforms a larger model that cannot be properly trained due to resource constraints.

In [1]:
# ==========================================
# MASTER TRAINING CELL (Fixed Device + Qwen 1.5B)
# ==========================================
import os, sys, gc

print("CLEAN START: Installing dependencies...")
os.system("pip install --upgrade pip")
os.system("pip install unsloth_zoo")
os.system("pip install --no-deps unsloth[colab-new] xformers trl peft accelerate bitsandbytes pandas openpyxl")

# IMPORTANT: Restart CUDA context after installs
import torch
torch.cuda.empty_cache()
gc.collect()

# Verify GPU is available BEFORE importing unsloth
print("\nVERIFYING GPU...")
if not torch.cuda.is_available():
    raise RuntimeError("NO GPU DETECTED! Go to Runtime -> Change runtime type -> Select T4 GPU")

device = torch.device("cuda:0")
print(f"GPU Found: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

# Now import unsloth
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import Dataset
import pandas as pd
from huggingface_hub import login

# Login
print("\nAUTHENTICATION")
login()

# Load Model
print("\nLOADING MODEL (Qwen2.5-1.5B-Instruct)")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-1.5B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Prepare Data
print("\nPREPARING DATA")
if not os.path.exists("Ad Script Dataset.xlsx"):
    from google.colab import files
    print("   Please upload your dataset...")
    uploaded = files.upload()

df = pd.read_excel("Ad Script Dataset.xlsx")
df = df.drop(columns=['Unnamed: 12', 'Unnamed: 13'], errors='ignore')

real_df = df.iloc[:17]
augmented_df = df.iloc[17:]
texts = []

for _ in range(3):
    for _, row in real_df.iterrows():
        for p in [row['prompt_1'], row['prompt_2'], row['prompt_3']]:
            if pd.notna(p):
                texts.append(tokenizer.apply_chat_template([
                    {"role": "system", "content": row['system_prompt']},
                    {"role": "user", "content": str(p)},
                    {"role": "assistant", "content": row['script']}
                ], tokenize=False, add_generation_prompt=False))

for _, row in augmented_df.iterrows():
    for p in [row['prompt_1'], row['prompt_2'], row['prompt_3']]:
        if pd.notna(p):
            texts.append(tokenizer.apply_chat_template([
                {"role": "system", "content": row['system_prompt']},
                {"role": "user", "content": str(p)},
                {"role": "assistant", "content": row['script']}
            ], tokenize=False, add_generation_prompt=False))

dataset = Dataset.from_dict({"text": texts})
print(f"   Training Examples: {len(dataset)}")

# Train
print("\nSTARTING TRAINING (3 Epochs)")
training_args = TrainingArguments(
    output_dir="./lekhAI_checkpoints",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=training_args,
)

trainer_stats = trainer.train()

print("\nTRAINING COMPLETE!")
print(f"   Final Loss: {trainer_stats.training_loss:.4f}")

CLEAN START: Installing dependencies...

VERIFYING GPU...
GPU Found: Tesla T4
VRAM: 14.74 GB
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!

AUTHENTICATION


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…


LOADING MODEL (Qwen2.5-1.5B-Instruct)
==((====))==  Unsloth 2026.1.4: Fast Qwen2 patching. Transformers: 4.57.6.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2026.1.4 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.



PREPARING DATA
   Training Examples: 408

STARTING TRAINING (3 Epochs)


Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/408 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 408 | Num Epochs = 3 | Total steps = 153
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 18,464,768 of 1,562,179,072 (1.18% trained)


Step,Training Loss
10,1.8937
20,1.5924
30,1.4471
40,1.4787
50,1.3548
60,1.2812
70,1.1615
80,1.1586
90,1.13
100,1.0483



TRAINING COMPLETE!
   Final Loss: 1.2124


### Step 6.2b: Continuing Training for Additional Epochs

### What We Are Doing

The model has already completed 1 epoch with a final loss of 1.80. To improve quality further, we continue training for 2 more epochs. The model weights are already in memory, so we simply run the trainer again.

### Why This Works

The `model` object retains all the learned weights from the previous training run. When we call `trainer.train()` again, it continues adjusting those weights rather than starting from scratch. This is sometimes called "warm starting" or "incremental training."

### Expected Outcome

- Starting loss should be around 1.80 (where we left off)
- After 2 more epochs, loss should drop to approximately 1.2-1.5
- Total training: 3 epochs (1 initial + 2 continuation)

In [None]:
# Step 6.2b: Continue Training for 2 More Epochs
# This picks up from the current model state

import torch
import gc

# Clear any leftover memory from inference mode
torch.cuda.empty_cache()
gc.collect()

print("CONTINUING TRAINING (1 MORE EPOCH)")
print("="*60)
print("Starting from current model state (Loss: ~1.80)")
print()

# Update training arguments for continuation
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import is_bfloat16_supported

# Recreate training args with 2 more epochs
continuation_args = TrainingArguments(
    output_dir="./lekhAI_checkpoints",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    warmup_steps=0,  # No warmup needed for continuation
    num_train_epochs=1,  # 1 more epochs
    learning_rate=1e-4,  # Slightly lower learning rate for fine-tuning
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=10,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    report_to="none",
    save_strategy="epoch",
)

# Recreate trainer with current model state
# The 'dataset' variable should still be in memory from the master cell
continuation_trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=continuation_args,
)

# Run continued training
print("Starting 1 additional epoch...")
print("-"*40)

continuation_stats = continuation_trainer.train()

print("\n" + "="*60)
print("CONTINUATION TRAINING COMPLETE")
print("="*60)
print(f"Final Loss: {continuation_stats.training_loss:.4f}")
print(f"Total Epochs Trained: 2 (1 initial + 1 continuation)")

# Memory check
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak GPU Memory: {used_memory} GB")

CONTINUING TRAINING (1 MORE EPOCH)
Starting from current model state (Loss: ~1.80)



ModuleNotFoundError: No module named 'trl'