# üöÄ Improved GPT-2 Fine-Tuning

This notebook implements all the improvements identified in our analysis:

| Issue | Previous | Improved |
|-------|----------|----------|
| Learning Rate | 1e-5 to 1e-4 | 1e-6 |
| Samples | 22-200 | 2000+ |
| Templates | 5 different | 1 consistent |
| Epochs | 2-3 | 1 |
| Regularization | None | KL penalty + data mixing |
| LoRA rank | 16 | 4 |

---
## 1. Setup & Configuration

In [1]:
import os
import sys
import json
import random
import math
from pathlib import Path
from typing import Dict, List, Optional
from dataclasses import dataclass

# SSL workaround
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType

print(f"PyTorch: {torch.__version__}")
print(f"Device: {'CUDA' if torch.cuda.is_available() else 'CPU'}")


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.4.1 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/manthan-kamble/Documents/GitHub/LlmPostTraining/.venv/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/manthan-kamble/Documents/GitHub/LlmPostTraining/.venv/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/manthan-kamble/Documents/GitHub/LlmPostTraining/.venv/lib

PyTorch: 2.2.2
Device: CPU


In [2]:
# Load improved configuration
config_path = Path("../configs/improved_gpt2_config.json")
with open(config_path) as f:
    CONFIG = json.load(f)

print("Loaded Improved Configuration:")
print(json.dumps(CONFIG, indent=2))

Loaded Improved Configuration:
{
  "data": {
    "source": "alpaca-cleaned (filtered)",
    "n_samples": 3000,
    "max_length": 256,
    "template": "Question: {instruction}\nAnswer:",
    "filter_criteria": {
      "max_response_tokens": 100,
      "task_types": [
        "qa",
        "classification",
        "short_generation"
      ],
      "exclude_complex": true
    }
  },
  "training": {
    "learning_rate": 1e-06,
    "batch_size": 8,
    "gradient_accumulation": 2,
    "epochs": 1,
    "warmup_steps": 100,
    "weight_decay": 0.1,
    "max_grad_norm": 0.5,
    "lr_scheduler": "linear"
  },
  "regularization": {
    "dropout": 0.1,
    "data_mixing_ratio": 0.3,
    "kl_penalty": 0.1
  },
  "lora": {
    "r": 4,
    "alpha": 8,
    "dropout": 0.2,
    "target_modules": [
      "c_attn",
      "c_proj",
      "mlp.c_fc",
      "mlp.c_proj"
    ],
    "learning_rate": 5e-05
  }
}


In [3]:
# Paths
BASE_MODEL_PATH = "../models/gpt2"
OUTPUT_DIR = "../outputs/improved_training"
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)
tokenizer.pad_token = tokenizer.eos_token

print(f"‚úÖ Tokenizer loaded: {tokenizer.vocab_size} tokens")

‚úÖ Tokenizer loaded: 50257 tokens


---
## 2. Dataset Curation (Filtered for GPT-2)

Key improvements:
- Filter for **simple tasks** (Q&A, classification)
- **Short responses** (max 100 tokens)
- **Single template** (no gradient conflicts)
- **2000+ samples** for adequate learning signal

In [4]:
def filter_for_gpt2(example: Dict) -> bool:
    """
    Filter dataset for GPT-2's limited capacity.
    Only keep simple, short examples.
    """
    instruction = example.get("instruction", "")
    output = example.get("output", "")
    inp = example.get("input", "")
    
    # Skip if empty
    if not instruction or not output:
        return False
    
    # Skip very long responses (>100 tokens ~400 chars)
    if len(output) > 400:
        return False
    
    # Skip very short responses (likely need context)
    if len(output) < 10:
        return False
    
    # Skip code-related tasks
    code_keywords = ["code", "python", "javascript", "function", "class", 
                    "def ", "import ", "```", "programming"]
    text = (instruction + output).lower()
    if any(kw in text for kw in code_keywords):
        return False
    
    # Skip complex reasoning tasks
    complex_keywords = ["step by step", "explain how", "analyze", "compare and contrast",
                       "multiple steps", "calculate", "solve"]
    if any(kw in text for kw in complex_keywords):
        return False
    
    # Skip tasks requiring external context
    if len(inp) > 200:  # Large input context
        return False
    
    return True

print("‚úÖ Filter function defined")

‚úÖ Filter function defined


In [6]:
# Fix SSL and httpx issues
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import httpx
httpx._client.Client.__init__.__globals__['DEFAULT_TIMEOUT_CONFIG'] = httpx.Timeout(30.0)

# Monkey-patch httpx to handle SSL
original_init = httpx.Client.__init__
def patched_init(self, *args, **kwargs):
    kwargs.setdefault('verify', False)
    original_init(self, *args, **kwargs)
httpx.Client.__init__ = patched_init

# Load and filter dataset
print("Loading Alpaca dataset...")
dataset = load_dataset("tatsu-lab/alpaca", split="train", trust_remote_code=True)
print(f"Original dataset size: {len(dataset)}")

# Apply filter
filtered_dataset = dataset.filter(filter_for_gpt2)
print(f"Filtered dataset size: {len(filtered_dataset)}")

# Take subset for training (2000-3000 samples as recommended)
N_SAMPLES = min(3000, len(filtered_dataset))
filtered_dataset = filtered_dataset.shuffle(seed=42).select(range(N_SAMPLES))
print(f"\n‚úÖ Using {N_SAMPLES} curated samples")

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'tatsu-lab/alpaca' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.


Loading Alpaca dataset...


Generating train split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 52002/52002 [00:00<00:00, 162529.66 examples/s]


Original dataset size: 52002


Filter: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 52002/52002 [00:01<00:00, 37657.75 examples/s]


Filtered dataset size: 30476

‚úÖ Using 3000 curated samples


In [7]:
# Show sample distribution
response_lengths = [len(ex["output"]) for ex in filtered_dataset]
print(f"\nüìä Response Length Statistics:")
print(f"  Min: {min(response_lengths)} chars")
print(f"  Max: {max(response_lengths)} chars")
print(f"  Mean: {sum(response_lengths)/len(response_lengths):.0f} chars")

# Show examples
print("\nüìù Sample Examples:")
for i in range(3):
    ex = filtered_dataset[i]
    print(f"\n--- Example {i+1} ---")
    print(f"Instruction: {ex['instruction'][:100]}...")
    print(f"Output: {ex['output'][:100]}...")


üìä Response Length Statistics:
  Min: 10 chars
  Max: 400 chars
  Mean: 143 chars

üìù Sample Examples:

--- Example 1 ---
Instruction: Edit this sentence for grammar, syntax, and style ‚ÄúIt can incredibly difficult to decide‚Äù...
Output: It can be incredibly difficult to decide....

--- Example 2 ---
Instruction: Which is the best way to learn a new language?...
Output: The best way to learn a new language is through total immersion. This can be accomplished by traveli...

--- Example 3 ---
Instruction: List 5 strategies for better organization and time management....
Output: 1. Set realistic goals and prioritize tasks.
2. Use a calendar to track meetings and deadlines.
3. B...


---
## 3. Single Template Formatting

**Critical improvement**: Use ONE consistent template to avoid gradient conflicts.

```
Question: {instruction}
Answer: {output}
```

In [8]:
# Simple, consistent template
TEMPLATE = """Question: {instruction}
Answer: {output}"""

def format_example(example: Dict) -> str:
    """Format example with single consistent template."""
    instruction = example["instruction"]
    
    # Append input if present
    if example.get("input"):
        instruction = f"{instruction}\nContext: {example['input']}"
    
    return TEMPLATE.format(
        instruction=instruction,
        output=example["output"]
    )

# Test formatting
print("Template Preview:")
print("=" * 50)
print(format_example(filtered_dataset[0]))
print("=" * 50)

Template Preview:
Question: Edit this sentence for grammar, syntax, and style ‚ÄúIt can incredibly difficult to decide‚Äù
Answer: It can be incredibly difficult to decide.


In [9]:
class GPT2Dataset(Dataset):
    """Custom dataset for GPT-2 fine-tuning."""
    
    def __init__(self, data, tokenizer, max_length=256):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        text = format_example(self.data[idx])
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt"
        )
        
        input_ids = encoding["input_ids"].squeeze()
        attention_mask = encoding["attention_mask"].squeeze()
        
        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": input_ids.clone()
        }

# Create dataset
train_dataset = GPT2Dataset(filtered_dataset, tokenizer, max_length=256)
print(f"‚úÖ Training dataset created: {len(train_dataset)} samples")

‚úÖ Training dataset created: 3000 samples


---
## 4. KL Regularization Implementation

To prevent catastrophic forgetting, we add a KL divergence penalty:

$$\mathcal{L}_{total} = \mathcal{L}_{task} + \beta \cdot KL(P_{new} || P_{original})$$

This keeps the new model's output distribution close to the original.

In [10]:
class KLRegularizedTrainer(Trainer):
    """
    Custom trainer with KL divergence regularization.
    Keeps model close to original distribution to prevent forgetting.
    """
    
    def __init__(self, *args, reference_model=None, kl_weight=0.1, **kwargs):
        super().__init__(*args, **kwargs)
        self.reference_model = reference_model
        self.kl_weight = kl_weight
        
        # Freeze reference model
        if self.reference_model is not None:
            self.reference_model.eval()
            for param in self.reference_model.parameters():
                param.requires_grad = False
            print(f"‚úÖ Reference model frozen for KL regularization (Œ≤={kl_weight})")
    
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        # Standard forward pass
        outputs = model(**inputs)
        task_loss = outputs.loss
        
        # Add KL regularization if reference model exists
        if self.reference_model is not None and self.kl_weight > 0:
            with torch.no_grad():
                ref_outputs = self.reference_model(**inputs)
                ref_logits = ref_outputs.logits
            
            # Compute KL divergence
            new_probs = F.log_softmax(outputs.logits, dim=-1)
            ref_probs = F.softmax(ref_logits, dim=-1)
            
            kl_loss = F.kl_div(new_probs, ref_probs, reduction="batchmean")
            
            # Total loss
            total_loss = task_loss + self.kl_weight * kl_loss
        else:
            total_loss = task_loss
        
        return (total_loss, outputs) if return_outputs else total_loss

print("‚úÖ KLRegularizedTrainer defined")

‚úÖ KLRegularizedTrainer defined


---
## 5. Model Setup with Improved LoRA

LoRA improvements:
- **Lower rank** (r=4 vs 16) - less overfitting
- **More target modules** - include MLP layers
- **Higher dropout** - better regularization

In [11]:
# Load base model (for training)
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_PATH,
    torch_dtype=torch.float32,
)
model.config.pad_token_id = tokenizer.pad_token_id

# Load reference model (frozen, for KL regularization)
reference_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_PATH,
    torch_dtype=torch.float32,
)

print(f"‚úÖ Models loaded")
print(f"  Trainable model params: {sum(p.numel() for p in model.parameters()):,}")

`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 148/148 [00:00<00:00, 190.83it/s, Materializing param=transformer.wte.weight]             
GPT2LMHeadModel LOAD REPORT from: ../models/gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 148/148 [00:00<00:00, 325.91it/s, Materializing param=transformer.wte.weight]             
GPT2LMHeadModel LOAD REPORT from: ../models/gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


‚úÖ Models loaded
  Trainable model params: 124,439,808


In [12]:
# Improved LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=4,                    # Lower rank (was 16)
    lora_alpha=8,           # Lower alpha (was 32)
    lora_dropout=0.2,       # Higher dropout (was 0.1)
    target_modules=[        # Include MLP layers
        "c_attn",
        "c_proj",
        "c_fc",
    ],
    bias="none",
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"\nüìä Parameter Summary:")
print(f"  Total params: {total_params:,}")
print(f"  Trainable params: {trainable_params:,}")
print(f"  Trainable %: {100 * trainable_params / total_params:.2f}%")

  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'

üìä Parameter Summary:
  Total params: 125,029,632
  Trainable params: 589,824
  Trainable %: 0.47%


---
## 6. Improved Training Configuration

Key changes:
- **LR: 1e-6** (100x lower than before)
- **1 epoch** (single pass to prevent overfitting)
- **Higher weight decay** (0.1 for regularization)
- **Lower gradient clipping** (0.5 for stability)

In [13]:
# Improved training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    
    # Key improvements
    learning_rate=1e-6,          # 100x lower!
    num_train_epochs=1,          # Single epoch
    weight_decay=0.1,            # Higher regularization
    max_grad_norm=0.5,           # Lower gradient clipping
    
    # Batch settings
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch = 16
    
    # LR schedule
    warmup_steps=100,
    lr_scheduler_type="linear",
    
    # Logging
    logging_steps=50,
    save_strategy="epoch",
    
    # Disable evaluation (CPU training)
    eval_strategy="no",
    
    # CPU settings
    use_cpu=True,
    fp16=False,
    
    # Reproducibility
    seed=42,
    data_seed=42,
)

print("‚úÖ Training arguments configured")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")

‚úÖ Training arguments configured
  Learning rate: 1e-06
  Epochs: 1
  Effective batch size: 16


In [14]:
# Calculate training estimates
n_samples = len(train_dataset)
batch_size = training_args.per_device_train_batch_size
grad_accum = training_args.gradient_accumulation_steps
epochs = training_args.num_train_epochs

steps_per_epoch = n_samples // (batch_size * grad_accum)
total_steps = steps_per_epoch * epochs

# Estimate time (based on previous runs: ~0.15s per step on CPU)
est_time_min = total_steps * 0.15 / 60

print(f"\nüìä Training Estimates:")
print(f"  Samples: {n_samples}")
print(f"  Steps per epoch: {steps_per_epoch}")
print(f"  Total steps: {total_steps}")
print(f"  Estimated time: ~{est_time_min:.0f} minutes")


üìä Training Estimates:
  Samples: 3000
  Steps per epoch: 187
  Total steps: 187
  Estimated time: ~0 minutes


---
## 7. Training with KL Regularization

In [15]:
# Create KL-regularized trainer
trainer = KLRegularizedTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    reference_model=reference_model,
    kl_weight=CONFIG["regularization"]["kl_penalty"],  # 0.1 from config
)

print("\n" + "="*60)
print("üöÄ STARTING IMPROVED TRAINING")
print("="*60)
print(f"\nImprovements Applied:")
print(f"  ‚úì Curated dataset: {len(train_dataset)} samples")
print(f"  ‚úì Single template (no gradient conflicts)")
print(f"  ‚úì Low LR: {training_args.learning_rate}")
print(f"  ‚úì 1 epoch (prevent overfitting)")
print(f"  ‚úì KL regularization: Œ≤={CONFIG['regularization']['kl_penalty']}")
print(f"  ‚úì LoRA r=4 (lower rank)")
print(f"\n" + "="*60)

‚úÖ Reference model frozen for KL regularization (Œ≤=0.1)

üöÄ STARTING IMPROVED TRAINING

Improvements Applied:
  ‚úì Curated dataset: 3000 samples
  ‚úì Single template (no gradient conflicts)
  ‚úì Low LR: 1e-06
  ‚úì 1 epoch (prevent overfitting)
  ‚úì KL regularization: Œ≤=0.1
  ‚úì LoRA r=4 (lower rank)



In [None]:
# Start training
import time
start_time = time.time()

train_result = trainer.train()

elapsed = time.time() - start_time
print(f"\n" + "="*60)
print(f"‚úÖ TRAINING COMPLETE!")
print(f"="*60)
print(f"  Time: {elapsed/60:.1f} minutes")
print(f"  Final loss: {train_result.training_loss:.4f}")

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


In [None]:
# Save model
adapter_path = f"{OUTPUT_DIR}/adapter"
model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)
print(f"‚úÖ Adapter saved to: {adapter_path}")

# Save merged model
merged_path = f"{OUTPUT_DIR}/merged"
merged_model = model.merge_and_unload()
merged_model.save_pretrained(merged_path)
tokenizer.save_pretrained(merged_path)
print(f"‚úÖ Merged model saved to: {merged_path}")

---
## 8. Evaluation & Comparison

In [None]:
# Load improved model for evaluation
eval_model = AutoModelForCausalLM.from_pretrained(
    merged_path,
    torch_dtype=torch.float32
)
eval_model.eval()

print("‚úÖ Model loaded for evaluation")

In [None]:
def generate_response(model, tokenizer, prompt, max_new_tokens=100):
    """Generate response with same format as training."""
    # Format prompt like training data
    formatted = f"Question: {prompt}\nAnswer:"
    
    inputs = tokenizer(formatted, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.2,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract answer part
    if "Answer:" in response:
        response = response.split("Answer:")[1].strip()
    
    return response

print("‚úÖ Generation function defined")

In [None]:
# Test queries (same as evaluation notebook)
test_queries = [
    "What is the capital of France?",
    "Who wrote Romeo and Juliet?",
    "What is 2 + 2?",
    "What color is the sky?",
    "Name a fruit that is red.",
    "What is the largest planet in our solar system?",
    "What language do people speak in Spain?",
    "Is water wet?",
]

print("\n" + "="*60)
print("IMPROVED MODEL RESPONSES")
print("="*60)

improved_results = []
for query in test_queries:
    response = generate_response(eval_model, tokenizer, query)
    improved_results.append({"query": query, "response": response})
    
    print(f"\nQ: {query}")
    print(f"A: {response[:200]}{'...' if len(response) > 200 else ''}")
    print("-" * 40)

In [None]:
# Compare with previous stages
print("\n" + "="*60)
print("COMPARISON WITH PREVIOUS STAGES")
print("="*60)

# Load Stage 2 (worst performer) for comparison
stage2_path = "../outputs/stage2_instruction/model"
if Path(stage2_path).exists():
    stage2_model = AutoModelForCausalLM.from_pretrained(stage2_path, torch_dtype=torch.float32)
    stage2_model.eval()
    
    # Compare on same queries
    for i, query in enumerate(test_queries[:3]):
        print(f"\n{'='*50}")
        print(f"Query: {query}")
        print(f"{'='*50}")
        
        # Stage 2 response
        s2_resp = generate_response(stage2_model, tokenizer, query)
        print(f"\nüìõ Stage 2 (before): {s2_resp[:150]}...")
        
        # Improved response
        print(f"\n‚úÖ Improved (after): {improved_results[i]['response'][:150]}...")
    
    del stage2_model
else:
    print("Stage 2 model not found for comparison")

In [None]:
# Quality metrics
def simple_quality_check(response):
    """Basic quality heuristics."""
    # Check for repetition (degenerate output)
    words = response.lower().split()
    if len(words) > 5:
        unique_ratio = len(set(words)) / len(words)
    else:
        unique_ratio = 1.0
    
    # Check response length (too short = bad, too long = bad)
    length_ok = 5 < len(words) < 150
    
    # Check for common degenerate patterns
    degenerate_patterns = [
        "the the the",
        "is is is",
        "and and and",
        "\n\n\n\n",
    ]
    has_degenerate = any(p in response.lower() for p in degenerate_patterns)
    
    return {
        "unique_ratio": unique_ratio,
        "length_ok": length_ok,
        "no_degenerate": not has_degenerate,
        "quality_score": unique_ratio * (1 if length_ok else 0.5) * (1 if not has_degenerate else 0.1)
    }

# Evaluate all responses
print("\n" + "="*60)
print("QUALITY METRICS")
print("="*60)

scores = []
for result in improved_results:
    metrics = simple_quality_check(result["response"])
    scores.append(metrics["quality_score"])
    
avg_score = sum(scores) / len(scores)
print(f"\nAverage Quality Score: {avg_score:.2f}")
print(f"  (1.0 = perfect, 0.1 = degenerate)")
print(f"\nPer-query scores:")
for i, (result, score) in enumerate(zip(improved_results, scores)):
    status = "‚úÖ" if score > 0.5 else "‚ö†Ô∏è" if score > 0.2 else "‚ùå"
    print(f"  {i+1}. {status} Score: {score:.2f} - {result['query'][:40]}")

In [None]:
# Save results
results_summary = {
    "training": {
        "samples": len(train_dataset),
        "epochs": 1,
        "learning_rate": training_args.learning_rate,
        "lora_rank": 4,
        "kl_weight": CONFIG["regularization"]["kl_penalty"],
        "final_loss": train_result.training_loss,
        "training_time_minutes": elapsed / 60,
    },
    "evaluation": {
        "average_quality_score": avg_score,
        "results": improved_results,
    },
    "improvements_applied": [
        "Curated dataset (filtered for simple tasks)",
        "Single template (no gradient conflicts)",
        "100x lower learning rate",
        "Single epoch training",
        "KL regularization against original model",
        "Lower LoRA rank (r=4)",
        "Higher dropout (0.2)",
    ]
}

results_path = Path(f"{OUTPUT_DIR}/improved_results.json")
with open(results_path, "w") as f:
    json.dump(results_summary, f, indent=2)

print(f"\n‚úÖ Results saved to: {results_path}")

---
## 9. Summary

### Improvements Applied:
1. ‚úÖ **Curated Dataset**: 2000+ samples filtered for simple tasks
2. ‚úÖ **Single Template**: Eliminated gradient conflicts
3. ‚úÖ **Low Learning Rate**: 1e-6 (100x lower)
4. ‚úÖ **Single Epoch**: Prevent overfitting
5. ‚úÖ **KL Regularization**: Keep model close to original
6. ‚úÖ **Lower LoRA Rank**: r=4 to reduce overfitting
7. ‚úÖ **Higher Dropout**: 0.2 for regularization

### Expected vs Achieved:
| Metric | Previous Stages | Expected | Achieved |
|--------|-----------------|----------|----------|
| Coherent responses | ~5% | 40-60% | TBD |
| No repetition | ~20% | 80%+ | TBD |
| Quality score | ~0.1 | 0.5+ | See above |

In [None]:
print("\n" + "="*70)
print("üéâ IMPROVED TRAINING COMPLETE!")
print("="*70)
print(f"""
Summary:
  - Training samples: {len(train_dataset)}
  - Final loss: {train_result.training_loss:.4f}
  - Average quality score: {avg_score:.2f}
  - Model saved to: {merged_path}

Key takeaways:
  1. Lower LR + fewer epochs prevents catastrophic forgetting
  2. Single template eliminates gradient conflicts
  3. KL regularization keeps model grounded
  4. GPT-2 works best for SIMPLE tasks
""")