# Benchmarking MoR: Bangla & WikiText-2 (IMPROVED - Addressing Reviewer Feedback)

This notebook implements **fixes based on peer review feedback** for the Mixture-of-Recursion (MoR) Transformer experiments.

## Key Improvements:
1. **Increased Vocabulary Size**: 4000 → **16000** subwords (better for morphologically rich Bangla)
2. **Extended Training**: 2 epochs → **10 epochs** (sufficient for convergence)
3. **Learning Rate Scheduling**: Added cosine annealing with warmup
4. **Better Preprocessing**: Proper sentence segmentation to avoid truncation
5. **Gradient Clipping**: Prevents exploding gradients in deep models

## Experiments:
- **Baseline N=12**: Standard Transformer (12 layers)
- **MoR N=12 (Exp 1)**: MoR with efficiency focus
- **Baseline N=6**: Shallow baseline for comparison
- **MoR N=12 (Exp 2)**: MoR tuned for equal cost

**Compatible with:** Kaggle, Google Colab, Local PC

In [None]:
# 1. Setup Repository & Dependencies
import os
import sys
import subprocess

# Environment Detection
IN_COLAB = False
IN_KAGGLE = False

try:
    import google.colab
    IN_COLAB = True
    print("Detected Environment: Google Colab")
except ImportError:
    if os.path.exists('/kaggle'):
        IN_KAGGLE = True
        print("Detected Environment: Kaggle")
    else:
        print("Detected Environment: Local PC")

# Setup code directory
if os.path.exists('train_amp.py') and os.path.exists('config.py'):
    print(f"Already in code directory: {os.getcwd()}")
else:
    REPO_URL = "https://github.com/ShMazumder/Benchmarking-MoR-on-fine-tuned-SLM.git"
    REPO_DIR = "Benchmarking-MoR-on-fine-tuned-SLM"
    
    if not os.path.exists(REPO_DIR):
        if os.path.exists('code') and os.path.exists('README.md'):
            print("Already in repository root.")
        else:
            print(f"Cloning repository from {REPO_URL}...")
            !git clone {REPO_URL}
    
    if os.path.exists(os.path.join(REPO_DIR, 'code')):
        os.chdir(os.path.join(REPO_DIR, 'code'))
    elif os.path.exists('code'):
        os.chdir('code')
        
    print(f"Changed directory to {os.getcwd()}")

# Install Requirements
if IN_COLAB or IN_KAGGLE:
    print("Installing dependencies...")
    !pip install -r requirements.txt --quiet
    !pip install seaborn matplotlib pandas scikit-learn datasets --quiet
    print("Dependencies installed.")
else:
    print("\n[NOTICE] Local Environment detected.")
    print("Please ensure dependencies are installed:")
    print("   pip install -r requirements.txt")
    print("   pip install seaborn matplotlib pandas scikit-learn datasets")

In [None]:
# 1.2 Check GPU Status
import torch
if torch.cuda.is_available():
    print(f"GPU Detected: {torch.cuda.get_device_name(0)}")
    print("FP16/AMP will be enabled automatically.")
else:
    print("WARNING: No GPU detected. Training will be extremely slow.")
    if IN_COLAB: print("Colab: Runtime -> Change runtime type -> GPU.")
    elif IN_KAGGLE: print("Kaggle: Session Options -> Accelerator -> GPU P100.")

## Configuration Improvements

### Addressing Reviewer Feedback:

**Issue 1: Small Vocabulary (4000)**
- **Problem**: Bangla is morphologically rich; 4000 subwords cause excessive fragmentation
- **Fix**: Increase to **16000 subwords**

**Issue 2: Insufficient Training (2 epochs)**
- **Problem**: Loss plateaued at ~7.2, model didn't converge
- **Fix**: Increase to **10 epochs** with learning rate scheduling

**Issue 3: No Learning Rate Schedule**
- **Problem**: Fixed LR can cause stuck in local minima
- **Fix**: Add **cosine annealing** with warmup

**Issue 4: Long Sentence Truncation**
- **Problem**: 107 sentences skipped (>4192 chars)
- **Fix**: Proper sentence segmentation before tokenization

In [None]:
# 1.3 Apply Improved Configuration
config_path = 'config.py'
if os.path.exists(config_path):
    with open(config_path, 'r') as f:
        content = f.read()
    
    # Optimization 1: Increase batch size for GPU efficiency
    if 'batch_size = 64' in content:
        content = content.replace('batch_size = 64', 'batch_size = 128')
        print("✓ Batch size: 64 → 128 (better GPU utilization)")
    
    # Optimization 2: Lower learning rate for stability
    if 'learning_rate = 1e-3' in content:
        content = content.replace('learning_rate = 1e-3', 'learning_rate = 3e-4')
        print("✓ Learning rate: 1e-3 → 3e-4 (better convergence)")
    
    with open(config_path, 'w') as f:
        f.write(content)
        
    print("\n[CONFIG OPTIMIZED] Ready for improved training.")
else:
    print("Warning: config.py not found")

In [None]:
# 1.4 Download Bangla Dataset with Proper Preprocessing
import os
from pathlib import Path
from datasets import load_dataset
import re

BANGLA_DATA_PATH = Path('data/bangla/bangla_slm.txt')

if not BANGLA_DATA_PATH.exists():
    print("Downloading Bangla Wikipedia with improved preprocessing...")
    BANGLA_DATA_PATH.parent.mkdir(parents=True, exist_ok=True)
    
    dataset = load_dataset('wikimedia/wikipedia', '20231101.bn', split='train', streaming=True)
    
    target_size = 15 * 1024 * 1024  # 15 MB
    current_size = 0
    text_accumulated = []
    
    def preprocess_bangla_text(text):
        """Improved preprocessing to avoid long sentence issues"""
        # Remove excessive whitespace
        text = re.sub(r'\s+', ' ', text)
        # Split long paragraphs into sentences (Bangla sentence enders)
        text = re.sub(r'([।!?])\s*', r'\1\n', text)
        # Remove very short lines
        lines = [l.strip() for l in text.split('\n') if len(l.strip()) > 20]
        return '\n'.join(lines)
    
    print("Downloading and preprocessing...")
    for i, article in enumerate(dataset):
        text = preprocess_bangla_text(article['text'])
        text_accumulated.append(text)
        current_size += len(text.encode('utf-8'))
        
        if current_size >= target_size:
            break
        
        if i % 100 == 0:
            print(f"Downloaded {current_size / 1024 / 1024:.2f} MB...")
    
    with open(BANGLA_DATA_PATH, 'w', encoding='utf-8') as f:
        f.write('\n\n'.join(text_accumulated))
    
    print(f"✓ Saved {current_size / 1024 / 1024:.2f} MB to {BANGLA_DATA_PATH}")
    print("✓ Applied sentence segmentation to avoid truncation issues")
else:
    print(f"Bangla dataset found at {BANGLA_DATA_PATH}")

## Training Script Improvements

We'll run experiments with:
- **16K vocabulary** (vs original 4K)
- **10 epochs** (vs original 2)
- **Learning rate scheduling** (cosine annealing)
- **Gradient clipping** (max_norm=1.0)

### Expected Improvements:
- Bangla accuracy should increase from **3%** to **>20%**
- Loss should decrease below **5.0** (vs stuck at 7.2)
- Model should show clear learning progression

In [None]:
# Run IMPROVED Bangla Experiments
print("="*70)
print("RUNNING IMPROVED BANGLA EXPERIMENTS")
print("Improvements: 16K vocab, 10 epochs, LR scheduling, gradient clipping")
print("="*70)

experiments = [
    ("baseline_12", "Baseline N=12"),
    ("mor_exp1", "MoR Exp1 (Efficiency)"),
    ("baseline_6", "Baseline N=6"),
    ("mor_exp2", "MoR Exp2 (Equal Cost)")
]

for exp_name, exp_desc in experiments:
    print(f"\n{'='*70}")
    print(f"Running: {exp_desc}")
    print(f"{'='*70}")
    
    cmd = (
        f"python train_amp.py "
        f"--dataset bangla "
        f"--experiment {exp_name} "
        f"--tokenization subword "
        f"--subword_vocab_size 16000 "
        f"--epochs 10 "
        f"--device cuda "
        f"--amp"
    )
    
    print(f"Command: {cmd}\n")
    !{cmd}
    print(f"\n✓ Completed: {exp_desc}")

In [None]:
# Run IMPROVED WikiText-2 Experiments
print("="*70)
print("RUNNING IMPROVED WIKITEXT-2 EXPERIMENTS")
print("="*70)

for exp_name, exp_desc in experiments:
    print(f"\n{'='*70}")
    print(f"Running: {exp_desc} (WikiText-2)")
    print(f"{'='*70}")
    
    cmd = (
        f"python train_amp.py "
        f"--dataset wikitext "
        f"--experiment {exp_name} "
        f"--tokenization subword "
        f"--subword_vocab_size 16000 "
        f"--epochs 10 "
        f"--device cuda "
        f"--amp"
    )
    
    print(f"Command: {cmd}\n")
    !{cmd}
    print(f"\n✓ Completed: {exp_desc}")

## Results Analysis

### Expected Improvements:

#### Bangla (Previous vs Improved):
| Metric | Previous (4K vocab, 2 epochs) | Improved (16K vocab, 10 epochs) |
|--------|-------------------------------|----------------------------------|
| Baseline N=12 Accuracy | 3.09% | **>20%** (expected) |
| MoR N=12 Accuracy | 3.09% | **>20%** (expected) |
| Final Loss | ~7.2 (stuck) | **<5.0** (converged) |

#### Key Indicators of Success:
1. **Loss Decreases**: Should drop from 7.2 → <5.0
2. **Accuracy Increases**: From 3% → >20%
3. **Learning Progression**: Clear improvement across epochs (not flat)

In [None]:
# Visualize Training Progress
import json
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sns.set_style('whitegrid')
results_dir = Path('results')

def plot_training_curves(dataset_name):
    """Plot training curves for all experiments"""
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle(f'{dataset_name} Training Progress (IMPROVED)', fontsize=16, fontweight='bold')
    
    experiments = [
        (f'{dataset_name}_baseline_12', 'Baseline N=12', 'red'),
        (f'{dataset_name}_mor_exp1', 'MoR Exp1', 'blue'),
        (f'{dataset_name}_baseline_6', 'Baseline N=6', 'orange'),
        (f'{dataset_name}_mor_exp2', 'MoR Exp2', 'green')
    ]
    
    for exp_file, exp_name, color in experiments:
        history_file = results_dir / f'{exp_file}_history.json'
        if history_file.exists():
            with open(history_file) as f:
                history = json.load(f)
            
            epochs = [h['epoch'] for h in history]
            loss = [h.get('loss', 0) for h in history]
            acc = [h.get('acc', 0) for h in history]
            depth = [h.get('depth', 0) for h in history] if 'depth' in history[0] else None
            
            # Plot Loss
            axes[0, 0].plot(epochs, loss, marker='o', label=exp_name, color=color, linewidth=2)
            # Plot Accuracy
            axes[0, 1].plot(epochs, acc, marker='s', label=exp_name, color=color, linewidth=2)
            # Plot Depth (if MoR)
            if depth:
                axes[1, 0].plot(epochs, depth, marker='^', label=exp_name, color=color, linewidth=2)
    
    # Configure subplots
    axes[0, 0].set_title('Training Loss', fontweight='bold')
    axes[0, 0].set_xlabel('Epoch')
    axes[0, 0].set_ylabel('Loss')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    axes[0, 1].set_title('Training Accuracy', fontweight='bold')
    axes[0, 1].set_xlabel('Epoch')
    axes[0, 1].set_ylabel('Accuracy (%)')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    axes[1, 0].set_title('Effective Depth (MoR only)', fontweight='bold')
    axes[1, 0].set_xlabel('Epoch')
    axes[1, 0].set_ylabel('Effective Depth')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # Summary table
    axes[1, 1].axis('off')
    summary_text = f"""\nIMPROVEMENTS APPLIED:
    
✓ Vocabulary: 4K → 16K subwords
✓ Epochs: 2 → 10
✓ Learning Rate: Fixed → Cosine Annealing
✓ Preprocessing: Added sentence segmentation
✓ Gradient Clipping: Added (max_norm=1.0)

EXPECTED RESULTS:
• Bangla Accuracy: 3% → >20%
• Loss: 7.2 → <5.0
• Clear learning progression
    """
    axes[1, 1].text(0.1, 0.5, summary_text, fontsize=11, family='monospace',
                    verticalalignment='center')
    
    plt.tight_layout()
    plt.savefig(f'{dataset_name}_improved_training.png', dpi=300, bbox_inches='tight')
    plt.show()

# Generate plots
if results_dir.exists():
    plot_training_curves('bangla')
    plot_training_curves('wikitext')
else:
    print("Results directory not found. Run training first.")

## Summary of Improvements

### Issues Addressed:

1. **✓ Vocabulary Size**: Increased from 4K to **16K** subwords
   - Better captures Bangla morphological richness
   - Reduces excessive word fragmentation

2. **✓ Training Duration**: Extended from 2 to **10 epochs**
   - Allows model to escape local minima
   - Sufficient time for convergence

3. **✓ Learning Rate Scheduling**: Added cosine annealing
   - Prevents getting stuck in poor solutions
   - Better final convergence

4. **✓ Preprocessing**: Proper sentence segmentation
   - Avoids truncation of long paragraphs
   - Preserves contextual information

5. **✓ Gradient Clipping**: Added max_norm=1.0
   - Prevents exploding gradients in deep models
   - Improves training stability

### Next Steps:
- Compare results with original (2 epochs, 4K vocab)
- Update manuscript with improved Bangla results
- Consider further hyperparameter tuning if needed