# LexiLingo: Unified LoRA Adapter Fine-tuning (Kaggle Edition)

**Platform:** Kaggle Notebooks (GPU: P100/T4)  
**Version:** 1.0.1 - CUDA Error Fixed (Jan 2026)

**Mục đích:** Fine-tune Qwen2.5-1.5B-Instruct với **1 unified LoRA adapter** để xử lý đồng thời 4 tasks:
1. **Fluency Scoring** (0.0-1.0)
2. **Vocabulary Level Classification** (A1, A2, B1, B2, C1, C2)
3. **Grammar Error Correction** (GEC)
4. **Dialogue Generation** (conversational responses)

---

## CRITICAL BUG FIX - Version 1.0.1

### Issue Resolved:
**Error:** `AcceleratorError: CUDA error: an illegal memory access was encountered`

**Root Cause:**
- PyTorch DataParallel wrapper was being applied to 4-bit quantized model
- bitsandbytes quantization is NOT compatible with DataParallel
- Multi-GPU operations cause memory access violations with quantized layers

**Solution Applied:**
1.  Set `CUDA_VISIBLE_DEVICES=0` BEFORE importing torch/transformers
2.  Explicit device pinning: `device_map={"": 0}` (no "auto")
3.  Disabled all distributed training parameters
4.  Added verification checks to prevent DataParallel wrapping
5.  Optimized memory allocation settings

**Result:** Stable single-GPU training with 4-bit quantization

---

## Kaggle-Specific Features

### Storage:
- **Input datasets:** `/kaggle/input/` (read-only, uploaded as Kaggle dataset)
- **Output/Checkpoints:** `/kaggle/working/` (auto-saves as output after session ends)
- **Model cache:** `/kaggle/working/.cache/` (Hugging Face models)

### Auto-Save Checkpoints:
- Every **100 steps** (periodic save)
- All saved to `/kaggle/working/unified_model/`
- Safe shutdown handling on session interruption

### Resume Training:
1. Upload previous session's output as input dataset
2. Set `resume_from_checkpoint="auto"` (default)
3. Training continues from saved state

---

**GPU Allocation:** Kaggle provides 30hrs/week GPU (P100/T4)  
**Expected Training Time:** 6-10 hours (HIGH QUALITY config)

## 1. Setup Environment & Install Dependencies

 **CRITICAL FIXES APPLIED** - Version 1.0.1 (Jan 2026)

### Fixed Issues:
1. **CUDA Illegal Memory Access Error** - RESOLVED 
   - Root cause: DataParallel wrapper conflicting with 4-bit quantization
   - Fix: Force single GPU mode via environment variables BEFORE imports
   - Added explicit device pinning (device_map={"": 0})

2. **Multi-GPU Conflicts** - RESOLVED 
   - Set CUDA_VISIBLE_DEVICES=0 at notebook start
   - Disabled all distributed training parameters
   - Added verification checks throughout pipeline

3. **Quantization Stability** - ENHANCED 
   - Optimized BitsAndBytesConfig for single GPU
   - Added memory allocator configuration
   - Disabled dataloader multi-processing

### Must Enable Internet First:

Kaggle blocks internet by default. TRL library is REQUIRED but not pre-installed.

**How to Enable Internet:**
1. Right sidebar -> Settings
2. Scroll to Internet section
3. Toggle switch to ON (blue)
4. Click Save
5. Re-run cells below

Without internet, this notebook CANNOT run. TRL is required for SFTTrainer.

**Expected Training Time:** 6-10 hours on P100/T4 (HIGH QUALITY config)

In [None]:
# Check internet connectivity
import urllib.request
import sys

print("\n" + "="*70)
print("INTERNET CONNECTION CHECK")
print("="*70)

try:
    urllib.request.urlopen('https://pypi.org', timeout=5)
    INTERNET_AVAILABLE = True
    print("\nInternet: ENABLED")
    print("  Ready to install packages from PyPI")
except:
    INTERNET_AVAILABLE = False
    print("\nInternet: DISABLED")
    print("\n" + "!"*70)
    print("ERROR: Cannot proceed without internet!")
    print("!"*70)
    print("\nTRL library is REQUIRED but not pre-installed on Kaggle.")
    print("\nYou MUST enable internet to continue:")
    print("  1. Right sidebar -> Settings ")
    print("  2. Scroll to 'Internet' section")
    print("  3. Toggle to ON (blue)")
    print("  4. Click Save")
    print("  5. Re-run this cell")
    print("\n" + "!"*70)

print("="*70 + "\n")

In [None]:
# CRITICAL: Set environment variables BEFORE importing torch/transformers
# This prevents DataParallel and multi-GPU issues with quantized models
import os

print("\n" + "="*70)
print("ENVIRONMENT SETUP - SINGLE GPU MODE")
print("="*70)

# Force single GPU execution (prevents DataParallel with quantization)
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"

# Disable tokenizers parallelism warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Set memory allocator for better stability
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

print("Environment variables set:")
print(f"  CUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES', 'not set')}")
print(f"  WORLD_SIZE: {os.environ.get('WORLD_SIZE', 'not set')}")
print(f"  PYTORCH_CUDA_ALLOC_CONF: {os.environ.get('PYTORCH_CUDA_ALLOC_CONF', 'not set')}")
print("\nSingle GPU mode enforced BEFORE library import")
print("="*70 + "\n")

# Install packages (REQUIRES INTERNET)
if not INTERNET_AVAILABLE:
    print("\n" + "!" * 70)
    print("CANNOT INSTALL PACKAGES - INTERNET IS DISABLED")
    print("!" * 70)
    print("\nPlease enable internet in Settings (see instructions above)")
    print("\nNotebook cannot proceed without TRL library.")
    print("!" * 70 + "\n")

    print("Checking pre-installed packages:\n")
    try:
        import transformers, accelerate, datasets, peft
        print(f"transformers: {transformers.__version__}")
        print(f"accelerate: {accelerate.__version__}")
        print(f"datasets: {datasets.__version__}")
        print(f"peft: {peft.__version__}")
    except ImportError as e:
        print(f"Some packages missing: {e}")

    try:
        import trl
        print(f"trl: {trl.__version__}")
    except ImportError:
        print("trl: NOT INSTALLED (REQUIRED)")
        print("\n" + "=" * 70)
        print("STOPPING: Enable internet and re-run from the top")
        print("=" * 70)
        raise
else:
    print("Installing required packages...\n")
    print("This may take 2-3 minutes on first run.\n")

    # Kaggle supports shell-style pip installs.
    # NOTE: TensorFlow/XLA cuDNN/cuBLAS log spam can appear in Kaggle sometimes; it is not fatal.
    !pip install -q -U \
      transformers>=4.41.0 \
      accelerate>=0.29.0 \
      datasets>=2.18.0 \
      peft>=0.10.0 \
      trl>=0.9.6 \
      bitsandbytes>=0.43.1 \
      sentencepiece

    print("\n" + "=" * 70)
    print("ALL PACKAGES INSTALLED")
    print("=" * 70)

    import transformers, accelerate, datasets, peft, trl
    print("\nInstalled versions:")
    print(f"  transformers: {transformers.__version__}")
    print(f"  accelerate: {accelerate.__version__}")
    print(f"  datasets: {datasets.__version__}")
    print(f"  peft: {peft.__version__}")
    print(f"  trl: {trl.__version__}")
    print("=" * 70 + "\n")

In [None]:
# Verify Kaggle environment
import os
from pathlib import Path

print("\n" + "="*70)
print("KAGGLE ENVIRONMENT CHECK")
print("="*70)

# Check paths
kaggle_working = Path("/kaggle/working")
kaggle_input = Path("/kaggle/input")

print(f"\nWorking directory: {kaggle_working}")
print(f"  Exists: {kaggle_working.exists()}")
print(f"  Writable: {os.access(kaggle_working, os.W_OK)}")

print(f"\nInput directory: {kaggle_input}")
print(f"  Exists: {kaggle_input.exists()}")

if kaggle_input.exists():
    datasets = list(kaggle_input.iterdir())
    print(f"  Datasets found: {len(datasets)}")
    for ds in datasets:
        print(f"    - {ds.name}")

# Check GPU
import torch
print(f"\nGPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU name: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

print("="*70 + "\n")

## 2. Configuration - Kaggle Optimized Paths

In [None]:
import torch
from pathlib import Path

# ============================================================================
# DATASET CONFIGURATION - Set this BEFORE running
# ============================================================================
# Option 1: Use real dataset (RECOMMENDED for production)
USE_TEST_DATA = False  # Set to True to use synthetic test data for debugging

# Option 2: Custom dataset path (if auto-detection fails)
CUSTOM_DATASET_PATH = None  # Example: "/kaggle/input/your-dataset-name"
# ============================================================================

# Verify CUDA setup AFTER imports
print("\n" + "="*70)
print("CUDA VERIFICATION (Post-Import)")
print("="*70)

if torch.cuda.is_available():
    print(f"\nCUDA Available: Yes")
    print(f"  Device count: {torch.cuda.device_count()}")
    print(f"  Current device: {torch.cuda.current_device()}")
    print(f"  Device name: {torch.cuda.get_device_name(0)}")
    print(f"  Device capability: {torch.cuda.get_device_capability(0)}")
    print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
    
    # Verify environment variables are still set
    print(f"\nEnvironment check:")
    print(f"  CUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES', 'NOT SET')}")
    print(f"  WORLD_SIZE: {os.environ.get('WORLD_SIZE', 'NOT SET')}")
    
    if torch.cuda.device_count() != 1:
        print(f"\nWARNING: {torch.cuda.device_count()} devices visible!")
        print("  This may cause DataParallel issues with quantization")
else:
    print("CUDA not available - will use CPU (slow)")

print("="*70 + "\n")

# KAGGLE PATHS - Tự động lưu output
KAGGLE_WORKING = Path("/kaggle/working")
KAGGLE_INPUT = Path("/kaggle/input")

# Output directory - sẽ được tự động save làm output của Kaggle session
OUTPUT_DIR = KAGGLE_WORKING / "unified_model"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Cache directory cho Hugging Face models
CACHE_DIR = KAGGLE_WORKING / ".cache"
CACHE_DIR.mkdir(parents=True, exist_ok=True)
os.environ["HF_HOME"] = str(CACHE_DIR)

print("="*70)
print("KAGGLE STORAGE CONFIGURATION")
print("="*70)
print(f"\nOutput directory: {OUTPUT_DIR}")
print(f"  Checkpoints will be saved here")
print(f"  Auto-uploaded as Kaggle output after session")
print(f"\nCache directory: {CACHE_DIR}")
print(f"  Hugging Face models cached here")
print("="*70 + "\n")

# Dataset Configuration
print("="*70)
print("DATASET CONFIGURATION")
print("="*70)

# Check if user wants to use test data
if USE_TEST_DATA:
    print("\nMode: TEST DATA (synthetic)")
    print("WARNING: This is for debugging only, not for production training!")
    DATASET_PATH = None  # Will trigger test data creation
else:
    print("\nMode: REAL DATA (production)")
    
    # Use custom path if provided
    if CUSTOM_DATASET_PATH:
        DATASET_PATH = Path(CUSTOM_DATASET_PATH)
        print(f"Using custom path: {DATASET_PATH}")
    else:
        # Auto-detect dataset path
        DATASET_PATH = None
        if KAGGLE_INPUT.exists():
            # Look for common dataset names
            possible_names = [
                "lexilingo-training-data",
                "unified-training-data", 
                "training-data",
                "lexilingo-dataset",
            ]
            
            for name in possible_names:
                candidate = KAGGLE_INPUT / name
                if candidate.exists():
                    # Check if it has train.jsonl
                    if (candidate / "train.jsonl").exists():
                        DATASET_PATH = candidate
                        print(f"\nAuto-detected dataset: {DATASET_PATH.name}")
                        break
                    # Check subdirectories
                    for subdir in candidate.iterdir():
                        if subdir.is_dir() and (subdir / "train.jsonl").exists():
                            DATASET_PATH = subdir
                            print(f"\nAuto-detected dataset: {DATASET_PATH}")
                            break
            
            if DATASET_PATH is None:
                # List all available datasets
                datasets = list(KAGGLE_INPUT.iterdir())
                if datasets:
                    print(f"\nNo dataset auto-detected")
                    print(f"\nAvailable datasets ({len(datasets)}):")
                    for i, ds in enumerate(datasets, 1):
                        print(f"  {i}. {ds.name}")
                        # Check for JSONL files
                        jsonl_files = list(ds.rglob("*.jsonl"))
                        if jsonl_files:
                            print(f"     Found {len(jsonl_files)} .jsonl files:")
                            for jf in jsonl_files[:5]:
                                print(f"       - {jf.relative_to(ds)}")
                    
                    print(f"\nTo use a dataset, set CUSTOM_DATASET_PATH at the top of this cell")
                else:
                    print("\nNo datasets found in /kaggle/input/")
        else:
            print("\nNot running on Kaggle - using local paths")
            # For local testing, use project datasets
            local_dataset = Path("../scripts/downloaded_datasets")
            if local_dataset.exists():
                DATASET_PATH = local_dataset
                print(f"Using local dataset: {DATASET_PATH}")

print("="*70 + "\n")

## 3. Import Libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')

import torch
import pandas as pd
import numpy as np
import json
import signal
import atexit
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Union, Any
from dataclasses import dataclass

# Transformers & PEFT
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig,
    TrainerCallback,
)
from peft import (
    LoraConfig, 
    get_peft_model, 
    TaskType,
    PeftModel,
)
from datasets import Dataset, DatasetDict
from trl import SFTTrainer

print("All libraries imported successfully")

## 4. Checkpoint Management System

In [None]:
class CheckpointManager:
    """Quản lý checkpoint tự động cho Kaggle"""
    def __init__(self, output_dir: str):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self.state_file = self.output_dir / "training_state.json"
        print(f"CheckpointManager initialized: {self.output_dir}")
    
    def find_latest_checkpoint(self):
        """Tìm checkpoint mới nhất"""
        checkpoints = sorted(
            [d for d in self.output_dir.glob("checkpoint-*") if d.is_dir()],
            key=lambda x: int(x.name.split("-")[-1])
        )
        return str(checkpoints[-1]) if checkpoints else None
    
    def list_all_checkpoints(self):
        """Liệt kê tất cả checkpoints"""
        checkpoints = sorted(
            [d for d in self.output_dir.glob("checkpoint-*") if d.is_dir()],
            key=lambda x: int(x.name.split("-")[-1])
        )
        return [{"path": str(cp), "step": int(cp.name.split("-")[-1])} for cp in checkpoints]
    
    def save_training_state(self, **kwargs):
        """Lưu training state"""
        state = {
            "last_update": datetime.now().isoformat(),
            "platform": "kaggle",
            **kwargs
        }
        with open(self.state_file, 'w') as f:
            json.dump(state, f, indent=2)
        print(f"Saved training state: {self.state_file}")
    
    def load_training_state(self):
        """Load training state"""
        if self.state_file.exists():
            with open(self.state_file, 'r') as f:
                return json.load(f)
        return None
    
    def print_status(self):
        """In status checkpoint"""
        checkpoints = self.list_all_checkpoints()
        state = self.load_training_state()
        
        print("\n" + "="*70)
        print("CHECKPOINT STATUS (Kaggle)")
        print("="*70)
        
        if checkpoints:
            print(f"Found {len(checkpoints)} checkpoint(s)")
            print(f"\nLatest: {checkpoints[-1]['path']}")
            
            if state:
                print(f"\nTraining State:")
                for key, value in state.items():
                    print(f"  - {key}: {value}")
        else:
            print("No checkpoints found - training from scratch")
        
        print("="*70 + "\n")

# Initialize checkpoint manager
checkpoint_mgr = CheckpointManager(OUTPUT_DIR)
checkpoint_mgr.print_status()

## 5. Graceful Shutdown Handler

In [None]:
class GracefulShutdownHandler:
    """Tự động lưu checkpoint khi session bị interrupt"""
    def __init__(self):
        self.trainer = None
        self.model = None
        self.emergency_save_path = OUTPUT_DIR / "emergency_checkpoint"
        
        # Register handlers
        signal.signal(signal.SIGINT, self._signal_handler)
        signal.signal(signal.SIGTERM, self._signal_handler)
        atexit.register(self._emergency_save)
        print("GracefulShutdownHandler activated")
    
    def register_trainer(self, trainer, model):
        """Register trainer để có thể save khi cần"""
        self.trainer = trainer
        self.model = model
    
    def _signal_handler(self, signum, frame):
        """Xử lý SIGINT/SIGTERM"""
        print(f"\n\nReceived signal {signum} - saving checkpoint...")
        if self.trainer is not None:
            try:
                self.trainer.save_model(str(self.emergency_save_path))
                print(f"Emergency checkpoint saved: {self.emergency_save_path}")
            except:
                pass
        exit(0)
    
    def _emergency_save(self):
        """Save khi exit bất thường"""
        if self.trainer is not None and self.model is not None:
            if not self.emergency_save_path.exists():
                print("\nEmergency exit - saving final checkpoint...")
                try:
                    self.emergency_save_path.mkdir(parents=True, exist_ok=True)
                    self.model.save_pretrained(str(self.emergency_save_path))
                    print(f"Final checkpoint saved: {self.emergency_save_path}")
                except:
                    pass

# Initialize shutdown handler
shutdown_handler = GracefulShutdownHandler()

## 6. Load Dataset

### Dataset Preparation Guide

**IMPORTANT: You must have a dataset before training!**

---

### OPTION 1: Use Test Data (Quick Start - Debugging Only)

**Fastest way to test the pipeline:**

1. Scroll up to **Cell 7** (Configuration - Kaggle Optimized Paths)
2. Find the line: `USE_TEST_DATA = False`
3. Change to: `USE_TEST_DATA = True`
4. Re-run Cell 7
5. Run this cell (Cell 15)

**WARNING:** Test data is synthetic and only for debugging. Not suitable for production!

---

### OPTION 2: Upload Real Dataset (Recommended for Production)

**Required Format:** JSONL files with `input` and `output` fields

Example lines in train.jsonl / val.jsonl:
```json
{"input": "Rate fluency: The cat sat on the mat.", "output": "Fluency Score: 0.85"}
{"input": "Classify level: I am happy today.", "output": "Vocabulary Level: A1"}
{"input": "Correct: He go to school yesterday.", "output": "Corrected: He went to school yesterday."}
{"input": "User: What is your name?", "output": "Assistant: I am LexiLingo, an AI assistant."}
```

**Steps to Upload:**

1. **Create Dataset on Kaggle:**
   - Go to [kaggle.com/datasets](https://www.kaggle.com/datasets)
   - Click "New Dataset"
   - Upload `train.jsonl` and `val.jsonl`
   - Name it (e.g., "lexilingo-training-data")
   - Click "Create"

2. **Add to This Notebook:**
   - In this notebook: Right sidebar -> Settings (gear icon)
   - Scroll to "Data" section  
   - Click "+ Add Data"
   - Search your dataset name
   - Click "Add"

3. **Run This Cell:**
   - Notebook will auto-detect your dataset
   - Check output to confirm successful loading

---

### Troubleshooting

**If auto-detection fails:**
- Set `CUSTOM_DATASET_PATH` in Cell 7
- Example: `CUSTOM_DATASET_PATH = "/kaggle/input/your-dataset-name"`

In [None]:
def create_test_dataset(num_train=100, num_val=20):
    """Create a small test dataset for debugging/testing"""
    print("\n" + "="*70)
    print("CREATING TEST DATASET")
    print("="*70)
    print("\nWARNING: Using synthetic test data for demonstration")
    print("For real training, upload your actual dataset to Kaggle\n")
    
    # Sample data for 4 tasks
    test_data = {
        "fluency": [
            {"input": "Rate fluency: The cat sits on mat.", "output": "Fluency Score: 0.85"},
            {"input": "Rate fluency: Me go store buy things.", "output": "Fluency Score: 0.45"},
        ],
        "vocabulary": [
            {"input": "Classify level: I am happy.", "output": "Vocabulary Level: A1"},
            {"input": "Classify level: The implementation demonstrates sophisticated algorithms.", "output": "Vocabulary Level: C2"},
        ],
        "grammar": [
            {"input": "Correct: He go to school yesterday.", "output": "Corrected: He went to school yesterday."},
            {"input": "Correct: She have been working here since 2020.", "output": "Corrected: She has been working here since 2020."},
        ],
        "dialogue": [
            {"input": "User: What's the weather like?", "output": "Assistant: I'd be happy to help, but I don't have access to real-time weather data. Please check a weather service."},
            {"input": "User: How are you?", "output": "Assistant: I'm functioning well, thank you for asking! How can I assist you today?"},
        ],
    }
    
    # Generate training data
    train_samples = []
    for _ in range(num_train):
        for task, examples in test_data.items():
            train_samples.append(examples[_ % len(examples)])
    
    # Generate validation data  
    val_samples = []
    for _ in range(num_val):
        for task, examples in test_data.items():
            val_samples.append(examples[_ % len(examples)])
    
    train_dataset = Dataset.from_list(train_samples)
    val_dataset = Dataset.from_list(val_samples)
    
    print(f"Test dataset created:")
    print(f"  Train: {len(train_dataset)} samples")
    print(f"  Val: {len(val_dataset)} samples")
    print(f"\nSample entry:")
    print(f"  Input: {train_samples[0]['input']}")
    print(f"  Output: {train_samples[0]['output']}")
    print("="*70 + "\n")
    
    return train_dataset, val_dataset


def load_kaggle_dataset(dataset_path: Path):
    """Load dataset từ Kaggle input"""
    
    print("\n" + "="*70)
    print("LOADING DATASET")
    print("="*70)
    
    # Try to find JSONL files
    train_file = dataset_path / "train.jsonl"
    val_file = dataset_path / "val.jsonl"
    
    if not train_file.exists() or not val_file.exists():
        # Try alternative paths
        for possible_path in dataset_path.rglob("train.jsonl"):
            train_file = possible_path
            val_file = possible_path.parent / "val.jsonl"
            break
    
    if not train_file.exists():
        raise FileNotFoundError(
            f"Cannot find train.jsonl in {dataset_path}\n"
            "Please upload your dataset to Kaggle and update DATASET_PATH"
        )
    
    print(f"\nLoading from:")
    print(f"  Train: {train_file}")
    print(f"  Val: {val_file}")
    
    # Load datasets
    train_data = []
    val_data = []
    
    with open(train_file, 'r', encoding='utf-8') as f:
        for line in f:
            train_data.append(json.loads(line.strip()))
    
    with open(val_file, 'r', encoding='utf-8') as f:
        for line in f:
            val_data.append(json.loads(line.strip()))
    
    train_dataset = Dataset.from_list(train_data)
    val_dataset = Dataset.from_list(val_data)
    
    print(f"\nDataset loaded:")
    print(f"  Train: {len(train_dataset)} samples")
    print(f"  Val: {len(val_dataset)} samples")
    print("="*70 + "\n")
    
    return train_dataset, val_dataset


# ============================================================================
# LOAD DATASET
# ============================================================================
train_dataset = None
val_dataset = None

# Check if user enabled test data mode (set at top of config cell)
if USE_TEST_DATA:
    print("\n" + "="*70)
    print("USING TEST DATA MODE")
    print("="*70)
    print("Generating synthetic test dataset for debugging...")
    train_dataset, val_dataset = create_test_dataset(num_train=200, num_val=40)
    print("WARNING: This is synthetic data - not suitable for production training!")
    print("="*70 + "\n")

# Otherwise try to load real dataset
elif DATASET_PATH is not None:
    try:
        train_dataset, val_dataset = load_kaggle_dataset(DATASET_PATH)
    except Exception as e:
        print(f"\nERROR loading dataset: {e}\n")
        train_dataset = None
        val_dataset = None

# If no dataset available, show instructions
if train_dataset is None or val_dataset is None:
    print("\n" + "="*70)
    print("NO DATASET AVAILABLE")
    print("="*70)
    print("\nYou have 2 options:")
    print()
    print("OPTION 1: Use real dataset (RECOMMENDED)")
    print("  1. Go to Kaggle.com -> Datasets -> New Dataset")
    print("  2. Upload your train.jsonl and val.jsonl files")
    print("  3. Add dataset to this notebook (Settings -> Data)")
    print("  4. Re-run this cell")
    print()
    print("OPTION 2: Use test data for debugging")
    print("  1. Go to the Configuration cell (Cell 7)")
    print("  2. Find the line: USE_TEST_DATA = False")
    print("  3. Change it to: USE_TEST_DATA = True")
    print("  4. Re-run Configuration cell and this cell")
    print("  WARNING: Test data is synthetic - only for debugging!")
    print()
    print("="*70 + "\n")

## 7. Model Configuration

In [None]:
# HIGH QUALITY config (stable on Kaggle P100/T4)
# Uses FP16 on pre-Ampere GPUs to avoid bitsandbytes crashes; BF16 only on Ampere+

# Verify single-GPU mode and configure matmul kernels
if torch.cuda.is_available():
    # Verify environment is correctly set
    torch.cuda.set_device(0)
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = True
    
    print("\n" + "="*70)
    print("CUDA CONFIGURATION")
    print("="*70)
    print(f"Single GPU mode active (device 0)")
    print(f"Available devices: {torch.cuda.device_count()}")
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
    print("="*70 + "\n")

def _detect_compute_dtype():
    if not torch.cuda.is_available():
        return torch.float32, False
    major, minor = torch.cuda.get_device_capability(0)
    supports_bf16 = major >= 8  # Ampere (8.x) or newer
    return (torch.bfloat16 if supports_bf16 else torch.float16), supports_bf16

COMPUTE_DTYPE, USE_BF16 = _detect_compute_dtype()

MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"  # Full quality model

# Quantization config (4-bit) - optimized for single GPU
QUANTIZATION_CONFIG = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=COMPUTE_DTYPE,
    bnb_4bit_use_double_quant=True,
    # Additional stability settings
    llm_int8_skip_modules=None,
    llm_int8_enable_fp32_cpu_offload=False,
)

# Unified LoRA configuration - HIGH QUALITY (balanced)
UNIFIED_LORA_CONFIG = {
    "task_type": TaskType.CAUSAL_LM,
    "r": 32,
    "lora_alpha": 64,
    "lora_dropout": 0.05,
    "bias": "none",
    "target_modules": [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    "inference_mode": False,
}

# Training configuration - stability first with EXPLICIT single-GPU settings
if torch.cuda.is_available():
    TRAINING_CONFIG = {
        "output_dir": str(OUTPUT_DIR),
        "num_train_epochs": 5,
        "per_device_train_batch_size": 1,
        "per_device_eval_batch_size": 1,
        "gradient_accumulation_steps": 24,
        "learning_rate": 2e-4,
        "weight_decay": 0.01,
        "max_grad_norm": 1.0,
        "lr_scheduler_type": "cosine",
        "warmup_ratio": 0.1,
        "logging_steps": 10,
        "eval_strategy": "steps",
        "eval_steps": 100,
        "save_strategy": "steps",
        "save_steps": 100,
        "save_total_limit": 3,
        "load_best_model_at_end": True,
        "metric_for_best_model": "eval_loss",
        "greater_is_better": False,
        "fp16": (not USE_BF16),
        "bf16": USE_BF16,
        "gradient_checkpointing": True,
        "optim": "adamw_8bit",
        "report_to": "none",
        "seed": 42,
        # CRITICAL: Disable all distributed/parallel training
        "ddp_find_unused_parameters": False,
        "dataloader_pin_memory": False,  # Reduces memory pressure
        "dataloader_num_workers": 0,  # Single-threaded data loading for stability
        "local_rank": -1,  # Disable distributed training
        "no_cuda": False,
    }
else:
    # CPU fallback
    TRAINING_CONFIG = {
        "output_dir": str(OUTPUT_DIR),
        "num_train_epochs": 2,
        "per_device_train_batch_size": 1,
        "per_device_eval_batch_size": 1,
        "gradient_accumulation_steps": 24,
        "learning_rate": 2e-4,
        "logging_steps": 10,
        "eval_strategy": "steps",
        "eval_steps": 100,
        "save_strategy": "steps",
        "save_steps": 100,
        "save_total_limit": 2,
        "report_to": "none",
        "seed": 42,
        "dataloader_num_workers": 0,
    }

gpu_info = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print("\n" + "="*70)
print("HIGH QUALITY MODEL CONFIGURATION - STABLE")
print("="*70)
print(f"Model: {MODEL_NAME}")
print(f"GPU: {gpu_info}")
print(f"Compute dtype: {COMPUTE_DTYPE}")
print(f"bf16 enabled: {USE_BF16}")
print(f"Batch size: {TRAINING_CONFIG['per_device_train_batch_size']}")
print(f"Grad accumulation: {TRAINING_CONFIG['gradient_accumulation_steps']}")
print(f"Effective batch: {TRAINING_CONFIG['per_device_train_batch_size'] * TRAINING_CONFIG['gradient_accumulation_steps']}")
print(f"Epochs: {TRAINING_CONFIG['num_train_epochs']}")
print(f"Learning rate: {TRAINING_CONFIG['learning_rate']}")
print(f"\nSINGLE GPU MODE ENFORCED - No DataParallel")
print("="*70 + "\n")

In [None]:
# Performance estimator for HIGH QUALITY config
def estimate_training_time():
    """Estimate training time for current config"""
    
    # Model parameters
    model_size = 1.5e9  # 1.5B params
    lora_rank = UNIFIED_LORA_CONFIG['r']
    
    # Training params
    train_size = len(train_dataset) if train_dataset else 10000
    batch_size = TRAINING_CONFIG['per_device_train_batch_size']
    grad_accum = TRAINING_CONFIG['gradient_accumulation_steps']
    epochs = TRAINING_CONFIG['num_train_epochs']
    
    # Calculate steps
    effective_batch = batch_size * grad_accum
    steps_per_epoch = train_size // effective_batch
    total_steps = steps_per_epoch * epochs
    
    # Estimate time per step (seconds)
    # 1.5B model with LoRA r=32 on P100/T4
    if torch.cuda.is_available():
        gpu_name = torch.cuda.get_device_name(0)
        if 'P100' in gpu_name:
            time_per_step = 3.5  # seconds
            gpu_type = "P100"
        elif 'T4' in gpu_name:
            time_per_step = 4.5  # seconds
            gpu_type = "T4"
        else:
            time_per_step = 5.0  # conservative estimate
            gpu_type = "Unknown GPU"
    else:
        time_per_step = 60  # CPU is very slow
        gpu_type = "CPU"
    
    total_time_sec = total_steps * time_per_step
    total_time_hours = total_time_sec / 3600
    
    print("\n" + "="*70)
    print("TRAINING TIME ESTIMATION - HIGH QUALITY CONFIG")
    print("="*70)
    print(f"\nDataset & Configuration:")
    print(f"  Training samples: {train_size:,}")
    print(f"  Batch size: {batch_size}")
    print(f"  Gradient accumulation: {grad_accum}")
    print(f"  Effective batch size: {effective_batch}")
    print(f"  Epochs: {epochs}")
    print(f"\nTraining Steps:")
    print(f"  Steps per epoch: {steps_per_epoch:,}")
    print(f"  Total steps: {total_steps:,}")
    print(f"\nHardware:")
    print(f"  Device: {gpu_type}")
    print(f"  Estimated time per step: {time_per_step:.2f}s")
    print(f"\nEstimated Total Time:")
    print(f"  {total_time_hours:.1f} hours ({total_time_hours/24:.1f} days)")
    print(f"\nCheckpoints:")
    print(f"  Saved every {TRAINING_CONFIG['save_steps']} steps")
    print(f"  Total checkpoints: ~{total_steps // TRAINING_CONFIG['save_steps']}")
    print(f"  Keep best {TRAINING_CONFIG['save_total_limit']} checkpoints")
    print("="*70 + "\n")
    
    # Quality notes
    print("QUALITY vs SPEED COMPARISON:")
    print("-" * 70)
    print("Config          | Model  | Rank | Time   | Quality")
    print("-" * 70)
    print("FAST (0.5B)     | 0.5B   | 16   | ~3-4h  | Good")
    print("HIGH QUALITY    | 1.5B   | 32   | ~6-10h | Excellent (CURRENT)")
    print("MAX QUALITY     | 1.5B   | 48   | ~8-12h | Best")
    print("-" * 70)
    print("\nCurrent config: Balanced HIGH QUALITY for production use")
    print("="*70 + "\n")

# Run estimator if dataset is loaded
if 'train_dataset' in globals() and train_dataset is not None:
    estimate_training_time()
else:
    print("\nTraining estimator will run after dataset is loaded")
    print("Expected training time: 6-10 hours on P100/T4")
    print("This is HIGH QUALITY configuration for production deployment\n")

## Configuration Notes - HIGH QUALITY

### Current Configuration Summary

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Model | Qwen2.5-1.5B | Full quality, 3x larger than 0.5B |
| LoRA rank | 32 | Balanced (not too low, not too high) |
| LoRA alpha | 64 | 2x rank ratio |
| Dropout | 0.05 | Lower for production quality |
| Batch size | 2 | Fits in 15-16GB GPU |
| Grad accum | 12 | Effective batch = 24 |
| Epochs | 5 | Better convergence |
| Learning rate | 2e-4 | Stable training |
| Training time (P100) | ~6-8h | High quality |
| Training time (T4) | ~8-10h | High quality |
| GPU Memory | ~8-10 GB | 4-bit quantization |

### Why This Configuration?

1. **Model Size (1.5B)**
   - 3x more parameters than 0.5B
   - Significantly better quality
   - Still fits in Kaggle GPU with 4-bit quantization
   - Production-ready accuracy

2. **LoRA Rank 32**
   - Sweet spot between quality and efficiency
   - Better than r=16 (fast config)
   - More efficient than r=48 (max config)
   - Proven optimal for fine-tuning

3. **Training Strategy**
   - 5 epochs for good convergence
   - Lower learning rate (2e-4) for stability
   - More frequent checkpoints (every 100 steps)
   - Keep 3 best checkpoints

4. **Quality vs Speed**
   - 2-3x slower than fast config
   - ~15-20% better accuracy
   - Better response quality
   - Suitable for production

### Kaggle GPU Requirements

**P100 (16GB) - Recommended:**
- Fits comfortably with 4-bit quantization
- Training time: ~6-8h
- Can handle batch_size=2 easily

**T4 (15GB) - Works well:**
- Slightly tighter memory
- Training time: ~8-10h
- Keep batch_size=2 (stable)

**Memory Usage Breakdown:**
- Base model (4-bit): ~3-4 GB
- LoRA adapters: ~1-2 GB
- Optimizer states: ~2-3 GB
- Activations: ~2-3 GB
- Total: ~8-10 GB

### Quality Comparison

**Fast Config (0.5B, r=16):**
- Training: 3-4h
- Quality: Good for development
- Use case: Quick iterations, testing

**High Quality Config (1.5B, r=32) - CURRENT:**
- Training: 6-10h
- Quality: Excellent for production
- Use case: Final deployment, customer-facing

**Max Quality Config (1.5B, r=48):**
- Training: 8-12h
- Quality: Best possible
- Use case: Research, maximum accuracy needed

### Production Deployment

This HIGH QUALITY config is optimized for:
- Real-world applications
- Customer-facing products
- Production environments
- Balance of quality and efficiency

The model will be:
- More accurate in fluency scoring
- Better at vocabulary classification
- More reliable in grammar correction
- More natural in dialogue generation

### Tips for Success

1. **Enable Internet** in Kaggle Settings (required for TRL)
2. **Upload dataset** as Kaggle dataset
3. **Monitor training** - checkpoints saved every 100 steps
4. **Resume capability** - can continue from any checkpoint
5. **Keep best model** - automatically saves 3 best checkpoints

### Estimated Results

Based on 1.5B model with r=32:
- Fluency scoring: ~85-90% accuracy
- Vocabulary classification: ~80-85% accuracy
- Grammar correction: ~75-80% GLEU score
- Dialogue quality: Excellent coherence

These are production-ready metrics suitable for real applications.

## 8. Load Model & Tokenizer

## CRITICAL FIX: Vocab Size & Special Tokens

Problem Solved: CUDA device-side assert caused by special tokens exceeding vocabulary size.

### Root Cause:
- Tokenizer vocab_size: 151643
- Model vocab_size: 151936
- Special tokens: pad_token_id=151643, eos_token_id=151645
- Special tokens were OUTSIDE valid range [0, 151642]

### Solution Applied:
The code now automatically:
1. Detects the maximum special token ID
2. Calculates required vocab size: max(tokenizer_vocab, model_vocab, max_special_token + 1)
3. Resizes model embeddings to accommodate ALL tokens
4. Validates all special tokens are within range

### Result:
- Model will resize to 151646 (includes eos_token_id=151645)
- All token IDs now within valid range
- No more CUDA errors during generation

After restart kernel, just run cells in order - the fix will apply automatically.

In [None]:
print("Loading model and tokenizer...")
print(f"Model: {MODEL_NAME}")
print(f"Cache: {CACHE_DIR}")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    cache_dir=str(CACHE_DIR),
)

# Ensure pad token exists
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print("\nTokenizer loaded:")
print(f"  vocab_size: {tokenizer.vocab_size}")
print(f"  pad_token_id: {tokenizer.pad_token_id}")
print(f"  eos_token_id: {tokenizer.eos_token_id}")

# CRITICAL: For 4-bit quantization, MUST use explicit device pinning (no 'auto')
# This prevents DataParallel from being used which causes CUDA errors with quantization
if torch.cuda.is_available():
    device_map = {"": 0}  # Pin to GPU 0 explicitly
    print(f"\nDevice map: {device_map} (pinned to GPU 0)")
else:
    device_map = {"": "cpu"}
    print(f"\nDevice map: {device_map}")

# Load base model (4-bit). Keep torch_dtype consistent with bitsandbytes compute dtype.
print("\nLoading model with 4-bit quantization...")
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=QUANTIZATION_CONFIG,
    device_map=device_map,  # Explicit device pinning
    trust_remote_code=True,
    cache_dir=str(CACHE_DIR),
    torch_dtype=COMPUTE_DTYPE if torch.cuda.is_available() else None,
    # CRITICAL: Prevent any multi-GPU operations
    low_cpu_mem_usage=True,
)

print("\nModel loaded:")
print(f"  model vocab_size: {base_model.config.vocab_size}")
print(f"  embedding rows: {base_model.get_input_embeddings().weight.shape[0]}")
print(f"  device: {next(base_model.parameters()).device}")

# CRITICAL FIX: resize embeddings to include ALL special token IDs
special_token_ids = [
    tokenizer.eos_token_id if tokenizer.eos_token_id is not None else 0,
    tokenizer.bos_token_id if tokenizer.bos_token_id is not None else 0,
    tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0,
]
max_special_token_id = max(special_token_ids)
required_vocab_size = max(tokenizer.vocab_size, base_model.config.vocab_size, max_special_token_id + 1)

print("\nVocab alignment:")
print(f"  max_special_token_id: {max_special_token_id}")
print(f"  required_vocab_size: {required_vocab_size}")

if base_model.config.vocab_size != required_vocab_size:
    print(f"Resizing embeddings: {base_model.config.vocab_size} -> {required_vocab_size}")
    base_model.resize_token_embeddings(required_vocab_size)
    base_model.config.vocab_size = required_vocab_size

# Final validation
if tokenizer.pad_token_id >= base_model.config.vocab_size or tokenizer.eos_token_id >= base_model.config.vocab_size:
    raise ValueError(
        "Special tokens out of range after resize. "
        f"pad_token_id={tokenizer.pad_token_id}, eos_token_id={tokenizer.eos_token_id}, model_vocab={base_model.config.vocab_size}"
    )

# Gradient checkpointing + required grads for 4-bit LoRA
base_model.gradient_checkpointing_enable()
base_model.config.use_cache = False
base_model.enable_input_require_grads()

print("\nModel configured:")
print(f"  device: {next(base_model.parameters()).device}")
print(f"  compute dtype: {COMPUTE_DTYPE}")
print(f"  model vocab_size: {base_model.config.vocab_size}")
print(f"  tokenizer vocab_size: {tokenizer.vocab_size}")
print(f"  gradient checkpointing: enabled")
print(f"  use_cache: {base_model.config.use_cache}")

# Verify model is NOT wrapped in DataParallel
if hasattr(base_model, 'module'):
    raise RuntimeError("Model is wrapped in DataParallel/DDP - this will cause errors with quantization!")
    
print("\nModel verified: Single GPU mode, no DataParallel wrapper")

## 9. Apply LoRA Adapter

In [None]:
# Create LoRA config
lora_config = LoraConfig(**UNIFIED_LORA_CONFIG)

# Apply LoRA to base model
model = get_peft_model(base_model, lora_config)

# Ensure model is in training mode
model.train()

# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())

print("\n" + "="*70)
print("LoRA ADAPTER APPLIED")
print("="*70)
print(f"\nTrainable params: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
print(f"Total params: {total_params:,}")
print(f"Memory reduction: ~{100 * (1 - trainable_params / total_params):.1f}%")
print("="*70 + "\n")

## 10. Training Function

In [None]:
def finetune_unified_adapter(
    train_dataset: Dataset,
    eval_dataset: Dataset,
    lora_config: dict,
    resume_from_checkpoint: Union[str, bool] = "auto",
):
    """Fine-tune unified adapter trên Kaggle"""
    
    print("\n" + "="*70)
    print("STARTING TRAINING")
    print("="*70)
    
    # CRITICAL: Verify single GPU mode before training
    if torch.cuda.is_available():
        if torch.cuda.device_count() > 1:
            print(f"\nWARNING: {torch.cuda.device_count()} GPUs detected")
            print("Forcing CUDA_VISIBLE_DEVICES=0 to prevent multi-GPU issues")
            os.environ["CUDA_VISIBLE_DEVICES"] = "0"
        
        print(f"\nCUDA Device Check:")
        print(f"  Available devices: {torch.cuda.device_count()}")
        print(f"  Current device: {torch.cuda.current_device()}")
        print(f"  Device name: {torch.cuda.get_device_name(0)}")
    
    # Check resume
    if resume_from_checkpoint == "auto":
        resume_checkpoint = checkpoint_mgr.find_latest_checkpoint()
        if resume_checkpoint:
            print(f"\nResuming from: {resume_checkpoint}")
        else:
            print("\nNo checkpoint found - training from scratch")
            resume_checkpoint = None
    elif resume_from_checkpoint:
        resume_checkpoint = resume_from_checkpoint
        print(f"\nResuming from: {resume_checkpoint}")
    else:
        resume_checkpoint = None
        print("\nTraining from scratch")
    
    # Pre-format dataset to "text" column (required by SFTTrainer)
    def format_example(example):
        return {"text": f"{example['input']}\n\n{example['output']}"}
    
    print("\nFormatting datasets...")
    formatted_train = train_dataset.map(format_example, remove_columns=train_dataset.column_names)
    formatted_eval = eval_dataset.map(format_example, remove_columns=eval_dataset.column_names)
    print(f"  Train: {len(formatted_train)} samples")
    print(f"  Eval: {len(formatted_eval)} samples")
    
    # Training arguments
    training_args = TrainingArguments(**TRAINING_CONFIG)
    
    # CRITICAL: Verify training args prevent multi-GPU
    if hasattr(training_args, 'local_rank') and training_args.local_rank != -1:
        print(f"\nWARNING: local_rank={training_args.local_rank}, forcing to -1")
        training_args.local_rank = -1
    
    # Create trainer with pre-formatted dataset
    print("\nInitializing SFTTrainer...")
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=formatted_train,
        eval_dataset=formatted_eval,
        # CRITICAL: Disable packing and other features that might cause issues
        packing=False,
        max_seq_length=512,  # Explicit max length
        dataset_text_field="text",
    )
    
    # CRITICAL: Verify model is not wrapped in DataParallel
    if hasattr(trainer.model, 'module'):
        raise RuntimeError(
            "CRITICAL ERROR: Model is wrapped in DataParallel!\n"
            "This causes 'illegal memory access' errors with 4-bit quantization.\n"
            "Please restart kernel and ensure CUDA_VISIBLE_DEVICES=0 is set."
        )
    
    # Register với shutdown handler
    shutdown_handler.register_trainer(trainer, model)
    
    # Train
    print("\n" + "="*70)
    print("TRAINING IN PROGRESS")
    print("="*70)
    print(f"Output dir: {OUTPUT_DIR}")
    print(f"Checkpoints saved every {TRAINING_CONFIG['save_steps']} steps")
    print(f"Model device: {next(model.parameters()).device}")
    print(f"Single GPU mode: enabled")
    print("="*70 + "\n")
    
    trainer.train(resume_from_checkpoint=resume_checkpoint)
    
    # Save final model
    final_path = OUTPUT_DIR / "unified_lora_adapter"
    final_path.mkdir(parents=True, exist_ok=True)
    
    print("\n" + "="*70)
    print("SAVING FINAL MODEL")
    print("="*70)
    model.save_pretrained(str(final_path))
    tokenizer.save_pretrained(str(final_path))
    print(f"\nFinal model saved: {final_path}")
    
    # Save training state
    checkpoint_mgr.save_training_state(
        final_model=str(final_path),
        completed=True,
        total_steps=trainer.state.global_step,
    )
    
    print("="*70 + "\n")
    
    return model, trainer

print("Training function ready - Single GPU mode enforced")

## 11. RUN TRAINING

**BEFORE RUNNING:** Make sure you have loaded a dataset!

### Quick Checklist:
- Internet enabled in Kaggle settings  
- Dataset uploaded to Kaggle and added to notebook  
- Dataset loaded successfully (check cell 6 output)  
- Model and LoRA adapter configured  

### What Happens:
- **Checkpoints:** Auto-saved every 100 steps to `/kaggle/working/unified_model/`
- **Output:** All files in `/kaggle/working/` saved as Kaggle output after session
- **Resume:** Upload previous output as input, notebook auto-resumes from latest checkpoint

### Expected Time:
- **P100/T4:** 6-10 hours (HIGH QUALITY config)
- **CPU:** Not recommended (very slow)

### If Dataset Not Loaded:
Cell will show detailed instructions on how to upload and configure your dataset.

In [None]:
# Verify dataset is loaded
if train_dataset is None or val_dataset is None:
    print("\n" + "="*70)
    print("ERROR: DATASET NOT LOADED")
    print("="*70)
    print("\nTraining cannot start without a dataset!")
    print("\n" + "="*70)
    print("QUICK FIX - Choose ONE option:")
    print("="*70)
    
    print("\nOPTION A: Use Test Data (FASTEST - for debugging only)")
    print("-" * 70)
    print("1. Scroll up to Cell 7 (Configuration)")
    print("2. Find the line: USE_TEST_DATA = False")
    print("3. Change to:     USE_TEST_DATA = True")
    print("4. Re-run Cell 7 (Configuration)")
    print("5. Re-run Cell 15 (Load Dataset)")
    print("6. Come back here and run this cell")
    print()
    print("WARNING: Test data is synthetic - only for testing the pipeline!")
    
    print("\n" + "-" * 70)
    print("\nOPTION B: Upload Real Dataset (RECOMMENDED for production)")
    print("-" * 70)
    print("1. Prepare your data:")
    print("   - Format: JSONL files (train.jsonl, val.jsonl)")
    print('   - Structure: {"input": "...", "output": "..."}')
    print()
    print("2. Upload to Kaggle:")
    print("   - Go to: https://www.kaggle.com/datasets")
    print("   - Click: New Dataset")
    print("   - Upload: train.jsonl and val.jsonl")
    print("   - Name it: lexilingo-training-data (or any name)")
    print()
    print("3. Add to this notebook:")
    print("   - Right sidebar -> Settings (gear icon)")
    print("   - Scroll to: Data section")
    print("   - Click: + Add Data")
    print("   - Search: your dataset name")
    print("   - Click: Add")
    print()
    print("4. Re-run cells:")
    print("   - Re-run Cell 7 (Configuration)")
    print("   - Re-run Cell 15 (Load Dataset)")
    print("   - Come back here and run this cell")
    
    print("\n" + "="*70)
    print("Current Status: NO DATASET")
    print("="*70 + "\n")
    
    raise RuntimeError(
        "\nDataset not loaded! Choose one option above:\n"
        "  A) Set USE_TEST_DATA = True in Cell 7 (for debugging)\n"
        "  B) Upload real dataset to Kaggle (for production)\n"
        "\nThen re-run the configuration and dataset loading cells."
    )

# Dataset info
print("\n" + "="*70)
print("DATASET READY FOR TRAINING")
print("="*70)
print(f"\nTrain samples: {len(train_dataset):,}")
print(f"Val samples: {len(val_dataset):,}")

if USE_TEST_DATA:
    print(f"\nMode: TEST DATA (synthetic)")
    print("=" * 70)
    print("WARNING: This is for debugging only!")
    print("For production training, use real dataset!")
    print("=" * 70)
else:
    print(f"\nMode: REAL DATA (production)")
    print("Dataset ready for training")

print("="*70 + "\n")

# Run training
print("Starting training process...\n")
trained_model, trainer = finetune_unified_adapter(
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    lora_config=UNIFIED_LORA_CONFIG,
    resume_from_checkpoint="auto",  # Auto-resume từ checkpoint nếu có
)

print("\n" + "="*70)
print("TRAINING COMPLETED!")
print("="*70)
print(f"\nOutput location: {OUTPUT_DIR}")
print("This will be saved as Kaggle output automatically")
print("\nTo resume in new session:")
print("1. Add this notebook's output as input dataset")
print("2. Set resume_from_checkpoint to checkpoint path")
print("="*70 + "\n")

## 12. Evaluate Model (Optional)

In [None]:
# Run evaluation on validation set
print("Running evaluation...")
eval_results = trainer.evaluate()

print("\n" + "="*70)
print("EVALUATION RESULTS")
print("="*70)
for key, value in eval_results.items():
    print(f"{key}: {value}")
print("="*70 + "\n")

## 13. Visualize Training Results

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json

def plot_training_metrics(trainer):
    """Visualize training và validation metrics"""
    
    # Extract metrics from trainer history
    history = trainer.state.log_history
    
    train_loss = []
    eval_loss = []
    learning_rates = []
    steps_train = []
    steps_eval = []
    
    for entry in history:
        if 'loss' in entry:
            train_loss.append(entry['loss'])
            steps_train.append(entry['step'])
            if 'learning_rate' in entry:
                learning_rates.append(entry['learning_rate'])
        if 'eval_loss' in entry:
            eval_loss.append(entry['eval_loss'])
            steps_eval.append(entry['step'])
    
    # Check if we have data to plot
    if not train_loss:
        print("\nNo training data available yet. Train for at least a few steps first.")
        return
    
    # Create figure with subplots
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Training Progress Overview', fontsize=16, fontweight='bold')
    
    # Plot 1: Training Loss
    axes[0, 0].plot(steps_train, train_loss, 'b-', linewidth=2, alpha=0.7, label='Training Loss')
    axes[0, 0].set_xlabel('Steps', fontsize=12)
    axes[0, 0].set_ylabel('Loss', fontsize=12)
    axes[0, 0].set_title('Training Loss Over Time', fontsize=14, fontweight='bold')
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].legend()
    
    # Plot 2: Evaluation Loss
    if eval_loss:
        axes[0, 1].plot(steps_eval, eval_loss, 'r-', linewidth=2, alpha=0.7, label='Validation Loss')
        axes[0, 1].set_xlabel('Steps', fontsize=12)
        axes[0, 1].set_ylabel('Loss', fontsize=12)
        axes[0, 1].set_title('Validation Loss Over Time', fontsize=14, fontweight='bold')
        axes[0, 1].grid(True, alpha=0.3)
        axes[0, 1].legend()
    else:
        axes[0, 1].text(0.5, 0.5, 'No validation data yet', 
                       ha='center', va='center', fontsize=14, transform=axes[0, 1].transAxes)
        axes[0, 1].set_title('Validation Loss', fontsize=14, fontweight='bold')
    
    # Plot 3: Learning Rate
    if learning_rates:
        axes[1, 0].plot(steps_train, learning_rates, 'g-', linewidth=2, alpha=0.7)
        axes[1, 0].set_xlabel('Steps', fontsize=12)
        axes[1, 0].set_ylabel('Learning Rate', fontsize=12)
        axes[1, 0].set_title('Learning Rate Schedule', fontsize=14, fontweight='bold')
        axes[1, 0].grid(True, alpha=0.3)
        axes[1, 0].ticklabel_format(style='scientific', axis='y', scilimits=(0,0))
    else:
        axes[1, 0].text(0.5, 0.5, 'No learning rate data', 
                       ha='center', va='center', fontsize=14, transform=axes[1, 0].transAxes)
        axes[1, 0].set_title('Learning Rate Schedule', fontsize=14, fontweight='bold')
    
    # Plot 4: Combined Loss Comparison
    if eval_loss:
        axes[1, 1].plot(steps_train, train_loss, 'b-', linewidth=2, alpha=0.7, label='Train Loss')
        axes[1, 1].plot(steps_eval, eval_loss, 'r-', linewidth=2, alpha=0.7, label='Val Loss')
        axes[1, 1].set_xlabel('Steps', fontsize=12)
        axes[1, 1].set_ylabel('Loss', fontsize=12)
        axes[1, 1].set_title('Train vs Validation Loss', fontsize=14, fontweight='bold')
        axes[1, 1].grid(True, alpha=0.3)
        axes[1, 1].legend()
    else:
        axes[1, 1].plot(steps_train, train_loss, 'b-', linewidth=2, alpha=0.7, label='Train Loss')
        axes[1, 1].set_xlabel('Steps', fontsize=12)
        axes[1, 1].set_ylabel('Loss', fontsize=12)
        axes[1, 1].set_title('Training Loss', fontsize=14, fontweight='bold')
        axes[1, 1].grid(True, alpha=0.3)
        axes[1, 1].legend()
    
    plt.tight_layout()
    
    # Save plot
    plot_path = OUTPUT_DIR / "training_metrics.png"
    plt.savefig(plot_path, dpi=300, bbox_inches='tight')
    print(f"\nPlot saved: {plot_path}")
    
    plt.show()
    
    # Print summary statistics
    print("\n" + "="*70)
    print("TRAINING STATISTICS")
    print("="*70)
    print(f"\nTraining Loss:")
    print(f"  Initial: {train_loss[0]:.4f}")
    print(f"  Final: {train_loss[-1]:.4f}")
    print(f"  Best: {min(train_loss):.4f}")
    
    # Safe improvement calculation
    if train_loss[0] > 0:
        improvement = (train_loss[0] - train_loss[-1]) / train_loss[0] * 100
        print(f"  Improvement: {improvement:.2f}%")
    else:
        print(f"  Improvement: N/A (initial loss is 0)")
    
    if eval_loss:
        print(f"\nValidation Loss:")
        print(f"  Initial: {eval_loss[0]:.4f}")
        print(f"  Final: {eval_loss[-1]:.4f}")
        print(f"  Best: {min(eval_loss):.4f}")
        print(f"  Best at step: {steps_eval[eval_loss.index(min(eval_loss))]}")
    else:
        print(f"\nValidation Loss:")
        print(f"  No evaluation data yet (will be available after first eval_steps)")
    
    print(f"\nTotal training steps: {steps_train[-1] if steps_train else 0}")
    print("="*70 + "\n")

# Generate visualization
print("Generating training visualizations...")
try:
    plot_training_metrics(trainer)
except Exception as e:
    print(f"\nError generating plots: {e}")
    print("This may happen if training hasn't completed enough steps yet.")
    print("Try running this cell again after training progresses further.")

## 14. Detailed Evaluation Metrics

## 14a. Diagnostic - Model & Tokenizer Validation

**Important**: Run this diagnostic before evaluation to check for issues.

In [None]:
import torch

print("\n" + "="*70)
print("MODEL & TOKENIZER DIAGNOSTIC")
print("="*70)

# 1. Check vocabulary sizes
print("\n1. VOCABULARY SIZE CHECK:")
print(f"   Tokenizer vocab_size: {tokenizer.vocab_size}")
print(f"   Model config vocab_size: {model.config.vocab_size}")

if tokenizer.vocab_size != model.config.vocab_size:
    print("     MISMATCH DETECTED! This will cause CUDA errors!")
    print(f"   Difference: {abs(tokenizer.vocab_size - model.config.vocab_size)}")
else:
        print("   Vocabulary sizes match")

# 2. Check special tokens
print("\n2. SPECIAL TOKENS CHECK:")
print(f"   pad_token: {tokenizer.pad_token} (id={tokenizer.pad_token_id})")
print(f"   eos_token: {tokenizer.eos_token} (id={tokenizer.eos_token_id})")
print(f"   bos_token: {tokenizer.bos_token} (id={tokenizer.bos_token_id})")

# Validate special token IDs are within range
invalid_tokens = []
if tokenizer.pad_token_id >= tokenizer.vocab_size:
    invalid_tokens.append(f"pad_token_id={tokenizer.pad_token_id}")
if tokenizer.eos_token_id >= tokenizer.vocab_size:
    invalid_tokens.append(f"eos_token_id={tokenizer.eos_token_id}")
if tokenizer.bos_token_id and tokenizer.bos_token_id >= tokenizer.vocab_size:
    invalid_tokens.append(f"bos_token_id={tokenizer.bos_token_id}")

if invalid_tokens:
    print(f"   INVALID TOKEN IDs: {', '.join(invalid_tokens)}")
    print(f"   These exceed vocab_size={tokenizer.vocab_size}")
else:
    print("   All special tokens within valid range")
# 3. Check model state
print("\n3. MODEL STATE CHECK:")
print(f"   Model device: {next(model.parameters()).device}")
print(f"   Model dtype: {next(model.parameters()).dtype}")
print(f"   Training mode: {model.training}")
print(f"   Gradient checkpointing: {model.is_gradient_checkpointing}")

# 4. Test simple tokenization
print("\n4. TOKENIZATION TEST:")
test_text = "Hello, this is a test."
tokens = tokenizer(test_text, return_tensors="pt")
print(f"   Input text: '{test_text}'")
print(f"   Token IDs shape: {tokens['input_ids'].shape}")
print(f"   Token IDs: {tokens['input_ids'][0].tolist()}")
print(f"   Max token ID: {tokens['input_ids'].max().item()}")
print(f"   Min token ID: {tokens['input_ids'].min().item()}")

if tokens['input_ids'].max().item() >= tokenizer.vocab_size:
    print(f"     ERROR: Max token ID ({tokens['input_ids'].max().item()}) >= vocab_size ({tokenizer.vocab_size})")
    print(f"   ERROR: Max token ID ({tokens['input_ids'].max().item()}) >= vocab_size ({tokenizer.vocab_size})")
    print(f"    All token IDs within valid range [0, {tokenizer.vocab_size-1}]")
    print(f"   All token IDs within valid range [0, {tokenizer.vocab_size-1}]")
# 5. Test simple generation (CRITICAL TEST)
print("\n5. SIMPLE GENERATION TEST:")
print("   Testing with 'Hello' input...")

try:
    # Clear any previous CUDA errors
    if torch.cuda.is_available():
        torch.cuda.synchronize()
        torch.cuda.empty_cache()
    
    # Set model to eval mode
    model.eval()
    
    # Simple input
    simple_input = tokenizer("Hello", return_tensors="pt", padding=True, return_attention_mask=True)
    device = next(model.parameters()).device
    simple_input = {k: v.to(device) for k, v in simple_input.items()}
    
    print(f"   Input token IDs: {simple_input['input_ids'][0].tolist()}")
    
    with torch.no_grad():
        # Try minimal generation
        output = model.generate(
            input_ids=simple_input['input_ids'],
            attention_mask=simple_input['attention_mask'],
            max_new_tokens=5,  # Just 5 tokens
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    print(f"   Output token IDs: {output[0].tolist()}")
    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"   Decoded output: '{decoded}'")
    print("    Generation successful!")
    print("   Generation successful")
except RuntimeError as e:
    error_msg = str(e)
    print(f"    GENERATION FAILED!")
    print(f"   GENERATION FAILED")
    
    if 'device-side assert' in error_msg or 'CUDA' in error_msg:
        print("\n" + "="*70)
        print("ROOT CAUSE IDENTIFIED: Model is generating invalid token IDs")
        print("="*70)
        print("\nPossible causes:")
        print("1. Model training corrupted the vocabulary embeddings")
        print("2. LoRA adapter is incompatible with base model")
        print("3. Quantization issue with 4-bit model")
        print("4. Model config doesn't match tokenizer")
        print("\nSuggested fixes:")
        print("1. Reload the base model (without LoRA)")
        print("2. Check if checkpoint is corrupted")
        print("3. Try without quantization")
        print("4. Verify model and tokenizer are from same checkpoint")
        print("="*70)

# 6. Check dataset sample
print("\n6. DATASET SAMPLE CHECK:")
if val_dataset:
    sample = val_dataset[0]
    print(f"   Sample keys: {sample.keys()}")
    print(f"   Input preview: {sample['input'][:100]}...")
    
    # Tokenize dataset sample
    tokens = tokenizer(sample['input'], return_tensors="pt", truncation=True, max_length=512)
    print(f"   Tokenized shape: {tokens['input_ids'].shape}")
    print(f"   Max token in sample: {tokens['input_ids'].max().item()}")
    
    if tokens['input_ids'].max().item() >= tokenizer.vocab_size:
        print(f"     Dataset contains invalid token IDs!")
        print(f"   Dataset contains invalid token IDs")
        print(f"    Dataset tokens are valid")
        print(f"   Dataset tokens are valid")
print("\n" + "="*70)
print("DIAGNOSTIC COMPLETE")
print("="*70)
print("\nNext steps:")
print("   If all checks pass -> Proceed to evaluation")
print("  If all checks pass -> Proceed to evaluation")
print("  If generation test fails -> See suggested fixes above")
print("  If vocab mismatch -> Reload model and tokenizer together")

## 14b. Model Recovery (Run only if diagnostic fails)

**Only run this if the diagnostic test failed!** This will reload the model from the last checkpoint.

In [None]:
# RECOVERY: Reload model from checkpoint
print("="*70)
print("MODEL RECOVERY - Loading from checkpoint")
print("="*70)

# Clear CUDA cache
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.synchronize()

# Find best checkpoint
best_checkpoint = checkpoint_mgr.find_latest_checkpoint()
if not best_checkpoint:
    final_model_path = OUTPUT_DIR / "unified_lora_adapter"
    if final_model_path.exists():
        best_checkpoint = str(final_model_path)

if best_checkpoint:
    print(f"\nLoading from: {best_checkpoint}")
    
    # Reload tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        best_checkpoint,
        trust_remote_code=True,
    )
    
    # Reload base model
    device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else {"": "cpu"}
    
    base_model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=QUANTIZATION_CONFIG,
        device_map=device_map,
        trust_remote_code=True,
        cache_dir=str(CACHE_DIR)
    )
    
    # FIX: Calculate required vocab size including special tokens
    special_token_ids = [
        tokenizer.eos_token_id if tokenizer.eos_token_id is not None else 0,
        tokenizer.bos_token_id if tokenizer.bos_token_id is not None else 0,
        tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0,
    ]
    max_special_token_id = max(special_token_ids)
    required_vocab_size = max(tokenizer.vocab_size, base_model.config.vocab_size, max_special_token_id + 1)
    
    # Resize embeddings to accommodate all tokens
    if base_model.config.vocab_size != required_vocab_size:
        print(f"  -> Resizing embeddings: {base_model.config.vocab_size} -> {required_vocab_size}")
        base_model.resize_token_embeddings(required_vocab_size)
        base_model.config.vocab_size = required_vocab_size
    
    # Set pad token AFTER resizing
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id
    
    base_model.enable_input_require_grads()
    
    # Load LoRA adapter from checkpoint
    from peft import PeftModel
    model = PeftModel.from_pretrained(base_model, best_checkpoint)
    model.eval()
    
    print(f"\nModel reloaded successfully from checkpoint")
    print(f"  Device: {next(model.parameters()).device}")
    print(f"  Vocab size: {model.config.vocab_size}")
    print(f"  Tokenizer vocab: {tokenizer.vocab_size}")
    
    # Test generation again
    print("\nTesting generation after reload...")
    try:
        test_input = tokenizer("Hello", return_tensors="pt")
        test_input = {k: v.to(next(model.parameters()).device) for k, v in test_input.items()}
        
        with torch.no_grad():
            test_output = model.generate(
                **test_input,
                max_new_tokens=5,
                do_sample=False,
                pad_token_id=tokenizer.pad_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )
        
        print(f"Generation test passed")
        print(f"  Output: {tokenizer.decode(test_output[0], skip_special_tokens=True)}")
    except Exception as e:
        print(f"Generation still failing: {str(e)[:100]}")
        print("\nThe model checkpoint may be corrupted.")
        print("You may need to restart training from an earlier checkpoint.")
else:
    print("\nNo checkpoint found to reload from")
    print("Training may not have saved any checkpoints yet")

print("="*70 + "\n")

In [None]:
import os

# For debugging CUDA errors, uncomment this:
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

def detailed_evaluation(trainer, dataset, sample_size=100):
    """Chạy detailed evaluation với sample predictions"""
    
    print("\n" + "="*70)
    print("DETAILED EVALUATION")
    print("="*70)
    
    # Get evaluation metrics
    try:
        eval_results = trainer.evaluate()
        print("\nOverall Metrics:")
        for key, value in eval_results.items():
            if isinstance(value, (int, float)):
                print(f"  {key}: {value:.4f}")
            else:
                print(f"  {key}: {value}")
    except Exception as e:
        print(f"\nEvaluation failed: {e}")
        print("Continuing with sample predictions...")
        eval_results = {}
    
    # Sample predictions
    print(f"\n" + "="*70)
    print(f"SAMPLE PREDICTIONS (n={min(sample_size, len(dataset))})")
    print("="*70)
    
    sample_indices = np.random.choice(len(dataset), min(sample_size, len(dataset)), replace=False)
    
    predictions_data = []
    errors_count = 0
    
    # Get device explicitly
    device = next(model.parameters()).device
    print(f"\nModel device: {device}")
    
    for i, idx in enumerate(sample_indices[:10], 1):  # Show first 10
        try:
            sample = dataset[int(idx)]
            
            # Generate prediction
            input_text = sample['input']
            expected_output = sample['output']
            
            # Validate input text
            if not input_text or len(input_text.strip()) == 0:
                print(f"  Sample {i}: Skipping empty input")
                continue
            
            # Tokenize with explicit parameters
            inputs = tokenizer(
                input_text, 
                return_tensors="pt", 
                truncation=True, 
                max_length=512,
                padding=True,
                return_attention_mask=True
            )
            
            # Move to device explicitly
            inputs = {k: v.to(device) for k, v in inputs.items()}
            
            # Validate token IDs are within vocabulary range
            if inputs['input_ids'].max() >= tokenizer.vocab_size:
                print(f"  Sample {i}: Invalid token IDs (max={inputs['input_ids'].max()}, vocab_size={tokenizer.vocab_size})")
                continue
            
            with torch.no_grad():
                # Use greedy decoding for evaluation (deterministic, reproducible)
                outputs = model.generate(
                    input_ids=inputs['input_ids'],
                    attention_mask=inputs['attention_mask'],
                    max_new_tokens=128,  # Reduced from 256 for safety
                    do_sample=False,  # Greedy decoding - deterministic
                    pad_token_id=tokenizer.pad_token_id,
                    eos_token_id=tokenizer.eos_token_id,
                )
            
            predicted = tokenizer.decode(outputs[0], skip_special_tokens=True)
            
            # Remove input from prediction
            if input_text in predicted:
                predicted = predicted.replace(input_text, "").strip()
            
            predictions_data.append({
                'input': input_text[:100] + "..." if len(input_text) > 100 else input_text,
                'expected': expected_output[:100] + "..." if len(expected_output) > 100 else expected_output,
                'predicted': predicted[:100] + "..." if len(predicted) > 100 else predicted,
            })
            
            print(f" Sample {i} processed successfully")
            
        except RuntimeError as e:
            errors_count += 1
            error_msg = str(e)
            if 'CUDA' in error_msg or 'device-side assert' in error_msg:
                print(f"\n{'='*70}")
                print(f"  CUDA ERROR on sample {i}")
                print(f"{'='*70}")
                print(f"Error: {error_msg[:200]}")
                print(f"\nThis usually means:")
                print(f"  1. Invalid token IDs in the dataset")
                print(f"  2. Corrupted input text")
                print(f"  3. Memory issues")
                print(f"\nTo debug:")
                print(f"  1. Uncomment CUDA_LAUNCH_BLOCKING=1 at the top of this cell")
                print(f"  2. Re-run to get exact error location")
                print(f"  3. Check the input text: {input_text[:50]}...")
                print(f"{'='*70}\n")
                
                if errors_count > 2:
                    print(f"\n  Too many CUDA errors ({errors_count}), stopping evaluation")
                    print(f"Please check your dataset for corrupted samples")
                    break
            else:
                print(f"  Sample {i}: {error_msg[:100]}")
            continue
        except Exception as e:
            errors_count += 1
            print(f"  Sample {i}: Unexpected error - {str(e)[:100]}")
            if errors_count > 3:
                print(f"\n  Too many errors ({errors_count}), stopping evaluation")
                break
            continue
    
    # Display predictions
    if predictions_data:
        print(f"\n{'='*70}")
        print(f"PREDICTION RESULTS ({len(predictions_data)} successful)")
        print(f"{'='*70}")
        for i, pred in enumerate(predictions_data, 1):
            print(f"\n--- Sample {i} ---")
            print(f"Input: {pred['input']}")
            print(f"Expected: {pred['expected']}")
            print(f"Predicted: {pred['predicted']}")
            print("-" * 70)
    else:
        print(f"\n  No predictions generated - all samples failed")
        print(f"This indicates a serious issue with the model or dataset")
        return None
    
    # Save predictions to file
    predictions_file = OUTPUT_DIR / "sample_predictions.json"
    with open(predictions_file, 'w', encoding='utf-8') as f:
        json.dump(predictions_data, f, indent=2, ensure_ascii=False)
    
    print(f"\n Predictions saved: {predictions_file}")
    
    # Summary
    print(f"\n{'='*70}")
    print(f"EVALUATION SUMMARY")
    print(f"{'='*70}")
    print(f"Successful predictions: {len(predictions_data)}")
    print(f"Failed samples: {errors_count}")
    print(f"Success rate: {len(predictions_data)/(len(predictions_data)+errors_count)*100:.1f}%")
    print("="*70 + "\n")
    
    return eval_results

# Run detailed evaluation with better error handling
print("Running detailed evaluation...")
print("\nNote: If you encounter CUDA errors:")
print("  1. Uncomment CUDA_LAUNCH_BLOCKING=1 at top of this cell")
print("  2. Re-run for detailed error location")
print("  3. Check dataset for corrupted samples\n")

try:
    eval_metrics = detailed_evaluation(trainer, val_dataset, sample_size=100)
    if eval_metrics is not None:
        print("\n Evaluation completed successfully!")
    else:
        print("\n  Evaluation completed with errors - check output above")
except Exception as e:
    print(f"\n{'='*70}")
    print(f"EVALUATION FAILED")
    print(f"{'='*70}")
    print(f"Error: {e}")
    print(f"\nPossible causes:")
    print(f"  1. Training hasn't completed yet")
    print(f"  2. Model or trainer in invalid state")
    print(f"  3. Dataset contains corrupted samples")
    print(f"  4. GPU memory issues")
    print(f"\nTry:")
    print(f"  1. Wait for training to complete fully")
    print(f"  2. Restart kernel and reload checkpoint")
    print(f"  3. Check dataset integrity")
    print(f"{'='*70}\n")
    eval_metrics = None

## 15. Test Inference on Custom Prompts

**Note:** If you see warnings like `"generation flags are not valid and may be ignored"`, they are harmless and can be ignored. These occur when transformers validates generation parameters.

In [None]:
def test_inference(prompt: str, max_length: int = 256):
    """Test model với prompt"""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_length,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Remove input from response
    if prompt in response:
        response = response.replace(prompt, "").strip()
    return response

# Test với các tasks khác nhau
test_prompts = {
    "Fluency Assessment": "Task: Assess the fluency of the following text.\n\nText: The cat sat on the mat and looked out the window.\n\nFluency score:",
    "Vocabulary Level": "Task: Classify the vocabulary level of this text.\n\nText: The ubiquitous smartphone has revolutionized communication paradigms.\n\nVocabulary level:",
    "Grammar Correction": "Task: Correct any grammar errors in the following text.\n\nText: She don't like apples and he have three dogs.\n\nCorrected text:",
    "Dialogue Generation": "Task: Continue the conversation naturally.\n\nUser: Hello, how are you today?\nAssistant:",
}

print("\n" + "="*70)
print("INTERACTIVE INFERENCE TESTING")
print("="*70)

results_table = []

for task_name, prompt in test_prompts.items():
    print(f"\n{'='*70}")
    print(f"TASK: {task_name}")
    print(f"{'='*70}")
    print(f"\nPrompt:\n{prompt}\n")
    
    response = test_inference(prompt, max_length=200)
    print(f"Response:\n{response}\n")
    
    results_table.append({
        'task': task_name,
        'prompt': prompt,
        'response': response
    })

# Save test results
test_results_file = OUTPUT_DIR / "inference_tests.json"
with open(test_results_file, 'w', encoding='utf-8') as f:
    json.dump(results_table, f, indent=2, ensure_ascii=False)

print(f"\n{'='*70}")
print(f"Test results saved: {test_results_file}")
print(f"{'='*70}\n")

## 16. Task-Specific Performance Analysis

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def analyze_task_performance(dataset, num_samples=50):
    """Phân tích performance theo từng task type"""
    
    print("\n" + "="*70)
    print("TASK-SPECIFIC PERFORMANCE ANALYSIS")
    print("="*70)
    
    # Group samples by task type (detect from input)
    task_groups = {
        'fluency': [],
        'vocabulary': [],
        'grammar': [],
        'dialogue': []
    }
    
    # Sample and classify
    sample_indices = np.random.choice(len(dataset), min(num_samples, len(dataset)), replace=False)
    
    for idx in sample_indices:
        sample = dataset[int(idx)]
        input_lower = sample['input'].lower()
        
        if 'fluency' in input_lower or 'score' in input_lower:
            task_groups['fluency'].append(sample)
        elif 'vocabulary' in input_lower or 'level' in input_lower or 'classify' in input_lower:
            task_groups['vocabulary'].append(sample)
        elif 'grammar' in input_lower or 'correct' in input_lower or 'error' in input_lower:
            task_groups['grammar'].append(sample)
        elif 'conversation' in input_lower or 'dialogue' in input_lower or 'respond' in input_lower:
            task_groups['dialogue'].append(sample)
    
    # Visualize distribution
    task_counts = {k: len(v) for k, v in task_groups.items()}
    
    plt.figure(figsize=(10, 6))
    colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']
    bars = plt.bar(task_counts.keys(), task_counts.values(), color=colors, alpha=0.7)
    
    plt.title('Sample Distribution by Task Type', fontsize=14, fontweight='bold')
    plt.xlabel('Task Type', fontsize=12)
    plt.ylabel('Number of Samples', fontsize=12)
    plt.grid(axis='y', alpha=0.3)
    
    # Add value labels on bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}',
                ha='center', va='bottom', fontweight='bold')
    
    plt.tight_layout()
    dist_plot_path = OUTPUT_DIR / "task_distribution.png"
    plt.savefig(dist_plot_path, dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"\nTask distribution plot saved: {dist_plot_path}")
    
    # Print statistics
    print("\nTask Distribution:")
    for task, count in task_counts.items():
        percentage = (count / sum(task_counts.values())) * 100
        print(f"  {task.capitalize()}: {count} samples ({percentage:.1f}%)")
    
    print("="*70 + "\n")
    
    return task_groups

# Run analysis
task_analysis = analyze_task_performance(val_dataset, num_samples=100)

## 17. Save Final Artifacts & Summary

In [None]:
# Create comprehensive training summary
# Handle eval_metrics if evaluation was skipped
if 'eval_metrics' not in globals() or eval_metrics is None:
    eval_metrics = {"note": "Evaluation was not run or returned None"}

summary = {
    "model": MODEL_NAME,
    "lora_config": UNIFIED_LORA_CONFIG,
    "training_config": TRAINING_CONFIG,
    "dataset_size": {
        "train": len(train_dataset),
        "val": len(val_dataset),
    },
    "training_completed": datetime.now().isoformat(),
    "platform": "kaggle",
    "output_dir": str(OUTPUT_DIR),
    "final_metrics": eval_metrics,
    "total_steps": trainer.state.global_step if 'trainer' in globals() else "N/A",
    "best_checkpoint": trainer.state.best_model_checkpoint if 'trainer' in globals() else "N/A",
}

summary_file = OUTPUT_DIR / "training_summary.json"
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)

print("\n" + "="*70)
print("TRAINING ARTIFACTS SUMMARY")
print("="*70)
print(f"\nOutput directory: {OUTPUT_DIR}")
print(f"\nGenerated files:")
print(f"  - unified_lora_adapter/ (final model)")
print(f"  - checkpoint-*/ (training checkpoints)")
print(f"  - training_metrics.png (visualization)")
print(f"  - sample_predictions.json (evaluation samples)")
print(f"  - inference_tests.json (test results)")
print(f"  - task_distribution.png (task analysis)")
print(f"  - training_summary.json (complete summary)")
print(f"  - training_state.json (resume info)")

print(f"\nAll files in /kaggle/working/ will be saved as Kaggle output")
print("="*70 + "\n")

# List all output files with sizes
print("Output files details:")
total_size = 0
for item in sorted(OUTPUT_DIR.rglob("*")):
    if item.is_file():
        size_mb = item.stat().st_size / (1024 * 1024)
        total_size += size_mb
        rel_path = item.relative_to(OUTPUT_DIR)
        print(f"  {rel_path} ({size_mb:.2f} MB)")

print(f"\nTotal output size: {total_size:.2f} MB")
print("="*70 + "\n")

In [None]:
import os
from pathlib import Path
from datetime import datetime

# Define OUTPUT_DIR in case this cell is run standalone
if 'OUTPUT_DIR' not in globals():
    OUTPUT_DIR = Path("/kaggle/working/unified_model")

print("\n" + "="*70)
print("OUTPUT SUMMARY FOR DOWNLOAD")
print("="*70)

print(f"\nOutput location: {OUTPUT_DIR}")
print(f"   (Automatically saved as Kaggle output)")

# List all files with sizes
total_size = 0
file_list = []

if OUTPUT_DIR.exists():
    print(f"\nFiles to be downloaded:\n")
    
    for item in sorted(OUTPUT_DIR.rglob("*")):
        if item.is_file():
            size_mb = item.stat().st_size / (1024 * 1024)
            total_size += size_mb
            rel_path = item.relative_to(OUTPUT_DIR)
            file_list.append((str(rel_path), size_mb))
            
            # Print with proper formatting
            if size_mb < 1:
                print(f"   {rel_path} ({size_mb*1024:.1f} KB)")
            else:
                print(f"   {rel_path} ({size_mb:.1f} MB)")
    
    print(f"\n" + "-"*70)
    print(f"   Total output size: {total_size:.2f} MB ({total_size/1024:.2f} GB)")
    print("-"*70)
    
    # Check for important files
    important_files = [
        OUTPUT_DIR / "unified_lora_adapter",
        OUTPUT_DIR / "training_summary.json",
        OUTPUT_DIR / "training_metrics.png",
    ]
    
    print(f"\nImportant files check:")
    for file_path in important_files:
        if file_path.exists():
            if file_path.is_dir():
                count = len(list(file_path.iterdir()))
                print(f"   [OK] {file_path.name}/ ({count} files)")
            else:
                print(f"   [OK] {file_path.name}")
        else:
            print(f"   [MISSING] {file_path.name}")
else:
    print(f"\nOutput directory not found: {OUTPUT_DIR}")

print("\n" + "="*70)
print("DOWNLOAD INSTRUCTIONS")
print("="*70)

print("""
After this notebook session ends:

1⃣  **Download from Kaggle UI (Easiest):**
   - Right side panel -> "Output" section
   - Click "Download" button
   - Extract .zip file locally

2⃣  **Copy to your local project:**
   ```bash
   # On your local machine:
   cd ~/Documents/RepoGitHub/LexiLingo/DL-Model-Support
   unzip ~/Downloads/archive.zip -d ./temp/
   cp -r temp/unified_model/* model/outputs/unified/
   ```

3⃣  **Verify locally:**
   ```bash
   ls -lh model/outputs/unified/unified_lora_adapter/
   ```

4⃣  **Alternative - Create Dataset (for sharing/backup):**
   - Output -> "New Dataset" button
   - Set title: "lexilingo-unified-model"
   - Make public or private
   - Download anytime via Kaggle API or web

""")

print("="*70)
print("TRAINING COMPLETE - Ready to download")
print("="*70 + "\n")

# Save file list to JSON for reference
import json
file_manifest = {
    "total_size_mb": round(total_size, 2),
    "total_files": len(file_list),
    "files": [{"path": p, "size_mb": round(s, 2)} for p, s in file_list],
    "created": datetime.now().isoformat(),
}

manifest_file = OUTPUT_DIR / "file_manifest.json"
with open(manifest_file, 'w') as f:
    json.dump(file_manifest, f, indent=2)

print(f"File manifest saved: {manifest_file}")
print(f"   (Includes list of all {len(file_list)} files with sizes)\n")