# Marco LoRA Fine-Tuning Training Notebook

Step-by-step LoRA fine-tuning of Qwen2.5-7B-Instruct for Italian teaching.

## Training Pipeline Overview
1. **Environment Setup** - Check GPU, install dependencies
2. **Data Preprocessing** - Load and validate training data
3. **Model Initialization** - Configure LoRA and load base model
4. **Training Setup** - Verify configuration and memory usage
5. **Fine-Tuning** - Execute training with validation monitoring
6. **Testing** - Quick inference tests with trained model
7. **Evaluation** - Generate plots, examples, and quality metrics

**Estimated Training Time (3 epochs, 10K samples):**
- **T4**: ~6-8 hours (memory-optimized)
- **L4**: ~2-3 hours (high-performance) ⭐ **RECOMMENDED**  
- **A100**: ~1.5-2.5 hours (maximum performance)

## 1. Environment Setup & GPU Detection

In [None]:
import os
import sys
import torch
import logging
from pathlib import Path
import pandas as pd


# mount
from google.colab import drive
drive.mount('/content/drive')

# Add project root to path
project_root = Path('/content/drive/MyDrive/Colab Notebooks/italian_teacher')
if project_root.exists():
    sys.path.append(str(project_root))
    os.chdir(project_root)
    print(f"✅ Working directory: {os.getcwd()}")
else:
    print("❌ Project directory not found. Update path for your setup.")

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

print("📦 Environment setup complete")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Working directory: /content/drive/MyDrive/Colab Notebooks/italian_teacher
📦 Environment setup complete


In [None]:
# GPU Detection and Memory Info
print("🔍 GPU Detection:")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)

    print(f"GPU: {gpu_name}")
    print(f"GPU Memory: {gpu_memory:.1f} GB")

    # Clear GPU cache
    torch.cuda.empty_cache()

    # Memory usage
    allocated = torch.cuda.memory_allocated(0) / (1024**3)
    cached = torch.cuda.memory_reserved(0) / (1024**3)
    print(f"Current GPU usage: {allocated:.2f} GB allocated, {cached:.2f} GB cached")

    # Determine optimal configuration based on GPU
    if "T4" in gpu_name:
        print("🔧 Detected T4 - Using memory-optimized settings")
        recommended_batch_size = 1
        recommended_eval_batch_size = 1
        gradient_accumulation = 8
        pin_memory = False
        training_speed = "~6-8 hours for 3 epochs"
    elif "L4" in gpu_name:
        print("🚀 Detected L4 - Using high-performance settings")
        recommended_batch_size = 3
        recommended_eval_batch_size = 4
        gradient_accumulation = 3
        pin_memory = True
        training_speed = "~2-3 hours for 3 epochs"
    elif "A100" in gpu_name:
        print("🏎️  Detected A100 - Using maximum performance settings")
        recommended_batch_size = 2
        recommended_eval_batch_size = 3
        gradient_accumulation = 4
        pin_memory = True
        training_speed = "~1.5-2.5 hours for 3 epochs"
    else:
        print(f"❓ Unknown GPU ({gpu_name}) - Using conservative settings")
        recommended_batch_size = 1
        recommended_eval_batch_size = 1
        gradient_accumulation = 8
        pin_memory = False
        training_speed = "~6-10 hours for 3 epochs (estimated)"

    effective_batch_size = recommended_batch_size * gradient_accumulation
    print(f"\n📊 Optimized Settings:")
    print(f"   Train batch size: {recommended_batch_size}")
    print(f"   Eval batch size: {recommended_eval_batch_size}")
    print(f"   Gradient accumulation: {gradient_accumulation}")
    print(f"   Effective batch size: {effective_batch_size}")
    print(f"   Pin memory: {pin_memory}")
    print(f"   Estimated training time: {training_speed}")
else:
    print("❌ No GPU detected. Training will be extremely slow.")

🔍 GPU Detection:
CUDA Available: True
GPU: NVIDIA L4
GPU Memory: 22.2 GB
Current GPU usage: 0.00 GB allocated, 0.00 GB cached
🚀 Detected L4 - Using high-performance settings

📊 Optimized Settings:
   Train batch size: 3
   Eval batch size: 4
   Gradient accumulation: 3
   Effective batch size: 9
   Pin memory: True
   Estimated training time: ~2-3 hours for 3 epochs


In [None]:
print(os.getcwd())

# Install required packages first
print("📦 Installing required packages...")
!pip install -q accelerate>=0.24.0 peft>=0.7.0 bitsandbytes>=0.41.0 transformers>=4.36.0 datasets>=2.14.0 wandb>=0.16.0

# Standalone import approach - avoids src package issues
import sys
from pathlib import Path

# Add fine_tuning directory directly to path
fine_tuning_path = Path.cwd() / "src" / "fine_tuning"
if str(fine_tuning_path) not in sys.path:
    sys.path.insert(0, str(fine_tuning_path))

print(f"✅ Added to Python path: {fine_tuning_path}")

# Direct imports from fine_tuning directory
try:
    from lora_trainer import MarcoLoRATrainer
    from config import get_default_config
    from inference import MarcoInference
    print("✅ All training modules imported successfully")
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("Installing additional packages from requirements...")
    !pip install -r src/fine_tuning/requirements.txt
    print("🔄 Please restart runtime (Runtime → Restart Session) and run this cell again")

    # Try import again after installation
    try:
        from lora_trainer import MarcoLoRATrainer
        from config import get_default_config
        from inference import MarcoInference
        print("✅ Packages installed and imported")
    except ImportError as e2:
        print(f"❌ Still failing: {e2}")
        print("🔄 Please restart runtime (Runtime → Restart Session) and run this cell again")

/content/drive/MyDrive/Colab Notebooks/italian_teacher
📦 Installing required packages...
✅ Added to Python path: /content/drive/MyDrive/Colab Notebooks/italian_teacher/src/fine_tuning
✅ All training modules imported successfully


# Data validation passed - ready for training setup

In [None]:
# Check training data availability
import json
from pathlib import Path

data_dir = Path("data/processed_llm_improved")
train_file = data_dir / "train.jsonl"
val_file = data_dir / "validation.jsonl"
test_file = data_dir / "test.jsonl"

print("📊 Training Data Status:")
print(f"Data directory exists: {data_dir.exists()}")
print(f"Train file exists: {train_file.exists()}")
print(f"Validation file exists: {val_file.exists()}")
print(f"Test file exists: {test_file.exists()}")

# Initialize variables
train_samples = 0
val_samples = 0
test_samples = 0

if train_file.exists():
    # Count samples
    with open(train_file, 'r', encoding='utf-8') as f:
        train_samples = sum(1 for line in f)
    print(f"Training samples: {train_samples:,}")
else:
    print("❌ Training file not found")

if val_file.exists():
    with open(val_file, 'r', encoding='utf-8') as f:
        val_samples = sum(1 for line in f)
    print(f"Validation samples: {val_samples:,}")
else:
    print("❌ Validation file not found")

if test_file.exists():
    with open(test_file, 'r', encoding='utf-8') as f:
        test_samples = sum(1 for line in f)
    print(f"Test samples: {test_samples:,}")
else:
    print("❌ Test file not found")

total_samples = train_samples + val_samples + test_samples
print(f"\n📈 Total samples: {total_samples:,}")
if total_samples > 0:
    print(f"Train/Val/Test split: {train_samples}/{val_samples}/{test_samples}")
else:
    print("⚠️  No training data found. Check data path or run data preparation first.")

📊 Training Data Status:
Data directory exists: True
Train file exists: True
Validation file exists: True
Test file exists: True
Training samples: 8,104
Validation samples: 1,519
Test samples: 507

📈 Total samples: 10,130
Train/Val/Test split: 8104/1519/507


In [None]:
# Sample data inspection
print("🔍 Sample Training Data:")

if train_file.exists() and train_samples > 0:
    with open(train_file, 'r', encoding='utf-8') as f:
        # Read first sample
        sample = json.loads(f.readline())

    print("Sample structure:")
    for key in sample.keys():
        print(f"  - {key}: {type(sample[key])}")

    print("\n💬 Sample conversation:")
    # Handle both 'messages' and 'conversation' formats
    if 'messages' in sample:
        conversation = sample['messages']
    elif 'conversation' in sample:
        conversation = sample['conversation']
    else:
        print("❌ Unknown conversation format in sample")
        conversation = []

    for i, msg in enumerate(conversation[:4]):  # Show first 4 messages
        role = msg.get('role', 'unknown')
        content = msg.get('content', '')
        content_preview = content[:100] + "..." if len(content) > 100 else content
        print(f"  {i+1}. {role}: {content_preview}")

    if 'metadata' in sample:
        print(f"\n📋 Metadata: {sample['metadata']}")

    print("\n✅ Data structure looks good for training")
else:
    print("❌ No training data available for inspection")
    print("Please ensure data files are in the correct location:")
    print(f"  Expected: {train_file}")
    print("  Or run data preparation pipeline first")

🔍 Sample Training Data:
Sample structure:
  - messages: <class 'list'>
  - metadata: <class 'dict'>

💬 Sample conversation:
  1. user: What's 'We try.' in Italian?
  2. assistant: Well done! The translation is 'Ci proviamo.'.

📋 Metadata: {'conversation_id': 'translation_5261', 'source': 'tatoeba', 'level': 'A1', 'topic': 'general'}

✅ Data structure looks good for training


In [None]:
# Get default configuration and customize for detected GPU
config = get_default_config()

# Override with detected optimal settings
if torch.cuda.is_available():
    # Use the recommended settings from GPU detection
    if 'recommended_batch_size' in locals():
        config.training.per_device_train_batch_size = recommended_batch_size
    if 'recommended_eval_batch_size' in locals():
        config.training.per_device_eval_batch_size = recommended_eval_batch_size
    if 'gradient_accumulation' in locals():
        config.training.gradient_accumulation_steps = gradient_accumulation
    if 'pin_memory' in locals():
        config.training.dataloader_pin_memory = pin_memory

# Customize training settings
config.training.num_train_epochs = 3  # Start with 3 epochs
config.training.output_dir = "./marco_lora_checkpoints"

# Set run name based on GPU
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    config.training.run_name = f"marco-lora-{gpu_name.lower().replace(' ', '-')}"
else:
    config.training.run_name = "marco-lora-cpu"

# Enable experiment tracking (optional)
config.experiment.use_wandb = False  # Set to True if you want wandb
config.experiment.experiment_name = f"marco-italian-teacher-{pd.Timestamp.now().strftime('%Y%m%d-%H%M')}"

print("⚙️  Training Configuration:")
print(f"Model: {config.training.model_name}")
print(f"Train batch size: {config.training.per_device_train_batch_size}")
print(f"Eval batch size: {config.training.per_device_eval_batch_size}")
print(f"Gradient accumulation: {config.training.gradient_accumulation_steps}")
print(f"Effective batch size: {config.training.per_device_train_batch_size * config.training.gradient_accumulation_steps}")
print(f"Pin memory: {config.training.dataloader_pin_memory}")
print(f"Learning rate: {config.training.learning_rate}")
print(f"Epochs: {config.training.num_train_epochs}")
print(f"LoRA rank: {config.lora.r}")
print(f"LoRA alpha: {config.lora.lora_alpha}")
print(f"Max sequence length: {config.data.max_length}")
print(f"Output directory: {config.training.output_dir}")
print(f"Experiment tracking: {'Enabled' if config.experiment.use_wandb else 'Disabled'}")
if 'training_speed' in locals():
    print(f"Estimated training time: {training_speed}")

⚙️  Training Configuration:
Model: Qwen/Qwen2.5-7B-Instruct
Train batch size: 3
Eval batch size: 4
Gradient accumulation: 3
Effective batch size: 9
Pin memory: True
Learning rate: 0.0002
Epochs: 3
LoRA rank: 16
LoRA alpha: 32
Max sequence length: 1024
Output directory: ./marco_lora_checkpoints
Experiment tracking: Disabled
Estimated training time: ~2-3 hours for 3 epochs


## 3. Model Configuration & Initialization

In [None]:
# Initialize trainer (this will load the model)
print("🚀 Initializing Marco LoRA Trainer...")
print("This will download and load Qwen2.5-7B-Instruct (may take a few minutes)")

# Check if we have training data before proceeding
if total_samples == 0:
    print("❌ No training data found. Cannot proceed with training.")
    print("Please ensure your data files are available at:")
    print(f"  Train: {train_file}")
    print(f"  Validation: {val_file}")
    print(f"  Test: {test_file}")
    print("\nTo fix this:")
    print("1. Check if the data path is correct")
    print("2. Run data preparation pipeline if needed")
    print("3. Or update the data paths in the config")
else:
    print(f"✅ Found {total_samples:,} training samples")

    try:
        trainer = MarcoLoRATrainer(config=config)
        print("✅ Trainer initialized successfully")
        print(f"GPU memory after model loading: {torch.cuda.memory_allocated(0) / (1024**3):.2f} GB")
    except Exception as e:
        print(f"❌ Failed to initialize trainer: {e}")
        print("This might be due to:")
        print("1. Missing packages (restart runtime after installing)")
        print("2. Insufficient GPU memory")
        print("3. Internet connection issues for model download")

🚀 Initializing Marco LoRA Trainer...
This will download and load Qwen2.5-7B-Instruct (may take a few minutes)
✅ Found 10,130 training samples
🚀 L4 GPU detected: Using high-performance settings
   Effective batch size: 9
   Memory optimization: Enabled
✅ Trainer initialized successfully
GPU memory after model loading: 0.00 GB


In [None]:
# Setup model components (tokenizer, LoRA, data)
print("🔧 Setting up model components...")

# Check if trainer was successfully initialized
if 'trainer' not in locals():
    print("❌ Trainer not initialized. Please run the previous cell successfully first.")
    print("Cannot proceed with model setup without trainer.")
else:
    try:
        print("Loading tokenizer and model...")
        trainer.setup_model_and_tokenizer()
        print(f"GPU memory after base model: {torch.cuda.memory_allocated(0) / (1024**3):.2f} GB")

        print("Applying LoRA configuration...")
        trainer.setup_lora()
        print(f"GPU memory after LoRA: {torch.cuda.memory_allocated(0) / (1024**3):.2f} GB")

        print("Preparing training datasets...")
        trainer.setup_data()

        print("\n✅ All components ready for training")
    except Exception as e:
        print(f"❌ Setup failed: {e}")
        print("This might be due to:")
        print("1. GPU memory issues (try smaller batch size)")
        print("2. Data loading problems (check file paths)")
        print("3. Network issues (model download interrupted)")

🔧 Setting up model components...
Loading tokenizer and model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

GPU memory after base model: 7.21 GB
Applying LoRA configuration...
GPU memory after LoRA: 7.36 GB
Preparing training datasets...

✅ All components ready for training


In [None]:
# Verify training setup
print("🔍 Training Setup Verification:")

# Check if trainer exists and has datasets
if 'trainer' not in locals():
    print("❌ Trainer not initialized. Please run the previous cells successfully.")
elif not hasattr(trainer, 'datasets') or trainer.datasets is None:
    print("❌ Datasets not loaded. Please run the setup cell above successfully first.")
    print("   The setup cell loads the model, applies LoRA, and prepares datasets.")
else:
    # Check datasets
    print(f"Training samples: {len(trainer.datasets['train']):,}")
    if 'validation' in trainer.datasets:
        print(f"Validation samples: {len(trainer.datasets['validation']):,}")

    # Memory check
    if torch.cuda.is_available():
        memory_used = torch.cuda.memory_allocated(0) / (1024**3)
        memory_total = torch.cuda.get_device_properties(0).total_memory / (1024**3)
        memory_percent = (memory_used / memory_total) * 100

        print(f"\n💾 Memory Usage:")
        print(f"Used: {memory_used:.2f} GB / {memory_total:.1f} GB ({memory_percent:.1f}%)")

        if memory_percent > 85:
            print("⚠️  High memory usage. Consider reducing batch size.")
        elif memory_percent < 50:
            print("✅ Good memory usage. Could potentially increase batch size.")
        else:
            print("✅ Optimal memory usage for training.")

        # Estimate training time
        total_samples = len(trainer.datasets['train'])
        effective_batch_size = config.training.per_device_train_batch_size * config.training.gradient_accumulation_steps
        steps_per_epoch = total_samples // effective_batch_size
        total_steps = steps_per_epoch * config.training.num_train_epochs

        print(f"\n⏱️  Training Estimates:")
        print(f"Steps per epoch: {steps_per_epoch}")
        print(f"Total training steps: {total_steps}")

        # GPU-specific time estimates
        gpu_name = torch.cuda.get_device_name(0)
        if "T4" in gpu_name:
            estimated_hours = total_steps * 0.8 / 60  # ~0.8 min per step on T4
            performance_note = "Memory-optimized for T4"
        elif "L4" in gpu_name:
            estimated_hours = total_steps * 0.4 / 60  # ~0.4 min per step on L4
            performance_note = "High-performance on L4 🚀"
        elif "A100" in gpu_name:
            estimated_hours = total_steps * 0.3 / 60  # ~0.3 min per step on A100
            performance_note = "Maximum performance on A100"
        else:
            estimated_hours = total_steps * 1.0 / 60  # Conservative estimate
            performance_note = "Conservative estimate for unknown GPU"

        print(f"Estimated training time: {estimated_hours:.1f} hours")
        print(f"Performance profile: {performance_note}")

        print("\n🚦 Ready to start training!")
    else:
        print("\n❌ No GPU detected - training will be extremely slow")

🔍 Training Setup Verification:
Training samples: 8,104
Validation samples: 1,519

💾 Memory Usage:
Used: 7.36 GB / 22.2 GB (33.2%)
✅ Good memory usage. Could potentially increase batch size.

⏱️  Training Estimates:
Steps per epoch: 900
Total training steps: 2700
Estimated training time: 18.0 hours
Performance profile: High-performance on L4 🚀

🚦 Ready to start training!


## 5. Fine-Tuning Execution

In [None]:
# Optional: Setup Weights & Biases for tracking
if config.experiment.use_wandb:
    try:
        import wandb

        # You may need to login to wandb first
        # wandb.login()  # Uncomment if needed

        trainer.setup_wandb()
        print("✅ Weights & Biases tracking enabled")
        print(f"Experiment: {config.experiment.experiment_name}")
    except Exception as e:
        print(f"⚠️  W&B setup failed: {e}")
        print("Training will continue without experiment tracking")
        config.experiment.use_wandb = False
else:
    print("📊 Training without experiment tracking")

📊 Training without experiment tracking


In [None]:
# Start training!
print("🚀 Starting Marco LoRA Fine-Tuning...")
print("This will take ~2-3 hours on L4 GPU. Monitor the progress below.")
print("\n" + "="*50)

# Run training
try:
    # Note: This calls the complete training pipeline
    # The trainer handles all setup internally
    trainer.train()

    print("\n" + "="*50)
    print("🎉 Training completed successfully!")
    print(f"📁 Model saved to: {config.training.output_dir}")

except KeyboardInterrupt:
    print("\n⏹️  Training interrupted by user")
    print("Partial model may be saved in checkpoints")

except Exception as e:
    print(f"\n❌ Training failed with error: {e}")
    print("Check the error details above")
    raise

🚀 Starting Marco LoRA Fine-Tuning...
This will take ~2-3 hours on L4 GPU. Monitor the progress below.



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

  self.trainer = Trainer(
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mari-katzir[0m ([33mariel-katzir[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss
600,0.4403,0.588308
700,0.5264,0.579639
800,0.4244,0.575169
900,0.4167,0.568732
1000,0.3328,0.57322
1100,0.3655,0.571508
1200,0.4496,0.566036
1300,0.3888,0.557911
1400,0.4243,0.554573
1500,0.4327,0.558335


You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a Qwen2TokenizerFast tokenizer. Please note that with a fast tokenizer, using t


🎉 Training completed successfully!
📁 Model saved to: ./marco_lora_checkpoints


In [None]:
# Verify training setup
print("🔍 Training Setup Verification:")

# Check if trainer exists and has datasets
if 'trainer' not in locals():
    print("❌ Trainer not initialized. Please run previous cells successfully.")
elif not hasattr(trainer, 'datasets') or trainer.datasets is None:
    print("❌ Datasets not loaded. Please run the setup cell successfully.")
else:
    # Check datasets
    print(f"Training samples: {len(trainer.datasets['train']):,}")
    if 'validation' in trainer.datasets:
        print(f"Validation samples: {len(trainer.datasets['validation']):,}")

    # Memory check
    if torch.cuda.is_available():
        memory_used = torch.cuda.memory_allocated(0) / (1024**3)
        memory_total = torch.cuda.get_device_properties(0).total_memory / (1024**3)
        memory_percent = (memory_used / memory_total) * 100

        print(f"\n💾 Memory Usage:")
        print(f"Used: {memory_used:.2f} GB / {memory_total:.1f} GB ({memory_percent:.1f}%)")

        if memory_percent > 85:
            print("⚠️  High memory usage. Consider reducing batch size.")
        elif memory_percent < 50:
            print("✅ Good memory usage. Could potentially increase batch size.")
        else:
            print("✅ Optimal memory usage for training.")

        # Estimate training time
        total_samples = len(trainer.datasets['train'])
        effective_batch_size = config.training.per_device_train_batch_size * config.training.gradient_accumulation_steps
        steps_per_epoch = total_samples // effective_batch_size
        total_steps = steps_per_epoch * config.training.num_train_epochs

        print(f"\n⏱️  Training Estimates:")
        print(f"Steps per epoch: {steps_per_epoch}")
        print(f"Total training steps: {total_steps}")

        # GPU-specific time estimates
        gpu_name = torch.cuda.get_device_name(0)
        if "T4" in gpu_name:
            estimated_hours = total_steps * 0.8 / 60  # ~0.8 min per step on T4
            performance_note = "Memory-optimized for T4"
        elif "L4" in gpu_name:
            estimated_hours = total_steps * 0.4 / 60  # ~0.4 min per step on L4
            performance_note = "High-performance on L4 🚀"
        elif "A100" in gpu_name:
            estimated_hours = total_steps * 0.3 / 60  # ~0.3 min per step on A100
            performance_note = "Maximum performance on A100"
        else:
            estimated_hours = total_steps * 1.0 / 60  # Conservative estimate
            performance_note = "Conservative estimate for unknown GPU"

        print(f"Estimated training time: {estimated_hours:.1f} hours")
        print(f"Performance profile: {performance_note}")

        print("\n🚦 Ready to start training!")
    else:
        print("\n❌ No GPU detected - training will be extremely slow")

🔍 Training Setup Verification:
Training samples: 8,104
Validation samples: 1,519

💾 Memory Usage:
Used: 15.35 GB / 22.2 GB (69.3%)
✅ Optimal memory usage for training.

⏱️  Training Estimates:
Steps per epoch: 900
Total training steps: 2700
Estimated training time: 18.0 hours
Performance profile: High-performance on L4 🚀

🚦 Ready to start training!


## 5. Fine-Tuning Execution

In [None]:
# Optional: Setup Weights & Biases for tracking
if config.experiment.use_wandb:
    try:
        import wandb

        # You may need to login to wandb first
        # wandb.login()  # Uncomment if needed

        trainer.setup_wandb()
        print("✅ Weights & Biases tracking enabled")
        print(f"Experiment: {config.experiment.experiment_name}")
    except Exception as e:
        print(f"⚠️  W&B setup failed: {e}")
        print("Training will continue without experiment tracking")
        config.experiment.use_wandb = False
else:
    print("📊 Training without experiment tracking")

📊 Training without experiment tracking


In [None]:
# Start training!
print("🚀 Starting Marco LoRA Fine-Tuning...")
print("This will take several hours. Monitor the progress below.")
print("\n" + "="*50)

# Run training
try:
    # Note: This calls the complete training pipeline
    # The trainer handles all setup internally
    trainer.train()

    print("\n" + "="*50)
    print("🎉 Training completed successfully!")
    print(f"📁 Model saved to: {config.training.output_dir}")

except KeyboardInterrupt:
    print("\n⏹️  Training interrupted by user")
    print("Partial model may be saved in checkpoints")

except Exception as e:
    print(f"\n❌ Training failed with error: {e}")
    print("Check the error details above")
    raise

🚀 Starting Marco LoRA Fine-Tuning...
This will take several hours. Monitor the progress below.



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]


⏹️  Training interrupted by user
Partial model may be saved in checkpoints


## 6. Model Testing & Quick Inference

In [None]:
# Test the trained model
print("🧪 Testing trained Marco model...")

# Initialize inference with trained LoRA adapter
marco = MarcoInference(
    lora_adapter_path=config.training.output_dir
)

print("✅ Trained Marco model loaded for testing")

🧪 Testing trained Marco model...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



✅ Trained Marco model loaded for testing


In [None]:
# Quick conversation tests
test_questions = [
    "What does 'Buongiorno' mean?",
    "Can you explain the grammar in 'Ho mangiato una pizza'?",
    "Help me practice Italian greetings at A1 level",
    "What's the difference between 'essere' and 'stare'?",
    "Translate and explain: 'Sto studiando l'italiano da due anni'"
]

print("💬 Quick Inference Tests:")
print("="*60)

for i, question in enumerate(test_questions, 1):
    print(f"\n🙋 Test {i}: {question}")

    try:
        response = marco.chat(question)
        print(f"🤖 Marco: {response}")
    except Exception as e:
        print(f"❌ Error: {e}")

    print("-" * 40)

print("\n✅ Quick testing complete!")

💬 Quick Inference Tests:

🙋 Test 1: What does 'Buongiorno' mean?
🤖 Marco: It's 'Good morning'.
----------------------------------------

🙋 Test 2: Can you explain the grammar in 'Ho mangiato una pizza'?
🤖 Marco: Great question! This translates to 'I ate a pizza.'. This sentence uses role-based morphology. The subject is "Io" (I), which is in the first person singular form and agrees with the verb "ho mangiato" (I have eaten). The object of the sentence is "una pizza" (a pizza), which is also in the singular form and agrees with the verb. The verb "mangiare" (to eat) is conjugated in the present tense, third person singular, to agree with the subject "Io". 

So in summary:
- Subject: Io (first person singular)
- Verb: ho mangiato (present tense, third person singular)
- Object: una pizza (singular)

The word order follows standard Italian syntax, with the subject followed by the verb and then the object.
----------------------------------------

🙋 Test 3: Help me practice Italian greeti

## 7. Evaluation & Analysis

In [None]:
# Training metrics analysis
import matplotlib.pyplot as plt
import pandas as pd
import json
from pathlib import Path

# Check for training logs
log_file = Path(config.training.output_dir) / "trainer_state.json"

if log_file.exists():
    print("📊 Analyzing training metrics...")

    with open(log_file, 'r') as f:
        trainer_state = json.load(f)

    # Extract training history
    log_history = trainer_state.get('log_history', [])

    if log_history:
        # Create DataFrames for analysis
        train_logs = [log for log in log_history if 'train_loss' in log]
        eval_logs = [log for log in log_history if 'eval_loss' in log]

        if train_logs:
            train_df = pd.DataFrame(train_logs)

            # Plot training loss
            plt.figure(figsize=(12, 5))

            plt.subplot(1, 2, 1)
            plt.plot(train_df['step'], train_df['train_loss'], 'b-', linewidth=2)
            plt.title('Training Loss')
            plt.xlabel('Step')
            plt.ylabel('Loss')
            plt.grid(True, alpha=0.3)

            # Plot learning rate
            plt.subplot(1, 2, 2)
            if 'learning_rate' in train_df.columns:
                plt.plot(train_df['step'], train_df['learning_rate'], 'g-', linewidth=2)
                plt.title('Learning Rate Schedule')
                plt.xlabel('Step')
                plt.ylabel('Learning Rate')
                plt.grid(True, alpha=0.3)

            plt.tight_layout()
            plt.show()

            # Training summary
            final_loss = train_df['train_loss'].iloc[-1]
            initial_loss = train_df['train_loss'].iloc[0]
            improvement = ((initial_loss - final_loss) / initial_loss) * 100

            print(f"\n📈 Training Summary:")
            print(f"Initial loss: {initial_loss:.4f}")
            print(f"Final loss: {final_loss:.4f}")
            print(f"Improvement: {improvement:.1f}%")

        if eval_logs:
            eval_df = pd.DataFrame(eval_logs)
            print(f"\n📊 Validation Results:")
            print(f"Final validation loss: {eval_df['eval_loss'].iloc[-1]:.4f}")

else:
    print("📋 No training logs found for analysis")

📋 No training logs found for analysis


In [19]:
# Compare with base model (before fine-tuning)
print("🔄 Comparing Fine-tuned vs Base Model...")

# Load base model for comparison
base_marco = MarcoInference()  # No LoRA adapter = base model

comparison_questions = [
    "Explain the grammar in 'Sono andato al mare'",
    "What's the difference between 'molto' and 'troppo'?",
    "Help me understand when to use the subjunctive mood"
]

print("\n" + "="*80)
for i, question in enumerate(comparison_questions, 1):
    print(f"\n🔍 Comparison Test {i}: {question}")
    print("-" * 60)

    # Base model response
    print("🤖 Base Model:")
    try:
        base_response = base_marco.chat(question)
        print(f"{base_response[:300]}{'...' if len(base_response) > 300 else ''}")
    except Exception as e:
        print(f"❌ Error: {e}")

    print("\n🎓 Fine-tuned Marco:")
    try:
        tuned_response = marco.chat(question)
        print(f"{tuned_response[:300]}{'...' if len(tuned_response) > 300 else ''}")
    except Exception as e:
        print(f"❌ Error: {e}")

    print("\n" + "="*80)

print("\n✅ Model comparison complete!")

🔄 Comparing Fine-tuned vs Base Model...


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]





🔍 Comparison Test 1: Explain the grammar in 'Sono andato al mare'
------------------------------------------------------------
🤖 Base Model:
The phrase "Sono andato al mare" is a common Italian sentence that translates to English as "I went to the sea" or "I went to the beach." Let's break down the grammar:

1. **Sono**: This is the first-person singular form of the verb "essere" (to be) in the present tense. It means "I am," but in this...

🎓 Fine-tuned Marco:
Great question! This translates to 'I went to the beach.'. This sentence uses role reversal, which is a grammatical structure where the subject and object are switched. In this case, "Sono" (I am) is the subject, and "andato al mare" (went to the sea) is the object.

The verb "andare" (to go) is con...


🔍 Comparison Test 2: What's the difference between 'molto' and 'troppo'?
------------------------------------------------------------
🤖 Base Model:
In Italian, both "molto" and "troppo" can be used to express the degree of som

In [20]:
# Generate example conversations for different CEFR levels
print("🎯 Testing Marco across different CEFR levels...")

cefr_tests = {
    "A1": "Help me learn basic Italian greetings",
    "A2": "Explain how to talk about daily routines in Italian",
    "B1": "What's the difference between passato prossimo and imperfetto?",
    "B2": "Explain the use of the conditional mood in Italian"
}

for level, question in cefr_tests.items():
    print(f"\n📚 {level} Level Test:")
    print(f"Question: {question}")
    print("-" * 50)

    try:
        response = marco.chat(f"At {level} level: {question}")
        print(f"Marco: {response}")
    except Exception as e:
        print(f"❌ Error: {e}")

    print("\n")

print("✅ CEFR level testing complete!")

🎯 Testing Marco across different CEFR levels...

📚 A1 Level Test:
Question: Help me learn basic Italian greetings
--------------------------------------------------
Marco: Great question! The translation is 'Salve! Io sono Stella.'.



📚 A2 Level Test:
Question: Explain how to talk about daily routines in Italian
--------------------------------------------------
Marco: Bravissimo! Here's how you would say it: 'Parliamo delle routine quotidiane.'.



📚 B1 Level Test:
Question: What's the difference between passato prossimo and imperfetto?
--------------------------------------------------
Marco: Great question! In Italian, "passato prossimo" is used to describe actions that happened in the past and have ended. It's formed by adding the auxiliary verb "essere" or "essere" (for verbs like "andare" and "venire") to the past participle of the main verb.

On the other hand, "imperfetto" is used to describe ongoing actions in the past. It's formed by adding the endings "-avo/-evi/-iva/-ivamo

## 8. Final Summary & Next Steps

In [21]:
# Training completion summary
print("🎉 Marco LoRA Fine-Tuning Complete!")
print("="*50)

# Model info
print(f"📁 Model Location: {config.training.output_dir}")
print(f"🤖 Base Model: {config.training.model_name}")
print(f"⚙️  LoRA Configuration: r={config.lora.r}, alpha={config.lora.lora_alpha}")
print(f"📊 Training Data: {total_samples:,} total samples")
print(f"⏱️  Training Duration: {config.training.num_train_epochs} epochs")

# File sizes
checkpoint_dir = Path(config.training.output_dir)
if checkpoint_dir.exists():
    total_size = sum(f.stat().st_size for f in checkpoint_dir.glob('**/*') if f.is_file())
    print(f"💾 Model Size: {total_size / (1024**2):.1f} MB")

print("\n🚀 Next Steps:")
print("1. ✅ Test the model with your own Italian questions")
print("2. 📊 Run more comprehensive evaluation if needed")
print("3. 🔄 Integrate with your Italian Teacher application")
print("4. 📈 Consider training for more epochs if performance needs improvement")
print("5. 🎯 Add specialized question generation training")

print("\n💡 To use this model in your app:")
print(f'marco = MarcoInference(lora_adapter_path="{config.training.output_dir}")')
print('response = marco.chat("Your Italian question here")')

print("\n🎊 Congratulations on completing Marco's fine-tuning!")

🎉 Marco LoRA Fine-Tuning Complete!
📁 Model Location: ./marco_lora_checkpoints
🤖 Base Model: Qwen/Qwen2.5-7B-Instruct
⚙️  LoRA Configuration: r=16, alpha=32
📊 Training Data: 8,104 total samples
⏱️  Training Duration: 3 epochs
💾 Model Size: 1124.5 MB

🚀 Next Steps:
1. ✅ Test the model with your own Italian questions
2. 📊 Run more comprehensive evaluation if needed
3. 🔄 Integrate with your Italian Teacher application
4. 📈 Consider training for more epochs if performance needs improvement
5. 🎯 Add specialized question generation training

💡 To use this model in your app:
marco = MarcoInference(lora_adapter_path="./marco_lora_checkpoints")
response = marco.chat("Your Italian question here")

🎊 Congratulations on completing Marco's fine-tuning!
