# 🧠 EEG-to-Text Training Pipeline (Optimized)

## 📋 Overview
- **Model**: Supervised CNN + HMM
- **Expected Accuracy**: 20-40% (with ~100-150 classes)
- **Training Time**: 2-4 hours
- **Memory**: RAM-safe for Colab (5-6 GB peak)

## ✅ Key Optimizations Applied
- Fixed augmentation bug (was returning wrong number of samples)
- Reduced classes to most common sentences (better accuracy)
- Optimized chunk size for Colab RAM limits
- Fixed learning rate scheduler
- 6x data augmentation

## 📝 Before You Start
**You need to upload your dataset to Google Drive first!**

1. Create a folder in Google Drive: `ML_Project_Dataset`
2. Upload your dataset files there (~5,915 CSV files)
3. Make sure you have:
   - `rawdata_*.csv` files (5,915 files)
   - `sentence_mapping.csv` file

---

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')
print("\n✅ Google Drive mounted successfully!")

## Step 2: Clone Repository from GitHub

In [None]:
import os

# Clone repository (or pull latest changes if already exists)
if not os.path.exists('/content/ML-Project-Data'):
    print("📥 Cloning repository from GitHub...")
    !git clone https://github.com/Tejas-Chakkarwar/ML-Project-Data.git
    print("✅ Repository cloned!")
else:
    print("✅ Repository already exists")
    print("📥 Pulling latest changes...")
    !cd /content/ML-Project-Data && git pull origin main

# Change to project directory
os.chdir('/content/ML-Project-Data')
print(f"\n✅ Working directory: {os.getcwd()}")

# Verify files
print("\n📋 Verifying code files:")
print(f"  main_streaming_supervised.py: {'✅' if os.path.exists('main_streaming_supervised.py') else '❌ MISSING'}")
print(f"  src/ directory: {'✅' if os.path.exists('src') else '❌ MISSING'}")
print(f"  src/config.py: {'✅' if os.path.exists('src/config.py') else '❌ MISSING'}")

## Step 3: Install Dependencies

In [None]:
# Install required packages
!pip install -q torch numpy pandas scikit-learn

# Check GPU availability
import torch

print("\n📊 System Information:")
print(f"  GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"  GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"  GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print("\n  ✅ GPU enabled - Training will use GPU!")
else:
    print("\n  ⚠️ No GPU detected!")
    print("  Go to: Runtime → Change runtime type → GPU")

print("\n✅ Dependencies installed!")

## Step 4: Copy Dataset from Google Drive to Local Storage

⚠️ **IMPORTANT**: Update the `SOURCE` path below to match where you uploaded your dataset!

Common paths:
- `/content/drive/MyDrive/ML_Project_Dataset`
- `/content/drive/MyDrive/Colab Notebooks/dataset`
- `/content/drive/MyDrive/dataset`

In [None]:
import time
import shutil
import os
import glob

print("=" * 70)
print("COPYING DATASET FROM GOOGLE DRIVE")
print("=" * 70)

# ⚠️ UPDATE THIS PATH TO YOUR ACTUAL DATASET LOCATION!
SOURCE = '/content/drive/MyDrive/ML_Project_Dataset'  # ← Change this!
DEST = '/content/ML-Project-Data/processed_data'

# Verify source exists
if not os.path.exists(SOURCE):
    print(f"\n❌ ERROR: Source path not found!")
    print(f"   Path: {SOURCE}")
    print("\n📁 Available folders in MyDrive:")
    for item in os.listdir('/content/drive/MyDrive'):
        item_path = os.path.join('/content/drive/MyDrive', item)
        if os.path.isdir(item_path):
            print(f"   📁 {item}/")
    print("\n⚠️ Please update the SOURCE path in the cell above!")
else:
    # Create destination
    os.makedirs(DEST, exist_ok=True)
    
    # Get list of files
    all_files = sorted(os.listdir(SOURCE))
    total_files = len(all_files)
    
    print(f"\n✅ Source found: {SOURCE}")
    print(f"📊 Total files to copy: {total_files:,}")
    
    # Check what's already copied
    already_copied = set(os.listdir(DEST)) if os.path.exists(DEST) else set()
    print(f"📊 Already copied: {len(already_copied):,}")
    print(f"📊 Remaining: {total_files - len(already_copied):,}\n")
    
    # Copy files with progress
    copied = 0
    failed = []
    start_time = time.time()
    
    for i, filename in enumerate(all_files, 1):
        # Skip if already copied
        if filename in already_copied:
            continue
        
        src_path = os.path.join(SOURCE, filename)
        dst_path = os.path.join(DEST, filename)
        
        # Try to copy with retry
        max_retries = 3
        for attempt in range(max_retries):
            try:
                if os.path.isfile(src_path):
                    shutil.copy2(src_path, dst_path)
                    copied += 1
                    break
            except Exception as e:
                if attempt < max_retries - 1:
                    time.sleep(2)
                else:
                    failed.append(filename)
        
        # Progress update every 1000 files
        if (i % 1000 == 0) or (i == total_files):
            elapsed = time.time() - start_time
            print(f"   [{i:,}/{total_files:,}] Progress ({elapsed/60:.1f} min, {len(failed)} failed)")
    
    elapsed = time.time() - start_time
    
    print("\n" + "=" * 70)
    print("COPY COMPLETE")
    print("=" * 70)
    print(f"⏱️  Time: {elapsed/60:.1f} minutes")
    print(f"✅ Copied: {copied:,} files")
    print(f"📊 Already existed: {len(already_copied):,} files")
    print(f"❌ Failed: {len(failed)} files")
    
    # Verify
    csv_files = glob.glob(f'{DEST}/rawdata_*.csv')
    mapping_exists = os.path.exists(f'{DEST}/sentence_mapping.csv')
    
    print(f"\n📋 Final Verification:")
    print(f"   CSV files: {len(csv_files):,}")
    print(f"   Mapping file: {'✅' if mapping_exists else '❌'}")
    
    if len(csv_files) >= 5900 and mapping_exists:
        print("\n🎉 SUCCESS! Data is ready for training!")
    else:
        print(f"\n⚠️  Warning: Only {len(csv_files):,} files (expected ~5,915)")
    
    print("=" * 70)

## Step 5: Verify Data and Configuration

In [None]:
import os

os.chdir('/content/ML-Project-Data')

print("=" * 70)
print("PRE-TRAINING VERIFICATION")
print("=" * 70)

# Check data
data_dir = 'processed_data'
if os.path.exists(data_dir):
    csv_files = [f for f in os.listdir(data_dir) if f.endswith('.csv') and f.startswith('rawdata')]
    mapping_file = os.path.join(data_dir, 'sentence_mapping.csv')
    
    print(f"\n✅ Data Status:")
    print(f"   Directory: {data_dir}")
    print(f"   CSV files: {len(csv_files):,}")
    print(f"   Mapping file: {'✅' if os.path.exists(mapping_file) else '❌'}")
    
    if len(csv_files) >= 5900 and os.path.exists(mapping_file):
        print("\n   🎉 All data ready!")
    else:
        print(f"\n   ⚠️  Only {len(csv_files):,} files found (expected ~5,915)")
else:
    print(f"\n❌ Data directory not found: {data_dir}")
    print("   Please run Step 4 to copy data first!")

# Check GPU
import torch
print(f"\n✅ GPU Status:")
print(f"   Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   Device: {torch.cuda.get_device_name(0)}")

# Check config
print(f"\n✅ Configuration:")
with open('src/config.py', 'r') as f:
    for line in f:
        if 'CNN_DEVICE' in line or 'MIN_SAMPLES_PER_SENTENCE' in line or 'NUM_AUGMENTATIONS' in line:
            print(f"   {line.strip()}")

print("\n" + "=" * 70)
print("✅ Verification complete! Ready to train.")
print("=" * 70)

## Step 6: Run Training (Optimized Settings)

### 🎯 Configuration:
- **Classes**: ~100-150 (min_samples=25 for best quality)
- **Augmentation**: 6x (more training data)
- **Chunk size**: 400 (RAM-safe for Colab)
- **Batch size**: 32 (optimized)
- **Epochs**: 12 (good balance)

### ⏱️ Expected:
- **Training time**: 2-4 hours
- **Target accuracy**: 20-40%
- **Memory usage**: ~5-6 GB RAM (safe for Colab)

### 📊 What to Watch:
- Training accuracy should increase from ~1% → 25-40%
- Loss should decrease steadily
- No "Out of Memory" errors

---

**⚠️ This cell will take 2-4 hours to complete. Don't close your browser!**

In [None]:
import os
import time

os.chdir('/content/ML-Project-Data')

print("=" * 70)
print("🚀 EEG-TO-TEXT TRAINING - OPTIMIZED")
print("=" * 70)
print("\n🎯 Configuration:")
print("  ✅ Min samples per class: 25 (~100-150 classes)")
print("  ✅ Augmentation: 6x (more training data)")
print("  ✅ Chunk size: 400 (RAM-safe)")
print("  ✅ Batch size: 32")
print("  ✅ Epochs: 12")
print("  ✅ HMM states: 5")
print()
print("💾 Expected Memory: ~5-6 GB RAM (safe for Colab)")
print("⏱️  Expected Time: 2-4 hours")
print("🎯 Target Accuracy: 20-40%")
print("=" * 70 + "\n")

start = time.time()

!python main_streaming_supervised.py \
  --cnn-epochs 12 \
  --cnn-batch-size 32 \
  --hmm-states 5 \
  --num-aug 6 \
  --min-samples 25 \
  --chunk-size 400 \
  --save-models \
  --verbose

elapsed = time.time() - start

print("\n" + "=" * 70)
print(f"🎉 TRAINING COMPLETED IN {elapsed/60:.1f} MINUTES ({elapsed/3600:.1f} HOURS)")
print("=" * 70)

## Step 7: Save Models to Google Drive

Copy trained models to Google Drive so they persist after the session ends.

In [None]:
import os
import shutil

# Create destination in Google Drive
DRIVE_MODELS_DIR = '/content/drive/MyDrive/ML_Project_Models_Final'
os.makedirs(DRIVE_MODELS_DIR, exist_ok=True)

# Copy checkpoints
LOCAL_CHECKPOINT_DIR = '/content/ML-Project-Data/checkpoints'

if os.path.exists(LOCAL_CHECKPOINT_DIR):
    print("📦 Saving models to Google Drive...\n")
    
    for filename in os.listdir(LOCAL_CHECKPOINT_DIR):
        if filename.endswith(('.pth', '.pkl')):
            src = os.path.join(LOCAL_CHECKPOINT_DIR, filename)
            dst = os.path.join(DRIVE_MODELS_DIR, filename)
            
            shutil.copy2(src, dst)
            size_mb = os.path.getsize(dst) / 1e6
            print(f"✅ {filename} ({size_mb:.1f} MB)")
    
    print(f"\n✅ Models saved to: {DRIVE_MODELS_DIR}")
    print("   These will persist even after session ends!")
else:
    print("⚠️  No checkpoints found. Did training complete successfully?")

## Step 8 (Optional): Download Models to Local Machine

In [None]:
from google.colab import files
import os

checkpoint_dir = '/content/ML-Project-Data/checkpoints'

if os.path.exists(checkpoint_dir):
    print("📥 Downloading models...\n")
    
    for filename in os.listdir(checkpoint_dir):
        if filename.endswith(('.pth', '.pkl')):
            filepath = os.path.join(checkpoint_dir, filename)
            print(f"Downloading {filename}...")
            files.download(filepath)
    
    print("\n✅ Downloads started! Check your browser's download folder.")
else:
    print("❌ No checkpoints to download.")

---

## 📊 Understanding Your Results

### What's Good Performance?

With **~100-150 classes**:
- **Random guessing**: ~0.67-1.0%
- **Poor**: 5-10%
- **Decent**: 15-25%
- **Good**: 25-35% ⭐
- **Excellent**: 35-45%

### If Accuracy is Low (<15%):
1. Reduce classes further: Try `--min-samples 40` (top 50-80 classes)
2. Train longer: Try `--cnn-epochs 20`
3. Increase augmentation: Try `--num-aug 10`

### If You Get "Out of Memory":
1. Reduce chunk size: Try `--chunk-size 300`
2. Reduce batch size: Try `--cnn-batch-size 16`
3. Reduce augmentation: Try `--num-aug 4`

---

## 🎉 Training Complete!

**Your trained models are saved in:**
- Google Drive: `/content/drive/MyDrive/ML_Project_Models_Final/`
- Local (temporary): `/content/ML-Project-Data/checkpoints/`

**Files:**
- `cnn_encoder.pth` - Trained CNN feature extractor
- `hmm_models.pkl` - Trained HMM models for each sentence

---