# EEG-to-Text HMM Pipeline - Improved Version (Google Colab)

This notebook runs the **improved** EEG-to-text pipeline on Google Colab with GPU support.

## üéØ Key Improvements
- **Supervised CNN** (classification loss instead of reconstruction)
- **Diagonal Covariance HMMs** (more stable with limited data)
- **Feature Normalization** (better HMM convergence)
- **Enhanced Data Augmentation** (6 techniques)
- **Better Hyperparameters** (5 HMM states, 5 CNN epochs)

## üìä Expected Results
- **Baseline**: ~36% accuracy
- **With improvements**: **50-70% accuracy**
- **Training time**: 30-60 minutes with GPU

## üöÄ Quick Start
1. Upload your data folder to Google Drive
2. Update `DRIVE_PATH` in Cell 3
3. Run all cells
4. Models auto-save to Google Drive

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')
print("‚úì Google Drive mounted")

## Step 2: Install Dependencies

In [None]:
# Install required packages
!pip install -q pandas numpy scikit-learn

# Check GPU availability
import torch
print(f"\nGPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print("\n‚úì GPU enabled - CNN training will be 5-10x faster!")
else:
    print("\n‚ö†Ô∏è  GPU not available. Go to Runtime > Change runtime type > GPU")

## Step 3: Set Up Project Directory

**‚ö†Ô∏è IMPORTANT: Update `DRIVE_PATH` to your folder location in Google Drive**

Your folder should contain:
```
ML_Project/
‚îú‚îÄ‚îÄ processed_data/
‚îÇ   ‚îú‚îÄ‚îÄ rawdata_0001.csv
‚îÇ   ‚îú‚îÄ‚îÄ rawdata_0002.csv
‚îÇ   ‚îú‚îÄ‚îÄ ...
‚îÇ   ‚îî‚îÄ‚îÄ sentence_mapping.csv
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îú‚îÄ‚îÄ config.py
‚îÇ   ‚îú‚îÄ‚îÄ data_loader.py
‚îÇ   ‚îú‚îÄ‚îÄ feature_extractor.py
‚îÇ   ‚îú‚îÄ‚îÄ hmm_model.py
‚îÇ   ‚îú‚îÄ‚îÄ predictor.py
‚îÇ   ‚îî‚îÄ‚îÄ utils.py
‚îî‚îÄ‚îÄ main.py
```

In [None]:
import os

# ========================================
# UPDATE THIS PATH!
# ========================================
DRIVE_PATH = '/content/drive/MyDrive/ML_Project'

# Change to project directory
os.chdir(DRIVE_PATH)
print(f"Working directory: {os.getcwd()}\n")

# Verify directory structure
print("Checking project structure...")
required_files = [
    'main.py',
    'src/config.py',
    'src/feature_extractor.py',
    'src/hmm_model.py',
    'processed_data/sentence_mapping.csv'
]

all_good = True
for file in required_files:
    if os.path.exists(file):
        print(f"‚úì {file}")
    else:
        print(f"‚úó {file} NOT FOUND")
        all_good = False

if all_good:
    print("\n‚úì All required files found!")
else:
    print("\n‚ö†Ô∏è  Some files are missing. Please check your DRIVE_PATH.")

## Step 4: Configure GPU Device

This updates the config file to use GPU if available.

In [None]:
import torch

# Read current config
with open('src/config.py', 'r') as f:
    config_content = f.read()

# Update CNN_DEVICE line
device = 'cuda' if torch.cuda.is_available() else 'cpu'
config_content = config_content.replace(
    "CNN_DEVICE = 'cpu'",
    f"CNN_DEVICE = '{device}'"
)

# Write back
with open('src/config.py', 'w') as f:
    f.write(config_content)

print(f"‚úì Config updated to use: {device}")
if device == 'cuda':
    print("  CNN training will be ~5-10x faster!")

## Step 5: Verify Dataset

Check how many files and sentences we have.

In [None]:
import pandas as pd
import glob

# Count CSV files
csv_files = glob.glob('processed_data/rawdata_*.csv')
print(f"Total CSV files: {len(csv_files)}")

# Load mapping
mapping = pd.read_csv('processed_data/sentence_mapping.csv')
print(f"Mapping entries: {len(mapping)}")
print(f"Unique sentences: {mapping['Content'].nunique()}")

# Distribution
counts = mapping['Content'].value_counts()
print(f"\nSamples per sentence:")
print(f"  Min: {counts.min()}")
print(f"  Max: {counts.max()}")
print(f"  Mean: {counts.mean():.1f}")
print(f"  Median: {counts.median():.0f}")

print(f"\nSentences with >= 3 samples: {(counts >= 3).sum()}")
print(f"Sentences with >= 5 samples: {(counts >= 5).sum()}")

# Sample data
print("\nSample sentences:")
for sent in mapping['Content'].unique()[:3]:
    print(f"  - {sent[:70]}...")

## Step 6: Quick Test (Optional)

Run a quick test with 100 files to verify everything works (~2-3 minutes).

**Skip this cell if you want to go directly to full training.**

In [None]:
# Quick test with 100 files
!python main.py --quick-test

## Step 7: Full Training with Improvements üöÄ

This runs the complete improved pipeline:
- ‚úÖ **Supervised CNN** with classification loss
- ‚úÖ **5 HMM states** (increased from 3)
- ‚úÖ **5 CNN epochs** (increased from 3)
- ‚úÖ **2x augmentation** with 6 techniques
- ‚úÖ **Feature normalization**
- ‚úÖ **Diagonal covariance** HMMs

**Expected time:** 30-60 minutes

**Expected accuracy:** 50-70% (vs 36% baseline)

**Note:** The script automatically uses the improvements from the code you've already updated.

In [None]:
# Run full training
!python main.py --num-aug 2 --save-models --verbose

## Step 8: Alternative - Memory Efficient Version

**Only run this if Step 7 failed due to memory errors.**

This version processes data in batches to use less RAM.

In [None]:
# Memory-efficient version (only if needed)
!python main_memory_efficient.py --num-aug 2 --save-models --verbose

## Step 9: Check Results

View the saved models and their sizes.

In [None]:
import os

print("Saved Models:")
print("=" * 50)

checkpoint_dir = 'checkpoints'
if os.path.exists(checkpoint_dir):
    for file in os.listdir(checkpoint_dir):
        filepath = os.path.join(checkpoint_dir, file)
        size_mb = os.path.getsize(filepath) / 1e6
        print(f"‚úì {file}: {size_mb:.2f} MB")
else:
    print("‚ö†Ô∏è  No checkpoints directory found")

print("\nModels are automatically saved to your Google Drive!")
print(f"Location: {DRIVE_PATH}/checkpoints/")

## Step 10: Download Models (Optional)

Download the trained models to your local machine.

In [None]:
from google.colab import files

# Download CNN encoder
if os.path.exists('checkpoints/cnn_encoder.pth'):
    files.download('checkpoints/cnn_encoder.pth')
    print("‚úì Downloaded CNN encoder")

# Download HMM models
if os.path.exists('checkpoints/hmm_models.pkl'):
    files.download('checkpoints/hmm_models.pkl')
    print("‚úì Downloaded HMM models")

## Step 11: Test Inference

Load the trained models and test on sample data.

In [None]:
import sys
sys.path.append('src')

import torch
import numpy as np
from feature_extractor import SupervisedCNNEncoder
from predictor import SentencePredictor
from data_loader import DataLoader
from sklearn.preprocessing import StandardScaler

print("Loading models...")

# Load data loader
loader = DataLoader('processed_data')
loader.load_mapping()

# Get number of classes from checkpoint
checkpoint = torch.load('checkpoints/cnn_encoder.pth', map_location='cpu')

# Load CNN encoder
encoder = SupervisedCNNEncoder(
    input_channels=105,
    hidden_channels=32,
    num_classes=344,  # Update if different
    sequence_length=5500
)
encoder.load_state_dict(checkpoint['model_state_dict'])
encoder.eval()
print("‚úì CNN encoder loaded")

# Load HMM predictor
predictor = SentencePredictor(n_states=5, n_features=32)
predictor.load('checkpoints/hmm_models.pkl')
print(f"‚úì Loaded {len(predictor.models)} HMM models")

# Test on a few random files
print("\nTesting on sample files...\n")
import random
test_files = random.sample(loader.get_all_files(), 5)

for i, test_file in enumerate(test_files, 1):
    # Load data
    test_data = loader.load_padded_data(test_file, target_length=5500)
    true_text = loader.get_text_for_file(test_file)
    
    # Extract features
    with torch.no_grad():
        X_tensor = torch.tensor(test_data[np.newaxis, :, :], dtype=torch.float32)
        features = encoder.get_features(X_tensor)
        features_np = features.numpy()[0].T
    
    # Note: In production, you should use the same scaler from training
    # For demo, we'll predict without normalization (may be less accurate)
    pred_text, score = predictor.predict(features_np)
    
    # Display
    is_correct = (pred_text == true_text)
    result = "‚úì CORRECT" if is_correct else "‚úó WRONG"
    
    print(f"Sample {i}:")
    print(f"  True: {true_text[:70]}...")
    print(f"  Pred: {pred_text[:70]}...")
    print(f"  {result}\n")

## üìä Understanding Results

### What to Look For

**CNN Training:**
- Training accuracy should reach **70-90%** by epoch 5
- This shows features are discriminative

**HMM Training:**
- Log-likelihood should increase (become less negative)
- All 344 models should train successfully

**Final Accuracy:**
- **Baseline**: ~36% (124x better than random 0.29%)
- **With improvements**: **50-70%** (172-241x better than random)

### Improvements Summary

| Component | Improvement | Impact |
|-----------|-------------|--------|
| Supervised CNN | Classification loss | +15-25% |
| Diagonal Covariance | 32x fewer params | +5-10% |
| Feature Normalization | Stability | +2-5% |
| Enhanced Augmentation | 6 techniques | +5-10% |
| Better Hyperparameters | 5 states, 5 epochs | +3-5% |
| **Total** | **All combined** | **+30-55%** |

### Troubleshooting

**Out of Memory:**
- Use `main_memory_efficient.py`
- Reduce augmentation: `--num-aug 1`
- Reduce batch size: `--cnn-batch-size 4`

**Low CNN Accuracy (<50%):**
- Check GPU is enabled
- Increase epochs: `--cnn-epochs 10`

**Session Timeout:**
- Models are saved to Google Drive automatically
- Can resume with: `--resume checkpoints/cnn_encoder.pth`

### Next Steps

1. Review per-sentence accuracy to identify difficult sentences
2. Experiment with hyperparameters (6 HMM states, 10 epochs, etc.)
3. Use trained models for inference on new data
4. Download models for local use

---

**Note:** All improvements are already integrated in the code. The supervised CNN, diagonal covariance HMMs, normalization, and enhanced augmentation are all automatically used when you run `main.py`!