# EEG-to-Text HMM Pipeline - Google Colab

This notebook runs the complete EEG-to-text pipeline on Google Colab with GPU support.

## Setup Instructions

1. Upload your dataset to Google Drive
2. Run the cells in order
3. Training will take ~30-45 minutes

**Expected Results:**
- ~344 unique sentences
- ~20-40% accuracy
- Models saved to Google Drive

## 1. Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 2. Install Dependencies

In [None]:
# Install required packages (PyTorch is pre-installed on Colab)
!pip install -q pandas numpy

# Check GPU availability
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 3. Set Up Project Directory

**IMPORTANT:** Update the `DRIVE_PATH` below to point to your uploaded dataset folder in Google Drive.

In [None]:
import os

# UPDATE THIS PATH to where you uploaded your data in Google Drive
DRIVE_PATH = '/content/drive/MyDrive/ML Project Data'

# Change to project directory
os.chdir(DRIVE_PATH)

# Verify the directory structure
print("Project directory contents:")
!ls -la

print("\nProcessed data directory:")
!ls processed_data/*.csv | head -5

## 4. Update Config for GPU

This cell modifies the config to use GPU if available.

In [None]:
# Update config to use GPU
config_updates = '''
# Update CNN_DEVICE to use GPU if available
import torch
CNN_DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {CNN_DEVICE}")
'''

# Append to config file
with open('src/config.py', 'a') as f:
    f.write('\n' + config_updates)

print("✓ Config updated for GPU")

## 5. Check Data Files

In [None]:
# Count data files
import glob

csv_files = glob.glob('processed_data/rawdata_*.csv')
print(f"Total CSV files: {len(csv_files)}")

# Check mapping file
import pandas as pd
mapping = pd.read_csv('processed_data/sentence_mapping.csv')
print(f"Mapping entries: {len(mapping)}")
print(f"Unique sentences: {mapping['Content'].nunique()}")

# Show sample
print("\nSample entries:")
print(mapping.head())

## 6. Run Training Pipeline

This will run the full training with:
- All 5,915 files
- 2x augmentation
- 3 CNN epochs
- GPU acceleration

**Expected time:** 30-45 minutes

In [None]:
# Run the training pipeline with 2x augmentation
!python main.py --num-aug 2 --save-models

## 7. Alternative: Memory-Efficient Version

If the above fails due to memory, use this instead:

In [None]:
# Run memory-efficient version
!python main_memory_efficient.py --num-aug 2 --save-models

## 8. Check Saved Models

In [None]:
# List saved models
print("Saved models:")
!ls -lh checkpoints/

# Check model sizes
import os
if os.path.exists('checkpoints/cnn_encoder.pth'):
    size = os.path.getsize('checkpoints/cnn_encoder.pth') / 1e6
    print(f"\nCNN Encoder: {size:.2f} MB")

if os.path.exists('checkpoints/hmm_models.pkl'):
    size = os.path.getsize('checkpoints/hmm_models.pkl') / 1e6
    print(f"HMM Models: {size:.2f} MB")

## 9. Download Models (Optional)

Download the trained models to your local machine.

In [None]:
from google.colab import files

# Download CNN encoder
if os.path.exists('checkpoints/cnn_encoder.pth'):
    files.download('checkpoints/cnn_encoder.pth')
    print("✓ Downloaded CNN encoder")

# Download HMM models
if os.path.exists('checkpoints/hmm_models.pkl'):
    files.download('checkpoints/hmm_models.pkl')
    print("✓ Downloaded HMM models")

## 10. Test Inference (Optional)

Load the trained models and test on a sample.

In [None]:
import sys
sys.path.append('src')

import torch
import numpy as np
from feature_extractor import CNNEEGEncoder
from predictor import SentencePredictor
from data_loader import DataLoader

# Load models
print("Loading models...")
encoder = CNNEEGEncoder(input_channels=105, hidden_channels=32, sequence_length=5500)
checkpoint = torch.load('checkpoints/cnn_encoder.pth', map_location='cpu')
encoder.load_state_dict(checkpoint['model_state_dict'])
encoder.eval()

predictor = SentencePredictor(n_states=3, n_features=32)
predictor.load('checkpoints/hmm_models.pkl')

print(f"✓ Loaded {len(predictor.models)} sentence models")

# Test on a random file
loader = DataLoader('processed_data')
loader.load_mapping()
test_file = loader.get_all_files()[0]
test_data = loader.load_padded_data(test_file, target_length=5500)
true_text = loader.get_text_for_file(test_file)

# Extract features and predict
with torch.no_grad():
    X_tensor = torch.tensor(test_data[np.newaxis, :, :], dtype=torch.float32)
    features = encoder.get_features(X_tensor)
    features_np = features.numpy()[0].T

pred_text, score = predictor.predict(features_np)

print(f"\nTest Prediction:")
print(f"True: {true_text[:80]}...")
print(f"Pred: {pred_text[:80]}...")
print(f"Score: {score:.2f}")
print(f"Match: {pred_text == true_text}")

## Notes

### Memory Usage
- Colab provides ~12-15 GB RAM
- Full dataset with 2x augmentation needs ~12-15 GB
- If you get memory errors, use `main_memory_efficient.py` instead

### GPU Acceleration
- CNN training will be much faster on GPU (~5-10x speedup)
- HMM training runs on CPU (no GPU implementation)

### Runtime Limits
- Free Colab sessions timeout after 12 hours of inactivity
- Training should complete in 30-45 minutes
- Models are saved to your Google Drive

### Troubleshooting
- **Out of memory**: Use `main_memory_efficient.py` or reduce `--num-aug`
- **Session timeout**: Models are saved incrementally, you can resume
- **Slow training**: Make sure GPU is enabled (Runtime > Change runtime type > GPU)