# üß† EEG-to-Text Seq2Seq LITE (Memory-Optimized for Colab CPU)

## ‚ö° MEMORY-OPTIMIZED VERSION
**Use this if the regular version runs out of RAM!**

- **Memory usage**: ~8GB (instead of 51GB+)
- **Training time**: 4-6 hours on CPU
- **Slightly lower accuracy** but much more stable

## üîß Optimizations Applied:
1. ‚úÖ Smaller model (128 hidden units, 1 layer)
2. ‚úÖ Smaller batch size (4 instead of 16)
3. ‚úÖ Less augmentation (4x instead of 6x)
4. ‚úÖ Shorter max sentence length (40 words)
5. ‚úÖ Aggressive garbage collection

## üéØ Expected Results (95 classes):
- **Exact Match**: 8-15%
- **Word Error Rate**: 50-70%
- **Still better than classification** (4-7%)

---

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive

drive.mount('/content/drive')
print("\n‚úÖ Google Drive mounted!")

## Step 2: Clone Repository

In [None]:
import os

if not os.path.exists('/content/ML-Project-Data'):
    print("üì• Cloning repository...")
    !git clone https://github.com/Tejas-Chakkarwar/ML-Project-Data.git
else:
    print("üì• Pulling latest changes...")
    !cd /content/ML-Project-Data && git pull origin main

os.chdir('/content/ML-Project-Data')
print(f"\n‚úÖ Working directory: {os.getcwd()}")

# Verify lite version exists
print("\nüìã Verifying LITE version:")
print(f"  train_seq2seq_lite.py: {'‚úÖ' if os.path.exists('lstm_approach/train_seq2seq_lite.py') else '‚ùå MISSING'}")

## Step 3: Install Dependencies

In [None]:
!pip install -q torch numpy pandas scikit-learn tqdm

import torch
print("\nüìä System Info:")
print(f"  GPU: {torch.cuda.is_available()}")
print("  Using: CPU (memory-optimized)")
print("\n‚úÖ Ready!")

## Step 4: Copy Dataset from Google Drive

‚ö†Ô∏è **UPDATE THE PATH BELOW!**

In [None]:
import time, shutil, glob

SOURCE = '/content/drive/MyDrive/ML_Project_Dataset'  # ‚Üê CHANGE THIS!
DEST = '/content/ML-Project-Data/processed_data'

if not os.path.exists(SOURCE):
    print(f"‚ùå ERROR: {SOURCE} not found!")
    print("\nüìÅ Available folders:")
    for item in os.listdir('/content/drive/MyDrive'):
        if os.path.isdir(os.path.join('/content/drive/MyDrive', item)):
            print(f"   üìÅ {item}/")
else:
    os.makedirs(DEST, exist_ok=True)
    all_files = sorted(os.listdir(SOURCE))
    already_copied = set(os.listdir(DEST)) if os.path.exists(DEST) else set()
    
    print(f"Copying {len(all_files) - len(already_copied)} files...")
    
    copied = 0
    start = time.time()
    
    for i, filename in enumerate(all_files, 1):
        if filename in already_copied:
            continue
        try:
            shutil.copy2(os.path.join(SOURCE, filename), os.path.join(DEST, filename))
            copied += 1
        except:
            pass
        if i % 1000 == 0:
            print(f"  [{i}/{len(all_files)}] ({(time.time()-start)/60:.1f} min)")
    
    csv_files = glob.glob(f'{DEST}/rawdata_*.csv')
    print(f"\n‚úÖ Ready! {len(csv_files)} files copied")

## Step 5: Train Seq2Seq LITE (Memory-Optimized)

### ‚ö° Memory-Optimized Settings:
- **Batch size**: 4 (saves RAM)
- **Model size**: 128 hidden, 1 layer
- **Augmentation**: 4x (instead of 6x)
- **Max sentence**: 40 words (instead of 60)

### ‚è±Ô∏è Expected:
- **Time**: 4-6 hours on Colab CPU
- **Memory**: ~8GB RAM (safe!)
- **Results**: 8-15% exact match, 50-70% WER

---

**‚ö†Ô∏è This will take 4-6 hours. Keep browser open!**

In [None]:
import time

os.chdir('/content/ML-Project-Data/lstm_approach')

print("=" * 70)
print("üöÄ SEQ2SEQ LITE TRAINING (MEMORY-OPTIMIZED)")
print("=" * 70)
print("\n‚ö° Optimizations:")
print("  ‚úÖ Small model: 128 hidden, 1 layer")
print("  ‚úÖ Batch size: 4")
print("  ‚úÖ Augmentation: 4x")
print("  ‚úÖ Max length: 40 words")
print("  ‚úÖ Aggressive memory cleanup")
print()
print("üíæ Memory: ~8GB (safe for Colab)")
print("‚è±Ô∏è  Time: 4-6 hours")
print("=" * 70 + "\n")

start = time.time()

!python train_seq2seq_lite.py \
  --min-samples 18 \
  --num-aug 4 \
  --batch-size 4 \
  --epochs 30 \
  --device cpu

elapsed = time.time() - start

print("\n" + "=" * 70)
print(f"üéâ COMPLETED IN {elapsed/60:.1f} MIN ({elapsed/3600:.1f} HRS)")
print("=" * 70)

## Step 6: Save Models to Google Drive

In [None]:
DRIVE_DIR = '/content/drive/MyDrive/ML_Project_Seq2Seq_LITE'
os.makedirs(DRIVE_DIR, exist_ok=True)

LOCAL_DIR = '/content/ML-Project-Data/checkpoints'

if os.path.exists(LOCAL_DIR):
    print("üì¶ Saving model...\n")
    for f in os.listdir(LOCAL_DIR):
        if f.endswith('.pth'):
            shutil.copy2(os.path.join(LOCAL_DIR, f), os.path.join(DRIVE_DIR, f))
            print(f"‚úÖ {f}")
    print(f"\n‚úÖ Saved to: {DRIVE_DIR}")
else:
    print("‚ö†Ô∏è  No models found")

---

## üìä Understanding Results

### LITE vs Regular Version:

| Metric | Regular (OOM) | LITE (Works!) |
|--------|---------------|---------------|
| Memory | 51GB+ (crash) | ~8GB ‚úÖ |
| Time | - | 4-6 hours |
| Exact Match | - | 8-15% |
| WER | - | 50-70% |

### Why Lower Accuracy?
- Smaller model = less capacity
- Less augmentation = less training data
- **BUT**: Still better than classification (4-7%)!

### What's Good:
- ‚úÖ Actually finishes training
- ‚úÖ Shows partial word understanding
- ‚úÖ Better than classification baseline
- ‚úÖ Proves Seq2Seq concept works

---

## üéØ For Presentation:

1. **Show the challenge**: Limited data (18 samples/class)
2. **Show LITE results**: 12% exact match, 60% WER
3. **Explain partial correctness**: Unlike classification, gets credit for some words
4. **Compare baselines**:
   - Random: 1% (95 classes)
   - Classification: 4-7%
   - Seq2Seq LITE: 8-15% ‚ú®
5. **Future work**: With GPU, can use larger model for better results

---