# Seq2Seq LSTM vs VAE for Learning Behavior Prediction

This notebook compares two sequence generation models on the Open University Learning Analytics Dataset (OULAD):
- **Seq2Seq LSTM**: Deterministic single-path prediction
- **Seq2Seq VAE**: Probabilistic multi-path generation

## Objectives
1. Train both models on the same dataset
2. Compare single-path prediction accuracy (MSE)
3. Analyze diversity and coverage of VAE generations
4. Visualize and interpret results

## 0. Download OULAD Dataset

This section automatically downloads the OULAD dataset from Kaggle and extracts the required CSV files.

**Required files:**
- `studentInfo.csv`: Student demographics and registration info
- `studentVle.csv`: Student VLE interaction logs (clicks)
- `studentAssessment.csv`: Assessment submissions and scores

In [4]:
# Check if data already exists
import os
import zipfile
from pathlib import Path

# Define required files
data_dir = Path('data/raw')
required_files = [
    'studentInfo.csv',
    'studentVle.csv',
    'studentAssessment.csv'
]

# Create data directory if it doesn't exist
data_dir.mkdir(parents=True, exist_ok=True)

# Check if all required files exist
all_files_exist = all((data_dir / f).exists() for f in required_files)

if all_files_exist:
    print("✓ All required data files already exist!")
    for f in required_files:
        file_path = data_dir / f
        size_mb = file_path.stat().st_size / (1024 * 1024)
        print(f"  - {f}: {size_mb:.2f} MB")
else:
    print("⚠ Some data files are missing. Downloading from Kaggle...")

    # Download dataset from Kaggle
    zip_path = 'data/oulad_dataset.zip'

    print("\nDownloading OULAD dataset...")
    print("This may take a few minutes depending on your internet connection.")

    # Use curl to download (works on most systems)
    !curl -L -o {zip_path} \
        https://www.kaggle.com/api/v1/datasets/download/anlgrbz/student-demographics-online-education-dataoulad

    # Check if download was successful
    if os.path.exists(zip_path):
        print(f"\n✓ Download complete: {os.path.getsize(zip_path) / (1024*1024):.2f} MB")

        # Extract the zip file
        print("\nExtracting files...")
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            # List all files in the zip
            all_files = zip_ref.namelist()
            print(f"Found {len(all_files)} files in archive")

            # Extract only the required files
            for file in all_files:
                # Get just the filename without path
                filename = os.path.basename(file)

                if filename in required_files:
                    # Read file from zip and write to data/raw/
                    with zip_ref.open(file) as source:
                        target_path = data_dir / filename
                        with open(target_path, 'wb') as target:
                            target.write(source.read())
                    print(f"  ✓ Extracted: {filename}")

        # Clean up zip file
        os.remove(zip_path)
        print("\n✓ Cleanup complete. Zip file removed.")

        # Verify all files now exist
        all_files_exist = all((data_dir / f).exists() for f in required_files)
        if all_files_exist:
            print("\n✓ All required files successfully extracted!")
            for f in required_files:
                file_path = data_dir / f
                size_mb = file_path.stat().st_size / (1024 * 1024)
                print(f"  - {f}: {size_mb:.2f} MB")
        else:
            print("\n⚠ Warning: Some files are still missing!")
            missing = [f for f in required_files if not (data_dir / f).exists()]
            print(f"Missing files: {missing}")
    else:
        print("\n✗ Download failed!")
        print("\nAlternative: Download manually from:")
        print("https://www.kaggle.com/datasets/anlgrbz/student-demographics-online-education-dataoulad")
        print(f"Then extract the required files to: {data_dir.absolute()}")

⚠ Some data files are missing. Downloading from Kaggle...

Downloading OULAD dataset...
This may take a few minutes depending on your internet connection.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 42.1M  100 42.1M    0     0  67.5M      0 --:--:-- --:--:-- --:--:--  208M

✓ Download complete: 42.15 MB

Extracting files...
Found 7 files in archive
  ✓ Extracted: studentAssessment.csv
  ✓ Extracted: studentInfo.csv
  ✓ Extracted: studentVle.csv

✓ Cleanup complete. Zip file removed.

✓ All required files successfully extracted!
  - studentInfo.csv: 3.30 MB
  - studentVle.csv: 432.81 MB
  - studentAssessment.csv: 5.43 MB


## 1. Setup and Installation

In [5]:
# ============================================================================
# SETUP FOR GOOGLE COLAB AND LOCAL ENVIRONMENTS
# ============================================================================

import sys
import os
from pathlib import Path

# Check if running on Colab
try:
    import google.colab
    IN_COLAB = True
    print("🔧 Running on Google Colab")
except:
    IN_COLAB = False
    print("🔧 Running locally")

# ============================================================================
# INSTALL DEPENDENCIES
# ============================================================================
if IN_COLAB:
    print("\n📦 Installing dependencies...")
    !pip install -q torch numpy pandas matplotlib seaborn scikit-learn tqdm
    print("✓ Dependencies installed")

# ============================================================================
# CLONE REPOSITORY (if needed on Colab)
# ============================================================================
if IN_COLAB:
    current_dir = os.getcwd()
    print(f"\n📂 Current directory: {current_dir}")

    # Check if we're already in the repository (src exists in current dir)
    if not os.path.exists('src'):
        print("⚠️  src/ not found. Cloning repository...")
        !git clone https://github.com/Qmo37/Seq2Seq_LSTM_VAE.git /content/Seq2Seq_LSTM_VAE
        os.chdir('/content/Seq2Seq_LSTM_VAE')
        print(f"✓ Repository cloned and changed to: {os.getcwd()}")
    else:
        print("✓ Repository already present")

# ============================================================================
# ADD SRC TO PYTHON PATH - FIX MODULE PRIORITY
# ============================================================================
print("\n🔍 Setting up Python path...")

# Get current working directory
cwd = Path.cwd()
print(f"Working directory: {cwd}")

# Find src directory
src_candidates = [
    cwd / 'src',                    # In current directory
    cwd.parent / 'src',             # One level up
]

src_path = None
for candidate in src_candidates:
    candidate_resolved = candidate.resolve()
    if candidate_resolved.exists() and candidate_resolved.is_dir():
        src_path = str(candidate_resolved)
        print(f"✓ Found src at: {src_path}")
        break

if src_path:
    # IMPORTANT: Remove any conflicting paths first!
    # The 'data' directory in project root conflicts with src/data module
    paths_to_remove = []
    for p in sys.path:
        # Remove current directory if it contains 'data' folder
        if p in [str(cwd), str(cwd.parent), '', '/content', '/content/Seq2Seq_LSTM_VAE']:
            if Path(p, 'data').exists():
                paths_to_remove.append(p)

    for p in paths_to_remove:
        if p in sys.path:
            sys.path.remove(p)
            print(f"⚠️  Removed from path (conflicts with src/data): {p}")

    # Add src/ to the BEGINNING of path
    if src_path not in sys.path:
        sys.path.insert(0, src_path)
        print(f"✓ Added to Python path (priority #1): {src_path}")

    # Verify required modules exist
    required_modules = ['data', 'models', 'utils']
    all_found = True
    print("\nVerifying modules:")
    for module in required_modules:
        module_path = Path(src_path) / module
        if module_path.exists():
            print(f"  ✓ {module}/")
        else:
            print(f"  ✗ {module}/ MISSING!")
            all_found = False

    if all_found:
        print("\n✅ Setup complete! Ready to import custom modules.")
        print("\nPython path priority:")
        for i, p in enumerate(sys.path[:3]):
            print(f"  {i+1}. {p}")
    else:
        print("\n⚠️  Warning: Some modules are missing!")
else:
    print("\n❌ ERROR: Could not find src/ directory!")

🔧 Running on Google Colab

📦 Installing dependencies...
✓ Dependencies installed

📂 Current directory: /content/Seq2Seq_LSTM_VAE
✓ Repository already present

🔍 Setting up Python path...
Working directory: /content/Seq2Seq_LSTM_VAE
✓ Found src at: /content/Seq2Seq_LSTM_VAE/src
⚠️  Removed from path (conflicts with src/data): 

Verifying modules:
  ✓ data/
  ✓ models/
  ✓ utils/

✅ Setup complete! Ready to import custom modules.

Python path priority:
  1. /content/Seq2Seq_LSTM_VAE/src
  2. /env/python
  3. /usr/lib/python312.zip


In [6]:
# Import standard libraries
import sys
import os
from pathlib import Path
print("--- sys.path ---")
for p in sys.path:
    print(p)
print("\\n--- CWD ---")
print(os.getcwd())
print("\\n--- Resolved CWD ---")
print(Path.cwd().resolve())

import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
print("✓ Standard libraries imported")

# Import custom modules (from src/)
print("\nImporting custom modules...")
try:
    from data import load_and_preprocess_data, LearningBehaviorDataset
    print("✓ data module imported")
except ImportError as e:
    print(f"✗ Failed to import data: {e}")
    print("\nDEBUG: Checking if 'data' module exists...")
    import os
    for path in sys.path:
        data_path = os.path.join(path, 'data')
        if os.path.exists(data_path):
            print(f"  Found data at: {data_path}")
            print(f"  Contents: {os.listdir(data_path)}")
    raise

try:
    from models import Seq2SeqLSTM, Seq2SeqVAE
    print("✓ models module imported")
except ImportError as e:
    print(f"✗ Failed to import models: {e}")
    raise

try:
    from utils import (
        train_lstm, train_vae, set_seed,
        evaluate_model,
        plot_training_curves, plot_comparison, plot_diversity_analysis
    )
    print("✓ utils module imported")
except ImportError as e:
    print(f"✗ Failed to import utils: {e}")
    raise

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')

print("\n✅ All imports successful!")

Python path:
  /content/Seq2Seq_LSTM_VAE/src
  /env/python
  /usr/lib/python312.zip

✓ Standard libraries imported

Importing custom modules...
✓ data module imported
✓ models module imported
✗ Failed to import utils: cannot import name 'evaluate_model' from 'utils' (/content/Seq2Seq_LSTM_VAE/src/utils/__init__.py)


ImportError: cannot import name 'evaluate_model' from 'utils' (/content/Seq2Seq_LSTM_VAE/src/utils/__init__.py)

## 2. Configuration

In [None]:
# Fixed hyperparameters (as per requirements)
CONFIG = {
    # Data
    'data_path': 'data/raw',
    'input_weeks': 4,
    'output_weeks': 2,

    # Training
    'batch_size': 128,
    'learning_rate': 1e-3,
    'epochs': 20,  # Adjustable: 5-30+
    'random_seed': 42,

    # Model architecture
    'hidden_size': 64,  # Adjustable
    'latent_dim': 16,   # Adjustable (VAE only)
    'num_layers': 1,
    'dropout': 0.0,

    # VAE specific
    'beta': 1.0,  # Weight for KL divergence

    # Evaluation
    'n_samples': 20,  # Number of VAE samples for evaluation

    # Device
    'device': 'cuda' if torch.cuda.is_available() else 'cpu'
}

print(f"Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

## 3. Data Loading and Preprocessing

In [None]:
# Set random seed for reproducibility
set_seed(CONFIG['random_seed'])

# Load and preprocess data
print("Loading and preprocessing OULAD dataset...")
data = load_and_preprocess_data(
    data_path=CONFIG['data_path'],
    input_weeks=CONFIG['input_weeks'],
    output_weeks=CONFIG['output_weeks'],
    random_seed=CONFIG['random_seed']
)

print(f"\nData shapes:")
print(f"  Train: X={data['X_train'].shape}, y={data['y_train'].shape}")
print(f"  Val:   X={data['X_val'].shape}, y={data['y_val'].shape}")
print(f"  Test:  X={data['X_test'].shape}, y={data['y_test'].shape}")

In [None]:
# Create PyTorch datasets and dataloaders
train_dataset = LearningBehaviorDataset(data['X_train'], data['y_train'])
val_dataset = LearningBehaviorDataset(data['X_val'], data['y_val'])
test_dataset = LearningBehaviorDataset(data['X_test'], data['y_test'])

train_loader = DataLoader(train_dataset, batch_size=CONFIG['batch_size'], shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=CONFIG['batch_size'], shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=CONFIG['batch_size'], shuffle=False)

print(f"Number of batches: Train={len(train_loader)}, Val={len(val_loader)}, Test={len(test_loader)}")

## 4. Model Training

### 4.1 Seq2Seq LSTM

In [None]:
# Initialize LSTM model
input_size = data['X_train'].shape[2]  # Number of features

lstm_model = Seq2SeqLSTM(
    input_size=input_size,
    hidden_size=CONFIG['hidden_size'],
    output_size=1,  # Predicting clicks only
    num_layers=CONFIG['num_layers'],
    dropout=CONFIG['dropout']
)

print(f"LSTM Model Architecture:")
print(lstm_model)
print(f"\nTotal parameters: {sum(p.numel() for p in lstm_model.parameters()):,}")

In [None]:
# Train LSTM
print("Training Seq2Seq LSTM...")
lstm_history = train_lstm(
    model=lstm_model,
    train_loader=train_loader,
    val_loader=val_loader,
    epochs=CONFIG['epochs'],
    lr=CONFIG['learning_rate'],
    device=CONFIG['device'],
    output_weeks=CONFIG['output_weeks']
)

# Save model
torch.save(lstm_model.state_dict(), 'results/checkpoints/lstm_model.pt')
print("\nLSTM model saved.")

### 4.2 Seq2Seq VAE

In [None]:
# Initialize VAE model
vae_model = Seq2SeqVAE(
    input_size=input_size,
    hidden_size=CONFIG['hidden_size'],
    latent_dim=CONFIG['latent_dim'],
    output_size=1,
    num_layers=CONFIG['num_layers'],
    dropout=CONFIG['dropout']
)

print(f"VAE Model Architecture:")
print(vae_model)
print(f"\nTotal parameters: {sum(p.numel() for p in vae_model.parameters()):,}")

In [None]:
# Train VAE
print("Training Seq2Seq VAE...")
vae_history = train_vae(
    model=vae_model,
    train_loader=train_loader,
    val_loader=val_loader,
    epochs=CONFIG['epochs'],
    lr=CONFIG['learning_rate'],
    device=CONFIG['device'],
    output_weeks=CONFIG['output_weeks'],
    beta=CONFIG['beta']
)

# Save model
torch.save(vae_model.state_dict(), 'results/checkpoints/vae_model.pt')
print("\nVAE model saved.")

### 4.3 Training Curves

In [None]:
# Plot training curves
plot_training_curves(
    lstm_history=lstm_history,
    vae_history=vae_history,
    save_path='results/figures/training_curves.png'
)

## 5. Model Evaluation

In [None]:
# Evaluate LSTM
print("Evaluating LSTM on test set...")
lstm_results = evaluate_model(
    model=lstm_model,
    data_loader=test_loader,
    device=CONFIG['device'],
    output_weeks=CONFIG['output_weeks'],
    is_vae=False
)

print(f"\nLSTM Results:")
print(f"  MSE: {lstm_results['mse']:.6f}")

In [None]:
# Evaluate VAE
print("Evaluating VAE on test set...")
vae_results = evaluate_model(
    model=vae_model,
    data_loader=test_loader,
    device=CONFIG['device'],
    output_weeks=CONFIG['output_weeks'],
    n_samples=CONFIG['n_samples'],
    is_vae=True
)

print(f"\nVAE Results:")
print(f"  MSE (mean prediction): {vae_results['mse']:.6f}")
print(f"  Best-of-N MSE: {vae_results['best_of_n_mse']:.6f}")
print(f"  Diversity (std): {vae_results['diversity']:.6f}")
print(f"  Coverage (95% CI): {vae_results['coverage']:.4f}")

## 6. Results Visualization

In [None]:
# Comprehensive comparison plot
plot_comparison(
    lstm_results=lstm_results,
    vae_results=vae_results,
    save_path='results/figures/model_comparison.png'
)

In [None]:
# Diversity analysis
plot_diversity_analysis(
    vae_samples=vae_results['samples'],
    targets=vae_results['targets'],
    save_path='results/figures/diversity_analysis.png'
)

## 7. Summary and Analysis

In [None]:
# Create summary table
summary = pd.DataFrame({
    'Metric': ['MSE', 'Best-of-N MSE', 'Diversity', 'Coverage'],
    'LSTM': [
        f"{lstm_results['mse']:.6f}",
        'N/A',
        '0 (deterministic)',
        'N/A'
    ],
    'VAE': [
        f"{vae_results['mse']:.6f}",
        f"{vae_results['best_of_n_mse']:.6f}",
        f"{vae_results['diversity']:.6f}",
        f"{vae_results['coverage']:.4f}"
    ]
})

print("\n" + "="*60)
print("MODEL COMPARISON SUMMARY")
print("="*60)
print(summary.to_string(index=False))
print("="*60)

### Key Findings

**Single-Path Prediction Accuracy:**
- Compare LSTM MSE vs VAE MSE (mean prediction)
- Which model is more accurate for deterministic prediction?

**Diversity and Multi-Path Generation:**
- VAE Best-of-N MSE: How much better can VAE do when allowed multiple attempts?
- VAE Diversity: How varied are the generated sequences?
- VAE Coverage: Do the generated samples capture the true uncertainty?

**Trade-offs:**
- LSTM: Simple, fast, deterministic → Good for point predictions
- VAE: Complex, slower, probabilistic → Good for uncertainty quantification and exploring multiple futures

**Practical Implications:**
- When to use LSTM: Need single best prediction, computational efficiency
- When to use VAE: Need to understand uncertainty, explore multiple scenarios, risk assessment

## 8. Save Results for Report

In [None]:
# Save numerical results
import json

results_summary = {
    'config': CONFIG,
    'lstm': {
        'mse': float(lstm_results['mse']),
        'n_parameters': sum(p.numel() for p in lstm_model.parameters())
    },
    'vae': {
        'mse': float(vae_results['mse']),
        'best_of_n_mse': float(vae_results['best_of_n_mse']),
        'diversity': float(vae_results['diversity']),
        'coverage': float(vae_results['coverage']),
        'n_parameters': sum(p.numel() for p in vae_model.parameters())
    }
}

with open('results/results_summary.json', 'w') as f:
    json.dump(results_summary, f, indent=2)

print("Results saved to results/results_summary.json")

In [None]:
# Save summary table
summary.to_csv('results/comparison_table.csv', index=False)
print("Comparison table saved to results/comparison_table.csv")

## 9. Conclusion

This notebook demonstrated:
1. Implementation of Seq2Seq LSTM and VAE for learning behavior prediction
2. Training and evaluation on OULAD dataset
3. Comprehensive comparison of single-path vs multi-path generation
4. Visualization and analysis of model strengths and weaknesses

**Next Steps for Report:**
- Copy figures from `results/figures/` to your Word document
- Use `results/results_summary.json` and `results/comparison_table.csv` for metrics
- Discuss advantages and disadvantages of each model
- Analyze when to use each model in practice