# DiffSTOCK India - Training on Google Colab

This notebook:
1. Clones the DiffSTOCK repository
2. Installs dependencies
3. Downloads/loads dataset
4. Trains the model
5. Saves checkpoints and metrics
6. Performs validation and testing
7. Generates comprehensive evaluation reports

**Runtime**: GPU recommended (training takes 2-4 hours on GPU, 12-20 hours on CPU)

**Author**: Siddhartha Koppaka  
**Model**: DiffSTOCK (ICASSP 2024) adapted for Indian markets

## Setup

In [None]:
# Check if GPU is available
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# Mount Google Drive to save outputs
from google.colab import drive
drive.mount('/content/drive')

# Create output directory in Google Drive
import os
OUTPUT_DIR = '/content/drive/MyDrive/DiffSTOCK_Outputs'
os.makedirs(OUTPUT_DIR, exist_ok=True)
print(f"Outputs will be saved to: {OUTPUT_DIR}")

## Clone Repository

In [None]:
# Clone the repository
!git clone https://github.com/SiddarthaKoppaka/stock_model.git
%cd stock_model/diffstock_india
!ls -la

## Install Dependencies

In [None]:
# Install requirements
!pip install -r requirements.txt -q

# Verify installation
!python verify_installation.py

## Data Setup

### Option 1: Upload Pre-downloaded Dataset
If you have already scraped the data locally, upload it here.

In [None]:
# Upload dataset files
from google.colab import files
import os

print("Please upload the following files from your local machine:")
print("1. nifty500_10yr.npz (from data/dataset/)")
print("2. relation_matrices.npz (from data/dataset/)")
print("\nUploading...")

uploaded = files.upload()

# Move files to correct locations
!mkdir -p data/dataset
for filename in uploaded.keys():
    !mv {filename} data/dataset/
    print(f"Moved {filename} to data/dataset/")

print("\nDataset files uploaded successfully!")

### ⚠️ CRITICAL: Rebuild Dataset with 16 Features

**The code was updated to use 16 features instead of 15.**

If you uploaded an old dataset or scraped data before this update, you MUST rebuild the dataset by running the cell below.

### Option 2: Download Data on Colab (Slow - 30-60 minutes)
**Warning**: This will take 30-60 minutes. Only use if you don't have pre-downloaded data.

In [None]:
# Uncomment to run data scraping on Colab (slow!)
# !python scripts/run_scrape.py

# Build dataset
# !python -c "from src.data.dataset_builder import build_dataset; build_dataset(run_scraping=False)"

## Verify Data

In [None]:
import numpy as np

# Load and inspect dataset
data = np.load('data/dataset/nifty500_10yr.npz', allow_pickle=True)

print("Dataset Contents:")
print("=" * 80)
print(f"Train samples: {len(data['X_train'])}")
print(f"Val samples: {len(data['X_val'])}")
print(f"Test samples: {len(data['X_test'])}")
print(f"\nStocks: {len(data['stock_symbols'])}")
print(f"Features: {len(data['feature_names'])}")
print(f"\nX_train shape: {data['X_train'].shape}")
print(f"y_train shape: {data['y_train'].shape}")

# Load relation matrices
relations = np.load('data/dataset/relation_matrices.npz')
print(f"\nRelation mask density: {relations['R_mask'].mean():.2%}")
print("=" * 80)

## Configuration

In [None]:
import yaml
from pathlib import Path

# Load config
with open('config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Update paths for Colab
config['paths']['root'] = '/content/stock_model/diffstock_india'
config['paths']['checkpoints'] = 'checkpoints'
config['paths']['logs'] = 'logs'
config['paths']['results'] = 'results'

# Create directories
for path in ['checkpoints', 'logs', 'results']:
    os.makedirs(path, exist_ok=True)

print("Configuration:")
print("=" * 80)
print(f"Model: d_model={config['model']['d_model']}, T={config['model']['diffusion_T']}")
print(f"Training: epochs={config['training']['max_epochs']}, batch_size={config['training']['batch_size']}")
print(f"Learning rate: {config['training']['learning_rate']}")
print("=" * 80)

## Create Model

In [None]:
import sys
sys.path.insert(0, '/content/stock_model/diffstock_india')

from src.model.diffstock import create_diffstock_model
from src.utils.seed import set_seed
from src.utils.logger import setup_logger

# Set seed for reproducibility
set_seed(config['seed'])

# Setup logger
setup_logger(log_dir=Path('logs'), log_level='INFO')

# Create model
n_stocks = len(data['stock_symbols'])
model = create_diffstock_model(config, n_stocks)

# Print model summary
model.print_model_summary()

# Move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
print(f"\nModel moved to device: {device}")

## Training

In [None]:
from src.training.trainer import DiffSTOCKTrainer

# Load relation mask
R_mask = torch.FloatTensor(relations['R_mask'])

# Create trainer
trainer = DiffSTOCKTrainer(
    model=model,
    config=config,
    R_mask=R_mask,
    device=device
)

print("Trainer initialized!")
print(f"Device: {trainer.device}")
print(f"Mixed precision: {trainer.use_amp}")

In [None]:
# Train the model
print("Starting training...")
print("=" * 80)

history = trainer.train(
    train_data=(data['X_train'], data['y_train']),
    val_data=(data['X_val'], data['y_val'])
)

print("\n" + "=" * 80)
print("Training completed!")
print(f"Best validation IC: {trainer.best_val_ic:.4f}")
print("=" * 80)

## Save Results to Google Drive

In [None]:
import shutil
import json
from datetime import datetime

# Create timestamped directory
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
run_dir = os.path.join(OUTPUT_DIR, f'run_{timestamp}')
os.makedirs(run_dir, exist_ok=True)

print(f"Saving results to: {run_dir}")

# Copy checkpoints
checkpoint_dir = os.path.join(run_dir, 'checkpoints')
os.makedirs(checkpoint_dir, exist_ok=True)
if os.path.exists('checkpoints/best_model.pt'):
    shutil.copy('checkpoints/best_model.pt', checkpoint_dir)
    print("✓ Saved best_model.pt")
if os.path.exists('checkpoints/final_model.pt'):
    shutil.copy('checkpoints/final_model.pt', checkpoint_dir)
    print("✓ Saved final_model.pt")

# Copy logs
log_dir = os.path.join(run_dir, 'logs')
os.makedirs(log_dir, exist_ok=True)
if os.path.exists('logs/training_history.json'):
    shutil.copy('logs/training_history.json', log_dir)
    print("✓ Saved training_history.json")

# Save training summary
summary = {
    'timestamp': timestamp,
    'device': str(device),
    'best_val_ic': float(trainer.best_val_ic),
    'total_epochs': trainer.current_epoch,
    'config': config,
    'model_parameters': model.count_parameters(),
    'dataset_info': {
        'train_samples': len(data['X_train']),
        'val_samples': len(data['X_val']),
        'test_samples': len(data['X_test']),
        'n_stocks': len(data['stock_symbols']),
        'n_features': len(data['feature_names'])
    }
}

summary_path = os.path.join(run_dir, 'training_summary.json')
with open(summary_path, 'w') as f:
    json.dump(summary, f, indent=2)
print("✓ Saved training_summary.json")

print(f"\n✅ All results saved to Google Drive: {run_dir}")

## Visualize Training

In [None]:
import matplotlib.pyplot as plt
import json

# Load training history
with open('logs/training_history.json', 'r') as f:
    history = json.load(f)

# Plot training loss
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Training Loss
axes[0, 0].plot(history['train_loss'])
axes[0, 0].set_title('Training Loss', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].grid(True, alpha=0.3)

# Validation IC
val_epochs = [i * config['evaluation']['report_every'] for i in range(len(history['val_metrics']))]
val_ics = [m['IC'] for m in history['val_metrics']]
axes[0, 1].plot(val_epochs, val_ics, marker='o')
axes[0, 1].axhline(y=trainer.best_val_ic, color='r', linestyle='--', label=f'Best: {trainer.best_val_ic:.4f}')
axes[0, 1].set_title('Validation IC', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('IC')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Learning Rate
axes[1, 0].plot(history['learning_rates'])
axes[1, 0].set_title('Learning Rate Schedule', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('LR')
axes[1, 0].set_yscale('log')
axes[1, 0].grid(True, alpha=0.3)

# Validation Metrics
val_acc = [m['Accuracy'] for m in history['val_metrics']]
axes[1, 1].plot(val_epochs, val_acc, marker='s', label='Accuracy')
if 'ICIR' in history['val_metrics'][0]:
    val_icir = [m.get('ICIR', 0) for m in history['val_metrics']]
    axes[1, 1].plot(val_epochs, val_icir, marker='^', label='ICIR')
axes[1, 1].set_title('Validation Metrics', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Epoch')
axes[1, 1].set_ylabel('Value')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(run_dir, 'training_curves.png'), dpi=150, bbox_inches='tight')
plt.show()

print("✓ Training curves saved")

## Validation Evaluation

In [None]:
from src.evaluation.metrics import compute_all_metrics
from torch.utils.data import DataLoader, TensorDataset

# Load best model
checkpoint = torch.load('checkpoints/best_model.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
print(f"Loaded best model from epoch {checkpoint['epoch']}")

# Apply EMA weights
trainer.ema.shadow = checkpoint['ema_shadow']
trainer.ema.apply_shadow()

# Generate predictions on validation set
model.eval()
val_predictions = []
val_targets = []

val_dataset = TensorDataset(
    torch.FloatTensor(data['X_val']),
    torch.FloatTensor(data['y_val'])
)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

print("Generating validation predictions...")
with torch.no_grad():
    for X_batch, y_batch in val_loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        
        pred, unc = model(X_batch, R_mask.to(device), n_samples=50)
        
        val_predictions.append(pred.cpu().numpy())
        val_targets.append(y_batch.cpu().numpy())

val_predictions = np.concatenate(val_predictions, axis=0)
val_targets = np.concatenate(val_targets, axis=0)

# Compute metrics
val_metrics = compute_all_metrics(val_predictions, val_targets)

print("\n" + "=" * 80)
print("Validation Results")
print("=" * 80)
for metric, value in val_metrics.items():
    print(f"{metric:.<30} {value:.4f}")
print("=" * 80)

# Save validation results
np.savez(
    os.path.join(run_dir, 'validation_results.npz'),
    predictions=val_predictions,
    targets=val_targets,
    dates=data['dates_val']
)

with open(os.path.join(run_dir, 'validation_metrics.json'), 'w') as f:
    json.dump({k: float(v) for k, v in val_metrics.items()}, f, indent=2)

print("\n✓ Validation results saved")

## Test Evaluation

In [None]:
# Generate predictions on test set
test_predictions = []
test_targets = []
test_uncertainties = []

test_dataset = TensorDataset(
    torch.FloatTensor(data['X_test']),
    torch.FloatTensor(data['y_test'])
)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

print("Generating test predictions...")
with torch.no_grad():
    for X_batch, y_batch in test_loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        
        pred, unc = model(X_batch, R_mask.to(device), n_samples=50)
        
        test_predictions.append(pred.cpu().numpy())
        test_targets.append(y_batch.cpu().numpy())
        test_uncertainties.append(unc.cpu().numpy())

test_predictions = np.concatenate(test_predictions, axis=0)
test_targets = np.concatenate(test_targets, axis=0)
test_uncertainties = np.concatenate(test_uncertainties, axis=0)

# Compute metrics
test_metrics = compute_all_metrics(test_predictions, test_targets)

print("\n" + "=" * 80)
print("Test Results")
print("=" * 80)
for metric, value in test_metrics.items():
    print(f"{metric:.<30} {value:.4f}")
print("=" * 80)

# Save test results
np.savez(
    os.path.join(run_dir, 'test_results.npz'),
    predictions=test_predictions,
    targets=test_targets,
    uncertainties=test_uncertainties,
    dates=data['dates_test'],
    stock_symbols=data['stock_symbols']
)

with open(os.path.join(run_dir, 'test_metrics.json'), 'w') as f:
    json.dump({k: float(v) for k, v in test_metrics.items()}, f, indent=2)

print("\n✓ Test results saved")

## Backtesting

In [None]:
from src.evaluation.backtester import IndianMarketBacktester

# Run backtest on test set
backtester = IndianMarketBacktester(
    predictions=test_predictions,
    actuals=test_targets,
    dates=data['dates_test'],
    stock_symbols=data['stock_symbols'].tolist(),
    transaction_costs=config['evaluation']['transaction_costs']
)

print("Running backtest...")
results = backtester.run_topk_strategy(
    K=config['evaluation']['top_k'],
    rebalance_freq=config['evaluation']['rebalance_freq']
)

# Print summary
backtester.print_backtest_summary(results)

# Save backtest results
np.savez(
    os.path.join(run_dir, 'backtest_results.npz'),
    portfolio_values=results['portfolio_values'],
    daily_returns=results['daily_returns']
)

backtest_summary = {
    'total_return': float(results['total_return']),
    'annualized_return': float(results['annualized_return']),
    'sharpe_ratio': float(results['sharpe_ratio']),
    'max_drawdown': float(results['max_drawdown']),
    'win_rate': float(results['win_rate']),
    'avg_turnover': float(results['avg_turnover'])
}

with open(os.path.join(run_dir, 'backtest_summary.json'), 'w') as f:
    json.dump(backtest_summary, f, indent=2)

print("\n✓ Backtest results saved")

## Visualize Backtest

In [None]:
# Plot portfolio performance
fig, axes = plt.subplots(2, 1, figsize=(15, 10))

# Portfolio value
dates = data['dates_test']
axes[0].plot(dates, results['portfolio_values'], linewidth=2, color='steelblue')
axes[0].axhline(y=1000000, color='gray', linestyle='--', alpha=0.5, label='Initial Capital')
axes[0].set_title('Portfolio Value Over Time', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Value (₹)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim(dates[0], dates[-1])

# Daily returns
axes[1].bar(dates, results['daily_returns'], width=1, color=['green' if r > 0 else 'red' for r in results['daily_returns']], alpha=0.6)
axes[1].axhline(y=0, color='black', linestyle='-', linewidth=0.8)
axes[1].set_title('Daily Returns', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Date')
axes[1].set_ylabel('Return')
axes[1].grid(True, alpha=0.3, axis='y')
axes[1].set_xlim(dates[0], dates[-1])

plt.tight_layout()
plt.savefig(os.path.join(run_dir, 'backtest_performance.png'), dpi=150, bbox_inches='tight')
plt.show()

print("✓ Backtest visualization saved")

## Final Report

In [None]:
# Generate comprehensive report
report = f"""
{'='*80}
DIFFSTOCK INDIA - TRAINING & EVALUATION REPORT
{'='*80}

Run ID: {timestamp}
Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

DATASET
{'-'*80}
Train samples: {len(data['X_train']):,}
Val samples: {len(data['X_val']):,}
Test samples: {len(data['X_test']):,}
Stocks: {len(data['stock_symbols'])}
Features: {len(data['feature_names'])}

MODEL
{'-'*80}
Architecture: DiffSTOCK (MaTCHS + Adaptive DDPM)
Parameters: {model.count_parameters()['Total']:,}
d_model: {config['model']['d_model']}
Diffusion steps: {config['model']['diffusion_T']}

TRAINING
{'-'*80}
Device: {device}
Epochs trained: {trainer.current_epoch}
Best validation IC: {trainer.best_val_ic:.4f}
Batch size: {config['training']['batch_size']}
Learning rate: {config['training']['learning_rate']}

VALIDATION RESULTS
{'-'*80}
IC: {val_metrics['IC']:.4f}
ICIR: {val_metrics.get('ICIR', 0):.4f}
Accuracy: {val_metrics['Accuracy']:.4f}
MCC: {val_metrics['MCC']:.4f}

TEST RESULTS
{'-'*80}
IC: {test_metrics['IC']:.4f}
ICIR: {test_metrics.get('ICIR', 0):.4f}
Accuracy: {test_metrics['Accuracy']:.4f}
MCC: {test_metrics['MCC']:.4f}

BACKTEST RESULTS (Top-{config['evaluation']['top_k']} Strategy)
{'-'*80}
Total Return: {results['total_return']:.2%}
Annualized Return: {results['annualized_return']:.2%}
Sharpe Ratio: {results['sharpe_ratio']:.2f}
Max Drawdown: {results['max_drawdown']:.2%}
Win Rate: {results['win_rate']:.2%}
Avg Turnover: {results['avg_turnover']:.2%}

FILES SAVED
{'-'*80}
✓ checkpoints/best_model.pt
✓ checkpoints/final_model.pt
✓ training_history.json
✓ training_summary.json
✓ validation_metrics.json
✓ validation_results.npz
✓ test_metrics.json
✓ test_results.npz
✓ backtest_summary.json
✓ backtest_results.npz
✓ training_curves.png
✓ backtest_performance.png

LOCATION
{'-'*80}
Google Drive: {run_dir}

{'='*80}
Report generated successfully!
{'='*80}
"""

print(report)

# Save report
with open(os.path.join(run_dir, 'REPORT.txt'), 'w') as f:
    f.write(report)

print(f"\n✅ Complete report saved to: {os.path.join(run_dir, 'REPORT.txt')}")

## Download Results (Optional)

In [None]:
# Create zip file of all results
import zipfile

zip_path = f'/content/diffstock_results_{timestamp}.zip'

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(run_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, run_dir)
            zipf.write(file_path, arcname)

print(f"Created zip file: {zip_path}")
print(f"Size: {os.path.getsize(zip_path) / 1e6:.2f} MB")

# Download
files.download(zip_path)
print("\n✓ Download started!")

## Hyperparameter Tuning Suggestions

Based on the results, consider tuning:

### If IC is low (<0.03):
- Increase `d_model` (128 → 192 or 256)
- Increase `n_layers_mrt` (3 → 4 or 5)
- Reduce `dropout` (0.25 → 0.15)
- Increase `lookback_window` (20 → 30)

### If overfitting (val IC >> test IC):
- Increase `dropout` (0.25 → 0.35)
- Increase `weight_decay` (0.005 → 0.01)
- Increase `noise_augmentation` (0.03 → 0.05)
- Reduce model size

### If training is unstable:
- Reduce `learning_rate` (0.0003 → 0.0001)
- Increase `warmup_steps` (1000 → 2000)
- Reduce `batch_size` (32 → 16)

### Edit `config/config.yaml` and retrain!

## Next Steps

1. **Analyze Results**: Review metrics in Google Drive
2. **Tune Hyperparameters**: Adjust config and retrain
3. **Ensemble Models**: Train multiple models with different seeds
4. **Feature Engineering**: Add more technical indicators
5. **Production Deployment**: Integrate with broker API

**All results are saved in your Google Drive!**

---

**Notebook created by**: Siddhartha Koppaka  
**Model**: DiffSTOCK (ICASSP 2024) for Indian Markets  
**Repository**: https://github.com/SiddarthaKoppaka/stock_model