# üß¨ Geometry-Complete Equivariant Diffusion Model
## De Novo Drug Design Training Notebook

This notebook trains a diffusion model for structure-based drug design on Google Colab.

**Requirements:**
- GPU Runtime (T4 recommended)
- ~10GB disk space for code + data

## Cell 1: Check GPU

In [None]:
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("‚ö†Ô∏è No GPU! Go to Runtime > Change runtime type > GPU")

## Cell 2: Install Dependencies

In [None]:
# Install required packages
!pip install -q torch-geometric rdkit scipy numpy pyyaml tqdm wandb

# Verify RDKit
try:
    from rdkit import Chem
    print("‚úÖ RDKit installed successfully")
except ImportError:
    print("‚ùå RDKit failed, trying alternative...")
    !pip install rdkit-pypi

## Cell 3: Clone Repository

In [None]:
# Clone the repository
!git clone https://github.com/Nethrananda21/geom_diffusion.git
%cd geom_diffusion

# Pull latest changes
!git pull origin master

## Cell 4: Download Dataset (Choose ONE option)

### Option A: Synthetic Data (No download needed)
Skip this cell - the code will auto-generate synthetic data for testing.

### Option B: Real CrossDocked2020 Data

In [None]:
# OPTION B: Download CrossDocked2020 types file (~3.5GB)
# Uncomment below to download real data

# !mkdir -p data/crossdocked
# !wget -q http://bits.csb.pitt.edu/files/crossdock2020/CrossDocked2020_v1.3_types.tgz
# !tar -xzf CrossDocked2020_v1.3_types.tgz -C data/
# print("‚úÖ Dataset downloaded")

## Cell 5: Configure Training

Adjust settings for your GPU memory:

In [None]:
# View the T4 config
!cat configs/debug_t4.yaml

In [None]:
# Optional: Reduce settings if OOM
# Use this to edit config if needed

import yaml

with open('configs/debug_t4.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Reduce if you get OOM errors:
# config['training']['batch_size'] = 2
# config['model']['egnn']['hidden_dim'] = 128

# For faster testing:
config['training']['max_epochs'] = 10  # Reduce epochs for quick test
config['hardware']['num_workers'] = 2  # Colab has 2 CPUs

with open('configs/debug_t4.yaml', 'w') as f:
    yaml.dump(config, f)

print("‚úÖ Config updated")
print(f"   Batch size: {config['training']['batch_size']}")
print(f"   Epochs: {config['training']['max_epochs']}")

## Cell 6: Start Training üöÄ

In [None]:
# Train the model
!python train.py --config configs/debug_t4.yaml --checkpoint_dir ./checkpoints

## Cell 7: Monitor Training (Optional - Run in parallel)

In [None]:
# Check training progress
import os
from pathlib import Path

checkpoints = list(Path('checkpoints').glob('*.pt')) if Path('checkpoints').exists() else []
print(f"Checkpoints saved: {len(checkpoints)}")
for ckpt in checkpoints:
    print(f"  - {ckpt.name} ({ckpt.stat().st_size / 1e6:.1f} MB)")

## Cell 8: Resume Training (If Interrupted)

In [None]:
# Resume from checkpoint if training was interrupted
# !python train.py --config configs/debug_t4.yaml --resume ./checkpoints/best_model.pt

## Cell 9: Download Trained Model

In [None]:
# Download the best checkpoint to your local machine
from google.colab import files

if Path('checkpoints/best_model.pt').exists():
    files.download('checkpoints/best_model.pt')
    print("‚úÖ Model downloaded!")
else:
    print("‚ùå No checkpoint found yet. Complete training first.")

## Cell 10: Generate Molecules (After Training)

In [None]:
# Generate molecules for a target pocket
# Uncomment after training completes

# !python generate.py \
#     --checkpoint checkpoints/best_model.pt \
#     --pocket_pdb /path/to/pocket.pdb \
#     --n_samples 100 \
#     --output_dir ./generated