# üß¨ Geometry-Complete Equivariant Diffusion Model
## De Novo Drug Design Training (Google Drive Storage)

**Data stored on Google Drive. Code runs from Colab local disk.**

## Cell 1: Mount Google Drive & Setup Paths

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os

# All data goes to Google Drive
DRIVE_BASE = '/content/drive/MyDrive/geom_diffusion_data'
os.makedirs(f'{DRIVE_BASE}/crossdocked', exist_ok=True)
os.makedirs(f'{DRIVE_BASE}/checkpoints', exist_ok=True)

print(f'‚úÖ Drive mounted')
print(f'üìÅ Data path: {DRIVE_BASE}')

## Cell 2: Install Dependencies

In [None]:
import torch
print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None!"}')

!pip install -q torch-geometric rdkit scipy numpy pyyaml tqdm wandb
print('‚úÖ Dependencies installed')

## Cell 3: Clone Repository

In [None]:
import os
if not os.path.exists('geom_diffusion'):
    !git clone https://github.com/Nethrananda21/geom_diffusion.git
%cd /content/geom_diffusion
!git pull origin master

## Cell 4: Download Dataset to Google Drive

‚ö†Ô∏è Downloads ~50GB to Google Drive. **Skip if already downloaded.**

In [None]:
import os

DRIVE_BASE = '/content/drive/MyDrive/geom_diffusion_data'
RAW_DATA = f'{DRIVE_BASE}/CrossDocked2020'

if os.path.exists(RAW_DATA):
    print(f'‚úÖ Dataset already exists on Drive: {RAW_DATA}')
    !du -sh {RAW_DATA}
else:
    print('üì• Downloading CrossDocked2020 to Google Drive...')
    print('   Takes 30-60 min. Progress bar shows status.')
    !curl -L --progress-bar http://bits.csb.pitt.edu/files/crossdock2020/CrossDocked2020_v1.3.tgz | tar -xzf - -C {DRIVE_BASE}/
    print('\n‚úÖ Download complete!')

## Cell 5: Create Symlink (Drive ‚Üí Local data folder)

In [None]:
import os

DRIVE_BASE = '/content/drive/MyDrive/geom_diffusion_data'

# Remove existing data folder and create symlink to Drive
%cd /content/geom_diffusion
!rm -rf data
!ln -s {DRIVE_BASE} data

print('‚úÖ Symlink created: ./data ‚Üí Google Drive')
!ls -la data/

## Cell 6: Preprocess Dataset

In [None]:
import os
from pathlib import Path

%cd /content/geom_diffusion

# Check if already preprocessed
train_pkl = Path('data/crossdocked/train_data.pkl')
if train_pkl.exists():
    print(f'‚úÖ Already preprocessed!')
    print(f'   {train_pkl}: {train_pkl.stat().st_size / 1e6:.1f} MB')
else:
    # Find raw data
    raw = Path('data/CrossDocked2020')
    if raw.exists():
        print('‚è≥ Preprocessing (10-20 min)...')
        !python preprocess_crossdocked.py \
            --data_dir data/CrossDocked2020 \
            --output_dir data/crossdocked \
            --config configs/debug_t4.yaml
        print('\n‚úÖ Preprocessing complete!')
    else:
        print('‚ùå Raw data not found at data/CrossDocked2020')
        !ls -la data/

## Cell 7: Delete Cache & Verify Data

In [None]:
import shutil
from pathlib import Path

%cd /content/geom_diffusion

# Delete synthetic cache
cache = Path('data/cache')
if cache.exists():
    shutil.rmtree(cache)
    print('üóëÔ∏è Deleted old cache')

# Verify
train = Path('data/crossdocked/train_data.pkl')
val = Path('data/crossdocked/val_data.pkl')
if train.exists() and val.exists():
    print(f'‚úÖ Ready to train on REAL data!')
    print(f'   Train: {train.stat().st_size / 1e6:.1f} MB')
    print(f'   Val: {val.stat().st_size / 1e6:.1f} MB')
else:
    print('‚ö†Ô∏è Real data not found - will use synthetic')

## Cell 8: Start Training üöÄ

In [None]:
%cd /content/geom_diffusion

# Checkpoints save to Drive via symlink!
!python train.py --config configs/debug_t4.yaml --checkpoint_dir data/checkpoints

## Cell 9: Resume Training (After Disconnect)

In [None]:
# Run Cells 1, 3, 5 first to restore symlinks, then:
# %cd /content/geom_diffusion
# !python train.py --config configs/debug_t4.yaml --resume data/checkpoints/best_model.pt

## Cell 10: Check Checkpoints on Drive

In [None]:
print('üìÅ Checkpoints on Google Drive:')
!ls -la /content/drive/MyDrive/geom_diffusion_data/checkpoints/