# üß¨ Geometry-Complete Equivariant Diffusion Model
## De Novo Drug Design Training (Everything on Google Drive)

**All code and data stored on Google Drive - persists across sessions!**

## Cell 1: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')
print('‚úÖ Drive mounted')

## Cell 2: Clone Repository to Google Drive

In [None]:
import os

# Everything goes to Drive
DRIVE_PATH = '/content/drive/MyDrive/geom_diffusion'

if not os.path.exists(DRIVE_PATH):
    print('üì• Cloning repository to Google Drive...')
    !git clone https://github.com/Nethrananda21/geom_diffusion.git {DRIVE_PATH}
else:
    print('‚úÖ Repository already exists on Drive')

%cd {DRIVE_PATH}
!git pull origin master
print(f'\nüìÅ Working directory: {os.getcwd()}')

## Cell 3: Install Dependencies

In [None]:
import torch
print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None!"}')

!pip install -q torch-geometric rdkit scipy numpy pyyaml tqdm wandb
print('‚úÖ Dependencies installed')

## Cell 4: Download Dataset to Google Drive

‚ö†Ô∏è Downloads ~50GB. **Skip if already downloaded before.**

In [None]:
import os

%cd /content/drive/MyDrive/geom_diffusion

RAW_DATA = 'data/CrossDocked2020'

if os.path.exists(RAW_DATA):
    print(f'‚úÖ Dataset already exists: {RAW_DATA}')
    !du -sh {RAW_DATA}
else:
    print('üì• Downloading CrossDocked2020 to Google Drive...')
    print('   Takes 30-60 min. Progress bar shows status.')
    !mkdir -p data
    !curl -L --progress-bar http://bits.csb.pitt.edu/files/crossdock2020/CrossDocked2020_v1.3.tgz | tar -xzf - -C data/
    print('\n‚úÖ Download complete!')

## Cell 5: Preprocess Dataset

In [None]:
import os
from pathlib import Path

%cd /content/drive/MyDrive/geom_diffusion

train_pkl = Path('data/crossdocked/train_data.pkl')
if train_pkl.exists():
    print(f'‚úÖ Already preprocessed: {train_pkl}')
else:
    raw = Path('data/CrossDocked2020')
    if raw.exists():
        print('‚è≥ Preprocessing (10-20 min)...')
        !python preprocess_crossdocked.py \
            --data_dir data/CrossDocked2020 \
            --output_dir data/crossdocked \
            --config configs/debug_t4.yaml
        print('\n‚úÖ Done!')
    else:
        print('‚ùå Run Cell 4 first to download data')

## Cell 6: Delete Cache & Verify

In [None]:
import shutil
from pathlib import Path

%cd /content/drive/MyDrive/geom_diffusion

cache = Path('data/cache')
if cache.exists():
    shutil.rmtree(cache)
    print('üóëÔ∏è Deleted cache')

train = Path('data/crossdocked/train_data.pkl')
val = Path('data/crossdocked/val_data.pkl')
if train.exists() and val.exists():
    print(f'‚úÖ Ready! Train: {train.stat().st_size/1e6:.1f}MB, Val: {val.stat().st_size/1e6:.1f}MB')
else:
    print('‚ö†Ô∏è Using synthetic data')

## Cell 7: Start Training üöÄ

In [None]:
%cd /content/drive/MyDrive/geom_diffusion
!python train.py --config configs/debug_t4.yaml --checkpoint_dir checkpoints

## Cell 8: Resume Training (After Reconnect)

In [None]:
# After reconnecting: Run Cell 1 (mount), then this cell
# %cd /content/drive/MyDrive/geom_diffusion
# !python train.py --config configs/debug_t4.yaml --resume checkpoints/best_model.pt

## Cell 9: Check Checkpoints

In [None]:
!ls -la /content/drive/MyDrive/geom_diffusion/checkpoints/