# üß¨ Geometry-Complete Equivariant Diffusion Model
## Training with Pre-Downloaded Dataset

**Dataset location:** `/content/drive/MyDrive/CrossDock2020/CrossDocked2020_v1.3.tgz`

## Cell 1: Mount Drive & Install Dependencies

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import torch
print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU!"}')

!pip install -q torch-geometric rdkit scipy numpy pyyaml tqdm wandb
print('‚úÖ Dependencies installed')

## Cell 2: Clone/Update Repository

In [None]:
import os

REPO_PATH = '/content/drive/MyDrive/geom_diffusion'

if not os.path.exists(REPO_PATH):
    !git clone https://github.com/Nethrananda21/geom_diffusion.git {REPO_PATH}

%cd {REPO_PATH}
!git pull origin master
print(f'‚úÖ Working dir: {os.getcwd()}')

## Cell 3: Extract Dataset (from your downloaded .tgz)

In [None]:
import os

%cd /content/drive/MyDrive/geom_diffusion

TGZ_PATH = '/content/drive/MyDrive/CrossDock2020/CrossDocked2020_v1.3.tgz'
RAW_DATA = 'data/CrossDocked2020'

if os.path.exists(RAW_DATA):
    print(f'‚úÖ Already extracted: {RAW_DATA}')
    !du -sh {RAW_DATA}
else:
    print('üì¶ Extracting dataset...')
    !mkdir -p data
    !tar -xzf {TGZ_PATH} -C data/
    print('‚úÖ Extraction complete!')
    !ls data/

## Cell 4: Preprocess Dataset (10-20 min)

In [None]:
from pathlib import Path

%cd /content/drive/MyDrive/geom_diffusion

train_pkl = Path('data/crossdocked/train_data.pkl')

if train_pkl.exists():
    print(f'‚úÖ Already preprocessed: {train_pkl}')
else:
    print('‚è≥ Preprocessing (10-20 minutes)...')
    !python preprocess_crossdocked.py \
        --data_dir data/CrossDocked2020 \
        --output_dir data/crossdocked \
        --config configs/debug_t4.yaml
    print('‚úÖ Done!')

## Cell 5: Verify Data & Delete Cache

In [None]:
import shutil
from pathlib import Path

%cd /content/drive/MyDrive/geom_diffusion

# Delete old cache
cache = Path('data/cache')
if cache.exists():
    shutil.rmtree(cache)
    print('üóëÔ∏è Deleted old cache')

# Check data
train = Path('data/crossdocked/train_data.pkl')
val = Path('data/crossdocked/val_data.pkl')

if train.exists() and val.exists():
    print(f'‚úÖ Real data ready!')
    print(f'   Train: {train.stat().st_size/1e6:.1f} MB')
    print(f'   Val: {val.stat().st_size/1e6:.1f} MB')
else:
    print('‚ö†Ô∏è Data not found')

## Cell 6: Start Training üöÄ

In [None]:
%cd /content/drive/MyDrive/geom_diffusion
!python train.py --config configs/debug_t4.yaml --checkpoint_dir checkpoints

## Cell 7: Resume Training (If Disconnected)

In [None]:
# Run Cell 1 & 2 first, then uncomment:
# %cd /content/drive/MyDrive/geom_diffusion
# !python train.py --config configs/debug_t4.yaml --resume checkpoints/best_model.pt

## Cell 8: (Optional) Delete .tgz After Extraction

In [None]:
# Saves 50GB on Drive - only run after successful extraction!
# !rm /content/drive/MyDrive/CrossDock2020/CrossDocked2020_v1.3.tgz
# print('üóëÔ∏è Deleted .tgz file')