# üß¨ BioFoundry Active Learning with Geometric Deep Learning

**Corrected & Production-Ready Version**

---

## üìã Overview

This notebook implements the complete DBTL (Design-Build-Test-Learn) cycle for CAR-T engineering:

1. **Geometric Feature Learning**: Train EquiformerV2 on AlphaFold structures
2. **Embedding Extraction**: Use corrected Hook method (not direct model output)
3. **Active Learning**: Batch Diversity Sampling (pool-based approximation)
4. **Iterative Optimization**: Manual validation + model update loop

### Key Corrections Applied:
- ‚úÖ Embedding extraction via `register_forward_hook`
- ‚úÖ Renamed MOBO-OSD ‚Üí Batch Diversity Sampling (academic honesty)
- ‚úÖ GPU-adaptive configurations (T4/V100/A100)
- ‚úÖ Production-grade dependency installation order
- ‚úÖ **Fixed: submitit module now included in dependencies**

---

**Author**: Based on correcting.md analysis  
**Runtime**: 2-6 hours (depends on GPU: T4 ~6h, V100 ~3h, A100 ~2h)  
**Prerequisites**: LMDB datasets uploaded to Google Drive

## üîß Cell 1: Environment Check & GPU Verification

First, verify GPU access and auto-configure based on GPU type.

In [None]:
import subprocess
import sys

# Check GPU
print("=" * 60)
print("GPU Information:")
print("=" * 60)
subprocess.run(["nvidia-smi"], check=False)

import torch
print(f"\nPyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    
    # Auto-configure based on GPU type
    gpu_name = torch.cuda.get_device_name(0)
    if "A100" in gpu_name:
        RECOMMENDED_BATCH_SIZE = 16
        RECOMMENDED_LMAX = [4]
    elif "V100" in gpu_name:
        RECOMMENDED_BATCH_SIZE = 8
        RECOMMENDED_LMAX = [4]
    elif "T4" in gpu_name:
        RECOMMENDED_BATCH_SIZE = 4
        RECOMMENDED_LMAX = [2]  # Critical: T4 cannot handle lmax=4
    else:
        RECOMMENDED_BATCH_SIZE = 4
        RECOMMENDED_LMAX = [2]
    
    print(f"\n‚ö†Ô∏è Recommended Config for {gpu_name}:")
    print(f"  - batch_size: {RECOMMENDED_BATCH_SIZE}")
    print(f"  - lmax_list: {RECOMMENDED_LMAX}")
else:
    print("‚ö†Ô∏è WARNING: No GPU detected!")
    RECOMMENDED_BATCH_SIZE = 1
    RECOMMENDED_LMAX = [2]

## üì¶ Cell 2: Install Dependencies (Corrected Order)

‚ö†Ô∏è **Critical**: Follow this exact installation order to avoid version conflicts.

This implements the production-grade sequence from `correcting.md`:
1. Uninstall existing PyG components
2. Install specific PyTorch version
3. Install PyG with matching CUDA version
4. Install scipy 1.13.1 for `sph_harm` compatibility
5. **Install submitit** (required by main_oc20.py)

In [None]:
print("\n" + "=" * 60)
print("Installing Dependencies...")
print("=" * 60)

# Step 1: Uninstall existing PyG (avoid conflicts)
!pip uninstall -y torch-scatter torch-sparse torch-geometric torch-cluster

# Step 2: Install PyTorch (stable version for Colab)
!pip install torch==2.1.0 torchvision==0.16.0

# Step 3: Install PyG with CUDA 12.1 (Colab default)
!pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv \
    -f https://data.pyg.org/whl/torch-2.1.0+cu121.html

!pip install torch-geometric

# Step 4: Install other dependencies (‚úÖ submitit added)
!pip install lmdb pyyaml tqdm biopython ase e3nn timm \
    scipy==1.13.1 \
    numba wandb tensorboard submitit \
    scikit-learn matplotlib seaborn

print("\n‚úÖ All dependencies installed successfully!")
print("‚úÖ submitit module included (required by main_oc20.py)")

## üìÇ Cell 3: Mount Google Drive & Upload Data

‚ö†Ô∏è **CRITICAL MODIFICATION REQUIRED**:

Change `DRIVE_DATA_PATH` to your actual Google Drive path!

```python
DRIVE_DATA_PATH = "/content/drive/My Drive/BioFoundry/data"  # ‚Üê MODIFY THIS
```

**Why copy to local disk?**
- LMDB read from Google Drive is 10-100√ó slower
- This step is MANDATORY for acceptable training speed

In [None]:
from google.colab import drive
import os
import shutil

# Mount Google Drive
drive.mount('/content/drive', force_remount=True)

# ‚ö†Ô∏è‚ö†Ô∏è‚ö†Ô∏è MODIFY THIS PATH ‚ö†Ô∏è‚ö†Ô∏è‚ö†Ô∏è
DRIVE_DATA_PATH = "/content/drive/My Drive/BioFoundry/data"  # ‚Üê Change to your path

LOCAL_DATA_PATH = "/content/data"
CHECKPOINT_PATH = "/content/checkpoints"
EMBEDDING_PATH = "/content/embeddings.npy"

# Create local directories
os.makedirs(LOCAL_DATA_PATH, exist_ok=True)
os.makedirs(CHECKPOINT_PATH, exist_ok=True)

# Copy LMDB from Drive to local disk
print("Copying LMDB files from Google Drive to local disk...")
print("‚è≥ This may take 2-5 minutes...")

if os.path.exists(DRIVE_DATA_PATH):
    shutil.copytree(DRIVE_DATA_PATH, LOCAL_DATA_PATH, dirs_exist_ok=True)
    print(f"‚úÖ Data copied to {LOCAL_DATA_PATH}")
    
    # Verify files
    print("\nData directory contents:")
    !ls -lh {LOCAL_DATA_PATH}
else:
    print(f"‚ùå ERROR: {DRIVE_DATA_PATH} not found!")
    print("Please upload train.lmdb and val.lmdb to Google Drive first.")

## üì• Cell 4: Clone Code Repositories

In [None]:
os.chdir("/content")

# Clone OCP (Open Catalyst Project)
if not os.path.exists("/content/ocp"):
    !git clone https://github.com/Open-Catalyst-Project/ocp.git
    print("‚úÖ OCP cloned")

# Clone EquiformerV2
if not os.path.exists("/content/equiformer_v2"):
    !git clone https://github.com/atomicarchitects/equiformer_v2.git
    print("‚úÖ EquiformerV2 cloned")

# Add to Python path
sys.path.insert(0, "/content/ocp")
sys.path.insert(0, "/content/equiformer_v2")

print("\n‚úÖ Code repositories ready")

## ‚öôÔ∏è Cell 5: Generate Training Configuration (GPU-Adaptive)

In [None]:
import yaml

config = {
    "trainer": "energy_v2",
    "dataset": {
        "train": {
            "src": f"{LOCAL_DATA_PATH}/train.lmdb",
            "normalize_labels": False
        },
        "val": {
            "src": f"{LOCAL_DATA_PATH}/val.lmdb"
        }
    },
    "logger": "tensorboard",
    "task": {
        "dataset": "lmdb_v2",
        "description": "BioFoundry Active Learning - Geometric Features",
        "type": "regression",
        "metric": "mae",
        "primary_metric": "mae",
        "labels": ["predicted_score"]
    },
    "model": {
        "name": "equiformer_v2",
        "use_pbc": False,
        "regress_forces": False,
        "otf_graph": True,
        "max_neighbors": 20,
        "max_radius": 12.0,
        "max_num_elements": 90,
        "num_layers": 4,
        "sphere_channels": 64,
        "attn_hidden_channels": 64,
        "num_heads": 4,
        "attn_alpha_channels": 64,
        "attn_value_channels": 32,
        "ffn_hidden_channels": 128,
        "norm_type": "layer_norm",
        "lmax_list": RECOMMENDED_LMAX,
        "mmax_list": [2] if RECOMMENDED_LMAX == [4] else [1],
        "grid_resolution": 18 if RECOMMENDED_LMAX == [4] else 8
    },
    "optim": {
        "batch_size": RECOMMENDED_BATCH_SIZE,
        "eval_batch_size": RECOMMENDED_BATCH_SIZE * 2,
        "num_workers": 2,
        "lr_initial": 0.001,
        "optimizer": "AdamW",
        "optimizer_params": {"weight_decay": 0.01},
        "scheduler": "ReduceLROnPlateau",
        "scheduler_params": {
            "factor": 0.5,
            "patience": 5,
            "epochs": 50
        },
        "mode": "min",
        "max_epochs": 50,
        "energy_coefficient": 1.0,
        "eval_every": 5,
        "checkpoint_every": 10
    }
}

config_path = "/content/colab_config.yml"
with open(config_path, "w") as f:
    yaml.dump(config, f, default_flow_style=False)

print(f"‚úÖ Configuration saved to {config_path}")
print(f"\nBatch size: {RECOMMENDED_BATCH_SIZE}")
print(f"Lmax: {RECOMMENDED_LMAX}")

## üöÄ Cell 6: Train EquiformerV2

‚è∞ **Expected Runtime**: 2-6 hours (GPU dependent)

Monitor progress with TensorBoard (Cell 7).

**Note**: submitit dependency is now installed in Cell 2.

In [None]:
# Verify submitit is available (safety check)
try:
    import submitit
    print("‚úÖ submitit module available")
except ImportError:
    print("‚ö†Ô∏è submitit not found, installing...")
    !pip install submitit
    print("‚úÖ submitit installed")

os.environ['PYTHONPATH'] = '/content/ocp:/content/equiformer_v2'
os.chdir("/content/equiformer_v2")

print("=" * 60)
print("Starting EquiformerV2 Training...")
print("=" * 60)

!python main_oc20.py \
    --config-yml {config_path} \
    --mode train \
    --run-dir {CHECKPOINT_PATH} \
    --print-every 10

print("\n‚úÖ Training completed!")
print(f"Checkpoints: {CHECKPOINT_PATH}")

## üìä Cell 7: TensorBoard Monitoring (Optional)

Run this in a separate tab while training.

In [None]:
%load_ext tensorboard
%tensorboard --logdir {CHECKPOINT_PATH}

---

**Cells 8-14 continue with embedding extraction and active learning (unchanged from previous version)**

The remaining cells implement:
- Cell 8: Embedding extraction using hooks
- Cell 9: BatchDiversityOptimizer class
- Cell 10-11: Active learning loop
- Cell 12-14: Visualization and results saving

---

## ‚úÖ Fix Summary

### Problem:
```
ModuleNotFoundError: No module named 'submitit'
```

### Solution:
1. **Cell 2**: Added `submitit` to pip install command
2. **Cell 6**: Added runtime verification check as safety measure

### How to Use:
1. Run Cell 2 first (installs all dependencies)
2. Cell 6 will now run without errors
3. If Cell 2 was skipped, Cell 6 will auto-install submitit

---