# üß¨ BioFoundry Active Learning with Geometric Deep Learning

**Corrected & Production-Ready Version**

---

## üìã Overview

This notebook implements the complete DBTL (Design-Build-Test-Learn) cycle for CAR-T engineering:

1. **Geometric Feature Learning**: Train EquiformerV2 on AlphaFold structures
2. **Embedding Extraction**: Use corrected Hook method (not direct model output)
3. **Active Learning**: Batch Diversity Sampling (pool-based approximation)
4. **Iterative Optimization**: Manual validation + model update loop

### Key Corrections Applied:
- ‚úÖ Embedding extraction via `register_forward_hook`
- ‚úÖ Renamed MOBO-OSD ‚Üí Batch Diversity Sampling (academic honesty)
- ‚úÖ GPU-adaptive configurations (T4/V100/A100)
- ‚úÖ Production-grade dependency installation order

---

**Author**: Based on correcting.md analysis  
**Runtime**: 2-6 hours (depends on GPU: T4 ~6h, V100 ~3h, A100 ~2h)  
**Prerequisites**: LMDB datasets uploaded to Google Drive

## üîß Cell 1: Environment Check & GPU Verification

First, verify GPU access and auto-configure based on GPU type.

In [None]:
import subprocess
import sys

# Check GPU
print("=" * 60)
print("GPU Information:")
print("=" * 60)
subprocess.run(["nvidia-smi"], check=False)

import torch
print(f"\nPyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    
    # Auto-configure based on GPU type
    gpu_name = torch.cuda.get_device_name(0)
    if "A100" in gpu_name:
        RECOMMENDED_BATCH_SIZE = 16
        RECOMMENDED_LMAX = [4]
    elif "V100" in gpu_name:
        RECOMMENDED_BATCH_SIZE = 8
        RECOMMENDED_LMAX = [4]
    elif "T4" in gpu_name:
        RECOMMENDED_BATCH_SIZE = 4
        RECOMMENDED_LMAX = [2]  # Critical: T4 cannot handle lmax=4
    else:
        RECOMMENDED_BATCH_SIZE = 4
        RECOMMENDED_LMAX = [2]
    
    print(f"\n‚ö†Ô∏è Recommended Config for {gpu_name}:")
    print(f"  - batch_size: {RECOMMENDED_BATCH_SIZE}")
    print(f"  - lmax_list: {RECOMMENDED_LMAX}")
else:
    print("‚ö†Ô∏è WARNING: No GPU detected!")
    RECOMMENDED_BATCH_SIZE = 1
    RECOMMENDED_LMAX = [2]

## üì¶ Cell 2: Install Dependencies (Corrected Order)

‚ö†Ô∏è **Critical**: Follow this exact installation order to avoid version conflicts.

This implements the production-grade sequence from `correcting.md`:
1. Uninstall existing PyG components
2. Install specific PyTorch version
3. Install PyG with matching CUDA version
4. Install scipy 1.13.1 for `sph_harm` compatibility

In [None]:
print("\n" + "=" * 60)
print("Installing Dependencies...")
print("=" * 60)

# Step 1: Uninstall existing PyG (avoid conflicts)
!pip uninstall -y torch-scatter torch-sparse torch-geometric torch-cluster

# Step 2: Install PyTorch (stable version for Colab)
!pip install torch==2.1.0 torchvision==0.16.0

# Step 3: Install PyG with CUDA 12.1 (Colab default)
!pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv \
    -f https://data.pyg.org/whl/torch-2.1.0+cu121.html

!pip install torch-geometric

# Step 4: Install other dependencies
!pip install lmdb pyyaml tqdm biopython ase e3nn timm \
    scipy==1.13.1 \
    numba wandb tensorboard \
    scikit-learn matplotlib seaborn

print("\n‚úÖ All dependencies installed successfully!")

## üìÇ Cell 3: Mount Google Drive & Upload Data

‚ö†Ô∏è **CRITICAL MODIFICATION REQUIRED**:

Change `DRIVE_DATA_PATH` to your actual Google Drive path!

```python
DRIVE_DATA_PATH = "/content/drive/My Drive/BioFoundry/data"  # ‚Üê MODIFY THIS
```

**Why copy to local disk?**
- LMDB read from Google Drive is 10-100√ó slower
- This step is MANDATORY for acceptable training speed

In [None]:
from google.colab import drive
import os
import shutil

# Mount Google Drive
drive.mount('/content/drive', force_remount=True)

# ‚ö†Ô∏è‚ö†Ô∏è‚ö†Ô∏è MODIFY THIS PATH ‚ö†Ô∏è‚ö†Ô∏è‚ö†Ô∏è
DRIVE_DATA_PATH = "/content/drive/My Drive/BioFoundry/data"  # ‚Üê Change to your path

LOCAL_DATA_PATH = "/content/data"
CHECKPOINT_PATH = "/content/checkpoints"
EMBEDDING_PATH = "/content/embeddings.npy"

# Create local directories
os.makedirs(LOCAL_DATA_PATH, exist_ok=True)
os.makedirs(CHECKPOINT_PATH, exist_ok=True)

# Copy LMDB from Drive to local disk
print("Copying LMDB files from Google Drive to local disk...")
print("‚è≥ This may take 2-5 minutes...")

if os.path.exists(DRIVE_DATA_PATH):
    shutil.copytree(DRIVE_DATA_PATH, LOCAL_DATA_PATH, dirs_exist_ok=True)
    print(f"‚úÖ Data copied to {LOCAL_DATA_PATH}")
    
    # Verify files
    print("\nData directory contents:")
    !ls -lh {LOCAL_DATA_PATH}
else:
    print(f"‚ùå ERROR: {DRIVE_DATA_PATH} not found!")
    print("Please upload train.lmdb and val.lmdb to Google Drive first.")

## üì• Cell 4: Clone Code Repositories

In [None]:
os.chdir("/content")

# Clone OCP (Open Catalyst Project)
if not os.path.exists("/content/ocp"):
    !git clone https://github.com/Open-Catalyst-Project/ocp.git
    print("‚úÖ OCP cloned")

# Clone EquiformerV2
if not os.path.exists("/content/equiformer_v2"):
    !git clone https://github.com/atomicarchitects/equiformer_v2.git
    print("‚úÖ EquiformerV2 cloned")

# Add to Python path
sys.path.insert(0, "/content/ocp")
sys.path.insert(0, "/content/equiformer_v2")

print("\n‚úÖ Code repositories ready")

## ‚öôÔ∏è Cell 5: Generate Training Configuration (GPU-Adaptive)

In [None]:
import yaml

config = {
    "trainer": "energy_v2",
    "dataset": {
        "train": {
            "src": f"{LOCAL_DATA_PATH}/train.lmdb",
            "normalize_labels": False
        },
        "val": {
            "src": f"{LOCAL_DATA_PATH}/val.lmdb"
        }
    },
    "logger": "tensorboard",
    "task": {
        "dataset": "lmdb_v2",
        "description": "BioFoundry Active Learning - Geometric Features",
        "type": "regression",
        "metric": "mae",
        "primary_metric": "mae",
        "labels": ["predicted_score"]
    },
    "model": {
        "name": "equiformer_v2",
        "use_pbc": False,
        "regress_forces": False,
        "otf_graph": True,
        "max_neighbors": 20,
        "max_radius": 12.0,
        "max_num_elements": 90,
        "num_layers": 4,
        "sphere_channels": 64,
        "attn_hidden_channels": 64,
        "num_heads": 4,
        "attn_alpha_channels": 64,
        "attn_value_channels": 32,
        "ffn_hidden_channels": 128,
        "norm_type": "layer_norm",
        "lmax_list": RECOMMENDED_LMAX,
        "mmax_list": [2] if RECOMMENDED_LMAX == [4] else [1],
        "grid_resolution": 18 if RECOMMENDED_LMAX == [4] else 8
    },
    "optim": {
        "batch_size": RECOMMENDED_BATCH_SIZE,
        "eval_batch_size": RECOMMENDED_BATCH_SIZE * 2,
        "num_workers": 2,
        "lr_initial": 0.001,
        "optimizer": "AdamW",
        "optimizer_params": {"weight_decay": 0.01},
        "scheduler": "ReduceLROnPlateau",
        "scheduler_params": {
            "factor": 0.5,
            "patience": 5,
            "epochs": 50
        },
        "mode": "min",
        "max_epochs": 50,
        "energy_coefficient": 1.0,
        "eval_every": 5,
        "checkpoint_every": 10
    }
}

config_path = "/content/colab_config.yml"
with open(config_path, "w") as f:
    yaml.dump(config, f, default_flow_style=False)

print(f"‚úÖ Configuration saved to {config_path}")
print(f"\nBatch size: {RECOMMENDED_BATCH_SIZE}")
print(f"Lmax: {RECOMMENDED_LMAX}")

## üöÄ Cell 6: Train EquiformerV2

‚è∞ **Expected Runtime**: 2-6 hours (GPU dependent)

Monitor progress with TensorBoard (Cell 7).

In [None]:
os.environ['PYTHONPATH'] = '/content/ocp:/content/equiformer_v2'
os.chdir("/content/equiformer_v2")

print("=" * 60)
print("Starting EquiformerV2 Training...")
print("=" * 60)

!python main_oc20.py \
    --config-yml {config_path} \
    --mode train \
    --run-dir {CHECKPOINT_PATH} \
    --print-every 10

print("\n‚úÖ Training completed!")
print(f"Checkpoints: {CHECKPOINT_PATH}")

## üìä Cell 7: TensorBoard Monitoring (Optional)

Run this in a separate tab while training.

In [None]:
%load_ext tensorboard
%tensorboard --logdir {CHECKPOINT_PATH}

## üß¨ Cell 8: Extract Geometric Embeddings (CORRECTED)

### ‚ö†Ô∏è Key Correction from `correcting.md`:

EquiformerV2's `forward()` only returns energy (scalar), **NOT** embeddings!

We use `register_forward_hook` to capture intermediate features **before** the energy head.

### üîß May Require Modification:

```python
hook_layer_name = 'energy_block'  # ‚Üê Verify this matches your model
```

If error occurs, check the printed model structure and update the layer name.

In [None]:
import torch
import lmdb
import pickle
import numpy as np
from torch_geometric.data import Data, DataLoader
from tqdm import tqdm

print("=" * 60)
print("Extracting Geometric Embeddings...")
print("=" * 60)

# 1. Load checkpoint
checkpoint_files = [f for f in os.listdir(CHECKPOINT_PATH) if f.endswith('.pt')]
best_checkpoint = sorted(checkpoint_files)[-1]
checkpoint_path = os.path.join(CHECKPOINT_PATH, best_checkpoint)

print(f"Loading: {checkpoint_path}")
checkpoint = torch.load(checkpoint_path, map_location='cuda')

# 2. Reconstruct model
from ocpmodels.common.registry import registry

config_dict = checkpoint.get('config', config)
model = registry.get_model_class(config_dict['model']['name'])(
    **config_dict['model']
)

state_dict = checkpoint['state_dict']
state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}
model.load_state_dict(state_dict)
model = model.to('cuda')
model.eval()

print("‚úÖ Model loaded")

# 3. Print model structure to find correct layer
print("\nModel structure (first 20 layers):")
for i, (name, module) in enumerate(model.named_modules()):
    print(f"  {name}: {type(module).__name__}")
    if i > 20:
        print("  ...")
        break

# 4. Define Hook
features_cache = {}

def get_embedding_hook(name):
    def hook(module, input, output):
        features_cache[name] = output.detach()
    return hook

# 5. Register hook (‚ö†Ô∏è May need to modify layer name)
hook_layer_name = 'energy_block'

if hasattr(model, hook_layer_name):
    hook_handle = getattr(model, hook_layer_name).register_forward_pre_hook(
        lambda m, inp: features_cache.update({'embedding': inp[0].detach()})
    )
    print(f"‚úÖ Hook registered at: {hook_layer_name}")
else:
    print(f"‚ö†Ô∏è Layer '{hook_layer_name}' not found. Using fallback...")
    hook_handle = model.norm.register_forward_hook(get_embedding_hook('embedding'))
    print("‚úÖ Hook registered at: model.norm (fallback)")

# 6. Create DataLoader
class LMDBDataset:
    def __init__(self, lmdb_path):
        self.env = lmdb.open(lmdb_path, readonly=True, lock=False)
        with self.env.begin() as txn:
            self.length = txn.stat()['entries']
    
    def __len__(self):
        return self.length
    
    def __getitem__(self, idx):
        with self.env.begin() as txn:
            data = pickle.loads(txn.get(str(idx).encode()))
        return Data(**data)

train_dataset = LMDBDataset(f"{LOCAL_DATA_PATH}/train.lmdb")
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=False)

print(f"\nDataset: {len(train_dataset)} samples")

# 7. Extract embeddings
embeddings_dict = {}

print("\nExtracting embeddings...")
for batch_idx, batch in enumerate(tqdm(train_loader)):
    batch = batch.to('cuda')
    
    with torch.no_grad():
        _ = model(batch)
        
        emb = features_cache['embedding']
        
        # Aggregate to graph-level if needed
        if emb.dim() == 2:
            from torch_geometric.nn import global_mean_pool
            emb = global_mean_pool(emb, batch.batch)
        
        emb_np = emb.cpu().numpy()
        
        # Store with sample IDs
        sample_ids = batch.sid if hasattr(batch, 'sid') else \
                     [f"train_{batch_idx * 16 + i}" for i in range(len(emb_np))]
        
        for sid, embedding in zip(sample_ids, emb_np):
            embeddings_dict[str(sid)] = embedding

np.save(EMBEDDING_PATH, embeddings_dict)
hook_handle.remove()

print(f"\n‚úÖ Extracted {len(embeddings_dict)} embeddings")
print(f"‚úÖ Saved to {EMBEDDING_PATH}")

# Sample output
sample_key = list(embeddings_dict.keys())[0]
sample_emb = embeddings_dict[sample_key]
print(f"\nSample shape: {sample_emb.shape}")
print(f"Sample dims (first 5): {sample_emb[:5]}")

## üéØ Cell 9: Active Learning - Batch Diversity Sampling

### ‚ö†Ô∏è Important Note (from `correcting.md`):

This is **NOT** true MOBO-OSD (Multi-Objective Bayesian Optimization with Orthogonal Search Directions)!

**What it is**: Batch Bayesian Optimization with Diversity Penalty (cosine similarity)

**Why it's valid**: For pool-based active learning with discrete candidates, this is a practical approximation.

**For true MOBO-OSD**: Use BoTorch's `qNoisyExpectedImprovement` with gradient-based orthogonalization.

In [None]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel, Matern
from sklearn.preprocessing import StandardScaler
import numpy as np

class BatchDiversityOptimizer:
    """
    Batch Bayesian Optimization with Diversity Penalty.
    
    This is a pool-based approximation, NOT true MOBO-OSD.
    For true orthogonal sampling with gradient projection, see BoTorch.
    """
    
    def __init__(self, embeddings_dict, initial_scores, beta=2.0):
        self.embeddings_dict = embeddings_dict
        self.all_ids = list(embeddings_dict.keys())
        
        self.labeled_ids = list(initial_scores.keys())
        self.unlabeled_ids = [sid for sid in self.all_ids if sid not in initial_scores]
        
        self.X_train = np.array([embeddings_dict[sid] for sid in self.labeled_ids])
        self.y_train = np.array([initial_scores[sid] for sid in self.labeled_ids])
        
        self.scaler_X = StandardScaler()
        self.scaler_y = StandardScaler()
        
        self.X_train_scaled = self.scaler_X.fit_transform(self.X_train)
        self.y_train_scaled = self.scaler_y.fit_transform(self.y_train.reshape(-1, 1)).ravel()
        
        kernel = ConstantKernel(1.0) * Matern(nu=2.5, length_scale=1.0)
        self.gp = GaussianProcessRegressor(
            kernel=kernel,
            n_restarts_optimizer=10,
            alpha=1e-6,
            normalize_y=False
        )
        
        self.beta = beta
        self.iteration = 0
        
        print(f"Initialized with {len(self.labeled_ids)} labeled samples")
        print(f"Pool size: {len(self.unlabeled_ids)} unlabeled")
    
    def fit(self):
        """Train Gaussian Process"""
        self.gp.fit(self.X_train_scaled, self.y_train_scaled)
        print(f"GP trained. Kernel: {self.gp.kernel_}")
    
    def acquisition_ucb(self, X_pool_scaled):
        """Upper Confidence Bound: Œº + Œ≤ * œÉ"""
        mu, sigma = self.gp.predict(X_pool_scaled, return_std=True)
        return mu + self.beta * sigma
    
    def select_batch_diverse(self, batch_size=10, diversity_weight=0.5):
        """
        Greedy batch selection with cosine similarity penalty.
        """
        if len(self.unlabeled_ids) == 0:
            print("‚ö†Ô∏è No unlabeled samples!")
            return []
        
        X_pool = np.array([self.embeddings_dict[sid] for sid in self.unlabeled_ids])
        X_pool_scaled = self.scaler_X.transform(X_pool)
        
        acq_scores = self.acquisition_ucb(X_pool_scaled)
        
        selected_indices = []
        selected_embeddings = []
        
        for step in range(min(batch_size, len(self.unlabeled_ids))):
            if step == 0:
                best_idx = np.argmax(acq_scores)
            else:
                selected_matrix = np.array(selected_embeddings)
                pool_matrix = X_pool
                
                # Normalize for cosine similarity
                selected_norm = selected_matrix / (np.linalg.norm(selected_matrix, axis=1, keepdims=True) + 1e-8)
                pool_norm = pool_matrix / (np.linalg.norm(pool_matrix, axis=1, keepdims=True) + 1e-8)
                
                similarities = np.dot(pool_norm, selected_norm.T)
                max_similarity = np.abs(similarities).max(axis=1)
                
                diversity_penalty = max_similarity * diversity_weight
                adjusted_scores = acq_scores * (1 - diversity_penalty)
                adjusted_scores[selected_indices] = -np.inf
                
                best_idx = np.argmax(adjusted_scores)
            
            selected_indices.append(best_idx)
            selected_embeddings.append(X_pool[best_idx])
            acq_scores[best_idx] = -np.inf
        
        selected_ids = [self.unlabeled_ids[i] for i in selected_indices]
        
        print(f"\nSelected {len(selected_ids)} candidates:")
        for i, sid in enumerate(selected_ids):
            print(f"  {i+1}. {sid}")
        
        return selected_ids
    
    def update(self, new_scores):
        """Update with new experimental results"""
        for sid, score in new_scores.items():
            if sid in self.unlabeled_ids:
                self.labeled_ids.append(sid)
                self.unlabeled_ids.remove(sid)
        
        self.X_train = np.array([self.embeddings_dict[sid] for sid in self.labeled_ids])
        self.y_train = np.array([new_scores.get(sid, self.y_train[i]) 
                                 for i, sid in enumerate(self.labeled_ids)])
        
        self.X_train_scaled = self.scaler_X.fit_transform(self.X_train)
        self.y_train_scaled = self.scaler_y.fit_transform(self.y_train.reshape(-1, 1)).ravel()
        
        self.iteration += 1
        print(f"‚úÖ Updated. Iteration {self.iteration}, Labeled: {len(self.labeled_ids)}")

print("‚úÖ BatchDiversityOptimizer class defined")

## üîÑ Cell 10: Run Active Learning Loop (Demo)

### ‚ö†Ô∏è MODIFICATION REQUIRED:

Replace simulated data with your **actual initial experiments**:

```python
initial_scores = {
    'CAR_001': 0.85,  # Your actual measured scores
    'CAR_023': 1.23,
    ...
}
```

In [None]:
# Load embeddings
embeddings = np.load(EMBEDDING_PATH, allow_pickle=True).item()
print(f"Loaded {len(embeddings)} embeddings")

# ‚ö†Ô∏è‚ö†Ô∏è‚ö†Ô∏è REPLACE WITH YOUR ACTUAL DATA ‚ö†Ô∏è‚ö†Ô∏è‚ö†Ô∏è
# Example: Simulate 20 initial experiments
np.random.seed(42)
all_sample_ids = list(embeddings.keys())
initial_sample_ids = np.random.choice(all_sample_ids, size=20, replace=False).tolist()
initial_scores = {sid: np.random.randn() for sid in initial_sample_ids}

print(f"\nInitial training set: {len(initial_scores)} samples")
print(f"Score range: [{min(initial_scores.values()):.2f}, {max(initial_scores.values()):.2f}]")

# Initialize optimizer
optimizer = BatchDiversityOptimizer(
    embeddings_dict=embeddings,
    initial_scores=initial_scores,
    beta=2.0
)

# Fit GP
optimizer.fit()

# Select next batch
BATCH_SIZE = 10
next_batch = optimizer.select_batch_diverse(
    batch_size=BATCH_SIZE,
    diversity_weight=0.5
)

print("\n" + "=" * 60)
print("üéØ Recommended candidates for next experiments:")
print("=" * 60)
for i, sid in enumerate(next_batch):
    print(f"{i+1:2d}. {sid}")

# Save
with open("/content/selected_batch_round1.txt", "w") as f:
    for sid in next_batch:
        f.write(f"{sid}\n")

print(f"\n‚úÖ Saved to /content/selected_batch_round1.txt")
print("\nüìù Next: Perform manual validation on these candidates")

## üîÑ Cell 11: Update Model (After Manual Validation)

Run this after completing experiments on the selected batch.

In [None]:
# ‚ö†Ô∏è Replace with your actual experimental results
new_experimental_results = {
    next_batch[0]: 1.5,  # Replace with real scores
    next_batch[1]: 0.8,
    next_batch[2]: -0.3,
    # ... add all tested samples
}

print("üìä New results:")
for sid, score in new_experimental_results.items():
    print(f"  {sid}: {score:.3f}")

# Update optimizer
optimizer.update(new_experimental_results)
optimizer.fit()

# Select next batch
next_batch_round2 = optimizer.select_batch_diverse(
    batch_size=BATCH_SIZE,
    diversity_weight=0.5
)

print("\n" + "=" * 60)
print("üéØ Round 2 - Recommended candidates:")
print("=" * 60)
for i, sid in enumerate(next_batch_round2):
    print(f"{i+1:2d}. {sid}")

# Continue this loop until convergence...

## üìä Cell 12: Visualization & Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)

# Prepare data
labeled_embeddings = np.array([embeddings[sid] for sid in optimizer.labeled_ids])
labeled_scores = optimizer.y_train

# PCA
pca = PCA(n_components=2)
labeled_2d = pca.fit_transform(labeled_embeddings)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Colored by score
scatter = axes[0].scatter(
    labeled_2d[:, 0], labeled_2d[:, 1],
    c=labeled_scores, cmap='viridis',
    s=100, alpha=0.6, edgecolors='black'
)
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')
axes[0].set_title('Embedding Space (Colored by Score)')
plt.colorbar(scatter, ax=axes[0], label='Score')

# Mark selected
if 'next_batch' in locals():
    next_batch_embeddings = np.array([embeddings[sid] for sid in next_batch])
    next_batch_2d = pca.transform(next_batch_embeddings)
    axes[0].scatter(
        next_batch_2d[:, 0], next_batch_2d[:, 1],
        c='red', s=200, alpha=0.8, marker='*',
        edgecolors='black', linewidths=2,
        label='Selected'
    )
    axes[0].legend()

# Plot 2: Coverage
axes[1].scatter(labeled_2d[:, 0], labeled_2d[:, 1],
                c='blue', s=50, alpha=0.3, label='Labeled')
axes[1].set_xlabel('PC1')
axes[1].set_ylabel('PC2')
axes[1].set_title('Search Space Coverage')
axes[1].legend()

plt.tight_layout()
plt.savefig('/content/visualization.png', dpi=150, bbox_inches='tight')
print("‚úÖ Saved to /content/visualization.png")
plt.show()

## üíæ Cell 13: Save Results & Download

In [None]:
import pickle
from datetime import datetime

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
results_dir = f"/content/drive/My Drive/BioFoundry/results_{timestamp}"
os.makedirs(results_dir, exist_ok=True)

# Save embeddings
np.save(f"{results_dir}/embeddings.npy", embeddings)

# Save optimizer state
with open(f"{results_dir}/optimizer_state.pkl", "wb") as f:
    pickle.dump({
        'labeled_ids': optimizer.labeled_ids,
        'unlabeled_ids': optimizer.unlabeled_ids,
        'scores': dict(zip(optimizer.labeled_ids, optimizer.y_train)),
        'iteration': optimizer.iteration
    }, f)

# Save batches
with open(f"{results_dir}/selected_batches.txt", "w") as f:
    f.write(f"# Active Learning Results - {timestamp}\n\n")
    f.write("## Round 1:\n")
    for sid in next_batch:
        f.write(f"{sid}\n")

# Copy checkpoint
shutil.copy(checkpoint_path, f"{results_dir}/best_model.pt")

print(f"‚úÖ Results saved to {results_dir}")
print("\nüìÇ Files:")
print("  - embeddings.npy")
print("  - optimizer_state.pkl")
print("  - selected_batches.txt")
print("  - best_model.pt")

## ‚úÖ Cell 14: Summary

### What We Accomplished:
1. ‚úÖ Trained EquiformerV2 on CAR-T dataset
2. ‚úÖ Extracted geometric embeddings (corrected Hook method)
3. ‚úÖ Implemented Batch Diversity Sampling
4. ‚úÖ Selected candidates for manual validation

### Key Corrections Applied:
- ‚úÖ Embedding via `register_forward_hook` (not model output)
- ‚úÖ Renamed to Batch Diversity Sampling (not MOBO-OSD)
- ‚úÖ GPU-adaptive config (T4: batch=4, lmax=[2])
- ‚úÖ LMDB copied to local disk
- ‚úÖ scipy==1.13.1 for sph_harm

### Next Steps:
1. **Manual Validation**: Test selected candidates
2. **Record Results**: Measure performance
3. **Update Model**: Run Cell 11 with new data
4. **Iterate**: Repeat until Pareto frontier convergence

### Troubleshooting:
- **OOM**: Reduce `batch_size`, `lmax_list`
- **Slow**: Check LMDB is on local disk
- **Hook fails**: Print model structure, update layer name

---

**üìà Good luck with your experiments!**

In [None]:
print("=" * 60)
print("üéâ BioFoundry Active Learning Pipeline Complete!")
print("=" * 60)
print(f"\nEmbeddings: {len(embeddings)} samples")
print(f"Labeled pool: {len(optimizer.labeled_ids)}")
print(f"Unlabeled: {len(optimizer.unlabeled_ids)}")
print(f"\nResults: {results_dir}")