# Dr. Zero Biomedical Training on Google Colab

This notebook provides a complete pipeline for training Dr. Zero on biomedical literature (PubMed) using Google Colab Pro+ with A100 GPU.

## Overview

**Training Pipeline:**
1. Setup environment and download PubMed corpus
2. Build PubMedBERT search index
3. Train Iteration 1 (Proposer + Solver)
4. Train Iteration 2 (with improved solver)
5. Train Iteration 3 (final models)
6. Evaluate on biomedical QA benchmarks

**Expected Runtime:** 30-40 hours on A100 GPU

**Requirements:**
- Google Colab Pro/Pro+ (for A100 GPU and long runtime)
- ~50 GB Google Drive storage
- Weights & Biases account (for logging)

## Before You Start

1. **Set runtime to A100 GPU:**
   - Runtime ‚Üí Change runtime type ‚Üí A100 GPU
2. **Get W&B API key:**
   - Sign up at wandb.ai
   - Get API key from wandb.ai/authorize
3. **Have your email ready** (required for NCBI PubMed API)

## Execution Instructions

Run cells in order. The notebook includes:
- ‚úÖ Automatic checkpointing to Google Drive
- üîÑ Auto-resume from disconnections
- üìä Progress monitoring
- üõ°Ô∏è Error handling and recovery

**Do NOT skip cells** - they build on each other.

Let's begin!

---
# Part 1: Environment Setup
---

In [None]:
# Cell 1: Mount Both Google Drives (Dual-Account Setup)

print("="*80)
print("CELL 1: Dual Google Drive Setup")
print("="*80)

from google.colab import drive
from pathlib import Path
import os

print("\nüìö You have two Google accounts:")
print("   Account A (Colab Pro): A100 GPU access, 15GB storage")
print("   Account B (Storage): 80GB storage")
print("\nWe'll mount BOTH drives to use Account A for compute")
print("and Account B for storage!")

# Mount Account A (Colab Pro) - for small files
print("\n" + "="*80)
print("STEP 1: Mount Account A (Colab Pro Account)")
print("="*80)
print("\nüìå Click the link below and authenticate with your COLAB PRO account")
print("   (The account you're using to run this notebook)\n")

drive.mount('/content/drive_pro', force_remount=False)
print("\n‚úÖ Account A (Colab Pro) mounted successfully!")

# Mount Account B (80GB Storage) - for large files  
print("\n" + "="*80)
print("STEP 2: Mount Account B (80GB Storage Account)")
print("="*80)
print("\nüìå Click the link below and:")
print("   1. Click 'Use another account'")
print("   2. Sign in with your 80GB STORAGE account")
print("   3. Authorize access\n")

drive.mount('/content/drive_storage', force_remount=False)
print("\n‚úÖ Account B (Storage) mounted successfully!")

# Verify both mounts
print("\n" + "="*80)
print("STEP 3: Verify Both Drives")
print("="*80)

try:
    pro_contents = os.listdir('/content/drive_pro/MyDrive')
    storage_contents = os.listdir('/content/drive_storage/MyDrive')
    
    print(f"\n‚úÖ Pro Drive (Account A): {len(pro_contents)} items")
    print(f"   Path: /content/drive_pro/MyDrive/")
    print(f"   Sample: {pro_contents[:3] if pro_contents else '(empty)'}")
    
    print(f"\n‚úÖ Storage Drive (Account B): {len(storage_contents)} items")
    print(f"   Path: /content/drive_storage/MyDrive/")
    print(f"   Sample: {storage_contents[:3] if storage_contents else '(empty)'}")
    
except Exception as e:
    print(f"\n‚ùå Error accessing drives: {e}")
    print("   Please re-run this cell and ensure both accounts are authenticated")
    raise

# Create directory structure on BOTH drives
print("\n" + "="*80)
print("STEP 4: Creating Directory Structure")
print("="*80)

# Account A (Pro) - Small files only
PRO_BASE = Path('/content/drive_pro/MyDrive/drzero_biomedical')
LOGS_DIR = PRO_BASE / 'logs'
CONFIG_DIR = PRO_BASE / 'configs'

for dir_path in [PRO_BASE, LOGS_DIR, CONFIG_DIR]:
    dir_path.mkdir(parents=True, exist_ok=True)
    print(f"‚úì Created (Pro Drive): {dir_path}")

# Account B (Storage) - Large files
STORAGE_BASE = Path('/content/drive_storage/MyDrive/drzero_biomedical')
CORPUS_DIR = STORAGE_BASE / 'corpus' / 'pubmed'
CHECKPOINT_DIR = STORAGE_BASE / 'checkpoints'
DATA_DIR = STORAGE_BASE / 'data' / 'biomedical'
OUTPUTS_DIR = STORAGE_BASE / 'outputs'

for dir_path in [STORAGE_BASE, CORPUS_DIR, CHECKPOINT_DIR, DATA_DIR, OUTPUTS_DIR]:
    dir_path.mkdir(parents=True, exist_ok=True)
    print(f"‚úì Created (Storage Drive): {dir_path}")

# Create local temp directories (fastest access during training)
LOCAL_BASE = Path('/content/drzero_local')
LOCAL_CHECKPOINT = LOCAL_BASE / 'checkpoints'
LOCAL_DATA = LOCAL_BASE / 'data'

for dir_path in [LOCAL_BASE, LOCAL_CHECKPOINT, LOCAL_DATA]:
    dir_path.mkdir(parents=True, exist_ok=True)
    print(f"‚úì Created (Local): {dir_path}")

# Print storage allocation summary
print("\n" + "="*80)
print("üìä STORAGE ALLOCATION SUMMARY")
print("="*80)

print("\nüìÅ Account A (Colab Pro - 15GB):")
print(f"   Logs: {LOGS_DIR}")
print(f"   Configs: {CONFIG_DIR}")
print(f"   Expected usage: <1 GB")

print("\nüìÅ Account B (Storage - 80GB):")
print(f"   Corpus: {CORPUS_DIR} (~10 GB)")
print(f"   Checkpoints: {CHECKPOINT_DIR} (~30 GB)")
print(f"   Data: {DATA_DIR} (~5 GB)")
print(f"   Outputs: {OUTPUTS_DIR} (~2 GB)")
print(f"   Expected usage: ~47 GB")

print("\nüí° Local temp (Colab VM - fast but not persistent):")
print(f"   {LOCAL_BASE}")
print(f"   Used for: Working files during training")

print("\n‚úÖ Dual-drive setup complete!")
print("   Both accounts accessible in this session")
print("   Large files ‚Üí Storage account")
print("   Small files ‚Üí Pro account")

In [None]:
# Cell 2: Install Dependencies

print("="*80)
print("CELL 2: Installing Dependencies")
print("="*80)

import subprocess
import sys

def install_package(package, quiet=True):
    """Install a package with pip."""
    cmd = [sys.executable, "-m", "pip", "install"]
    if quiet:
        cmd.append("-q")
    cmd.append(package)
    subprocess.check_call(cmd)

# Core dependencies
print("\nüì¶ Installing core packages...")
core_packages = [
    "torch",
    "transformers",
    "accelerate",
    "datasets",
    "sentence-transformers",
    "faiss-gpu",
    "biopython",
    "wandb",
    "tqdm",
    "psutil",
]

for pkg in core_packages:
    try:
        install_package(pkg)
        print(f"  ‚úì {pkg}")
    except Exception as e:
        print(f"  ‚ö†Ô∏è Failed: {pkg} - {e}")

# Install SGLang for serving
print("\nüì¶ Installing SGLang...")
try:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "sglang[all]"])
    print("  ‚úì sglang")
except:
    print("  ‚ö†Ô∏è SGLang installation failed, will try alternative method")

# Install veRL from source
print("\nüì¶ Installing veRL framework...")
if not os.path.exists('/content/verl'):
    subprocess.check_call(["git", "clone", "https://github.com/volcengine/verl.git", "/content/verl"])
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-e", "/content/verl"])
    print("  ‚úì veRL installed from source")
else:
    print("  ‚úì veRL already installed")

# Verify installations
print("\nüîç Verifying installations...")
import torch
import transformers
print(f"  ‚úì PyTorch: {torch.__version__}")
print(f"  ‚úì Transformers: {transformers.__version__}")
print(f"  ‚úì CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"  ‚úì GPU: {gpu_name}")
    print(f"  ‚úì GPU Memory: {gpu_memory:.1f} GB")
    
    if "A100" not in gpu_name:
        print(f"  ‚ö†Ô∏è WARNING: Expected A100 GPU, got {gpu_name}")
        print(f"     Training may be slower or run out of memory")
else:
    print("  ‚ùå ERROR: No GPU detected!")
    print("     Go to Runtime -> Change runtime type -> Select A100 GPU")
    raise RuntimeError("GPU required for training")

print("\n‚úÖ All dependencies installed successfully!")

In [None]:
# Cell 3: Clone DrPubMedZero Repository (Much Simpler!)

print("="*80)
print("CELL 3: Cloning DrPubMedZero Repository")
print("="*80)

import os
import subprocess
from pathlib import Path
import sys

# Clone your DrPubMedZero repository (contains everything!)
REPO_DIR = Path('/content/DrPubMedZero')

if not REPO_DIR.exists():
    print("\nüì• Cloning DrPubMedZero repository from GitHub...")
    subprocess.check_call([
        "git", "clone", 
        "https://github.com/ShivaAyyar/DrPubMedZero.git",
        str(REPO_DIR)
    ])
    print("  ‚úì Repository cloned")
else:
    print("\n‚úì Repository already exists")
    print("  Pulling latest changes...")
    subprocess.check_call(["git", "-C", str(REPO_DIR), "pull"])
    print("  ‚úì Up to date")

# Change to repository directory
os.chdir(REPO_DIR)
print(f"\nüìÇ Working directory: {os.getcwd()}")

# Verify all required files exist
print("\nüîç Verifying repository contents...")

required_items = [
    'biomedical/',
    'colab_helpers.py',
    'colab_config.yaml',
    'config/',
    'scripts/download.py',
    'iter1_challenger_biomed.sh',
    'iter2_challenger_biomed.sh',
    'iter3_challenger_biomed.sh',
    'requirements.txt'
]

all_present = True
for item in required_items:
    path = Path(item)
    if path.exists():
        print(f"  ‚úì {item}")
    else:
        print(f"  ‚ùå Missing: {item}")
        all_present = False

if not all_present:
    print("\n‚ö†Ô∏è Some files are missing. Ensure your repository is up to date.")
    raise FileNotFoundError("Required files missing from repository")

# Verify biomedical module imports
print("\nüîç Verifying biomedical module...")
sys.path.insert(0, str(REPO_DIR))  # Add to Python path

try:
    from biomedical import (
        PubMedCorpusManager,
        BiomedicalValidator,
        BiomedicalRetrieverServer,
        BiomedicalPrompts,
        BiomedicalRewardCalculator,
        BiomedicalDatasets,
        setup_for_colab
    )
    print("  ‚úì All biomedical components imported successfully")
    
    # Run Colab setup
    print("\nüîß Configuring for Colab environment...")
    if setup_for_colab():
        print("  ‚úì Colab environment configured")
    
except ImportError as e:
    print(f"  ‚ùå Import error: {e}")
    print("\nüì¶ Installing missing dependencies from requirements.txt...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "-r", "requirements.txt"])
    print("  ‚úì Dependencies installed")
    print("\n‚ö†Ô∏è Please re-run this cell to complete setup")

print("\n‚úÖ Repository setup complete!")
print(f"   All code is in: {REPO_DIR}")
print(f"\nüí° TIP: Your repository is now cloned. All biomedical")
print(f"   modules, configs, and training scripts are ready!")

---
# Part 2: Configuration & Data Preparation
---

In [None]:
# Cell 4: Configuration (Dual-Drive Aware)

print("="*80)
print("CELL 4: Configuration")
print("="*80)

# User inputs (MODIFY THESE)
import getpass

print("\n‚öôÔ∏è Setting up configuration...\n")

# NCBI Email (required for PubMed API)
NCBI_EMAIL = "ssa163@case.edu"  # Your email
print(f"üìß NCBI Email: {NCBI_EMAIL}")

# Weights & Biases API key
print("\nüîë Weights & Biases Setup:")
print("   Get your API key from: https://wandb.ai/authorize")
WANDB_API_KEY = getpass.getpass("Enter W&B API key (hidden): ")

if WANDB_API_KEY:
    os.environ['WANDB_API_KEY'] = WANDB_API_KEY
    import wandb
    wandb.login(key=WANDB_API_KEY)
    print("  ‚úì W&B configured")
else:
    print("  ‚ö†Ô∏è No W&B key provided - logging will be disabled")
    os.environ['WANDB_MODE'] = 'disabled'

# Training configuration with DUAL-DRIVE PATHS
print("\nüìÅ Configuring dual-drive storage paths...")

CONFIG = {
    # Model
    'model_name': 'Qwen/Qwen2.5-3B-Instruct',
    
    # Data
    'corpus_size': 50000,  # Number of PubMed papers to download
    'training_seeds': 2000,  # Number of seed documents
    'pubmed_query': '(breast cancer OR lung cancer OR drug resistance) AND (gene OR protein OR pathway)',
    'date_range': ('2020/01/01', '2024/12/31'),
    
    # Training
    'batch_size': 64,
    'gradient_accumulation': 4,
    'learning_rate': 1e-6,
    'max_steps_per_iteration': 200,  # Steps per iteration (adjust based on data size)
    
    # Paths - DUAL-DRIVE SETUP
    # Large files ‚Üí Storage account (80GB)
    'corpus_path': str(CORPUS_DIR),  # Account B
    'checkpoint_dir': str(CHECKPOINT_DIR),  # Account B
    'data_dir': str(DATA_DIR),  # Account B
    'outputs_dir': str(OUTPUTS_DIR),  # Account B
    
    # Small files ‚Üí Pro account (15GB)
    'logs_dir': str(LOGS_DIR),  # Account A
    'config_dir': str(CONFIG_DIR),  # Account A
    
    # Servers
    'retrieval_port': 8000,
    'solver_port': 8001,
}

print("\nüìã Training Configuration:")
print("\nüîπ Model & Training:")
for key in ['model_name', 'batch_size', 'gradient_accumulation', 'learning_rate']:
    print(f"   {key}: {CONFIG[key]}")

print("\nüîπ Data:")
for key in ['corpus_size', 'training_seeds', 'pubmed_query']:
    value = CONFIG[key]
    if isinstance(value, str) and len(value) > 60:
        value = value[:57] + "..."
    print(f"   {key}: {value}")

print("\nüîπ Storage Paths (Dual-Drive):")
print(f"   üìÅ Account A (Pro - 15GB):")
print(f"      logs_dir: {CONFIG['logs_dir']}")
print(f"      config_dir: {CONFIG['config_dir']}")
print(f"\n   üìÅ Account B (Storage - 80GB):")
print(f"      corpus_path: {CONFIG['corpus_path']}")
print(f"      checkpoint_dir: {CONFIG['checkpoint_dir']}")
print(f"      data_dir: {CONFIG['data_dir']}")
print(f"      outputs_dir: {CONFIG['outputs_dir']}")

# Verify paths exist
print("\nüîç Verifying all directories...")
all_paths_exist = True
for key, path in CONFIG.items():
    if '_dir' in key or '_path' in key:
        if not Path(path).exists():
            print(f"   ‚ö†Ô∏è Creating: {path}")
            Path(path).mkdir(parents=True, exist_ok=True)
        else:
            print(f"   ‚úì {key}: exists")

print("\n‚úÖ Configuration complete!")
print("   Dual-drive setup verified")
print("   All paths ready for training")

In [None]:
# Cell 5: Download PubMed Corpus

print("="*80)
print("CELL 5: Downloading PubMed Corpus")
print("="*80)

from biomedical import PubMedCorpusManager

# Check if corpus already exists
corpus_file = Path(CONFIG['corpus_path']) / 'pubmed-corpus.jsonl'

if corpus_file.exists():
    print(f"\n‚úì Corpus already exists: {corpus_file}")
    
    # Count existing papers
    with open(corpus_file, 'r') as f:
        n_existing = sum(1 for _ in f)
    
    print(f"  Papers in corpus: {n_existing}")
    
    if n_existing >= CONFIG['corpus_size']:
        print("  Skipping download (sufficient papers already downloaded)")
    else:
        print(f"  Need to download {CONFIG['corpus_size'] - n_existing} more papers")
        download_corpus = True
else:
    print("\nüì• Downloading PubMed corpus...")
    print(f"   Query: {CONFIG['pubmed_query']}")
    print(f"   Max papers: {CONFIG['corpus_size']}")
    print(f"   Date range: {CONFIG['date_range']}")
    print("\n‚è±Ô∏è This will take 30-60 minutes...")
    
    download_corpus = True

if download_corpus:
    # Initialize corpus manager
    manager = PubMedCorpusManager(
        save_path=CONFIG['corpus_path'],
        email=NCBI_EMAIL
    )
    
    # Download
    articles = manager.download_pubmed_abstracts(
        query=CONFIG['pubmed_query'],
        max_results=CONFIG['corpus_size'],
        date_range=CONFIG['date_range']
    )
    
    if articles:
        # Save corpus
        manager.save_corpus(articles)
        
        # Print statistics
        stats = manager.get_corpus_statistics()
        print("\nüìä Corpus Statistics:")
        for key, value in stats.items():
            print(f"   {key}: {value}")
        
        print(f"\n‚úÖ Downloaded {len(articles)} papers!")
    else:
        print("\n‚ùå Download failed - check your internet and NCBI email")
        raise RuntimeError("Corpus download failed")
else:
    print("\n‚úÖ Using existing corpus")

## ‚ö†Ô∏è Checkpoint: Corpus Downloaded

At this point, you have:
- ‚úÖ PubMed corpus downloaded to Google Drive
- ‚úÖ Environment fully configured

**If you need to stop here:**
- Your corpus is safely stored in Google Drive
- You can resume from the next cell later

**To continue:** Run the next cells to build the search index.

This is a comprehensive Jupyter notebook, but due to size constraints, I'm providing the first 5 critical cells. The complete notebook would continue with:

- Cells 6-7: Build FAISS index
- Cells 8-9: Prepare training data
- Cells 10-15: Iteration 1 training
- Cells 16-21: Iteration 2 training
- Cells 22-27: Iteration 3 training
- Cells 28-30: Evaluation

Would you like me to:
1. Continue with the remaining cells in the notebook?
2. Create a simplified version?
3. Focus on specific sections?

Let me know how you'd like to proceed!