# Dr. Zero Biomedical Training on Google Colab

This notebook provides a complete pipeline for training Dr. Zero on biomedical literature (PubMed) using Google Colab Pro+ with A100 GPU.

## Overview

**Training Pipeline:**
1. Setup environment and download PubMed corpus
2. Build PubMedBERT search index
3. Train Iteration 1 (Proposer + Solver)
4. Train Iteration 2 (with improved solver)
5. Train Iteration 3 (final models)
6. Evaluate on biomedical QA benchmarks

**Expected Runtime:** 30-40 hours on A100 GPU

**Requirements:**
- Google Colab Pro/Pro+ (for A100 GPU and long runtime)
- ~50 GB Google Drive storage
- Weights & Biases account (for logging)

## Before You Start

1. **Set runtime to A100 GPU:**
   - Runtime ‚Üí Change runtime type ‚Üí A100 GPU
2. **Get W&B API key:**
   - Sign up at wandb.ai
   - Get API key from wandb.ai/authorize
3. **Have your email ready** (required for NCBI PubMed API)

## Execution Instructions

Run cells in order. The notebook includes:
- ‚úÖ Automatic checkpointing to Google Drive
- üîÑ Auto-resume from disconnections
- üìä Progress monitoring
- üõ°Ô∏è Error handling and recovery

**Do NOT skip cells** - they build on each other.

Let's begin!

---
# Part 1: Environment Setup
---

In [None]:
# Cell 1: Mount Google Drive and Setup Directories

print("="*80)
print("CELL 1: Mounting Google Drive & Creating Directories")
print("="*80)

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=False)

# Create directory structure in Google Drive
import os
from pathlib import Path

# Base directory in Google Drive
DRIVE_BASE = Path('/content/drive/MyDrive/drzero_biomedical')
DRIVE_BASE.mkdir(exist_ok=True)

# Subdirectories
CORPUS_DIR = DRIVE_BASE / 'corpus' / 'pubmed'
CHECKPOINT_DIR = DRIVE_BASE / 'checkpoints'
DATA_DIR = DRIVE_BASE / 'data' / 'biomedical'
LOGS_DIR = DRIVE_BASE / 'logs'
OUTPUTS_DIR = DRIVE_BASE / 'outputs'

for dir_path in [CORPUS_DIR, CHECKPOINT_DIR, DATA_DIR, LOGS_DIR, OUTPUTS_DIR]:
    dir_path.mkdir(parents=True, exist_ok=True)
    print(f"‚úì Created: {dir_path}")

# Create local directories (faster access during training)
LOCAL_BASE = Path('/content/drzero_local')
LOCAL_CHECKPOINT = LOCAL_BASE / 'checkpoints'
LOCAL_DATA = LOCAL_BASE / 'data'

for dir_path in [LOCAL_CHECKPOINT, LOCAL_DATA]:
    dir_path.mkdir(parents=True, exist_ok=True)
    print(f"‚úì Created local: {dir_path}")

print("\n‚úÖ Directory structure ready!")
print(f"   Drive base: {DRIVE_BASE}")
print(f"   Local base: {LOCAL_BASE}")

In [None]:
# Cell 2: Install Dependencies

print("="*80)
print("CELL 2: Installing Dependencies")
print("="*80)

import subprocess
import sys

def install_package(package, quiet=True):
    """Install a package with pip."""
    cmd = [sys.executable, "-m", "pip", "install"]
    if quiet:
        cmd.append("-q")
    cmd.append(package)
    subprocess.check_call(cmd)

# Core dependencies
print("\nüì¶ Installing core packages...")
core_packages = [
    "torch",
    "transformers",
    "accelerate",
    "datasets",
    "sentence-transformers",
    "faiss-gpu",
    "biopython",
    "wandb",
    "tqdm",
    "psutil",
]

for pkg in core_packages:
    try:
        install_package(pkg)
        print(f"  ‚úì {pkg}")
    except Exception as e:
        print(f"  ‚ö†Ô∏è Failed: {pkg} - {e}")

# Install SGLang for serving
print("\nüì¶ Installing SGLang...")
try:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "sglang[all]"])
    print("  ‚úì sglang")
except:
    print("  ‚ö†Ô∏è SGLang installation failed, will try alternative method")

# Install veRL from source
print("\nüì¶ Installing veRL framework...")
if not os.path.exists('/content/verl'):
    subprocess.check_call(["git", "clone", "https://github.com/volcengine/verl.git", "/content/verl"])
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-e", "/content/verl"])
    print("  ‚úì veRL installed from source")
else:
    print("  ‚úì veRL already installed")

# Verify installations
print("\nüîç Verifying installations...")
import torch
import transformers
print(f"  ‚úì PyTorch: {torch.__version__}")
print(f"  ‚úì Transformers: {transformers.__version__}")
print(f"  ‚úì CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"  ‚úì GPU: {gpu_name}")
    print(f"  ‚úì GPU Memory: {gpu_memory:.1f} GB")
    
    if "A100" not in gpu_name:
        print(f"  ‚ö†Ô∏è WARNING: Expected A100 GPU, got {gpu_name}")
        print(f"     Training may be slower or run out of memory")
else:
    print("  ‚ùå ERROR: No GPU detected!")
    print("     Go to Runtime -> Change runtime type -> Select A100 GPU")
    raise RuntimeError("GPU required for training")

print("\n‚úÖ All dependencies installed successfully!")

In [None]:
# Cell 3: Clone Dr. Zero and Setup Biomedical Module

print("="*80)
print("CELL 3: Setting Up Dr. Zero Repository")
print("="*80)

import os
import subprocess
import shutil

# Clone Dr. Zero if not exists
DRZERO_DIR = Path('/content/drzero')
if not DRZERO_DIR.exists():
    print("\nüì• Cloning Dr. Zero repository...")
    subprocess.check_call([
        "git", "clone", 
        "https://github.com/facebookresearch/drzero.git",
        str(DRZERO_DIR)
    ])
    print("  ‚úì Dr. Zero cloned")
else:
    print("\n‚úì Dr. Zero already cloned")

# Change to drzero directory
os.chdir(DRZERO_DIR)
print(f"\nüìÇ Working directory: {os.getcwd()}")

# Copy biomedical module (assuming it's uploaded to Colab or Drive)
print("\nüìã Setting up biomedical module...")

# Check if biomedical module exists in current directory or Drive
BIOMEDICAL_SOURCE = None
for search_path in [
    Path('/content/biomedical'),  # If uploaded directly
    DRIVE_BASE.parent / 'biomedical',  # If in Drive
    Path.cwd() / 'biomedical'  # If already copied
]:
    if search_path.exists():
        BIOMEDICAL_SOURCE = search_path
        break

if BIOMEDICAL_SOURCE and BIOMEDICAL_SOURCE != DRZERO_DIR / 'biomedical':
    shutil.copytree(BIOMEDICAL_SOURCE, DRZERO_DIR / 'biomedical', dirs_exist_ok=True)
    print(f"  ‚úì Copied biomedical module from {BIOMEDICAL_SOURCE}")
elif (DRZERO_DIR / 'biomedical').exists():
    print("  ‚úì Biomedical module already in place")
else:
    print("  ‚ö†Ô∏è Biomedical module not found!")
    print("     Please upload the 'biomedical/' folder to:")
    print(f"     - /content/biomedical/ OR")
    print(f"     - {DRIVE_BASE.parent}/biomedical/")
    raise FileNotFoundError("Biomedical module required")

# Copy helper files
print("\nüìã Copying helper files...")
helper_files = [
    'colab_helpers.py',
    'colab_config.yaml'
]

for filename in helper_files:
    # Check multiple locations
    for source in [Path('/content') / filename, DRIVE_BASE.parent / filename]:
        if source.exists():
            shutil.copy(source, DRZERO_DIR / filename)
            print(f"  ‚úì Copied {filename}")
            break
    else:
        print(f"  ‚ö†Ô∏è {filename} not found (will create if needed)")

# Verify biomedical module
print("\nüîç Verifying biomedical module...")
try:
    from biomedical import (
        PubMedCorpusManager,
        BiomedicalValidator,
        BiomedicalRetrieverServer,
        BiomedicalPrompts,
        BiomedicalRewardCalculator,
        BiomedicalDatasets
    )
    print("  ‚úì All biomedical components imported successfully")
except ImportError as e:
    print(f"  ‚ùå Import error: {e}")
    raise

print("\n‚úÖ Setup complete!")

---
# Part 2: Configuration & Data Preparation
---

In [None]:
# Cell 4: Configuration

print("="*80)
print("CELL 4: Configuration")
print("="*80)

# User inputs (MODIFY THESE)
import getpass

print("\n‚öôÔ∏è Setting up configuration...\n")

# NCBI Email (required for PubMed API)
NCBI_EMAIL = "ssa163@case.edu"  # CHANGE THIS to your email
print(f"üìß NCBI Email: {NCBI_EMAIL}")

# Weights & Biases API key
print("\nüîë Weights & Biases Setup:")
print("   Get your API key from: https://wandb.ai/authorize")
WANDB_API_KEY = getpass.getpass("Enter W&B API key (hidden): ")

if WANDB_API_KEY:
    os.environ['WANDB_API_KEY'] = WANDB_API_KEY
    import wandb
    wandb.login(key=WANDB_API_KEY)
    print("  ‚úì W&B configured")
else:
    print("  ‚ö†Ô∏è No W&B key provided - logging will be disabled")
    os.environ['WANDB_MODE'] = 'disabled'

# Training configuration
CONFIG = {
    # Model
    'model_name': 'Qwen/Qwen2.5-3B-Instruct',
    
    # Data
    'corpus_size': 50000,  # Number of PubMed papers to download
    'training_seeds': 2000,  # Number of seed documents
    'pubmed_query': '(breast cancer OR lung cancer OR drug resistance) AND (gene OR protein OR pathway)',
    'date_range': ('2020/01/01', '2024/12/31'),
    
    # Training
    'batch_size': 64,
    'gradient_accumulation': 4,
    'learning_rate': 1e-6,
    'max_steps_per_iteration': 200,  # Steps per iteration (adjust based on data size)
    
    # Paths
    'corpus_path': str(CORPUS_DIR),
    'checkpoint_dir': str(CHECKPOINT_DIR),
    'data_dir': str(DATA_DIR),
    'logs_dir': str(LOGS_DIR),
    
    # Servers
    'retrieval_port': 8000,
    'solver_port': 8001,
}

print("\nüìã Training Configuration:")
for key, value in CONFIG.items():
    print(f"   {key}: {value}")

print("\n‚úÖ Configuration complete!")

In [None]:
# Cell 5: Download PubMed Corpus

print("="*80)
print("CELL 5: Downloading PubMed Corpus")
print("="*80)

from biomedical import PubMedCorpusManager

# Check if corpus already exists
corpus_file = Path(CONFIG['corpus_path']) / 'pubmed-corpus.jsonl'

if corpus_file.exists():
    print(f"\n‚úì Corpus already exists: {corpus_file}")
    
    # Count existing papers
    with open(corpus_file, 'r') as f:
        n_existing = sum(1 for _ in f)
    
    print(f"  Papers in corpus: {n_existing}")
    
    if n_existing >= CONFIG['corpus_size']:
        print("  Skipping download (sufficient papers already downloaded)")
    else:
        print(f"  Need to download {CONFIG['corpus_size'] - n_existing} more papers")
        download_corpus = True
else:
    print("\nüì• Downloading PubMed corpus...")
    print(f"   Query: {CONFIG['pubmed_query']}")
    print(f"   Max papers: {CONFIG['corpus_size']}")
    print(f"   Date range: {CONFIG['date_range']}")
    print("\n‚è±Ô∏è This will take 30-60 minutes...")
    
    download_corpus = True

if download_corpus:
    # Initialize corpus manager
    manager = PubMedCorpusManager(
        save_path=CONFIG['corpus_path'],
        email=NCBI_EMAIL
    )
    
    # Download
    articles = manager.download_pubmed_abstracts(
        query=CONFIG['pubmed_query'],
        max_results=CONFIG['corpus_size'],
        date_range=CONFIG['date_range']
    )
    
    if articles:
        # Save corpus
        manager.save_corpus(articles)
        
        # Print statistics
        stats = manager.get_corpus_statistics()
        print("\nüìä Corpus Statistics:")
        for key, value in stats.items():
            print(f"   {key}: {value}")
        
        print(f"\n‚úÖ Downloaded {len(articles)} papers!")
    else:
        print("\n‚ùå Download failed - check your internet and NCBI email")
        raise RuntimeError("Corpus download failed")
else:
    print("\n‚úÖ Using existing corpus")

## ‚ö†Ô∏è Checkpoint: Corpus Downloaded

At this point, you have:
- ‚úÖ PubMed corpus downloaded to Google Drive
- ‚úÖ Environment fully configured

**If you need to stop here:**
- Your corpus is safely stored in Google Drive
- You can resume from the next cell later

**To continue:** Run the next cells to build the search index.

This is a comprehensive Jupyter notebook, but due to size constraints, I'm providing the first 5 critical cells. The complete notebook would continue with:

- Cells 6-7: Build FAISS index
- Cells 8-9: Prepare training data
- Cells 10-15: Iteration 1 training
- Cells 16-21: Iteration 2 training
- Cells 22-27: Iteration 3 training
- Cells 28-30: Evaluation

Would you like me to:
1. Continue with the remaining cells in the notebook?
2. Create a simplified version?
3. Focus on specific sections?

Let me know how you'd like to proceed!