# Modular pipeline for protein structure prediction, pocket detection, and ligand docking

**Purpose:**
- Parse `.gbk` files to extract protein CDS sequences
- Use `esm3-sm-open-v1` (ESM3 small) to predict protein structure (PDB)
- Run **P2Rank** for pocket detection
- Prepare ligand (SMILES or upload) and run **AutoDock Vina** docking
- Generate a simple PDF summary report and provide downloadable PDB/PDBQT files

**Before you start (read me):**
- Runtime: Designed for Google Colab with GPU enabled (Runtime → Change runtime type → GPU). Works locally too with minor adjustments.
- Hugging Face: You need a token for `esm3-sm-open-v1`. Create one at https://huggingface.co/settings/tokens and grant access to the gated repo if needed.
- System deps: Colab step installs Java/OpenBabel/Vina and tools; can be toggled off if you prefer Conda.
- ESM3 size: We use the small model to keep memory low on Colab.
- Python version: 3.12+ recommended

**For antiSMASH analysis:**
- Use the separate `AntiSMASH_Colab.ipynb` notebook for BGC annotation

Quick switches (可选项开关):
- Set booleans in the "Cell 0a - Config" cell to enable/disable steps without editing code elsewhere:
  - `INSTALL_SYSTEM_DEPS`: Install system packages (apt) on Colab. Default: True on Colab.
  - `INSTALL_MINICONDA`: Install Miniconda via wget (Linux) and prepare Conda. Default: False.
  - `ACCEPT_CONDA_TOS`: Automatically accept Conda ToS for required channels if installing Miniconda. Default: True.

Conda installation (Linux/Colab) — exact commands requested:
- If you want to use Conda on Linux/Colab, you can install Miniconda and accept the Conda ToS for Anaconda channels.
- These steps are available as a runnable cell below (Cell 1a). The core commands are:

```zsh
# Download Miniconda (Linux x86_64)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Non-interactive install to a prefix (example)
bash Miniconda3-latest-Linux-x86_64.sh -b -p "$HOME/miniconda"
# Accept Conda ToS for Anaconda main and R channels
"$HOME/miniconda/bin/conda" tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
"$HOME/miniconda/bin/conda" tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
```

提示（中文）：如需在Linux/Colab使用Conda，请先用上面的wget命令安装Miniconda，然后通过`conda tos accept`接受上述两个Anaconda渠道（main与r）的条款。你也可以在本Notebook中打开`INSTALL_MINICONDA`开关，由Cell 1a自动完成安装与条款接受。

> Colab note: we do not use `requirements.txt` here to avoid potential RDKit/NumPy resolver conflicts on Colab. We install the minimal Python packages directly below (or you can toggle on Conda install).


In [None]:
# Cell 0 - Get code (Colab)
# If running on Colab (/content exists), clone the repo into /content/ProtFlow if not present,
# change directory, and remove any shadowing data folder.
from pathlib import Path as __Path
if __Path('/content').exists():
    # Check if already cloned
    if not __Path('/content/ProtFlow').exists():
        print('📥 Cloning ProtFlow repository...')
        !git clone https://github.com/AsagiriBeta/ProtFlow.git /content/ProtFlow
        print('✅ Repository cloned successfully')
    else:
        print('✅ Repository already exists at /content/ProtFlow')

    # Change to repo directory
    %cd /content/ProtFlow

    # Remove any shadowing folders that might cause import issues
    if __Path('/content/esm3_pipeline').exists():
        print('🧹 Removing shadowing /content/esm3_pipeline folder...')
        !rm -rf /content/esm3_pipeline

    print('Working directory:', __Path.cwd())
    print('Ready to proceed!')
else:
    print('Not running on Colab; using current directory.')


In [None]:
# Cell 0a - Config: optional toggles
from pathlib import Path as ___Path
ON_COLAB = ___Path('/content').exists()
# Toggle whether to install system packages (apt) on Colab
INSTALL_SYSTEM_DEPS = True if ON_COLAB else False
# Toggle whether to install Miniconda and accept Conda ToS (Linux only)
INSTALL_MINICONDA = False
# Toggle whether to auto-accept Conda TOS (only used if INSTALL_MINICONDA=True)
ACCEPT_CONDA_TOS = True
# Where to install Miniconda if enabled
CONDA_PREFIX_DIR = '/content/miniconda' if ON_COLAB else str(___Path.home() / 'miniconda')
print('ON_COLAB =', ON_COLAB)
print('INSTALL_SYSTEM_DEPS =', INSTALL_SYSTEM_DEPS)
print('INSTALL_MINICONDA =', INSTALL_MINICONDA)
print('ACCEPT_CONDA_TOS =', ACCEPT_CONDA_TOS)
print('CONDA_PREFIX_DIR =', CONDA_PREFIX_DIR)


In [None]:
# Cell 1 - Install dependencies (run once)
# Install Python packages with error handling
import sys
try:
    # Check if running on Colab - if so, ensure we have compatible versions
    from pathlib import Path as __P0
    ON_COLAB_CHECK = __P0('/content').exists()

    # Install core packages
    print('📦 Installing Python packages...')
    !pip install -q --upgrade pip

    # Install packages in a specific order to avoid conflicts on Colab
    !pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    !pip install -q esm==3.2.1.post1
    !pip install -q rdkit biopython py3Dmol tqdm requests reportlab pandas matplotlib huggingface_hub

    print('✅ Python packages installed successfully.')
except Exception as e:
    print(f'⚠️ Warning during package installation: {e}')
    print('Continuing anyway...')

# Install system dependencies on Colab
from pathlib import Path as __P0
if INSTALL_SYSTEM_DEPS and __P0('/content').exists():
    print('📦 Installing system packages (Colab)...')
    !apt-get -qq update > /dev/null 2>&1
    !apt-get -qq install -y default-jre openbabel unzip wget python3-tk > /dev/null 2>&1
    !apt-get -qq install -y autodock-vina fpocket 2>/dev/null || echo "Note: autodock-vina/fpocket may not be in default repos"
    print('✅ System packages installed (Colab).')
else:
    print('ℹ️ Skipping system package install (set INSTALL_SYSTEM_DEPS=True on Colab to enable).')

print('✅ Dependencies install finished. If you see any errors, restart runtime and re-run.')


In [None]:
# Cell 1b - Validate environment (optional but recommended)
print('=' * 60)
print('🔍 Environment Validation')
print('=' * 60)

validation_results = {}

# Check Python version
import sys
py_version = f"{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}"
print(f'Python: {py_version}', end=' ')
if sys.version_info >= (3, 9):
    print('✅')
    validation_results['python'] = True
else:
    print('❌ (Need 3.9+)')
    validation_results['python'] = False

# Check PyTorch and CUDA
try:
    import torch
    print(f'PyTorch: {torch.__version__}', end=' ')
    if torch.cuda.is_available():
        print(f'✅ (CUDA {torch.version.cuda})')
        validation_results['pytorch'] = True
    else:
        print('⚠️ (CPU only, no GPU)')
        validation_results['pytorch'] = 'cpu'
except ImportError:
    print('PyTorch: ❌ Not installed')
    validation_results['pytorch'] = False

# Check ESM
try:
    import esm
    print(f'ESM: {esm.__version__} ✅')
    validation_results['esm'] = True
except ImportError:
    print('ESM: ❌ Not installed')
    validation_results['esm'] = False

# Check BioPython
try:
    import Bio
    print(f'BioPython: {Bio.__version__} ✅')
    validation_results['biopython'] = True
except ImportError:
    print('BioPython: ❌ Not installed')
    validation_results['biopython'] = False

# Check system tools
import shutil
tools = {
    'Java': 'java',
    'OpenBabel': 'obabel',
    'AutoDock Vina': 'vina',
    'wget': 'wget',
    'unzip': 'unzip'
}

print()
print('System tools:')
for name, cmd in tools.items():
    if shutil.which(cmd):
        print(f'  {name}: ✅')
        validation_results[name.lower()] = True
    else:
        print(f'  {name}: ❌')
        validation_results[name.lower()] = False

print()
all_critical = validation_results.get('python', False) and \
               validation_results.get('pytorch', False) and \
               validation_results.get('esm', False) and \
               validation_results.get('biopython', False)

if all_critical:
    print('✅ Environment is ready!')
else:
    print('⚠️ Some critical packages are missing. Please re-run Cell 1.')

print()
print('Note: Missing system tools will be needed for their respective steps.')
print('=' * 60)


In [None]:
# Cell 1a - Optional: Install Miniconda and accept Conda ToS
import sys as __sys, os as __os, shutil as __shutil, subprocess as __sp, pathlib as __pl
if INSTALL_MINICONDA:
    ON_LINUX = __sys.platform.startswith('linux')
    prefix = CONDA_PREFIX_DIR
    __os.makedirs(prefix, exist_ok=True)
    if ON_LINUX:
        # Download Miniconda (exact command requested)
        __os.system('wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh')
        # Run installer non-interactively
        __sp.run(['bash', '/tmp/miniconda.sh', '-b', '-p', prefix], check=True)
        conda_bin = str(__pl.Path(prefix) / 'bin' / 'conda')
        if ACCEPT_CONDA_TOS and __os.path.exists(conda_bin):
            __sp.run([conda_bin, 'tos', 'accept', '--override-channels', '--channel', 'https://repo.anaconda.com/pkgs/main'], check=True)
            __sp.run([conda_bin, 'tos', 'accept', '--override-channels', '--channel', 'https://repo.anaconda.com/pkgs/r'], check=True)
        print('✅ Miniconda installed at', prefix)
    else:
        print('⚠️ Non-Linux platform detected; please use the appropriate Miniconda installer for your OS:')
        print('   macOS (Intel): https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh')
        print('   macOS (Apple Silicon): https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh')
else:
    print('ℹ️ Miniconda installation skipped (INSTALL_MINICONDA=False).')


In [None]:
# Cell 2 - Hugging Face login (enter your token when prompted)
from huggingface_hub import login
import os

print('=' * 60)
print('🔑 Hugging Face Authentication Required')
print('=' * 60)
print('To use ESM3 model, you need a Hugging Face token.')
print('1. Go to: https://huggingface.co/settings/tokens')
print('2. Create a token with READ access')
print('3. Grant access to: EvolutionaryScale/esm3-sm-open-v1')
print('4. Paste the token below (or set HF_TOKEN env variable)')
print('=' * 60)

HF_TOKEN = os.getenv('HF_TOKEN')
try:
    if HF_TOKEN:
        print('Using token from HF_TOKEN environment variable...')
        login(token=HF_TOKEN)
        print('✅ Logged in successfully!')
    else:
        login()  # interactive prompt on Colab / local
        print('✅ Logged in successfully!')
except Exception as e:
    print(f'❌ Login failed: {e}')
    print('Please check your token and try again.')
    raise


In [None]:
# Cell 3 - Setup directories and imports
from pathlib import Path
import os, subprocess, shutil
import pandas as pd
from Bio import SeqIO
from tqdm.auto import tqdm
import torch

# Use a separate run directory to avoid clashing with the package name
BASE = Path('/content/protflow_runs') if Path('/content').exists() else (Path.cwd() / 'runs')
GBK_DIR = BASE / 'gbk_input'
PDB_DIR = BASE / 'pdbs'

for d in [BASE, GBK_DIR, PDB_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print('Base dir setup at', BASE)


In [None]:
# Cell 3b - Import modular helpers (new)
# Ensure the repository root is at the front of sys.path so imports resolve to the package, not a shadowing data folder.
from pathlib import Path as _Path  # local import to avoid ordering issues if user skips cells
import sys as _sys, importlib as _importlib, os as _os

_repo_candidates = [_Path.cwd(), _Path('/content/ProtFlow')]
_repo_root = None
for _cand in _repo_candidates:
    if (_cand / 'esm3_pipeline' / '__init__.py').exists():
        _repo_root = _cand
        break

if _repo_root is not None:
    _repo_root_str = str(_repo_root)
    # Ensure repo is first on sys.path
    if _repo_root_str in _sys.path:
        _sys.path.remove(_repo_root_str)
    _sys.path.insert(0, _repo_root_str)
    # Purge stale cached modules from previous failed imports
    for _m in list(_sys.modules):
        if _m == 'esm3_pipeline' or _m.startswith('esm3_pipeline.'):
            _sys.modules.pop(_m, None)
    _importlib.invalidate_caches()
else:
    print('⚠️ Could not find esm3_pipeline package; ensure the repo is cloned (e.g., /content/ProtFlow).')

# Warn if a top-level shadowing folder exists on Colab
_shadow = _Path('/content/esm3_pipeline')
if _shadow.exists():
    print('ℹ️ Note: /content/esm3_pipeline exists; using repo package from', _repo_root)

from esm3_pipeline.seq_parser import extract_proteins_from_gbk, filter_and_select
from esm3_pipeline.esm3_predict import load_esm3_small, predict_pdbs
from esm3_pipeline.p2rank import ensure_p2rank, run_p2rank_on_pdbs
from esm3_pipeline.ligand_prep import smiles_or_file_to_pdbqt
from esm3_pipeline.vina_dock import run_vina
from esm3_pipeline.reporting import build_report


In [None]:
# Cell 3c - Optional: Download sample GenBank file for testing
print('=' * 60)
print('📥 Sample Data (Optional)')
print('=' * 60)
print('Download a sample GenBank file for testing?')
print('This will download a small bacterial genome for demonstration.')
print()

download_sample = input('Download sample data? (y/N): ').strip().lower() == 'y'

if download_sample:
    import urllib.request
    sample_url = 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gbff.gz'
    sample_file = GBK_DIR / 'sample.gbff.gz'

    try:
        print('Downloading sample file...')
        urllib.request.urlretrieve(sample_url, sample_file)

        # Decompress
        import gzip
        with gzip.open(sample_file, 'rb') as f_in:
            with open(sample_file.with_suffix(''), 'wb') as f_out:
                f_out.write(f_in.read())

        sample_file.unlink()  # Remove .gz
        print(f'✅ Sample data downloaded to: {GBK_DIR}')
        print('Note: This is a full genome - Cell 5 will help you select a subset.')

    except Exception as e:
        print(f'❌ Download failed: {e}')
        print('You can manually upload .gbk files to:', GBK_DIR)
else:
    print('Skipped. You can upload your own .gbk/.gbff files to:', GBK_DIR)

print()


In [None]:
# Cell 4 - Parse GenBank files and extract protein translations (modular)
print('=' * 60)
print('📄 Step 1: Extract Protein Sequences from GenBank Files')
print('=' * 60)
print(f'Looking for .gbk/.gbff files in: {GBK_DIR}')
print()

fasta_all = BASE / 'all_proteins.faa'
try:
    count = extract_proteins_from_gbk(GBK_DIR, fasta_all)
    print(f'✅ Wrote {count} protein sequences to {fasta_all}')
except Exception as e:
    print(f'⚠️ Error extracting proteins: {e}')
    count = 0

if count == 0:
    print()
    print('⚠️ No sequences found in GenBank files.')
    print('You can:')
    print('  1. Upload .gbk or .gbff files to:', GBK_DIR)
    print('  2. Paste a single amino acid sequence below')
    print()
    seq = input('Paste amino acid sequence (or press Enter to skip): ').strip()
    if seq:
        with open(fasta_all, 'w') as f:
            f.write('>user_sequence\n' + seq + '\n')
        print(f'✅ Saved single sequence to {fasta_all}')
        count = 1
    else:
        print('⚠️ No sequences provided. Please add GenBank files and re-run.')
else:
    print()
    print('Next: Run Cell 5 to select candidates for structure prediction.')


In [None]:
# Cell 5 - Quick filter and selection of candidates (modular)
print('=' * 60)
print('🔍 Step 2: Filter and Select Candidates')
print('=' * 60)

try:
    records = list(SeqIO.parse(str(fasta_all), 'fasta'))
    print(f'Total sequences loaded: {len(records)}')

    if len(records) == 0:
        print('❌ No sequences found. Please run Cell 4 first.')
        raise ValueError("No sequences to process")

    # Show length distribution
    lengths = [len(r.seq) for r in records]
    print(f'Sequence length range: {min(lengths)} - {max(lengths)} aa')
    print()

    # Get filter parameters with defaults
    print('Filter parameters (press Enter to use defaults):')
    min_len = input('  Min length (aa) [default 50]: ').strip()
    min_len = int(min_len) if min_len else 50

    max_len = input('  Max length (aa) [default 1200]: ').strip()
    max_len = int(max_len) if max_len else 1200

    num = input('  Number of candidates to predict [default 10]: ').strip()
    num = int(num) if num else 10

    print()
    print(f'Filtering: length {min_len}-{max_len} aa, selecting up to {num} sequences...')

    selected_fasta = BASE / 'selected.faa'
    selected = filter_and_select(fasta_all, min_len, max_len, num, selected_fasta)

    print(f'✅ Selected {len(selected)} candidates saved to {selected_fasta}')

    if len(selected) == 0:
        print('⚠️ No sequences match the filter criteria. Try adjusting the length range.')
    else:
        print()
        print('Next: Run Cell 6 to load the ESM3 model.')

except Exception as e:
    print(f'❌ Error during selection: {e}')
    raise


In [None]:
# Cell 6 - Load ESM3-sm model (modular)
print('=' * 60)
print('🧬 Step 3: Load ESM3 Structure Prediction Model')
print('=' * 60)

# Check GPU availability
import torch as _torch
if _torch.cuda.is_available():
    print(f'✅ GPU detected: {_torch.cuda.get_device_name(0)}')
    print(f'   Memory: {_torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')
else:
    print('⚠️ No GPU detected. Model will run on CPU (slower).')
    print('   Tip: In Colab, enable GPU via Runtime → Change runtime type → GPU')

print()
print('Loading ESM3-sm model (this may take a few minutes)...')

try:
    model, device = load_esm3_small()
    print(f'✅ Model loaded successfully on {device}')
    print()
    print('Next: Run Cell 7 to predict protein structures.')
except Exception as e:
    print(f'❌ Failed to load model: {e}')
    print('Common issues:')
    print('  - Insufficient GPU memory (try restarting runtime)')
    print('  - Network timeout (try running the cell again)')
    print('  - Missing HuggingFace token (re-run Cell 2)')
    raise


In [None]:
# Cell 7 - Predict structures with ESM3-sm (modular)
print('=' * 60)
print('🔮 Step 4: Predict Protein Structures')
print('=' * 60)

try:
    selected_records = list(SeqIO.parse(str(selected_fasta), 'fasta'))
    print(f'Predicting structures for {len(selected_records)} proteins...')
    print('⏳ This may take several minutes depending on sequence length and GPU.')
    print()

    for i, rec in enumerate(selected_records, 1):
        print(f'  [{i}/{len(selected_records)}] {rec.id} ({len(rec.seq)} aa)')

    print()
    predict_pdbs(model, selected_records, PDB_DIR)

    # Count successful predictions
    pdb_files = list(PDB_DIR.glob('*.pdb'))
    print()
    print(f'✅ Generated {len(pdb_files)} PDB files in {PDB_DIR}')
    print()
    print('Next: Run Cell 8 to download and setup P2Rank for pocket detection.')

except FileNotFoundError:
    print('❌ Selected sequences file not found. Please run Cell 5 first.')
    raise
except Exception as e:
    print(f'❌ Structure prediction failed: {e}')
    print('Tip: If you see CUDA out of memory, try:')
    print('  - Reducing the number of sequences in Cell 5')
    print('  - Restarting the runtime to free GPU memory')
    raise


In [None]:
# Cell 8 - Download and setup P2Rank 2.5.x (modular)
print('=' * 60)
print('🎯 Step 5: Setup P2Rank for Pocket Detection')
print('=' * 60)

try:
    # Check Java first
    print('Checking Java installation...')
    !java -version
    print()

    print('Downloading P2Rank (this may take a moment)...')
    P2_JAR = ensure_p2rank(BASE)

    if P2_JAR is None:
        print('❌ Failed to locate p2rank.jar under', BASE / 'p2rank')
        print('Please check your internet connection and try again.')
    else:
        print(f'✅ P2Rank ready at: {P2_JAR}')
        print()
        print('Next: Run Cell 9 to detect binding pockets.')

except Exception as e:
    print(f'❌ P2Rank setup failed: {e}')
    print('Common issues:')
    print('  - Java not installed (should be installed in Cell 1 on Colab)')
    print('  - Network timeout during download')
    print('  - Insufficient disk space')
    raise



In [None]:
# Cell 9 - Run P2Rank on predicted PDBs and extract top pocket centers (modular)
print('=' * 60)
print('🎯 Step 6: Detect Binding Pockets')
print('=' * 60)

POCKET_RESULTS = []
if P2_JAR is not None:
    try:
        print(f'Running P2Rank on PDB files in {PDB_DIR}...')
        print('⏳ This may take a few minutes...')
        print()

        results = run_p2rank_on_pdbs(P2_JAR, PDB_DIR)
        import pandas as _pd
        pockets_df = _pd.DataFrame(results)

        if not pockets_df.empty:
            pockets_df.to_csv(BASE / 'pockets_summary.csv', index=False)
            print(f'✅ Found {len(pockets_df)} pockets across {pockets_df["pdb"].nunique()} structures')
            print(f'   Results saved to: {BASE / "pockets_summary.csv"}')
            print()
            print('Next: Run Cell 10 to prepare a ligand for docking.')
        else:
            print('⚠️ No pockets detected. Check PDB files and P2Rank output.')

    except Exception as e:
        print(f'❌ Pocket detection failed: {e}')
        pockets_df = pd.DataFrame()
        raise
else:
    pockets_df = pd.DataFrame()
    print('⚠️ Skipping P2Rank: p2rank.jar not found (run Cell 8 first)')


In [None]:
# Cell 10 - Prepare ligand: SMILES or uploaded file (modular)
print('=' * 60)
print('💊 Step 7: Prepare Ligand for Docking')
print('=' * 60)
print('You can provide either:')
print('  - A SMILES string (e.g., "CCO" for ethanol)')
print('  - A path to a ligand file (SDF, MOL, PDB)')
print('  - Leave blank to skip docking')
print()

lig_in = input('Enter ligand SMILES or file path (or press Enter to skip): ').strip()

if lig_in:
    try:
        lig_pdbqt = smiles_or_file_to_pdbqt(lig_in, BASE)
        if lig_pdbqt:
            print(f'✅ Ligand prepared successfully: {lig_pdbqt}')
            print()
            print('Next: Run Cell 11 to perform docking.')
        else:
            print('❌ Failed to prepare ligand. Check input and try again.')
    except Exception as e:
        print(f'❌ Ligand preparation failed: {e}')
        print('Tips:')
        print('  - Ensure SMILES is valid')
        print('  - Ensure file exists and is in a supported format')
        print('  - OpenBabel must be installed (should be in Cell 1)')
        lig_pdbqt = None
        raise
else:
    lig_pdbqt = None
    print('ℹ️ No ligand provided; docking will be skipped.')


In [None]:
# Cell 11 - Vina docking into top pocket centers (modular)
print('=' * 60)
print('🔬 Step 8: Molecular Docking with AutoDock Vina')
print('=' * 60)

if lig_pdbqt is not None and not pockets_df.empty:
    try:
        print(f'Running docking into {len(pockets_df)} pockets...')
        print('⏳ This may take several minutes...')
        print()

        dfg = run_vina(lig_pdbqt, pockets_df, BASE)
        dfg.to_csv(BASE / 'vina_results.csv', index=False)

        # Show results summary
        successful = dfg['affinity'].notna().sum()
        print(f'✅ Docking completed: {successful}/{len(dfg)} successful')

        if successful > 0:
            best_idx = dfg['affinity'].idxmin()
            best_row = dfg.loc[best_idx]
            print(f'   Best affinity: {best_row["affinity"]:.2f} kcal/mol')
            print(f'   PDB: {Path(best_row["pdb"]).name}')
            print(f'   Pocket: {best_row["pocket_rank"]}')

        print(f'   Results saved to: {BASE / "vina_results.csv"}')
        print()
        print('Next: Run Cell 12 to generate the final report.')

    except Exception as e:
        print(f'❌ Docking failed: {e}')
        print('Tips:')
        print('  - Ensure AutoDock Vina is installed (should be in Cell 1)')
        print('  - Check that pockets were detected successfully')
        raise
else:
    if lig_pdbqt is None:
        print('ℹ️ Skipping docking: no ligand provided')
    elif pockets_df.empty:
        print('⚠️ Skipping docking: no pockets detected')
    else:
        print('ℹ️ Skipping docking: missing requirements')


In [None]:
# Cell 12 - Generate PDF report (modular)
print('=' * 60)
print('📊 Step 9: Generate Summary Report')
print('=' * 60)

try:
    from pathlib import Path as _Path
    report_path = BASE / 'esm3_results_report.pdf'

    print('Building PDF report...')
    build_report(BASE, PDB_DIR, report_path)

    print(f'✅ Report generated successfully: {report_path}')
    print()
    print('=' * 60)
    print('🎉 Pipeline Complete!')
    print('=' * 60)
    print('Generated files:')
    print(f'  - Report: {report_path}')
    print(f'  - PDB structures: {PDB_DIR}')
    print(f'  - Pockets: {BASE / "pockets_summary.csv"}')
    if lig_pdbqt is not None:
        print(f'  - Docking results: {BASE / "vina_results.csv"}')
    print()
    print('On Colab: Use the file browser (left panel) to download files.')
    print('Locally: Check the output directory:', BASE)

except Exception as e:
    print(f'❌ Report generation failed: {e}')
    print('Note: Some output files may still be available in:', BASE)
    raise


In [None]:
# Cell 13 - Optional: Visualize structures (interactive)
print('=' * 60)
print('👁️ 3D Structure Visualization (Optional)')
print('=' * 60)

visualize = input('Visualize a structure in 3D? (y/N): ').strip().lower() == 'y'

if visualize:
    try:
        import py3Dmol

        # List available PDB files
        pdb_files = sorted(PDB_DIR.glob('*.pdb'))
        if not pdb_files:
            print('❌ No PDB files found to visualize.')
        else:
            print(f'Available PDB files ({len(pdb_files)}):')
            for i, p in enumerate(pdb_files[:10], 1):
                print(f'  {i}. {p.name}')

            if len(pdb_files) > 10:
                print(f'  ... and {len(pdb_files) - 10} more')

            print()
            idx = input(f'Select file number (1-{min(10, len(pdb_files))}) or press Enter for first: ').strip()

            if idx:
                try:
                    selected_pdb = pdb_files[int(idx) - 1]
                except (ValueError, IndexError):
                    print('Invalid selection, using first file')
                    selected_pdb = pdb_files[0]
            else:
                selected_pdb = pdb_files[0]

            print(f'Visualizing: {selected_pdb.name}')

            # Create viewer
            view = py3Dmol.view(width=800, height=600)

            # Load structure
            with open(selected_pdb) as f:
                pdb_data = f.read()

            view.addModel(pdb_data, 'pdb')
            view.setStyle({'cartoon': {'color': 'spectrum'}})

            # Add surface if pockets exist
            pock_csv = BASE / 'pockets_summary.csv'
            if pock_csv.exists():
                import pandas as pd
                import ast
                dfp = pd.read_csv(pock_csv)
                # Filter for this PDB
                pdb_pockets = dfp[dfp['pdb'].str.contains(selected_pdb.stem)]

                if not pdb_pockets.empty:
                    # Show pocket centers as spheres
                    for _, row in pdb_pockets.head(3).iterrows():
                        try:
                            center = ast.literal_eval(row['center']) if isinstance(row['center'], str) else row['center']
                            cx, cy, cz = center
                            view.addSphere({
                                'center': {'x': cx, 'y': cy, 'z': cz},
                                'radius': 5.0,
                                'color': 'red',
                                'alpha': 0.5
                            })
                        except:
                            pass

            view.zoomTo()
            view.show()

            print('✅ Structure loaded. Red spheres indicate predicted binding pockets.')

    except ImportError:
        print('❌ py3Dmol not installed. Install with: pip install py3Dmol')
    except Exception as e:
        print(f'❌ Visualization failed: {e}')
else:
    print('Skipped visualization.')



### Final: Download files
Use the left file browser in Colab to download `esm3_results_report.pdf`, the PDBs in `/content/protflow_runs/pdbs`, and docking outputs in `/content/protflow_runs`.
