# Modular pipeline for protein structure prediction, pocket detection, ligand docking, and BGC annotation

**Purpose:**
- Parse `.gbk` files to extract protein CDS sequences
- Use `esm3-sm-open-v1` (ESM3 small) to predict protein structure (PDB)
- Run **P2Rank** for pocket detection
- Prepare ligand (SMILES or upload) and run **AutoDock Vina** docking
- Generate a simple PDF summary report and provide downloadable PDB/PDBQT files
- Optional: Run antiSMASH on GBK/GBFF (if available)

**Notes before running:**
1. This notebook is designed for Google Colab with GPU enabled. Go to `Runtime → Change runtime type → GPU`.
2. You must have a Hugging Face token with access to `esm3-sm-open-v1`. Create one at https://huggingface.co/settings/tokens and grant access to the gated repo if needed.
3. The notebook installs some system packages (OpenBabel, Vina, Java). It may take several minutes.
4. This pipeline uses ESM3-sm (small) to avoid excessive memory requirements on Colab.
5. Optional antiSMASH: not required for the main pipeline. If you want to run antiSMASH (Cell 8a), install it first on your local machine or a conda env:
   - Bioconda (recommended):
     ```zsh
     conda config --add channels conda-forge
     conda config --add channels bioconda
     conda create -y -n antismash antismash
     conda activate antismash
     download-antismash-databases
     # optional warm-up caches
     antismash --prepare-data
     ```
   - Docker (standalone full image; large download):
     ```zsh
     mkdir -p ~/bin
     curl -q https://dl.secondarymetabolites.org/releases/latest/docker-run_antismash-full > ~/bin/run_antismash
     chmod a+x ~/bin/run_antismash
     # Use absolute paths; args order: input first, then output dir
     run_antismash /abs/path/input.gbk /abs/path/out
     ```
   The notebook will automatically skip the antiSMASH step if it is not installed/available.

> Colab note: we do not use `requirements.txt` here to avoid potential RDKit/NumPy resolver conflicts on Colab. We install the minimal Python packages directly below.


In [None]:
# Cell 1 - Install dependencies (run once)
!pip install -q esm rdkit-pypi biopython py3Dmol tqdm requests reportlab pandas matplotlib huggingface_hub
!apt-get -qq update
!apt-get -qq install -y default-jre openbabel unzip wget python3-tk
!apt-get -qq install -y autodock-vina fpocket || true

print('✅ Install finished. Restart the runtime if prompted.')


In [None]:
# Cell 2 - Hugging Face login (enter your token when prompted)
from huggingface_hub import login
import os
print('Paste your Hugging Face token (it will not be stored here).')
HF_TOKEN = os.getenv('HF_TOKEN')
if HF_TOKEN:
    login(token=HF_TOKEN)
else:
    login()  # interactive prompt on Colab / local


In [None]:
# Cell 3 - Setup directories and imports
from pathlib import Path
import os, subprocess, shutil
import pandas as pd
from Bio import SeqIO
from tqdm.auto import tqdm
import torch

# Use /content on Colab, otherwise use current working directory
BASE = Path('/content/esm3_pipeline') if Path('/content').exists() else (Path.cwd() / 'esm3_pipeline')
GBK_DIR = BASE / 'gbk_input'
PDB_DIR = BASE / 'pdbs'

for d in [BASE, GBK_DIR, PDB_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print('Base dir setup at', BASE)


In [None]:
# Cell 3b - Import modular helpers (new)
from esm3_pipeline.seq_parser import extract_proteins_from_gbk, filter_and_select
from esm3_pipeline.esm3_predict import load_esm3_small, predict_pdbs
from esm3_pipeline.p2rank import ensure_p2rank, run_p2rank_on_pdbs
from esm3_pipeline.ligand_prep import smiles_or_file_to_pdbqt
from esm3_pipeline.vina_dock import run_vina
from esm3_pipeline.reporting import build_report
from esm3_pipeline.antismash import is_antismash_available, run_antismash


In [None]:
# Cell 4 - Parse GenBank files and extract protein translations (modular)
fasta_all = BASE / 'all_proteins.faa'
count = extract_proteins_from_gbk(GBK_DIR, fasta_all)
print('Wrote', count, 'protein sequences to', fasta_all)

if count == 0:
    seq = input('No sequences found. Paste a single amino acid sequence (or press Enter to skip): ').strip()
    if seq:
        with open(fasta_all, 'w') as f:
            f.write('>user_sequence\n' + seq + '\n')
        print('Saved single sequence to', fasta_all)


In [None]:
# Cell 5 - Quick filter and selection of candidates (modular)
records = list(SeqIO.parse(str(fasta_all), 'fasta'))
print('Total sequences loaded:', len(records))

min_len = int(input('Min length (aa) [default 50]: ') or 50)
max_len = int(input('Max length (aa) [default 1200]: ') or 1200)
num = int(input('How many candidates to predict (default 10): ') or 10)
selected_fasta = BASE / 'selected.faa'
selected = filter_and_select(fasta_all, min_len, max_len, num, selected_fasta)
print('Selected', len(selected), 'candidates saved to', selected_fasta)


In [None]:
# Cell 6 - Load ESM3-sm model (modular)
model, device = load_esm3_small()
print('Model loaded to', device)


In [None]:
# Cell 7 - Predict structures with ESM3-sm (modular)
selected_records = list(SeqIO.parse(str(selected_fasta), 'fasta'))
for rec in selected_records:
    print('Predicting', rec.id, 'len', len(rec.seq))
predict_pdbs(model, selected_records, PDB_DIR)
print('Saved PDB files to', PDB_DIR)


In [None]:
# Cell 8 - Download and setup P2Rank 2.5.x (modular)
P2_JAR = ensure_p2rank(BASE)
if P2_JAR is None:
    print('❌ Failed to locate p2rank.jar under', BASE / 'p2rank')
else:
    print('P2Rank jar at', P2_JAR)
    !java -version


In [None]:
# Cell 8a - Optional: antiSMASH analysis (if available)
run_as = input('Run antiSMASH on a GBK/GBFF? (y/N): ').strip().lower() == 'y'
if run_as:
    if not is_antismash_available():
        print('⚠️ antiSMASH is not installed in this environment; skipping')
        print('   Install via Bioconda (recommended):')
        print('     conda config --add channels conda-forge && conda config --add channels bioconda')
        print('     conda create -y -n antismash antismash && conda activate antismash')
        print('     download-antismash-databases && antismash --prepare-data')
        print('   Or use the Docker wrapper (full image):')
        print('     curl -q https://dl.secondarymetabolites.org/releases/latest/docker-run_antismash-full > ~/bin/run_antismash && chmod a+x ~/bin/run_antismash')
    else:
        # Choose first GBK/GBFF by default
        gbk_files = sorted(GBK_DIR.glob('*.gbk')) + sorted(GBK_DIR.glob('*.gbff'))
        default_inp = str(gbk_files[0]) if gbk_files else ''
        as_inp = input(f'Path to GBK/GBFF [default {default_inp}]: ').strip() or default_inp
        if as_inp and Path(as_inp).exists():
            out_as = BASE / 'antismash_out'
            res = run_antismash(Path(as_inp), out_as)
            print('antiSMASH results at', res)
        else:
            print('No valid input file; skipped antiSMASH')
else:
    print('antiSMASH step skipped')


In [None]:
# Cell 9 - Run P2Rank on predicted PDBs and extract top pocket centers (modular)
POCKET_RESULTS = []
if P2_JAR is not None:
    results = run_p2rank_on_pdbs(P2_JAR, PDB_DIR)
    import pandas as _pd
    pockets_df = _pd.DataFrame(results)
    if not pockets_df.empty:
        pockets_df.to_csv(BASE / 'pockets_summary.csv', index=False)
    print('✅ Saved pockets summary to', BASE / 'pockets_summary.csv')
else:
    pockets_df = pd.DataFrame()
    print('⚠️ Skipping P2Rank: p2rank.jar not found')


In [None]:
# Cell 10 - Prepare ligand: SMILES or uploaded file (modular)
lig_in = input('Enter ligand SMILES or local path to ligand file (SDF/MOL/PDB). Leave blank to skip docking: ').strip()
lig_pdbqt = smiles_or_file_to_pdbqt(lig_in, BASE) if lig_in else None
if lig_pdbqt:
    print('Ligand prepared at', lig_pdbqt)
else:
    print('No ligand provided; skipping docking')


In [None]:
# Cell 11 - Vina docking into top pocket centers (modular)
if lig_pdbqt is not None and not pockets_df.empty:
    dfg = run_vina(lig_pdbqt, pockets_df, BASE)
    dfg.to_csv(BASE / 'vina_results.csv', index=False)
    print('✅ Vina docking completed; results at', BASE / 'vina_results.csv')
else:
    print('Skipping docking: ligand or pockets missing')


In [None]:
# Cell 12 - Generate PDF report (modular)
from pathlib import Path as _Path
report_path = BASE / 'esm3_results_report.pdf'
build_report(BASE, PDB_DIR, report_path)
print('✅ Report built at', report_path)


### Final: Download files
Use the left file browser in Colab to download `esm3_results_report.pdf`, the PDBs in `/content/esm3_pipeline/pdbs`, and docking outputs in `/content/esm3_pipeline`.
