# Modular pipeline for protein structure prediction, pocket detection, ligand docking, and BGC annotation

**Purpose:**
- Parse `.gbk` files to extract protein CDS sequences
- Use `esm3-sm-open-v1` (ESM3 small) to predict protein structure (PDB)
- Run **P2Rank** for pocket detection
- Prepare ligand (SMILES or upload) and run **AutoDock Vina** docking
- Generate a simple PDF summary report and provide downloadable PDB/PDBQT files
- Optional: Run antiSMASH on GBK/GBFF (if available)

**Before you start (read me):**
- Runtime: Designed for Google Colab with GPU enabled (Runtime → Change runtime type → GPU). Works locally too with minor adjustments.
- Hugging Face: You need a token for `esm3-sm-open-v1`. Create one at https://huggingface.co/settings/tokens and grant access to the gated repo if needed.
- System deps: Colab step installs Java/OpenBabel/Vina and tools; can be toggled off if you prefer Conda.
- ESM3 size: We use the small model to keep memory low on Colab.
- antiSMASH: Optional. You can use a Conda env or Docker. Notebook will skip if not available.

Quick switches (可选项开关):
- Set booleans in the "Cell 0a - Config" cell to enable/disable steps without editing code elsewhere:
  - `INSTALL_SYSTEM_DEPS`: Install system packages (apt) on Colab. Default: True on Colab.
  - `INSTALL_MINICONDA`: Install Miniconda via wget (Linux) and prepare Conda. Default: False.
  - `ACCEPT_CONDA_TOS`: Automatically accept Conda ToS for required channels if installing Miniconda. Default: True.

Conda installation (Linux/Colab) — exact commands requested:
- If you want to use Conda on Linux/Colab, you can install Miniconda and accept the Conda ToS for Anaconda channels.
- These steps are available as a runnable cell below (Cell 1a). The core commands are:

```zsh
# Download Miniconda (Linux x86_64)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Non-interactive install to a prefix (example)
bash Miniconda3-latest-Linux-x86_64.sh -b -p "$HOME/miniconda"
# Accept Conda ToS for Anaconda main and R channels
"$HOME/miniconda/bin/conda" tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
"$HOME/miniconda/bin/conda" tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
```

提示（中文）：如需在Linux/Colab使用Conda，请先用上面的wget命令安装Miniconda，然后通过`conda tos accept`接受上述两个Anaconda渠道（main与r）的条款。你也可以在本Notebook中打开`INSTALL_MINICONDA`开关，由Cell 1a自动完成安装与条款接受。

> Colab note: we do not use `requirements.txt` here to avoid potential RDKit/NumPy resolver conflicts on Colab. We install the minimal Python packages directly below (or you can toggle on Conda install).


In [None]:
# Cell 0 - Get code (Colab)
# If running on Colab (/content exists), clone the repo into /content/ProtFlow if not present,
# change directory, and remove any shadowing data folder.
from pathlib import Path as __Path
if __Path('/content').exists():
    # shell and magic are available in IPython/Colab
    !test -d /content/ProtFlow || git clone https://github.com/AsagiriBeta/ProtFlow.git /content/ProtFlow
    %cd /content/ProtFlow
    !rm -rf /content/esm3_pipeline || true
    print('Repo ready at /content/ProtFlow')
else:
    print('Not running on Colab; skipping clone step.')


In [None]:
# Cell 0a - Config: optional toggles
from pathlib import Path as ___Path
ON_COLAB = ___Path('/content').exists()
# Toggle whether to install system packages (apt) on Colab
INSTALL_SYSTEM_DEPS = True if ON_COLAB else False
# Toggle whether to install Miniconda and accept Conda ToS (Linux only)
INSTALL_MINICONDA = False
# Toggle whether to auto-accept Conda TOS (only used if INSTALL_MINICONDA=True)
ACCEPT_CONDA_TOS = True
# Where to install Miniconda if enabled
CONDA_PREFIX_DIR = '/content/miniconda' if ON_COLAB else str(___Path.home() / 'miniconda')
print('ON_COLAB =', ON_COLAB)
print('INSTALL_SYSTEM_DEPS =', INSTALL_SYSTEM_DEPS)
print('INSTALL_MINICONDA =', INSTALL_MINICONDA)
print('ACCEPT_CONDA_TOS =', ACCEPT_CONDA_TOS)
print('CONDA_PREFIX_DIR =', CONDA_PREFIX_DIR)


In [None]:
# Cell 1 - Install dependencies (run once)
!pip install -q esm rdkit biopython py3Dmol tqdm requests reportlab pandas matplotlib huggingface_hub

from pathlib import Path as __P0
if INSTALL_SYSTEM_DEPS and __P0('/content').exists():
    !apt-get -qq update
    !apt-get -qq install -y default-jre openbabel unzip wget python3-tk
    !apt-get -qq install -y autodock-vina fpocket || true
    print('✅ System packages installed (Colab).')
else:
    print('ℹ️ Skipping system package install (set INSTALL_SYSTEM_DEPS=True on Colab to enable).')

print('✅ Python deps install finished. Restart the runtime if prompted.')


In [None]:
# Cell 1a - Optional: Install Miniconda and accept Conda ToS
import sys as __sys, os as __os, shutil as __shutil, subprocess as __sp, pathlib as __pl
if INSTALL_MINICONDA:
    ON_LINUX = __sys.platform.startswith('linux')
    prefix = CONDA_PREFIX_DIR
    __os.makedirs(prefix, exist_ok=True)
    if ON_LINUX:
        # Download Miniconda (exact command requested)
        __os.system('wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh')
        # Run installer non-interactively
        __sp.run(['bash', '/tmp/miniconda.sh', '-b', '-p', prefix], check=True)
        conda_bin = str(__pl.Path(prefix) / 'bin' / 'conda')
        if ACCEPT_CONDA_TOS and __os.path.exists(conda_bin):
            __sp.run([conda_bin, 'tos', 'accept', '--override-channels', '--channel', 'https://repo.anaconda.com/pkgs/main'], check=True)
            __sp.run([conda_bin, 'tos', 'accept', '--override-channels', '--channel', 'https://repo.anaconda.com/pkgs/r'], check=True)
        print('✅ Miniconda installed at', prefix)
    else:
        print('⚠️ Non-Linux platform detected; please use the appropriate Miniconda installer for your OS:')
        print('   macOS (Intel): https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh')
        print('   macOS (Apple Silicon): https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh')
else:
    print('ℹ️ Miniconda installation skipped (INSTALL_MINICONDA=False).')


In [None]:
# Cell 2 - Hugging Face login (enter your token when prompted)
from huggingface_hub import login
import os
print('Paste your Hugging Face token (it will not be stored here).')
HF_TOKEN = os.getenv('HF_TOKEN')
if HF_TOKEN:
    login(token=HF_TOKEN)
else:
    login()  # interactive prompt on Colab / local


In [None]:
# Cell 3 - Setup directories and imports
from pathlib import Path
import os, subprocess, shutil
import pandas as pd
from Bio import SeqIO
from tqdm.auto import tqdm
import torch

# Use a separate run directory to avoid clashing with the package name
BASE = Path('/content/protflow_runs') if Path('/content').exists() else (Path.cwd() / 'runs')
GBK_DIR = BASE / 'gbk_input'
PDB_DIR = BASE / 'pdbs'

for d in [BASE, GBK_DIR, PDB_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print('Base dir setup at', BASE)


In [None]:
# Cell 3b - Import modular helpers (new)
# Ensure the repository root is at the front of sys.path so imports resolve to the package, not a shadowing data folder.
from pathlib import Path as _Path  # local import to avoid ordering issues if user skips cells
import sys as _sys, importlib as _importlib, os as _os

_repo_candidates = [_Path.cwd(), _Path('/content/ProtFlow')]
_repo_root = None
for _cand in _repo_candidates:
    if (_cand / 'esm3_pipeline' / '__init__.py').exists():
        _repo_root = _cand
        break

if _repo_root is not None:
    _repo_root_str = str(_repo_root)
    # Ensure repo is first on sys.path
    if _repo_root_str in _sys.path:
        _sys.path.remove(_repo_root_str)
    _sys.path.insert(0, _repo_root_str)
    # Purge stale cached modules from previous failed imports
    for _m in list(_sys.modules):
        if _m == 'esm3_pipeline' or _m.startswith('esm3_pipeline.'):
            _sys.modules.pop(_m, None)
    _importlib.invalidate_caches()
else:
    print('⚠️ Could not find esm3_pipeline package; ensure the repo is cloned (e.g., /content/ProtFlow).')

# Warn if a top-level shadowing folder exists on Colab
_shadow = _Path('/content/esm3_pipeline')
if _shadow.exists():
    print('ℹ️ Note: /content/esm3_pipeline exists; using repo package from', _repo_root)

from esm3_pipeline.seq_parser import extract_proteins_from_gbk, filter_and_select
from esm3_pipeline.esm3_predict import load_esm3_small, predict_pdbs
from esm3_pipeline.p2rank import ensure_p2rank, run_p2rank_on_pdbs
from esm3_pipeline.ligand_prep import smiles_or_file_to_pdbqt
from esm3_pipeline.vina_dock import run_vina
from esm3_pipeline.reporting import build_report
from esm3_pipeline.antismash import is_antismash_available, run_antismash, get_runner


In [None]:
# Cell 4 - Parse GenBank files and extract protein translations (modular)
fasta_all = BASE / 'all_proteins.faa'
count = extract_proteins_from_gbk(GBK_DIR, fasta_all)
print('Wrote', count, 'protein sequences to', fasta_all)

if count == 0:
    seq = input('No sequences found. Paste a single amino acid sequence (or press Enter to skip): ').strip()
    if seq:
        with open(fasta_all, 'w') as f:
            f.write('>user_sequence\n' + seq + '\n')
        print('Saved single sequence to', fasta_all)


In [None]:
# Cell 5 - Quick filter and selection of candidates (modular)
records = list(SeqIO.parse(str(fasta_all), 'fasta'))
print('Total sequences loaded:', len(records))

min_len = int(input('Min length (aa) [default 50]: ') or 50)
max_len = int(input('Max length (aa) [default 1200]: ') or 1200)
num = int(input('How many candidates to predict (default 10): ') or 10)
selected_fasta = BASE / 'selected.faa'
selected = filter_and_select(fasta_all, min_len, max_len, num, selected_fasta)
print('Selected', len(selected), 'candidates saved to', selected_fasta)


In [None]:
# Cell 6 - Load ESM3-sm model (modular)
model, device = load_esm3_small()
print('Model loaded to', device)


In [None]:
# Cell 7 - Predict structures with ESM3-sm (modular)
selected_records = list(SeqIO.parse(str(selected_fasta), 'fasta'))
for rec in selected_records:
    print('Predicting', rec.id, 'len', len(rec.seq))
predict_pdbs(model, selected_records, PDB_DIR)
print('Saved PDB files to', PDB_DIR)


In [None]:
# Cell 8 - Download and setup P2Rank 2.5.x (modular)
P2_JAR = ensure_p2rank(BASE)
if P2_JAR is None:
    print('❌ Failed to locate p2rank.jar under', BASE / 'p2rank')
else:
    print('P2Rank jar at', P2_JAR)
    !java -version


In [None]:
# Cell 8a0 - Optional: Install antiSMASH via micromamba (Colab)
from pathlib import Path as __P
import os as __os, shutil as __shutil
if __P('/content').exists():
    do_install = input('Install antiSMASH env via micromamba now? (y/N): ').strip().lower() == 'y'
    if do_install:
        # Install micromamba into /content/bin if missing
        if not __shutil.which('micromamba'):
            !mkdir -p /content/bin
            !curl -Ls "https://micro.mamba.pm/api/micromamba/linux-64/latest" -o /content/micromamba.tar.bz2
            !tar -xjf /content/micromamba.tar.bz2 -C /content bin/micromamba
            !mv /content/bin/micromamba /content/bin/micromamba
            __os.environ['PATH'] = '/content/bin:' + __os.environ.get('PATH','')
        # Create env and install antiSMASH
        !micromamba create -y -n antismash -c conda-forge -c bioconda antismash
        # Download databases
        !micromamba run -n antismash download-antismash-databases
        print('antiSMASH env ready. Proceed to Cell 8a to run it.')
else:
    print('Skipping micromamba setup (not Colab).')


In [None]:
# Cell 8a - Optional: antiSMASH analysis (if available)
# You can specify a conda env name via ANTISMASH_ENV (default 'antismash').
import os as __os
__os.environ.setdefault('ANTISMASH_ENV', 'antismash')
run_as = input('Run antiSMASH on a GBK/GBFF? (y/N): ').strip().lower() == 'y'
if run_as:
    runner = get_runner()
    if not is_antismash_available():
        print('⚠️ antiSMASH is not available in this environment; skipping')
        if runner is None:
            print('   Tips:')
            print('     - Use Bioconda to create an env: conda create -n antismash antismash && download-antismash-databases')
            print('     - Then set ANTISMASH_ENV=antismash and ensure conda/mamba is on PATH')
            print('     - Or install the Docker wrapper: run_antismash (full image)')
        else:
            print('   Detected runner but failed to probe with --help:', ' '.join(runner))
    else:
        print('antiSMASH runner:', ' '.join(runner))
        # Choose first GBK/GBFF by default
        gbk_files = sorted(GBK_DIR.glob('*.gbk')) + sorted(GBK_DIR.glob('*.gbff'))
        default_inp = str(gbk_files[0]) if gbk_files else ''
        as_inp = input(f'Path to GBK/GBFF [default {default_inp}]: ').strip() or default_inp
        if as_inp and Path(as_inp).exists():
            out_as = BASE / 'antismash_out'
            res = run_antismash(Path(as_inp), out_as)
            print('antiSMASH results at', res)
        else:
            print('No valid input file; skipped antiSMASH')
else:
    print('antiSMASH step skipped')


In [None]:
# Cell 9 - Run P2Rank on predicted PDBs and extract top pocket centers (modular)
POCKET_RESULTS = []
if P2_JAR is not None:
    results = run_p2rank_on_pdbs(P2_JAR, PDB_DIR)
    import pandas as _pd
    pockets_df = _pd.DataFrame(results)
    if not pockets_df.empty:
        pockets_df.to_csv(BASE / 'pockets_summary.csv', index=False)
    print('✅ Saved pockets summary to', BASE / 'pockets_summary.csv')
else:
    pockets_df = pd.DataFrame()
    print('⚠️ Skipping P2Rank: p2rank.jar not found')


In [None]:
# Cell 10 - Prepare ligand: SMILES or uploaded file (modular)
lig_in = input('Enter ligand SMILES or local path to ligand file (SDF/MOL/PDB). Leave blank to skip docking: ').strip()
lig_pdbqt = smiles_or_file_to_pdbqt(lig_in, BASE) if lig_in else None
if lig_pdbqt:
    print('Ligand prepared at', lig_pdbqt)
else:
    print('No ligand provided; skipping docking')


In [None]:
# Cell 11 - Vina docking into top pocket centers (modular)
if lig_pdbqt is not None and not pockets_df.empty:
    dfg = run_vina(lig_pdbqt, pockets_df, BASE)
    dfg.to_csv(BASE / 'vina_results.csv', index=False)
    print('✅ Vina docking completed; results at', BASE / 'vina_results.csv')
else:
    print('Skipping docking: ligand or pockets missing')


In [None]:
# Cell 12 - Generate PDF report (modular)
from pathlib import Path as _Path
report_path = BASE / 'esm3_results_report.pdf'
build_report(BASE, PDB_DIR, report_path)
print('✅ Report built at', report_path)


### Final: Download files
Use the left file browser in Colab to download `esm3_results_report.pdf`, the PDBs in `/content/protflow_runs/pdbs`, and docking outputs in `/content/protflow_runs`.
