A generative reasoning system for protein functional annotation — SeqStudio orchestrates heterogeneous evidence from sequence homology, structural similarity, domain architecture, and membrane topology within an LLM framework, semantically integrating outputs from BLAST, InterProScan, Foldseek, and TMHMM into structured annotations and natural-language narratives (with confidence scores and provenance), serializable as JSON / JSONL.
Overview · Methods & figures · Capabilities · Quick start · Installation · Documentation · Citation
The rapid growth of sequencing and structure prediction has made functional annotation a persistent bottleneck. Classical pattern-matching pipelines (similarity search, domain signatures, rule propagation) excel at detecting isolated signals, yet they rarely reconcile heterogeneous or conflicting evidence into mechanistic, curator-style narratives as in UniProtKB/Swiss-Prot.
SeqStudio organizes these computational signals into a unified, multi-view evidence object and performs conditional generation under an LLM with explicit tool outputs and a constrained schema. Reported evaluations include >91% consistency with Swiss-Prot across six functional dimensions, stable behaviour on 2025 newly deposited sequences held out from foundation-model training corpora, and large-scale enrichment on UniProtKB/TrEMBL–like data (median filled functional fields from roughly 1–3 to 5–6, with confidence-based triage). See the manuscript linked below for full benchmarks and methodology.
This repository provides a reproducible annotation stack you can run on a workstation or cluster: evidence collection, GO/Pfam semantic enrichment, prompt orchestration, structured prediction, and JSONL export. Use the two CLI entry points below to annotate FASTA + PDB bundles or UniProt-style JSON.gz shards.
Traditional approaches (BLAST, InterProScan, UniRule/ARBA, etc.) retrieve fragmented signals by similarity or rules; SeqStudio integrates these sources in a generative reasoning framework aimed at expert-like judgment.
Multi-source evidence acquisition → context-driven reasoning → structured report (confidence + provenance) → optional human–AI dialogue.
Evidence sources
| View | Typical tools / sources | Role |
|---|---|---|
| Evolution & homology | BLAST (e.g. vs Swiss-Prot) | Whole-sequence homology and transfer hints |
| Domains & rules | InterProScan, UniRule, … | Modular units and expert-encoded logic |
| 3D fold | Foldseek | Fold-level neighbors when sequence similarity is weak |
| Topology & membranes | TMHMM | TM segment count and topology; constraints on localization and class |
Structured outputs cover multiple functional dimensions (e.g. protein family, molecular function, enzyme information, pathways, subcellular location, structural class). Each field includes a confidence score and support provenance (motifs, GO IDs, tool sources) for downstream databases and review.
| Input | Command | Notes |
|---|---|---|
FASTA + PDB (one subdirectory per protein, or flat pairs under data/datasets/) |
python -m seq_annotation |
Optional helper: bash scripts/run_fasta_pdb_annotation.sh |
UniProt-style JSON.gz (gzip JSON with results array, etc.) |
python -m seqstudio_pipeline |
Sharded runs: scripts/run_uniprot_json_gz_shards.example.sh |
Code lives in packages seq_annotation/ and seqstudio_pipeline/, with shared utilities under utils/ and tools/. Use python -m seqstudio_pipeline (there is no standalone seqstudio_pipeline.py at the repo root).
git clone https://github.com/OpenRaiser/SeqStudio.git
cd SeqStudio
conda env create -f environment.yml
conda activate bioanalysis
pip install -r requirements.txt
bash setup.shThe first setup.sh run downloads large archives (InterProScan, BLAST Swiss-Prot database, Foldseek database, etc.) and can take substantial time and disk space. Then finish the steps in Installation (Java, TMHMM, API keys, shell environment). Command-line flags, directory layout, and data layout are documented in docs/GUIDE.md.
SeqStudio is not a pure Python library: full runs expect BLAST, InterProScan, Foldseek, TMHMM, local sequence/structure databases, three auxiliary JSON files (Pfam / GO / InterPro metadata, shipped in-repo), and typically an LLM HTTP API. Treat setup as layers: Python environment → setup.sh → manual configuration → verification.
| Layer | What it provides |
|---|---|
| Host | Linux recommended; tens of GB free disk for InterProScan, Swiss-Prot BLAST DB, and Foldseek DB; stable network for first-time downloads. |
| Python | environment.yml + requirements.txt — orchestration, HTTP clients, and glue code. |
setup.sh |
OpenJDK 11 in conda, InterProScan download/extract (version pinned in the script), BLAST 2.16 + Swiss-Prot blastp DB under blast_db/, Foldseek + structure DB under foldseek_db/, and may append BLASTDB / FOLDSEEK_DB to your shell rc. Does not install TMHMM, LLM credentials, or your own FASTA/PDB/UniProt JSON.gz inputs. |
| Manual | Java 11 for InterProScan; TMHMM (license from the vendor); EXTERNAL_API_KEY (optional EXTERNAL_API_URL, EXTERNAL_MODEL_NAME); reload shell or export BLASTDB / FOLDSEEK_DB in job scripts. |
| Data | data/raw_data/*.json (~145 MB) ship with the repo; override paths via CLI if needed. UniProt-style json.gz for seqstudio_pipeline is supplied by you. |
For tool paths (--interproscan_path, --tmhmm_path, …), optional ijson for huge JSON.gz, and repository data policy, see docs/GUIDE.md.
- Reload shell config (
source ~/.bashrcor equivalent) soBLASTDBandFOLDSEEK_DBare set, or export them in your job script. - Confirm Java 11 is what InterProScan sees (
java -version). - InterProScan — default layout is under
interproscan/at the repo root aftersetup.sh; use--interproscan_pathif you install elsewhere. - TMHMM — install under its license; add
tmhmmtoPATHor pass--tmhmm_path. - Auxiliary JSON —
data/raw_data/all_pfam_descriptions.json,go.json,interpro_data.json; updates from Hugging Face — OpenRaiser/SeqStudio. - LLM API —
export EXTERNAL_API_KEY="..."; keep secrets out of git. - Optional —
pip install ijsonfor streaming very large JSON.gz (see GUIDE).
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate bioanalysis
which python blastp foldseek interproscan.sh tmhmm 2>/dev/null || true
java -version 2>&1 | head -1
echo "BLASTDB=${BLASTDB:-<unset>}"
echo "FOLDSEEK_DB=${FOLDSEEK_DB:-<unset>}"
test -f data/raw_data/go.json && echo "go.json OK" || echo "missing go.json"| Goal | Command |
|---|---|
| Smoke test (no full tool chain) | python -m seq_annotation --proteins_root examples --dry_run |
| UniProt pipeline CLI | python -m seqstudio_pipeline --help |
| FASTA + PDB annotation | python -m seq_annotation --proteins_root … |
| UniProt JSON.gz shard | python -m seqstudio_pipeline --input_json_gz … --output_dir … |
| File | Contents |
|---|---|
docs/GUIDE.md |
Install, environment variables, both pipelines, data, release notes |
docs/LAYOUT.md |
Directory tree and command map |
docs/README.md |
Doc index |
Example inputs: examples/README.md.
If you use SeqStudio or the following manuscript, please cite:
Liu, Y., Shen, L., Xu, T., Wu, Y., Xu, J., Pan, C., Liu, Q., Pan, J., Zhang, Y., Wei, J., Li, S., Chen, J., Zhou, X., He, C., Ni, J., & Tan, C. Generative reasoning emulating expert curation moves protein functional annotation beyond pattern matching at scale. Manuscript (2026).
@misc{seqstudio2026manuscript,
title = {Generative reasoning emulating expert curation moves protein functional annotation beyond pattern matching at scale},
author = {Liu, Yumou and Shen, Lingdong and Xu, Tongyue and Wu, Yujun and Xu, Jinhang and Pan, Chenkai and Liu, Qi and Pan, Jiabao and Zhang, Yijie and Wei, Jingxuan and Li, Siyuan and Chen, Jintao and Zhou, Xuanhe and He, Conghui and Ni, Jinren and Tan, Cheng},
year = {2026},
howpublished = {\url{https://github.com/OpenRaiser/SeqStudio}}
}Data and extensions: Hugging Face — OpenRaiser/SeqStudio.
This project is released under the MIT License; see LICENSE.
SeqStudio — tool-grounded evidence integration for generative protein annotation