Skip to content

OpenRaiser/SeqStudio

Repository files navigation

SeqStudio hero

SeqStudio

Chinese README English README

A generative reasoning system for protein functional annotation — SeqStudio orchestrates heterogeneous evidence from sequence homology, structural similarity, domain architecture, and membrane topology within an LLM framework, semantically integrating outputs from BLAST, InterProScan, Foldseek, and TMHMM into structured annotations and natural-language narratives (with confidence scores and provenance), serializable as JSON / JSONL.

Paper GitHub Hugging Face

Overview · Methods & figures · Capabilities · Quick start · Installation · Documentation · Citation


Overview

The rapid growth of sequencing and structure prediction has made functional annotation a persistent bottleneck. Classical pattern-matching pipelines (similarity search, domain signatures, rule propagation) excel at detecting isolated signals, yet they rarely reconcile heterogeneous or conflicting evidence into mechanistic, curator-style narratives as in UniProtKB/Swiss-Prot.

SeqStudio organizes these computational signals into a unified, multi-view evidence object and performs conditional generation under an LLM with explicit tool outputs and a constrained schema. Reported evaluations include >91% consistency with Swiss-Prot across six functional dimensions, stable behaviour on 2025 newly deposited sequences held out from foundation-model training corpora, and large-scale enrichment on UniProtKB/TrEMBL–like data (median filled functional fields from roughly 1–3 to 5–6, with confidence-based triage). See the manuscript linked below for full benchmarks and methodology.

This repository provides a reproducible annotation stack you can run on a workstation or cluster: evidence collection, GO/Pfam semantic enrichment, prompt orchestration, structured prediction, and JSONL export. Use the two CLI entry points below to annotate FASTA + PDB bundles or UniProt-style JSON.gz shards.

Back to top


Methods and figures

Teaser: from pattern matching to integrative curation

SeqStudio teaser: traditional tools vs SeqStudio

Traditional approaches (BLAST, InterProScan, UniRule/ARBA, etc.) retrieve fragmented signals by similarity or rules; SeqStudio integrates these sources in a generative reasoning framework aimed at expert-like judgment.

Framework: four-stage annotation pipeline

SeqStudio framework overview

Multi-source evidence acquisition → context-driven reasoning → structured report (confidence + provenance) → optional human–AI dialogue.

Evidence sources

View Typical tools / sources Role
Evolution & homology BLAST (e.g. vs Swiss-Prot) Whole-sequence homology and transfer hints
Domains & rules InterProScan, UniRule, … Modular units and expert-encoded logic
3D fold Foldseek Fold-level neighbors when sequence similarity is weak
Topology & membranes TMHMM TM segment count and topology; constraints on localization and class

Structured outputs cover multiple functional dimensions (e.g. protein family, molecular function, enzyme information, pathways, subcellular location, structural class). Each field includes a confidence score and support provenance (motifs, GO IDs, tool sources) for downstream databases and review.

Back to top


Core capabilities

Input Command Notes
FASTA + PDB (one subdirectory per protein, or flat pairs under data/datasets/) python -m seq_annotation Optional helper: bash scripts/run_fasta_pdb_annotation.sh
UniProt-style JSON.gz (gzip JSON with results array, etc.) python -m seqstudio_pipeline Sharded runs: scripts/run_uniprot_json_gz_shards.example.sh

Code lives in packages seq_annotation/ and seqstudio_pipeline/, with shared utilities under utils/ and tools/. Use python -m seqstudio_pipeline (there is no standalone seqstudio_pipeline.py at the repo root).

Back to top


Quick start

git clone https://github.com/OpenRaiser/SeqStudio.git
cd SeqStudio

conda env create -f environment.yml
conda activate bioanalysis
pip install -r requirements.txt

bash setup.sh

The first setup.sh run downloads large archives (InterProScan, BLAST Swiss-Prot database, Foldseek database, etc.) and can take substantial time and disk space. Then finish the steps in Installation (Java, TMHMM, API keys, shell environment). Command-line flags, directory layout, and data layout are documented in docs/GUIDE.md.

Back to top


Installation

SeqStudio is not a pure Python library: full runs expect BLAST, InterProScan, Foldseek, TMHMM, local sequence/structure databases, three auxiliary JSON files (Pfam / GO / InterPro metadata, shipped in-repo), and typically an LLM HTTP API. Treat setup as layers: Python environment → setup.sh → manual configuration → verification.

What you need

Layer What it provides
Host Linux recommended; tens of GB free disk for InterProScan, Swiss-Prot BLAST DB, and Foldseek DB; stable network for first-time downloads.
Python environment.yml + requirements.txt — orchestration, HTTP clients, and glue code.
setup.sh OpenJDK 11 in conda, InterProScan download/extract (version pinned in the script), BLAST 2.16 + Swiss-Prot blastp DB under blast_db/, Foldseek + structure DB under foldseek_db/, and may append BLASTDB / FOLDSEEK_DB to your shell rc. Does not install TMHMM, LLM credentials, or your own FASTA/PDB/UniProt JSON.gz inputs.
Manual Java 11 for InterProScan; TMHMM (license from the vendor); EXTERNAL_API_KEY (optional EXTERNAL_API_URL, EXTERNAL_MODEL_NAME); reload shell or export BLASTDB / FOLDSEEK_DB in job scripts.
Data data/raw_data/*.json (~145 MB) ship with the repo; override paths via CLI if needed. UniProt-style json.gz for seqstudio_pipeline is supplied by you.

For tool paths (--interproscan_path, --tmhmm_path, …), optional ijson for huge JSON.gz, and repository data policy, see docs/GUIDE.md.

After setup.sh

  1. Reload shell config (source ~/.bashrc or equivalent) so BLASTDB and FOLDSEEK_DB are set, or export them in your job script.
  2. Confirm Java 11 is what InterProScan sees (java -version).
  3. InterProScan — default layout is under interproscan/ at the repo root after setup.sh; use --interproscan_path if you install elsewhere.
  4. TMHMM — install under its license; add tmhmm to PATH or pass --tmhmm_path.
  5. Auxiliary JSONdata/raw_data/all_pfam_descriptions.json, go.json, interpro_data.json; updates from Hugging Face — OpenRaiser/SeqStudio.
  6. LLM APIexport EXTERNAL_API_KEY="..."; keep secrets out of git.
  7. Optionalpip install ijson for streaming very large JSON.gz (see GUIDE).

Verify

source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate bioanalysis
which python blastp foldseek interproscan.sh tmhmm 2>/dev/null || true
java -version 2>&1 | head -1
echo "BLASTDB=${BLASTDB:-<unset>}"
echo "FOLDSEEK_DB=${FOLDSEEK_DB:-<unset>}"
test -f data/raw_data/go.json && echo "go.json OK" || echo "missing go.json"

Example commands

Goal Command
Smoke test (no full tool chain) python -m seq_annotation --proteins_root examples --dry_run
UniProt pipeline CLI python -m seqstudio_pipeline --help
FASTA + PDB annotation python -m seq_annotation --proteins_root …
UniProt JSON.gz shard python -m seqstudio_pipeline --input_json_gz … --output_dir …

Back to top


Documentation

File Contents
docs/GUIDE.md Install, environment variables, both pipelines, data, release notes
docs/LAYOUT.md Directory tree and command map
docs/README.md Doc index

Example inputs: examples/README.md.

Back to top


Citation

If you use SeqStudio or the following manuscript, please cite:

Liu, Y., Shen, L., Xu, T., Wu, Y., Xu, J., Pan, C., Liu, Q., Pan, J., Zhang, Y., Wei, J., Li, S., Chen, J., Zhou, X., He, C., Ni, J., & Tan, C. Generative reasoning emulating expert curation moves protein functional annotation beyond pattern matching at scale. Manuscript (2026).

@misc{seqstudio2026manuscript,
  title        = {Generative reasoning emulating expert curation moves protein functional annotation beyond pattern matching at scale},
  author       = {Liu, Yumou and Shen, Lingdong and Xu, Tongyue and Wu, Yujun and Xu, Jinhang and Pan, Chenkai and Liu, Qi and Pan, Jiabao and Zhang, Yijie and Wei, Jingxuan and Li, Siyuan and Chen, Jintao and Zhou, Xuanhe and He, Conghui and Ni, Jinren and Tan, Cheng},
  year         = {2026},
  howpublished = {\url{https://github.com/OpenRaiser/SeqStudio}}
}

Data and extensions: Hugging Face — OpenRaiser/SeqStudio.

Back to top


License

This project is released under the MIT License; see LICENSE.


SeqStudio — tool-grounded evidence integration for generative protein annotation

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors