SeqStudio

A generative reasoning system for protein functional annotation — SeqStudio orchestrates heterogeneous evidence from sequence homology, structural similarity, domain architecture, and membrane topology within an LLM framework, semantically integrating outputs from BLAST, InterProScan, Foldseek, and TMHMM into structured annotations and natural-language narratives (with confidence scores and provenance), serializable as JSON / JSONL.

Overview · Methods & figures · Capabilities · Quick start · Installation · Documentation · Citation

Overview

The rapid growth of sequencing and structure prediction has made functional annotation a persistent bottleneck. Classical pattern-matching pipelines (similarity search, domain signatures, rule propagation) excel at detecting isolated signals, yet they rarely reconcile heterogeneous or conflicting evidence into mechanistic, curator-style narratives as in UniProtKB/Swiss-Prot.

SeqStudio organizes these computational signals into a unified, multi-view evidence object and performs conditional generation under an LLM with explicit tool outputs and a constrained schema. Reported evaluations include >91% consistency with Swiss-Prot across six functional dimensions, stable behaviour on 2025 newly deposited sequences held out from foundation-model training corpora, and large-scale enrichment on UniProtKB/TrEMBL–like data (median filled functional fields from roughly 1–3 to 5–6, with confidence-based triage). See the manuscript linked below for full benchmarks and methodology.

This repository provides a reproducible annotation stack you can run on a workstation or cluster: evidence collection, GO/Pfam semantic enrichment, prompt orchestration, structured prediction, and JSONL export. Use the two CLI entry points below to annotate FASTA + PDB bundles or UniProt-style JSON.gz shards.

Back to top

Methods and figures

Teaser: from pattern matching to integrative curation

SeqStudio teaser: traditional tools vs SeqStudio

_{Traditional approaches (BLAST, InterProScan, UniRule/ARBA, etc.) retrieve fragmented signals by similarity or rules; SeqStudio integrates these sources in a generative reasoning framework aimed at expert-like judgment.}

Framework: four-stage annotation pipeline

_{Multi-source evidence acquisition → context-driven reasoning → structured report (confidence + provenance) → optional human–AI dialogue.}

Evidence sources

View	Typical tools / sources	Role
Evolution & homology	BLAST (e.g. vs Swiss-Prot)	Whole-sequence homology and transfer hints
Domains & rules	InterProScan, UniRule, …	Modular units and expert-encoded logic
3D fold	Foldseek	Fold-level neighbors when sequence similarity is weak
Topology & membranes	TMHMM	TM segment count and topology; constraints on localization and class

Structured outputs cover multiple functional dimensions (e.g. protein family, molecular function, enzyme information, pathways, subcellular location, structural class). Each field includes a confidence score and support provenance (motifs, GO IDs, tool sources) for downstream databases and review.

Back to top

Core capabilities

Input	Command	Notes
FASTA + PDB (one subdirectory per protein, or flat pairs under `data/datasets/`)	`python -m seq_annotation`	Optional helper: `bash scripts/run_fasta_pdb_annotation.sh`
UniProt-style JSON.gz (gzip JSON with `results` array, etc.)	`python -m seqstudio_pipeline`	Sharded runs: `scripts/run_uniprot_json_gz_shards.example.sh`

Code lives in packages seq_annotation/ and seqstudio_pipeline/, with shared utilities under utils/ and tools/. Use python -m seqstudio_pipeline (there is no standalone seqstudio_pipeline.py at the repo root).

Back to top

Quick start

git clone https://github.com/OpenRaiser/SeqStudio.git
cd SeqStudio

conda env create -f environment.yml
conda activate bioanalysis
pip install -r requirements.txt

bash setup.sh

The first setup.sh run downloads large archives (InterProScan, BLAST Swiss-Prot database, Foldseek database, etc.) and can take substantial time and disk space. Then finish the steps in Installation (Java, TMHMM, API keys, shell environment). Command-line flags, directory layout, and data layout are documented in docs/GUIDE.md.

Back to top

Installation

SeqStudio is not a pure Python library: full runs expect BLAST, InterProScan, Foldseek, TMHMM, local sequence/structure databases, three auxiliary JSON files (Pfam / GO / InterPro metadata, shipped in-repo), and typically an LLM HTTP API. Treat setup as layers: Python environment → setup.sh → manual configuration → verification.

What you need

Layer	What it provides
Host	Linux recommended; tens of GB free disk for InterProScan, Swiss-Prot BLAST DB, and Foldseek DB; stable network for first-time downloads.
Python	`environment.yml` + `requirements.txt` — orchestration, HTTP clients, and glue code.
`setup.sh`	OpenJDK 11 in conda, InterProScan download/extract (version pinned in the script), BLAST 2.16 + Swiss-Prot `blastp` DB under `blast_db/`, Foldseek + structure DB under `foldseek_db/`, and may append `BLASTDB` / `FOLDSEEK_DB` to your shell rc. Does not install TMHMM, LLM credentials, or your own FASTA/PDB/UniProt JSON.gz inputs.
Manual	Java 11 for InterProScan; TMHMM (license from the vendor); `EXTERNAL_API_KEY` (optional `EXTERNAL_API_URL`, `EXTERNAL_MODEL_NAME`); reload shell or export `BLASTDB` / `FOLDSEEK_DB` in job scripts.
Data	*`data/raw_data/.json` (~145 MB) ship with the repo; override paths via CLI if needed. UniProt-style `json.gz`** for `seqstudio_pipeline` is supplied by you.

For tool paths (--interproscan_path, --tmhmm_path, …), optional ijson for huge JSON.gz, and repository data policy, see docs/GUIDE.md.

After `setup.sh`

Reload shell config (source ~/.bashrc or equivalent) so BLASTDB and FOLDSEEK_DB are set, or export them in your job script.
Confirm Java 11 is what InterProScan sees (java -version).
InterProScan — default layout is under interproscan/ at the repo root after setup.sh; use --interproscan_path if you install elsewhere.
TMHMM — install under its license; add tmhmm to PATH or pass --tmhmm_path.
Auxiliary JSON — data/raw_data/all_pfam_descriptions.json, go.json, interpro_data.json; updates from Hugging Face — OpenRaiser/SeqStudio.
LLM API — export EXTERNAL_API_KEY="..."; keep secrets out of git.
Optional — pip install ijson for streaming very large JSON.gz (see GUIDE).

Verify

source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate bioanalysis
which python blastp foldseek interproscan.sh tmhmm 2>/dev/null || true
java -version 2>&1 | head -1
echo "BLASTDB=${BLASTDB:-<unset>}"
echo "FOLDSEEK_DB=${FOLDSEEK_DB:-<unset>}"
test -f data/raw_data/go.json && echo "go.json OK" || echo "missing go.json"

Example commands

Goal	Command
Smoke test (no full tool chain)	`python -m seq_annotation --proteins_root examples --dry_run`
UniProt pipeline CLI	`python -m seqstudio_pipeline --help`
FASTA + PDB annotation	`python -m seq_annotation --proteins_root …`
UniProt JSON.gz shard	`python -m seqstudio_pipeline --input_json_gz … --output_dir …`

Back to top

Documentation

File	Contents
`docs/GUIDE.md`	Install, environment variables, both pipelines, data, release notes
`docs/LAYOUT.md`	Directory tree and command map
`docs/README.md`	Doc index

Example inputs: examples/README.md.

Back to top

Citation

If you use SeqStudio or the following manuscript, please cite:

Liu, Y., Shen, L., Xu, T., Wu, Y., Xu, J., Pan, C., Liu, Q., Pan, J., Zhang, Y., Wei, J., Li, S., Chen, J., Zhou, X., He, C., Ni, J., & Tan, C. Generative reasoning emulating expert curation moves protein functional annotation beyond pattern matching at scale. Manuscript (2026).

@misc{seqstudio2026manuscript,
  title        = {Generative reasoning emulating expert curation moves protein functional annotation beyond pattern matching at scale},
  author       = {Liu, Yumou and Shen, Lingdong and Xu, Tongyue and Wu, Yujun and Xu, Jinhang and Pan, Chenkai and Liu, Qi and Pan, Jiabao and Zhang, Yijie and Wei, Jingxuan and Li, Siyuan and Chen, Jintao and Zhou, Xuanhe and He, Conghui and Ni, Jinren and Tan, Cheng},
  year         = {2026},
  howpublished = {\url{https://github.com/OpenRaiser/SeqStudio}}
}

Data and extensions: Hugging Face — OpenRaiser/SeqStudio.

Back to top

License

This project is released under the MIT License; see LICENSE.

_{SeqStudio — tool-grounded evidence integration for generative protein annotation}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeqStudio

Overview

Methods and figures

Teaser: from pattern matching to integrative curation

Framework: four-stage annotation pipeline

Core capabilities

Quick start

Installation

What you need

After `setup.sh`

Verify

Example commands

Documentation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data		data
docs		docs
examples		examples
scripts		scripts
seq_annotation		seq_annotation
seqstudio_pipeline		seqstudio_pipeline
tools		tools
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

SeqStudio

Overview

Methods and figures

Teaser: from pattern matching to integrative curation

Framework: four-stage annotation pipeline

Core capabilities

Quick start

Installation

What you need

After setup.sh

Verify

Example commands

Documentation

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

After `setup.sh`

Packages