PhADS

PhADS is a bilingual-based multimodal model, based on the prostT5 protein model to annotate phage anti-defense systems. Given a protein sequence FASTA file, it automatically performs the following full workflow:

Sequence Embedding Generation: Convert protein sequences into embedding vectors using the ProstT5 pre-trained model
HMM Alignment: Align sequences against a pre-built ADS HMM database using HMMER hmmscan
HMM Feature Matrix Construction: Transform alignment results into a fixed-dimension feature matrix (row-wise Min-Max normalization)
SDH-ProtoNet Prediction: Fuse sequence embeddings with evolutionary features and perform cluster classification and candidate ranking in latent space
Quality Control (QC) Report: Output text + JSON format quality control reports

Usage Guide

Environment Setup

conda env create -f environment.yml
conda activate PhADS

Database Integrity Self-Check (Run First)

python main.py -v

Successful output:

PhADS version 0.1

If files are missing, all missing file paths will be listed.

Minimal Run Example

python main.py \
  -i test/test.faa \
  -db /path/to/prost_model \
  -o result

Windows PowerShell example:

python main.py -i .\test\test.faa -db F:\models\ProstT5 -o .\result

Full Parameter Run Example

python main.py \
  -i input.faa \
  -db /path/to/prost_model \
  -o result \
  -n 16 \
  --device cuda \
  --predict-mode prototype \
  --topk 10 \
  --filter-mode moderate \
  -temp temp_work \
  --print-topk

Running Sub-scripts Individually (Advanced)

Embedding generation only:

python scripts/translate_to_embedding.py \
  -i input.faa \
  -o embeddings_out \
  --model-path /path/to/prost_model \
  --device cuda

HMM feature matrix construction only:

python scripts/hmm_to_npy.py \
  -d hmm_results.domtblout \
  -i input.faa \
  -k 211

Prediction only:

python scripts/predict.py \
  --query-emb query_embeddings.npy \
  --query-hmm query_hmm_features_L2.npy \
  --query-ids protein_ids.txt \
  --model-path database/PhADS_model/sdh_protonet_best.pth \
  --map-path database/PhADS_model/family_map.pth \
  --mode prototype \
  --topk 5 \
  --filter-mode moderate

Output File Descriptions

Main Prediction Results `prediction_results.tsv`

Column	Description
`query_id`	Input sequence ID
`mode`	Prediction mode (prototype/instance)
`pred_id`	Best candidate ID
`pred_cluster`	Predicted cluster name
`pred_cluster_rep`	Predicted cluster representative sequence
`pred_distance2`	Squared Euclidean distance to predicted cluster
`confidence`	Softmax confidence score
`filter_mode`	Filter mode used
`filter_status`	Filter status (Pass/Fail)
`threshold_limit`	Applied threshold value
`nearest_sequence_id`	Nearest training instance ID
`nearest_sequence_label`	Nearest training instance label
`nearest_sequence_distance2`	Distance to nearest training instance
`nearest_sequence_function_summary`	Functional summary of nearest training instance
`cluster_func_*`	Functional annotation expanded columns

Script Details

`main.py` — Main Pipeline Entry (Recommended)

Purpose: One-click orchestration of embedding generation → HMM alignment → feature construction → prediction → QC reporting.

Auto-resolved built-in database paths (relative to main.py location):

Database Item	Path
HMM Database	`database/hmm_model/anti_defense_system.hmm`
CNN Weights	`database/cnn_chkpnt/model.pt`
PhADS Model	`database/PhADS_model/sdh_protonet_best.pth`
Cluster Map	`database/PhADS_model/family_map.pth`
Threshold Matrix	`database/PhADS_model/family_thresholds.tsv`
Cluster Annotations	`database/anno_database/cluster_annotation.txt`
ADS Functions	`database/anno_database/ADS_function.txt`

User must provide:

-i / --input-fasta: Input FASTA file (supports multiple sequences)
-db / --mode-path: Path to the ProstT5 model directory

Full parameter reference:

Parameter	Description	Default
`-i, --input-fasta`	Input FASTA file path	Required
`-o, --work-dir`	Final output directory	`run_auto`
`-n, --cpu`	Number of CPU threads	8
`--device`	Inference device	`auto` (auto-select cpu/cuda)
`-temp, --temp`	Temporary directory for intermediate files	`./temp/` in cwd
`-db, --mode-path`	ProstT5 model directory	Required
`--predict-mode`	Prediction mode: `prototype` / `instance`	`prototype`
`--topk`	Number of Top-k candidates	5
`--filter-mode`	Filter control mode	`moderate`
`--predict-output`	Main prediction output filename	`prediction_results.tsv`
`--predict-topk-output`	Top-k detail output filename	`prediction_topk.tsv`
`--print-topk`	Print Top-k candidates to stdout	Off
`--qc-txt`	Text QC report filename	`qc_report.txt`
`--qc-json`	JSON QC report filename	`qc_report.json`
`-v, --version`	Database integrity self-check	—

Filter control modes (--filter-mode):

Mode	Description
`strict`	Tight firewall limit (P90 training bounds, minimizes false positives)
`moderate`	Balanced firewall limit (P95 training bounds, recommended standard mode)
`loose`	Relaxed firewall limit (outlier boundaries, aims for divergent discovery)
`none`	Disable online filtering (backward compatible, preserves raw results)

Intermediate files (temp directory):

protein_ids.txt — Query sequence ID list
<seq_id>_embedding.pt — Per-sequence embedding files
query_embeddings.npy — Aggregated embedding matrix
hmm_results.domtblout — Raw hmmscan alignment results
query_hmm_features_L2.npy — HMM feature matrix (binary)
query_hmm_features_L2.txt — HMM feature matrix (text)
missing_embedding_ids.log — IDs of sequences with missing embeddings (if any)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
database		database
scripts		scripts
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhADS

Usage Guide

Environment Setup

Database Integrity Self-Check (Run First)

Minimal Run Example

Full Parameter Run Example

Running Sub-scripts Individually (Advanced)

Output File Descriptions

Main Prediction Results `prediction_results.tsv`

Script Details

`main.py` — Main Pipeline Entry (Recommended)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PhADS

Usage Guide

Environment Setup

Database Integrity Self-Check (Run First)

Minimal Run Example

Full Parameter Run Example

Running Sub-scripts Individually (Advanced)

Output File Descriptions

Main Prediction Results prediction_results.tsv

Script Details

main.py — Main Pipeline Entry (Recommended)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Main Prediction Results `prediction_results.tsv`

`main.py` — Main Pipeline Entry (Recommended)

Packages