PhADS is a bilingual-based multimodal model, based on the prostT5 protein model to annotate phage anti-defense systems. Given a protein sequence FASTA file, it automatically performs the following full workflow:
- Sequence Embedding Generation: Convert protein sequences into embedding vectors using the ProstT5 pre-trained model
- HMM Alignment: Align sequences against a pre-built ADS HMM database using HMMER
hmmscan - HMM Feature Matrix Construction: Transform alignment results into a fixed-dimension feature matrix (row-wise Min-Max normalization)
- SDH-ProtoNet Prediction: Fuse sequence embeddings with evolutionary features and perform cluster classification and candidate ranking in latent space
- Quality Control (QC) Report: Output text + JSON format quality control reports
conda env create -f environment.yml
conda activate PhADSpython main.py -vSuccessful output:
PhADS version 0.1
If files are missing, all missing file paths will be listed.
python main.py \
-i test/test.faa \
-db /path/to/prost_model \
-o resultWindows PowerShell example:
python main.py -i .\test\test.faa -db F:\models\ProstT5 -o .\resultpython main.py \
-i input.faa \
-db /path/to/prost_model \
-o result \
-n 16 \
--device cuda \
--predict-mode prototype \
--topk 10 \
--filter-mode moderate \
-temp temp_work \
--print-topkEmbedding generation only:
python scripts/translate_to_embedding.py \
-i input.faa \
-o embeddings_out \
--model-path /path/to/prost_model \
--device cudaHMM feature matrix construction only:
python scripts/hmm_to_npy.py \
-d hmm_results.domtblout \
-i input.faa \
-k 211Prediction only:
python scripts/predict.py \
--query-emb query_embeddings.npy \
--query-hmm query_hmm_features_L2.npy \
--query-ids protein_ids.txt \
--model-path database/PhADS_model/sdh_protonet_best.pth \
--map-path database/PhADS_model/family_map.pth \
--mode prototype \
--topk 5 \
--filter-mode moderate| Column | Description |
|---|---|
query_id |
Input sequence ID |
mode |
Prediction mode (prototype/instance) |
pred_id |
Best candidate ID |
pred_cluster |
Predicted cluster name |
pred_cluster_rep |
Predicted cluster representative sequence |
pred_distance2 |
Squared Euclidean distance to predicted cluster |
confidence |
Softmax confidence score |
filter_mode |
Filter mode used |
filter_status |
Filter status (Pass/Fail) |
threshold_limit |
Applied threshold value |
nearest_sequence_id |
Nearest training instance ID |
nearest_sequence_label |
Nearest training instance label |
nearest_sequence_distance2 |
Distance to nearest training instance |
nearest_sequence_function_summary |
Functional summary of nearest training instance |
cluster_func_* |
Functional annotation expanded columns |
Purpose: One-click orchestration of embedding generation → HMM alignment → feature construction → prediction → QC reporting.
Auto-resolved built-in database paths (relative to main.py location):
| Database Item | Path |
|---|---|
| HMM Database | database/hmm_model/anti_defense_system.hmm |
| CNN Weights | database/cnn_chkpnt/model.pt |
| PhADS Model | database/PhADS_model/sdh_protonet_best.pth |
| Cluster Map | database/PhADS_model/family_map.pth |
| Threshold Matrix | database/PhADS_model/family_thresholds.tsv |
| Cluster Annotations | database/anno_database/cluster_annotation.txt |
| ADS Functions | database/anno_database/ADS_function.txt |
User must provide:
-i/--input-fasta: Input FASTA file (supports multiple sequences)-db/--mode-path: Path to the ProstT5 model directory
Full parameter reference:
| Parameter | Description | Default |
|---|---|---|
-i, --input-fasta |
Input FASTA file path | Required |
-o, --work-dir |
Final output directory | run_auto |
-n, --cpu |
Number of CPU threads | 8 |
--device |
Inference device | auto (auto-select cpu/cuda) |
-temp, --temp |
Temporary directory for intermediate files | ./temp/ in cwd |
-db, --mode-path |
ProstT5 model directory | Required |
--predict-mode |
Prediction mode: prototype / instance |
prototype |
--topk |
Number of Top-k candidates | 5 |
--filter-mode |
Filter control mode | moderate |
--predict-output |
Main prediction output filename | prediction_results.tsv |
--predict-topk-output |
Top-k detail output filename | prediction_topk.tsv |
--print-topk |
Print Top-k candidates to stdout | Off |
--qc-txt |
Text QC report filename | qc_report.txt |
--qc-json |
JSON QC report filename | qc_report.json |
-v, --version |
Database integrity self-check | — |
Filter control modes (--filter-mode):
| Mode | Description |
|---|---|
strict |
Tight firewall limit (P90 training bounds, minimizes false positives) |
moderate |
Balanced firewall limit (P95 training bounds, recommended standard mode) |
loose |
Relaxed firewall limit (outlier boundaries, aims for divergent discovery) |
none |
Disable online filtering (backward compatible, preserves raw results) |
Intermediate files (temp directory):
protein_ids.txt— Query sequence ID list<seq_id>_embedding.pt— Per-sequence embedding filesquery_embeddings.npy— Aggregated embedding matrixhmm_results.domtblout— Raw hmmscan alignment resultsquery_hmm_features_L2.npy— HMM feature matrix (binary)query_hmm_features_L2.txt— HMM feature matrix (text)missing_embedding_ids.log— IDs of sequences with missing embeddings (if any)