Skip to content

George-nsn/PhADS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhADS

PhADS is a bilingual-based multimodal model, based on the prostT5 protein model to annotate phage anti-defense systems. Given a protein sequence FASTA file, it automatically performs the following full workflow:

  1. Sequence Embedding Generation: Convert protein sequences into embedding vectors using the ProstT5 pre-trained model
  2. HMM Alignment: Align sequences against a pre-built ADS HMM database using HMMER hmmscan
  3. HMM Feature Matrix Construction: Transform alignment results into a fixed-dimension feature matrix (row-wise Min-Max normalization)
  4. SDH-ProtoNet Prediction: Fuse sequence embeddings with evolutionary features and perform cluster classification and candidate ranking in latent space
  5. Quality Control (QC) Report: Output text + JSON format quality control reports

Usage Guide

Environment Setup

conda env create -f environment.yml
conda activate PhADS

Database Integrity Self-Check (Run First)

python main.py -v

Successful output:

PhADS version 0.1

If files are missing, all missing file paths will be listed.

Minimal Run Example

python main.py \
  -i test/test.faa \
  -db /path/to/prost_model \
  -o result

Windows PowerShell example:

python main.py -i .\test\test.faa -db F:\models\ProstT5 -o .\result

Full Parameter Run Example

python main.py \
  -i input.faa \
  -db /path/to/prost_model \
  -o result \
  -n 16 \
  --device cuda \
  --predict-mode prototype \
  --topk 10 \
  --filter-mode moderate \
  -temp temp_work \
  --print-topk

Running Sub-scripts Individually (Advanced)

Embedding generation only:

python scripts/translate_to_embedding.py \
  -i input.faa \
  -o embeddings_out \
  --model-path /path/to/prost_model \
  --device cuda

HMM feature matrix construction only:

python scripts/hmm_to_npy.py \
  -d hmm_results.domtblout \
  -i input.faa \
  -k 211

Prediction only:

python scripts/predict.py \
  --query-emb query_embeddings.npy \
  --query-hmm query_hmm_features_L2.npy \
  --query-ids protein_ids.txt \
  --model-path database/PhADS_model/sdh_protonet_best.pth \
  --map-path database/PhADS_model/family_map.pth \
  --mode prototype \
  --topk 5 \
  --filter-mode moderate

Output File Descriptions

Main Prediction Results prediction_results.tsv

Column Description
query_id Input sequence ID
mode Prediction mode (prototype/instance)
pred_id Best candidate ID
pred_cluster Predicted cluster name
pred_cluster_rep Predicted cluster representative sequence
pred_distance2 Squared Euclidean distance to predicted cluster
confidence Softmax confidence score
filter_mode Filter mode used
filter_status Filter status (Pass/Fail)
threshold_limit Applied threshold value
nearest_sequence_id Nearest training instance ID
nearest_sequence_label Nearest training instance label
nearest_sequence_distance2 Distance to nearest training instance
nearest_sequence_function_summary Functional summary of nearest training instance
cluster_func_* Functional annotation expanded columns

Script Details

main.py — Main Pipeline Entry (Recommended)

Purpose: One-click orchestration of embedding generation → HMM alignment → feature construction → prediction → QC reporting.

Auto-resolved built-in database paths (relative to main.py location):

Database Item Path
HMM Database database/hmm_model/anti_defense_system.hmm
CNN Weights database/cnn_chkpnt/model.pt
PhADS Model database/PhADS_model/sdh_protonet_best.pth
Cluster Map database/PhADS_model/family_map.pth
Threshold Matrix database/PhADS_model/family_thresholds.tsv
Cluster Annotations database/anno_database/cluster_annotation.txt
ADS Functions database/anno_database/ADS_function.txt

User must provide:

  • -i / --input-fasta: Input FASTA file (supports multiple sequences)
  • -db / --mode-path: Path to the ProstT5 model directory

Full parameter reference:

Parameter Description Default
-i, --input-fasta Input FASTA file path Required
-o, --work-dir Final output directory run_auto
-n, --cpu Number of CPU threads 8
--device Inference device auto (auto-select cpu/cuda)
-temp, --temp Temporary directory for intermediate files ./temp/ in cwd
-db, --mode-path ProstT5 model directory Required
--predict-mode Prediction mode: prototype / instance prototype
--topk Number of Top-k candidates 5
--filter-mode Filter control mode moderate
--predict-output Main prediction output filename prediction_results.tsv
--predict-topk-output Top-k detail output filename prediction_topk.tsv
--print-topk Print Top-k candidates to stdout Off
--qc-txt Text QC report filename qc_report.txt
--qc-json JSON QC report filename qc_report.json
-v, --version Database integrity self-check

Filter control modes (--filter-mode):

Mode Description
strict Tight firewall limit (P90 training bounds, minimizes false positives)
moderate Balanced firewall limit (P95 training bounds, recommended standard mode)
loose Relaxed firewall limit (outlier boundaries, aims for divergent discovery)
none Disable online filtering (backward compatible, preserves raw results)

Intermediate files (temp directory):

  • protein_ids.txt — Query sequence ID list
  • <seq_id>_embedding.pt — Per-sequence embedding files
  • query_embeddings.npy — Aggregated embedding matrix
  • hmm_results.domtblout — Raw hmmscan alignment results
  • query_hmm_features_L2.npy — HMM feature matrix (binary)
  • query_hmm_features_L2.txt — HMM feature matrix (text)
  • missing_embedding_ids.log — IDs of sequences with missing embeddings (if any)

About

PhADS is a bilingual-based multimodal model, based on the prostT5 protein model to annotate phage anti-defense systems.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages