This repository contains a modified PHACE pipeline that can run on a single concatenated MSA composed of multiple protein MSAs (protein “blocks”), while discarding false signal introduced by artificial gap blocks (i.e., blocks filled with gaps because a species lacks the ortholog for that protein).
In the original PHACE workflow, gaps (-) in the alignment can be interpreted as real evolutionary events. In a concatenated multi‑protein MSA, however, entire protein blocks can be gaps purely due to missing orthologs (not biological indels). This version prevents those artificial gaps from contributing to tolerance scoring and coevolution scoring.
You must provide:
-
Ortholog presence table (
ortholog_selection_table.tsv)- Rows: species names (must match tree tip labels)
- Columns: protein identifiers (must match boundary file protein names)
- Cells: non‑empty = ortholog present; empty/NA = ortholog missing (artificial gap block)
-
Protein boundary map (
proteinBoundriesForMSA.csv/.txt)- Maps concatenated MSA coordinates to protein blocks
- Must include (at minimum):
protein_name(orprotein_id),start,end - Coordinates are 1‑based, inclusive, in the concatenated alignment.
When a species lacks the ortholog for the protein block containing a position, this version encodes that position as ? (missing) in MSA1 and MSA2 encodings for that species only (dynamic, per position / per pair).
In coev_diff_MSA1.R and coev_diff_MSA2.R, if either parent or child ancestral state is ?, the branch is forced to “no change” for that position (diff=0, change label "--", etc.). This prevents spurious change labels like C?, A?, etc.
For a position pair (i1, i2) mapping to proteins (P1, P2):
- Species missing P1 or P2 are masked (they contribute zero).
- Additionally, any internal branch whose descendant subtree contains zero species that have both orthologs is masked (branch weight set to 0 and per‑branch diffs zeroed).
Whenever a script reads both an MSA and a tree, the MSA is reordered to match tree$tip.label and the pipeline errors if the sets differ.
The scripts follow the same folder conventions as original PHACE. Typical layout:
PHACE_Codes/
load_metadata.R
ToleranceScore.R
MSA1.R
MSA2.R
Part1_MSA1.R
Part1_MSA2.R
coev_diff_MSA1.R
coev_diff_MSA2.R
GetTotalChangeMatrix.R
PHACE_parallel.R
merge_PHACE_scores.py
Data/
vals_MSA1.txt
vals_MSA2.txt
ToleranceScores/
MSA1/
MSA2/
MSA_ASRs/
Part1_AC/
Part1_Gap/
totalChanges/
PHACE_scores/<id>/
(If a folder does not exist, create it before running the corresponding step.)
At minimum:
- Concatenated AA alignment (FASTA)
- Tree (Newick / IQ‑TREE
.treefile) - ASR output (IQ‑TREE
.state) - Ortholog presence table (
.tsv) - Protein boundaries (
.csv/.txt)
Partitioned tree search and ASR are optional, but recommended for concatenated multi-protein alignments. If partitioned ASR outputs are provided, the pipeline recognizes the partition structure and maps sites accordingly
For ASR of MSA1 and MSA2 you will also use:
Data/vals_MSA1.txtData/vals_MSA2.txt
Below, <id> is your analysis identifier (used as filename prefix), and paths are examples.
mkdir -p ToleranceScores MSA1 MSA2 MSA_ASRs Part1_AC Part1_Gap totalChanges PHACE_scores/<id>
Command
Rscript PHACE_Codes/ToleranceScore.R \
<id> \
<concatenatedMSA.fasta> \
<tree.treefile> \
<ASR.state> \
<boundaries.csv> \
<ortholog_selection_table.tsv>Output
ToleranceScores/<id>.csv
Note: This step is the only one that needs the ASR
.statefile.
Build MSA1
Rscript PHACE_Codes/MSA1.R \
<id> \
<concatenatedMSA.fasta> \
<boundaries.csv> \
<ortholog_selection_table.tsv>Output
MSA1/<id>_MSA1.fasta
Run IQ‑TREE2 ASR on MSA1
iqtree2 -s MSA1/<id>_MSA1.fasta -te <tree.treefile> -blfix \
-m Data/vals_MSA1.txt -asr --prefix MSA_ASRs/<id>_MSA1 --safeThis produces:
MSA_ASRs/<id>_MSA1.stateMSA_ASRs/<id>_MSA1.treefile(and other IQ‑TREE outputs)
Build MSA2
Rscript PHACE_Codes/MSA2.R \
<id> \
<concatenatedMSA.fasta> \
<boundaries.csv> \
<ortholog_selection_table.tsv>Output
MSA2/<id>_MSA2.fasta
Run IQ‑TREE2 ASR on MSA2
iqtree2 -s MSA2/<id>_MSA2.fasta -te <tree.treefile> -blfix \
-m Data/vals_MSA2.txt -asr --prefix MSA_ASRs/<id>_MSA2 --safeMSA1 (amino‑acid change mapping)
Rscript PHACE_Codes/Part1_MSA1.R <id>MSA2 (gap change mapping)
Rscript PHACE_Codes/Part1_MSA2.R <id>Outputs go to:
Part1_AC/Part1_Gap/
Rscript PHACE_Codes/GetTotalChangeMatrix.R <id>Output:
totalChanges/<id>_TotalChange.RData
PHACE_parallel.R is designed to be run as a job array. Arguments:
<id><array_task_id>(1‑based)<num_jobs>(total array size)<ortholog_selection_table.tsv><boundaries.csv><concatenatedMSA.fasta>(original MSA fasta)
Example (single task):
Rscript PHACE_Codes/PHACE_parallel.R \
<id> \
1 \
100 \
<ortholog_selection_table.tsv> \
<boundaries.csv> \
<concatenated_AA.fasta>Outputs:
PHACE_scores/<id>/<id>_PHACE_part<task>.RData
This repo includes merge_PHACE_scores.py which merges PHACE_part*.RData outputs into a single file and writes:
<id>_PHACE_internalBranchEffectRemoved_scores.feather<id>_PHACE_internalBranchEffectRemoved_scores.npz
Usage:
python merge_PHACE_scores.py <id> <num_parts>Dependencies:
numpy,pandas,pyreadr.
If you see aNameError: sys is not defined, addimport sysat the top ofmerge_PHACE_scores.py.
- MSA sequence names must match tree tip labels exactly.
- Species names in
ortholog_selection_table.tsvmust match those as well. - This version will error out if the sets differ.
- Boundary
protein_namevalues must match the ortholog table column names. - If a protein in boundaries is not present as a column in the ortholog table, masking for that protein cannot be applied.
Your boundary file contains a protein name not present in ortholog_selection_table.tsv header. Fix by making names identical in both files.
Your FASTA rownames and tree tip labels differ. Fix naming (or reorder/rename sequences) so they match exactly.
Internal‑branch masking requires that totalChanges branch labels can be mapped to tree nodes. In typical runs, mat_info[,2] contains tip labels and internal node labels. If your tree lacks internal node labels, you must either label internal nodes or change the mapping strategy.
If you use PHACE in published research, cite the original PHACE paper / repository, and describe these modifications (multi‑protein concatenation and artificial‑gap masking) in Methods.