This is the script for identifying positive selection on a specific branch for batches of gene families using PAML branch-site model.
Biopython == 1.79
paml == 4.9j
macse == v2.04
Pandas == 1.24
ps: Other versions should work as well
python BatchPAML.py -h
For universal use, you need provide a file contain two columns without header:
Family_name | MSA_file |
---|---|
Family1 | Family1_aligned.fasta |
Family2 | Family2_aligned.fasta |
I recommend using macse
This pattern was designed for the single copy family identified in OrthoFinder.
You just need to prepare a fasta file contain all corresponding cds sequences for the protein sequences used in OrthoFinder and specify some results file from OrthoFinder.
ps: The species tree constructed by OrthoFinder is OK. There is no need to use gene tree for each gene family.
The result file contain two columns without header
Family_name | p-value |
---|---|
Family1 | 1.0 |
Family2 | 0.5921 |
- convert MSA in fasta format to paml format automatically.
- unroot the rooted tree automatically
- Multi thread parallel
- Allow specify the type of codon table manually (using NCBI No.)
-
You need mark the foreground branch in the tree manually using "#1". Please do not insert space between species name and the marker
example:
(((Human#1, chimpanzee),Fish),Fly);✓
(((Human #1, chimpanzee),Fish),Fly);✗ -
Though this script search the paml bin in PATH and should be cross-platform, I recommand you specify the path of binary manually.
-
The multiple sequence alignment (MSA) must be in fasta format.
-
if the MSA has potential frameshift, then this family will be skipped.
-
The process file are stored in 'BatchPAML_Results' in working directory
- Allow MSA file in paml format directly
- Fix frameshift problem
- More flexible file storage path
- Maybe a part of a comparative genome analysis pipeline
If you have any problem or advice pleas feel free to contact me by njbxhzy at hotmail.com