-
Notifications
You must be signed in to change notification settings - Fork 0
MSA combining
In MetaClock, we deviced a cutting-edge feature for merging multiple genome alignment reconstructed from different references in order to guarantee the reliability of phylogenetic and molecular clocking analysis.
Reconstructing single-reference genome alignment has been elaborated in the section of automated genome reconstruction. Here, for enhancing the tutorial emjoyment, we provided four M. orali genome alignments, GCF_902384065.fna, GCF_900289035.fna and GCA_001639275.fna, reconstructed using three different reference genomes. Note: the file name indicates the reference genome used.
metaclock_combiner
here is utilized for combing:
metaclock_combiner three_alignments --nproc 3 --output_folder three_alns_combined
Output file called multiple_alignments_combination.fna
in the three_alns_combined
folder.
usage: metaclock_combiner [-h] [-l LENGTH] [-i IDENTITY] [-p NPROC]
[-hf HOMO_SITE_IN_REFS] [-d OUTPUT_FOLDER]
[-int INTERMEDIATE_FOLDER] [-vt VOTING_THRESHOLD]
[-cds]
[alignments_folder]
positional arguments:
alignments_folder Specify the folder which contains alignments built
using single reference with metaclock_mac. Note:
alignment file name should be consistent with
reference genome name inside fasta
optional arguments:
-h, --help show this help message and exit
-l LENGTH, --length LENGTH
The minimum length of homologous sequence alignment to
keep as a hit. default: [500]
-i IDENTITY, --identity IDENTITY
The minium identity of homologous sequences to keep as
a hit. default: [95.0]
-p NPROC, --nproc NPROC
Number of processors to use
-hf HOMO_SITE_IN_REFS, --homo_site_in_refs HOMO_SITE_IN_REFS
The number of references site must be in to be
homologous. default: [The total number of references
used]
-d OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
Specify a name for output folder. default: [outputs]
in current directory
-int INTERMEDIATE_FOLDER, --intermediate_folder INTERMEDIATE_FOLDER
Specify a name for intermediate folder. default:
[intermediates] in current directory
-vt VOTING_THRESHOLD, --voting_threshold VOTING_THRESHOLD
Specify a threshold for majority rule in merging
columns. default: [0.5]
-cds, --CDS_region Consider sites only within coding sequences. Note:
make sure prokka is installed in the system path.
metaclock_combiner
aims at reducing the bias by deciding a consensus alignment based on multiple reference genomes and its effect on reducing missing information might be limited, thus it is highly recommended to inspect the genome-wide missing information before moving to next steps.
metaclock_visualizer multiple_alignments_combination.fna combined_alignment.png -w 10000 -s 2000 -opt_text combined_aln_stats.txt
cat combined_aln_stats.txt
SeqHeader GapRatio AlignmentLength SequenceLength GCcontent
ERR3003614.fna 0.08 2122303 1943556 0.28
ERR3003615.fna 0.06 2122303 1992500 0.28
ERR3003619.fna 0.08 2122303 1959961 0.28
ERR3003621.fna 0.07 2122303 1978467 0.28
ERR3003622.fna 0.1 2122303 1915707 0.28
ERR3003623.fna 0.47 2122303 1130730 0.27
ERR3003632.fna 0.08 2122303 1955626 0.28
ERR3003637.fna 0.1 2122303 1906953 0.28
ERR3003640.fna 0.07 2122303 1976226 0.28
ERR3003644.fna 0.07 2122303 1972801 0.28
ERR3003646.fna 0.1 2122303 1919899 0.28
ERR3003647.fna 0.4 2122303 1282825 0.29
ERR3003651.fna 0.02 2122303 2071666 0.28
GCA_001639275.fna 0.0 2122303 2121707 0.28
GCF_000529525.fna 0.01 2122303 2110723 0.28
GCF_900289035.fna 0.0 2122303 2121269 0.28
GCF_902384065.fna 0.01 2122303 2110723 0.28
SRR6877288.fna 0.09 2122303 1924785 0.28
SRR6877339.fna 0.42 2122303 1241040 0.3
SRR6877399.fna 0.31 2122303 1464368 0.3
Based on missing information inspection, we can keep all reconstructed genomes and gap threshold setting as 0.05 for each column will do ! So we use metaclock_tailor basic
to remove missing information, generate stats after tailoring, and infer a ML tree.
metaclock_tailor basic multiple_alignments_combination.fna combined_alignment_tailored.fna -s tailored_stats.txt -tt genomes_kept.txt,0.05 --raxml_output ML_tree_combined_tailored -nproc 5
Strain originated from ancient samples are marked as blue and those from modern dataset are marked as red.