Skip to content

MSA combining

Kun Huang edited this page Mar 18, 2021 · 1 revision

Combining multiple reference alignments

In MetaClock, we deviced a cutting-edge feature for merging multiple genome alignment reconstructed from different references in order to guarantee the reliability of phylogenetic and molecular clocking analysis.

Step 1. Reconstruct individual genome alignments from different references

Reconstructing single-reference genome alignment has been elaborated in the section of automated genome reconstruction. Here, for enhancing the tutorial emjoyment, we provided four M. orali genome alignments, GCF_902384065.fna, GCF_900289035.fna and GCA_001639275.fna, reconstructed using three different reference genomes. Note: the file name indicates the reference genome used.

Step 2. Combing three genome alignments

metaclock_combiner here is utilized for combing:

metaclock_combiner three_alignments --nproc 3 --output_folder three_alns_combined

Output file called multiple_alignments_combination.fna in the three_alns_combined folder.

usage: metaclock_combiner [-h] [-l LENGTH] [-i IDENTITY] [-p NPROC]
                          [-hf HOMO_SITE_IN_REFS] [-d OUTPUT_FOLDER]
                          [-int INTERMEDIATE_FOLDER] [-vt VOTING_THRESHOLD]
                          [-cds]
                          [alignments_folder]

positional arguments:
  alignments_folder     Specify the folder which contains alignments built
                        using single reference with metaclock_mac. Note:
                        alignment file name should be consistent with
                        reference genome name inside fasta

optional arguments:
  -h, --help            show this help message and exit
  -l LENGTH, --length LENGTH
                        The minimum length of homologous sequence alignment to
                        keep as a hit. default: [500]
  -i IDENTITY, --identity IDENTITY
                        The minium identity of homologous sequences to keep as
                        a hit. default: [95.0]
  -p NPROC, --nproc NPROC
                        Number of processors to use
  -hf HOMO_SITE_IN_REFS, --homo_site_in_refs HOMO_SITE_IN_REFS
                        The number of references site must be in to be
                        homologous. default: [The total number of references
                        used]
  -d OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
                        Specify a name for output folder. default: [outputs]
                        in current directory
  -int INTERMEDIATE_FOLDER, --intermediate_folder INTERMEDIATE_FOLDER
                        Specify a name for intermediate folder. default:
                        [intermediates] in current directory
  -vt VOTING_THRESHOLD, --voting_threshold VOTING_THRESHOLD
                        Specify a threshold for majority rule in merging
                        columns. default: [0.5]
  -cds, --CDS_region    Consider sites only within coding sequences. Note:
                        make sure prokka is installed in the system path.

Step 3. Inspecting the merged alignment quality

metaclock_combiner aims at reducing the bias by deciding a consensus alignment based on multiple reference genomes and its effect on reducing missing information might be limited, thus it is highly recommended to inspect the genome-wide missing information before moving to next steps.

metaclock_visualizer multiple_alignments_combination.fna combined_alignment.png -w 10000 -s 2000 -opt_text combined_aln_stats.txt
cat combined_aln_stats.txt

SeqHeader	GapRatio	AlignmentLength	SequenceLength	GCcontent
ERR3003614.fna	0.08	2122303	1943556	0.28
ERR3003615.fna	0.06	2122303	1992500	0.28
ERR3003619.fna	0.08	2122303	1959961	0.28
ERR3003621.fna	0.07	2122303	1978467	0.28
ERR3003622.fna	0.1	2122303	1915707	0.28
ERR3003623.fna	0.47	2122303	1130730	0.27
ERR3003632.fna	0.08	2122303	1955626	0.28
ERR3003637.fna	0.1	2122303	1906953	0.28
ERR3003640.fna	0.07	2122303	1976226	0.28
ERR3003644.fna	0.07	2122303	1972801	0.28
ERR3003646.fna	0.1	2122303	1919899	0.28
ERR3003647.fna	0.4	2122303	1282825	0.29
ERR3003651.fna	0.02	2122303	2071666	0.28
GCA_001639275.fna	0.0	2122303	2121707	0.28
GCF_000529525.fna	0.01	2122303	2110723	0.28
GCF_900289035.fna	0.0	2122303	2121269	0.28
GCF_902384065.fna	0.01	2122303	2110723	0.28
SRR6877288.fna	0.09	2122303	1924785	0.28
SRR6877339.fna	0.42	2122303	1241040	0.3
SRR6877399.fna	0.31	2122303	1464368	0.3

Combined alignment visualization

Step 4. Trimming missing information and inferring a phylogenetic tree

Based on missing information inspection, we can keep all reconstructed genomes and gap threshold setting as 0.05 for each column will do ! So we use metaclock_tailor basic to remove missing information, generate stats after tailoring, and infer a ML tree.

metaclock_tailor basic multiple_alignments_combination.fna combined_alignment_tailored.fna -s tailored_stats.txt -tt genomes_kept.txt,0.05 --raxml_output ML_tree_combined_tailored -nproc 5

ML Tree

Strain originated from ancient samples are marked as blue and those from modern dataset are marked as red.