Skip to content

User manual

Kun Huang edited this page Mar 18, 2021 · 25 revisions

MetaClock

MetaClock is an integrated framework for reconstructing time-revolved evolutionary history for microbiome species using large-scale (meta)genomic data from ancient and contemporary populations.

Overview

Installation

While there are three possible installation solutions, we highly recommend to install MetaClock through Conda in order to avoid the complexity of managing dependencies.

1. Conda environment (Thumb-up)

Note: we strongly suggest that MetaClock should be installed in a new and isolated conda environment thus dependencies can be resolved automatically by Conda.

conda create -n "metaclock" -c bioconda metaclock

2. PyPi

pip install metaclock

3. Repository from GitHub

git clone https://github.com/SegataLab/metaclock.git
cd metaclock
python setup.py install 

Both solution 2 and 3 require install dependencies independently:

Constructing genome alignment

metaclock_mac <configuration_file> -r <reference_genome> \
    --ancient_metagenomes <ancient_metagenomic_data> \
    --modern_metagenomes <modern_metagenomic_data>
    --genome_assemblies <assembled_contigs>
    --output_dir <output_directory>
    --intermediate_dir <intermediate_directory>
  • <configuration_file> is a mandatory input which is used to config parameters (please check the detailed description below). The template configuration file - configs.json - can be automatically generated by metaclock_mac_template_configs and customized directly by users.
  • <reference_genome> is a mandatory input which is usually a single-amplified genome (SAG) or high-quality metagenome assembled genome (MAG) whose whole sequence is stored in a FASTA file. It represents a species from a microbiome. The input path can either be specified in command line or in configuration file.
  • <ancient_metagenomic_data> is an optional input which is a folder contains sub-folders, with each sub-folder containing metagenomic reads, in fastq files (or compressed in .gz or .bzip2, or mixed), from an ancient sample. The input path can be specified in the respective section in the configuration file.
  • <modern_metagenomic_data> is an optional input which is a folder contains sub-folders, with each sub-folder containing metagenomic reads, in fastq files (or compressed in .gz or .bzip2, or mixed), from an contemporary sample. The input path can be specified in the respective section in the configuration file.
  • <genome_assemblies> is an optional input which is a folder contains multiple assembled genomes in FASTA files, each file containing nucleotide sequences from one genome. The assembled genomes are mostly from contemporary samples but in rare case ancient assembled genomes are also available. If ancient assembled genomes were used as input here, please be cautious of genome quality. The input path can be specified in the respective section in the configuration file.
  • <output_directory> is an optional argument which stores the outputs from metaclock_mac. If no specific folder path was given, a new folder will be created in the working directory by default.
  • <intermediate_directory> is an optional argument which stores the intermediate files generated in the process. If no specific folder path was given, a new folder will be created in the working directory by default.

Note: While <ancient_metagenomic_data>, <modern_metagenomic_data> and <genome_assemblies> are three independent and optional inputs, at least one of three kinds must be given in order to construct a genome alignment.

Other optional settings

  • --clean is an option for rerunning analysis from intermediate bam files. Flagging this argument will save you a good amount of time from re-mapping reads, particularly time-efficient when metagenomic data is enormous.
  • --authentication is an option for authenticating ancient origin of metagenomic reads from ancient specimens.
  • --SNV_rate is an option for estimating pairwise SNV rates based on the genome sequences in the reconstructed alignment.

Configuration description

The configuration file configs.json is divided into three sections: ancient_reads, modern_reads, and contigs whose internal parameter settings correspond to input <ancient_metagenomic_data>, <modern_metagenomic_data> and <genome_assemblies> respectively.

{
  "ancient_reads": {
    "input_type":"reads",
    "reference_genome":"",
    "age_type":1,
    "intermediate":"",
    "samples":"",
    "parameter_set":{
      "search_report_mode":"-k,5",
      "bowtie2_threads":10,
      "minimum_mapping_quality":30,
      "minimum_mapping_length":30,
      "maximum_snp_edit_distance":0.03,
      "nproc":5,
      "minimum_coverage":5,
      "trim_distance":"5:5",
      "dominant_allele_frequency":0.8,
      "output_trimmed_reads":0
    }
  },
  "modern_reads": {
    "input_type":"reads",
    "reference_genome":"",
    "age_type":2,
    "intermediate":"",
    "samples":"",
    "parameter_set":{
      "search_report_mode":"-k,1",
      "bowtie2_threads":15,
      "minimum_mapping_quality":30,
      "minimum_mapping_length":30,
      "maximum_snp_edit_distance":0.03,
      "nproc":5,
      "minimum_coverage":5,
      "dominant_allele_frequency":0.8
    }
  },
  "contigs": {
    "input_type":"contigs",
    "reference_genome":"",
    "intermediate":"",
    "samples":"",
    "parameter_set":{
      "homolog_length":500,
      "homolog_identity":95,
      "blastn_threads":10
    }
  }
}

1. Ancient data input as reads (ancient_reads):

This section is specific to the configuration of processing sequencing reads from ancient (meta)genomic samples.

  • input_type <string> [optional]:   The type of genomic information. Default: "reads".
  • reference_genome <string> [required]   The reference genome representative of a microbial species, in FASTA format. This parameter is equivalent to --reference in the --help menu.
  • age_type <int> [optional]:   Age type of sample ( 1 indicates ancient type and 2 indicates contemporary type) . Default: 1.
  • intermediate <string> [optional]:   The directory for storing intermediate files. Default: "intermediates" in current directory. This parameter is equivalent to --intermediate_dir in the --help menu.
  • samples <string> [required]: The directory holding ancient metagenomics reads folders. This parameter is equivalent to --ancient_metagenomes in the --help menu.
  • search_report_mode <"-k,int"> [optional]:   The upper limit on the number of alignments Bowtie 2 should report in the search of alignments. Default: "-k,5". (The higher the value is set the more computational time required , but more accurate alignment returned.)
  • bowtie2_threads <int> [optional]:   The thread number for bowtie2 to process each sample. Default: [1].
  • minimum_mapping_quality <int> [optional]:   Bases with quality score lower than the specified value will be ignored in reconstructing genome alignment. Default: [30].
  • minimum_mapping_length <int> [optional]:   Aligned reads with length shorter than the specified value will be ignored in reconstructing genome alignment. Default: [30].
  • maximum_snp_edit_distance <float> [optional]:   Reads with SNP edit distance greater than the specified value will be ignored in reconstructing genome alignment. Default: [0.03].
  • nproc <int> [optional]:   The number of processors for handling multiple samples in parallel. Default: [1].
  • minimum_coverage <int> [optional]:   A position with a coverage depth lower than the specified value will be ignored in reconstructing genome alignment. Default: [5].
  • trim_distance <"int:int"> [optional]:   The number of nucleotides to trim at two ends of ancient reads. These positions are likely post-mortem damages. Default: "5:5".
  • dominant_allele_frequency <float> [optional]:   A position with the degree of dominant allele lower than the specified value will be ignored in reconstructing genome alignment. Default: [0.8].
  • output_trimmed_reads <int> [optional]: Specify 1 if you need to output aligned trimmed reads which are used in reconstructing genome alignment; Specify 0 if you want to skip this feature. Default: [0].


2. Modern data input as reads(modern_reads):

This section is specific to the configuration of processing sequencing reads from modern (meta)genomic samples.

  • input_type <string> [optional]:   The type of genomic information. Default: "reads".
  • reference_genome <string> [required]   The reference genome representative of a microbial species, in FASTA format. This parameter is equivalent to --reference in the --help menu.
  • age_type <int> [optional]:   Age type of sample ( 1 indicates ancient type and 2 indicates contemporary type) . Default: 2.
  • intermediate <string> [optional]:   The directory for storing intermediate files. Default: "intermediates" in current directory. This parameter is equivalent to --intermediate_dir in the --help menu.
  • samples <string> [required]: The directory holding modern metagenomics reads folders. This parameter is equivalent to --modern_metagenomes in the --help menu.
  • search_report_mode <"-k,int"> [optional]:   The upper limit on the number of alignments Bowtie 2 should report in the search of alignments. Default: "-k,5". (The higher the value is set the more computational time required , but more accurate alignment returned.)
  • bowtie2_threads <int> [optional]:   The thread number for bowtie2 to process each sample. Default: [1].
  • minimum_mapping_quality <int> [optional]:   Bases with quality score lower than the specified value will be ignored in reconstructing genome alignment. Default: [30].
  • minimum_mapping_length <int> [optional]:   Aligned reads with length shorter than the specified value will be ignored in reconstructing genome alignment. Default: [30].
  • maximum_snp_edit_distance <float> [optional]:   Reads with SNP edit distance greater than the specified value will be ignored in reconstructing genome alignment. Default: [0.03].
  • nproc <int> [optional]:   The number of processors for handling multiple samples in parallel. Default: [1].
  • minimum_coverage <int> [optional]:   A position with a coverage depth lower than the specified value will be ignored in reconstructing genome alignment. Default: [5].
  • dominant_allele_frequency <float> [optional]:   A position with the degree of dominant allele lower than the specified value will be ignored in reconstructing genome alignment. Default: [0.8].


3. Assembled genomes input as contigs (contigs):

This section is specific to the configuration of processing assembled genomes as input.

  • input_type <string> [optional]:   The type of genomic information. Default: "contigs".
  • reference_genome <string> [required]:   The reference genome representative of a microbial species, in FASTA format. This parameter is equivalent to --reference in the --help menu.
  • intermediate <string> [optional]:   directory for storing intermediate files. Default: "". This parameter is equivalent to --intermediate_dir in the --help menu.
  • samples <string> [required]: Specify the directory holding assembled contigs files. Default: "". This parameter is equivalent to --genome_assemblies in the --help menu.
  • homolog_length <int> [optional]:   The minimum length for the aligned part of a contig to be considered as a homolog in order to be used in reconstructing genome alignment. Default: [500].
  • homolog_identity <float> [optional]:   The minimum identity for the aligned part of a contig to be considered as a homolog in order to be used in reconstructing genome alignment. Default: [95.0]
  • blastn_threads <int> [optional]:   Specify the number of threads used for blastn. Note: please use < 8 CPUs if you are using blastn >2.7.0.


Output

The results of metaclock_mac are available in the <mac_output>folder (or in the folder specified with --output_dir) and include:

  • <output_dir>.fna The reconstructed genome alignment for a given species.
  • mac_stats.tsv The basic statistics of the reconstructed genome alignment, including GC content, missing data proportion, alignment length, and reconstructed sequence length of each strain.
  • damage_pattern.png (if --authentication was flagged) The estimated damage pattern for ancient metagenomic reads for authenticating their ancient origin.
  • snv_rates_distribution.png (if --SNV_rate was flagged) The histogram of pairwise SNV rates between reconstructed strain genomes.
  • snv_rates_heatmap.png (if --SNV_rate was flagged) The heatmap of SNV rates between reconstructed strain genomes

Genome alignment visualization

metaclock_visualizer <genome_alignment> \
<output_figure> --window_size <sliding_window_size> \
--step_size <moving_step_size> \
--label_distance <label_distance_to_figure> \
--output_text <output_text>

Where:

  • <genome_alignment> is a mandatory input which is the reconstructed genome alignment generated by metaclock_mac.
  • <output_figure> is the visualization result whose name (with suffix specifying its format, e.g .png, svg and so on) should be specified by users.
  • <sliding_window_size> is an integer specifying the size of sliding window when assessing missing data for different genomic regions.
  • <moving_step_size> is an integer specifying the step size when sliding windows move along the genome.
  • <output_text> is an optional output which store the result in txt format.

Genome alignment tailoring

In this section, we elaborate two tailoring strategies in MetaClock (metaclock_tailor basic and metaclock landscape) for enhancing the quality of downstream analysis.

Basic tailoring

Basic tailoring aims to exclude missing data from the whole genome alignment using threshold specified by users or simply by automatic trimming algorithm.

If using automatic trimming algorithm:

metaclock_tailor basic <genome_alignment> \
<tailored_alignment> --automated_tailoring \
--stats <stats_tailored_output>

Where:

  • <genome_alignment> is a mandatory input which is the reconstructed genome alignment generated by metaclock_mac. Or the multiple nucleotide sequence alignment generated by other tools.
  • <tailored_alignment> is the output which is the multiple sequence alignment after being tailored.
  • <stats_tailored_output> is a text file which details basic statistics of the tailored genome alignment.

if using user-defined tailoring strategy:

metaclock_tailor basic <genome_alignment> \
<tailored_alignment> \ 
--target_tailoring <target_genomes>,<missing_data_threshold> \
--stats <stats_tailored_output>

Where:

  • <genome_alignment> is a mandatory input which is the reconstructed genome alignment generated by metaclock_mac. Or the multiple nucleotide sequence alignment generated by other tools.
  • <tailored_alignment> is the output which is the multiple sequence alignment after being tailored.
  • <target_genome> is a text which contains a list of genomes to tailor on (i.e. the genomes not included in the file will be removed when tailoring and generating processed alignment, mostly due to that those genomes contain too much missing information). Each row is spared for one genome. For example:
Genome_1
Genome_2
.
.
Genome_n
  • <missing_data_threshold> is a float (<= 1.0) which specifies the maximum missing data allowed in each column when only considering the target genomes in <target_genome> file

Optional settings

  • --raxml_output A maximum-likelihood tree will be inferred based on the tailored alignment if users specify the output folder with this parameter.
  • --number_of_processors Specifying the number of CPUs to use in tree inferring.

Landscape tailoring

Landscape tailoring aims to curate the genome alignment based on the genome annotation profile, by selecting and concatenating sub-alignment blocks with rich estimates such as missing data, SNV density, sub-alignment length and many other.

metaclock_tailor landscape <whole_genome_alignment> --opt_dir <output_directory> 
  • <genome_alignment> is a mandatory input which must be the reconstructed genome alignment generated by metaclock_mac. Notice: the alignment has to be built on the whole genome sequence (preferably by reliable reference genome) for generating an accurate annotation profile.
  • <output_directory> is an output folder which contains all resulted files.

Optional settings

  • --gff3_annotation Input the gff3 annotation file. (Please use gff file from the same reference sequence in reconstructing whole genome alignment in order to maintain the consistency). If no gff3 file was provided automatic annotation would be performed.
  • --select Flag this option to select and concatenate your interesting regions.
  • --number_of_processors Specify the number of processor to be used if you want parallel computation.
  • --number_of_samples Specify the minimum number of samples with <10 % missing data to select the sub-alignment. This parameter is coupled with --select.
  • --minimum_length Specify the minimum length of a sub-alignment to be selected. This parameter is coupled with --select.
  • --maximum_missing_information Specify the minimum missing data of a sub-alignment to be selected. This parameter is coupled with --select.
  • --maximum_snv_density Specify the maximum SNV density for selecting a sub-alignment in order to exclude hypervariable regions. This parameter is coupled with --select.
  • --minimum_snv_densit Specify the minimum SNV density for selecting a sub-alignment in order to exclude less variable regions. This parameter is coupled with --select.
  • --average_pairwise_distances Specify the maximum average pairwise distances for selecting a sub-alignment. This parameter is coupled with --select.
  • --pairwise_distances_stdv Specify the maximum standard deviation of pairwise distances for selecting a sub-alignment. This parameter is coupled with --select.
  • --feature Specify the types [CDS, tRNA, nCDS] to select. If more than one type comma should be used to delimit. e.g. CDS,tRNA,nCDS: Output all kinds of alignment. This parameter is coupled with --select.
  • --fast_temporal_signal_test Inputting a mapping file containing sample names and sample dates will allow a fast estimation for temporal signal. Note: this option has to be used together with --raxml_tree. For example:
Genome_1  2000
Genome_2  1500
.
.
Genome_n  5000

Genome name and dates (Years Before Present) are delimited by tab.

  • --raxml_tree Reconstruct a maximum likelihood tree using raxmlHPC.

Multiple alignments combining

Among many other features in MetaClock, we introduce a novel algorithm to improve phylogeny accuracy and reliability combing multiple alignments from different references.

metaclock_combiner <genome_alignments> 

Where <genome_alignments> is a mandatory input folder which contains multiple genome alignments built in preceding steps.

Note: To make software run smoothly, alignment file names should use a consistent name. See examples [here](superlink to tutorial of combining multiple alignments).

Optional settings

  • --length The minimum length for blastn hits to be kept. A longer length parameter setting might result in a shorter combined alignment length.
  • --identity The minimum percentage identity for a blastn hits to be kept. A higher identity (used when reference genomes are highly similar) might result in a shorter combined alignment length.
  • --nproc The number of CPUs to use for parallel computation.
  • --homo_site_in_refs The number of references a site must be in to be homologous. A lower number might result in redundancy of the combined alignment.
  • --output_folde The name for the output folder.
  • --intermediate_folder The name for the intermediate file folder.
  • --voting_threshold A voting threshold for merging homologous columns.
  • --CDS_region Only considering sites from coding sequence regions if this parameter is flagged.

metaclock_mac

usage: metaclock_mac [-h] [-r REFERENCE] [-a_ipt ANCIENT_METAGENOMES]
                     [-m_ipt MODERN_METAGENOMES] [-g_ipt GENOME_ASSEMLIES]
                     [-o OUTPUT_DIR] [-int INTERMEDIATE_DIR] [-c] [-a]
                     [-snv_rate] config_file

Reconstruct whole-genome-level MSA from large-scale datasets.

positional arguments:

    config_file Input the configuration file.

optional arguments:

    -h, --help             show this help message and exit
    -r REFERENCE, --reference  REFERENCE
                           Input reference genome sequence in the fasta format.
    -a_ipt ANCIENT_METAGENOMES, --ancient_metagenomes ANCIENT_METAGENOMES
                           Input an ancient-metagenome folder containing sub-folders,
                           each sub-folder contains sequencing reads from one ancient
                           sample. [Support suffix of .fastq/.fastq.gz/.fastq.bz2]
    -m_ipt MODERN_METAGENOMES, --modern_metagenomes MODERN_METAGENOMES
                           Input a modern-metagenome folder containing sub-folders, 
                           each sub-folder contains sequencing reads from one modern sample.
                           [Support suffix of .fastq/.fastq.gz/.fastq.bz2]
    -g_ipt GENOME_ASSEMLIES, --genome_assemlies GENOME_ASSEMLIES
                           Input a genome-assemblies folder containing assembled genomes, 
                           each assembled genome is stored in a fasta file. [Support FASTA format]
    -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                           Specify an output folder for storing results. Default: 
                           [Mac_output] in the working directory.
    -int INTERMEDIATE_DIR, --intermediate_dir INTERMEDIATE_DIR
                           Specify an intermediate folder for storing intermediate files.
                           Default: [intermediates] in the working directory.
    -c, --clean Clean intermediate files and rerun from the beginning, otherwise rerun from the
                           intermediate files
    -a, --authentication   Autheticate the anicent origin of genomic information used in genome 
                           reconstruction.
    -snv_rate, --SNV_rate  Estimate pairwise SNV rates between samples.

metaclock_visualizer

usage: metaclock_visualizer [-h] [-w [WINDOW_SIZE]] [-s [STEP_SIZE]]
                            [-l_dist [LABEL_DISTANCE]]
                            [-opt_text [OUTPUT_TEXT]]
                            [input_alignment] [output_name]

positional arguments:
    input_alignment        Input a MSA alignment in fasta
    output_name            Specify the output name for resulted figure

optional arguments:
    -h, --help             show this help message and exit
    -w [WINDOW_SIZE], --window_size [WINDOW_SIZE]
                           Sliding window size. default:[200bp]
    -s [STEP_SIZE], --step_size [STEP_SIZE]
                           Moving steps of sliding window. default:[50bp]
    -l_dist [LABEL_DISTANCE], --label_distance [LABEL_DISTANCE]
                           The distance betwen text label and the plotting.
                           default: [0.15]
    -opt_text [OUTPUT_TEXT], --output_text [OUTPUT_TEXT]
                           Output a statistics text.

metaclock_tailor basic

usage: landscape basic [-h] [-a] [-s STATS] [-tt TARGET_TAILORING]
                       [whole_genome_alignment] [output_file]

Assess mutational landscape of whole genome alignment and select interested
regions.

positional arguments:
    whole_genome_alignment
                           Input the whole genome alignment generated by
                           metaclock_mac.
    output_file            Specify output name for the tailored genome alignment
                           in fasta format.

optional arguments:
    -h, --help             show this help message and exit
    -a, --automated_tailoring
                           This option allows automated tailoring using trimAl
                           for minimizing missing information in the alignment
    -s STATS, --stats STATS
                           Specify a file name for statistic report of tailored
                           alignment.
    -tt TARGET_TAILORING, --target_tailoring TARGET_TAILORING
                           Input a list of samples to keep for alignment
                           tailoring. The limit for missing info in each column
                           can be given and it is delimited by comma (e.g.
                           shortlist.txt,0.1), or column tailoring will be
                           performed automatically using trimAl (e.g. list.txt).

metaclock_tailor landscape

usage: landscape landscape [-h] [-o OPT_DIR] [-gff3 GFF3_ANNOTATION] [-s]
                           [-nproc NUMBER_OF_PROCESSORS]
                           [-ns NUMBER_OF_SAMPLES] [-l MINIMUM_LENGTH]
                           [-mi MAXIMUM_MISSING_INFORMATION]
                           [-snv_density_max MAXIMUM_SNV_DENSITY]
                           [-snv_density_min MINIMUM_SNV_DENSITY]
                           [-Pdist_mean AVERAGE_PAIRWISE_DISTANCES]
                           [-Pdist_stdv PAIRWISE_DISTANCES_STDV] [-f FEATURE]
                           [-fast_TempEst FAST_TEMPORAL_SIGNAL_TEST] [-raxml]
                           [whole_genome_alignment]

Assess mutational landscape of whole genome alignment and select interesting
regions.

positional arguments:
  whole_genome_alignment
                        Input the whole genome alignment generated by
                        metaclock_mac.

optional arguments:
  -h, --help            show this help message and exit
  -o OPT_DIR, --opt_dir OPT_DIR
                        Sepecify a directory name for outputs. default:
                        "opt_dir" in current working directory.
  -gff3 GFF3_ANNOTATION, --gff3_annotation GFF3_ANNOTATION
                        Input the gff3 annotation file. If no gff3 file was
                        provided automatic annotation would be performed.
                        (Please use gff file from the same reference sequence
                        in resconstructing whole genome alignment in order to
                        maintain the consistency.)
  -s, --select          Flag this option to select your interesting regions.
                        Note: other options need to be specificed.
  -nproc NUMBER_OF_PROCESSORS, --number_of_processors NUMBER_OF_PROCESSORS
                        Specify the number of processor to be used. default:
                        [1]
  -ns NUMBER_OF_SAMPLES, --number_of_samples NUMBER_OF_SAMPLES
                        Specify the minimum number of samples with <10 percent
                        missing information to select the child alignment.
                        Default: [1]
  -l MINIMUM_LENGTH, --minimum_length MINIMUM_LENGTH
                        Specify the minimum length of a child alignment to be
                        selected. default: [100]
  -mi MAXIMUM_MISSING_INFORMATION, --maximum_missing_information MAXIMUM_MISSING_INFORMATION
                        Specify the minimum missing information of a child
                        alignment to be selected. default: [1.0]
  -snv_density_max MAXIMUM_SNV_DENSITY, --maximum_snv_density MAXIMUM_SNV_DENSITY
                        Specify the maximum SNV density for selecting a child
                        alignment in order to exclude hypervariable regions.
                        default: [1.0]
  -snv_density_min MINIMUM_SNV_DENSITY, --minimum_snv_density MINIMUM_SNV_DENSITY
                        Specify the minimum SNV density for selecting a child
                        alignment in order to exclude hypervariable regions.
                        default: [0.0]
  -Pdist_mean AVERAGE_PAIRWISE_DISTANCES, --average_pairwise_distances AVERAGE_PAIRWISE_DISTANCES
                        Specify the maximum average pairwise distances for
                        selecting a child alignment. default: [1.0]
  -Pdist_stdv PAIRWISE_DISTANCES_STDV, --pairwise_distances_stdv PAIRWISE_DISTANCES_STDV
                        Specify the maximum standard deviation of pairwise
                        distances. default: [1.0]
  -f FEATURE, --feature FEATURE
                        Specify the types [CDS, tRNA, nCDS] to select. If more
                        than one type comma should be used to delimit. e.g.
                        CDS,tRNA,nCDS: Output all kinds of alignment.
  -fast_TempEst FAST_TEMPORAL_SIGNAL_TEST, --fast_temporal_signal_test FAST_TEMPORAL_SIGNAL_TEST
                        Inputting a mapping file containing sample names and
                        sample dates will allow a fast estimation for temporal
                        signal. Note: this option has to be used together with
                        -raxml.
  -raxml, --raxml_tree  Reconstruct a maximum likelihood tree using raxmlHPC.

metaclock_combiner

usage: metaclock_combiner [-h] [-l LENGTH] [-i IDENTITY] [-p NPROC]
                          [-hf HOMO_SITE_IN_REFS] [-d OUTPUT_FOLDER]
                          [-int INTERMEDIATE_FOLDER] [-vt VOTING_THRESHOLD]
                          [-cds]
                          [alignments_folder]

positional arguments:
  alignments_folder     Specify the folder which contains alignments built
                        using single reference with metaclock_mac. Note:
                        alignment file name should be consistent with
                        reference genome name inside fasta

optional arguments:
  -h, --help            show this help message and exit
  -l LENGTH, --length LENGTH
                        minimum length of homologous sequence alignment to
                        keep as a hit. default: [500]
  -i IDENTITY, --identity IDENTITY
                        minium identity of homologous sequences to keep as a
                        hit. default: [95.0]
  -p NPROC, --nproc NPROC
                        Number of processors to use
  -hf HOMO_SITE_IN_REFS, --homo_site_in_refs HOMO_SITE_IN_REFS
                        The number of references site must be in to be homologous. 
                        default: [The total number of references used]
  -d OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
                        Specify a name for output folder. default: [outputs]
                        in current directory
  -int INTERMEDIATE_FOLDER, --intermediate_folder INTERMEDIATE_FOLDER
                        Specify a name for intermediate folder. default:
                        [intermediates] in current directory
  -vt VOTING_THRESHOLD, --voting_threshold VOTING_THRESHOLD
                        Specify a threshold for majority rule in merging
                        columns. default: [0.5]
  -cds, --CDS_region    Consider sites only within coding sequences. Note:
                        make sure prokka is installed in the system path.