Skip to content

Details about configuration parameters

Lior Glick edited this page Nov 2, 2022 · 14 revisions

This page contains detailed descriptions of all parameters found in the pipelines' main (YAML) configurations and the annotation configuration.
Note: when using a parameter value containing spaces, wrap it in double quotes (parameter: "value with spaces").

De-novo configuration

name - This will be used as the job name when running on an HPC cluster.

input/output

samples_info_file - full path to the LQ samples TSV
hq_genomes_info_file - full path to the HQ samples TSV
out_dir - full path to the output directory

Reference

reference_name - sample name of the reference genome (e.g HG38, TAIR10)
reference_genome - full path to the fasta file of the reference genome
reference_proteins - full path to the fasta file containing reference protein sequences
reference_annotation - full path to the GFF3 file containing reference gene models
reference_cds - full path to the fasta files containing reference CDS sequences
reference_transcripts - full path to the fasta files containing reference transcript sequences

Reads pre processing (RPP)

trimming_modules - a string describing the various trimming and QA modules to be applied by Trimmomatic. See the Trimmomatic manual for details or use "SLIDINGWINDOW:5:15 MINLEN:40"
merge_min_overlap - minimal number of overlapping bases required to perform PE read merging. With most standard Illumina libraries, 10 should be a reasonable value.
merge_max_mismatch_ratio - maximal ratio of mismatches in overlapping region required to perform PE read merging. With most standard Illumina libraries, 0.2 should be a reasonable value.

Assembly

assembler - the assembly software to use. One of: 'spades', 'megahit', or 'minia'. SPAdes usually produces the most contiguous assemblies but is memory-intense and is useful for small genomes only. MEGAHIT is faster, and Minia is very memory-efficient, and thus useful for large genomes.
min_length - assembled contigs shorter than this value are discarded and not annotated.
busco_set - name of BUSCO lineage to be used when assessing assembly completeness. You can find the list of available lineages here - just copy paste the one relevant to your organism.

Annotation

transcripts - paths to all transcripts fasta files, separated by commas (including ref). These will be used as annotation evidence.
proteins - paths to all proteins fasta files, separated by commas (including ref). These will be used as annotation evidence.
annotation_yml_template - full path to the annotation configuration yaml.

Annotation filtration

min_protein - proteins shorter than this value (in number of residues) will be discarded. Very short proteins are often a result of erroneous or partial prediction.

Environment

queue - name of the queue to which jobs will be submitted (when using a HPC cluster).
priority - submitted job priority (when using a HPC cluster).
ppn - processes per node. This is the maximal number of CPUs to be used by a single job. Set this value to the number of CPUs available on a single machine in your cluster (when using a HPC cluster). When running locally, this is the number of CPUs to be used on the local machine.
max_ram - maximal RAM to be used by a single job. This should usually be in the range of Gbs, so values like 20g or even 100g are reasonable, depending on the available resources on cluster machines (when using a HPC cluster).
max_jobs - if running on a HPC cluster, this is the maximal number of jobs the pipeline may submit at a time. If running locally, this is the maximal number of processes the pipeline will use in parallel.
cluster_wrapper - same as the value for the --cluster parameter, when running on an HPC cluster (e.g. python cluster_wrapper.py.

Map-to-pan configuration

input/output

samples_info_file - full path to the LQ samples TSV
hq_genomes_info_file - full path to the HQ samples TSV
out_dir - full path to the output directory

Reference

reference_name - sample name of the reference genome (e.g HG38, TAIR10)
reference_genome - full path to the fasta file of the reference genome
reference_proteins - full path to the fasta file containing reference protein sequences
reference_annotation - full path to the GFF3 file containing reference gene models
id_simplify_function - a python function in the lambda syntax describing a procedure to be applied to gene IDs of the reference gff. If this is not needed, leave the value as "lambda x: x"

Reads pre processing (RPP)

trimming_modules - a string describing the various trimming and QA modules to be applied by Trimmomatic. See the Trimmomatic manual for details or use "SLIDINGWINDOW:5:15 MINLEN:40"
merge_min_overlap - minimal number of overlapping bases required to perform PE read merging. With most standard Illumina libraries, 10 should be a reasonable value.
merge_max_mismatch_ratio - maximal ratio of mismatches in overlapping region required to perform PE read merging. With most standard Illumina libraries, 0.2 should be a reasonable value.

Assembly

assembler - the assembly software to use. One of: 'spades', 'megahit', or 'minia'. SPAdes usually produces the most contiguous assemblies but is memory-intense and is useful for small genomes only. MEGAHIT is faster, and Minia is very memory-efficient, and thus useful for large genomes.
min_length - assembled contigs shorter than this value are discarded and not annotated.
busco_set - name of BUSCO lineage to be used when assessing assembly completeness. You can find the list of available lineages here - just copy paste the one relevant to your organism.

Annotation

transcripts - paths to all transcripts fasta files, separated by commas (including ref). These will be used as annotation evidence.
proteins - paths to all proteins fasta files, separated by commas (including ref). These will be used as annotation evidence.
annotation_yml_template - full path to the annotation configuration yaml.
chunk_size - novel sequences will be broken into chunks of this size (in bp) for parallel processing. Chunks should be much larger than the expected gene size, but not too large so efficient parallelization can be achieved.

Annotation filtration

min_protein - proteins shorter than this value (in number of residues) will be discarded. Very short proteins are often a result of erroneous or partial prediction.
similarity_threshold_proteins - to avoid redundant proteins in the pan-genome, all novel proteins are clustered based on sequence similarity using CD-HIT. The value is passed on to the -c option of CD-HIT and defined the sequence identity threshold for clustering.

gene loss detection

HQ_min_cov - the fraction of a gene (between 0 and 1) that's required to be covered by a HQ sample sequence in order for it to be considered present.
LQ_min_cov - the fraction of a gene (between 0 and 1) that's required to be covered by LQ sample reads in order for it to be considered present.
min_read_depth - the number of LQ reads required in order to consider a position covered.

Environment

queue - name of the queue to which jobs will be submitted (when using a HPC cluster).
priority - submitted job priority (when using a HPC cluster).
ppn - processes per node. This is the maximal number of CPUs to be used by a single job. Set this value to the number of CPUs available on a single machine in your cluster (when using a HPC cluster).
max_ram - maximal RAM to be used by a single job. This should usually be in the range of Gbs, so values like 20g or even 100g are reasonable, depending on the available resources on cluster machines (when using a HPC cluster).
max_jobs - if running on a HPC cluster, this is the maximal number of jobs the pipeline may submit at a time. If running locally, this is the maximal number of processes the pipeline will use in parallel.
cluster_wrapper - same as the value for the --cluster parameter, when running on an HPC cluster (e.g. python cluster_wrapper.py.

Iterative assembly configuration

input/output

samples_info_file - full path to the LQ samples TSV
hq_genomes_info_file - full path to the HQ samples TSV
out_dir - full path to the output directory

Reference

reference_name - sample name of the reference genome (e.g HG38, TAIR10)
reference_genome - full path to the fasta file of the reference genome
reference_proteins - full path to the fasta file containing reference protein sequences
reference_annotation - full path to the GFF3 file containing reference gene models
id_simplify_function - a python function in the lambda syntax describing a procedure to be applied to gene IDs of the reference gff. If this is not needed, leave the value as "lambda x: x"

Reads pre processing (RPP)

trimming_modules - a string describing the various trimming and QA modules to be applied by Trimmomatic. See the Trimmomatic manual for details or use "SLIDINGWINDOW:5:15 MINLEN:40"
merge_min_overlap - minimal number of overlapping bases required to perform PE read merging. With most standard Illumina libraries, 10 should be a reasonable value.
merge_max_mismatch_ratio - maximal ratio of mismatches in overlapping region required to perform PE read merging. With most standard Illumina libraries, 0.2 should be a reasonable value.

Detect unmapped reads

max_mapq - reads mapped with MAPQ (mapping quality) smaller or equal to this value will be considered unmapped.
min_mismatch - reads mapped with number of mismatches larger or equal to this value will be considered unmapped.
max_qlen - reads mapped with alignment length smaller or equal to this value will be considered unmapped.

Assembly

assembler - the assembly software to use. One of: 'spades', 'megahit', or 'minia'. SPAdes usually produces the most contiguous assemblies but is memory-intense and is useful for small genomes only. MEGAHIT is faster, and Minia is very memory-efficient, and thus useful for large genomes. Since only unmapped reads are assembled, SPAdes is usually a safe choice.
min_length - assembled contigs shorter than this value are discarded and not annotated.
busco_set - name of BUSCO lineage to be used when assessing assembly completeness. You can find the list of available lineages here - just copy paste the one relevant to your organism.

Annotation

transcripts - paths to all transcripts fasta files, separated by commas (including ref). These will be used as annotation evidence.
proteins - paths to all proteins fasta files, separated by commas (including ref). These will be used as annotation evidence.
annotation_yml_template - full path to the annotation configuration yaml.
chunk_size - novel sequences will be broken into chunks of this size (in bp) for parallel processing. Chunks should be much larger than the expected gene size, but not too large so efficient parallelization can be achieved.

Annotation filtration

min_protein - proteins shorter than this value (in number of residues) will be discarded. Very short proteins are often a result of erroneous or partial prediction.
similarity_threshold_proteins - to avoid redundant proteins in the pan-genome, all novel proteins are clustered based on sequence similarity using CD-HIT. The value is passed on to the -c option of CD-HIT and defined the sequence identity threshold for clustering.

gene loss detection

HQ_min_cov - the fraction of a gene (between 0 and 1) that's required to be covered by a HQ sample sequence in order for it to be considered present.
LQ_min_cov - the fraction of a gene (between 0 and 1) that's required to be covered by LQ sample reads in order for it to be considered present.
min_read_depth - the number of LQ reads required in order to consider a position covered.

Environment

queue - name of the queue to which jobs will be submitted (when using a HPC cluster).
priority - submitted job priority (when using a HPC cluster).
ppn - processes per node. This is the maximal number of CPUs to be used by a single job. Set this value to the number of CPUs available on a single machine in your cluster (when using a HPC cluster).
max_ram - maximal RAM to be used by a single job. This should usually be in the range of Gbs, so values like 20g or even 100g are reasonable, depending on the available resources on cluster machines (when using a HPC cluster).
max_jobs - if running on a HPC cluster, this is the maximal number of jobs the pipeline may submit at a time. If running locally, this is the maximal number of processes the pipeline will use in parallel.
cluster_wrapper - same as the value for the --cluster parameter, when running on an HPC cluster (e.g. python cluster_wrapper.py.

Annotation configuration

Do not modify parameter values enclosed in <> (e.g. input_genome: <INPUT_GENOME>) as these will be automatically completed by Panoramic.
The relevant parameters are:
mask_genome - 0 or 1. Defines whether or not to perform repeat masking on genomic sequences using EDTA.
augustus_species - if specified, Augustus will be used for ab-initio gene prediction. Augustus must be trained for a specific organism. However, pre-trained sets are available for some species. You can check to see if your organism is on the list. If not, there are several tutorials about training Augustus, e.g this one.
glimmerhmm_species - if specified, GlimmerHMM will be used for ab-initio gene prediction. However, there are only a few pre-trained data sets: arabidopsis, Celegans, human, rice, and zebrafish. Otherwise, you'll need to train the software yourself.
snap_species - if specified, SNAP will be used for ab-initio gene prediction. full path to the relevant SNAP .hmm file. A list of pre-trained species can be found here. Otherwise you'll need to train SNAP yourself.
ab-initio_weight - weight of ab-initio gene predictions when synthesizing gene models using EvidenceModeler. This can be any number, and is relative to other weight parameters defined in the configuration file.
transcripts_weight - weight of transcript evidence-based gene predictions (derived from PASA) when synthesizing gene models using EvidenceModeler. This can be any number, and is relative to other weight parameters defined in the configuration file.
proteins_weight - weight of protein evidence-based gene predictions (derived from genomeThreader) when synthesizing gene models using EvidenceModeler. This can be any number, and is relative to other weight parameters defined in the configuration file.
segment_size - genome window size (in bp) used by EvidenceModeler for parallel computing.
overlap_size - overlap between windows (in bp).
split_chimeras - 0 or 1. Whether or not to apply a step for detection and splitting of chimeric gene models. This is an experimental feature.
chimeraBuster_dir - if you choose to split chimeric genes, you'll need to clone the chimeraBuster repo and use this parameter to specify the path to the code.
min_protein - gene models coding for proteins shorter than this value (number of amino acids) will be discarded.
max_AED - each gene model generated by EvidenceModeler is assigned an Annotation Edit Distance (AED) between 0 and 1, indicating how different a model is from the available evidence. An AED score of 1 means there is no evidence support for the model and it is purely based on ab-initio prediction. Models with AED larger than this value will be discarded. The value depends on how much you want to rely on the evidence. When solid evidence are available, a value of 0.3-0.5 is recommended. If the quality and/or quantity of evidence data is limited, you may want to increase the value.