dengueQA - a pipeline for quality assurance of consensus dengue virus genomes

This pipeline processes dengue virus sequencing data to perform quality assurance of genomic data. To do so it runs steps of genotyping, sequence alignment, genome coverage calculation, and molecular clock analysis.

Installation

Clone the repository:

git clone https://github.com/InstitutoTodosPelaSaude/dengueQA.git
cd dengueQA

Install dependencies:

conda env create -f config/dengueQA.yaml
conda activate dengueQA

Usage

Edit the Snakefile to specify your input files and parameters.
Run the pipeline:
```
snakemake all --cores all
```

Workflow

The pipeline consists of the following steps:

BLAST searches: Perform BLAST searches for serotyping, genotyping, and lineage assignment according to reference files provided by 'dengue-lineages' nomenclature system.
Dataset splitting: Split datasets based on dengue serotypes.
Sequence alignment: Align newly sequenced genomes with reference genomes using augur align.
Coverage calculation: Calculate coverage per coding region (CDS) and whole genomes.
Contextual genome selection: Select contextual genomes for root-to-tip analysis.
Dataset compilation: Join new sequences with contextual genomes.
Metadata processing: Add missing columns and process metadata.
Phylogenetic analysis: Align genomes (augur align), mask 5' and 3' untranslated regions (augur mask), build phylogenetic trees (iqtree), and perform root-to-tip analysis (treetime).
Quality assurance: Generate quality assurance report containing information about genomic coverage (flagging low coverage coding regions (CDS) and whole genomes) and sequence quality (flagging molecular clock outliers).

Sequence quality criteria

Using this pipeline, the quality assurance report reveals important statistics to evaluate genomes under the following criteria:

Evolutionary rates with strong deviations from the molecular clock
Minimum coverage of both whole genome and specific coding regions.

The molecular clock quality analysis (1) aims to identify genomes whose evolutionary rates deviate more than 10 interquartile ranges from the trend line produced by high-quality genomes. This cutoff is rather permissive, and was set on purpose to avoid exclusion of genomes that may have naturally higher evolutionary rates.

In this example, the correlation between genetic distance (y axis) and time (x axis) of the root-to-tip molecular clock analysis identified one sequence with evolutionary rate the deviates from the expected, being flagged as a potentially low quality sequence.

The genome coverage analysis (2) aims to identify genomes with less than 70% coverage, the minimum cutoff used for submitting the sequence as 'complete genome' or 'near-complete genome' to public databases. When sequences have less than 70% coverage, specific genomic fragments are evaluated for possible submission as partial sequences, thery are: the C-prM-E region (~2300 bp) and the viral envelope region (E, ~1500 bp) (see image below). In this order, if any of these regions have at least 95% coverage, it could be submitted to databases as genomic fragments, still relevant for specific analyses.

Examples of sequences of interest for submission to public genomic databases, according to their levels of completeness.

Author

Anderson Brito, Instituto Todos pela Saúde (ITpS) - Website - anderson.brito@itps.org.br

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
config		config
input		input
references		references
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dengueQA - a pipeline for quality assurance of consensus dengue virus genomes

Installation

Usage

Workflow

Sequence quality criteria

Author

License

About

Releases

Packages

Languages

License

InstitutoTodosPelaSaude/dengueQA

Folders and files

Latest commit

History

Repository files navigation

dengueQA - a pipeline for quality assurance of consensus dengue virus genomes

Installation

Usage

Workflow

Sequence quality criteria

Author

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages