CDPHE_SARS-CoV-2 Workflows

Disclaimer

Next generation sequencing and bioinformatic and genomic analysis at the Colorado Department of Public Health and Environment (CDPHE) is not CLIA validated at this time. These workflows and their outputs are not to be used for diagnostic purposes and should only be used for public health action and surveillance purposes. CDPHE is not responsible for the incorrect or inappropriate use of these workflows or their results.

Overview

The following documentation describes the Colorado Department of Public Health and Environment's workflows for the assembly and analysis of whole genome sequencing data of SARS-CoV-2 on GCP's Terra.bio platform. Workflows are written in WDL and can be imported into a Terra.bio workspace through dockstore (see Setup section below: https://dockstore.org/).

Our SARS-CoV-2 whole genome reference-based assembly workflows are highly adaptable and facilitate the assembly and analysis of tiled amplicon based sequencing data of SARS-CoV-2. The workflows can accommodate various amplicon primer schemes including Artic V3, Artic V4, Artic V4.1, Artic V5.3.2 and Midnight, as well as different sequencing technology platforms including both Illumina and Oxford Nanopore Technology (ONT).

Workflows

Below is a list of available and maintained workflows and a brief description of the workflow. A full description of each workflow can be found on each workflow's readme page.

Workflow Name	Description
`SC2_illumina_pe_assembly`	SARS-CoV-2 reference based assembly of Illumina pair-end data.
`SC2_ont_assembly`	SARS-CoV-2 reference based assembly of Oxford Nanopore Technology (ONT) data.
`SC2_lineage_calling_and_results`	Performs lineage calling on SARS-CoV-2 consensus sequences using Pangolin and Nextclade and generates a summary report of assembly metrics. Should be run following an assembly workflow.
`SC2_wastewater_variant_calling`	Uses Freyja to recover relative lineage abundances from wastewater samples which are considered mixed SARS-CoV-2 samples.
`SC2_novel_mutations`	Uses mutation outputs from Freyja to detect novel and recurrent mutations in wastewater samples.
`SC2_multifasta_lineage_calling`	Performs lineage calling on SARS-CoV-2 consensus sequences using Pangolin.

Process

Clinical SC2 sequence assembly and lineage calling

Sequence assembly and lineage calling for clinical SC2 samples requires two workflows (see figure 1). We first run either the SC2_illumina_pe_assembly or SC2_ont_assembly workflow, which performs quality control, trimming, and filtering of raw reads, followed by reference-guided whole genome assembly, and finally transfer of intermediate files and consensus sequences to a local GCP bucket for storage. Next, we run the SC2_lineage_calling_and_results which uses Pangolin and Nextclade to perform clade and lineage assignment on the consensus assemblies and produce a results summary file for the set of sequences analyzed.

If you already have a multifasta, you can use the SC2_multifasta_lineage_calling workflow for clade and lineage assignment.

Wastewater SC2 sequence assembly and variant calling

Sequence assembly, variant calling, and mutation analysis for wastewater SC2 samples requires four workflows (See figure 1). Similar to our process for clinical SC2 samples, we first run either the SC2_illumina_pe_assembly or SC2_ont_assembly workflow, which performs quality control, trimming, and filtering of raw reads, followed by reference-guided whole genome assembly, and finally transfer of intermediate files and consensus sequences to a local GCP bucket for storage. Next, we run the SC2_lineage_calling_and_results workflow which uses Pangolin and Nextclade to perform clade and lineage assignment on the consensus assemblies and produces a results summary file for the set of sequences analyzed. Then we run the SC2_wastewater_variant_calling workflow which uses Freyja to recover relative lineage abundances from wastewater samples, which are considered mixed SC2 samples. Finally, we run the SC2_novel_mutations workflow which uses the mutation outputs from Freyja to detect novel and recurrent mutations in wastewater samples and to keep track of each mutation over time.

Figure 1. High level overview of workflow process for clinical and wastewater SC2 samples.

Setup

Input Workflow from Dockstore

To use the workflow on the Terra platform, first you will need to import the workflow from Dockstore. All workflows can be found under our dockstore organization called CDPHE-bioinformatics.

Go to dockstore (https://dockstore.org/).
Along the top search bar click on Organizations and search for "CDPHE".
Select the workflow.
On the right hand side of the workflow description, select "Launch with Terra".
Select the Destination workspace and select "Import".
The workflow will now be displayed as a card under your workflows tab in your Terra workspace.

Workspace Data

Prior to running any of the workflows, you must set up the Terra workspace data with the correct reference files and custom python scripts. The reference files can be found in this repository in the data/workspace_data directory. Python scripts can be found in the scripts directory. Workspace variables are named using the following format {organism}_{description}_{file_type}, except for the primer bed files which are named as {description}_{file_type}. Reference files and python scripts should be copied from this repo into a GCP bucket. The GCP bucket path to the file will serve as the "value" when adding data to the terra workspace data table.

To add data to the terra workspace data:

Navigate to the Data tab in your Terra workspace.
In the left hand list of data tables, under "Other Data" select "Workspace Data".
Click on the "+" button in the lower right hand corner of the workspace data table.
Fill in the "Key" column with the workspace variable name, the "Value" column with GCP bucket path to the file and the "Description" column with a brief description if desired.
Once complete hit the check mark to the right.

Below is a data table detailing the workspace data you will need to set up in order to run the SC2 workflows.

workspace variable name	workflow	file name	description
`adapters_and_contaminants_fa`	`SC2_illumina_pe_assembly`	Adapters_plus_PhiX_174.fasta	adapters sequences and containment sequences removed during fastq cleaning and filtering using SeqyClean. Thanks to Erin Young at Utah Public Health Laboratory for providing this file!
`artic_v3_bed`	`SC2_illumina_pe_assembly`, `SC2_ont_assembly`	artic_V3_nCoV-2019.primer.bed	primer bed file for the Artic V3 tiled amplicon primer set. Thanks to Theiagen Genomics for providing this file!
`artic_v4_bed`	`SC2_illumina_pe_assembly`, `SC2_ont_assembly`	artic_V4_nCoV-2019.primer.bed	primer bed file for the Artic V4 tiled amplicon primer set. Thanks to Theiagen Genomics for providing this file!
`artic_v4-1_bed`	`SC2_illumina_pe_assembly`, `SC2_ont_assembly`	artic_V4-1_nCoV-2019.primer.bed	primer bed file for the Artic V4.1 tiled amplicon primer set. Thanks to Theiagen Genomics for providing this file!
`artic_v4-1_s_gene_amplicons`	`SC2_illumina_pe_assembly`, `SC2_ont_assembly`	artic_v4_1_s_gene_amplicons.tsv	coordinate positions of S gene amplicons using the artic V4.1 primers
`artic_v4-1_s_gene_primer_bed`	`SC2_illumina_pe_assembly`, `SC2_ont_assembly`	S_gene_V4-1_nCoV-2021.primer.bed	primer sequences and coordinate positions of primer binding region of Artic v4.1 primers
`artic_v5-3-2_bed`	`SC2_illumina_pe_assembly`, `SC2_ont_assembly`	artic_v5-3-2_nCoV-2023.primer.bed	primer bed file for the Artic V5.3.2 tiled amplicon primer set.
`artic_v5-3-2_s_gene_amplicons`	`SC2_illumina_pe_assembly`, `SC2_ont_assembly`	artic_v5-3-2_s_gene_amplicons.tsv	coordinate positions of S gene amplicons using the artic V5.3.2 primers
`artic_v5-3-2_s_gene_primer_bed`	`SC2_illumina_pe_assembly`, `SC2_ont_assembly`	S_gene_V5-3-2_nCoV-2021.primer.bed	primer sequences and coordinate positions of primer binding region of Artic v5.3.2 primers
`midnight_bed`	`SC2_ont_assembly`	Midnight_Primers_SARS-CoV-2.scheme.bed	primer bed file for the Midnight tiled amplicon primer set. Thanks to Theiagen Genomics for providing this file!
`covid_genome_fa`	`SC2_illumina_pe_assembly`, `SC2_ont_assembly`, `SC2_wastewater_variant_calling`	MN908947-2_reference.fasta	SARS-CoV-2 whole genome reference sequence in fasta format (we use NCBI genbank ID MN908947.3)
`covid_genome_gff`	`SC2_illumina_pe_assembly`, `SC2_ont_assembly`, `SC2_wastewater_variant_calling`	NC_045512-2_reference.gff	whole genome reference sequence annotation file in gff format (we use NCBI genbank ID MN908947.3)
`covid_genome_gff_mutations`	`SC2_novel_mutations`	novel_mutations_gff.tsv	tsv formatted version of `covid_genome_gff` for use with `novel_mutations_append_py`
`covid_voc_annotations_tsv`	`SC2_wastewater_variant_calling workflow`	SC2_voc_annotations_20220711.tsv	For wastewater only. List of amino acid (AA) substitutions and lineages containing those AA substitutions; for a lineage to be associated with a given AA substitution, 90% of publicly available sequences must contain the AA substitution (the 90% cutoff was determined using outbreak.info)
`covid_voc_bed_tsv`	`SC2_wastewater_variant_calling workflow`	SC2_voc_mutations_20220711.tsv	For wastewater only. List of nucleotide genome positions in relation to the MN908947.3 reference genome of know mutations
`covid_calc_per_cov_py`	`SC2_illumina_pe_assembly`, `SC2_ont_assembly`	calc_percent_coverage.py	see detailed description in the readme file found in `./python_scripts/` repo directory
`covid_nextclade_json_parser_py`	`SC2_lineage_calling_and_results`	nextclade_json_parser.py	see detailed description in the readme file found in `./python_scripts/` repo directory
`covid_concat_results_py`	`SC2_lineage_calling_and_results`	concat_seq_metrics_and_lineages_results.py	see detailed description in the readme file found in `./python_scripts` repo directory
`covid_novel_mutations_append_py`	`SC2_novel_mutations`	novel_mutations_append.py	see detailed description in the readme file found in `./python_scripts/` repo directory
`covid_version_capture_illumina_pe_assembly_py`	`SC2_illumina_pe_assembly`	version_capture_illumina_pe_assembly.py	generates version capture output file for software versions used in the SC2_illumina_pe_assembly workflow
`covid_version_capture_ont_assembly_py`	`SC2_ont_assembly`	version_capture_ont_assembly.py	generates version capture output file for software versions used in the SC2_ont_assembly workflow
`covid_version_capture_lineage_calling_py`	`SC2_lineage_calling_and_results`	version_capture_lineage_calling_and_results.py	generates version capture output file for software versions used in the SC2_lineage_calling_and_results workflow
`covid_version_capture_wastewater_variant_calling_py`	`SC2_wastewater_variant_calling`	version_capture_wastewater_variant_calling.py	generates version capture output file for software versions used in the SC2_wastewater_variant_calling workflow
`covid_version_capture_multifasta_lineage_calling_py`	`SC2_multifasta_lineage_calling`	version_capture_multifasta_lineage_calling.py	generates version capture output file for software versions used in the SC2_multifasta_lineage_calling workflow
`novel_mutations_historical_full`	`SC2_novel_mutations`	novel_mutations.py	for wastewater only. See detailed description in the readme file found in `./python_scripts/` repo directory
`novel_mutations_historical_unique`	`SC2_novel_mutations`	novel_mutations.py	for wastewater only. See detailed description in the readme file found in `./python_scripts/` repo directory

Name		Name	Last commit message	Last commit date
Latest commit History 510 Commits
.github		.github
data		data
docs		docs
scripts		scripts
tasks		tasks
workflows		workflows
.dockstore.yml		.dockstore.yml
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDPHE_SARS-CoV-2 Workflows

Disclaimer

Overview

Workflows

Process

Clinical SC2 sequence assembly and lineage calling

Wastewater SC2 sequence assembly and variant calling

Setup

Input Workflow from Dockstore

Workspace Data

About

Releases 18

Packages

Contributors 5

Languages

License

CDPHE-bioinformatics/CDPHE-SARS-CoV-2

Folders and files

Latest commit

History

Repository files navigation

CDPHE_SARS-CoV-2 Workflows

Disclaimer

Overview

Workflows

Process

Clinical SC2 sequence assembly and lineage calling

Wastewater SC2 sequence assembly and variant calling

Setup

Input Workflow from Dockstore

Workspace Data

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 18

Packages 0

Contributors 5

Languages

Packages