Skip to content

Latest commit

 

History

History
156 lines (120 loc) · 5.17 KB

metagenomic_triage.rst

File metadata and controls

156 lines (120 loc) · 5.17 KB

TaxTriage (Metagenomics) (Under Construction)

Warning

This module is under construction and is in alpha-release. Scheduled full release of v1.0 in Oct. 2022

Standard diagram for deployment and pipeline development

../assets/img/APHLPipeline.png

The pipeline consists of a variety of alignment/classification steps as well as QC and pre-filtering processes. It is designed to be serve as the initial triage step for identifying unknown organisms present in one or more sample types and supports both Illumina or Oxford Nanopore-generated NGS data.

The pipeline is packaged to include basic quality control to making a (potential) de-novo assembly for each organism that is detected in the sample from a filtering a hierarchical perspective. That is, the most prevalent taxonomic IDs at various ranks in the hierarchical chain are reported, binned, and run through a variety of alignment and assembly steps (for lower levels like species). Finally, a set of flags are generated for each taxonomic map that is the most prevalent per sample.

A list of tools used are listed below for each step

  1. Demultiplex and Gather OPTIONAL, Oxford Nanopore Only
  1. Quality Control OPTIONAL
  • PycoQC - Computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data
  1. Trimming
  1. Filtering
  1. QC Plotting
  1. Classification (K-mer approach)
  1. Alignent Stats
  1. Report Generation

Please see relevant links in the listed modules for more information on the underlying mechanisms and corresponding papers (if existent)

Parameters

  • Samplesheet (.csv): file

    Contains a mapping of metadata and a single sample per row. Explanations of the possible columns for Basestack are seen below:

Samplesheet Description
Column Name Description
sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (_).
single_end Is the data single or paired end
fastq_1 Full path to FastQ file for Illumina short reads 1 OR OXFORD reads. File has to be gzipped and have the extension ".fastq.gz" or ".fq.
fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.
barcode TRUE/FALSE, is the row attributed to a demultiplexed barcode folder of 1 or more fastq files or is it a single file that is .
from Directory path of the barcode, only used with the column being set as TRUE in the barcode column
trim TRUE/FALSE, do you want to run trimming on the sample?
platform Platform used, [ILLUMINA, OXFORD]
sequencing_summary If detected, output plots based on the the sequencing summary file for that sample
Example Samplesheet
sample fastq_1 fastq_2 platform from trim sequencing_summary single_end barcode
Sample_1 AEG588A1_S1_L001_R1_001.fastq.gz AEG588A1_S1_L001_R2_001.fastq.gz ILLUMINA NULL (or leave blank) FALSE NULL (or leave blank) FALSE FALSE
Sample_2 ecoli_reads.fastq NULL OXFORD NULL FALSE sequencing_summary.txt TRUE FALSE
Sample_3 NULL NULL OXFORD barcode01 TRUE FALSE TRUE TRUE

For the samples shown above:

  1. A paired-end run of Illumina data where we DON'T trim anything (no Trimgalore)
  2. A single-end Oxford Nanopore run where all reads are concatenated to a single fastq file. No barcode. There is a sequencing summary file we want to plot for run statistics/plots
  3. A single-end Oxford Nanopore run where reads have NOT been demultiplexed and/or aggregated to a single fastq file (like row 2). This will run artic guppyplex as well to concatenate all to one fastq file

Returns

  1. MultiQC report HTML file
  2. Variety of intermediate and output results files for the MultiQC report
  • Examples: - SAM/BAM alignment - Filtered FASTQ Files (for downstream use) - Assembly (de novo) - WIP and is not ready just yet - Kraken2 Report(s)