An optimized workflow for processing 16S rRNA gene amplicon data using the DADA2 package in R. This workflow identifies exact amplicon sequence variants (ASVs) with higher resolution than traditional OTU-based methods while implementing advanced optimizations for improved performance, reliability, and insights.
This repository contains two main components:
- dada2_workflow_optimize.Rmd - An enhanced, performance-optimized RMarkdown workflow that implements the complete DADA2 pipeline
- dashboard.R - An interactive Shiny dashboard for visualizing and exploring results
Note: A basic DADA2 workflow (dada2_workflow.Rmd) is also included for users who want the simplest possible implementation.
- Automatic sequencing platform detection - Identifies platform based on read length and quality patterns
- Parameter optimization - Tunes filtering parameters to your specific sequence data characteristics
- Adaptive truncation lengths - Dynamically adjusts to optimize read quality and overlap
- Expected error threshold optimization - Balances quality control and read retention
- Primer detection - Automatically identifies primer sequences for amplicon size calculations
- Memory-optimized processing - Efficient batched operations with automatic memory management
- Enhanced checkpointing system - Robust recovery from interruptions with comprehensive tracking
- Parallelized execution - Automatic multi-core utilization with adaptive worker allocation
- Reference-based taxonomy confidence scoring - Bootstrap confidence values for all taxonomic assignments
- Multi-method taxonomy assignment - Combines results from multiple classifiers for improved accuracy
- Phylogenetic tree construction - Integrates phylogenetic information using optimized alignment methods
- Rarefaction analysis - Depth optimization with saturation detection for proper diversity comparisons
- Detailed quality visualization - Enhanced plots with quality interpretation zones
- Multi-run support - Process and integrate data from multiple sequencing runs with batch effect analysis
- Comprehensive reporting - Generate detailed HTML/PDF reports with code hiding option
The interactive dashboard (dashboard.R
) provides advanced visualization and analysis capabilities:
- Overview Panel - Sample metrics, read statistics, and ASV summaries
- Quality Control - Filtering performance, read tracking, and quality metric distributions
- Alpha Diversity - Multiple indices with statistical comparisons between groups
- Beta Diversity - Multiple ordination methods (PCoA, NMDS, t-SNE, UMAP) with statistical tests
- Taxonomy Explorer - Interactive hierarchical visualization of taxonomic composition
- ASV Browser - Searchable ASV table with sequence information and abundance patterns
- Differential Abundance - Multiple testing methods for identifying biomarkers between groups
- Batch Effect Analysis - For multi-run studies, quantifies and visualizes run effects
- Normalization Methods - Compare various count normalization approaches
- Export Options - Download plots, tables, and processed data in multiple formats
- R ≥ 4.0.0
- Required R packages:
- dada2
- ggplot2
- phyloseq
- Biostrings
- ShortRead
- tidyverse
- future (for parallelization)
- DECIPHER (for improved taxonomy & phylogeny)
- vegan (for diversity analyses)
-
Clone this repository:
git clone https://github.com/yourusername/dada2-workflow.git cd dada2-workflow
-
Install required R packages:
install.packages(c("ggplot2", "tidyverse", "argparse", "future", "future.apply")) if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("dada2", "phyloseq", "Biostrings", "ShortRead", "DECIPHER"))
-
Run the optimized workflow using one of the following methods:
Method 1: RStudio
- Open
dada2_workflow_optimize.Rmd
in RStudio - Update parameters in the YAML header if needed
- Execute by running all code chunks
Method 2: Command Line
-
For a single sequencing run:
Rscript run_dada2_workflow.R
-
For multi-run analysis with batch effect correction:
Rscript run_dada2_workflow.R --multi-run --run-dir path/to/run_directory
- Open
-
View results in the interactive dashboard:
Rscript run_dashboard.R
Or for advanced options:
Rscript run_dashboard.R --optimize --cores 4 --multi-run
Place your fastq files in the data/
directory:
data/
├── sample1_R1.fastq.gz
├── sample1_R2.fastq.gz
├── sample2_R1.fastq.gz
├── sample2_R2.fastq.gz
└── ...
Organize your data with each run in a separate subdirectory:
data/
├── run1/
│ ├── sample1_R1.fastq.gz
│ ├── sample1_R2.fastq.gz
│ └── ...
├── run2/
│ ├── sampleA_R1.fastq.gz
│ ├── sampleA_R2.fastq.gz
│ └── ...
└── run3/
├── sampleX_R1.fastq.gz
├── sampleX_R2.fastq.gz
└── ...
The run_dada2_workflow.R
script provides the following options:
usage: run_dada2_workflow.R [-h] [-m] [-d RUN_DIR] [-b] [-r] [-f FORMAT]
[-o OUTPUT_DIR] [-n OUTPUT_FILE] [--cores CORES]
[--optimize]
Run optimized DADA2 workflow for 16S rRNA amplicon sequence processing
optional arguments:
-h, --help show this help message and exit
-m, --multi-run Enable multi-run processing mode
-d RUN_DIR, --run-dir RUN_DIR
Directory containing run subdirectories (for multi-run
mode)
-b, --big-data Enable big data mode with optimized memory management
-r, --report Generate HTML report
-f FORMAT, --format FORMAT
Output format for report (e.g., html_document,
pdf_document)
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory for reports
-n OUTPUT_FILE, --output-file OUTPUT_FILE
Base output filename for reports
--cores CORES Number of CPU cores to use for parallelization
--optimize Enable additional performance optimizations
For studies with samples across multiple sequencing runs, the workflow:
- Processes each run separately through the sample inference step with run-specific error models
- Merges sequence tables from all runs while preserving run information
- Performs chimera removal and taxonomic assignment on the combined data
- Provides batch effect detection and correction methods:
- PERMANOVA to test for significant run effects
- Beta dispersion analysis to check homogeneity across runs
- Batch effect visualization with ordination methods
- Optional normalization methods specifically for batch correction
This approach gives you:
- More accurate error models specific to each sequencing run
- Detection of potential batch effects that could bias results
- Methods to correct or account for batch effects in downstream analyses
- Better integration of data from different sequencing platforms or centers
The workflow produces comprehensive output files in the results/
directory:
seqtab_nochim.csv
: ASV count tabletaxonomy.csv
: Taxonomic assignments for each ASVtaxonomy_with_confidence.csv
: Taxonomy with bootstrap confidence scoresphyloseq_object.rds
: R object for downstream analysisASVs.fasta
: FASTA file containing ASV sequencesfilter_summary.csv
: Quality filtering statisticschimera_summary.csv
: Chimera detection statisticsread_tracking_detailed.csv
: Read counts through each pipeline steprarefaction_curves.rds
: Data for rarefaction analysisworkflow_summary.rds
: Complete statistics about the analysis run
Contributions to improve this workflow are welcome. Please feel free to submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP (2016). "DADA2: High-resolution sample inference from Illumina amplicon data." Nature Methods, 13, 581-583. doi: 10.1038/nmeth.3869
- McMurdie PJ, Holmes S (2013). "phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data." PLoS ONE, 8(4):e61217
- Murali A, Bhargava A, Wright ES (2018). "IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences." Microbiome, 6, 140. doi: 10.1186/s40168-018-0521-5