Skip to content

SkinMicrobe/flowmeta

Repository files navigation

FlowMeta: Automated End-to-End Metagenomic & Amplicon Profiling Pipeline

FlowMeta packages two end-to-end bioinformatics workflows into a single Python package that can be deployed on any Linux/WSL environment:

  • flowmeta base (shotgun metagenomics): 10-step pipeline covering fastp → Bowtie2 → Kraken2/Bracken → host-taxid filtering → OTU/MPA matrix merge.
  • flowmeta amplicon (16S amplicon): 5-step pipeline covering QIIME2 import → QC → denoise (DADA2/VSEARCH) → taxonomy classification → export.

Each processing phase writes checkpoint flags so you can resume or re-run individual stages with confidence.

Highlights

  • End-to-end orchestration for both shotgun metagenomic and 16S amplicon workflows.
  • Auto-detection of paired-end vs single-end reads with optimal trunc_len defaults.
  • Supports both DADA2 (ASV) and VSEARCH (97% OTU) denoising for amplicon data.
  • Smart checkpointing (.task.complete per sample / stepN.done per step) for incremental runs.
  • Shared-memory (optional) caching for Kraken2 databases.
  • Configurable project prefix applied to merged outputs.
  • Detailed per-step logs stored alongside generated data.

Graphical Abstract

Graphical Abstract

Installation

Option A: From environment.yml (recommended)

conda env create -f environment.yml
conda activate flowmeta
pip install flowmeta

Option B: Manual setup

conda create -n flowmeta python=3.9 -y
conda activate flowmeta

# Shotgun metagenomics tools (flowmeta base)
conda install -c bioconda fastp bowtie2 samtools kraken2 bracken -y
conda install -c conda-forge scipy pandas pigz seqkit -y

# 16S amplicon tools — QIIME2 + all plugins (flowmeta amplicon)
# Note: q2-dada2 automatically pulls in R (4.3.x) and the dada2 R package
conda install -c qiime2 -c conda-forge -c bioconda \
    qiime2 q2cli q2-types q2-quality-filter q2-dada2 \
    q2-feature-classifier q2-vsearch q2-taxa biom-format -y

# Install flowmeta
pip install flowmeta

Post-install: verify DADA2 R dependencies

DADA2 requires R packages (dada2, GenomeInfoDbData) that may not install correctly via conda. Verify:

R -e 'library(dada2); cat("dada2 OK\n")'

If you see Error in library(dada2) or DADA2 fails with "Error matrix is NULL" at runtime:

# Fix: reinstall GenomeInfoDbData from Bioconductor
R -e 'if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager", repos="https://cloud.r-project.org"); BiocManager::install("GenomeInfoDbData", ask=FALSE, update=FALSE)'

Validate

flowmeta -h

Verified dependency versions

Component Version
Python 3.9.x
R 4.3.3
QIIME2 plugins (q2-*) 2024.5.0
DADA2 (R) 1.30.0
Rcpp / RcppParallel 1.1.0 / 5.1.9
GenomeInfoDbData 1.2.11

External executables for shotgun mode: fastp, bowtie2, samtools, kraken2, bracken, pigz, seqkit.

Quick Start

Shotgun Metagenomics (flowmeta base)

# Subcommand mode (recommended)
flowmeta base \
    --input_dir /mnt/data/01-raw \
    --output_dir /mnt/data/flowmeta-out \
    --db_bowtie2 /mnt/db/GRCh38_noalt_as/GRCh38_noalt_as \
    --db_kraken /mnt/db/k2ppf \
    --threads 8 \
    --project_prefix SMOOTH-

# Legacy mode (backward compatible)
flowmeta_base \
    --input_dir /mnt/data/01-raw \
    --output_dir /mnt/data/flowmeta-out \
    --db_bowtie2 /mnt/db/GRCh38_noalt_as/GRCh38_noalt_as \
    --db_kraken /mnt/db/k2ppf \
    --threads 8 \
    --project_prefix SMOOTH-

16S Amplicon (flowmeta amplicon)

# DADA2 denoising (auto-detects paired/single-end, auto trunc_len)
flowmeta amplicon \
    --input_dir /mnt/data/16S-raw \
    --output_dir /mnt/data/16S-out \
    --method dada2 \
    --silva_db /mnt/db/silva16s/silva138_classifier.qza \
    --threads 16

# VSEARCH denoising (OTU clustering at 97%)
flowmeta amplicon \
    --input_dir /mnt/data/16S-raw \
    --output_dir /mnt/data/16S-out \
    --method vsearch \
    --silva_db /mnt/db/silva16s/silva138_classifier.qza \
    --threads 16

# Override trunc_len manually (0 = auto: 140 for single-end, 250 for paired-end)
flowmeta amplicon \
    --input_dir /mnt/data/16S-raw \
    --output_dir /mnt/data/16S-out \
    --method dada2 \
    --silva_db /mnt/db/silva16s/silva138_classifier.qza \
    --threads 16 \
    --trunc_len 200

# Standalone command (also available)
flowmeta_amplicon \
    --input_dir /mnt/data/16S-raw \
    --output_dir /mnt/data/16S-out \
    --method dada2 \
    --silva_db /mnt/db/silva16s/silva138_classifier.qza \
    --threads 16

CLI Reference

flowmeta base — Shotgun Metagenomics

Flag Description
--input_dir Directory containing raw FASTQ files (paired _1.fastq.gz/_2.fastq.gz).
--output_dir Target workspace; creates 02-qc09-mpa.
--db_bowtie2 Path prefix of the host Bowtie2 index.
--db_kraken Kraken2 database directory (must include hash.k2d, opts.k2d, taxo.k2d).
--threads Worker threads per sample (default 32).
--batch Samples processed in parallel for fastp/Kraken (default 2).
--min_count Bracken minimum count during host-taxid filtering (default 4).
--skip_integrity_checks Skip FASTQ integrity checks.
--check_result Enable integrity checks (Steps 2 and 4).
--enable_bracken_step7 Run Bracken during Step 7 (default: Kraken2 only).
--project_prefix Label prepended to merged outputs (e.g. SMOOTH-).
--force Recompute even if checkpoint markers exist.
--step Resume from step N (1–10).
--step_only With --step, run only that step and stop.
--skip_host_extract Skip samtools host FASTQ extraction (Step 5).
--no_shm/--shm_path Control Kraken2 shared-memory caching.
--se Treat input as single-end reads.

flowmeta amplicon — 16S Amplicon

Flag Description
--input_dir Directory containing FASTQ files (paired or single-end, auto-detected).
--output_dir Target workspace; creates 02-qc06-export.
--method Denoising method: dada2 or vsearch (required).
--silva_db Path to SILVA classifier QZA file (required).
--threads Number of threads (default 16).
--trunc_len DADA2 truncation length. 0 = auto: 140 for single-end, 250 for paired-end (default 0).
--n_reads_learn Number of reads for DADA2 error model training (default 1000000).
--force Ignore step completion markers and rerun.

Step Maps

Shotgun Metagenomics (10 Steps)

Step Purpose
1 fastp trimming and QC.
2 fastp integrity verification (requires --check_result).
3 Bowtie2 host depletion.
4 Host-removed FASTQ integrity check (requires --check_result).
5 Optional samtools host-read export.
6 Stage Kraken2 DB to shared memory.
7 Kraken2/Bracken classification.
8 Kraken report validation.
9 Remove host taxa + rerun Bracken.
10 Merge OTU/MPA/Bracken matrices.

16S Amplicon

Step DADA2 VSEARCH Output
1 Import FASTQ into QIIME2 artifact Import FASTQ into QIIME2 artifact 02-qc/01-demux.qza
2 (skipped — DADA2 has built-in QC) Quality filtering (q-score) 03-trim/03-trimmed-seqs.qza
3 DADA2 denoise-paired/single VSEARCH dereplicate + 97% OTU 04-denoise/06-table.qza, 07-rep-seqs.qza
4 Taxonomy classification (sklearn) Taxonomy classification (sklearn) 05-taxonomy/taxonomy.qza
5 Export OTU table + taxonomy TSV Export OTU table + taxonomy TSV 06-export/otu_table.tsv, taxonomy.tsv

Note (DADA2): Input FASTQ files must have real, variable quality scores. Data with uniform/flat quality scores (e.g. some SRA downloads where all Q=30) will cause DADA2's error rate estimation to fail. VSEARCH is not affected by this limitation.

Output Layout

Shotgun Metagenomics

02-qc/            # fastp outputs + integrity checks
03-hr/            # host-removed FASTQ
04-bam/           # Bowtie2 BAM/indices
05-host/          # optional host reads (samtools)
06-ku/            # Kraken2 reports/outputs
07-bracken/       # Bracken abundance tables
08-ku2/           # Host-filtered rerun (reports + diversity)
09-mpa/           # Merged OTU/MPA/summary matrices

16S Amplicon

02-qc/            # QIIME2 demux artifact + manifest
03-trim/          # Quality-filtered sequences
04-denoise/       # Feature table + representative sequences
05-taxonomy/      # Taxonomy classification
06-export/        # otu_table.tsv + taxonomy.tsv

Reference Databases

Kraken2 (shotgun)

Bowtie2 (shotgun host reference)

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz
gunzip GCA_000001405.28_GRCh38.p13_genomic.fna.gz
seqkit grep -rvp 'alt|PATCH' GCA_000001405.28_GRCh38.p13_genomic.fna > GRCh38_noalt.fna
bowtie2-build GRCh38_noalt.fna GRCh38_noalt_as/GRCh38_noalt_as

SILVA (amplicon taxonomy)

  • Download SILVA reference sequences and taxonomy from https://www.arb-silva.de/
  • Train a classifier compatible with your sklearn version:
qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads silva-138-99-seqs.qza \
    --i-reference-taxonomy silva-138-99-tax.qza \
    --o-classifier silva138_classifier.qza

Claude Code Skill

FlowMeta ships with a Claude Code skill that lets you run both pipelines through natural language. After installing the package:

flowmeta install-skill

This copies the skill to ~/.claude/skills/flowmeta/. Then in Claude Code, type /flowmeta to invoke it. Example prompts:

  • "I have paired-end FASTQ files in /data/raw, run shotgun metagenomics with Kraken2"
  • "Analyze my 16S amplicon data using DADA2, the FASTQ files are in /data/16S"
  • "My DADA2 run failed with Error matrix is NULL, what should I do?"

The skill guides Claude through environment verification, command construction, troubleshooting, and result interpretation.

Build & Publish

pip install build
python -m build --wheel
ls dist/

Documentation

Support

Dongqiang Zeng · interlaken@smu.edu.cn

GitHub: https://github.com/SkinMicrobe/FlowMeta

About

FlowMeta: Automated End-to-End Metagenomic Profiling Pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages