FlowMeta: Automated End-to-End Metagenomic & Amplicon Profiling Pipeline

FlowMeta packages two end-to-end bioinformatics workflows into a single Python package that can be deployed on any Linux/WSL environment:

flowmeta base (shotgun metagenomics): 10-step pipeline covering fastp → Bowtie2 → Kraken2/Bracken → host-taxid filtering → OTU/MPA matrix merge.
flowmeta amplicon (16S amplicon): 5-step pipeline covering QIIME2 import → QC → denoise (DADA2/VSEARCH) → taxonomy classification → export.

Each processing phase writes checkpoint flags so you can resume or re-run individual stages with confidence.

Highlights

End-to-end orchestration for both shotgun metagenomic and 16S amplicon workflows.
Auto-detection of paired-end vs single-end reads with optimal trunc_len defaults.
Supports both DADA2 (ASV) and VSEARCH (97% OTU) denoising for amplicon data.
Smart checkpointing (.task.complete per sample / stepN.done per step) for incremental runs.
Shared-memory (optional) caching for Kraken2 databases.
Configurable project prefix applied to merged outputs.
Detailed per-step logs stored alongside generated data.

Graphical Abstract

Installation

Option A: From environment.yml (recommended)

conda env create -f environment.yml
conda activate flowmeta
pip install flowmeta

Option B: Manual setup

conda create -n flowmeta python=3.9 -y
conda activate flowmeta

# Shotgun metagenomics tools (flowmeta base)
conda install -c bioconda fastp bowtie2 samtools kraken2 bracken -y
conda install -c conda-forge scipy pandas pigz seqkit -y

# 16S amplicon tools — QIIME2 + all plugins (flowmeta amplicon)
# Note: q2-dada2 automatically pulls in R (4.3.x) and the dada2 R package
conda install -c qiime2 -c conda-forge -c bioconda \
    qiime2 q2cli q2-types q2-quality-filter q2-dada2 \
    q2-feature-classifier q2-vsearch q2-taxa biom-format -y

# Install flowmeta
pip install flowmeta

Post-install: verify DADA2 R dependencies

DADA2 requires R packages (dada2, GenomeInfoDbData) that may not install correctly via conda. Verify:

R -e 'library(dada2); cat("dada2 OK\n")'

If you see Error in library(dada2) or DADA2 fails with "Error matrix is NULL" at runtime:

# Fix: reinstall GenomeInfoDbData from Bioconductor
R -e 'if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager", repos="https://cloud.r-project.org"); BiocManager::install("GenomeInfoDbData", ask=FALSE, update=FALSE)'

Validate

flowmeta -h

Verified dependency versions

Component	Version
Python	3.9.x
R	4.3.3
QIIME2 plugins (q2-*)	2024.5.0
DADA2 (R)	1.30.0
Rcpp / RcppParallel	1.1.0 / 5.1.9
GenomeInfoDbData	1.2.11

External executables for shotgun mode: fastp, bowtie2, samtools, kraken2, bracken, pigz, seqkit.

Quick Start

Shotgun Metagenomics (`flowmeta base`)

# Subcommand mode (recommended)
flowmeta base \
    --input_dir /mnt/data/01-raw \
    --output_dir /mnt/data/flowmeta-out \
    --db_bowtie2 /mnt/db/GRCh38_noalt_as/GRCh38_noalt_as \
    --db_kraken /mnt/db/k2ppf \
    --threads 8 \
    --project_prefix SMOOTH-

# Legacy mode (backward compatible)
flowmeta_base \
    --input_dir /mnt/data/01-raw \
    --output_dir /mnt/data/flowmeta-out \
    --db_bowtie2 /mnt/db/GRCh38_noalt_as/GRCh38_noalt_as \
    --db_kraken /mnt/db/k2ppf \
    --threads 8 \
    --project_prefix SMOOTH-

16S Amplicon (`flowmeta amplicon`)

# DADA2 denoising (auto-detects paired/single-end, auto trunc_len)
flowmeta amplicon \
    --input_dir /mnt/data/16S-raw \
    --output_dir /mnt/data/16S-out \
    --method dada2 \
    --silva_db /mnt/db/silva16s/silva138_classifier.qza \
    --threads 16

# VSEARCH denoising (OTU clustering at 97%)
flowmeta amplicon \
    --input_dir /mnt/data/16S-raw \
    --output_dir /mnt/data/16S-out \
    --method vsearch \
    --silva_db /mnt/db/silva16s/silva138_classifier.qza \
    --threads 16

# Override trunc_len manually (0 = auto: 140 for single-end, 250 for paired-end)
flowmeta amplicon \
    --input_dir /mnt/data/16S-raw \
    --output_dir /mnt/data/16S-out \
    --method dada2 \
    --silva_db /mnt/db/silva16s/silva138_classifier.qza \
    --threads 16 \
    --trunc_len 200

# Standalone command (also available)
flowmeta_amplicon \
    --input_dir /mnt/data/16S-raw \
    --output_dir /mnt/data/16S-out \
    --method dada2 \
    --silva_db /mnt/db/silva16s/silva138_classifier.qza \
    --threads 16

CLI Reference

`flowmeta base` — Shotgun Metagenomics

Flag	Description
`--input_dir`	Directory containing raw FASTQ files (paired `_1.fastq.gz`/`_2.fastq.gz`).
`--output_dir`	Target workspace; creates `02-qc` … `09-mpa`.
`--db_bowtie2`	Path prefix of the host Bowtie2 index.
`--db_kraken`	Kraken2 database directory (must include `hash.k2d`, `opts.k2d`, `taxo.k2d`).
`--threads`	Worker threads per sample (default 32).
`--batch`	Samples processed in parallel for fastp/Kraken (default 2).
`--min_count`	Bracken minimum count during host-taxid filtering (default 4).
`--skip_integrity_checks`	Skip FASTQ integrity checks.
`--check_result`	Enable integrity checks (Steps 2 and 4).
`--enable_bracken_step7`	Run Bracken during Step 7 (default: Kraken2 only).
`--project_prefix`	Label prepended to merged outputs (e.g. `SMOOTH-`).
`--force`	Recompute even if checkpoint markers exist.
`--step`	Resume from step N (1–10).
`--step_only`	With `--step`, run only that step and stop.
`--skip_host_extract`	Skip samtools host FASTQ extraction (Step 5).
`--no_shm`/`--shm_path`	Control Kraken2 shared-memory caching.
`--se`	Treat input as single-end reads.

`flowmeta amplicon` — 16S Amplicon

Flag	Description
`--input_dir`	Directory containing FASTQ files (paired or single-end, auto-detected).
`--output_dir`	Target workspace; creates `02-qc` … `06-export`.
`--method`	Denoising method: `dada2` or `vsearch` (required).
`--silva_db`	Path to SILVA classifier QZA file (required).
`--threads`	Number of threads (default 16).
`--trunc_len`	DADA2 truncation length. `0` = auto: 140 for single-end, 250 for paired-end (default 0).
`--n_reads_learn`	Number of reads for DADA2 error model training (default 1000000).
`--force`	Ignore step completion markers and rerun.

Step Maps

Shotgun Metagenomics (10 Steps)

Step	Purpose
1	fastp trimming and QC.
2	fastp integrity verification (requires `--check_result`).
3	Bowtie2 host depletion.
4	Host-removed FASTQ integrity check (requires `--check_result`).
5	Optional samtools host-read export.
6	Stage Kraken2 DB to shared memory.
7	Kraken2/Bracken classification.
8	Kraken report validation.
9	Remove host taxa + rerun Bracken.
10	Merge OTU/MPA/Bracken matrices.

16S Amplicon

Step	DADA2	VSEARCH	Output
1	Import FASTQ into QIIME2 artifact	Import FASTQ into QIIME2 artifact	`02-qc/01-demux.qza`
2	(skipped — DADA2 has built-in QC)	Quality filtering (q-score)	`03-trim/03-trimmed-seqs.qza`
3	DADA2 denoise-paired/single	VSEARCH dereplicate + 97% OTU	`04-denoise/06-table.qza`, `07-rep-seqs.qza`
4	Taxonomy classification (sklearn)	Taxonomy classification (sklearn)	`05-taxonomy/taxonomy.qza`
5	Export OTU table + taxonomy TSV	Export OTU table + taxonomy TSV	`06-export/otu_table.tsv`, `taxonomy.tsv`

Note (DADA2): Input FASTQ files must have real, variable quality scores. Data with uniform/flat quality scores (e.g. some SRA downloads where all Q=30) will cause DADA2's error rate estimation to fail. VSEARCH is not affected by this limitation.

Output Layout

Shotgun Metagenomics

02-qc/            # fastp outputs + integrity checks
03-hr/            # host-removed FASTQ
04-bam/           # Bowtie2 BAM/indices
05-host/          # optional host reads (samtools)
06-ku/            # Kraken2 reports/outputs
07-bracken/       # Bracken abundance tables
08-ku2/           # Host-filtered rerun (reports + diversity)
09-mpa/           # Merged OTU/MPA/summary matrices

16S Amplicon

02-qc/            # QIIME2 demux artifact + manifest
03-trim/          # Quality-filtered sequences
04-denoise/       # Feature table + representative sequences
05-taxonomy/      # Taxonomy classification
06-export/        # otu_table.tsv + taxonomy.tsv

Reference Databases

Kraken2 (shotgun)

Download pre-built libraries: https://benlangmead.github.io/aws-indexes/k2
Extract and point --db_kraken to the directory containing hash.k2d, opts.k2d, taxo.k2d.

Bowtie2 (shotgun host reference)

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz
gunzip GCA_000001405.28_GRCh38.p13_genomic.fna.gz
seqkit grep -rvp 'alt|PATCH' GCA_000001405.28_GRCh38.p13_genomic.fna > GRCh38_noalt.fna
bowtie2-build GRCh38_noalt.fna GRCh38_noalt_as/GRCh38_noalt_as

SILVA (amplicon taxonomy)

Download SILVA reference sequences and taxonomy from https://www.arb-silva.de/
Train a classifier compatible with your sklearn version:

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads silva-138-99-seqs.qza \
    --i-reference-taxonomy silva-138-99-tax.qza \
    --o-classifier silva138_classifier.qza

Claude Code Skill

FlowMeta ships with a Claude Code skill that lets you run both pipelines through natural language. After installing the package:

flowmeta install-skill

This copies the skill to ~/.claude/skills/flowmeta/. Then in Claude Code, type /flowmeta to invoke it. Example prompts:

"I have paired-end FASTQ files in /data/raw, run shotgun metagenomics with Kraken2"
"Analyze my 16S amplicon data using DADA2, the FASTQ files are in /data/16S"
"My DADA2 run failed with Error matrix is NULL, what should I do?"

The skill guides Claude through environment verification, command construction, troubleshooting, and result interpretation.

Build & Publish

pip install build
python -m build --wheel
ls dist/

Documentation

Companion README (中文操作指南): README.zh.md
HTML tutorial: docs/tutorial.html

Support

Dongqiang Zeng · interlaken@smu.edu.cn

GitHub: https://github.com/SkinMicrobe/FlowMeta

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
flowmeta		flowmeta
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README.zh.md		README.zh.md
__main__.py		__main__.py
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

FlowMeta: Automated End-to-End Metagenomic & Amplicon Profiling Pipeline

Highlights

Graphical Abstract

Installation

Option A: From environment.yml (recommended)

Option B: Manual setup

Post-install: verify DADA2 R dependencies

Validate

Verified dependency versions

Quick Start

Shotgun Metagenomics (flowmeta base)

16S Amplicon (flowmeta amplicon)

CLI Reference

flowmeta base — Shotgun Metagenomics

flowmeta amplicon — 16S Amplicon

Step Maps

Shotgun Metagenomics (10 Steps)

16S Amplicon

Output Layout

Shotgun Metagenomics

16S Amplicon

Reference Databases

Kraken2 (shotgun)

Bowtie2 (shotgun host reference)

SILVA (amplicon taxonomy)

Claude Code Skill

Build & Publish

Documentation

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Shotgun Metagenomics (`flowmeta base`)

16S Amplicon (`flowmeta amplicon`)

`flowmeta base` — Shotgun Metagenomics

`flowmeta amplicon` — 16S Amplicon

Packages