FlowMeta packages two end-to-end bioinformatics workflows into a single Python package that can be deployed on any Linux/WSL environment:
flowmeta base(shotgun metagenomics): 10-step pipeline covering fastp → Bowtie2 → Kraken2/Bracken → host-taxid filtering → OTU/MPA matrix merge.flowmeta amplicon(16S amplicon): 5-step pipeline covering QIIME2 import → QC → denoise (DADA2/VSEARCH) → taxonomy classification → export.
Each processing phase writes checkpoint flags so you can resume or re-run individual stages with confidence.
- End-to-end orchestration for both shotgun metagenomic and 16S amplicon workflows.
- Auto-detection of paired-end vs single-end reads with optimal
trunc_lendefaults. - Supports both DADA2 (ASV) and VSEARCH (97% OTU) denoising for amplicon data.
- Smart checkpointing (
.task.completeper sample /stepN.doneper step) for incremental runs. - Shared-memory (optional) caching for Kraken2 databases.
- Configurable project prefix applied to merged outputs.
- Detailed per-step logs stored alongside generated data.
conda env create -f environment.yml
conda activate flowmeta
pip install flowmetaconda create -n flowmeta python=3.9 -y
conda activate flowmeta
# Shotgun metagenomics tools (flowmeta base)
conda install -c bioconda fastp bowtie2 samtools kraken2 bracken -y
conda install -c conda-forge scipy pandas pigz seqkit -y
# 16S amplicon tools — QIIME2 + all plugins (flowmeta amplicon)
# Note: q2-dada2 automatically pulls in R (4.3.x) and the dada2 R package
conda install -c qiime2 -c conda-forge -c bioconda \
qiime2 q2cli q2-types q2-quality-filter q2-dada2 \
q2-feature-classifier q2-vsearch q2-taxa biom-format -y
# Install flowmeta
pip install flowmetaDADA2 requires R packages (dada2, GenomeInfoDbData) that may not install correctly via conda. Verify:
R -e 'library(dada2); cat("dada2 OK\n")'If you see Error in library(dada2) or DADA2 fails with "Error matrix is NULL" at runtime:
# Fix: reinstall GenomeInfoDbData from Bioconductor
R -e 'if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager", repos="https://cloud.r-project.org"); BiocManager::install("GenomeInfoDbData", ask=FALSE, update=FALSE)'flowmeta -h| Component | Version |
|---|---|
| Python | 3.9.x |
| R | 4.3.3 |
| QIIME2 plugins (q2-*) | 2024.5.0 |
| DADA2 (R) | 1.30.0 |
| Rcpp / RcppParallel | 1.1.0 / 5.1.9 |
| GenomeInfoDbData | 1.2.11 |
External executables for shotgun mode: fastp, bowtie2, samtools, kraken2, bracken, pigz, seqkit.
# Subcommand mode (recommended)
flowmeta base \
--input_dir /mnt/data/01-raw \
--output_dir /mnt/data/flowmeta-out \
--db_bowtie2 /mnt/db/GRCh38_noalt_as/GRCh38_noalt_as \
--db_kraken /mnt/db/k2ppf \
--threads 8 \
--project_prefix SMOOTH-
# Legacy mode (backward compatible)
flowmeta_base \
--input_dir /mnt/data/01-raw \
--output_dir /mnt/data/flowmeta-out \
--db_bowtie2 /mnt/db/GRCh38_noalt_as/GRCh38_noalt_as \
--db_kraken /mnt/db/k2ppf \
--threads 8 \
--project_prefix SMOOTH-# DADA2 denoising (auto-detects paired/single-end, auto trunc_len)
flowmeta amplicon \
--input_dir /mnt/data/16S-raw \
--output_dir /mnt/data/16S-out \
--method dada2 \
--silva_db /mnt/db/silva16s/silva138_classifier.qza \
--threads 16
# VSEARCH denoising (OTU clustering at 97%)
flowmeta amplicon \
--input_dir /mnt/data/16S-raw \
--output_dir /mnt/data/16S-out \
--method vsearch \
--silva_db /mnt/db/silva16s/silva138_classifier.qza \
--threads 16
# Override trunc_len manually (0 = auto: 140 for single-end, 250 for paired-end)
flowmeta amplicon \
--input_dir /mnt/data/16S-raw \
--output_dir /mnt/data/16S-out \
--method dada2 \
--silva_db /mnt/db/silva16s/silva138_classifier.qza \
--threads 16 \
--trunc_len 200
# Standalone command (also available)
flowmeta_amplicon \
--input_dir /mnt/data/16S-raw \
--output_dir /mnt/data/16S-out \
--method dada2 \
--silva_db /mnt/db/silva16s/silva138_classifier.qza \
--threads 16| Flag | Description |
|---|---|
--input_dir |
Directory containing raw FASTQ files (paired _1.fastq.gz/_2.fastq.gz). |
--output_dir |
Target workspace; creates 02-qc … 09-mpa. |
--db_bowtie2 |
Path prefix of the host Bowtie2 index. |
--db_kraken |
Kraken2 database directory (must include hash.k2d, opts.k2d, taxo.k2d). |
--threads |
Worker threads per sample (default 32). |
--batch |
Samples processed in parallel for fastp/Kraken (default 2). |
--min_count |
Bracken minimum count during host-taxid filtering (default 4). |
--skip_integrity_checks |
Skip FASTQ integrity checks. |
--check_result |
Enable integrity checks (Steps 2 and 4). |
--enable_bracken_step7 |
Run Bracken during Step 7 (default: Kraken2 only). |
--project_prefix |
Label prepended to merged outputs (e.g. SMOOTH-). |
--force |
Recompute even if checkpoint markers exist. |
--step |
Resume from step N (1–10). |
--step_only |
With --step, run only that step and stop. |
--skip_host_extract |
Skip samtools host FASTQ extraction (Step 5). |
--no_shm/--shm_path |
Control Kraken2 shared-memory caching. |
--se |
Treat input as single-end reads. |
| Flag | Description |
|---|---|
--input_dir |
Directory containing FASTQ files (paired or single-end, auto-detected). |
--output_dir |
Target workspace; creates 02-qc … 06-export. |
--method |
Denoising method: dada2 or vsearch (required). |
--silva_db |
Path to SILVA classifier QZA file (required). |
--threads |
Number of threads (default 16). |
--trunc_len |
DADA2 truncation length. 0 = auto: 140 for single-end, 250 for paired-end (default 0). |
--n_reads_learn |
Number of reads for DADA2 error model training (default 1000000). |
--force |
Ignore step completion markers and rerun. |
| Step | Purpose |
|---|---|
| 1 | fastp trimming and QC. |
| 2 | fastp integrity verification (requires --check_result). |
| 3 | Bowtie2 host depletion. |
| 4 | Host-removed FASTQ integrity check (requires --check_result). |
| 5 | Optional samtools host-read export. |
| 6 | Stage Kraken2 DB to shared memory. |
| 7 | Kraken2/Bracken classification. |
| 8 | Kraken report validation. |
| 9 | Remove host taxa + rerun Bracken. |
| 10 | Merge OTU/MPA/Bracken matrices. |
| Step | DADA2 | VSEARCH | Output |
|---|---|---|---|
| 1 | Import FASTQ into QIIME2 artifact | Import FASTQ into QIIME2 artifact | 02-qc/01-demux.qza |
| 2 | (skipped — DADA2 has built-in QC) | Quality filtering (q-score) | 03-trim/03-trimmed-seqs.qza |
| 3 | DADA2 denoise-paired/single | VSEARCH dereplicate + 97% OTU | 04-denoise/06-table.qza, 07-rep-seqs.qza |
| 4 | Taxonomy classification (sklearn) | Taxonomy classification (sklearn) | 05-taxonomy/taxonomy.qza |
| 5 | Export OTU table + taxonomy TSV | Export OTU table + taxonomy TSV | 06-export/otu_table.tsv, taxonomy.tsv |
Note (DADA2): Input FASTQ files must have real, variable quality scores. Data with uniform/flat quality scores (e.g. some SRA downloads where all Q=30) will cause DADA2's error rate estimation to fail. VSEARCH is not affected by this limitation.
02-qc/ # fastp outputs + integrity checks
03-hr/ # host-removed FASTQ
04-bam/ # Bowtie2 BAM/indices
05-host/ # optional host reads (samtools)
06-ku/ # Kraken2 reports/outputs
07-bracken/ # Bracken abundance tables
08-ku2/ # Host-filtered rerun (reports + diversity)
09-mpa/ # Merged OTU/MPA/summary matrices
02-qc/ # QIIME2 demux artifact + manifest
03-trim/ # Quality-filtered sequences
04-denoise/ # Feature table + representative sequences
05-taxonomy/ # Taxonomy classification
06-export/ # otu_table.tsv + taxonomy.tsv
- Download pre-built libraries: https://benlangmead.github.io/aws-indexes/k2
- Extract and point
--db_krakento the directory containinghash.k2d,opts.k2d,taxo.k2d.
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz
gunzip GCA_000001405.28_GRCh38.p13_genomic.fna.gz
seqkit grep -rvp 'alt|PATCH' GCA_000001405.28_GRCh38.p13_genomic.fna > GRCh38_noalt.fna
bowtie2-build GRCh38_noalt.fna GRCh38_noalt_as/GRCh38_noalt_as- Download SILVA reference sequences and taxonomy from https://www.arb-silva.de/
- Train a classifier compatible with your sklearn version:
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads silva-138-99-seqs.qza \
--i-reference-taxonomy silva-138-99-tax.qza \
--o-classifier silva138_classifier.qzaFlowMeta ships with a Claude Code skill that lets you run both pipelines through natural language. After installing the package:
flowmeta install-skillThis copies the skill to ~/.claude/skills/flowmeta/. Then in Claude Code, type /flowmeta to invoke it. Example prompts:
- "I have paired-end FASTQ files in /data/raw, run shotgun metagenomics with Kraken2"
- "Analyze my 16S amplicon data using DADA2, the FASTQ files are in /data/16S"
- "My DADA2 run failed with Error matrix is NULL, what should I do?"
The skill guides Claude through environment verification, command construction, troubleshooting, and result interpretation.
pip install build
python -m build --wheel
ls dist/- Companion README (中文操作指南):
README.zh.md - HTML tutorial:
docs/tutorial.html
Dongqiang Zeng · interlaken@smu.edu.cn
