A small, readable Snakemake migration of the paired-end RNA-seq pipeline
originally written in Bash (see the sibling bioinformatics/ repository).
The goal of this first version is to be simple but functional: every
rule maps one-to-one to a step in the legacy run_workflow.sh, so it is
easy to compare the two and to grow the workflow incrementally.
+---------------------+
| reference download |
| + HISAT2 index |
+----------+----------+
|
v
raw FASTQs --> FastQC fastp ---> HISAT2 | samtools sort ---> samtools index
| |
v v
fastp report samtools stats + qualimap
|
v
featureCounts (all BAMs) -> gene_counts.tsv
|
v
MultiQC roll-ups (fastp / alignment / qualimap)
bioinformatics_snakemake/
├── environment.yml # conda env spec: snakemake + bioinformatics tools
├── logs/ # gitignored: one log file per run (see Quickstart)
├── config/
│ └── config.yaml # all editable knobs (paths, threads, tool args)
├── workflow/
│ ├── Snakefile # entry point: includes + final targets (rule all)
│ └── rules/
│ ├── reference.smk # download FASTA/GTF + build HISAT2 index
│ ├── qc.smk # FastQC on raw reads + MultiQC roll-ups
│ ├── trim.smk # fastp paired-end trimming
│ ├── align.smk # HISAT2 + samtools sort/index/stats + qualimap
│ └── count.smk # featureCounts -> gene x sample matrix
└── data/ # gitignored; created on first run
├── input/
│ ├── raw/ # YOUR INPUT: <sample>_1.fastq.gz / <sample>_2.fastq.gz
│ └── reference/ # downloaded by the workflow
├── work/ # intermediate trimmed FASTQs + sorted BAMs
└── results/ # final reports + gene_counts.tsv
Everything (Snakemake + the bioinformatics CLI tools) lives in a single
conda environment defined by environment.yml. Conda is the de-facto
standard in bioinformatics and the path Snakemake itself recommends.
We use Miniforge — it's a
minimal conda installer that ships mamba (a faster drop-in for
conda) and uses the conda-forge channel by default, which is what
bioinformatics workflows want.
Check first whether you already have it:
command -v mamba || command -v condaIf both come back empty, install Miniforge. On Linux x86_64:
curl -L -o /tmp/miniforge.sh \
https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash /tmp/miniforge.sh -b -p "$HOME/miniforge3"
rm /tmp/miniforge.sh
# Make conda + mamba available in the current shell...
source "$HOME/miniforge3/etc/profile.d/conda.sh"
# ...and in every future shell.
conda init bashOpen a new shell (or exec bash) so the conda init changes take
effect. Confirm mamba --version now works.
For macOS or different architectures, grab the matching installer from the Miniforge releases page.
conda init bash wires conda into your shell; it does not
install mamba's shell hook. If you run mamba activate without that
hook, you will see errors like Shell not initialized or
'mamba' is running as a subprocess and can't modify the parent shell.
Either:
-
Use
conda activate rnaseqafter creating the environment (works with no extra setup), or -
Run this once so
mamba activateworks the same way:mamba shell init --shell bash --root-prefix="$HOME/miniforge3" exec bash # or open a new terminal
Use
--root-prefix="$HOME/miniforge3"(your Miniforge install path), not the default~/.local/share/mambathat mamba may suggest in its error text — that would point mamba at a separate prefix and yourrnaseqenvironment would not appear there.
From the repository root (the directory that contains
environment.yml):
mamba env create -f environment.yml
mamba activate rnaseq # or: conda activate rnaseq(conda env create -f environment.yml works too; mamba is just faster.)
mamba env update -f environment.yml --prunesnakemake --version
multiqc --version
fastqc --version && fastp --version && hisat2 --version | head -1
samtools --version | head -1
featureCounts -v 2>&1 | head -1
qualimap --help 2>&1 | head -1 # only if qualimap is enabled-
Drop your paired-end FASTQs into
data/input/raw/. They must be named<sample>_1.fastq.gzand<sample>_2.fastq.gz. Sample names are auto-detected from the_1.fastq.gzfiles. -
(Optional) Edit
config/config.yaml— change reference URLs, threading, tool extras, or toggle qualimap. -
Dry-run to inspect the DAG without executing anything:
snakemake --cores 8 --dry-run
-
Run for real:
snakemake --cores 8
On each real run (not
--dry-run), the Snakefile opens a single filelogs/snakemake_<UTC-timestamp>_<id>.logand every job’s shell tees its stdout/stderr into that file (viashell.prefix), so tool errors and Snakemake-printed job output land in one place. Parallel jobs may interleave lines.onsuccess/onerrorappend a footer; on failure, the footer also points at Snakemake’s internal engine log path. The active log path is inlogs/.active_workflow_log— avoid two concurrent runs from the same repo directory.Snakemake decides what to run by walking back from the targets in
rule alland skipping any step whose output already exists and is newer than its input. -
Force a full rerun (ignore cached outputs):
snakemake --cores 8 --forceall
# Print the DAG of jobs as Graphviz (pipe into `| dot -Tpng > dag.png`)
snakemake --cores 1 --dag
# Print just the rule graph (one node per rule, regardless of sample count)
snakemake --cores 1 --rulegraph
# Build a single intermediate file
snakemake --cores 4 data/work/bam/<sample>_sorted.bam
# Lint the Snakefile for common issues
snakemake --lintThese are present in the legacy Bash pipeline but were left out to keep this first Snakemake version small and easy to read end-to-end:
- Raw FASTQ download from Macrogen. The legacy
download_raw_reads.Rscript handles parallel downloads, basic auth credentials, MD5 validation, and resume. That belongs in a separate, dedicated rule (or stays as a one-shot script) and can be added later without changing the analytical core. - Atomic-write
.tmp.swap pattern for partial outputs. Snakemake already deletes incomplete outputs of failed jobs by default, so the manual rename dance from the Bash version is not needed. - Resource-projection banner (detected cores, projected RAM per
phase). Snakemake exposes this via
--resourcesand per-ruleresources:directives if you want to add it later. - FIFO-based qualimap semaphore. Snakemake schedules concurrency for
you based on
--coresand per-rulethreads:.
The natural progression for environment management goes:
- Now — single conda env (
environment.yml). Simplest possible setup; one env, one activation, every rule sees every tool. - Later — per-rule conda envs. Drop small files under
workflow/envs/<tool>.yaml(e.g.fastp.yaml,hisat2.yaml), addconda: "envs/<tool>.yaml"to the matching rule, and runsnakemake --use-conda. Each rule gets its own pinned environment that Snakemake materializes on demand. Strictly more reproducible than the single-env approach. - Later still — containers. Add
container: "docker://<image>"to each rule (or globally in the Snakefile) and runsnakemake --use-singularity(or--use-apptainer). Combine with--use-condafor conda-inside-container reproducibility.
Other useful directions:
- Differential expression in R: add a rule that calls a small
scripts/deseq2.R, takingdata/results/featurecounts/gene_counts.tsvas input. - Cluster / cloud execution: Snakemake supports SLURM, Kubernetes,
AWS Batch, etc. via profiles — drop a profile under
~/.config/snakemake/<name>/config.yamland runsnakemake --profile <name>.