bioinformatics_snakemake

A small, readable Snakemake migration of the paired-end RNA-seq pipeline originally written in Bash (see the sibling bioinformatics/ repository).

The goal of this first version is to be simple but functional: every rule maps one-to-one to a step in the legacy run_workflow.sh, so it is easy to compare the two and to grow the workflow incrementally.

Pipeline at a glance

                    +---------------------+
                    |  reference download |
                    |  + HISAT2 index     |
                    +----------+----------+
                               |
                               v
   raw FASTQs --> FastQC      fastp ---> HISAT2 | samtools sort ---> samtools index
                               |                                          |
                               v                                          v
                          fastp report                           samtools stats + qualimap
                                                                          |
                                                                          v
                                                  featureCounts (all BAMs) -> gene_counts.tsv
                                                                          |
                                                                          v
                                                  MultiQC roll-ups (fastp / alignment / qualimap)

Project layout

bioinformatics_snakemake/
├── environment.yml        # conda env spec: snakemake + bioinformatics tools
├── logs/                  # gitignored: one log file per run (see Quickstart)
├── config/
│   └── config.yaml        # all editable knobs (paths, threads, tool args)
├── workflow/
│   ├── Snakefile          # entry point: includes + final targets (rule all)
│   └── rules/
│       ├── reference.smk  # download FASTA/GTF + build HISAT2 index
│       ├── qc.smk         # FastQC on raw reads + MultiQC roll-ups
│       ├── trim.smk       # fastp paired-end trimming
│       ├── align.smk      # HISAT2 + samtools sort/index/stats + qualimap
│       └── count.smk      # featureCounts -> gene x sample matrix
└── data/                  # gitignored; created on first run
    ├── input/
    │   ├── raw/           # YOUR INPUT: <sample>_1.fastq.gz / <sample>_2.fastq.gz
    │   └── reference/     # downloaded by the workflow
    ├── work/              # intermediate trimmed FASTQs + sorted BAMs
    └── results/           # final reports + gene_counts.tsv

Requirements

Everything (Snakemake + the bioinformatics CLI tools) lives in a single conda environment defined by environment.yml. Conda is the de-facto standard in bioinformatics and the path Snakemake itself recommends.

Step 1 — install Miniforge (skip if you already have conda or mamba)

We use Miniforge — it's a minimal conda installer that ships mamba (a faster drop-in for conda) and uses the conda-forge channel by default, which is what bioinformatics workflows want.

Check first whether you already have it:

command -v mamba || command -v conda

If both come back empty, install Miniforge. On Linux x86_64:

curl -L -o /tmp/miniforge.sh \
    https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash /tmp/miniforge.sh -b -p "$HOME/miniforge3"
rm /tmp/miniforge.sh

# Make conda + mamba available in the current shell...
source "$HOME/miniforge3/etc/profile.d/conda.sh"
# ...and in every future shell.
conda init bash

Open a new shell (or exec bash) so the conda init changes take effect. Confirm mamba --version now works.

For macOS or different architectures, grab the matching installer from the Miniforge releases page.

Initialize mamba for `mamba activate` (one-time)

conda init bash wires conda into your shell; it does not install mamba's shell hook. If you run mamba activate without that hook, you will see errors like Shell not initialized or 'mamba' is running as a subprocess and can't modify the parent shell.

Either:

Use conda activate rnaseq after creating the environment (works with no extra setup), or
Run this once so mamba activate works the same way:
```
mamba shell init --shell bash --root-prefix="$HOME/miniforge3"
exec bash   # or open a new terminal
```
Use --root-prefix="$HOME/miniforge3" (your Miniforge install path), not the default ~/.local/share/mamba that mamba may suggest in its error text — that would point mamba at a separate prefix and your rnaseq environment would not appear there.

Step 2 — create the workflow environment

From the repository root (the directory that contains environment.yml):

mamba env create -f environment.yml
mamba activate rnaseq         # or: conda activate rnaseq

(conda env create -f environment.yml works too; mamba is just faster.)

After pulling changes that touch `environment.yml`

mamba env update -f environment.yml --prune

Verifying the install

snakemake --version
multiqc --version
fastqc --version && fastp --version && hisat2 --version | head -1
samtools --version | head -1
featureCounts -v 2>&1 | head -1
qualimap --help 2>&1 | head -1   # only if qualimap is enabled

Quickstart

Drop your paired-end FASTQs into data/input/raw/. They must be named <sample>_1.fastq.gz and <sample>_2.fastq.gz. Sample names are auto-detected from the _1.fastq.gz files.
(Optional) Edit config/config.yaml — change reference URLs, threading, tool extras, or toggle qualimap.
Dry-run to inspect the DAG without executing anything:
```
snakemake --cores 8 --dry-run
```
Run for real:
```
snakemake --cores 8
```
On each real run (not --dry-run), the Snakefile opens a single file logs/snakemake_<UTC-timestamp>_<id>.log and every job’s shell tees its stdout/stderr into that file (via shell.prefix), so tool errors and Snakemake-printed job output land in one place. Parallel jobs may interleave lines. onsuccess / onerror append a footer; on failure, the footer also points at Snakemake’s internal engine log path. The active log path is in logs/.active_workflow_log — avoid two concurrent runs from the same repo directory.

Snakemake decides what to run by walking back from the targets in rule all and skipping any step whose output already exists and is newer than its input.
Force a full rerun (ignore cached outputs):
```
snakemake --cores 8 --forceall
```

Useful one-liners

# Print the DAG of jobs as Graphviz (pipe into `| dot -Tpng > dag.png`)
snakemake --cores 1 --dag

# Print just the rule graph (one node per rule, regardless of sample count)
snakemake --cores 1 --rulegraph

# Build a single intermediate file
snakemake --cores 4 data/work/bam/<sample>_sorted.bam

# Lint the Snakefile for common issues
snakemake --lint

What's intentionally NOT here yet

These are present in the legacy Bash pipeline but were left out to keep this first Snakemake version small and easy to read end-to-end:

Raw FASTQ download from Macrogen. The legacy download_raw_reads.R script handles parallel downloads, basic auth credentials, MD5 validation, and resume. That belongs in a separate, dedicated rule (or stays as a one-shot script) and can be added later without changing the analytical core.
Atomic-write .tmp. swap pattern for partial outputs. Snakemake already deletes incomplete outputs of failed jobs by default, so the manual rename dance from the Bash version is not needed.
Resource-projection banner (detected cores, projected RAM per phase). Snakemake exposes this via --resources and per-rule resources: directives if you want to add it later.
FIFO-based qualimap semaphore. Snakemake schedules concurrency for you based on --cores and per-rule threads:.

Where to extend it

The natural progression for environment management goes:

Now — single conda env (environment.yml). Simplest possible setup; one env, one activation, every rule sees every tool.
Later — per-rule conda envs. Drop small files under workflow/envs/<tool>.yaml (e.g. fastp.yaml, hisat2.yaml), add conda: "envs/<tool>.yaml" to the matching rule, and run snakemake --use-conda. Each rule gets its own pinned environment that Snakemake materializes on demand. Strictly more reproducible than the single-env approach.
Later still — containers. Add container: "docker://<image>" to each rule (or globally in the Snakefile) and run snakemake --use-singularity (or --use-apptainer). Combine with --use-conda for conda-inside-container reproducibility.

Other useful directions:

Differential expression in R: add a rule that calls a small scripts/deseq2.R, taking data/results/featurecounts/gene_counts.tsv as input.
Cluster / cloud execution: Snakemake supports SLURM, Kubernetes, AWS Batch, etc. via profiles — drop a profile under ~/.config/snakemake/<name>/config.yaml and run snakemake --profile <name>.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
workflow		workflow
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bioinformatics_snakemake

Pipeline at a glance

Project layout

Requirements

Step 1 — install Miniforge (skip if you already have conda or mamba)

Initialize mamba for `mamba activate` (one-time)

Step 2 — create the workflow environment

After pulling changes that touch `environment.yml`

Verifying the install

Quickstart

Useful one-liners

What's intentionally NOT here yet

Where to extend it

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bioinformatics_snakemake

Pipeline at a glance

Project layout

Requirements

Step 1 — install Miniforge (skip if you already have conda or mamba)

Initialize mamba for mamba activate (one-time)

Step 2 — create the workflow environment

After pulling changes that touch environment.yml

Verifying the install

Quickstart

Useful one-liners

What's intentionally NOT here yet

Where to extend it

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Initialize mamba for `mamba activate` (one-time)

After pulling changes that touch `environment.yml`

Packages