Skip to content

SimpleImplementations/bioinformatics_snakemake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bioinformatics_snakemake

A small, readable Snakemake migration of the paired-end RNA-seq pipeline originally written in Bash (see the sibling bioinformatics/ repository).

The goal of this first version is to be simple but functional: every rule maps one-to-one to a step in the legacy run_workflow.sh, so it is easy to compare the two and to grow the workflow incrementally.

Pipeline at a glance

                    +---------------------+
                    |  reference download |
                    |  + HISAT2 index     |
                    +----------+----------+
                               |
                               v
   raw FASTQs --> FastQC      fastp ---> HISAT2 | samtools sort ---> samtools index
                               |                                          |
                               v                                          v
                          fastp report                           samtools stats + qualimap
                                                                          |
                                                                          v
                                                  featureCounts (all BAMs) -> gene_counts.tsv
                                                                          |
                                                                          v
                                                  MultiQC roll-ups (fastp / alignment / qualimap)

Project layout

bioinformatics_snakemake/
├── environment.yml        # conda env spec: snakemake + bioinformatics tools
├── logs/                  # gitignored: one log file per run (see Quickstart)
├── config/
│   └── config.yaml        # all editable knobs (paths, threads, tool args)
├── workflow/
│   ├── Snakefile          # entry point: includes + final targets (rule all)
│   └── rules/
│       ├── reference.smk  # download FASTA/GTF + build HISAT2 index
│       ├── qc.smk         # FastQC on raw reads + MultiQC roll-ups
│       ├── trim.smk       # fastp paired-end trimming
│       ├── align.smk      # HISAT2 + samtools sort/index/stats + qualimap
│       └── count.smk      # featureCounts -> gene x sample matrix
└── data/                  # gitignored; created on first run
    ├── input/
    │   ├── raw/           # YOUR INPUT: <sample>_1.fastq.gz / <sample>_2.fastq.gz
    │   └── reference/     # downloaded by the workflow
    ├── work/              # intermediate trimmed FASTQs + sorted BAMs
    └── results/           # final reports + gene_counts.tsv

Requirements

Everything (Snakemake + the bioinformatics CLI tools) lives in a single conda environment defined by environment.yml. Conda is the de-facto standard in bioinformatics and the path Snakemake itself recommends.

Step 1 — install Miniforge (skip if you already have conda or mamba)

We use Miniforge — it's a minimal conda installer that ships mamba (a faster drop-in for conda) and uses the conda-forge channel by default, which is what bioinformatics workflows want.

Check first whether you already have it:

command -v mamba || command -v conda

If both come back empty, install Miniforge. On Linux x86_64:

curl -L -o /tmp/miniforge.sh \
    https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash /tmp/miniforge.sh -b -p "$HOME/miniforge3"
rm /tmp/miniforge.sh

# Make conda + mamba available in the current shell...
source "$HOME/miniforge3/etc/profile.d/conda.sh"
# ...and in every future shell.
conda init bash

Open a new shell (or exec bash) so the conda init changes take effect. Confirm mamba --version now works.

For macOS or different architectures, grab the matching installer from the Miniforge releases page.

Initialize mamba for mamba activate (one-time)

conda init bash wires conda into your shell; it does not install mamba's shell hook. If you run mamba activate without that hook, you will see errors like Shell not initialized or 'mamba' is running as a subprocess and can't modify the parent shell.

Either:

  • Use conda activate rnaseq after creating the environment (works with no extra setup), or

  • Run this once so mamba activate works the same way:

    mamba shell init --shell bash --root-prefix="$HOME/miniforge3"
    exec bash   # or open a new terminal

    Use --root-prefix="$HOME/miniforge3" (your Miniforge install path), not the default ~/.local/share/mamba that mamba may suggest in its error text — that would point mamba at a separate prefix and your rnaseq environment would not appear there.

Step 2 — create the workflow environment

From the repository root (the directory that contains environment.yml):

mamba env create -f environment.yml
mamba activate rnaseq         # or: conda activate rnaseq

(conda env create -f environment.yml works too; mamba is just faster.)

After pulling changes that touch environment.yml

mamba env update -f environment.yml --prune

Verifying the install

snakemake --version
multiqc --version
fastqc --version && fastp --version && hisat2 --version | head -1
samtools --version | head -1
featureCounts -v 2>&1 | head -1
qualimap --help 2>&1 | head -1   # only if qualimap is enabled

Quickstart

  1. Drop your paired-end FASTQs into data/input/raw/. They must be named <sample>_1.fastq.gz and <sample>_2.fastq.gz. Sample names are auto-detected from the _1.fastq.gz files.

  2. (Optional) Edit config/config.yaml — change reference URLs, threading, tool extras, or toggle qualimap.

  3. Dry-run to inspect the DAG without executing anything:

    snakemake --cores 8 --dry-run
  4. Run for real:

    snakemake --cores 8

    On each real run (not --dry-run), the Snakefile opens a single file logs/snakemake_<UTC-timestamp>_<id>.log and every job’s shell tees its stdout/stderr into that file (via shell.prefix), so tool errors and Snakemake-printed job output land in one place. Parallel jobs may interleave lines. onsuccess / onerror append a footer; on failure, the footer also points at Snakemake’s internal engine log path. The active log path is in logs/.active_workflow_log — avoid two concurrent runs from the same repo directory.

    Snakemake decides what to run by walking back from the targets in rule all and skipping any step whose output already exists and is newer than its input.

  5. Force a full rerun (ignore cached outputs):

    snakemake --cores 8 --forceall

Useful one-liners

# Print the DAG of jobs as Graphviz (pipe into `| dot -Tpng > dag.png`)
snakemake --cores 1 --dag

# Print just the rule graph (one node per rule, regardless of sample count)
snakemake --cores 1 --rulegraph

# Build a single intermediate file
snakemake --cores 4 data/work/bam/<sample>_sorted.bam

# Lint the Snakefile for common issues
snakemake --lint

What's intentionally NOT here yet

These are present in the legacy Bash pipeline but were left out to keep this first Snakemake version small and easy to read end-to-end:

  • Raw FASTQ download from Macrogen. The legacy download_raw_reads.R script handles parallel downloads, basic auth credentials, MD5 validation, and resume. That belongs in a separate, dedicated rule (or stays as a one-shot script) and can be added later without changing the analytical core.
  • Atomic-write .tmp. swap pattern for partial outputs. Snakemake already deletes incomplete outputs of failed jobs by default, so the manual rename dance from the Bash version is not needed.
  • Resource-projection banner (detected cores, projected RAM per phase). Snakemake exposes this via --resources and per-rule resources: directives if you want to add it later.
  • FIFO-based qualimap semaphore. Snakemake schedules concurrency for you based on --cores and per-rule threads:.

Where to extend it

The natural progression for environment management goes:

  1. Now — single conda env (environment.yml). Simplest possible setup; one env, one activation, every rule sees every tool.
  2. Later — per-rule conda envs. Drop small files under workflow/envs/<tool>.yaml (e.g. fastp.yaml, hisat2.yaml), add conda: "envs/<tool>.yaml" to the matching rule, and run snakemake --use-conda. Each rule gets its own pinned environment that Snakemake materializes on demand. Strictly more reproducible than the single-env approach.
  3. Later still — containers. Add container: "docker://<image>" to each rule (or globally in the Snakefile) and run snakemake --use-singularity (or --use-apptainer). Combine with --use-conda for conda-inside-container reproducibility.

Other useful directions:

  • Differential expression in R: add a rule that calls a small scripts/deseq2.R, taking data/results/featurecounts/gene_counts.tsv as input.
  • Cluster / cloud execution: Snakemake supports SLURM, Kubernetes, AWS Batch, etc. via profiles — drop a profile under ~/.config/snakemake/<name>/config.yaml and run snakemake --profile <name>.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages