Skip to content

Preprocessing

Heewon Seo edited this page Dec 12, 2025 · 1 revision

SLURM scripts

Step 01. Basecalling with Dorado

Due to limited access to GPU resources on the ARC (Advanced Research Computing) cluster at UCalgary, we configured MinKNOW to write a POD5 file every hour and submitted jobs requesting one GPU node per run. While users can feed all POD5 files together to Dorado, the queue wait time increases significantly when requesting more resources. Therefore, basecalling each POD5 file individually and subsequently merging all BAM files into a single output is more efficient. Specifically, the following script was used to perform POD5 basecalling:

#!/bin/bash
#SBATCH --job-name=Dorado
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=32G
#SBATCH --time=2:00:00
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu-v100

PLATE="Plate10"
KIT_INFO="SQK-NBD-96"

POD5_FOLDER="$HOME/POD5/$PLATE"
BAM_FOLDER="$HOME/BAM/$PLATE"
mkdir -p $BAM_FOLDER

dorado basecaller sup $POD5_FOLDER/<INPUT>.pod5 --kit-name $KIT_INFO --barcode-both-ends > $BAM_FOLDER/<OUTPUT>.bam

Verify that the number of POD5 files matches the number of corresponding BAM files by running:

ls *.pod5 | wc -l
ls *.bam | wc -l

Step 02. Merge BAM files

Merging all BAM files can be performed in the /scratch directory to take advantage of local fast I/O, and the resulting merged BAM file can then be moved to the designated BAM folder:

#! /bin/bash
#SBATCH --job-name=Merge_BAMs
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=248GB
#SBATCH --time=24:00:00
#SBATCH --partition=bigmem

PLATE="Plate10"

BAM_FOLDER="$HOME/BAM/$PLATE"
SCRATCH=/scratch/$SLURM_JOBID

samtools merge $SCRATCH/Merged.bam $BAM_FOLDER/*.bam

mv $SCRATCH/Merged.bam $BAM_FOLDER/

Step 03. Demultiplex reads by barcode

After generating the Merged.bam file, demultiplexing reads by barcode can be performed using Dorado. In this step, using CPU resources minimizes queue wait time, which is advantageous given the limited GPU availability on ARC. We used the following script to submit the demultiplexing job:

#! /bin/bash
#SBATCH --job-name=Demux
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=124GB
#SBATCH --time=24:00:00
#SBATCH --partition=bigmem

PLATE="Plate10"

BAM_FOLDER="$HOME/BAM/$PLATE"
DEMUX_FOLDER="$HOME/DEMUX/$PLATE"
mkdir -p $DEMUX_FOLDER

dorado demux --output-dir $DEMUX_FOLDER --no-classify $BAM_FOLDER/Merged.bam

Step 04. Generate sequencing summary

The merged sequencing output Merged.bam can be summarized using Dorado. For this step, requesting CPU resources is more efficient and avoids extended GPU queue wait times. We used the following script to generate the summary:

#! /bin/bash
#SBATCH --job-name=BAM_Summary
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=256GB
#SBATCH --time=24:00:00
#SBATCH --partition=bigmem

PLATE="Plate10"

BAM_FOLDER="$HOME/BAM/$PLATE"
SUMMARY_FOLDER="$HOME/Summary"
mkdir -p $SUMMARY_FOLDER

dorado summary $BAM_FOLDER/Merged.bam > $SUMMARY_FOLDER/$PLATE.tsv

Step 05. Convert BAM to FASTQ

Each barcode-level BAM file should be converted to a FASTQ file, which is the required input format for most downstream tools. Since each barcode is processed independently, up to 96 jobs (for 96 barcodes) can be parallelized. The following script provides an example of a single job submission:

#!/bin/bash
#SBATCH --job-name=BAM2FQ_96
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2GB
#SBATCH --time=2:00:00
#SBATCH --partition=single

PLATE="Plate10"

DEMUX_FOLDER="$HOME/DEMUX/$PLATE"
FASTQ_FOLDER="$HOME/FASTQ/$PLATE"
mkdir -p $FASTQ_FOLDER

samtools fastq -T"*" $DEMUX_FOLDER/<INPUT>.bam | gzip > $FASTQ_FOLDER/<OUTPUT>.fastq.gz

Step 06. Summarize read quality using NanoPlot

Whereas Step 04 provides a summary of the sequencing output at the aggregate level, this section summarizes per-barcode statistics, including read length distributions, read quality, and other relevant metrics. Unclassified reads are excluded from this analysis. We used the following script:

#!/bin/bash
#SBATCH --job-name=NanoPlot
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=4GB
#SBATCH --time=4:00:00
#SBATCH --partition=single

PLATE="Plate10"

FASTQ_FOLDER="$HOME/FASTQ/$PLATE"
STAT_FOLDER="$HOME/Stats/$PLATE"
mkdir -p $STAT_FOLDER

NanoPlot -t 4 --raw --fastq $FASTQ_FOLDER/<INPUT>.fastq.gz --no_static --tsv_stats --outdir $STAT_FOLDER/<OUTPUT>

Step 07. Evaluate per-base quality using FastQC

We used FastQC to assess per-base quality for each barcode-level FASTQ file. Because read-length distributions vary substantially across barcodes, it is more informative to generate FastQC results stratified by read length. Therefore, each FASTQ file was partitioned into five read-length bins: (1) 0–99 bp, (2) 100–999 bp, (3) 1000–9999 bp, (4) 10000–99999 bp, and (5) >100000 bp. This binning enables more accurate visualization of quality trends across read-length ranges. The following script was used to run FastQC:

#!/bin/bash
#SBATCH --job-name=FastQC
#SBATCH --mem=12G
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --partition=single
#SBATCH --time=12:00:00

PLATE="Plate10"

FASTQ_FOLDER="$HOME/FASTQ/$PLATE"
FASTQC_FOLDER="$HOME/FASTQC/$PLATE"
mkdir -p $FASTQC_FOLDER/<OUTPUT>

fastqc -o $FASTQC_FOLDER/<OUTPUT> --extract -f fastq $FASTQ_FOLDER/<INPUT>.fastq.gz

Step 08. Aggregate reports with MultiQC

Finally, we summarized the FastQC outputs using MultiQC to enable comparative visualization of per-base quality across batches, barcodes, and/or read-length bins. For this purpose, individual .folders files were generated to specify sets of FastQC result directories for targeted comparison. The following bash script was used to enumerate all FastQC directories:

#!/bin/bash

find "$HOME"/FASTQC/Plate?? -type d -name "barcode*_len_*" -exec realpath {} \; | sort > "$HOME/FASTQC/all_dirs.txt"
cat "$HOME/FASTQC/all_dirs.txt" | grep "_fastqc" > "$HOME/FASTQC/fastqc.folders"
rm "$HOME/FASTQC/all_dirs.txt"

Users may then subset directories of interest and create a .folders file to supply to MultiQC. The following script was used to run MultiQC:

#!/bin/bash
#SBATCH --mem=12G
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=single
#SBATCH --time=4:00:00

PLATE="Plate10"

FASTQC_FOLDER="$HOME/FASTQC/$PLATE"
MULTIQC_FOLDER="$HOME/MULTIQC/$PLATE"
mkdir -p $MULTIQC_FOLDER

multiqc -o $MULTIQC_FOLDER/$PLATE -n $PLATE -z --file-list $FASTQC_FOLDER/$PLATE.folders

Note

The variable $HOME should be predefined, or users may explicitly define it within each script to specify the working directory. For example, it can be exported as follows:

HOME=/bulk/your_storage/project_name/

We expect users to parallelize these scripts, where <INPUT> and <OUTPUT> represent either each POD5 file (generated hourly) or each barcode-specific sample. We also benchmarked approximate runtimes using 290 POD5 files under the configurations described above. These values are provided for reference only, as actual runtimes will depend on queue load and resource availability.

  • Approximate runtimes:
    • Step 01 (basecalling): 2 days
    • Step 02 (merge): 5 hours 37 minutes
    • Step 03 (demultiplexing): 1 hour 22 minutes
    • Step 05 (BAM-to-FASTQ conversion): 1 hour 03 minutes
    • Steps involving summary generation are completed in a comparatively short time (typically five minutes to less than one hour).

Clone this wiki locally