-
Notifications
You must be signed in to change notification settings - Fork 0
Preprocessing
Due to limited access to GPU resources on the ARC (Advanced Research Computing) cluster at UCalgary, we configured MinKNOW to write a POD5 file every hour and submitted jobs requesting one GPU node per run. While users can feed all POD5 files together to Dorado, the queue wait time increases significantly when requesting more resources. Therefore, basecalling each POD5 file individually and subsequently merging all BAM files into a single output is more efficient. Specifically, the following script was used to perform POD5 basecalling:
#!/bin/bash
#SBATCH --job-name=Dorado
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=32G
#SBATCH --time=2:00:00
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu-v100
PLATE="Plate10"
KIT_INFO="SQK-NBD-96"
POD5_FOLDER="$HOME/POD5/$PLATE"
BAM_FOLDER="$HOME/BAM/$PLATE"
mkdir -p $BAM_FOLDER
dorado basecaller sup $POD5_FOLDER/<INPUT>.pod5 --kit-name $KIT_INFO --barcode-both-ends > $BAM_FOLDER/<OUTPUT>.bamVerify that the number of POD5 files matches the number of corresponding BAM files by running:
ls *.pod5 | wc -l
ls *.bam | wc -lMerging all BAM files can be performed in the /scratch directory to take advantage of local fast I/O, and the resulting merged BAM file can then be moved to the designated BAM folder:
#! /bin/bash
#SBATCH --job-name=Merge_BAMs
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=248GB
#SBATCH --time=24:00:00
#SBATCH --partition=bigmem
PLATE="Plate10"
BAM_FOLDER="$HOME/BAM/$PLATE"
SCRATCH=/scratch/$SLURM_JOBID
samtools merge $SCRATCH/Merged.bam $BAM_FOLDER/*.bam
mv $SCRATCH/Merged.bam $BAM_FOLDER/After generating the Merged.bam file, demultiplexing reads by barcode can be performed using Dorado. In this step, using CPU resources minimizes queue wait time, which is advantageous given the limited GPU availability on ARC. We used the following script to submit the demultiplexing job:
#! /bin/bash
#SBATCH --job-name=Demux
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=124GB
#SBATCH --time=24:00:00
#SBATCH --partition=bigmem
PLATE="Plate10"
BAM_FOLDER="$HOME/BAM/$PLATE"
DEMUX_FOLDER="$HOME/DEMUX/$PLATE"
mkdir -p $DEMUX_FOLDER
dorado demux --output-dir $DEMUX_FOLDER --no-classify $BAM_FOLDER/Merged.bamThe merged sequencing output Merged.bam can be summarized using Dorado. For this step, requesting CPU resources is more efficient and avoids extended GPU queue wait times. We used the following script to generate the summary:
#! /bin/bash
#SBATCH --job-name=BAM_Summary
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=256GB
#SBATCH --time=24:00:00
#SBATCH --partition=bigmem
PLATE="Plate10"
BAM_FOLDER="$HOME/BAM/$PLATE"
SUMMARY_FOLDER="$HOME/Summary"
mkdir -p $SUMMARY_FOLDER
dorado summary $BAM_FOLDER/Merged.bam > $SUMMARY_FOLDER/$PLATE.tsvEach barcode-level BAM file should be converted to a FASTQ file, which is the required input format for most downstream tools. Since each barcode is processed independently, up to 96 jobs (for 96 barcodes) can be parallelized. The following script provides an example of a single job submission:
#!/bin/bash
#SBATCH --job-name=BAM2FQ_96
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2GB
#SBATCH --time=2:00:00
#SBATCH --partition=single
PLATE="Plate10"
DEMUX_FOLDER="$HOME/DEMUX/$PLATE"
FASTQ_FOLDER="$HOME/FASTQ/$PLATE"
mkdir -p $FASTQ_FOLDER
samtools fastq -T"*" $DEMUX_FOLDER/<INPUT>.bam | gzip > $FASTQ_FOLDER/<OUTPUT>.fastq.gzWhereas Step 04 provides a summary of the sequencing output at the aggregate level, this section summarizes per-barcode statistics, including read length distributions, read quality, and other relevant metrics. Unclassified reads are excluded from this analysis. We used the following script:
#!/bin/bash
#SBATCH --job-name=NanoPlot
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=4GB
#SBATCH --time=4:00:00
#SBATCH --partition=single
PLATE="Plate10"
FASTQ_FOLDER="$HOME/FASTQ/$PLATE"
STAT_FOLDER="$HOME/Stats/$PLATE"
mkdir -p $STAT_FOLDER
NanoPlot -t 4 --raw --fastq $FASTQ_FOLDER/<INPUT>.fastq.gz --no_static --tsv_stats --outdir $STAT_FOLDER/<OUTPUT>We used FastQC to assess per-base quality for each barcode-level FASTQ file. Because read-length distributions vary substantially across barcodes, it is more informative to generate FastQC results stratified by read length. Therefore, each FASTQ file was partitioned into five read-length bins: (1) 0–99 bp, (2) 100–999 bp, (3) 1000–9999 bp, (4) 10000–99999 bp, and (5) >100000 bp. This binning enables more accurate visualization of quality trends across read-length ranges. The following script was used to run FastQC:
#!/bin/bash
#SBATCH --job-name=FastQC
#SBATCH --mem=12G
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --partition=single
#SBATCH --time=12:00:00
PLATE="Plate10"
FASTQ_FOLDER="$HOME/FASTQ/$PLATE"
FASTQC_FOLDER="$HOME/FASTQC/$PLATE"
mkdir -p $FASTQC_FOLDER/<OUTPUT>
fastqc -o $FASTQC_FOLDER/<OUTPUT> --extract -f fastq $FASTQ_FOLDER/<INPUT>.fastq.gzFinally, we summarized the FastQC outputs using MultiQC to enable comparative visualization of per-base quality across batches, barcodes, and/or read-length bins. For this purpose, individual .folders files were generated to specify sets of FastQC result directories for targeted comparison. The following bash script was used to enumerate all FastQC directories:
#!/bin/bash
find "$HOME"/FASTQC/Plate?? -type d -name "barcode*_len_*" -exec realpath {} \; | sort > "$HOME/FASTQC/all_dirs.txt"
cat "$HOME/FASTQC/all_dirs.txt" | grep "_fastqc" > "$HOME/FASTQC/fastqc.folders"
rm "$HOME/FASTQC/all_dirs.txt"Users may then subset directories of interest and create a .folders file to supply to MultiQC. The following script was used to run MultiQC:
#!/bin/bash
#SBATCH --mem=12G
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=single
#SBATCH --time=4:00:00
PLATE="Plate10"
FASTQC_FOLDER="$HOME/FASTQC/$PLATE"
MULTIQC_FOLDER="$HOME/MULTIQC/$PLATE"
mkdir -p $MULTIQC_FOLDER
multiqc -o $MULTIQC_FOLDER/$PLATE -n $PLATE -z --file-list $FASTQC_FOLDER/$PLATE.foldersThe variable $HOME should be predefined, or users may explicitly define it within each script to specify the working directory. For example, it can be exported as follows:
HOME=/bulk/your_storage/project_name/We expect users to parallelize these scripts, where <INPUT> and <OUTPUT> represent either each POD5 file (generated hourly) or each barcode-specific sample. We also benchmarked approximate runtimes using 290 POD5 files under the configurations described above. These values are provided for reference only, as actual runtimes will depend on queue load and resource availability.
- Approximate runtimes:
- Step 01 (basecalling): 2 days
- Step 02 (merge): 5 hours 37 minutes
- Step 03 (demultiplexing): 1 hour 22 minutes
- Step 05 (BAM-to-FASTQ conversion): 1 hour 03 minutes
- Steps involving summary generation are completed in a comparatively short time (typically five minutes to less than one hour).