Skip to content

JakobSkouPedersenLab/pairbase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pairbase

Fast mismatch consensus calling from paired-end reads

Mismatch rates: pairbase counts mismatches in the consensus positions of paired-end reads from whole genome sequenced cell-free DNA and normalizes by the consensus-position opportunities in the same sequencing data.

Positional statistics: Quantify the effect of adding/changing filtering steps via positional statistics about consensus bases.


Key features

capability details
Specifiable k-mer size allows k=2 (DBS) and odd sizes from k=3 up to k=27 (SBS; limited by available memory).
Fast and multithreaded implemented in rust and fully parallelized (up to one thread per chromosome)
Human-readable output outputs TSV file with the mismatch motif, mismatch count, opportunity count, and mismatch rate
Multiple windowing modes fixed length (--by-size 10_000), BED intervals (--by-bed sites.bed), or single genome‑wide (default)
Blacklist masking exclude repeats/SNPs/artefacts with one or several BEDs
Strandagnostic motifs are collapsed to their reverse-complements

Installation

Compile from source

You may need a few dependencies that can be installed as a conda environment with:

conda create -n pairbase rust=1.87.0 zstandard perl conda-forge::llvmdev conda-forge::clangdev
conda activate pairbase

Compile and install:

cargo install --git https://github.com/JakobSkouPedersenLab/pairbase
pairbase --help
# or clone + build
git clone https://github.com/JakobSkouPedersenLab/pairbase
cd pairbase && cargo build --release
target/release/pairbase --help

Quick‑start example

Runtime depends on the size of the bam file. The below example ran in ~10min with 12 cores (~6min without the 11-mers) for a ~25x WGS file:

pairbase \
  --bam sample.bam \                      # bam file with paired-end cfDNA
  --ref-2bit hg38.2bit \                  # reference genome in 2‑bit
  --output-dir results \                  # where to write files
  --kmer-sizes 3,5,11 \                   # count 3-, 5-, and 11-mers
  --n-threads 12 \                        # use 12 CPU cores (max. one per chromosome)
  --blacklist encode_blacklist.bed        # exclude ENCODE blacklist intervals
  --blacklist germline_snps.bed           # exclude germline SNPs

Extract positional statistics for quantifying the effect of adding/removing filtering steps, etc.:

pos_stats \
  --bam sample.bam \                      # bam file with paired-end cfDNA
  --ref-2bit hg38.2bit \                  # reference genome in 2‑bit
  --output-dir results \                  # where to write files
  --n-threads 12 \                        # use 12 CPU cores (max. one per chromosome)
  --blacklist encode_blacklist.bed        # exclude ENCODE blacklist intervals
  --blacklist germline_snps.bed           # exclude germline SNPs

Command‑line reference

pairbase --help
pos_stats --help
option purpose
-i, --bam <path> bam file with paired-end reads
-r, --ref-2bit <path> two‑bit reference genome
-k, --kmer-sizes <list> k values (2, 3, 5, 7, ..., 27)
-t, --n-threads <N> CPU threads
--strand-aware <flag> do not collapse motifs to reverse-complements
Window selection
--by-size <bp> fixed‑length windows
--by-bed <BED> custom intervals
--global <flag> one big window for the genome
Filtering
-b, --blacklist <BED> mask repeats/artefacts
--blacklist-min-size <bp> drop tiny blacklist entries
--min-overlap <bp> minimum size of overlap
--min-base-quality minimum base quality of the position on both reads
--min-mapq minimum mapping quality of the read
--min-seq-len <bp> minimum size read sequence
--max-fragment-length <bp> maximum fragment length
--min-nm <bp> minimum number of recorded mismatches in a read (default: 0)
--max-nm <bp> maximum number of recorded mismatches in a read (default: 5)
Chromosome selection
--chromosomes <list> chromosomes to process (default: chr1-22)
--chromosomes-file <path> file with chromosomes to process

FAQ

What bases are recognized?

We only consider A, C, G, T, N. Any other bases are converted to 'N'.

What happens to kmers with 'N's in them?

We discard all motifs with 'N' in them.

Where do I get the 2-bit reference genome file?

We use the hg38.2bit file from UCSC: https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit

Although the code is agnostic to the genome assembly, so hg19.2bit or similar should work, assuming the BAM files are aligned to the same assembly.

Can I make suggestions for the tool?

Of course! Open an issue at https://github.com/JakobSkouPedersenLab/pairbase/issues/new/choose.


Implemented by @ludvigolsen. Logic by @gualpo.

About

Blazingly fast calculation of consensus mismatch rates in whole genome sequenced cell-free DNA

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages