pairbase

Fast mismatch consensus calling from paired-end reads

Mismatch rates: pairbase counts mismatches in the consensus positions of paired-end reads from whole genome sequenced cell-free DNA and normalizes by the consensus-position opportunities in the same sequencing data.

Positional statistics: Quantify the effect of adding/changing filtering steps via positional statistics about consensus bases.

Key features

capability	details
Specifiable k-mer size	allows k=2 (DBS) and odd sizes from k=3 up to k=27 (SBS; limited by available memory).
Fast and multithreaded	implemented in rust and fully parallelized (up to one thread per chromosome)
Human-readable output	outputs TSV file with the mismatch motif, mismatch count, opportunity count, and mismatch rate
Multiple windowing modes	fixed length (`--by-size 10_000`), BED intervals (`--by-bed sites.bed`), or single genome‑wide (default)
Blacklist masking	exclude repeats/SNPs/artefacts with one or several BEDs
Strandagnostic	motifs are collapsed to their reverse-complements

Installation

Compile from source

You may need a few dependencies that can be installed as a conda environment with:

conda create -n pairbase rust=1.87.0 zstandard perl conda-forge::llvmdev conda-forge::clangdev
conda activate pairbase

Compile and install:

cargo install --git https://github.com/JakobSkouPedersenLab/pairbase
pairbase --help
# or clone + build
git clone https://github.com/JakobSkouPedersenLab/pairbase
cd pairbase && cargo build --release
target/release/pairbase --help

Quick‑start example

Runtime depends on the size of the bam file. The below example ran in ~10min with 12 cores (~6min without the 11-mers) for a ~25x WGS file:

pairbase \
  --bam sample.bam \                      # bam file with paired-end cfDNA
  --ref-2bit hg38.2bit \                  # reference genome in 2‑bit
  --output-dir results \                  # where to write files
  --kmer-sizes 3,5,11 \                   # count 3-, 5-, and 11-mers
  --n-threads 12 \                        # use 12 CPU cores (max. one per chromosome)
  --blacklist encode_blacklist.bed        # exclude ENCODE blacklist intervals
  --blacklist germline_snps.bed           # exclude germline SNPs

Extract positional statistics for quantifying the effect of adding/removing filtering steps, etc.:

pos_stats \
  --bam sample.bam \                      # bam file with paired-end cfDNA
  --ref-2bit hg38.2bit \                  # reference genome in 2‑bit
  --output-dir results \                  # where to write files
  --n-threads 12 \                        # use 12 CPU cores (max. one per chromosome)
  --blacklist encode_blacklist.bed        # exclude ENCODE blacklist intervals
  --blacklist germline_snps.bed           # exclude germline SNPs

Command‑line reference

pairbase --help
pos_stats --help

option	purpose
`-i`, `--bam <path>`	bam file with paired-end reads
`-r`, `--ref-2bit <path>`	two‑bit reference genome
`-k`, `--kmer-sizes <list>`	k values (2, 3, 5, 7, ..., 27)
`-t`, `--n-threads <N>`	CPU threads
`--strand-aware <flag>`	do not collapse motifs to reverse-complements
Window selection
`--by-size <bp>`	fixed‑length windows
`--by-bed <BED>`	custom intervals
`--global <flag>`	one big window for the genome
Filtering
`-b`, `--blacklist <BED>`	mask repeats/artefacts
`--blacklist-min-size <bp>`	drop tiny blacklist entries
`--min-overlap <bp>`	minimum size of overlap
`--min-base-quality`	minimum base quality of the position on both reads
`--min-mapq`	minimum mapping quality of the read
`--min-seq-len <bp>`	minimum size read sequence
`--max-fragment-length <bp>`	maximum fragment length
`--min-nm <bp>`	minimum number of recorded mismatches in a read (default: 0)
`--max-nm <bp>`	maximum number of recorded mismatches in a read (default: 5)
Chromosome selection
`--chromosomes <list>`	chromosomes to process (default: chr1-22)
`--chromosomes-file <path>`	file with chromosomes to process

FAQ

What bases are recognized?

We only consider A, C, G, T, N. Any other bases are converted to 'N'.

What happens to kmers with 'N's in them?

We discard all motifs with 'N' in them.

Where do I get the 2-bit reference genome file?

We use the hg38.2bit file from UCSC: https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit

Although the code is agnostic to the genome assembly, so hg19.2bit or similar should work, assuming the BAM files are aligned to the same assembly.

Can I make suggestions for the tool?

Of course! Open an issue at https://github.com/JakobSkouPedersenLab/pairbase/issues/new/choose.

Implemented by @ludvigolsen. Logic by @gualpo.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pairbase_logo_230x280_250dpi.png		pairbase_logo_230x280_250dpi.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pairbase

Fast mismatch consensus calling from paired-end reads

Key features

Installation

Compile from source

Quick‑start example

Command‑line reference

FAQ

What bases are recognized?

What happens to kmers with 'N's in them?

Where do I get the 2-bit reference genome file?

Can I make suggestions for the tool?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pairbase

Fast mismatch consensus calling from paired-end reads

Key features

Installation

Compile from source

Quick‑start example

Command‑line reference

FAQ

What bases are recognized?

What happens to kmers with 'N's in them?

Where do I get the 2-bit reference genome file?

Can I make suggestions for the tool?

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages