Mismatch rates: pairbase counts mismatches in the consensus positions of paired-end reads from whole genome sequenced cell-free DNA and normalizes by the consensus-position opportunities in the same sequencing data.
Positional statistics: Quantify the effect of adding/changing filtering steps via positional statistics about consensus bases.
| capability | details |
|---|---|
| Specifiable k-mer size | allows k=2 (DBS) and odd sizes from k=3 up to k=27 (SBS; limited by available memory). |
| Fast and multithreaded | implemented in rust and fully parallelized (up to one thread per chromosome) |
| Human-readable output | outputs TSV file with the mismatch motif, mismatch count, opportunity count, and mismatch rate |
| Multiple windowing modes | fixed length (--by-size 10_000), BED intervals (--by-bed sites.bed), or single genome‑wide (default) |
| Blacklist masking | exclude repeats/SNPs/artefacts with one or several BEDs |
| Strandagnostic | motifs are collapsed to their reverse-complements |
You may need a few dependencies that can be installed as a conda environment with:
conda create -n pairbase rust=1.87.0 zstandard perl conda-forge::llvmdev conda-forge::clangdev
conda activate pairbaseCompile and install:
cargo install --git https://github.com/JakobSkouPedersenLab/pairbase
pairbase --help
# or clone + build
git clone https://github.com/JakobSkouPedersenLab/pairbase
cd pairbase && cargo build --release
target/release/pairbase --helpRuntime depends on the size of the bam file. The below example ran in ~10min with 12 cores (~6min without the 11-mers) for a ~25x WGS file:
pairbase \
--bam sample.bam \ # bam file with paired-end cfDNA
--ref-2bit hg38.2bit \ # reference genome in 2‑bit
--output-dir results \ # where to write files
--kmer-sizes 3,5,11 \ # count 3-, 5-, and 11-mers
--n-threads 12 \ # use 12 CPU cores (max. one per chromosome)
--blacklist encode_blacklist.bed # exclude ENCODE blacklist intervals
--blacklist germline_snps.bed # exclude germline SNPsExtract positional statistics for quantifying the effect of adding/removing filtering steps, etc.:
pos_stats \
--bam sample.bam \ # bam file with paired-end cfDNA
--ref-2bit hg38.2bit \ # reference genome in 2‑bit
--output-dir results \ # where to write files
--n-threads 12 \ # use 12 CPU cores (max. one per chromosome)
--blacklist encode_blacklist.bed # exclude ENCODE blacklist intervals
--blacklist germline_snps.bed # exclude germline SNPspairbase --help
pos_stats --help| option | purpose |
|---|---|
-i, --bam <path> |
bam file with paired-end reads |
-r, --ref-2bit <path> |
two‑bit reference genome |
-k, --kmer-sizes <list> |
k values (2, 3, 5, 7, ..., 27) |
-t, --n-threads <N> |
CPU threads |
--strand-aware <flag> |
do not collapse motifs to reverse-complements |
| Window selection | |
--by-size <bp> |
fixed‑length windows |
--by-bed <BED> |
custom intervals |
--global <flag> |
one big window for the genome |
| Filtering | |
-b, --blacklist <BED> |
mask repeats/artefacts |
--blacklist-min-size <bp> |
drop tiny blacklist entries |
--min-overlap <bp> |
minimum size of overlap |
--min-base-quality |
minimum base quality of the position on both reads |
--min-mapq |
minimum mapping quality of the read |
--min-seq-len <bp> |
minimum size read sequence |
--max-fragment-length <bp> |
maximum fragment length |
--min-nm <bp> |
minimum number of recorded mismatches in a read (default: 0) |
--max-nm <bp> |
maximum number of recorded mismatches in a read (default: 5) |
| Chromosome selection | |
--chromosomes <list> |
chromosomes to process (default: chr1-22) |
--chromosomes-file <path> |
file with chromosomes to process |
We only consider A, C, G, T, N. Any other bases are converted to 'N'.
We discard all motifs with 'N' in them.
We use the hg38.2bit file from UCSC: https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit
Although the code is agnostic to the genome assembly, so hg19.2bit or similar should work, assuming the BAM files are aligned to the same assembly.
Of course! Open an issue at https://github.com/JakobSkouPedersenLab/pairbase/issues/new/choose.
Implemented by @ludvigolsen. Logic by @gualpo.
