Enumerate co-existing haplotypes at a user-defined window in a multi-copy or repeat-rich locus from paired-end short-read data.
CluSTR is a small pipeline aimed at loci where standard variant callers collapse heterogeneity into single positions (most prominently the human rDNA repeat array, but also other multi-copy or tandemly repeated regions). Instead of calling per-position variants, CluSTR:
- trims adapters,
- orients reads by primer,
- maps to a reference,
- keeps only reads whose alignment fully spans a user-defined window (CIGAR-aware geometric filter),
- clusters the survivors,
- projects each cluster centroid back onto reference coordinates using its CIGAR,
- deduplicates identical projections, sums their read support, and
- writes a depth-aware-thresholded list of haplotypes.
This repository is refactored from a diploma-thesis-stage bash + Python pipeline
into a packaged, tested, reproducible tool. The original thesis entry-point
(main.sh, main.py, setup scripts, and the vendored FASTX splitter) is
preserved on the legacy/thesis-pipeline
branch. The current implementation is in src/clustr/ and is
what the rest of this README describes.
CluSTR is a small Python package that shells out to four external tools:
cutadapt, bwa, samtools, and one of cd-hit-est or vsearch. The
recommended way to get all of them is the conda environment in
env/environment.yml.
mamba env create -f env/environment.yml
mamba activate clustr
clustr check # confirms all external tools are on $PATHOr via Docker:
docker build -t clustr:latest -f env/Dockerfile .
docker run --rm -v "$PWD":/work clustr:latest run --helpFor development:
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytestclustr run \
--r1 sample_R1.fq.gz \
--r2 sample_R2.fq.gz \
--reference data/U13369.1.fasta \
--primers data/primers.tsv \
--window 4500:7800 \
--adapter-5 CTGTCTCTTATACACATCT \
--adapter-3 CTGTCTCTTATACACATCT \
--threads 8 \
--outdir runs/sampleOutput of interest:
| File | Contents |
|---|---|
final.tsv |
Ranked haplotype table (rank, sequence, count, insertions, label). Row 0 is the reference. |
haplotypes.fasta |
Same haplotypes as FASTA, count/insertions in the header. |
summary.json |
Per-stage statistics and the exact run config. |
logs/external_tools.log |
All cutadapt / BWA / samtools / cluster invocations + their stderr. |
snakemake -s workflow/Snakefile \
--configfile workflow/config.example.yaml \
--use-conda --cores 32| Input | Format |
|---|---|
--r1, --r2 |
Paired-end FASTQ, .fq/.fastq/.fq.gz/.fastq.gz. |
--reference |
Indexed FASTA. Index is built on first use. |
--primers |
Two-column TSV: label<TAB>sequence. Same as the legacy file. |
--window |
START:END, 1-based inclusive, on the reference contig. |
--contig |
Optional contig name in a multi-contig reference. Defaults to the first contig. |
clustr run # full pipeline on one paired-end sample
clustr check # verify external tools on $PATH
clustr report <outdir> # print summary.json of a finished run
clustr version
clustr run --help lists every flag, including the depth-aware
--min-haplotype-reads auto threshold and the choice of clustering backend
(--cluster-backend cdhit|vsearch).
| Tool | What it outputs | Where CluSTR differs |
|---|---|---|
| DADA2 / Deblur | ASV table (denoised per-sample sequences) | CluSTR uses a window-spanning filter + sequence clustering tuned for indel-heavy loci, not amplicon denoising. |
| ExpansionHunter / HipSTR / STRipy / LUSTR | STR length genotypes | CluSTR enumerates per-haplotype window sequences, not lengths. |
| rDNAcaller / GATK on a masked CHM13 ref | VCF of positional variants | CluSTR reports complete haplotype sequences (variants in their natural co-occurring combinations) rather than per-position calls. |
Legacy bash pipeline on legacy/thesis-pipeline |
A single final.txt |
The new implementation is packaged, tested, idempotent, parametrised, parallel, and fixes three CIGAR-handling bugs (see CHANGELOG.md). |
See benchmarks/ for a Snakemake harness that simulates
mixtures with art_illumina, runs CluSTR / DADA2 / rDNAcaller side-by-side
on the same input, and scores each against the ground truth. The DADA2 and
rDNAcaller rules are deliberate stubs — wire them up to your local installs.
If you use CluSTR in a publication, please cite the (forthcoming) application
note. A CITATION.cff is provided.
MIT. See LICENSE.