CluSTR

Enumerate co-existing haplotypes at a user-defined window in a multi-copy or repeat-rich locus from paired-end short-read data.

CluSTR is a small pipeline aimed at loci where standard variant callers collapse heterogeneity into single positions (most prominently the human rDNA repeat array, but also other multi-copy or tandemly repeated regions). Instead of calling per-position variants, CluSTR:

trims adapters,
orients reads by primer,
maps to a reference,
keeps only reads whose alignment fully spans a user-defined window (CIGAR-aware geometric filter),
clusters the survivors,
projects each cluster centroid back onto reference coordinates using its CIGAR,
deduplicates identical projections, sums their read support, and
writes a depth-aware-thresholded list of haplotypes.

Status

This repository is refactored from a diploma-thesis-stage bash + Python pipeline into a packaged, tested, reproducible tool. The original thesis entry-point (main.sh, main.py, setup scripts, and the vendored FASTX splitter) is preserved on the legacy/thesis-pipeline branch. The current implementation is in src/clustr/ and is what the rest of this README describes.

Install

CluSTR is a small Python package that shells out to four external tools: cutadapt, bwa, samtools, and one of cd-hit-est or vsearch. The recommended way to get all of them is the conda environment in env/environment.yml.

mamba env create -f env/environment.yml
mamba activate clustr
clustr check     # confirms all external tools are on $PATH

Or via Docker:

docker build -t clustr:latest -f env/Dockerfile .
docker run --rm -v "$PWD":/work clustr:latest run --help

For development:

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytest

Quickstart

clustr run \
    --r1 sample_R1.fq.gz \
    --r2 sample_R2.fq.gz \
    --reference data/U13369.1.fasta \
    --primers data/primers.tsv \
    --window 4500:7800 \
    --adapter-5 CTGTCTCTTATACACATCT \
    --adapter-3 CTGTCTCTTATACACATCT \
    --threads 8 \
    --outdir runs/sample

Output of interest:

File	Contents
`final.tsv`	Ranked haplotype table (rank, sequence, count, insertions, label). Row 0 is the reference.
`haplotypes.fasta`	Same haplotypes as FASTA, count/insertions in the header.
`summary.json`	Per-stage statistics and the exact run config.
`logs/external_tools.log`	All cutadapt / BWA / samtools / cluster invocations + their stderr.

Many samples in parallel

snakemake -s workflow/Snakefile \
    --configfile workflow/config.example.yaml \
    --use-conda --cores 32

Inputs

Input	Format
`--r1`, `--r2`	Paired-end FASTQ, `.fq`/`.fastq`/`.fq.gz`/`.fastq.gz`.
`--reference`	Indexed FASTA. Index is built on first use.
`--primers`	Two-column TSV: `label<TAB>sequence`. Same as the legacy file.
`--window`	`START:END`, 1-based inclusive, on the reference contig.
`--contig`	Optional contig name in a multi-contig reference. Defaults to the first contig.

CLI

clustr run               # full pipeline on one paired-end sample
clustr check             # verify external tools on $PATH
clustr report <outdir>   # print summary.json of a finished run
clustr version

clustr run --help lists every flag, including the depth-aware --min-haplotype-reads auto threshold and the choice of clustering backend (--cluster-backend cdhit|vsearch).

How CluSTR differs from existing tools

Tool	What it outputs	Where CluSTR differs
DADA2 / Deblur	ASV table (denoised per-sample sequences)	CluSTR uses a window-spanning filter + sequence clustering tuned for indel-heavy loci, not amplicon denoising.
ExpansionHunter / HipSTR / STRipy / LUSTR	STR length genotypes	CluSTR enumerates per-haplotype window sequences, not lengths.
rDNAcaller / GATK on a masked CHM13 ref	VCF of positional variants	CluSTR reports complete haplotype sequences (variants in their natural co-occurring combinations) rather than per-position calls.
Legacy bash pipeline on `legacy/thesis-pipeline`	A single `final.txt`	The new implementation is packaged, tested, idempotent, parametrised, parallel, and fixes three CIGAR-handling bugs (see `CHANGELOG.md`).

Benchmarks

See benchmarks/ for a Snakemake harness that simulates mixtures with art_illumina, runs CluSTR / DADA2 / rDNAcaller side-by-side on the same input, and scores each against the ground truth. The DADA2 and rDNAcaller rules are deliberate stubs — wire them up to your local installs.

Citing

If you use CluSTR in a publication, please cite the (forthcoming) application note. A CITATION.cff is provided.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
env		env
paper		paper
src/clustr		src/clustr
testing		testing
tests		tests
workflow		workflow
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CluSTR

Status

Install

Quickstart

Many samples in parallel

Inputs

CLI

How CluSTR differs from existing tools

Benchmarks

Citing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CluSTR

Status

Install

Quickstart

Many samples in parallel

Inputs

CLI

How CluSTR differs from existing tools

Benchmarks

Citing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages