Skip to content

AdamNogell/CluSTR

Repository files navigation

CluSTR

Enumerate co-existing haplotypes at a user-defined window in a multi-copy or repeat-rich locus from paired-end short-read data.

CluSTR is a small pipeline aimed at loci where standard variant callers collapse heterogeneity into single positions (most prominently the human rDNA repeat array, but also other multi-copy or tandemly repeated regions). Instead of calling per-position variants, CluSTR:

  1. trims adapters,
  2. orients reads by primer,
  3. maps to a reference,
  4. keeps only reads whose alignment fully spans a user-defined window (CIGAR-aware geometric filter),
  5. clusters the survivors,
  6. projects each cluster centroid back onto reference coordinates using its CIGAR,
  7. deduplicates identical projections, sums their read support, and
  8. writes a depth-aware-thresholded list of haplotypes.

Status

This repository is refactored from a diploma-thesis-stage bash + Python pipeline into a packaged, tested, reproducible tool. The original thesis entry-point (main.sh, main.py, setup scripts, and the vendored FASTX splitter) is preserved on the legacy/thesis-pipeline branch. The current implementation is in src/clustr/ and is what the rest of this README describes.

Install

CluSTR is a small Python package that shells out to four external tools: cutadapt, bwa, samtools, and one of cd-hit-est or vsearch. The recommended way to get all of them is the conda environment in env/environment.yml.

mamba env create -f env/environment.yml
mamba activate clustr
clustr check     # confirms all external tools are on $PATH

Or via Docker:

docker build -t clustr:latest -f env/Dockerfile .
docker run --rm -v "$PWD":/work clustr:latest run --help

For development:

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytest

Quickstart

clustr run \
    --r1 sample_R1.fq.gz \
    --r2 sample_R2.fq.gz \
    --reference data/U13369.1.fasta \
    --primers data/primers.tsv \
    --window 4500:7800 \
    --adapter-5 CTGTCTCTTATACACATCT \
    --adapter-3 CTGTCTCTTATACACATCT \
    --threads 8 \
    --outdir runs/sample

Output of interest:

File Contents
final.tsv Ranked haplotype table (rank, sequence, count, insertions, label). Row 0 is the reference.
haplotypes.fasta Same haplotypes as FASTA, count/insertions in the header.
summary.json Per-stage statistics and the exact run config.
logs/external_tools.log All cutadapt / BWA / samtools / cluster invocations + their stderr.

Many samples in parallel

snakemake -s workflow/Snakefile \
    --configfile workflow/config.example.yaml \
    --use-conda --cores 32

Inputs

Input Format
--r1, --r2 Paired-end FASTQ, .fq/.fastq/.fq.gz/.fastq.gz.
--reference Indexed FASTA. Index is built on first use.
--primers Two-column TSV: label<TAB>sequence. Same as the legacy file.
--window START:END, 1-based inclusive, on the reference contig.
--contig Optional contig name in a multi-contig reference. Defaults to the first contig.

CLI

clustr run               # full pipeline on one paired-end sample
clustr check             # verify external tools on $PATH
clustr report <outdir>   # print summary.json of a finished run
clustr version

clustr run --help lists every flag, including the depth-aware --min-haplotype-reads auto threshold and the choice of clustering backend (--cluster-backend cdhit|vsearch).

How CluSTR differs from existing tools

Tool What it outputs Where CluSTR differs
DADA2 / Deblur ASV table (denoised per-sample sequences) CluSTR uses a window-spanning filter + sequence clustering tuned for indel-heavy loci, not amplicon denoising.
ExpansionHunter / HipSTR / STRipy / LUSTR STR length genotypes CluSTR enumerates per-haplotype window sequences, not lengths.
rDNAcaller / GATK on a masked CHM13 ref VCF of positional variants CluSTR reports complete haplotype sequences (variants in their natural co-occurring combinations) rather than per-position calls.
Legacy bash pipeline on legacy/thesis-pipeline A single final.txt The new implementation is packaged, tested, idempotent, parametrised, parallel, and fixes three CIGAR-handling bugs (see CHANGELOG.md).

Benchmarks

See benchmarks/ for a Snakemake harness that simulates mixtures with art_illumina, runs CluSTR / DADA2 / rDNAcaller side-by-side on the same input, and scores each against the ground truth. The DADA2 and rDNAcaller rules are deliberate stubs — wire them up to your local installs.

Citing

If you use CluSTR in a publication, please cite the (forthcoming) application note. A CITATION.cff is provided.

License

MIT. See LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors