This pipeline checks a whole genome sequencing file for a small set of forensic
DNA markers, such as TPOX, D3S1358, vWA, and related STR markers. These
are the same kind of marker names that can appear in lab reports from forensic
or ancestry-style DNA testing. Together, these marker values are often called a
DNA profile. The markers are useful for identification because their repeat
lengths are highly variable between individuals.
The input is a sequencing alignment file, usually a BAM or CRAM, produced
from whole genome sequencing. The pipeline uses ExpansionHunter to estimate the
two allele values at each marker, writes a simple CSV report, and can compare
those values with a lab reference CSV from capillary electrophoresis (CE).
The default example sample is NA12878, a public reference genome sample, using
30x whole genome sequencing data. Low-coverage NA12878 data is intentionally not
supported because it did not produce usable allele calls for this panel.
One practical use case is a sample identity sanity check: if you have a whole genome sequencing file and an independent wet-lab DNA profile for the same person, this pipeline can compare forensic marker values between them. A strong match supports that the sequencing file belongs to the expected person, while many mismatches are a warning to investigate sample mix-up, file mix-up, reference/report formatting, or assay differences.
If you want to try this with your own DNA and do not have access to a laboratory, there are direct-to-consumer services that can provide the two kinds of input used here:
- Whole genome sequencing data: services such as Sequencing.com offer at-home
whole genome sequencing kits and provide downloadable genome data. This
pipeline needs an aligned
BAMorCRAMfile, not only a health report. - Wet-lab DNA profile / CE reference: services such as AlphaBiolabs offer
direct-to-consumer DNA profile testing based on STR markers. Their report can
be converted into the CE reference CSV format used by
make compare.
These are examples of service categories, not endorsements. Check current file download options, marker lists, consent rules, privacy terms, and local legal requirements before ordering any DNA test.
make report writes:
locus,allele1,allele2make compare writes:
locus,allele1,allele2,ce_allele1,ce_allele2,matchThe match column is true when the two called alleles match the CE alleles,
ignoring allele order.
ExpansionHunter also writes VCF/JSON outputs and a small _realigned.bam
sidecar in results/. The CSV files are the primary outputs for this workflow.
Create a project folder and clone this repository:
mkdir my_project
cd my_project
git clone https://github.com/SpikeTreeLab/forensic_loci.gitYou can also add this repository as a folder inside an existing project if you want to keep the pipeline next to your own BAM/CRAM files and CE references.
Install these tools in the active environment:
ExpansionHuntersamtoolsawkwgetgzipcolumn
Example conda setup:
conda create -n bioinfo
conda activate bioinfo
conda install -c conda-forge -c bioconda expansionhunter samtools coreutils gawk wget gzipCheck software availability:
make -C forensic_loci depsThe Makefile detects local CPU/RAM and defaults to using most CPU cores while leaving some headroom for the OS.
Fetch the hg38 reference and the default NA12878 30x WGS CRAM:
make -C forensic_loci dataRun ExpansionHunter and generate the report:
make -C forensic_loci reportCompare against the included NA12878 CE reference:
make -C forensic_loci compareView the comparison as an aligned table:
make -C forensic_loci viewDefault output paths:
forensic_loci/results/NA12878.wgs30.forensic_loci.report.csv
forensic_loci/results/NA12878.wgs30.forensic_loci.ce_compare.csv
For another hg38-aligned BAM with a CE reference CSV:
make -C forensic_loci compare \
SAMPLE=JohnDoe \
BAM=/path/to/bam/JohnDoe-XXXXXXXX-30x-WGS-Sequencing_com.bam \
CE_REFERENCE=/path/to/ce_reference/forensic_loci_alpha_biolabs_pom_XXXXXX.csvThis writes:
forensic_loci/results/JohnDoe.wgs30.forensic_loci.report.csv
forensic_loci/results/JohnDoe.wgs30.forensic_loci.ce_compare.csv
If the BAM index is missing, the Makefile will try to create
/path/to/sample.bam.bai with samtools index. If that directory is not
writable, create the index in a writable location and pass it explicitly:
make -C forensic_loci compare \
SAMPLE=my_sample \
BAM=/path/to/my_sample.bam \
READS_INDEX=/writable/path/my_sample.bam.bai \
CE_REFERENCE=/path/to/my_sample_ce_reference.csvFor CRAM input, provide the CRAM index and a compatible hg38 reference FASTA if needed:
make -C forensic_loci compare \
SAMPLE=my_sample \
BAM=/path/to/my_sample.cram \
READS_INDEX=/path/to/my_sample.cram.crai \
REF=/path/to/hg38.fa \
CE_REFERENCE=/path/to/my_sample_ce_reference.csvThe CE reference CSV must have this shape:
locus,ce_allele1,ce_allele2
TPOX,8,8
D3S1358,16,17
vWA,15,17The locus names should match the names in
catalog/forensic_loci.hg38.expansionhunter.json.
The included NA12878 CE reference lives in ce_reference/. That directory also
documents the publisher source link and how the calls were read from the
electropherogram image in Supplementary Dataset 7.
- Input BAM/CRAM coordinates must be compatible with hg38.
- The ExpansionHunter catalog is a targeted forensic-loci catalog, not a full forensic interpretation system.
- CE comparison is allele-label comparison, not a legal identity test.
- Some loci with complex repeat structure can disagree between short-read STR genotyping and CE sizing.
These directories are generated/downloaded and ignored by git:
alignments/
logs/
refs/
results/
Clean current sample outputs:
make -C forensic_loci cleanRemove generated outputs, downloaded reads, and downloaded references:
make -C forensic_loci realcleanThis project is released under the MIT License. See LICENSE.
The included NA12878 CE reference-call documentation is kept for provenance and
auditability. The third-party source image/PDF is not redistributed because its
reuse license is not clearly open-source compatible. See NOTICE and
ce_reference/README.md for source attribution and verification notes.
This workflow was informed by:
Direct-to-consumer examples mentioned above:
- Sequencing.com whole genome sequencing:
https://sequencing.com/ - AlphaBiolabs single genetic profile testing:
https://www.alphabiolabs.co.uk/public-testing-services/single-genetic-profile-testing/
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment
and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology 37,
907-915 (2019). DOI: 10.1038/s41587-019-0201-4.
Kim et al. used PowerPlex Fusion wet-lab DNA fingerprinting data for 17 Platinum Genomes and compared those results with HISAT-genotype calls. The NA12878 CE reference calls in this repository were transcribed from their Supplementary Dataset 7, described as "PowerPlex Fusion results for 17 PG genomes (raw signal image data)".
Oketch O, et al. A comparison of software for analysis of rare and common short
tandem repeat (STR) variation using human genome sequences from clinical and
population-based samples. PLOS ONE (2024). DOI:
10.1371/journal.pone.0300545.
Oketch et al. compared STR analysis tools, including ExpansionHunter, GangSTR, HipSTR, STRetch, STRling, and ExpansionHunter Denovo, and used the Kim et al. PowerPlex Fusion data as the CE reference for forensic STR comparisons.
This pipeline and documentation were developed with assistance from OpenAI Codex, an AI coding assistant. Codex helped with Makefile structure, result parsing scripts, documentation wording, and literature/source tracing. The pipeline should still be reviewed, tested, and interpreted by a human operator, especially before using results for any decision involving personal identity, privacy, health, or legal matters.
The Makefile-centered workflow and directory organization were inspired by the
example project structure from The Biostar Handbook:
https://www.biostarhandbook.com/. The Biostar Handbook emphasizes practical,
reproducible bioinformatics workflows using command-line tools, Makefiles, and
clear separation of inputs, references, scripts, and results.