Forensic Loci Pipeline

This pipeline checks a whole genome sequencing file for a small set of forensic DNA markers, such as TPOX, D3S1358, vWA, and related STR markers. These are the same kind of marker names that can appear in lab reports from forensic or ancestry-style DNA testing. Together, these marker values are often called a DNA profile. The markers are useful for identification because their repeat lengths are highly variable between individuals.

The input is a sequencing alignment file, usually a BAM or CRAM, produced from whole genome sequencing. The pipeline uses ExpansionHunter to estimate the two allele values at each marker, writes a simple CSV report, and can compare those values with a lab reference CSV from capillary electrophoresis (CE).

The default example sample is NA12878, a public reference genome sample, using 30x whole genome sequencing data. Low-coverage NA12878 data is intentionally not supported because it did not produce usable allele calls for this panel.

One practical use case is a sample identity sanity check: if you have a whole genome sequencing file and an independent wet-lab DNA profile for the same person, this pipeline can compare forensic marker values between them. A strong match supports that the sequencing file belongs to the expected person, while many mismatches are a warning to investigate sample mix-up, file mix-up, reference/report formatting, or assay differences.

Getting Your Own Data

If you want to try this with your own DNA and do not have access to a laboratory, there are direct-to-consumer services that can provide the two kinds of input used here:

Whole genome sequencing data: services such as Sequencing.com offer at-home whole genome sequencing kits and provide downloadable genome data. This pipeline needs an aligned BAM or CRAM file, not only a health report.
Wet-lab DNA profile / CE reference: services such as AlphaBiolabs offer direct-to-consumer DNA profile testing based on STR markers. Their report can be converted into the CE reference CSV format used by make compare.

These are examples of service categories, not endorsements. Check current file download options, marker lists, consent rules, privacy terms, and local legal requirements before ordering any DNA test.

What It Produces

make report writes:

locus,allele1,allele2

make compare writes:

locus,allele1,allele2,ce_allele1,ce_allele2,match

The match column is true when the two called alleles match the CE alleles, ignoring allele order.

ExpansionHunter also writes VCF/JSON outputs and a small _realigned.bam sidecar in results/. The CSV files are the primary outputs for this workflow.

Setup

Create a project folder and clone this repository:

mkdir my_project
cd my_project
git clone https://github.com/SpikeTreeLab/forensic_loci.git

You can also add this repository as a folder inside an existing project if you want to keep the pipeline next to your own BAM/CRAM files and CE references.

Requirements

Install these tools in the active environment:

ExpansionHunter
samtools
awk
wget
gzip
column

Example conda setup:

conda create -n bioinfo
conda activate bioinfo
conda install -c conda-forge -c bioconda expansionhunter samtools coreutils gawk wget gzip

Check software availability:

make -C forensic_loci deps

The Makefile detects local CPU/RAM and defaults to using most CPU cores while leaving some headroom for the OS.

Default NA12878 Run

Fetch the hg38 reference and the default NA12878 30x WGS CRAM:

make -C forensic_loci data

Run ExpansionHunter and generate the report:

make -C forensic_loci report

Compare against the included NA12878 CE reference:

make -C forensic_loci compare

View the comparison as an aligned table:

make -C forensic_loci view

Default output paths:

forensic_loci/results/NA12878.wgs30.forensic_loci.report.csv
forensic_loci/results/NA12878.wgs30.forensic_loci.ce_compare.csv

Running Another Sample

For another hg38-aligned BAM with a CE reference CSV:

make -C forensic_loci compare \
  SAMPLE=JohnDoe \
  BAM=/path/to/bam/JohnDoe-XXXXXXXX-30x-WGS-Sequencing_com.bam \
  CE_REFERENCE=/path/to/ce_reference/forensic_loci_alpha_biolabs_pom_XXXXXX.csv

This writes:

forensic_loci/results/JohnDoe.wgs30.forensic_loci.report.csv
forensic_loci/results/JohnDoe.wgs30.forensic_loci.ce_compare.csv

If the BAM index is missing, the Makefile will try to create /path/to/sample.bam.bai with samtools index. If that directory is not writable, create the index in a writable location and pass it explicitly:

make -C forensic_loci compare \
  SAMPLE=my_sample \
  BAM=/path/to/my_sample.bam \
  READS_INDEX=/writable/path/my_sample.bam.bai \
  CE_REFERENCE=/path/to/my_sample_ce_reference.csv

For CRAM input, provide the CRAM index and a compatible hg38 reference FASTA if needed:

make -C forensic_loci compare \
  SAMPLE=my_sample \
  BAM=/path/to/my_sample.cram \
  READS_INDEX=/path/to/my_sample.cram.crai \
  REF=/path/to/hg38.fa \
  CE_REFERENCE=/path/to/my_sample_ce_reference.csv

CE Reference CSV Format

The CE reference CSV must have this shape:

locus,ce_allele1,ce_allele2
TPOX,8,8
D3S1358,16,17
vWA,15,17

The locus names should match the names in catalog/forensic_loci.hg38.expansionhunter.json.

The included NA12878 CE reference lives in ce_reference/. That directory also documents the publisher source link and how the calls were read from the electropherogram image in Supplementary Dataset 7.

Important Assumptions

Input BAM/CRAM coordinates must be compatible with hg38.
The ExpansionHunter catalog is a targeted forensic-loci catalog, not a full forensic interpretation system.
CE comparison is allele-label comparison, not a legal identity test.
Some loci with complex repeat structure can disagree between short-read STR genotyping and CE sizing.

Generated Files

These directories are generated/downloaded and ignored by git:

alignments/
logs/
refs/
results/

Clean current sample outputs:

make -C forensic_loci clean

Remove generated outputs, downloaded reads, and downloaded references:

make -C forensic_loci realclean

License

This project is released under the MIT License. See LICENSE.

The included NA12878 CE reference-call documentation is kept for provenance and auditability. The third-party source image/PDF is not redistributed because its reuse license is not clearly open-source compatible. See NOTICE and ce_reference/README.md for source attribution and verification notes.

Papers And Data Sources

This workflow was informed by:

Direct-to-consumer examples mentioned above:

Sequencing.com whole genome sequencing: https://sequencing.com/
AlphaBiolabs single genetic profile testing: https://www.alphabiolabs.co.uk/public-testing-services/single-genetic-profile-testing/

Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology 37, 907-915 (2019). DOI: 10.1038/s41587-019-0201-4.

Kim et al. used PowerPlex Fusion wet-lab DNA fingerprinting data for 17 Platinum Genomes and compared those results with HISAT-genotype calls. The NA12878 CE reference calls in this repository were transcribed from their Supplementary Dataset 7, described as "PowerPlex Fusion results for 17 PG genomes (raw signal image data)".

Oketch O, et al. A comparison of software for analysis of rare and common short tandem repeat (STR) variation using human genome sequences from clinical and population-based samples. PLOS ONE (2024). DOI: 10.1371/journal.pone.0300545.

Oketch et al. compared STR analysis tools, including ExpansionHunter, GangSTR, HipSTR, STRetch, STRling, and ExpansionHunter Denovo, and used the Kim et al. PowerPlex Fusion data as the CE reference for forensic STR comparisons.

AI Usage Acknowledgement

This pipeline and documentation were developed with assistance from OpenAI Codex, an AI coding assistant. Codex helped with Makefile structure, result parsing scripts, documentation wording, and literature/source tracing. The pipeline should still be reviewed, tested, and interpreted by a human operator, especially before using results for any decision involving personal identity, privacy, health, or legal matters.

Workflow Structure Acknowledgement

The Makefile-centered workflow and directory organization were inspired by the example project structure from The Biostar Handbook: https://www.biostarhandbook.com/. The Biostar Handbook emphasizes practical, reproducible bioinformatics workflows using command-line tools, Makefiles, and clear separation of inputs, references, scripts, and results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forensic Loci Pipeline

Getting Your Own Data

What It Produces

Setup

Requirements

Default NA12878 Run

Running Another Sample

CE Reference CSV Format

Important Assumptions

Generated Files

License

Papers And Data Sources

AI Usage Acknowledgement

Workflow Structure Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
catalog		catalog
ce_reference		ce_reference
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Forensic Loci Pipeline

Getting Your Own Data

What It Produces

Setup

Requirements

Default NA12878 Run

Running Another Sample

CE Reference CSV Format

Important Assumptions

Generated Files

License

Papers And Data Sources

AI Usage Acknowledgement

Workflow Structure Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages