Local Java CLI for ranking Y-STR haplotypes against a selected base haplotype and appending TMRCA-related research metrics to a semicolon-separated CSV file.
| Area | Details |
|---|---|
| Runtime | Java 21, Maven |
| Entry point | ranking.Main |
| Artifact | Maven artifactId ranking, final jar target/ranking.jar |
| Data engine | Spark 4.1.2 with Scala 2.13, Hadoop 3.5.0 |
| CLI parser | JCommander 1.82 |
| Logging | Log4j 2.26.0 |
| Test and QA setup | JUnit 6.1.0, AssertJ 3.27.7, JaCoCo 0.8.14, SpotBugs, FindSecBugs, Spotless |
| Sample data | assets/DataSet.csv, assets/RankedData.csv |
The project was built for a genetics research workflow. It reads Y-STR haplotype rows, compares each row with a uniquely selected base haplotype, computes research metrics, and writes a ranked CSV next to the input file.
This is a local research tool, not a clinical or diagnostic system.
The code implements a linear Y-STR TMRCA calculation based on the Klyosov (2009a) method with back-mutation correction.
ASD = sum((ref_i - comp_i)^2) / n, wherenis the number of comparable loci.TMRCA = averageAge * ASD / mutationRate, reported in years.lambda = mutationRate * T_generations, whereT_generations = TMRCA / averageAge.k = (lambda / 2) * (1 + exp(-lambda)), the Klyosov forward formula for observed mutation steps with back-mutation correction.
The default mutation rate is 0.0026 per locus per generation, based on Ballantyne et al. 2010 as a weighted average across 186 Y-STR markers. It can be changed with --mu.
Build the executable jar:
mvn clean packageRun with required flags:
java -jar target/ranking.jar -p assets/DataSet.csv -i indexOfHaplotype -a averageAgeRun with a custom mutation rate:
java -jar target/ranking.jar -p assets/DataSet.csv -i indexOfHaplotype -a averageAge --mu 0.0024Show help:
java -jar target/ranking.jar -hSupported flags:
| Flag | Meaning |
|---|---|
-p, --path |
Path to the input CSV file |
-i, --index |
Index value of the base haplotype |
-a, --age |
Average generation age used in the TMRCA formula |
--mu |
Mutation rate per locus per generation, default 0.0026 |
-h, --help |
Print CLI help |
The expected local build output is:
target/ranking.jar
The jar is executable because the Maven assembly configuration points to ranking.Main, and the project final name is ranking.
The application starts Spark in local[1] mode, disables the Spark UI, reads CSV rows through Spark, then loads all rows to the driver for ranking.
Sample files:
- Input sample:
assets/DataSet.csv - Ranked sample:
assets/RankedData.csv
Screenshots:
The filenames use the spelling currently present in the repository.
Input requirements:
- CSV separator is a semicolon.
- The first column must be exactly
Index. Indexvalues must be unique. Duplicate values are rejected.- The base haplotype passed through
-ior--indexmust exist exactly once. - Locus values must parse as integers when present.
- Null or blank locus values are skipped pairwise during comparison.
Output behavior:
- The output file is named
RankedData.csv. - It is written to the same directory as the input file.
- Existing input columns are preserved.
- The code appends these metrics:
TMRCA,Average number of actual mutations(lambda),Average number of mutation steps(k).
- The formulas are research assumptions and are not clinically validated for diagnostic use.
- Formula changes should be reviewed by a domain expert.
- Spark is used in local
local[1]mode for this CLI workflow. - Ranking loads all rows to the driver, so available JVM memory limits dataset size.
- All CSV columns are read as strings before parsing metrics, which helps preserve exact
Indexvalues. - Back-mutation correction assumes low mutation rates. Accuracy degrades for larger
lambdavalues.
- TMRCA
- Poisson distribution
- Y-STR haplotypes
- Klyosov AN. "Haplogroup R1a" (2009a)
- Ballantyne KN et al. "Mutability of Y-chromosomal microsatellites" (2010)
Research/educational project. Results, dependencies, and runtime assumptions are documented for reproducibility, but the repository is not maintained as a packaged product.


