Skip to content

Mark1708/ranking-str-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ranking-str-data

Java 21 Spark Maven Status

Русская версия

Local Java CLI for ranking Y-STR haplotypes against a selected base haplotype and appending TMRCA-related research metrics to a semicolon-separated CSV file.

Area Details
Runtime Java 21, Maven
Entry point ranking.Main
Artifact Maven artifactId ranking, final jar target/ranking.jar
Data engine Spark 4.1.2 with Scala 2.13, Hadoop 3.5.0
CLI parser JCommander 1.82
Logging Log4j 2.26.0
Test and QA setup JUnit 6.1.0, AssertJ 3.27.7, JaCoCo 0.8.14, SpotBugs, FindSecBugs, Spotless
Sample data assets/DataSet.csv, assets/RankedData.csv

Goal

The project was built for a genetics research workflow. It reads Y-STR haplotype rows, compares each row with a uniquely selected base haplotype, computes research metrics, and writes a ranked CSV next to the input file.

This is a local research tool, not a clinical or diagnostic system.

Research model

The code implements a linear Y-STR TMRCA calculation based on the Klyosov (2009a) method with back-mutation correction.

  1. ASD = sum((ref_i - comp_i)^2) / n, where n is the number of comparable loci.
  2. TMRCA = averageAge * ASD / mutationRate, reported in years.
  3. lambda = mutationRate * T_generations, where T_generations = TMRCA / averageAge.
  4. k = (lambda / 2) * (1 + exp(-lambda)), the Klyosov forward formula for observed mutation steps with back-mutation correction.

The default mutation rate is 0.0026 per locus per generation, based on Ballantyne et al. 2010 as a weighted average across 186 Y-STR markers. It can be changed with --mu.

CLI usage

Build the executable jar:

mvn clean package

Run with required flags:

java -jar target/ranking.jar -p assets/DataSet.csv -i indexOfHaplotype -a averageAge

Run with a custom mutation rate:

java -jar target/ranking.jar -p assets/DataSet.csv -i indexOfHaplotype -a averageAge --mu 0.0024

Show help:

java -jar target/ranking.jar -h

Supported flags:

Flag Meaning
-p, --path Path to the input CSV file
-i, --index Index value of the base haplotype
-a, --age Average generation age used in the TMRCA formula
--mu Mutation rate per locus per generation, default 0.0026
-h, --help Print CLI help

Reproducibility

The expected local build output is:

target/ranking.jar

The jar is executable because the Maven assembly configuration points to ranking.Main, and the project final name is ranking.

The application starts Spark in local[1] mode, disables the Spark UI, reads CSV rows through Spark, then loads all rows to the driver for ranking.

Sample files:

Screenshots:

Example 1 Example 2 Example 3

The filenames use the spelling currently present in the repository.

Inputs and outputs

Input requirements:

  • CSV separator is a semicolon.
  • The first column must be exactly Index.
  • Index values must be unique. Duplicate values are rejected.
  • The base haplotype passed through -i or --index must exist exactly once.
  • Locus values must parse as integers when present.
  • Null or blank locus values are skipped pairwise during comparison.

Output behavior:

  • The output file is named RankedData.csv.
  • It is written to the same directory as the input file.
  • Existing input columns are preserved.
  • The code appends these metrics: TMRCA, Average number of actual mutations(lambda), Average number of mutation steps(k).

Limitations

  • The formulas are research assumptions and are not clinically validated for diagnostic use.
  • Formula changes should be reviewed by a domain expert.
  • Spark is used in local local[1] mode for this CLI workflow.
  • Ranking loads all rows to the driver, so available JVM memory limits dataset size.
  • All CSV columns are read as strings before parsing metrics, which helps preserve exact Index values.
  • Back-mutation correction assumes low mutation rates. Accuracy degrades for larger lambda values.

References

Status

Research/educational project. Results, dependencies, and runtime assumptions are documented for reproducibility, but the repository is not maintained as a packaged product.

About

The utility evaluates TMRCA (Time to the Most Recent Common Ancestor) by Y-STR loci and then performs ranking relative to the base haplotype.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages