Similarity or Distance Metrics, e.g. Levenshtein, for Java
Java

README.md

Maven Central Build Status Coverage Status

SimMetrics

A Java library of similarity and distance metrics e.g. Levenshtein distance and Cosine similarity. All similarity metrics return normalized values rather than unbounded similarity scores. Distance metrics return non-negative unbounded scores.

Usage

For a quick and easy use StringMetrics and StringDistances contain a collection of well known similarity and distance metrics.

    String str1 = "This is a sentence. It is made of words";
    String str2 = "This sentence is similar. It has almost the same words";

    StringMetric metric = StringMetrics.cosineSimilarity();

    float result = metric.compare(str1, str2); //0.4767

The StringMetricBuilder and StringDistanceBuilder are convenience tools to build string similarity and distance metrics. Any class implementing Metric or Distance respectively can be used to build a metric. The builders support simplification, tokenization, token-filtering, token-transformation, and caching. For usage see the examples section.

For a terse syntax use import static org.simmetrics.builders.StringMetricBuilder.with;

    String str1 = "This is a sentence. It is made of words";
    String str2 = "This sentence is similar. It has almost the same words";

    StringMetric metric =
            with(new CosineSimilarity<String>())
            .simplify(Simplifiers.toLowerCase(Locale.ENGLISH))
            .simplify(Simplifiers.replaceNonWord())
            .tokenize(Tokenizers.whitespace())
            .build();

    float result = metric.compare(str1, str2); //0.5720