Skip to content

A small Java package for fuzzy string matching using Levenshtein distance.

License

Notifications You must be signed in to change notification settings

SpinningVinyl/FuzzyStrings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FuzzyStrings

A small Java package for simple string matching. It was born when I needed something similar to The Fuzz, but for Java.

In my testing, the results it produces are extremely similar to results produced by The Fuzz (not surprising since both are based on the same underlying concepts). It also performed extremely well compared to the matcher built into a popular CAT software package.

Features

  • 100% Java;
  • does not have any dependency outside of the Java Class Library;
  • fully compatible with Java 11 and up.

Documentation

The source code is extensively documented using JavaDoc, but here is a quick recap.

Private methods

levenshtein()

private static int levenshtein(String s1, String s2)

Returns the Levenshtein distance between strings s1 and s2.

levenshteinToken()

private static int levenshteinToken(String[] tokens1, String[] tokens2)

Similar to levenshtein(), but based on tokens instead of individual characters. A token is an uninterrupted sequence of word characters (\w in regex parlance).

Public methods

ratio()

public static int ratio(String s1, String s2, boolean ignoreCase)

Returns similarity between s1 and s2 based on their Levenshtein distance. If ignoreCase is true, the comparison will be case-insensitive.

ratioToken()

public static int ratioToken(String s1, String s2, boolean ignoreCase)

Returns similarity between s1 and s2 based on levenshteinToken(). Returns -1 if both s1 and s2 do not contain any tokens.

ratioTokenSet()

public static int ratioTokenSet(String s1, String s2, boolean ignoreCase)

This method treats s1 and s2 as sets of tokens and returns their similarity based on the difference between those two sets. Returns -1 if both s1 and s2 do not contain any tokens.

Example:

int score = FuzzyStrings.ratioTokenSet("It is a beautiful day", "A beautiful day it is", true);
System.out.println(score); // 100

complexRatio()

public static int complexRatio(String s1, String s2, boolean ignoreCase)

Returns similarity between s1 and s2 based on ratio(), ratioToken() and ratioTokenSet().

matchOne()

public static StringMatch matchOne(String s, Collection<String> candidates,
                                       StringCompareFunction compareFunction, boolean ignoreCase)

Compares all strings in candidates to s and returns the best match.

FuzzyStrings::ratio, FuzzyStrings::ratioToken, FuzzyStrings::ratioTokenSet and FuzzyStrings::complexRatio can be used as compareFunction.

Example:

String s = "this is a test";
List<String> candidates = new ArrayList<>();
candidates.add("this is a pest");
candidates.add("not a match at all");
candidates.add("this is a bird's nest");
candidates.add("this is a test test");
candidates.add("this is a test!");
candidates.add("this is a Test");
candidates.add("this is a quick test");
StringMatch m = FuzzyStrings.matchOne(s, candidates, FuzzyStrings::complexRatio, true);
System.out.println(m.getScore() + ": " + m.getText()); // 100: this is a Test

matchAndSort()

public static List<StringMatch> matchAndSort(String s, Collection<String> candidates,
                                                 StringCompareFunction compareFunction, boolean ignoreCase) 

Compares all strings in candidates to s and returns matches sorted by score (in descending order).

FuzzyStrings::ratio, FuzzyStrings::ratioToken, FuzzyStrings::ratioTokenSet and FuzzyStrings::complexRatio can be used as compareFunction.

Example:

String s = "this is a test";
List<String> candidates = new ArrayList<>();
candidates.add("this is a pest");
candidates.add("not a match at all");
candidates.add("this is a bird's nest");
candidates.add("this is a test test");
candidates.add("this is a test!");
candidates.add("this is a Test");
candidates.add("this is a quick test");
List<StringMatch> results = FuzzyStrings.matchAndSort(s, candidates, FuzzyStrings::complexRatio, false);
for (StringMatch m : results) {
    System.out.println(m.getScore() + ": " + m.getText());
}

Output:

99: this is a test!
91: this is a test test
87: this is a quick test
86: this is a pest
86: this is a Test
69: this is a bird's nest
40: not a match at all

Helper class – StringMatch

public class StringMatch implements Comparable<StringMatch>

This class implements a string match. It contains both the text that has been matched and its score. The score shows how similar the text is to the string it has been compared to.

To get the text, use the getText() method:

StringMatch match = FuzzyStrings.matchOne(s, candidates, FuzzyStrings::complexRatio, false);
String matchedString = match.getText();

Similarly, to get the score, use the getScore() method:

int score = match.getScore();

Functional interface – StringCompareFunction

@FunctionalInterface
public interface StringCompareFunction {
    int compare(String s1, String s2, boolean ignoreCase);
}

License

This project is licensed under the terms of the MIT license. See LICENSE for details.

About

A small Java package for fuzzy string matching using Levenshtein distance.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages