Fuzzy matching library for scientific names with emphasis on performance and scalability
Java Scala Python
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
matcher/src
project
.gitignore
CHANGELOG.md
LICENSE
README.rst
build.sbt
circle.yml

README.rst

Global Names Matcher

https://circleci.com/gh/GlobalNamesArchitecture/gnmatcher.svg?style=svg

Global Names Matcher or gnmatcher is a Scala 2.10.3+ library for very fast fuzzy matching of a query string against given set of strings.

Dependency Declaration for Java or Scala

The artifacts for gnmatcher live on Maven Central and can be set as a dependency in following ways:

SBT:

libraryDependencies += "org.globalnames" %% "gnmatcher" % "0.1.0"

Maven:

<dependency>
    <groupId>org.globalnames</groupId>
    <artifactId>gnmatcher_2.11</artifactId>
    <version>0.1.0</version>
</dependency>

<dependency>
    <groupId>org.globalnames</groupId>
    <artifactId>gnmatcher_2.10</artifactId>
    <version>0.1.0</version>
</dependency>

Fuzzy Matching

To match input sequence against query run code as follows:

$ sbt matcher/console
console> import org.globalnames.matcher.Matcher
console> val matcher = Matcher(Seq("Abdf", "Abce", "Dddd"), maxDistance = 2)
console> matcher.transduce("Abc")
res0: Seq[org.globalnames.Candidate] = Vector(Candidate(Abce,1), Candidate(Abdf,1))

Result contains only Candidates edit distance with merges and splits is not greater than maxDistance.

Dump and Restore

Fuzzy matching is very fast. It is theoretically proven to be constant time of query string. But building inner data structures for input string might be long. To avoid rebuilding of Matcher it is useful to dump and restore it from file as follows:

$ sbt matcher/console
console> import org.globalnames.matcher.Matcher
console> val matcher = Matcher(Seq("Abdf", "Abce", "Dddd"), maxDistance = 2)
console> matcher.dump(dumpPath = "matcher.ser")
console> val matcherRestored = Matcher.restore(dumpPath = "matcher.ser")
console> matcherRestored.transduce("Abc")
res0: Seq[org.globalnames.Candidate] = Vector(Candidate(Abce,1), Candidate(Abdf,1))

Contributors

License

Released under MIT license