This project is a C# wrapper and set of SQL Server installation scripts to make the SimMetrics string matching algorithms available in SQL Server.
- SimMetrics was originally released at SourceForge. This project uses version 1.5 of that library. Subsequent versions were migrated to Java.
- The C# wrapper was inspired by this blogpost
Descriptions of the supported string fuzzy match functions are provided on the wiki home page.
This project was motivated by the frequent need for fuzzy matching (approximate string matching) algorithms in data analytics and data science work. These algorithms are missing from SQL Server. Many projects do not have the time, licencing, or budget to install additional SQL Server packages such as SSIS. Furthermore, it is best to do as much data science work as possible through program code rather than manual graphical wizards as outlined in the Guerrilla Analytics Principles. You can read more about Guerrilla Analytics in the book.
The project has minimal dependencies.
- C Sharp compiler (csc.exe) version 3.5 to rebuild the project's c sharp files. This may come with your Microsoft SQL Server installation.
- Microsoft SQL Server to install the functions into
- Apache Ant to build and install the library
Installation, Configuration, Examples and how to contribute
Installation and configuration are controlled by an Apache Ant build file. Configure your database settings and you should be good to go.
Please see the GitHub wiki page for details.
Simple Code Example
You can find the functions under a schema with the name of the Similarity library version e.g.
Similarity_<Major version>_<minor version>_<patch version>.
To use these functions in SQL code, simply call the function while specifying its full name. For example:
SELECT SIMILARITY_1_1_0.Levenshtein('THE QUICK BROWN FOX','THE QUICK FOX')
For more detailed examples, please see the Quick Guide on the wiki.
This overall project is released under the GPLv3.