Skip to content

Distance related functions (Damerau-Levenshtein, Jaro-Winkler , longest common substring & subsequence) implemented as SQLite run-time loadable extension. Any UTF-8 strings are supported.

License

Notifications You must be signed in to change notification settings

GerHobbelt/distlib

 
 

Repository files navigation

distlib

Alt text

Distance related functions implemented as SQLite run-time loadable extension. Any UTF-8 strings are supported.

Functions provided:

SQLite function Description
ldist(x,y) Levenshtein distance as integer >= 0
lsim(x,y) Levenshtein similarity as double between 0.0 and 1.0
dldist(x,y) Damerau–Levenshtein distance as integer >= 0
dlsim(x,y) Damerau–Levenshtein similarity as double between 0.0 and 1.0
jsim(x,y) Jaro similarity as double between 0.0 and 1.0
jwsim(x,y) Jaro-Winkler similarity as double between 0.0 and 1.0
lcstr(x,y) Longest common substring
lcstrl(x,y) Longest common substring length
lcseq(x,y) Longest common subsequence
lcseql(x,y) Longest common subsequence length
perm(x) Permutation table-valued function for strings up to len(x)>=10
subseq(x) Subsequences table-valued function for strings up to len(x)>=16

Examples:

Query example Result example
select ldist('kitten','sitting'); 3
select lsim('kitten','sitting'); 0.571428571428571
select ldist('a cat','an act'); 3
select dldist('a cat','an act'); 2
select lsim('a cat','an act'); 0.5
select lsim('a cät','an äct'); 0.5
select dlsim('a cat','an act'); 0.666666666666667
select ldist('fish', 'ifsh'); 2
select dldist('fish', 'ifsh'); 1
select ldist('kitten','sitting'); 3
select jsim('Clark Kent','Claire Kent'); 0.869023569023569
select jwsim('Peter Parker','Pete Parker'); 0.928787878787879
select jwsim('Péter Pärker','Péte Pärker'); 0.928787878787879
select lcstr('carpenter', 'sharpener'); arpen
select lcstrl('carpenter', 'sharpener'); 5
select lcseq('cärpenter', 'shärpener'); ärpener
select lcseql('cärpenter', 'shärpener'); 7
select * from perm('ABCDEFGH'); ABCDEFGH,ABCDEFHG, ... n! rows
select * from perm ('Smørrebrød'); Smørrebrød,Smørrebødr, ... n! rows
select * from subseq('ABCDEFGH'); A,AB,ABC, ... max. 2^n-1 rows

Bulk examples:

WITH orig_ AS (SELECT 'Geräte' AS orig),
perm_ AS (SELECT orig, permut FROM perm(orig) JOIN orig_ ON 1=1)
SELECT * FROM (
SELECT orig, permut, lcstr(orig, permut) lcstr, lcstrl(orig, permut) lcstrl,
lcseq(orig, permut) lcseq, lcseql(orig, permut) lcseql FROM perm_
) ORDER BY lcstrl desc, lcseql desc;

DROP TABLE IF EXISTS perm_gen;
CREATE TABLE IF NOT EXISTS perm_gen AS
WITH orig_ AS (SELECT 'Z''Hansrüed' AS orig),
perm_ AS (SELECT orig, permut FROM perm(orig) JOIN orig_ ON 1=1)
SELECT * FROM perm_;

DROP TABLE IF EXISTS perm_test;
CREATE TABLE IF NOT EXISTS perm_test AS
SELECT orig, permut, lcstr(orig, permut) lcstr, lcstrl(orig, permut) lcstrl, lsim(orig, permut) lsim,
dlsim(orig, permut) dlsim, jsim(orig, permut) jsim, jwsim(orig, permut) jwsim FROM perm_gen;

DROP TABLE IF EXISTS subseq_gen;
CREATE TABLE IF NOT EXISTS subseq_gen AS
WITH orig_ AS (SELECT 'Lückenbüßer' AS orig),
subseq_ AS (SELECT orig, subseq FROM subseq(orig) JOIN orig_ ON 1=1)
SELECT * FROM subseq_;

DROP TABLE IF EXISTS subseq_test;
CREATE TABLE IF NOT EXISTS subseq_test AS
SELECT orig, subseq, lcstr(orig, subseq) lcstr, lcstrl(orig, subseq) lcstrl, lsim(orig, subseq) lsim,
dlsim(orig, subseq) dlsim, jsim(orig, subseq) jsim, jwsim(orig, subseq) jwsim from subseq_gen;

How to build the .dll/.so library?

The library was primary built on Windows 10 64-bit with Visual Studio Community 2019 and tested with DB Browser for SQLite.
Smoke testing was also done on Linux Mint 19.3 "Tricia" and the latest Raspberry Pi OS on a Raspberry Pi 400.

Compile on Windows by reviewing & calling win64build_extensions_generic.bat :

cl /EHsc /FojaroWinkler.obj /c jaroWinkler.cpp
cl /EHsc /Fopylcs.obj /c pylcs.cpp
cl /EHsc /Fdldist.obj /c dldist.cpp
cl /EHsc /Flcsubstr.obj /c lcsubstr.cpp
cl /EHsc /Foperm.obj /c perm.cpp
cl /EHsc /Fosubseq.obj /c subseq.cpp
cl /EHsc /FoRegistExt.obj /c RegistExt.cpp
cl /EHsc /Foutf8_unicode.obj /c utf8_unicode.cpp
link /DLL /OUT:distlib_64.dll utf8_unicode.obj RegistExt.obj perm.obj subseq.obj jaroWinkler.obj pylcs.obj dldist.obj lcsubstr.obj

Compile on Linux/Raspberry Pi OS in your default shell :

chmod 755 linux_build_extensions.sh
./linux_build_extensions.sh

About

Distance related functions (Damerau-Levenshtein, Jaro-Winkler , longest common substring & subsequence) implemented as SQLite run-time loadable extension. Any UTF-8 strings are supported.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 98.0%
  • Batchfile 1.7%
  • Shell 0.3%