Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to string comparison efficiency. #38

merged 2 commits into from Oct 6, 2017


Copy link

@joelbecker joelbecker commented Jul 28, 2017

This pull request changes the data structures used within the recordlinkage string comparison algorithms. Instead of concatenating two series into a pandas.DataFrame and applying a function to its rows, these algorithms now apply the same function to a pandas.Series of tuples. These changes were minimally invasive, but appear to drastically improve efficiency for some algorithms.

The table below show the results of an informal performance experiment run on my 2012 Macbook Pro (Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz; 16GB RAM). Each comparison algorithm was applied to a set of ~ 200,000 candidate links with the original algorithms and the modified algorithms. Each trial was repeated five times, with very little variance between replications. Results show significant performance increases in all string comparison methods that were modified (note that qgram and cosine similarities were not changed and were therefore not assessed).

This table shows the mean time in seconds (i.e. the average observed across five trials) to compare 200,000 candidate links with the original and modified algorithms:

Metric Trial Mean (seconds)
1dameraulevenshteinnew 4.35562
2dameraulevenshteinoriginal 43.4045
3jaro new 0.334586
4jaro original 29.488
5jaro_winkler new 0.315651
6jaro_winkler original 30.9617
7lcs new 14.0373
8lcs original 71.3062
9levenshtein new 4.09868
10levenshtein original 42.8332
11smith_waterman new 23.0192
12smith_waterman original 47.7344

Note that due to technical problems, I was unable to run formal benchmarks using asv on my own machine. @J535D165, if these changes are merged, it would be good to get a more formal / comparable benchmark using asv on whatever machine you use for benchmarking.

Copy link

@codecov-io codecov-io commented Jul 28, 2017

Codecov Report

Merging #38 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #38      +/-   ##
- Coverage   86.68%   86.66%   -0.02%     
  Files          20       20              
  Lines        1449     1447       -2     
  Branches      265      265              
- Hits         1256     1254       -2     
  Misses        127      127              
  Partials       66       66
Impacted Files Coverage Δ
recordlinkage/algorithms/ 88.65% <100%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30312c0...9d0c88a. Read the comment docs.

Copy link

@J535D165 J535D165 commented Jul 30, 2017

You are going to make a lot of users very happy with this PR!!! Thanks for finding this bottleneck.

By the end of this week (I am on holiday without laptop and poor internet), I will do some performance testing myself and then merge the PR.

(Good catch that the store parameter isn't working well. I am removing it in the new version of the Compare class. )

Copy link
Contributor Author

@joelbecker joelbecker commented Sep 25, 2017

@J535D165 Have you had a chance to review/benchmark these changes?

@J535D165 J535D165 merged commit f465a94 into J535D165:master Oct 6, 2017
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants