Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed improvement for gp_levenshtein(). #1408

Merged
merged 6 commits into from Apr 26, 2022

Conversation

dd32
Copy link
Contributor

@dd32 dd32 commented Apr 14, 2022

gp_levenshtein():

  • Avoid the use of mb_substr() for significant speed improvements
  • Improve code readability; don't recalculate string length
  • Improve code readability; use a foreach loop rather than for + assignments

On longer strings (~500char) splitting these into an array before processing it character-by-character results in a 2x speed increase.

On my random WordPress.org page dataset, this results in a speed improvement of ~700ms ~= 310ms per call of gp_string_similarity(), which is basically just gp_levenshtein().
Non-scientific measurement though, re-running a script with ~200 calls to gp_string_similarity() multiple times with changing the code back and forth.

Some of these changes are not strictly likely to be faster in all scenario's, but at least here seems that foreach is more efficient than for with an additional assignment here.

This was found by a .po import taking 100% CPU for several minutes, caused by a project having a significant number of obsoleted originals, most of which were long strings, so the C implementation wasn't being used.

Note: I also went looking for more optimized variants of this algorithm, there's probably something more that can be done by removing gp_string_similarity()'s percentage calculation and passing through the raw distance to future calls to abort early in the string if it's a long-way-away from the source string and a closer match has already been found.

The algorithm in use here is this: https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows

…improvements

On longer strings (~500char) splitting these into an array before processing it character-by-character results in a 2x speed increase.

Takes the average run in my testing from an average of 705ms to 341ms on my real-world random data set.
…at's not required. Might be slightly faster.
… loop to a foreach, which again is slightly faster. 341ms => 310ms
@dd32
Copy link
Contributor Author

dd32 commented Apr 14, 2022

Just noting that there's limited unit tests for this code branch currently: https://github.com/GlotPress/GlotPress/blob/develop/tests/phpunit/testcases/test_strings.php

No logical changes were made here though, and before/after return values appear to be the same still, so I haven't dug into adding proper unit tests, mostly due to personal time constraints.

@dd32 dd32 added [Type] Performance php Pull requests that update Php code labels Apr 14, 2022
gp-includes/strings.php Outdated Show resolved Hide resolved
gp-includes/strings.php Outdated Show resolved Hide resolved
@ocean90 ocean90 added this to the 3.1 milestone Apr 20, 2022
ocean90 and others added 2 commits April 22, 2022 17:59
@ocean90 ocean90 removed the php Pull requests that update Php code label Apr 22, 2022
Copy link
Member

@ocean90 ocean90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@ocean90 ocean90 merged commit 741b4dd into GlotPress:develop Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants