Speed improvement for gp_levenshtein(). #1408

dd32 · 2022-04-14T03:53:12Z

gp_levenshtein():

Avoid the use of mb_substr() for significant speed improvements
Improve code readability; don't recalculate string length
Improve code readability; use a foreach loop rather than for + assignments

On longer strings (~500char) splitting these into an array before processing it character-by-character results in a 2x speed increase.

On my random WordPress.org page dataset, this results in a speed improvement of ~700ms ~= 310ms per call of gp_string_similarity(), which is basically just gp_levenshtein().
Non-scientific measurement though, re-running a script with ~200 calls to gp_string_similarity() multiple times with changing the code back and forth.

Some of these changes are not strictly likely to be faster in all scenario's, but at least here seems that foreach is more efficient than for with an additional assignment here.

This was found by a .po import taking 100% CPU for several minutes, caused by a project having a significant number of obsoleted originals, most of which were long strings, so the C implementation wasn't being used.

Note: I also went looking for more optimized variants of this algorithm, there's probably something more that can be done by removing gp_string_similarity()'s percentage calculation and passing through the raw distance to future calls to abort early in the string if it's a long-way-away from the source string and a closer match has already been found.

The algorithm in use here is this: https://en.wikipedia.org/wiki/Levenshtein_distance#Iterative_with_two_matrix_rows

…improvements On longer strings (~500char) splitting these into an array before processing it character-by-character results in a 2x speed increase. Takes the average run in my testing from an average of 705ms to 341ms on my real-world random data set.

…at's not required. Might be slightly faster.

… loop to a foreach, which again is slightly faster. 341ms => 310ms

dd32 · 2022-04-14T04:03:36Z

Just noting that there's limited unit tests for this code branch currently: https://github.com/GlotPress/GlotPress/blob/develop/tests/phpunit/testcases/test_strings.php

No logical changes were made here though, and before/after return values appear to be the same still, so I haven't dug into adding proper unit tests, mostly due to personal time constraints.

…'s function declaration.

gp-includes/strings.php

Co-authored-by: Dion Hulse <dion@wordpress.org>

ocean90

🚀

dd32 added 3 commits April 14, 2022 03:30

gp_levenshtein: Improve readibility of the code, avoid extra logic th…

1eec577

…at's not required. Might be slightly faster.

gp_levenshtein: Improve readability of the code, switching from a for…

79ba1c5

… loop to a foreach, which again is slightly faster. 341ms => 310ms

dd32 added [Type] Performance php Pull requests that update Php code labels Apr 14, 2022

Pass -1 as the default for preg_split() rather than null, to match it…

1541812

…'s function declaration.

ocean90 reviewed Apr 14, 2022

View reviewed changes

gp-includes/strings.php Outdated Show resolved Hide resolved

dd32 commented Apr 19, 2022

View reviewed changes

gp-includes/strings.php Outdated Show resolved Hide resolved

ocean90 mentioned this pull request Apr 20, 2022

Update minimum required PHP version to 7.4 #1417

Merged

ocean90 added this to the 3.1 milestone Apr 20, 2022

ocean90 and others added 2 commits April 22, 2022 17:59

Merge branch 'develop' into speed/gp_levenshtein

e61dd3c

Use mb_str_split()

77bcbaf

Co-authored-by: Dion Hulse <dion@wordpress.org>

ocean90 removed the php Pull requests that update Php code label Apr 22, 2022

ocean90 approved these changes Apr 22, 2022

View reviewed changes

ocean90 merged commit 741b4dd into GlotPress:develop Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed improvement for gp_levenshtein(). #1408

Speed improvement for gp_levenshtein(). #1408

dd32 commented Apr 14, 2022 •

edited

dd32 commented Apr 14, 2022

ocean90 left a comment

Speed improvement for gp_levenshtein(). #1408

Speed improvement for gp_levenshtein(). #1408

Conversation

dd32 commented Apr 14, 2022 • edited

dd32 commented Apr 14, 2022

ocean90 left a comment

Choose a reason for hiding this comment

dd32 commented Apr 14, 2022 •

edited