Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of STARTING WITH with insensitive collations #7038

Closed
asfernandes opened this issue Nov 4, 2021 · 3 comments
Closed

Comments

@asfernandes
Copy link
Member

asfernandes commented Nov 4, 2021

To process STARTING WITH with insensitive collations, it's first necessary to generate canonical bytes of the matching strings.

If the matching string is much greater than the pattern string, a time is wasted generating unneeded canonical bytes.

It's necessary to only generate canonical bytes for the initial substring with the same length of the pattern string.

In my tests with character set WIN1252 collate WIN_PTBR matching strings of length 60 and pattern string with length 1, I see performance improvement of ~30%.

With character set UTF8 collate UNICODE_CI I see performance improvement of ~50% in the same test.

Test:

execute block
as
    declare p varchar(1) character set win1252 collate win_ptbr = 'x';
    declare s varchar(60) character set win1252 collate win_ptbr = 'x12345678901234567890123456789012345678901234567890123456789';
    declare n integer = 0;
    declare b boolean;
begin
    while (n < 1000000)
    do
    begin
        b = s starting with p;
        n = n + 1;
    end
end!
execute block
as
    declare p varchar(1) character set utf8 collate unicode_ci = 'x';
    declare s varchar(60) character set utf8 collate unicode_ci = 'x12345678901234567890123456789012345678901234567890123456789';
    declare n integer = 0;
    declare b boolean;
begin
    while (n < 1000000)
    do
    begin
        b = s starting with p;
        n = n + 1;
    end
end!
@asfernandes
Copy link
Member Author

Updated performance improvement verified in test after changes in the implementation.

@asfernandes
Copy link
Member Author

After small changes in INTL API description for canonical function and small changes in engine I verified more than 50% improvement with UTF8 and UNICODE_CI in the same test, so I'm changing this issue to also optimize multi-byte character sets.

@asfernandes asfernandes changed the title Improve performance of STARTING WITH of fixed-byte charsets with insensitive collations Improve performance of STARTING WITH with insensitive collations Nov 5, 2021
asfernandes added a commit that referenced this issue Nov 5, 2021
@pavel-zotov
Copy link

Currently implemented only for WINDOWS: package 'psutil' can not be installed on Python 2.7 when it is running on Linux (Debian).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment