You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To process STARTING WITH with insensitive collations, it's first necessary to generate canonical bytes of the matching strings.
If the matching string is much greater than the pattern string, a time is wasted generating unneeded canonical bytes.
It's necessary to only generate canonical bytes for the initial substring with the same length of the pattern string.
In my tests with character set WIN1252 collate WIN_PTBR matching strings of length 60 and pattern string with length 1, I see performance improvement of ~30%.
With character set UTF8 collate UNICODE_CI I see performance improvement of ~50% in the same test.
Test:
execute block
as
declare p varchar(1) character set win1252 collate win_ptbr = 'x';
declare s varchar(60) character set win1252 collate win_ptbr = 'x12345678901234567890123456789012345678901234567890123456789';
declare n integer = 0;
declare b boolean;
begin
while (n < 1000000)
do
begin
b = s starting with p;
n = n + 1;
end
end!
execute block
as
declare p varchar(1) character set utf8 collate unicode_ci = 'x';
declare s varchar(60) character set utf8 collate unicode_ci = 'x12345678901234567890123456789012345678901234567890123456789';
declare n integer = 0;
declare b boolean;
begin
while (n < 1000000)
do
begin
b = s starting with p;
n = n + 1;
end
end!
The text was updated successfully, but these errors were encountered:
After small changes in INTL API description for canonical function and small changes in engine I verified more than 50% improvement with UTF8 and UNICODE_CI in the same test, so I'm changing this issue to also optimize multi-byte character sets.
asfernandes
changed the title
Improve performance of STARTING WITH of fixed-byte charsets with insensitive collations
Improve performance of STARTING WITH with insensitive collations
Nov 5, 2021
To process
STARTING WITH
with insensitive collations, it's first necessary to generate canonical bytes of the matching strings.If the matching string is much greater than the pattern string, a time is wasted generating unneeded canonical bytes.
It's necessary to only generate canonical bytes for the initial substring with the same length of the pattern string.
In my tests with
character set WIN1252 collate WIN_PTBR
matching strings of length 60 and pattern string with length 1, I see performance improvement of ~30%.With
character set UTF8 collate UNICODE_CI
I see performance improvement of ~50% in the same test.Test:
The text was updated successfully, but these errors were encountered: