New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new function startsWithUTF8 and endsWithUTF8 #52555
Conversation
This is an automated comment for commit d15ae5e with description of existing statuses. It's updated for the latest CI running
|
Can you please share the motivation to have these functions? |
Yes of course. When string contains UTF8 characters, current function |
BTW, can you also review this pr #51472, it had been blocked a little long. |
Let me first ask Yakov in DM to continue review. UPD: he will check the PR today |
I've run your test with startsWith/endsWith, and the only difference to startsWithUTF8/endsWithUTF8 is in the cases where the result is documented as undefined:
It means no logical difference, and the motivation of these functions remains unclear. I cannot imagine a case where there will be a difference. |
startsWith matchs two strings by byte, but startsWithUTF8 match two strings by UTF8 character, that's the difference, which is also the difference between :) select startsWith('富强民主文明和谐', '\xe5');
SELECT startsWith('富强民主文明和谐', '㥩
Query id: 5c0b30b1-877b-443b-b427-6fb48ac017d5
┌─startsWith('富强民主文明和谐', '㥩─┐
│ 1 │
└────────────────────────────────────┘
1 row in set. Elapsed: 0.014 sec.
:) select startsWithUTF8('富强民主文明和谐', '\xe5');
SELECT startsWithUTF8('富强民主文明和谐', '㥩
Query id: f522aa30-a747-459e-b5ac-c91169f2a4a5
┌─startsWithUTF8('富强民主文明和谐', '㥩─┐
│ 0 │
└────────────────────────────────────────┘
1 row in set. Elapsed: 0.001 sec. |
@taiyang-li but you state in the docs that this particular case is undefined behavior. |
|
But startsWith/endsWith don't deal with positions, and every Unicode code point is unambiguously represented by a sequence of bytes in UTF-8, assuming UTF-8 is valid*, so there is no difference. * there is "overlong encoding" and "CESU-8" as variants of invalid UTF-8, but we don't care. |
@alexey-milovidov I see. What I originally want to solve is below issue. I expect select startsWith('富强民主文明和谐', '\xe5');
┌─startsWith('富强民主文明和谐', '㥩─┐
│ 1 │
└────────────────────────────────────┘ |
…ickHouse into starts_ends_with_utf8
@Avogar I don't know how to solve those failed tests, do you have any ideas? |
Let's merge it, failed tests is not related and I will investigate it later |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add new function startsWithUTF8 and endsWithUTF8