forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-48283][SQL] Modify string comparison for UTF8_BINARY_LCASE
### What changes were proposed in this pull request? String comparison and hashing in UTF8_BINARY_LCASE is now context-unaware, and uses ICU root locale rules to convert string to lowercase at code point level, taking into consideration special cases for one-to-many case mapping. For example: comparing "ΘΑΛΑΣΣΙΝΟΣ" and "θαλασσινοσ" under UTF8_BINARY_LCASE now returns true, because Greek final sigma is special-cased in the new comparison implementation. ### Why are the changes needed? 1. UTF8_BINARY_LCASE should use ICU root locale rules (instead of JVM) 2. comparing strings under UTF8_BINARY_LCASE should be context-insensitive ### Does this PR introduce _any_ user-facing change? Yes, comparing strings under UTF8_BINARY_LCASE will now give different results in two kinds of special cases (Turkish dotted letter "i" and Greek final letter "sigma"). ### How was this patch tested? Unit tests in `CollationSupportSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46700 from uros-db/lcase-casing. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Loading branch information
Showing
5 changed files
with
244 additions
and
54 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters