Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharp-S character treated incorrectly in UNICODE_CI_AI collation [CORE4136] #4463

Closed
firebird-issue-importer opened this issue Jul 1, 2013 · 12 comments

Comments

@firebird-issue-importer
Copy link

@firebird-issue-importer firebird-issue-importer commented Jul 1, 2013

Submitted by: Stefan Heymann (stefanheymann)

Is related to QA529

Votes: 1

The UNICODE_CI_AI collation treats the Sharp-s character (U+00DF) incorrectly.
This character (used in German language text) is special in that there is only a lower-case form, no upper-case (having derived from a ligature between a long and a round lowercase "s". Forget about U+1E9E, which is an abstract invention by the Unicode consortium that has no practical use in German language).

To reproduce the bug, try this on a UTF8 database:

select
case when 'Übergeek' collate unicode_ci_ai like 'ÜB%' collate unicode_ci_ai
then '=' else '<>' end as test_1,
case when 'Übergeek' collate unicode_ci_ai like 'üb%' collate unicode_ci_ai
then '=' else '<>' end as test_2,
case when 'Fußball' collate unicode_ci_ai like 'fu%' collate unicode_ci_ai
then '=' else '<>' end as test_3,
case when 'Fußball' collate unicode_ci_ai like 'fuß%' collate unicode_ci_ai
then '=' else '<>' end as test_4,
case when upper ('Fußball') like upper ('fuß%')
then '=' else '<>' end as test_5
from rdb$database

TEST_4 will show a mismatch where it should show a match.

Commits: 0e7302f fb41d66 FirebirdSQL/fbt-repository@9b97921 FirebirdSQL/fbt-repository@41851cd

====== Test Details ======

Have question about different results of comparison, see issue 29-may-2015.
Perhaps, it also related to CORE4739.
See also sample in CORE857 ( 19/Apr/15 08:56 AM )

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jul 1, 2013

Commented by: Stefan Heymann (stefanheymann)

I should add that UNICODE_CI (wihout AI) works correctly on this character:

select
case when 'Übergeek' collate unicode_ci like 'ÜB%' collate unicode_ci_ai
then '=' else '<>' end as test_1,
case when 'Übergeek' collate unicode_ci like 'üb%' collate unicode_ci_ai
then '=' else '<>' end as test_2,
case when 'Fußball' collate unicode_ci like 'fu%' collate unicode_ci_ai
then '=' else '<>' end as test_3,
case when 'Fußball' collate unicode_ci like 'fuß%' collate unicode_ci_ai
then '=' else '<>' end as test_4,
case when upper ('Fußball') like upper ('fuß%')
then '=' else '<>' end as test_5
from rdb$database

will show a match for TEST_4, which is correct.

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jul 5, 2013

Modified by: @asfernandes

assignee: Adriano dos Santos Fernandes [ asfernandes ]

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jul 7, 2013

Modified by: @asfernandes

status: Open [ 1 ] => Resolved [ 5 ]

resolution: Fixed [ 1 ]

Fix Version: 2.5.3 [ 10461 ]

Fix Version: 3.0 Alpha 2 [ 10560 ]

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jul 8, 2013

Commented by: Stefan Heymann (stefanheymann)

It's working now (tested 2013-07-08 with Firebird 2.5.3.26671). Thanks!

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jul 22, 2013

Modified by: @dyemanov

Fix Version: 3.0 Alpha 1 [ 10331 ]

Fix Version: 3.0 Alpha 2 [ 10560 ] =>

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Mar 29, 2014

Modified by: @pcisar

Link: This issue is related to QA529 [ QA529 ]

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 29, 2015

Commented by: @pavel-zotov

Can anyone who knows German language explain following results:

set names utf8;
create database 'localhost/3333:C:\MIX\firebird\QA\fbt-repo\tmp\c4136_a.fdb' default character set utf8;
commit;

recreate table test(text varchar(10) collate unicode_ci_ai, patt varchar(10) collate unicode_ci_ai);
commit;
show table test;

insert into test values('ß','s');
insert into test values('ß','ss');
insert into test values('ss','ß');
commit;

set list on;
select
text
,patt
,text = patt as "equal result"
,text is NOT distinct from patt as "is NOT_distinct result"
,text starting with patt as "starting_with result"
,text like patt as "like result"
,text containing patt as "containing result"
,text similar to patt as "similar_to result"
from test
;

Output:

TEXT ß
PATT s
equal result <false>
is NOT_distinct result <false>
starting_with result <false> ------------ WHY ?
like result <false>
containing result <false> ------------- WHY ?
similar_to result <false>

TEXT ß
PATT ss
equal result <true>
is NOT_distinct result <true>
starting_with result <false> -------- WHY ??? (this and subsequent three)
like result <false>
containing result <false>
similar_to result <false>

TEXT ss
PATT ß
equal result <true>
is NOT_distinct result <true>
starting_with result <false> -------- WHY ??? (this and subsequent three)
like result <false>
containing result <false>
similar_to result <false>

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 29, 2015

Modified by: @pavel-zotov

status: Resolved [ 5 ] => Resolved [ 5 ]

QA Status: Done successfully

Test Details: Have question about different results of comparison, see issue 29-may-2015.
Perhaps, it also related to CORE4739.

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented May 29, 2015

Commented by: Stefan Heymann (stefanheymann)

Historically, the sharp-s character (ß) is a typographical ligature of a "long" and a "round" lowercase s. Today it is a complete character of its own, so "ß" and "ss" are not the same (you can, however, treat them as the same in sorting). The same is true for "ß" and "s". There is no such thing as an uppercase sharp-S, so the Unicode code point U+1E9E LATIN CAPITAL LETTER SHARP S is completely pointless.
In uppercase writing, a sharp-s is replaced by "SS" or (to avoid amiguities) by "SZ" (so Masse gets MASSE, Maße gets MASSE or MASZE, Fuß gets FUSS).
A "ß" is not starting with "s" or "ss" and it doesn't contain them.

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Sep 25, 2015

Commented by: @pcisar

Test created.

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Sep 25, 2015

Modified by: @pcisar

status: Resolved [ 5 ] => Closed [ 6 ]

@firebird-issue-importer
Copy link
Author

@firebird-issue-importer firebird-issue-importer commented Jun 25, 2018

Modified by: @pavel-zotov

status: Closed [ 6 ] => Closed [ 6 ]

Test Details: Have question about different results of comparison, see issue 29-may-2015.
Perhaps, it also related to CORE4739.

=>

Have question about different results of comparison, see issue 29-may-2015.
Perhaps, it also related to CORE4739.
See also sample in CORE857 ( 19/Apr/15 08:56 AM )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants