Remove ruby compat hacks #259

c42f · 2024-01-04T00:51:24Z

Follow on from #258

There's three minor cleanups here arranged into two patches so you can see the effect on utf8proc_data.c. (We might want to squash the patches together when merging because they churn utf8proc_data.c a bit due to two of them affecting the encoding tables in a non-meaningful but global way which is why I've put them all together here in this PR.)

Only one of these fixes is observable via the API as a minor bug fix for properties of U+200C and U+200D.

The patch notes:

[PATCH 1/2] Fix two minor bugs from the Ruby code

First, categroy rather than code was used in constructing the
control_boundary property as related to the characters U+200C and
U+200D. This seemed incorrect and should be fixed. This could be an
observable bugfix for any C code which inspects the control_boundary
property.

Second, when reading composition exclusions, Ruby's String hex method
produces zero rather than nil if no number is found. For example

$ ruby -e 'puts "# blah".hex'
0

This led to the character '\0' being included in the exclusions
and excl_versions sets which is incorrect. However this seems
asymptomatic because '\0' is never part of a composition. (In terms of
the C code, the use of comp_exclusion is guarded by the comb_index
property which is UINT16_MAX for '\0'.)

[PATCH 2/2] Cleanup: Remove sequence ordering hack

This hack changed the ordering of sequences encoded in the sequences
table and was added so we could easily prove equivalence to the Ruby
data generator code.

However, it's no longer needed and removing it shouldn't result in any
functional change.

First, `categroy` rather than `code` was used in constructing the `control_boundary` property as related to the characters U+200C and U+200D. This seemed incorrect and should be fixed. This could be an observable bugfix for any C code which inspects the `control_boundary` property. Second, when reading composition exclusions, Ruby's String hex method produces zero rather than nil if no number is found. For example $ ruby -e 'puts "# blah".hex' 0 This led to the character `'\0'` being included in the `exclusions` and `excl_versions` sets which is incorrect. However this seems asymptomatic because `'\0'` is never part of a composition. (In terms of the C code, the use of `comp_exclusion` is guarded by the `comb_index` property which is `UINT16_MAX` for `'\0'`.)

This hack changed the ordering of sequences encoded in the sequences table and was added so we could easily prove equivalence to the Ruby data generator code. However, it's no longer needed and removing it shouldn't result in any functional change.

stevengj · 2024-01-04T01:04:57Z

Also fixes #226?

stevengj · 2024-01-04T01:06:37Z

This hack changed the ordering of sequences encoded in the sequences table and was added so we could easily prove equivalence to the Ruby data generator code.

Is the sequence ordering still deterministic independent of the Julia version? I wouldn't want the output to change just because Julia changed its Dict iteration order.

c42f · 2024-01-04T04:44:00Z

Is the sequence ordering still deterministic independent of the Julia version? I wouldn't want the output to change just because Julia changed its Dict iteration order.

Yes I've kept iteration order separate from Dict - it's consistent with ruby which iterates hash in the order of insertion. (In the port one extra array comb2nd_indices_sorted_keys was introduced to ensure this.)

The only ordering-related thing which changes here is that we insert the decomposition mapping and case folding sequences into the cache interleaved with uppercase, lowercase and titlecase mappings, rather than doing them first. But that's independent of Julia version. The ordering in the Ruby code was less obvious because the mutable cache of sequences is a global variable.

c42f added 2 commits January 4, 2024 10:42

c42f mentioned this pull request Jan 4, 2024

A couple of potential issues in data_generator.rb #226

Closed

stevengj merged commit 1fe43f5 into JuliaStrings:master Jan 4, 2024
12 checks passed

c42f deleted the caf/remove-ruby-compat-hacks branch January 5, 2024 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove ruby compat hacks #259

Remove ruby compat hacks #259

c42f commented Jan 4, 2024 •

edited

Loading

stevengj commented Jan 4, 2024

stevengj commented Jan 4, 2024 •

edited

Loading

c42f commented Jan 4, 2024

Remove ruby compat hacks #259

Remove ruby compat hacks #259

Conversation

c42f commented Jan 4, 2024 • edited Loading

[PATCH 1/2] Fix two minor bugs from the Ruby code

[PATCH 2/2] Cleanup: Remove sequence ordering hack

stevengj commented Jan 4, 2024

stevengj commented Jan 4, 2024 • edited Loading

c42f commented Jan 4, 2024

c42f commented Jan 4, 2024 •

edited

Loading

stevengj commented Jan 4, 2024 •

edited

Loading