Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ruby compat hacks #259

Merged
merged 2 commits into from
Jan 4, 2024

Conversation

c42f
Copy link
Contributor

@c42f c42f commented Jan 4, 2024

Follow on from #258

There's three minor cleanups here arranged into two patches so you can see the effect on utf8proc_data.c. (We might want to squash the patches together when merging because they churn utf8proc_data.c a bit due to two of them affecting the encoding tables in a non-meaningful but global way which is why I've put them all together here in this PR.)

Only one of these fixes is observable via the API as a minor bug fix for properties of U+200C and U+200D.

The patch notes:

[PATCH 1/2] Fix two minor bugs from the Ruby code

First, categroy rather than code was used in constructing the
control_boundary property as related to the characters U+200C and
U+200D. This seemed incorrect and should be fixed. This could be an
observable bugfix for any C code which inspects the control_boundary
property.

Second, when reading composition exclusions, Ruby's String hex method
produces zero rather than nil if no number is found. For example

$ ruby -e 'puts "# blah".hex'
0

This led to the character '\0' being included in the exclusions
and excl_versions sets which is incorrect. However this seems
asymptomatic because '\0' is never part of a composition. (In terms of
the C code, the use of comp_exclusion is guarded by the comb_index
property which is UINT16_MAX for '\0'.)

[PATCH 2/2] Cleanup: Remove sequence ordering hack

This hack changed the ordering of sequences encoded in the sequences
table and was added so we could easily prove equivalence to the Ruby
data generator code.

However, it's no longer needed and removing it shouldn't result in any
functional change.

First, `categroy` rather than `code` was used in constructing the
`control_boundary` property as related to the characters U+200C and
U+200D. This seemed incorrect and should be fixed. This could be an
observable bugfix for any C code which inspects the `control_boundary`
property.

Second, when reading composition exclusions, Ruby's String hex method
produces zero rather than nil if no number is found. For example

    $ ruby -e 'puts "# blah".hex'
    0

This led to the character `'\0'` being included in the `exclusions`
and `excl_versions` sets which is incorrect. However this seems
asymptomatic because `'\0'` is never part of a composition. (In terms of
the C code, the use of `comp_exclusion` is guarded by the `comb_index`
property which is `UINT16_MAX` for `'\0'`.)
This hack changed the ordering of sequences encoded in the sequences
table and was added so we could easily prove equivalence to the Ruby
data generator code.

However, it's no longer needed and removing it shouldn't result in any
functional change.
@stevengj
Copy link
Member

stevengj commented Jan 4, 2024

Also fixes #226?

@stevengj
Copy link
Member

stevengj commented Jan 4, 2024

This hack changed the ordering of sequences encoded in the sequences table and was added so we could easily prove equivalence to the Ruby data generator code.

Is the sequence ordering still deterministic independent of the Julia version? I wouldn't want the output to change just because Julia changed its Dict iteration order.

@c42f
Copy link
Contributor Author

c42f commented Jan 4, 2024

Is the sequence ordering still deterministic independent of the Julia version? I wouldn't want the output to change just because Julia changed its Dict iteration order.

Yes I've kept iteration order separate from Dict - it's consistent with ruby which iterates hash in the order of insertion. (In the port one extra array comb2nd_indices_sorted_keys was introduced to ensure this.)

The only ordering-related thing which changes here is that we insert the decomposition mapping and case folding sequences into the cache interleaved with uppercase, lowercase and titlecase mappings, rather than doing them first. But that's independent of Julia version. The ordering in the Ruby code was less obvious because the mutable cache of sequences is a global variable.

@stevengj stevengj merged commit 1fe43f5 into JuliaStrings:master Jan 4, 2024
12 checks passed
@c42f c42f deleted the caf/remove-ruby-compat-hacks branch January 5, 2024 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants