Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode 9 updates #70

Merged
merged 4 commits into from
Jun 28, 2016
Merged

Unicode 9 updates #70

merged 4 commits into from
Jun 28, 2016

Conversation

Keno
Copy link
Member

@Keno Keno commented Jun 24, 2016

I believe the only substantiative changes to the code required are the new rules in TR29, but please do check the Unicode change log (http://www.unicode.org/versions/Unicode9.0.0/) for other changes that may affect this library.

Keno added 2 commits June 23, 2016 23:53
- New rules GB10/(12/13) are used to combine emoji-zwj sequences/
  (force grapheme breaks every two RI codepoints). Unfortunately this
  breaks statelessness of grapheme-boundary determination. Deal with
  this by ignoring the problem in utf8proc_grapheme_break, and by
  hacking in a special case in decompose

- ZWJ moved to its own boundclass, update what is now GB9 accordingly.

- Add comments to indicate which rule a given case implements

- The Number of bound classes Now exceeds 4 bits, expand to 8 and
  reorganize fields
@Keno Keno changed the title WIP: Unicode 9 updates Unicode 9 updates Jun 24, 2016
Please note that evaluation of GB10 (grapheme breaks between emoji zwj sequences)
and GB 12/13 (regional indicator code points) require knowledge of previous characters
and are thus not handled by this function. This may result in an incorrect break before
and E_Modifier class codepoint and an incorrectly missing break between two
Copy link
Contributor

@tkelman tkelman Jun 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an E_Modifier?

@tkelman
Copy link
Contributor

tkelman commented Jun 24, 2016

This almost certainly needs version number bumps

@stevengj stevengj mentioned this pull request Jun 24, 2016
tbc == UTF8PROC_BOUNDCLASS_EXTEND)
*last_boundclass = UTF8PROC_BOUNDCLASS_E_BASE;
else
*last_boundclass = tbc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to export this logic somehow so that we can use it in the Julia grapheme iterator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, I believe just adding an extra out parameter to give you the override class should be fine.

@Keno
Copy link
Member Author

Keno commented Jun 25, 2016

Ok, I've updated the API to expose the state override. Since that's an ABI incompatible change, I've bumped the MAJOR version accordingly. Please review.

@Keno
Copy link
Member Author

Keno commented Jun 25, 2016

We might want to do this at the same time as #68 (which may also require a rebase for this). cc @benibela

@@ -19,9 +19,9 @@ UCFLAGS = $(CFLAGS) $(PICFLAG) $(C99FLAG) $(WCFLAGS) -DUTF8PROC_EXPORTS
# not API compatibility: MAJOR should be incremented whenever *binary*
# compatibility is broken, even if the API is backward-compatible
# Be sure to also update these in MANIFEST and CMakeLists.txt!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forget to update CMakeLists.txt too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you also need to update utf8proc.h for the API version change.

@stevengj
Copy link
Member

Maybe better to do this first, then update #68?

* matching the rules in Unicode 8.0.0.
*
* @warning If the state parameter is used, `utf8proc_grapheme_break` must be called
* IN ORDER on ALL potentital breaks in a string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential

@Keno
Copy link
Member Author

Keno commented Jun 27, 2016

Do we need to hold off on doing the version bump if we want to do #68 right after?

@tkelman
Copy link
Contributor

tkelman commented Jun 27, 2016

I think we can do a version bump for this (definitely needed), and if we want to merge further big changes we don't need to change the version again until we do the next release.

@Keno
Copy link
Member Author

Keno commented Jun 27, 2016

Ok.

@Keno
Copy link
Member Author

Keno commented Jun 28, 2016

Good to merge?

@stevengj stevengj merged commit 41c6b23 into master Jun 28, 2016
@stevengj stevengj deleted the kf/unicode9 branch June 28, 2016 20:04
stevengj added a commit that referenced this pull request Jun 28, 2016
@benibela
Copy link
Contributor

benibela commented Jul 2, 2016

There is a warning about an unused variable: lbc_override

@Keno
Copy link
Member Author

Keno commented Jul 2, 2016

Oops, thanks.

@Keno
Copy link
Member Author

Keno commented Jul 2, 2016

I also realized I made this static. Will fix.

{
int lbc_override = lbc;
if (state && *state != UTF8PROC_BOUNDCLASS_START)
lbc_override = *state;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Keno, I just noticed that lbc_override seems to never be used. Did you mean to pass lbc_override to grapheme_break_simple?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while, but yes, that looks right, esp by looking at the first commit in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already seem to have found that bug though: https://github.com/JuliaLang/utf8proc/blob/master/utf8proc.c#L290

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants