Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update graphemes for Unicode 7 #20

Merged
merged 2 commits into from
Dec 14, 2014
Merged

Update graphemes for Unicode 7 #20

merged 2 commits into from
Dec 14, 2014

Conversation

stevengj
Copy link
Member

This fixes #19, and also makes it much easier to implement grapheme iterators in Julia (JuliaLang/julia#9261) by adding a bool utf8proc_grapheme_break(int32_t c1, int32_t c2) function to check for a grapheme break between two codepoints. This allows us to iterate over graphemes in-place, without mapping to a separate string with 0xFF grapheme separators.

Unfortunately, I had to break backwards compatibility by changing the utf8proc_property_t struct to replace the extend:1 field with a boundclass:4 field, where the latter is now read from Unicode's GraphemeBreakProperty.txt file by the updated generator script. I took this opportunity to rearrange the struct to put the bitfields at the end, so that C will not insert alignment padding into the struct; as a consequence, the struct actually got smaller by several bytes.

Once this is merged, I will submit the corresponding patch to the utf8proc folks.

@jiahao, does it look okay to you? cc @StefanKarpinski

@StefanKarpinski
Copy link
Member

👍

@stevengj
Copy link
Member Author

Going to go ahead and merge, then submit upstream.

stevengj added a commit that referenced this pull request Dec 14, 2014
Update graphemes for Unicode 7
@stevengj stevengj merged commit 4f70bbe into master Dec 14, 2014
@stevengj stevengj mentioned this pull request Mar 8, 2015
@stevengj stevengj deleted the graphemes branch June 27, 2015 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

incorrect extended grapheme segmentation
2 participants