Update data tables to Unicode 7.0.0 #6

jiahao · 2014-07-17T22:37:26Z

Updates:

Updates the data_generator.rb script. This script now runs on a modern version of ruby (>1.8) and has the hard-coded data tables replaced with file reads from the appropriate Unicode data (UNIDATA) files.
Provides a new Makefile target, update, which automatically downloads the relevant UNIDATA and runs data_generator.rb to produce the file utf8proc_data.c.new.
Updates utf8proc_data.c to the output generated by running make update against UNIDATA v7.0.0

Observations:

There are #defined constants in utf8proc.c which may in principle have changed from v5.0 to v7.0, such as the constants marking the location of Hangul, Unihan, etc. I haven't checked them and it's probably not worth recomputing for each new Unicode version.
It looks like utf8proc implements an internal processing mode called LUMP, which is briefly described in lump.txt. As far as I can tell, this is a custom normalization mode which is separate from the Unicode standard, but I think we'll want to use these.

Ref: #1

jiahao · 2014-07-18T14:16:27Z

I managed to bork this PR.

jiahao · 2014-07-18T14:17:24Z

Replaced by #9.

jiahao added 5 commits July 17, 2014 15:32

Mark location of CaseFolding.txt data

f0943b4

Remove utf8proc_data.c (generated by data_generator.rb)

76b96f1

Mark Default_Ignorable_Code_Point data

ba5d970

Mark Grapheme_Extend data

d78ced6

Mark composition exclusion characters

e55defc

jiahao mentioned this pull request Jul 18, 2014

Update data_generator script #8

Closed

jiahao changed the title ~~XXX Marking data locations~~ Update data tables to Unicode 7.0.0 Jul 18, 2014

jiahao closed this Jul 18, 2014

Provide feedback