New data file generator with support for UCD 13 & 14 #227

chris0e3 · 2021-09-03T21:36:41Z

Attached is data_make.py, a python3 script designed to combine & replace data/data_generator.rb & data/charwidths.jl and support both UCD 13 & 14. Also utf8proc.c.patch, a small change to utf8proc.c needed to support UCD 14.

Here are some of its features:

Written in Python (easier to read & support?), only uses (a little) sed. Tested on Python 3.7.4.
Doesn’t use an unspecified version of Ruby.
Doesn’t use an unspecified version of Julia.
Doesn’t require a previously built, unspecified, version of libutf8proc.
Runs to completion in 5-6 seconds (about 10x as fast as data_generator.rb).
Passes all utf8proc tests.
No changes to the public API.
Can generate a byte-for-byte identical utf8proc_data.c file compared to that contained in utf8proc 2.6.1.
Can generate an equivalent utf8proc_data.c source file that is over 1.1 MB smaller.
Writes informative header comments to the generated file.
Can process the latest UCD 14-dev data files and generate a utf8proc_data.c that passes all current tests.
[Due to the increased size of UCD 14 data I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires a small patch to utf8proc.c.]
Can half the size of utf8proc_stage1table. (Saves 4352 bytes.)
Can be used to create a utf8proc_properties table that is > 64,000 bytes smaller.
Doesn’t need data/Uppercase.txt, data/Lowercase.txt or data/CharWidths.txt files.

To build with (the still in development) UCD 14 requires a new Makefile. I haven’t supplied that here as the UCD 14 is still in a state of flux & the URLs are changing. (I can supply one if requested.)
UCD 14 has increased the size of the generated data. I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires the small patch to utf8proc.c contained in utf8proc.c.patch. With the patch applied utf8proc.c still works with the original utf8proc_data.c, and the new format UCD 13 & 14 data.

To use:

Download & unpack a clean copy of utf8proc-2.6.1.tar.gz.
Unpack & copy the attached data_make.py & utf8proc.c.patch into the utf8proc-2.6.1 dir.
Run make -kC data to download the UCD 13 data files. [It’s OK if CharWidths.txt is not made.]
Run patch < utf8proc.c.patch.
Run ./data_make.py --verbose --format=1 --output=utf8proc_data.c
Run make check.

Usage is:

data_make.py [-v|--verbose] [-f#|--format=#] [--fix26] [--cmap] [-o ‹out-file›|--output=‹out-file›] [‹data-dir›]

If unspecified the output file is utf8proc_data.out.c.
If unspecified the input data-dir file is ./data.
If --format=0 alone is used (the default) then the output file should be identical to the original utf8proc_data.c file.
If --fix26 is used then the fixes described in issue #226 are applied to the tables.
If --cmap is used then the utf8proc_sequences table is split & the utf8proc_casemap table added. This requires the utf8proc.c.patch to be applied.
If --format=1 is used then --fix26 & --cmap are implied and the output file uses the new compact source form.
Using UCD 14 automatically forces --format=1 (thus --fix26 & --cmap too).
Using --verbose reports the options in effect & successful generation of the output file.

data_make.zip

The text was updated successfully, but these errors were encountered:

chris0e3 · 2021-09-16T02:18:27Z

With the release of Unicode 14.0 I have now also updated the make files.
I’ve attached the changed files below.

I also updated data_make.py to add a UNICODE_VERSION macro to utf8proc_data.c, and changed the utf8proc_unicode_version API to return it if defined.

To build & test: Copy utf8proc-2.6.1.tar.gz & the attached utf8proc-2.6.1-changes.tar.gz into ‹your-work-dir› and:

cd ‹your-work-dir›
tar -xf utf8proc-2.6.1.tar.gz
tar -xf utf8proc-2.6.1-changes.tar.gz
make -C utf8proc-2.6.1 update check UNICODE_VERSION=14.0.0

This will download the UCD data files, generate a utf8proc_data.c, compile the code & run the tests.

Alternatively, make -C utf8proc-2.6.1 update check UNICODE_VERSION=13.0.0 will build utf8proc using the older UCD 13 data. Also if you don’t specify any UNICODE_VERSION=… it defaults to 14.0.0.

utf8proc-2.6.1-changes.tar.gz

stevengj · 2021-09-16T12:13:26Z

This sounds great, I'll try to take a look at it later.

stevengj · 2021-09-16T12:14:05Z

Can you convert this into a pull request? A PR is much easier to review than a tarball of changes.

chris0e3 · 2021-09-16T14:04:58Z

Can you convert this into a pull request? A PR is much easier to review than a tarball of changes.

I’m sorry, I’m not a git user. I don’t know how to do that.
I could probably attach a patch file here, if that would help.
[The python script is nearly 600 lines, but the other changes are very small.]

stevengj · 2021-09-16T17:08:41Z

I’m sorry, I’m not a git user. I don’t know how to do that.

There are hundreds of tutorials online — it's pretty indispensable for participating in any free/open-source software projects these days, not to mention a lot of commercial projects.

(If you can write Python code with all of the features listed above, I'm sure you can learn git!)

In a pinch, I can take the .tar.gz file you posted and make a pull request for you, though.

chris0e3 · 2021-09-16T21:26:33Z

(If you can write Python code with all of the features listed above, I'm sure you can learn git!)

I could, probably, learn git. I really don’t want to 🤡.
I’m not a Python programmer. I used it because I thought it would be acceptable, and I posted here because I was just trying to help out. I solved a problem and thought it could help others.

In a pinch, I can take the .tar.gz file you posted and make a pull request for you, though.

Did the above commits give you what you wanted?
[I followed the instructions for ‘Linking a pull request …’, but just noticed that at the end it states “… will not be listed as a linked pull request”. So perhaps I have to do/should have done something else.]

Also, I appear to have missed the changes in 610730f.

stevengj · 2021-09-16T21:45:56Z

You have to create a new pull request based on the commits: https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request

chris0e3 · 2021-09-17T00:40:34Z

You have to create a new pull request based on the commits: https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request

I read that and thought that “To open a pull request in a public repository, you must have write access …” meant it wasn’t what I wanted. So I followed this https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue. But apparently you have to push a pull request!
Anyway, I re-merged the changes from 610730f plus 1 additional warning.
[Of course I had based my changes on the released 2.6.1 code.]
And I also tweaked the Makefile so it still builds with the original 2.6.1 utf8proc_data.c as well as the newly generated ones for UCD 13 & 14.

[All done without git 🤓.]

stevengj · 2024-01-04T00:29:13Z

Closed in favor of #258

chris0e3 added a commit to chris0e3/utf8proc that referenced this issue Sep 16, 2021

Resolves JuliaStrings#227

772eca6

chris0e3 added a commit to chris0e3/utf8proc that referenced this issue Sep 16, 2021

Missing file. Resolves JuliaStrings#227

a9bee2f

chris0e3 mentioned this issue Sep 17, 2021

New data file generator with support for UCD 13 & 14 #228

Closed

Seelengrab mentioned this issue Oct 29, 2021

Support Unicode 14.0 JuliaLang/julia#42843

Closed

stevengj closed this as completed Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New data file generator with support for UCD 13 & 14 #227

New data file generator with support for UCD 13 & 14 #227

chris0e3 commented Sep 3, 2021

chris0e3 commented Sep 16, 2021

stevengj commented Sep 16, 2021

stevengj commented Sep 16, 2021

chris0e3 commented Sep 16, 2021

stevengj commented Sep 16, 2021

chris0e3 commented Sep 16, 2021 •

edited

Loading

stevengj commented Sep 16, 2021

chris0e3 commented Sep 17, 2021

stevengj commented Jan 4, 2024

New data file generator with support for UCD 13 & 14 #227

New data file generator with support for UCD 13 & 14 #227

Comments

chris0e3 commented Sep 3, 2021

Here are some of its features:

To use:

Usage is:

chris0e3 commented Sep 16, 2021

stevengj commented Sep 16, 2021

stevengj commented Sep 16, 2021

chris0e3 commented Sep 16, 2021

stevengj commented Sep 16, 2021

chris0e3 commented Sep 16, 2021 • edited Loading

stevengj commented Sep 16, 2021

chris0e3 commented Sep 17, 2021

stevengj commented Jan 4, 2024

chris0e3 commented Sep 16, 2021 •

edited

Loading