Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New data file generator with support for UCD 13 & 14 #227

Closed
chris0e3 opened this issue Sep 3, 2021 · 9 comments
Closed

New data file generator with support for UCD 13 & 14 #227

chris0e3 opened this issue Sep 3, 2021 · 9 comments

Comments

@chris0e3
Copy link

chris0e3 commented Sep 3, 2021

Attached is data_make.py, a python3 script designed to combine & replace data/data_generator.rb & data/charwidths.jl and support both UCD 13 & 14. Also utf8proc.c.patch, a small change to utf8proc.c needed to support UCD 14.

Here are some of its features:

  • Written in Python (easier to read & support?), only uses (a little) sed. Tested on Python 3.7.4.
  • Doesn’t use an unspecified version of Ruby.
  • Doesn’t use an unspecified version of Julia.
  • Doesn’t require a previously built, unspecified, version of libutf8proc.
  • Runs to completion in 5-6 seconds (about 10x as fast as data_generator.rb).
  • Passes all utf8proc tests.
  • No changes to the public API.
  • Can generate a byte-for-byte identical utf8proc_data.c file compared to that contained in utf8proc 2.6.1.
  • Can generate an equivalent utf8proc_data.c source file that is over 1.1 MB smaller.
  • Writes informative header comments to the generated file.
  • Can process the latest UCD 14-dev data files and generate a utf8proc_data.c that passes all current tests.
    [Due to the increased size of UCD 14 data I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires a small patch to utf8proc.c.]
  • Can half the size of utf8proc_stage1table. (Saves 4352 bytes.)
  • Can be used to create a utf8proc_properties table that is > 64,000 bytes smaller.
  • Doesn’t need data/Uppercase.txt, data/Lowercase.txt or data/CharWidths.txt files.

To build with (the still in development) UCD 14 requires a new Makefile. I haven’t supplied that here as the UCD 14 is still in a state of flux & the URLs are changing. (I can supply one if requested.)
UCD 14 has increased the size of the generated data. I have had to split utf8proc_sequences & added utf8proc_casemap to prevent index overflow. This requires the small patch to utf8proc.c contained in utf8proc.c.patch. With the patch applied utf8proc.c still works with the original utf8proc_data.c, and the new format UCD 13 & 14 data.

To use:

  1. Download & unpack a clean copy of utf8proc-2.6.1.tar.gz.
  2. Unpack & copy the attached data_make.py & utf8proc.c.patch into the utf8proc-2.6.1 dir.
  3. Run make -kC data to download the UCD 13 data files. [It’s OK if CharWidths.txt is not made.]
  4. Run patch < utf8proc.c.patch.
  5. Run ./data_make.py --verbose --format=1 --output=utf8proc_data.c
  6. Run make check.

Usage is:

data_make.py [-v|--verbose] [-f#|--format=#] [--fix26] [--cmap] [-o ‹out-file›|--output=‹out-file›] [‹data-dir›]

If unspecified the output file is utf8proc_data.out.c.
If unspecified the input data-dir file is ./data.
If --format=0 alone is used (the default) then the output file should be identical to the original utf8proc_data.c file.
If --fix26 is used then the fixes described in issue #226 are applied to the tables.
If --cmap is used then the utf8proc_sequences table is split & the utf8proc_casemap table added. This requires the utf8proc.c.patch to be applied.
If --format=1 is used then --fix26 & --cmap are implied and the output file uses the new compact source form.
Using UCD 14 automatically forces --format=1 (thus --fix26 & --cmap too).
Using --verbose reports the options in effect & successful generation of the output file.

data_make.zip

@chris0e3
Copy link
Author

With the release of Unicode 14.0 I have now also updated the make files.
I’ve attached the changed files below.

I also updated data_make.py to add a UNICODE_VERSION macro to utf8proc_data.c, and changed the utf8proc_unicode_version API to return it if defined.

To build & test: Copy utf8proc-2.6.1.tar.gz & the attached utf8proc-2.6.1-changes.tar.gz into ‹your-work-dir› and:

cd ‹your-work-dir›
tar -xf utf8proc-2.6.1.tar.gz
tar -xf utf8proc-2.6.1-changes.tar.gz
make -C utf8proc-2.6.1 update check UNICODE_VERSION=14.0.0                                              

This will download the UCD data files, generate a utf8proc_data.c, compile the code & run the tests.

Alternatively, make -C utf8proc-2.6.1 update check UNICODE_VERSION=13.0.0 will build utf8proc using the older UCD 13 data. Also if you don’t specify any UNICODE_VERSION=… it defaults to 14.0.0.

utf8proc-2.6.1-changes.tar.gz

@stevengj
Copy link
Member

This sounds great, I'll try to take a look at it later.

@stevengj
Copy link
Member

Can you convert this into a pull request? A PR is much easier to review than a tarball of changes.

@chris0e3
Copy link
Author

Can you convert this into a pull request? A PR is much easier to review than a tarball of changes.

I’m sorry, I’m not a git user. I don’t know how to do that.
I could probably attach a patch file here, if that would help.
[The python script is nearly 600 lines, but the other changes are very small.]

@stevengj
Copy link
Member

I’m sorry, I’m not a git user. I don’t know how to do that.

There are hundreds of tutorials online — it's pretty indispensable for participating in any free/open-source software projects these days, not to mention a lot of commercial projects.

(If you can write Python code with all of the features listed above, I'm sure you can learn git!)

In a pinch, I can take the .tar.gz file you posted and make a pull request for you, though.

chris0e3 added a commit to chris0e3/utf8proc that referenced this issue Sep 16, 2021
chris0e3 added a commit to chris0e3/utf8proc that referenced this issue Sep 16, 2021
@chris0e3
Copy link
Author

chris0e3 commented Sep 16, 2021

(If you can write Python code with all of the features listed above, I'm sure you can learn git!)

I could, probably, learn git. I really don’t want to 🤡.
I’m not a Python programmer. I used it because I thought it would be acceptable, and I posted here because I was just trying to help out. I solved a problem and thought it could help others.

In a pinch, I can take the .tar.gz file you posted and make a pull request for you, though.

Did the above commits give you what you wanted?
[I followed the instructions for ‘Linking a pull request …’, but just noticed that at the end it states “… will not be listed as a linked pull request”. So perhaps I have to do/should have done something else.]

Also, I appear to have missed the changes in 610730f.

@stevengj
Copy link
Member

@chris0e3
Copy link
Author

You have to create a new pull request based on the commits: https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request

I read that and thought that “To open a pull request in a public repository, you must have write access …” meant it wasn’t what I wanted. So I followed this https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue. But apparently you have to push a pull request!
Anyway, I re-merged the changes from 610730f plus 1 additional warning.
[Of course I had based my changes on the released 2.6.1 code.]
And I also tweaked the Makefile so it still builds with the original 2.6.1 utf8proc_data.c as well as the newly generated ones for UCD 13 & 14.

[All done without git 🤓.]

@stevengj
Copy link
Member

stevengj commented Jan 4, 2024

Closed in favor of #258

@stevengj stevengj closed this as completed Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants