Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data to Unicode 8.0.0 standard #45

Merged
merged 3 commits into from
Jun 23, 2015
Merged

Update data to Unicode 8.0.0 standard #45

merged 3 commits into from
Jun 23, 2015

Conversation

jiahao
Copy link
Collaborator

@jiahao jiahao commented Jun 20, 2015

Closes #41

Some minor modifications to the julia script generator were needed.

There's some trailing whitespace in the generated file but I really don't feel like fixing it.

@jiahao
Copy link
Collaborator Author

jiahao commented Jun 20, 2015

The test failure in this PR is because @staticfloat's cached data is not updated to 8.0.0, so the data generation script on Travis fails the comparison with the committed data file.

Should we really have a test that will fail every time we update the Unicode tables? Granted it won't happen so often, but I'm not sure this particular test is that valuable.

@staticfloat
Copy link
Contributor

If I clear out the cached data, will that work for older version of utf8proc?

@jiahao
Copy link
Collaborator Author

jiahao commented Jun 20, 2015

I don't think so. We'd have a break where older versions have Unicode 7 tables while newer ones have Unicode 8 data.

There's also the minor issue that I had to explicitly edit out the use of the cached Unicode data in order to actually update the tables, so the update procedure no longer works out of the box.

@tkelman
Copy link
Contributor

tkelman commented Jun 21, 2015

Is there a version-dependent URL where we can get the data from, so it won't look like the same file to the caching server? I thought there was some checksum validation for staticfloat/cache.julialang.org#3

@jiahao
Copy link
Collaborator Author

jiahao commented Jun 21, 2015

Actually, yes. http://www.unicode.org/Public/UNIDATA looks like it is a symlink to the latest version, http://www.unicode.org/Public/8.0.0. The previous versions are accessible at http://www.unicode.org/Public/7.0.0, etc.

@staticfloat
Copy link
Contributor

Since the caching server only really pays attention to the basename, the versioned URLs unfortunately won't help us all that much. We'll have to come up with an elegant solution to that; maybe special-casing those URLs to have unique names on our S3 server (so they can both live side-by-side, and we know which one to serve)?

It looks like the caching server notices the file has changed, attempts to change it, thinks its succeeded (the ETAG and MD5 of the file gets saved correctly), but doesn't actually change the file. I'll look into it.

@staticfloat
Copy link
Contributor

Nope, I was wrong, it is actually changing the file, and it gets the changes right the very first time. I'm not sure what the problems here are, but the caching server is serving the correct files out.

@stevengj
Copy link
Member

I don't want the Unicode-table generation code to bitrot, so having it break on the rare cases where we update the Unicode support seems a small price to pay...

@jiahao
Copy link
Collaborator Author

jiahao commented Jun 23, 2015

@stevengj my original diagnosis doesn't seem to be correct, since the caching server apparently did notice the update to the Unicode data files.

@staticfloat
Copy link
Contributor

It should be noted however, that since the URL we've been asking for has changed its contents, older versions of utf8proc are likely failing now.

@jiahao
Copy link
Collaborator Author

jiahao commented Jun 23, 2015

I just restarted the build; if this doesn't work still I'll have to force the test to spit out more output.

@jiahao
Copy link
Collaborator Author

jiahao commented Jun 23, 2015

Rerunning the data generator seems to have fixed the problem. Will squash and commit.

jiahao added a commit that referenced this pull request Jun 23, 2015
Update data to Unicode 8.0.0 standard
@jiahao jiahao merged commit 327bf10 into master Jun 23, 2015
@jiahao jiahao deleted the cjh/unicode8 branch June 23, 2015 21:19
jiahao added a commit that referenced this pull request Jun 23, 2015
Link to Lua-mojibake (closes #44)

Bump Unicode version (ref: #45)
@jiahao
Copy link
Collaborator Author

jiahao commented Jun 23, 2015

Merged and README updated. Is it time to tag a new version?

@nalimilan
Copy link
Member

A new stable version would be very useful for me to get Julia 0.4.0-dev in the Fedora development version.

@stevengj
Copy link
Member

Yes, it seems like it's time for a new version.

@ScottPJones
Copy link
Contributor

I remember you had loads of fun getting a new version of utf8proc used by master, is there a PR yet to do that (along with an update to NEW.md, and tests on at least some new characters)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update for Unicode 8
6 participants