Move Unicode-related functions to new Unicode stdlib package #25021

nalimilan · 2017-12-10T22:01:19Z

As decided at #14347. It turns out most character category predicates and similar functions are used somewhere in Base, so it does not sound possible to move them to the standard library (except for a few particular functions). So I kept these in Base but unexported them, and reexported them from Unicode (which is otherwise an empty module). I've also moved some functions to utf8proc.jl for clarity and because it created problems during bootstrap due to attempts to import functions in UTF8proc which had not been defined yet in Base.

I wasn't sure how many functions we want to move to the stdlib. For example, should isascii, isdigit and isxdigit be moved, given how simple they are, and given that they are not really Unicode-related?

StefanKarpinski · 2017-12-10T22:22:18Z

Nice, thanks for tackling this!

StefanKarpinski · 2017-12-10T22:25:27Z

stdlib/Unicode/docs/src/index.md

+
+```@docs
+Unicode.is_assigned_char
+Unicode.normalize_string


How about renaming is_assigned_char to isassigned and generalizing it to strings the same way the rest are – i.e. by all reduction over the characters in the string. Similarly, we could rename normalize_string to normalize now that it's in the Unicode module.

I've added them in the commit deprecating isnumber in favor of isnumeric. Though note that the names clash with corresponding functions in Base, so they always need to be qualified. I guess that's OK.

isassigned was already restricted to Char arguments. But I noticed isascii accepts strings directly, which is inconsistent with other functions. Time to deprecate it too?

oscardssmith · 2017-12-10T23:42:53Z

While I agree that having a small base is valuable, I'm not sure that these changes make sense. Unicode is really complicated, and if people have to go to stdlib to get something that works, while base has things that work ~most of the time with ascii, I think the result will likely be a bunch of code that doesn't deal with unicode well.

StefanKarpinski · 2017-12-11T01:32:41Z

@oscardssmith, I think you're misinterpreting the situation here. Base strings fully support Unicode and always have. This is a collection of utility functions to query Unicode properties about characters and strings. The functions all have to do with querying character classes, computing text widths, string normalization, and changing character cases. These are all the things that might change from one version of Unicode to another whereas what's in base is not going to change – i.e. the basic mechanics of UTF-8. So it makes sense to separate it out since you might want to change the version of Unicode that your program supports independent of the version of Julia you're using, whereas nothing in base is specific to any particular version of Unicode.

stevengj · 2017-12-11T13:35:35Z

Missing deprecations?

nalimilan · 2017-12-11T17:29:45Z

I've managed to get deprecations to work. It's somewhat tricky:

The new deprecations conflict with the old deprecations from pred(s::AbstractString) to all(pred, s). I had to move the old deprecations to Unicode to fix warnings.
If we export Unicode-related functions from Base.UTF8proc to Base to use them internally, we cannot import them into Unicode (to reexport them) without triggering deprecation warnings during the build. Therefore, I had to fully qualify all uses of these functions in Base (apart from submodules, where using Base.UTF8proc: ... is OK) so that the deprecated functions appear as totally different from the actual ones under Base.UTF8proc, which are reexported by Unicode. That's quite verbose but I couldn't find any other solution. EDIT: Note that this is only temporary, until the deprecations are removed.

Finally, there's a problem with the manual: functions appear as Base.UTF8proc.* instead of Unicode.*. I don't know how to fix that. ~~I also haven't found how to make a reference to graphemes from Base docs to Unicode docs.~~ EDIT: Actually it was due to the problem described in the previous sentence: need to use the Base.UTF8proc.* name that appears in the manual.

StefanKarpinski · 2017-12-11T19:23:00Z

Can't we get rid of the old deprecations? They're either from a previous release, in which case they're due to be deleted anyway, or they're from this release cycle, in which case we don't need to keep them if they're superseded by newer deprecations.

nalimilan · 2017-12-11T21:45:07Z

Can't we get rid of the old deprecations? They're either from a previous release, in which case they're due to be deleted anyway, or they're from this release cycle, in which case we don't need to keep them if they're superseded by newer deprecations.

Indeed, this deprecation was introduced in 0.6 so it should be fine to remove it. I've added a commit, at the same time as new deprecations for isnumeric, isassigned and normalize.

BTW, I'm not sure what's the convention to document functions in the stdlib. I've added the Unicode. prefix only to the signatures of normalize and isassigned, since they conflict with functions from Base. But I've added using Unicode to all examples, since it's not obvious when reading the manual online that it's needed.

Keep them under Base.Unicode since they are needed inside Base, but stop exporting them to Base since they would conflict with the deprecations. Base.Unicode is the new name for Base.UTF8proc, but including a few more functions.

To match the new name of the module.

isnumeric() is consistent with Python and Rust (but not Go), and less easy to confuse with isdigit(). Improve documentation to make confusion less easy. Also fix a few uses where isdigit() is more appropriate than isnumber().

maleadt · 2017-12-13T10:08:56Z

Would have been nice to have Compat functionality (ie. using Compat.Unicode) at the time of merging this PR, to avoid temporary temporary VERSION >= v"0.7.0-DEV.2915" checks.

nalimilan · 2017-12-13T10:14:31Z

I don't think we generally add Compat support before merging PRs, especially since the priority is to merge breaking PRs this week before the feature freeze. But indeed it's going to be needed.

martinholters · 2017-12-14T11:00:56Z

julia> using Unicode

julia> isassigned
WARNING: both Unicode and Base export "isassigned"; uses of it in module Main must be qualified
ERROR: UndefVarError: isassigned not defined

I don't think export isassigned from Unicode is particularly helpful. (using Unicode: isassigned is possible without an export.)

nalimilan · 2017-12-14T14:45:48Z

Yeah, probably. I'm not sure either what exporting isassigned and normalize from Unicode allow which is not possible without exporting them, since the conflict with Base.isassigned makes them impossible to use unqualified anyway. See #25079.

KristofferC · 2018-01-07T15:52:06Z

This should probably have defined the functions in the Unicode stdlib instead of just reexporting existing functions. Otherwise e.g. the documentation will write these functions as Base.Unicode.isnumeric.

nalimilan · 2018-01-07T15:54:51Z

I've mentioned that problem above, and couldn't find a solution. At least Base.Unicode.isnumeric is less confusing than Base.UTF8proc.isnumeric (as it appeared in the original version of the PR). Help welcome.

nalimilan · 2018-01-07T15:55:46Z

Ah, and the functions are not defined in Base because I liked that, it's because they are used all over the place in Base.

fredrikekre · 2018-01-07T15:56:44Z

We could do

module Unicode
isnumeric(args...) = Base.isnumeric(args...)
...
end

and move the docs to document the functions in the Unicode module

KristofferC · 2018-01-07T15:58:28Z

This is how I did it with CRC32c: https://github.com/JuliaLang/julia/blob/master/stdlib/CRC32c/src/CRC32c.jl (but that was fewer functions so easier to do).

nalimilan · 2018-02-05T21:56:22Z

OK, see #25902.

nalimilan added domain:strings "Strings!" domain:unicode Related to unicode characters and encodings labels Dec 10, 2017

nalimilan mentioned this pull request Dec 10, 2017

Replace isnumber(), etc. with a single function #14347

Open

StefanKarpinski approved these changes Dec 10, 2017

View reviewed changes

ararslan added kind:excision Removal of code from Base or the repository stdlib Julia's standard library labels Dec 10, 2017

nalimilan force-pushed the nl/unicode branch from 66b5e9c to 4a804b4 Compare December 11, 2017 17:26

nalimilan force-pushed the nl/unicode branch from 4a804b4 to 8070033 Compare December 11, 2017 19:36

nalimilan mentioned this pull request Dec 11, 2017

Deprecate isnumber() in favor of isnumeric() #25008

Closed

nalimilan force-pushed the nl/unicode branch from 8070033 to 189e2a7 Compare December 11, 2017 21:41

nalimilan force-pushed the nl/unicode branch 2 times, most recently from c63baff to e2d1088 Compare December 12, 2017 08:37

nalimilan changed the title ~~WIP: Move Unicode-related functions to new Unicode stdlib package~~ Move Unicode-related functions to new Unicode stdlib package Dec 12, 2017

nalimilan added 3 commits December 12, 2017 22:28

Rename base/strings/utf8proc.jl to base/strings/unicode.jl

635826b

To match the new name of the module.

nalimilan force-pushed the nl/unicode branch from e2d1088 to 756936a Compare December 12, 2017 21:32

nalimilan merged commit 295b098 into master Dec 13, 2017

nalimilan deleted the nl/unicode branch December 13, 2017 08:29

StefanKarpinski mentioned this pull request Dec 13, 2017

titlecase: chars not starting a word can be converted to lowercase #23393

Merged

ararslan mentioned this pull request Dec 13, 2017

Fix for Base Unicode functions moving to stdlib JuliaIO/JSON.jl#229

Merged

StefanKarpinski mentioned this pull request Dec 13, 2017

lpad, rpad use textwidth/char count incoherently #25016

Closed

nalimilan mentioned this pull request Dec 14, 2017

Move back isascii() from Unicode to Base #25076

Merged

nalimilan mentioned this pull request Dec 14, 2017

Stop exporting isassigned() and normalize() from Unicode #25079

Merged

omus mentioned this pull request Dec 15, 2017

Add Compat.Unicode for standard library Unicode JuliaLang/Compat.jl#432

Merged

stevengj mentioned this pull request Mar 1, 2018

Compat.Unicode.normalize refers to wrong functions JuliaLang/Compat.jl#506

Closed

cormullion mentioned this pull request Mar 18, 2018

NEWS.md is getting a bit untidy #26508

Closed

jmkuhn mentioned this pull request May 15, 2018

Add isletter JuliaLang/Compat.jl#542

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move Unicode-related functions to new Unicode stdlib package #25021

Move Unicode-related functions to new Unicode stdlib package #25021

nalimilan commented Dec 10, 2017

StefanKarpinski commented Dec 10, 2017

StefanKarpinski Dec 10, 2017

nalimilan Dec 11, 2017

oscardssmith commented Dec 10, 2017

StefanKarpinski commented Dec 11, 2017 •

edited

stevengj commented Dec 11, 2017

nalimilan commented Dec 11, 2017 •

edited

StefanKarpinski commented Dec 11, 2017

nalimilan commented Dec 11, 2017

maleadt commented Dec 13, 2017

nalimilan commented Dec 13, 2017

martinholters commented Dec 14, 2017

nalimilan commented Dec 14, 2017

KristofferC commented Jan 7, 2018

nalimilan commented Jan 7, 2018

nalimilan commented Jan 7, 2018

fredrikekre commented Jan 7, 2018

KristofferC commented Jan 7, 2018

nalimilan commented Feb 5, 2018

Move Unicode-related functions to new Unicode stdlib package #25021

Move Unicode-related functions to new Unicode stdlib package #25021

Conversation

nalimilan commented Dec 10, 2017

StefanKarpinski commented Dec 10, 2017

StefanKarpinski Dec 10, 2017

Choose a reason for hiding this comment

nalimilan Dec 11, 2017

Choose a reason for hiding this comment

oscardssmith commented Dec 10, 2017

StefanKarpinski commented Dec 11, 2017 • edited

stevengj commented Dec 11, 2017

nalimilan commented Dec 11, 2017 • edited

StefanKarpinski commented Dec 11, 2017

nalimilan commented Dec 11, 2017

maleadt commented Dec 13, 2017

nalimilan commented Dec 13, 2017

martinholters commented Dec 14, 2017

nalimilan commented Dec 14, 2017

KristofferC commented Jan 7, 2018

nalimilan commented Jan 7, 2018

nalimilan commented Jan 7, 2018

fredrikekre commented Jan 7, 2018

KristofferC commented Jan 7, 2018

nalimilan commented Feb 5, 2018

StefanKarpinski commented Dec 11, 2017 •

edited

nalimilan commented Dec 11, 2017 •

edited