Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Unicode-related functions to new Unicode stdlib package #25021

Merged
merged 3 commits into from Dec 13, 2017

Conversation

nalimilan
Copy link
Member

As decided at #14347. It turns out most character category predicates and similar functions are used somewhere in Base, so it does not sound possible to move them to the standard library (except for a few particular functions). So I kept these in Base but unexported them, and reexported them from Unicode (which is otherwise an empty module). I've also moved some functions to utf8proc.jl for clarity and because it created problems during bootstrap due to attempts to import functions in UTF8proc which had not been defined yet in Base.

I wasn't sure how many functions we want to move to the stdlib. For example, should isascii, isdigit and isxdigit be moved, given how simple they are, and given that they are not really Unicode-related?

@nalimilan nalimilan added domain:strings "Strings!" domain:unicode Related to unicode characters and encodings labels Dec 10, 2017
@StefanKarpinski
Copy link
Sponsor Member

Nice, thanks for tackling this!


```@docs
Unicode.is_assigned_char
Unicode.normalize_string
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming is_assigned_char to isassigned and generalizing it to strings the same way the rest are – i.e. by all reduction over the characters in the string. Similarly, we could rename normalize_string to normalize now that it's in the Unicode module.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added them in the commit deprecating isnumber in favor of isnumeric. Though note that the names clash with corresponding functions in Base, so they always need to be qualified. I guess that's OK.

isassigned was already restricted to Char arguments. But I noticed isascii accepts strings directly, which is inconsistent with other functions. Time to deprecate it too?

@ararslan ararslan added kind:excision Removal of code from Base or the repository stdlib Julia's standard library labels Dec 10, 2017
@oscardssmith
Copy link
Member

While I agree that having a small base is valuable, I'm not sure that these changes make sense. Unicode is really complicated, and if people have to go to stdlib to get something that works, while base has things that work ~most of the time with ascii, I think the result will likely be a bunch of code that doesn't deal with unicode well.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Dec 11, 2017

@oscardssmith, I think you're misinterpreting the situation here. Base strings fully support Unicode and always have. This is a collection of utility functions to query Unicode properties about characters and strings. The functions all have to do with querying character classes, computing text widths, string normalization, and changing character cases. These are all the things that might change from one version of Unicode to another whereas what's in base is not going to change – i.e. the basic mechanics of UTF-8. So it makes sense to separate it out since you might want to change the version of Unicode that your program supports independent of the version of Julia you're using, whereas nothing in base is specific to any particular version of Unicode.

@stevengj
Copy link
Member

Missing deprecations?

@nalimilan
Copy link
Member Author

nalimilan commented Dec 11, 2017

I've managed to get deprecations to work. It's somewhat tricky:

  • The new deprecations conflict with the old deprecations from pred(s::AbstractString) to all(pred, s). I had to move the old deprecations to Unicode to fix warnings.
  • If we export Unicode-related functions from Base.UTF8proc to Base to use them internally, we cannot import them into Unicode (to reexport them) without triggering deprecation warnings during the build. Therefore, I had to fully qualify all uses of these functions in Base (apart from submodules, where using Base.UTF8proc: ... is OK) so that the deprecated functions appear as totally different from the actual ones under Base.UTF8proc, which are reexported by Unicode. That's quite verbose but I couldn't find any other solution. EDIT: Note that this is only temporary, until the deprecations are removed.

Finally, there's a problem with the manual: functions appear as Base.UTF8proc.* instead of Unicode.*. I don't know how to fix that. I also haven't found how to make a reference to graphemes from Base docs to Unicode docs. EDIT: Actually it was due to the problem described in the previous sentence: need to use the Base.UTF8proc.* name that appears in the manual.

@StefanKarpinski
Copy link
Sponsor Member

Can't we get rid of the old deprecations? They're either from a previous release, in which case they're due to be deleted anyway, or they're from this release cycle, in which case we don't need to keep them if they're superseded by newer deprecations.

@nalimilan
Copy link
Member Author

Can't we get rid of the old deprecations? They're either from a previous release, in which case they're due to be deleted anyway, or they're from this release cycle, in which case we don't need to keep them if they're superseded by newer deprecations.

Indeed, this deprecation was introduced in 0.6 so it should be fine to remove it. I've added a commit, at the same time as new deprecations for isnumeric, isassigned and normalize.

BTW, I'm not sure what's the convention to document functions in the stdlib. I've added the Unicode. prefix only to the signatures of normalize and isassigned, since they conflict with functions from Base. But I've added using Unicode to all examples, since it's not obvious when reading the manual online that it's needed.

@nalimilan nalimilan force-pushed the nl/unicode branch 2 times, most recently from c63baff to e2d1088 Compare December 12, 2017 08:37
@nalimilan nalimilan changed the title WIP: Move Unicode-related functions to new Unicode stdlib package Move Unicode-related functions to new Unicode stdlib package Dec 12, 2017
Keep them under Base.Unicode since they are needed inside Base,
but stop exporting them to Base since they would conflict with
the deprecations. Base.Unicode is the new name for Base.UTF8proc,
but including a few more functions.
isnumeric() is consistent with Python and Rust (but not Go), and less easy to confuse
with isdigit(). Improve documentation to make confusion less easy. Also fix a few uses
where isdigit() is more appropriate than isnumber().
@maleadt
Copy link
Member

maleadt commented Dec 13, 2017

Would have been nice to have Compat functionality (ie. using Compat.Unicode) at the time of merging this PR, to avoid temporary temporary VERSION >= v"0.7.0-DEV.2915" checks.

@nalimilan
Copy link
Member Author

I don't think we generally add Compat support before merging PRs, especially since the priority is to merge breaking PRs this week before the feature freeze. But indeed it's going to be needed.

@martinholters
Copy link
Member

julia> using Unicode

julia> isassigned
WARNING: both Unicode and Base export "isassigned"; uses of it in module Main must be qualified
ERROR: UndefVarError: isassigned not defined

I don't think export isassigned from Unicode is particularly helpful. (using Unicode: isassigned is possible without an export.)

@nalimilan
Copy link
Member Author

Yeah, probably. I'm not sure either what exporting isassigned and normalize from Unicode allow which is not possible without exporting them, since the conflict with Base.isassigned makes them impossible to use unqualified anyway. See #25079.

@KristofferC
Copy link
Sponsor Member

This should probably have defined the functions in the Unicode stdlib instead of just reexporting existing functions. Otherwise e.g. the documentation will write these functions as Base.Unicode.isnumeric.

@nalimilan
Copy link
Member Author

I've mentioned that problem above, and couldn't find a solution. At least Base.Unicode.isnumeric is less confusing than Base.UTF8proc.isnumeric (as it appeared in the original version of the PR). Help welcome.

@nalimilan
Copy link
Member Author

Ah, and the functions are not defined in Base because I liked that, it's because they are used all over the place in Base.

@fredrikekre
Copy link
Member

We could do

module Unicode
isnumeric(args...) = Base.isnumeric(args...)
...
end

and move the docs to document the functions in the Unicode module

@KristofferC
Copy link
Sponsor Member

This is how I did it with CRC32c: https://github.com/JuliaLang/julia/blob/master/stdlib/CRC32c/src/CRC32c.jl (but that was fewer functions so easier to do).

@nalimilan
Copy link
Member Author

OK, see #25902.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!" domain:unicode Related to unicode characters and encodings kind:excision Removal of code from Base or the repository stdlib Julia's standard library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants