Make Char no longer a subtype of Integer #8816

Merged
merged 1 commit into from Oct 27, 2014

Projects

None yet
@jakebolewski
Member

I feel that this is surprising behavior, and this issue has come up in #5844 and #8569.

All the tests pass except for readlm and parallel (which hangs forever). I guess I'm looking for a up / down vote on this (breaking) behavior before fixing the last remaining issues.

@johnmyleswhite
Member

+1 to this idea

@johnmyleswhite
Member

If it helps to argue for this idea, I'd say that codepoints are a convenience numeric encoding for what is clearly a categorical variable that doesn't possess any meaningful numerical properties.

@StefanKarpinski
Member

I agree but I also don't see why at least comparisons with integers can't work.

@jakebolewski
Member

If it helps to argue for this idea, I'd say that codepoints are a convenience numeric encoding for what is clearly a categorical variable that doesn't possess any meaningful numerical properties.

I agree but we don't really have a well defined interface to support this concept. @JeffBezanson has pointed out that Ptr's are similar in this regard.

I also don't see why at least comparisons with integers can't work.

I wanted to be conservative, integer comparison can be trivially added back now.

@StefanKarpinski
Member

So having thought about this a bit more I think that chars should not be numbers but support these very specific operations:

  • comparisons with numbers
  • Char - Char = Int
  • Char + Int = Char

That covers all the useful operations you generally want to do with characters as integers but is specific enough to avoid too much confusion.

@StefanKarpinski
Member

Oh, in see that you have exactly those in here.

@jakebolewski
Member

I left out comparisons in the beginning as it was useful for finding what parts of Base this change touched.

I'll point out thatPtr does not allow comparisons with Integers (the other two are defined). From my experience, character equality to integers (==) is a source of bugs, but we have egal as a fallback in this case.

@StefanKarpinski
Member

Why do you think it's a source of bugs? There's far less use for comparisons of pointers than for characters.

@jakebolewski
Member

Off the top of my head, I can't think of any two types in Base that define an equality relationship and do not share a common abstract super-type.

@StefanKarpinski
Member

They do – Any!

@quinnj
Member
quinnj commented Oct 26, 2014

The grisu code uses Char == Int comparisons. I wouldn't be surprised if other printing/showing code utilized it somehow.

Also Period types in the Dates module can be equal to Real.

Just a couple of data points.

@jakebolewski
Member

Ok fair enough :-)

@quinnj are you sure? If you turn off character / integer comparison then all the grisu tests still pass. That might signal that are parts that are not being tested at the moment but I find that hard to believe.

@quinnj
Member
quinnj commented Oct 26, 2014

Ah, perhaps they don't. They originally did to mirror the C++ implementation, but I did end up changing over to using raw Array{Uint8,1} buffers instead. I still would have thought there'd have been a few comparisons in there at some point. I guess if not, all the better, right?

@jakebolewski jakebolewski make Char no longer a subtype of Integer
Char now supports a limited set of "integer like" behavior.

* comparisons with integers
* Char - Char = Int
* Char + Int = Char

update docs and add a NEWS.md entry
ac512af
@jakebolewski jakebolewski changed the title from WIP: Make Char no longer a subtype of Integer to Make Char no longer a subtype of Integer Oct 27, 2014
@jakebolewski
Member

Travis is passing now and I've updated the docs. Should comparison / equality operations be defined for all Numbers or just Integers?

@ivarne
Contributor
ivarne commented Oct 27, 2014

Nice to get rid of +(::Char, ::Char) with this PR!

We'll get bug reports if people think '\n' == 10.0 makes sense. Char isn't a Number, so it shouldn't behave like one if people don't complain about missing methods that they need for some reason. #7512 will also help.

@StefanKarpinski
Member

I agree with @ivarne's take – let's stick with just allowing comparisons to integers for now.

@jakebolewski
Member

That's what I have now. Good to merge then?

@nalimilan
Contributor

I support @jakebolewski's position that comparing ints and chars is more a source of bugs than anything. Inequalities like '0' != 0 can be particularly confusing. Note that @quinnj stated above he used it, before realizing he had found a better way of doing that: Uint8. I think it may be better to force users to choose between char or int behavior, by using Char or Uint8.

@StefanKarpinski
Member

Perhaps, but we can make this less disruptive change first and then consider whether to do that. I'm good with merging this now.

@jakebolewski
Member

Actually, no code in Base relies on Char / Integer comparison to be defined after this PR, so we can trivially switch to that behavior in the future if we so choose.

@jakebolewski jakebolewski merged commit 47dac56 into master Oct 27, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@jakebolewski jakebolewski deleted the jcb/char_not_integer branch Oct 27, 2014
@Jutho Jutho commented on the diff Oct 28, 2014
@@ -29,6 +29,13 @@ Language changes
dicts are synchronized. As part of this change, `=>` is parsed as a normal
operator, and `Base` defines it to construct `Pair` objects ([#6739]).
+ * `Char` is no longer a subtype of `Integer`. ([#8816])
+ Char now supports a more limited set of operations with `Integer` types:
+
+ * comparison / equality
+ * `Char` + `Int` = `Int`
@Jutho
Jutho Oct 28, 2014 Contributor

Wasn't this supposed to be Char + Int = Char ?

@ivarne
ivarne Oct 28, 2014 Contributor

Seems like that's how it is implemented

@ivarne
ivarne Oct 28, 2014 Contributor

Fixed in 816ee25

@ivarne ivarne commented on the diff Oct 28, 2014
base/char.jl
@@ -40,14 +56,10 @@ promote_rule(::Type{Char}, ::Type{Uint128}) = Uint128
@ivarne
ivarne Oct 28, 2014 Contributor

Do we want to keep these bitwise operations for Char?

@kmsquire
kmsquire Oct 28, 2014 Member

Might be useful for masking.

@ivarne ivarne referenced this pull request in JuliaCollections/DataStructures.jl Oct 28, 2014
Closed

[PkgEval] DataStructures may have a testing issue on Julia 0.4 (2014-10-28) #65

@IainNZ
Member
IainNZ commented Oct 28, 2014

PkgEval: This unsurprisingly broke a few things, including people doing something involving getindex, Char and an Array{Bool}

@ivarne ivarne added a commit to JuliaIO/JSON.jl that referenced this pull request Oct 28, 2014
@ivarne ivarne Convert to Int before indexing with Char in Array
Regression fix from JuliaLang/julia#8816
Fixes #84
59c6fd6
@IainNZ
Member
IainNZ commented Oct 28, 2014

TextPlot's error is ERROR:|has no method matching |(::Char, ::Int64), for reference.

@ivarne ivarne referenced this pull request in JuliaIO/JSON.jl Oct 28, 2014
Merged

Convert to Int before indexing with Char in Array #85

@IainNZ
Member
IainNZ commented Oct 28, 2014

DataStructures:

ERROR: test error in expression: sort(collect(keys(d))) == ['a':'z']
`isless` has no method matching isless(::Char, ::Char)

(Does this mean you can't sort strings?)

@kmsquire
Member

No, you just can't sort Chars.

@IainNZ
Member
IainNZ commented Oct 28, 2014

If you can write 'a':'z' and it produces the letters 'a','b',...,'z', it implies (to me) there is some sort of sensible ordering to apply

@Jutho
Contributor
Jutho commented Oct 28, 2014

if Char - `Char produces an integer, I would also argue that there is some order, given by the sign of that integer.

On 28 Oct 2014, at 15:47, Iain Dunning notifications@github.com wrote:

If you can write 'a':'z' and it produces the letters 'a','b',...,'z', it implies (to me) there is some sort of sensible ordering to apply


Reply to this email directly or view it on GitHub #8816 (comment).

@jiahao
Member
jiahao commented Oct 28, 2014
julia> isless('a', 'c')
ERROR: `isless` has no method matching isless(::Char, ::Char)

julia> 'a' < 'c'
true

I forget how isless and < are supposed to work again (is one a partial order and the other a total order?), but both comparisons probably should work.

@ivarne
Contributor
ivarne commented Oct 28, 2014

According to the documentation we should rather implement isless(::Char, ::Char) than <(::Char, ::Char) now that Char isn't a numeric type anymore. There is also a default for < that calls isless.

help?> isless
Base.isless(x, y)

   Test whether "x" is less than "y", according to a canonical
   total order. Values that are normally unordered, such as "NaN",
   are ordered in an arbitrary but consistent fashion. This is the
   default comparison used by "sort". Non-numeric types with a
   canonical total order should implement this function. Numeric types
   only need to implement it if they have special values such as
   "NaN".
@ivarne ivarne added a commit that referenced this pull request Oct 28, 2014
@ivarne ivarne Make Char implement isless instead of <
See discussion in #8816
b802443
@andreasnoack
Member

Is "canonical total order" CS 101? I know what a total order is, but the canonical part is not obvious to me and Wikipedia's definition:

in the context of sorting, a standard order, typically a total order, obeying certain rules, e.g. lexicographic order for strings

didn't really help as "standard order" seems to be a non-standard term and "obeying certain rules" is rather vague.

@jakebolewski
Member

@ivarne these errors are gaps in test coverage, so any fix should include tests.

@ivarne
Contributor
ivarne commented Oct 28, 2014

Definitely true that. Seems like we have a pretty glaring hole in test coverage for Char. Char only occurs 10 times in /tests/*.

@garborg garborg commented on the diff Oct 28, 2014
base/datafmt.jl
asci = true
d = sbuff.data
for idx in 1:length(d)
- (d[idx] < 0x80) ? continue : (asci = false; break)
+ (char(d[idx]) < char(0x80)) ? continue : (asci = false; break)
@garborg
garborg Oct 28, 2014 Member

I wouldn't have thought to do this, but now I'm wondering if I should -- what's the benefit?

@JeffBezanson
JeffBezanson Oct 28, 2014 Member

Yeah, this doesn't make sense. This code is really comparing bytes.

@jakebolewski
jakebolewski Oct 28, 2014 Member

I think this is because for UTF32 strings this is a Char / Integer comparison, and I removed those definitions some intermediate point.

@garborg
garborg Oct 28, 2014 Member

I see -- good to know that's what UTF32String looks like under the hood. Thanks.

@JeffBezanson
JeffBezanson Oct 28, 2014 Member

Ok, I got confused because this code is kind of wrong. If it's supposed to work for arbitrary Strings it shouldn't be accessing .data.

@tonyhffong
Member

What should be the replacement for the logic for using ncurses where we need to test

c == char(-1)

It's now giving me InexactError()

@ivarne
Contributor
ivarne commented Nov 5, 2014

Maybe c == '\Uffffffff' or the somewhat convoluted c == reinterpret(Char,Int32(-1))?

I also have trouble understanding why we think of Char as unsigned. Go makes interprets its Char a signed quantity and all the negative numbers are invalid code points anyway.

@tonyhffong
Member

yeah just successfully tested that 5 secs ago. It's not a big deal. Thanks.

@ivarne
Contributor
ivarne commented Nov 5, 2014

typemax(Char) will also work (as long as Char is unsigned).

@tkelman
Member
tkelman commented Nov 5, 2014

Re: signedness of char, I'll bring up #7472 and #7303 in the spirit of unfun, do-not-want, things to watch out for, where char being (sometimes? depending on platform?) a signed fixed-width type caused nastiness like introducing unicode into the parser resulting in the windows binaries no longer working on XP...

@ivarne
Contributor
ivarne commented Nov 5, 2014

@tkelman How is issues with the signdness of the C 8 bit datatype char relevant for how we should treat integer conversion to the Julia 32 bit type Char?

I don't say that the signdness matters, I just want there to be reason why we behave differently than Go, which have very similar Unicode semantics with Julia.

@tkelman
Member
tkelman commented Nov 5, 2014

Just precedent for unforeseen consequences of a related, but not identical, issue.

@StefanKarpinski
Member

I'd want to know what the sign bit would mean. One option would be to serve as an "invalid" flag while allowing some payload.

@pao
Member
pao commented Nov 5, 2014

Not sure if it would come into play here, but Java rejects (valid) byte patterns expressed as hex literals because it treats its byte type as signed:

private static final byte[] sync = {0xFF, 0x12};

is not permitted. This is annoying and I'd appreciate its Julia equivalent working independent of char's signedness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment