hashing of ranges is awful #5778

StefanKarpinski · 2014-02-12T17:12:20Z

julia> @which hash(1:10)
hash(a::AbstractArray{T,N}) at dict.jl:247

The first issue is that this is incredibly slow:

julia/base/dict.jl

Lines 246 to 252 in 72a65be

    
           function hash(a::AbstractArray) 
        
               h::Uint = hash(size(a))+1 
        
               for i=1:length(a) 
        
                   h = bitmix(h,int(hash(a[i]))) 
        
               end 
        
               return h 
        
           end

It's unclear to me why we're using indexing here instead of just iterating, but the code if very old so maybe it made sense at the time.

jakebolewski · 2014-02-12T17:37:36Z

I'm curious, why do you have to iterate through the entire range?

JeffBezanson · 2014-02-12T18:05:49Z

That's the point, you don't. It's just that ranges lack a specific implementation of hash and fall back on a generic array method.

JeffBezanson · 2014-02-12T18:07:11Z

Oh I know how this happened: ranges are isequal to vectors with the same contents.

StefanKarpinski · 2014-02-12T19:03:07Z

Add this to the list of isequal issues. We should maybe have a label for this.

JeffBezanson · 2014-02-12T19:49:17Z

There might not be anything we can do here; I expect 1:10 == [1:10] is uncontroversial. Making isequal different in that case is possible, but that doesn't seem to be the direction we're moving with it.

jakebolewski · 2014-02-12T20:04:28Z

I didn't realize that 1:10 == [1:10]. Personally this doesn't make much sense, but I realize that this follows Matlab's behavior.

JeffBezanson · 2014-02-12T20:05:14Z

We also have A == sparse(A).

staticfloat · 2014-02-12T20:21:49Z

I feel like the isequal issue is separate from this one though; we can just write a new hash() function that specializes on ranges and this performance issue disappears, right?

JeffBezanson · 2014-02-12T20:24:43Z

In theory; but there would need to be an O(1) algorithm that gives the same answer as the O(n) hash of a dense array of the same values. Or we could check whether a given dense vector is equal to some range, but that doesn't sound very easy either.

stevengj · 2014-02-12T20:41:32Z

We already decided in #3385 to make isequal(1,1.0) == false even though 1 == 1.0, precisely because it was impractical to hash different numeric types to the same value.

Why doesn't the same logic apply here? Have 1:10 == [1:10] but isequal(1:10, [1:10]) == false, and let them hash to different values.

StefanKarpinski · 2014-02-12T20:55:39Z

That decision has been an almost unmitigated disaster. We really need to change it back.

stevengj · 2014-02-12T21:10:28Z

What disastrous consequences have ensued? I just don't see how any other decision could possibly be practical if you want (a) isequal(a,b) to imply hash(a) == hash(b) and (b) hashing to be fast.

JeffBezanson · 2014-02-12T21:15:29Z

I wouldn't describe it as an unmitigated disaster, but as sometimes mildly inconvenient.

StefanKarpinski · 2014-02-12T21:22:07Z

This, among other things:

julia> d = Dict{Int32,Any}()
Dict{Int32,Any}()

julia> d[1] = "foo"
ERROR: 1 is not a valid key for type Int32
 in setindex! at dict.jl:502

I don't have time to give a comprehensive list of the problems, but it does not work well. It makes dictionaries super finicky and brittle.

stevengj · 2014-02-12T21:23:33Z

That seems like an orthogonal issue. Why not just define getindex{K,V}(d::Dict{K,V}, v) = getindex(d, convert(K, v))?

simonster · 2014-02-12T21:27:21Z

@stevengj Because then getindex would throw when convert(K, v) is not implemented or throws an InexactError. Some previous discussion in #4166.

stevengj · 2014-02-12T21:30:20Z

@simonster, what's wrong with throwing an error in that case? Seems like that's what I would want: if convert is not implemented, or if convert(K, v) != v (not isequal, except that NaN would need special handling), then throw an error. Similarly for setindex! of course. But that discussion should be in the other issue.

staticfloat · 2014-02-12T22:13:22Z

Well, this has been fun to play around with. I went ahead and wrote something to calculate the equivalent Range hash for a dense vector that is congruent, and even made sure it gives roughly the same performance, the only problem is with Float ranges. The way I'm ensuring that a dense vector is, in fact, congruent to a range is by looking at the differences between elements, but with float ranges, of course, that's a problem.

I'm solving it right now by using isapprox() inside of my makeshift hash() function for floating-point ranges. Since posting IJulia notebooks seems to be the thing to do nowadays, here's mine. I'm sure there's plenty of problems with this approach, but I thought it might be fun to do it anyway. ;)

jakebolewski · 2014-02-12T22:34:32Z

@staticfloat, you need to special case zero element ranges

julia> my_hash(1:-1:10)
0x90cc5625201d9479

julia> my_hash([1:-1:10])
0xdb33fc5eee865dad

StefanKarpinski · 2014-02-12T22:35:56Z

NaN throws a wrench into the works again because NaN != NaN32. See also #5314. Fundamentally, checking whether two things are equal, considering NaNs to be equal and things like 0.0 and -0.0 to be unequal, yet not caring about type is fundamentally something that one sometimes wants to do. Currently none of our equality predicates does this thing.

jakebolewski · 2014-02-12T22:38:48Z

also decreasing ranges don't work

julia> my_hash([10:-1:1])
0xbd06ea18dc0adc1d

julia> my_hash(10:-1:1)
0x0158e13259077bd8

staticfloat · 2014-02-12T22:45:59Z

@jakebolewski Thanks for those, I've updated it, and added checks in that second cell.

JeffBezanson · 2014-02-12T23:15:17Z

The problem with calling convert in getindex is that you can't rely on the dictionary type: you'd be able to look up an int32(1) key using int64(1) in an Int32 Dict, but not in an Any Dict.

jakebolewski · 2014-02-12T23:15:35Z

@staticfloat did you try fuzz testing the floating point implementation? It fails the majority of the time for me.

julia> n_failures = 0
0

julia> for _ in 1:1_000_000
       start = 0.0          
       step  = rand()
       stop  = 10.0
       r = start:step:stop; rv = [r]
       if my_hash(r) != my_hash(rv)
           n_failures += 1
       end
       end

julia> n_failures / float(1_000_000)
0.875346

StefanKarpinski · 2014-02-12T23:18:45Z

The problem with calling convert in getindex is that you can't rely on the dictionary type: you'd be able to look up an int32(1) key using int64(1) in an Int32 Dict, but not in an Any Dict.

Sigh. I mean, I really wish this weren't such problem, but it is.

staticfloat · 2014-02-12T23:23:34Z

@jakebolewski I think that's because you were using my_hash() instead of my_float_hash(). I named it differently only so that the tests would continue to fail in that one cell designed to show the failure. Otherwise, after the float version "patches" my_hash(), running that cell would no longer fail.

I've updated the notebook again with a fuzz test, which shows all 10,000 cases passing.

jakebolewski · 2014-02-12T23:36:52Z

Cool! I didn't catch the name change.

staticfloat · 2014-02-12T23:37:35Z

Fair enough, there really shouldn't be a name change if something like this ever gets used.

stevengj · 2014-02-13T00:22:38Z

I don't see the problem with relying on the dictionary type. Why shouldn't a Dict with typed keys behave differently in this respect from a Dict with Any keys?

JeffBezanson · 2014-02-13T00:30:18Z

The more general problem is that you lose some invariants that most people would expect to hold. There can be a value x that works as a key for d, and yet is not isequal to anything in collect(keys(d)).

JeffBezanson · 2014-02-19T08:01:55Z

It's looking like we will change isequal(a:b, [a:b]) to false, making this a breaking change that should be done soon.

JeffBezanson · 2014-02-19T19:31:31Z

Just had another thought: if isequal could be true for numbers of different types, comparing ranges would be harder. Having == components does not guarantee equal ranges if the types are different. In fact our current == for ranges gets this wrong:

julia> x = 0.23f0;

julia> 0.0f0:x:1.0f0 == 0.0:x:1.0
true

julia> [0.0f0:x:1.0f0] == [0.0:x:1.0]
false

GlenHertz · 2014-02-20T04:39:07Z

Whatever may happen here, isequal seems like a name that is too generic. Maybe we need to split this into two different functions.

StefanKarpinski · 2014-02-20T04:50:04Z

@JeffBezanson – that particular example would be fixed by my float range PR. There remain, however, other cases that are still broken with the current logic. This just means our equality predicate for ranges is currently wrong. This stuff is very subtle when you mix types. But we already have to face that with == in any case.

JeffBezanson · 2014-02-20T05:06:23Z

@GlenHertz adding more equality functions probably won't help (many people already think there are too many). You can never have enough; at a certain point the kind of equality needed is application-specific.

@StefanKarpinski true, I just don't yet know a fast algorithm for == on ranges of different types.

make ranges and arrays unequal, faster hashing of ranges, fixes #5778

toivoh · 2014-08-09T09:12:50Z

Regarding #7867, it could be possible to hash ranges the same as vectors if hashing of vectors was changed so that it looks at a few elements to figure out what range the vector would correspond to, if any. The hash for the vector could be created from the hash of that range, plus deltas between the predicted and actual elements. But it's a pretty brittle solution, and I'm not even sure it's possible given the number of range types that we have by now.

StefanKarpinski added the performance label Feb 12, 2014

JeffBezanson added the breaking label Feb 19, 2014

JeffBezanson added this to the 0.3 milestone Feb 19, 2014

JeffBezanson closed this as completed in 1a28512 Mar 11, 2014

JeffBezanson added a commit that referenced this issue Mar 11, 2014

Merge pull request #6084 from JuliaLang/jb/equalranges

6128592

make ranges and arrays unequal, faster hashing of ranges, fixes #5778

JeffBezanson mentioned this issue May 16, 2014

Hashing Tuples Based on Type #6870

Closed

quinnj mentioned this issue Aug 6, 2014

RFC: should 1:3 == [1,2,3] be true? #7867

Closed

simonster mentioned this issue Feb 11, 2015

Hashing SparseMatrixCSC is slow #10160

Closed

mbauman mentioned this issue Jul 20, 2015

Hashing integer ranges #12226

Closed

simonster mentioned this issue Oct 12, 2015

1:3 != collect(1:3) #13565

Closed

c42f mentioned this issue Jul 11, 2017

parametrize Diagonal on the wrapped vector type #22718

Merged

nalimilan mentioned this issue Jul 11, 2017

Make arrays and ranges hash and compare equal #16401

Merged

hashing of ranges is awful #5778

hashing of ranges is awful #5778

Comments

StefanKarpinski commented Feb 12, 2014

jakebolewski commented Feb 12, 2014

JeffBezanson commented Feb 12, 2014

JeffBezanson commented Feb 12, 2014

StefanKarpinski commented Feb 12, 2014

JeffBezanson commented Feb 12, 2014

jakebolewski commented Feb 12, 2014

JeffBezanson commented Feb 12, 2014

staticfloat commented Feb 12, 2014

JeffBezanson commented Feb 12, 2014

stevengj commented Feb 12, 2014

StefanKarpinski commented Feb 12, 2014

stevengj commented Feb 12, 2014

JeffBezanson commented Feb 12, 2014

StefanKarpinski commented Feb 12, 2014

stevengj commented Feb 12, 2014

simonster commented Feb 12, 2014

stevengj commented Feb 12, 2014

staticfloat commented Feb 12, 2014

jakebolewski commented Feb 12, 2014

StefanKarpinski commented Feb 12, 2014

jakebolewski commented Feb 12, 2014

staticfloat commented Feb 12, 2014

JeffBezanson commented Feb 12, 2014

jakebolewski commented Feb 12, 2014

StefanKarpinski commented Feb 12, 2014

staticfloat commented Feb 12, 2014

jakebolewski commented Feb 12, 2014

staticfloat commented Feb 12, 2014

stevengj commented Feb 13, 2014

JeffBezanson commented Feb 13, 2014

JeffBezanson commented Feb 19, 2014

JeffBezanson commented Feb 19, 2014

GlenHertz commented Feb 20, 2014

StefanKarpinski commented Feb 20, 2014

JeffBezanson commented Feb 20, 2014

toivoh commented Aug 9, 2014