Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hashing of ranges is awful #5778

Closed
StefanKarpinski opened this issue Feb 12, 2014 · 36 comments
Closed

hashing of ranges is awful #5778

StefanKarpinski opened this issue Feb 12, 2014 · 36 comments
Labels
kind:breaking This change will break code performance Must go faster
Milestone

Comments

@StefanKarpinski
Copy link
Sponsor Member

julia> @which hash(1:10)
hash(a::AbstractArray{T,N}) at dict.jl:247

The first issue is that this is incredibly slow:

julia/base/dict.jl

Lines 246 to 252 in 72a65be

function hash(a::AbstractArray)
h::Uint = hash(size(a))+1
for i=1:length(a)
h = bitmix(h,int(hash(a[i])))
end
return h
end

It's unclear to me why we're using indexing here instead of just iterating, but the code if very old so maybe it made sense at the time.

@jakebolewski
Copy link
Member

I'm curious, why do you have to iterate through the entire range?

@JeffBezanson
Copy link
Sponsor Member

That's the point, you don't. It's just that ranges lack a specific implementation of hash and fall back on a generic array method.

@JeffBezanson
Copy link
Sponsor Member

Oh I know how this happened: ranges are isequal to vectors with the same contents.

@StefanKarpinski
Copy link
Sponsor Member Author

Add this to the list of isequal issues. We should maybe have a label for this.

@JeffBezanson
Copy link
Sponsor Member

There might not be anything we can do here; I expect 1:10 == [1:10] is uncontroversial. Making isequal different in that case is possible, but that doesn't seem to be the direction we're moving with it.

@jakebolewski
Copy link
Member

I didn't realize that 1:10 == [1:10]. Personally this doesn't make much sense, but I realize that this follows Matlab's behavior.

@JeffBezanson
Copy link
Sponsor Member

We also have A == sparse(A).

@staticfloat
Copy link
Sponsor Member

I feel like the isequal issue is separate from this one though; we can just write a new hash() function that specializes on ranges and this performance issue disappears, right?

@JeffBezanson
Copy link
Sponsor Member

In theory; but there would need to be an O(1) algorithm that gives the same answer as the O(n) hash of a dense array of the same values. Or we could check whether a given dense vector is equal to some range, but that doesn't sound very easy either.

@stevengj
Copy link
Member

We already decided in #3385 to make isequal(1,1.0) == false even though 1 == 1.0, precisely because it was impractical to hash different numeric types to the same value.

Why doesn't the same logic apply here? Have 1:10 == [1:10] but isequal(1:10, [1:10]) == false, and let them hash to different values.

@StefanKarpinski
Copy link
Sponsor Member Author

That decision has been an almost unmitigated disaster. We really need to change it back.

@stevengj
Copy link
Member

What disastrous consequences have ensued? I just don't see how any other decision could possibly be practical if you want (a) isequal(a,b) to imply hash(a) == hash(b) and (b) hashing to be fast.

@JeffBezanson
Copy link
Sponsor Member

I wouldn't describe it as an unmitigated disaster, but as sometimes mildly inconvenient.

@StefanKarpinski
Copy link
Sponsor Member Author

This, among other things:

julia> d = Dict{Int32,Any}()
Dict{Int32,Any}()

julia> d[1] = "foo"
ERROR: 1 is not a valid key for type Int32
 in setindex! at dict.jl:502

I don't have time to give a comprehensive list of the problems, but it does not work well. It makes dictionaries super finicky and brittle.

@stevengj
Copy link
Member

That seems like an orthogonal issue. Why not just define getindex{K,V}(d::Dict{K,V}, v) = getindex(d, convert(K, v))?

@simonster
Copy link
Member

@stevengj Because then getindex would throw when convert(K, v) is not implemented or throws an InexactError. Some previous discussion in #4166.

@stevengj
Copy link
Member

@simonster, what's wrong with throwing an error in that case? Seems like that's what I would want: if convert is not implemented, or if convert(K, v) != v (not isequal, except that NaN would need special handling), then throw an error. Similarly for setindex! of course. But that discussion should be in the other issue.

@staticfloat
Copy link
Sponsor Member

Well, this has been fun to play around with. I went ahead and wrote something to calculate the equivalent Range hash for a dense vector that is congruent, and even made sure it gives roughly the same performance, the only problem is with Float ranges. The way I'm ensuring that a dense vector is, in fact, congruent to a range is by looking at the differences between elements, but with float ranges, of course, that's a problem.

I'm solving it right now by using isapprox() inside of my makeshift hash() function for floating-point ranges. Since posting IJulia notebooks seems to be the thing to do nowadays, here's mine. I'm sure there's plenty of problems with this approach, but I thought it might be fun to do it anyway. ;)

@jakebolewski
Copy link
Member

@staticfloat, you need to special case zero element ranges

julia> my_hash(1:-1:10)
0x90cc5625201d9479

julia> my_hash([1:-1:10])
0xdb33fc5eee865dad

@StefanKarpinski
Copy link
Sponsor Member Author

NaN throws a wrench into the works again because NaN != NaN32. See also #5314. Fundamentally, checking whether two things are equal, considering NaNs to be equal and things like 0.0 and -0.0 to be unequal, yet not caring about type is fundamentally something that one sometimes wants to do. Currently none of our equality predicates does this thing.

@jakebolewski
Copy link
Member

also decreasing ranges don't work

julia> my_hash([10:-1:1])
0xbd06ea18dc0adc1d

julia> my_hash(10:-1:1)
0x0158e13259077bd8

@staticfloat
Copy link
Sponsor Member

@jakebolewski Thanks for those, I've updated it, and added checks in that second cell.

@JeffBezanson
Copy link
Sponsor Member

The problem with calling convert in getindex is that you can't rely on the dictionary type: you'd be able to look up an int32(1) key using int64(1) in an Int32 Dict, but not in an Any Dict.

@jakebolewski
Copy link
Member

@staticfloat did you try fuzz testing the floating point implementation? It fails the majority of the time for me.

julia> n_failures = 0
0

julia> for _ in 1:1_000_000
       start = 0.0          
       step  = rand()
       stop  = 10.0
       r = start:step:stop; rv = [r]
       if my_hash(r) != my_hash(rv)
           n_failures += 1
       end
       end

julia> n_failures / float(1_000_000)
0.875346

@StefanKarpinski
Copy link
Sponsor Member Author

The problem with calling convert in getindex is that you can't rely on the dictionary type: you'd be able to look up an int32(1) key using int64(1) in an Int32 Dict, but not in an Any Dict.

Sigh. I mean, I really wish this weren't such problem, but it is.

@staticfloat
Copy link
Sponsor Member

@jakebolewski I think that's because you were using my_hash() instead of my_float_hash(). I named it differently only so that the tests would continue to fail in that one cell designed to show the failure. Otherwise, after the float version "patches" my_hash(), running that cell would no longer fail.

I've updated the notebook again with a fuzz test, which shows all 10,000 cases passing.

@jakebolewski
Copy link
Member

Cool! I didn't catch the name change.

@staticfloat
Copy link
Sponsor Member

Fair enough, there really shouldn't be a name change if something like this ever gets used.

@stevengj
Copy link
Member

I don't see the problem with relying on the dictionary type. Why shouldn't a Dict with typed keys behave differently in this respect from a Dict with Any keys?

@JeffBezanson
Copy link
Sponsor Member

The more general problem is that you lose some invariants that most people would expect to hold. There can be a value x that works as a key for d, and yet is not isequal to anything in collect(keys(d)).

@JeffBezanson JeffBezanson added this to the 0.3 milestone Feb 19, 2014
@JeffBezanson
Copy link
Sponsor Member

It's looking like we will change isequal(a:b, [a:b]) to false, making this a breaking change that should be done soon.

@JeffBezanson
Copy link
Sponsor Member

Just had another thought: if isequal could be true for numbers of different types, comparing ranges would be harder. Having == components does not guarantee equal ranges if the types are different. In fact our current == for ranges gets this wrong:

julia> x = 0.23f0;

julia> 0.0f0:x:1.0f0 == 0.0:x:1.0
true

julia> [0.0f0:x:1.0f0] == [0.0:x:1.0]
false

@GlenHertz
Copy link
Contributor

Whatever may happen here, isequal seems like a name that is too generic. Maybe we need to split this into two different functions.

@StefanKarpinski
Copy link
Sponsor Member Author

@JeffBezanson – that particular example would be fixed by my float range PR. There remain, however, other cases that are still broken with the current logic. This just means our equality predicate for ranges is currently wrong. This stuff is very subtle when you mix types. But we already have to face that with == in any case.

@JeffBezanson
Copy link
Sponsor Member

@GlenHertz adding more equality functions probably won't help (many people already think there are too many). You can never have enough; at a certain point the kind of equality needed is application-specific.

@StefanKarpinski true, I just don't yet know a fast algorithm for == on ranges of different types.

JeffBezanson added a commit that referenced this issue Mar 11, 2014
make ranges and arrays unequal, faster hashing of ranges, fixes #5778
@toivoh
Copy link
Contributor

toivoh commented Aug 9, 2014

Regarding #7867, it could be possible to hash ranges the same as vectors if hashing of vectors was changed so that it looks at a few elements to figure out what range the vector would correspond to, if any. The hash for the vector could be created from the hash of that range, plus deltas between the predicted and actual elements. But it's a pretty brittle solution, and I'm not even sure it's possible given the number of range types that we have by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:breaking This change will break code performance Must go faster
Projects
None yet
Development

No branches or pull requests

8 participants