Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move away from hashing #78

Closed
karanveerm opened this issue May 12, 2015 · 5 comments · Fixed by #504
Closed

Move away from hashing #78

karanveerm opened this issue May 12, 2015 · 5 comments · Fixed by #504

Comments

@karanveerm
Copy link
Contributor

Hashing may result in collisions, and since we use hashes to ascertain object equalities, we should probably move away from this.

@madeleineudell
Copy link
Contributor

Here's a proposal for how to make our hashing collision-proof.

We've been caching hashes, rather than just defining a hash function recursively and using a dictionary structure to avoid collisions, to avoid traversing a recursive call to hash multiple times in case we've constructed deep nested expressions, eg (+, (x[1], (+, x[2], (+, x[3], ...)))).

Consider the following procedure. Maintain a global dictionary unique_exprs of (hash, expr) pairs for each expression constructed so far. When a new expression expr is formed, compute a hash myhash of it as we do now and check:

while true
    if haskey(unique_exprs, myhash)
        if unique_exprs[myhash].head == expr.head && length(unique_exprs[myhash].children == length(expr.children) && all(unique_exprs[myhash].children[i].id_hash == expr.children[i].id_hash for i in 1:length(expr.children))
            # we've already formed this expression
            break
        else
            # hash collision --- rehash
            hash = hash(myhash) # there are lots of ways to do this
        end
    else
        # no collision; add it to the global dictionary
        unique_exprs[myhash] = head
        break
    end
end
return myhash

You can prove that this works by induction. (Variables and constants have hashes that are unique b/c we use their object_id. Assume all hashes of expressions formed so far are unique, form a new expression, and hash it. If there's a collision, and the heads and hashes of children match, then the arguments of the colliding expressions must also be the same by the inductive hypothesis. If there's no collision then all is well.)

@mlubin
Copy link
Member

mlubin commented May 13, 2015

If you have a dictionary where the keys are the expressions themselves and not the hashes, the internal logic of the dictionary will handle hash collisions. That could be a simpler solution.

@madeleineudell
Copy link
Contributor

Yes, but we'll still need to define hash on expressions to put them into
the dictionary, I presume. It would be nice to cache the hashes to avoid
traversing the whole syntax tree every time the same expression is formed.
Is there a good way to do this? The following fails, for example:

hash(expr) = (myhash = hash((expr.head, map(c->unique_exprs[c],

expr.children)...)); unique_exprs[expr] = hash)

On Tue, May 12, 2015 at 7:38 PM, Miles Lubin notifications@github.com
wrote:

If you have a dictionary where the keys are the expressions themselves and
not the hashes, the internal logic of the dictionary will handle hash
collisions. That could be a simpler solution.


Reply to this email directly or view it on GitHub
#78 (comment).

Madeleine Udell
PhD Candidate in Computational and Mathematical Engineering
Stanford University
www.stanford.edu/~udell

@mlubin
Copy link
Member

mlubin commented May 13, 2015

I don't know enough about the Convex.jl internals to understand the performance effects of hashing the same thing multiple times. Does caching hashes change the algorithmic complexity of how many times the nodes in an expression need to be traversed? Don't you have to do it at least once for every expression, and at most a constant number of times?

@madeleineudell
Copy link
Contributor

The only time it seems likely to happen is in deeply nested expressions involving indexing. For example, the expression x[1] + x[2] + x[3] + ... creates a deeply nested expression like (+, (x[1], (+, x[2], (+, x[3], ...)))). To form this expression with operator overloading and without caching hashes, we'd end up recursing down the whole expression every time, for a total quadratic cost.

In fact, we try to be mildly more clever to avoid making deeply nested expressions. The expression is condensed every time, so eg at the last step (+, (x[1], (+, x[2], x[3], ...))) becomes (+, x[1], x[2], x[3], ...))). In other words, addition is extended to general summation; it's no longer a binary operator. But because we're blind to how many terms are going to show up, we hash the tuple of children at every iteration, still resulting in quadratic cost.

[Actually, looking through the code it's somewhat concerning to me that we've ended up with two constructs for the same thing: sum and +.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants