Move away from hashing #78

karanveerm · 2015-05-12T00:39:33Z

Hashing may result in collisions, and since we use hashes to ascertain object equalities, we should probably move away from this.

madeleineudell · 2015-05-12T18:07:48Z

Here's a proposal for how to make our hashing collision-proof.

We've been caching hashes, rather than just defining a hash function recursively and using a dictionary structure to avoid collisions, to avoid traversing a recursive call to hash multiple times in case we've constructed deep nested expressions, eg (+, (x[1], (+, x[2], (+, x[3], ...)))).

Consider the following procedure. Maintain a global dictionary unique_exprs of (hash, expr) pairs for each expression constructed so far. When a new expression expr is formed, compute a hash myhash of it as we do now and check:

while true
    if haskey(unique_exprs, myhash)
        if unique_exprs[myhash].head == expr.head && length(unique_exprs[myhash].children == length(expr.children) && all(unique_exprs[myhash].children[i].id_hash == expr.children[i].id_hash for i in 1:length(expr.children))
            # we've already formed this expression
            break
        else
            # hash collision --- rehash
            hash = hash(myhash) # there are lots of ways to do this
        end
    else
        # no collision; add it to the global dictionary
        unique_exprs[myhash] = head
        break
    end
end
return myhash

You can prove that this works by induction. (Variables and constants have hashes that are unique b/c we use their object_id. Assume all hashes of expressions formed so far are unique, form a new expression, and hash it. If there's a collision, and the heads and hashes of children match, then the arguments of the colliding expressions must also be the same by the inductive hypothesis. If there's no collision then all is well.)

mlubin · 2015-05-13T02:38:05Z

If you have a dictionary where the keys are the expressions themselves and not the hashes, the internal logic of the dictionary will handle hash collisions. That could be a simpler solution.

madeleineudell · 2015-05-13T03:15:02Z

Yes, but we'll still need to define hash on expressions to put them into
the dictionary, I presume. It would be nice to cache the hashes to avoid
traversing the whole syntax tree every time the same expression is formed.
Is there a good way to do this? The following fails, for example:

hash(expr) = (myhash = hash((expr.head, map(c->unique_exprs[c],

expr.children)...)); unique_exprs[expr] = hash)

On Tue, May 12, 2015 at 7:38 PM, Miles Lubin notifications@github.com
wrote:

If you have a dictionary where the keys are the expressions themselves and
not the hashes, the internal logic of the dictionary will handle hash
collisions. That could be a simpler solution.

—
Reply to this email directly or view it on GitHub
#78 (comment).

Madeleine Udell
PhD Candidate in Computational and Mathematical Engineering
Stanford University
www.stanford.edu/~udell

mlubin · 2015-05-13T03:50:55Z

I don't know enough about the Convex.jl internals to understand the performance effects of hashing the same thing multiple times. Does caching hashes change the algorithmic complexity of how many times the nodes in an expression need to be traversed? Don't you have to do it at least once for every expression, and at most a constant number of times?

madeleineudell · 2015-05-13T17:26:05Z

The only time it seems likely to happen is in deeply nested expressions involving indexing. For example, the expression x[1] + x[2] + x[3] + ... creates a deeply nested expression like (+, (x[1], (+, x[2], (+, x[3], ...)))). To form this expression with operator overloading and without caching hashes, we'd end up recursing down the whole expression every time, for a total quadratic cost.

In fact, we try to be mildly more clever to avoid making deeply nested expressions. The expression is condensed every time, so eg at the last step (+, (x[1], (+, x[2], x[3], ...))) becomes (+, x[1], x[2], x[3], ...))). In other words, addition is extended to general summation; it's no longer a binary operator. But because we're blind to how many terms are going to show up, we hash the tuple of children at every iteration, still resulting in quadratic cost.

[Actually, looking through the code it's somewhat concerning to me that we've ended up with two constructs for the same thing: sum and +.]

ericphanson mentioned this issue Mar 12, 2020

Use different hash function for arrays of AbstractExpr. #378

Closed

ericphanson mentioned this issue Aug 23, 2020

Proof of concept more-MOIified implementation #393

Closed

14 tasks

ericphanson mentioned this issue Jul 9, 2023

More MOIified implementation, again #504

Merged

blegat closed this as completed in #504 Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move away from hashing #78

Move away from hashing #78

karanveerm commented May 12, 2015

madeleineudell commented May 12, 2015

mlubin commented May 13, 2015

madeleineudell commented May 13, 2015

mlubin commented May 13, 2015

madeleineudell commented May 13, 2015

Move away from hashing #78

Move away from hashing #78

Comments

karanveerm commented May 12, 2015

madeleineudell commented May 12, 2015

mlubin commented May 13, 2015

madeleineudell commented May 13, 2015

mlubin commented May 13, 2015

madeleineudell commented May 13, 2015