Profile-based optimization of build_by_features #48

jcushman · 2020-08-13T16:18:12Z

I did some profiling to try to speed up Simhash(features), while sticking to pure Python with no additional dependencies. This PR is the fastest option I found; it speeds up Simhash(features) by about 20%.

This became sort of a rabbit hole for my own entertainment, so here are details if they're interesting to anyone else:

All profiling was done on a Core i9 running MacOS 10.15 with Python 3.7.6. To profile, I used ipython:

In [1]: import simhash
   ...: features = [str(i) for i in range(1000)]
   ...: assert simhash.Simhash(features).value == 7994783356231524236

In [2]: %timeit simhash.Simhash(features)
7.5 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I also used py-spy to find hotspots. It turns out that the hotspot of the current Simhash code is the part that updates v:

for i in range(self.f):
    v[i] += w if h & masks[i] else -w

Those lines take over 80% of the total runtime! Actually computing all the md5s takes < 15%. Apparently iterating the bits of an integer is slow in Python.

My fastest alternative truncates the hash to the desired length, converts it to a binary string, and then iterates the characters of the string to create a new list:

h_bits = format(h & truncate_mask, bitstring_format)
v = [x + (w if bit == "1" else -w) for bit, x in zip(h_bits, v)]

This runs in about 75% the time of the current hotspot lines, so ends up speeding up the whole Simhash(features) operation by about 20%.

I tried some other avenues that didn't work as well, including:

v = [x + (w if h & m else -w) for m, x in zip(masks, v)] (slower than original)
updating the list in place with the format-string approach (slightly slower than final choice)
Avoiding the if-else with a lookup like l = {'1': w, '0': -w} (slightly slower than not)

I also tried the BitVector and numpy libraries to see if getting away from bare python would help:

bv = BitVector(intVal=h, size = 64); v = [x + (w if b else -w) for b, x in zip(bv, v)] (much slower)
np_v = np.zeros(64); bits = np.unpackbits(np.ndarray(shape=(8,),dtype='>B', buffer=h.to_bytes(length=8, byteorder='big'))); np_v += bits * w + (bits - 1) * w (somewhat slower)

This is all a little frustrating -- intuitively, one would think most of the runtime would go to md5 rather than to the 64 additions. If that were true (and I'm just missing the way to do it), Simhash() might run several times as fast as it does with this PR.

1e0ng · 2020-08-20T14:48:17Z

Hi, @jcushman thanks for taking the time to optimize the performance. 👍

I tested your change in my local environment (Core i7 MacOS 10.15 with Python 3.8.5), and I do see the performance improvement. So great work, @jcushman Thanks!

One thing I notice is with Python 2, this change seems slightly making it slower, not sure if you have noticed the same thing. Since Python 2 is deprecated, this won't be a big deal. Just curious.

jcushman · 2020-08-23T00:25:19Z

Hmm, I didn't try Python2, but I definitely believe it! There's so much different as far as how e.g. strings and zip() are implemented, and the performance benefit on Python3 isn't huge, so I can definitely see it not working out as a win on 2.

Profile-based optimization of build_by_features

d1efea9

1e0ng approved these changes Aug 20, 2020

View reviewed changes

1e0ng merged commit 66a9929 into 1e0ng:master Aug 20, 2020

jcushman mentioned this pull request Oct 5, 2020

Fast simhash calculation using numpy #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile-based optimization of build_by_features #48

Profile-based optimization of build_by_features #48

jcushman commented Aug 13, 2020

1e0ng commented Aug 20, 2020

jcushman commented Aug 23, 2020

Profile-based optimization of build_by_features #48

Profile-based optimization of build_by_features #48

Conversation

jcushman commented Aug 13, 2020

1e0ng commented Aug 20, 2020

jcushman commented Aug 23, 2020