Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile-based optimization of build_by_features #48

Merged
merged 1 commit into from
Aug 20, 2020

Conversation

jcushman
Copy link
Contributor

I did some profiling to try to speed up Simhash(features), while sticking to pure Python with no additional dependencies. This PR is the fastest option I found; it speeds up Simhash(features) by about 20%.

This became sort of a rabbit hole for my own entertainment, so here are details if they're interesting to anyone else:

All profiling was done on a Core i9 running MacOS 10.15 with Python 3.7.6. To profile, I used ipython:

In [1]: import simhash
   ...: features = [str(i) for i in range(1000)]
   ...: assert simhash.Simhash(features).value == 7994783356231524236

In [2]: %timeit simhash.Simhash(features)
7.5 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I also used py-spy to find hotspots. It turns out that the hotspot of the current Simhash code is the part that updates v:

for i in range(self.f):
    v[i] += w if h & masks[i] else -w

Those lines take over 80% of the total runtime! Actually computing all the md5s takes < 15%. Apparently iterating the bits of an integer is slow in Python.

My fastest alternative truncates the hash to the desired length, converts it to a binary string, and then iterates the characters of the string to create a new list:

h_bits = format(h & truncate_mask, bitstring_format)
v = [x + (w if bit == "1" else -w) for bit, x in zip(h_bits, v)]

This runs in about 75% the time of the current hotspot lines, so ends up speeding up the whole Simhash(features) operation by about 20%.

I tried some other avenues that didn't work as well, including:

  • v = [x + (w if h & m else -w) for m, x in zip(masks, v)] (slower than original)
  • updating the list in place with the format-string approach (slightly slower than final choice)
  • Avoiding the if-else with a lookup like l = {'1': w, '0': -w} (slightly slower than not)

I also tried the BitVector and numpy libraries to see if getting away from bare python would help:

  • bv = BitVector(intVal=h, size = 64); v = [x + (w if b else -w) for b, x in zip(bv, v)] (much slower)
  • np_v = np.zeros(64); bits = np.unpackbits(np.ndarray(shape=(8,),dtype='>B', buffer=h.to_bytes(length=8, byteorder='big'))); np_v += bits * w + (bits - 1) * w (somewhat slower)

This is all a little frustrating -- intuitively, one would think most of the runtime would go to md5 rather than to the 64 additions. If that were true (and I'm just missing the way to do it), Simhash() might run several times as fast as it does with this PR.

@1e0ng
Copy link
Owner

1e0ng commented Aug 20, 2020

Hi, @jcushman thanks for taking the time to optimize the performance. 👍

I tested your change in my local environment (Core i7 MacOS 10.15 with Python 3.8.5), and I do see the performance improvement. So great work, @jcushman Thanks!

One thing I notice is with Python 2, this change seems slightly making it slower, not sure if you have noticed the same thing. Since Python 2 is deprecated, this won't be a big deal. Just curious.

@1e0ng 1e0ng merged commit 66a9929 into 1e0ng:master Aug 20, 2020
@jcushman
Copy link
Contributor Author

Hmm, I didn't try Python2, but I definitely believe it! There's so much different as far as how e.g. strings and zip() are implemented, and the performance benefit on Python3 isn't huge, so I can definitely see it not working out as a win on 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants