Profile-based optimization of build_by_features #48
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I did some profiling to try to speed up
Simhash(features)
, while sticking to pure Python with no additional dependencies. This PR is the fastest option I found; it speeds upSimhash(features)
by about 20%.This became sort of a rabbit hole for my own entertainment, so here are details if they're interesting to anyone else:
All profiling was done on a Core i9 running MacOS 10.15 with Python 3.7.6. To profile, I used ipython:
I also used py-spy to find hotspots. It turns out that the hotspot of the current Simhash code is the part that updates
v
:Those lines take over 80% of the total runtime! Actually computing all the md5s takes < 15%. Apparently iterating the bits of an integer is slow in Python.
My fastest alternative truncates the hash to the desired length, converts it to a binary string, and then iterates the characters of the string to create a new list:
This runs in about 75% the time of the current hotspot lines, so ends up speeding up the whole
Simhash(features)
operation by about 20%.I tried some other avenues that didn't work as well, including:
v = [x + (w if h & m else -w) for m, x in zip(masks, v)]
(slower than original)l = {'1': w, '0': -w}
(slightly slower than not)I also tried the BitVector and numpy libraries to see if getting away from bare python would help:
bv = BitVector(intVal=h, size = 64); v = [x + (w if b else -w) for b, x in zip(bv, v)]
(much slower)np_v = np.zeros(64); bits = np.unpackbits(np.ndarray(shape=(8,),dtype='>B', buffer=h.to_bytes(length=8, byteorder='big'))); np_v += bits * w + (bits - 1) * w
(somewhat slower)This is all a little frustrating -- intuitively, one would think most of the runtime would go to md5 rather than to the 64 additions. If that were true (and I'm just missing the way to do it), Simhash() might run several times as fast as it does with this PR.