Improve performance with GMP library. #44

bebound · 2020-04-18T09:01:26Z

As #43 mentions, the simhash.value will change, but distance should be the same.
I've test on python3.7, python2 should work

import time

from simhash import Simhash

a = Simhash('kk rocks')
print(a.value)

s = time.time()
for i in range(10000):
    a = Simhash('kk rocks')
print(time.time() - s, a.value)

s = time.time()
a = Simhash('kk rocks')
b = Simhash('kk really rocks')
for i in range(1000000):
    dis = a.distance(b)
print(time.time() - s, dis)

output:

new:
864692470817131398
0.8506567478179932 864692470817131398
0.5275230407714844 19

old:
7048486158682030128
0.8147940635681152 7048486158682030128
3.1718358993530273 19

1e0ng · 2020-04-18T11:58:55Z

simhash/__init__.py

-            ans += 1
-            x &= x - 1
-        return ans
+        return popcount(self.value ^ another.value)


Hi, thanks for the pull request.
I'm trying to understand the popcount here.
Have you done some performance test to show that popcount is more efficient?
Or maybe you have the source code of popcount?
I'm trying to find it out, but not found anything about its implementation yet.
:-)

popcount is used to get the number of 1 in binary.

gmpy2 is using GMP libary's popcount function. There is also a cpu instruction named POPCNT, I guess GMP should be able to call it. I cpu instruction should be much faster than iterate the bit array to find the 1s.

I've show the performance test in #43, it's 10x faster.

Thanks!, that helps a lot and it makes sense now. Just curious why the 2 methods produced different results because the old method is to get the number of 1 in binary too. 🙂

By the way, one more thing, in case cpu doesn't have that instruction popcount, can it fall back to less efficient ways to get the same result? 🙂

The distance function produce same result 19.
But the two version produce different value, 864692470817131398 and 7048486158682030128 respectively. Because new method treat the first bit as most significant bit. For example, when v is 001, the original method produces 4, but the new method produces 1.

Don't worry, GMP library should always return the right result regardless of which cpu is using.

1e0ng · 2020-04-18T13:19:41Z

simhash/__init__.py

-            if v[i] > 0:
-                ans |= masks[i]
-        self.value = ans
+        binary_str = ''.join(['0' if i <= 0 else '1' for i in v])


If we don't change this, the popcount will still work, right? The join command here is O(n) the same complexity as the old implementation. May I ask why you change this?

Hi KK, thanks again for replying.

I see this is another improvement you are trying to make.
Could you move this change into a separate pull request?
I just want to make this change minimal so that it's easy to test and I'll have much higher confidence to merge if so.
Thanks.

Okay, I've revert the value calculation.

Thanks, bebound. Can you update the unit tests as well? Because right now the build is failing. I'm not sure why the build result is not showing here, but I'll check later. (By right, each commit will trigger a CI build and the result will show here)

Sorry, I forget to rollback the value in test. It should work now.

Travis said I forget to add gmpy2 as dependency, I'll fix that.

Travis has been fixed.

bebound · 2020-04-20T02:21:40Z

It should work. I change this to make the code cleaner and more intuitive. In general, binary str 0001 represent 1.

…

On Apr 20, 2020, 00:21 +0800, Leon , wrote: @leonsim commented on this pull request. In simhash/__init__.py: > @@ -105,20 +107,12 @@ def build_by_features(self, features): w = f[1] for i in range(self.f): v[i] += w if h & masks[i] else -w - ans = 0 - for i in range(self.f): - if v[i] > 0: - ans |= masks[i] - self.value = ans + binary_str = ''.join(['0' if i <= 0 else '1' for i in v]) If we don't change this, the popcount will still work, right? The join command here is O(n) the same complexity as the old implementation. May I ask why you change this? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

1e0ng · 2020-04-25T14:39:26Z

@bebound Thanks for your work! 👍
Merged and published to PyPi: https://pypi.org/project/simhash/#history

This reverts commit 9409b67.

bebound added 3 commits April 18, 2020 16:54

Use ^ and popcount calculate distance

6525de2

Update version number

a0e8101

Clean code

c2d3881

1e0ng reviewed Apr 18, 2020

View reviewed changes

1e0ng reviewed Apr 19, 2020

View reviewed changes

bebound added 2 commits April 22, 2020 21:53

Revert value calculation to previous method

af576c6

Rollback test

a1903a4

1e0ng approved these changes Apr 25, 2020

View reviewed changes

bebound added 4 commits April 25, 2020 16:32

Fix gmpy2 dependency

ff5e043

Fix travis test

d9efcbe

Update .travis.yml

7262f67

Update .travis.yml

96244a8

1e0ng changed the title ~~Use ^ and popcount calculate distance~~ Improve performance with GMP library. Apr 25, 2020

1e0ng merged commit 9409b67 into 1e0ng:master Apr 25, 2020

bebound deleted the performance branch April 25, 2020 15:13

ajdapretnar mentioned this pull request Apr 28, 2020

Gmpy2 causes installation issues downstream #46

Open

1e0ng added a commit that referenced this pull request May 1, 2020

Revert "Improve performance with GMP library. (#44)"

d88d21b

This reverts commit 9409b67.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance with GMP library. #44

Improve performance with GMP library. #44

bebound commented Apr 18, 2020

1e0ng Apr 18, 2020

bebound Apr 18, 2020 •

edited

Loading

1e0ng Apr 18, 2020

1e0ng Apr 18, 2020

bebound Apr 18, 2020

bebound Apr 18, 2020

1e0ng Apr 18, 2020

1e0ng Apr 22, 2020

bebound Apr 22, 2020

1e0ng Apr 25, 2020

bebound Apr 25, 2020

bebound Apr 25, 2020

bebound Apr 25, 2020 •

edited

Loading

bebound commented Apr 20, 2020 via email

1e0ng commented Apr 25, 2020

Improve performance with GMP library. #44

Improve performance with GMP library. #44

Conversation

bebound commented Apr 18, 2020

Choose a reason for hiding this comment

bebound Apr 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bebound Apr 25, 2020 • edited Loading

Choose a reason for hiding this comment

bebound commented Apr 20, 2020 via email

1e0ng commented Apr 25, 2020

bebound Apr 18, 2020 •

edited

Loading

bebound Apr 25, 2020 •

edited

Loading