Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance with GMP library. #44

Merged
merged 9 commits into from
Apr 25, 2020
Merged

Improve performance with GMP library. #44

merged 9 commits into from
Apr 25, 2020

Conversation

bebound
Copy link
Contributor

@bebound bebound commented Apr 18, 2020

As #43 mentions, the simhash.value will change, but distance should be the same.
I've test on python3.7, python2 should work

import time

from simhash import Simhash

a = Simhash('kk rocks')
print(a.value)

s = time.time()
for i in range(10000):
    a = Simhash('kk rocks')
print(time.time() - s, a.value)

s = time.time()
a = Simhash('kk rocks')
b = Simhash('kk really rocks')
for i in range(1000000):
    dis = a.distance(b)
print(time.time() - s, dis)

output:

new:
864692470817131398
0.8506567478179932 864692470817131398
0.5275230407714844 19

old:
7048486158682030128
0.8147940635681152 7048486158682030128
3.1718358993530273 19

ans += 1
x &= x - 1
return ans
return popcount(self.value ^ another.value)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the pull request.
I'm trying to understand the popcount here.
Have you done some performance test to show that popcount is more efficient?
Or maybe you have the source code of popcount?
I'm trying to find it out, but not found anything about its implementation yet.
:-)

Copy link
Contributor Author

@bebound bebound Apr 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

popcount is used to get the number of 1 in binary.

gmpy2 is using GMP libary's popcount function. There is also a cpu instruction named POPCNT, I guess GMP should be able to call it. I cpu instruction should be much faster than iterate the bit array to find the 1s.

I've show the performance test in #43, it's 10x faster.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!, that helps a lot and it makes sense now. Just curious why the 2 methods produced different results because the old method is to get the number of 1 in binary too. 🙂

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, one more thing, in case cpu doesn't have that instruction popcount, can it fall back to less efficient ways to get the same result? 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distance function produce same result 19.
But the two version produce different value, 864692470817131398 and 7048486158682030128 respectively. Because new method treat the first bit as most significant bit. For example, when v is 001, the original method produces 4, but the new method produces 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry, GMP library should always return the right result regardless of which cpu is using.

if v[i] > 0:
ans |= masks[i]
self.value = ans
binary_str = ''.join(['0' if i <= 0 else '1' for i in v])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't change this, the popcount will still work, right? The join command here is O(n) the same complexity as the old implementation. May I ask why you change this?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi KK, thanks again for replying.

I see this is another improvement you are trying to make.
Could you move this change into a separate pull request?
I just want to make this change minimal so that it's easy to test and I'll have much higher confidence to merge if so.
Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've revert the value calculation.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, bebound. Can you update the unit tests as well? Because right now the build is failing. I'm not sure why the build result is not showing here, but I'll check later. (By right, each commit will trigger a CI build and the result will show here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forget to rollback the value in test. It should work now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Travis said I forget to add gmpy2 as dependency, I'll fix that.

Copy link
Contributor Author

@bebound bebound Apr 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Travis has been fixed.

@bebound
Copy link
Contributor Author

bebound commented Apr 20, 2020 via email

@1e0ng 1e0ng changed the title Use ^ and popcount calculate distance Improve performance with GMP library. Apr 25, 2020
@1e0ng 1e0ng merged commit 9409b67 into 1e0ng:master Apr 25, 2020
@1e0ng
Copy link
Owner

1e0ng commented Apr 25, 2020

@bebound Thanks for your work! 👍
Merged and published to PyPi: https://pypi.org/project/simhash/#history

@bebound bebound deleted the performance branch April 25, 2020 15:13
1e0ng added a commit that referenced this pull request May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants