Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance with GMP library. #44

Merged
merged 9 commits into from
Apr 25, 2020
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

setup(
name = 'simhash',
version = '1.9.1',
version = '1.10',
keywords = ('simhash'),
description = 'A Python implementation of Simhash Algorithm',
license = 'MIT License',
Expand All @@ -22,6 +22,7 @@
'numpy',
'scipy',
'scikit-learn',
'gmpy2'
],
test_suite = "nose.collector",
)
17 changes: 7 additions & 10 deletions simhash/__init__.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
# Created by 1e0n in 2013
from __future__ import division, unicode_literals

import re
import sys
import collections
import hashlib
import logging
import numbers
import collections
import re
import sys
from itertools import groupby

from gmpy2 import popcount

if sys.version_info[0] >= 3:
basestring = str
unicode = str
Expand Down Expand Up @@ -82,7 +84,7 @@ def _tokenize(self, content):

def build_by_text(self, content):
features = self._tokenize(content)
features = {k:sum(1 for _ in g) for k, g in groupby(sorted(features))}
features = {k: sum(1 for _ in g) for k, g in groupby(sorted(features))}
return self.build_by_features(features)

def build_by_features(self, features):
Expand Down Expand Up @@ -113,12 +115,7 @@ def build_by_features(self, features):

def distance(self, another):
assert self.f == another.f
x = (self.value ^ another.value) & ((1 << self.f) - 1)
ans = 0
while x:
ans += 1
x &= x - 1
return ans
return popcount(self.value ^ another.value)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the pull request.
I'm trying to understand the popcount here.
Have you done some performance test to show that popcount is more efficient?
Or maybe you have the source code of popcount?
I'm trying to find it out, but not found anything about its implementation yet.
:-)

Copy link
Contributor Author

@bebound bebound Apr 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

popcount is used to get the number of 1 in binary.

gmpy2 is using GMP libary's popcount function. There is also a cpu instruction named POPCNT, I guess GMP should be able to call it. I cpu instruction should be much faster than iterate the bit array to find the 1s.

I've show the performance test in #43, it's 10x faster.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!, that helps a lot and it makes sense now. Just curious why the 2 methods produced different results because the old method is to get the number of 1 in binary too. 🙂

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, one more thing, in case cpu doesn't have that instruction popcount, can it fall back to less efficient ways to get the same result? 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distance function produce same result 19.
But the two version produce different value, 864692470817131398 and 7048486158682030128 respectively. Because new method treat the first bit as most significant bit. For example, when v is 001, the original method produces 4, but the new method produces 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry, GMP library should always return the right result regardless of which cpu is using.



class SimhashIndex(object):
Expand Down