-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance with GMP library. #44
Changes from 3 commits
6525de2
a0e8101
c2d3881
af576c6
a1903a4
ff5e043
d9efcbe
7262f67
96244a8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,16 @@ | ||
# Created by 1e0n in 2013 | ||
from __future__ import division, unicode_literals | ||
|
||
import re | ||
import sys | ||
import collections | ||
import hashlib | ||
import logging | ||
import numbers | ||
import collections | ||
import re | ||
import sys | ||
from itertools import groupby | ||
|
||
from gmpy2 import popcount | ||
|
||
if sys.version_info[0] >= 3: | ||
basestring = str | ||
unicode = str | ||
|
@@ -82,7 +84,7 @@ def _tokenize(self, content): | |
|
||
def build_by_text(self, content): | ||
features = self._tokenize(content) | ||
features = {k:sum(1 for _ in g) for k, g in groupby(sorted(features))} | ||
features = {k: sum(1 for _ in g) for k, g in groupby(sorted(features))} | ||
return self.build_by_features(features) | ||
|
||
def build_by_features(self, features): | ||
|
@@ -105,20 +107,12 @@ def build_by_features(self, features): | |
w = f[1] | ||
for i in range(self.f): | ||
v[i] += w if h & masks[i] else -w | ||
ans = 0 | ||
for i in range(self.f): | ||
if v[i] > 0: | ||
ans |= masks[i] | ||
self.value = ans | ||
binary_str = ''.join(['0' if i <= 0 else '1' for i in v]) | ||
self.value = int(binary_str, 2) | ||
|
||
def distance(self, another): | ||
assert self.f == another.f | ||
x = (self.value ^ another.value) & ((1 << self.f) - 1) | ||
ans = 0 | ||
while x: | ||
ans += 1 | ||
x &= x - 1 | ||
return ans | ||
return popcount(self.value ^ another.value) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, thanks for the pull request. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I've show the performance test in #43, it's 10x faster. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks!, that helps a lot and it makes sense now. Just curious why the 2 methods produced different results because the old method is to get the number of 1 in binary too. 🙂 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By the way, one more thing, in case cpu doesn't have that instruction popcount, can it fall back to less efficient ways to get the same result? 🙂 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't worry, |
||
|
||
|
||
class SimhashIndex(object): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't change this, the popcount will still work, right? The join command here is O(n) the same complexity as the old implementation. May I ask why you change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi KK, thanks again for replying.
I see this is another improvement you are trying to make.
Could you move this change into a separate pull request?
I just want to make this change minimal so that it's easy to test and I'll have much higher confidence to merge if so.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I've revert the value calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, bebound. Can you update the unit tests as well? Because right now the build is failing. I'm not sure why the build result is not showing here, but I'll check later. (By right, each commit will trigger a CI build and the result will show here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forget to rollback the value in test. It should work now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Travis said I forget to add
gmpy2
as dependency, I'll fix that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Travis has been fixed.