Skip to content
Branch: master
Go to file
Code

Latest commit

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

README.rst

pybktree

pybktree is a generic, pure Python implementation of a BK-tree data structure, which allows fast querying of "close" matches (for example, matches with small hamming distance or Levenshtein distance). This module is based on the algorithm by Nick Johnson in his blog article on BK-trees.

The library is on the Python Package Index (PyPI) and works on both Python 3 and Python 2.7. To install it, fire up a command prompt, activate your virtual environment if you're using one, and type:

pip install pybktree

Example usage:

>>> tree = pybktree.BKTree(pybktree.hamming_distance, [0, 4, 5, 14])
>>> tree.add(15)              # add element 15
>>> sorted(tree)              # BKTree instances are iterable
[0, 4, 5, 14, 15]
>>> sorted(tree.find(13, 1))  # find elements at most 1 bit away from element 13
[(1, 5), (1, 15)]

If you need to track the ID, key, or filename of the original item, use a tuple or namedtuple. Repeating the above example with an Item namedtuple:

>>> import collections
>>> Item = collections.namedtuple('Item', 'bits id')
>>> def item_distance(x, y):
...     return pybktree.hamming_distance(x.bits, y.bits)
>>> tree = pybktree.BKTree(item_distance, [Item(0, 'a'), Item(4, 'b'),
                                           Item(5, 'c'), Item(14, 'd')])
>>> tree.add(Item(15, 'e'))
>>> sorted(tree.find(Item(13, 'x'), 1))
[(1, Item(bits=5, id='c')), (1, Item(bits=15, id='e'))]

For large trees and fairly small N when calling find(), using a BKTree is much faster than doing a linear search. This is especially good when you're de-duping a few hundred thousand photos -- with a linear search that would become a very slow, O(N) for every photo, so O(N²) in total.

A lookup in a BKTree is much faster than linear for small distance thresholds -- though it goes up to O(N) far large distance thresholds, so won't be valuable in those cases. See Maximilian Knespel's detailed analysis.

Read the code in pybktree.py for more details – it's pretty small!

Other BK-tree modules I found on GitHub while writing this one:

  • ahupp/bktree: this one is pretty good, but it's not on PyPI, and it's recursive
  • ryanfox/bktree: this one is hard to customize, search() doesn't return distances, it's slower, and was buggy (though I think he fixed it recently)

pybktree was written by Ben Hoyt for Jetsetter and is licensed with a permissive MIT license (see LICENSE.txt).

About

Python BK-tree data structure to allow fast querying of "close" matches

Topics

Resources

License

Languages

You can’t perform that action at this time.