Efficient indexing for NN queries #91

rth · 2017-02-11T18:18:21Z

Currently the Nearest Neighbor search is used in the following places,

categorization
DBSCAN
Sematnic search

previous optimization attempts in #15 concluded that for NN search in the LSI space (approximate datasize of [700k documents, 150 LSI dimensions]) brute-force method (with multi-threaded BLAS) outperforms KDTree and BallTree structures from scikit-learn.

However, for larger dataset sizes brute-force NN won't be possible (due to time and memory constraints), and some index in the semantic space would have to be constructed for NN queries. A good benchmark for exact and approximate NN algorithms (mostly in C++ with Python wrappers) can be found at erikbern/ann-benchmarks. Though this adds another complexity of estimating what would be acceptable approximate NN in terms of precision...

The text was updated successfully, but these errors were encountered:

rth added the large scale label Feb 11, 2017

rth added this to the v2.0 milestone Feb 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient indexing for NN queries #91

Efficient indexing for NN queries #91

rth commented Feb 11, 2017 •

edited

Loading

Efficient indexing for NN queries #91

Efficient indexing for NN queries #91

Comments

rth commented Feb 11, 2017 • edited Loading

rth commented Feb 11, 2017 •

edited

Loading