Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient indexing for NN queries #91

Open
rth opened this issue Feb 11, 2017 · 0 comments
Open

Efficient indexing for NN queries #91

rth opened this issue Feb 11, 2017 · 0 comments
Milestone

Comments

@rth
Copy link
Contributor

rth commented Feb 11, 2017

Currently the Nearest Neighbor search is used in the following places,

  • categorization
  • DBSCAN
  • Sematnic search

previous optimization attempts in #15 concluded that for NN search in the LSI space (approximate datasize of [700k documents, 150 LSI dimensions]) brute-force method (with multi-threaded BLAS) outperforms KDTree and BallTree structures from scikit-learn.

However, for larger dataset sizes brute-force NN won't be possible (due to time and memory constraints), and some index in the semantic space would have to be constructed for NN queries. A good benchmark for exact and approximate NN algorithms (mostly in C++ with Python wrappers) can be found at erikbern/ann-benchmarks. Though this adds another complexity of estimating what would be acceptable approximate NN in terms of precision...

@rth rth added this to the v2.0 milestone Feb 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant