You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the Nearest Neighbor search is used in the following places,
categorization
DBSCAN
Sematnic search
previous optimization attempts in #15 concluded that for NN search in the LSI space (approximate datasize of [700k documents, 150 LSI dimensions]) brute-force method (with multi-threaded BLAS) outperforms KDTree and BallTree structures from scikit-learn.
However, for larger dataset sizes brute-force NN won't be possible (due to time and memory constraints), and some index in the semantic space would have to be constructed for NN queries. A good benchmark for exact and approximate NN algorithms (mostly in C++ with Python wrappers) can be found at erikbern/ann-benchmarks. Though this adds another complexity of estimating what would be acceptable approximate NN in terms of precision...
The text was updated successfully, but these errors were encountered:
Currently the Nearest Neighbor search is used in the following places,
previous optimization attempts in #15 concluded that for NN search in the LSI space (approximate datasize of
[700k documents, 150 LSI dimensions]
) brute-force method (with multi-threaded BLAS) outperforms KDTree and BallTree structures from scikit-learn.However, for larger dataset sizes brute-force NN won't be possible (due to time and memory constraints), and some index in the semantic space would have to be constructed for NN queries. A good benchmark for exact and approximate NN algorithms (mostly in C++ with Python wrappers) can be found at erikbern/ann-benchmarks. Though this adds another complexity of estimating what would be acceptable approximate NN in terms of precision...
The text was updated successfully, but these errors were encountered: