This code implements relatively fast cosine similarity computation and kNN classification for large matrices, so I never have to worry about it again.
It's also possible to batch the computations to save space.
Vectors are classigied using a kNN majority-vote approach.
The main algorithm is implemented in src/knn.py
.
You can test locally in a Docker container with an ES index, if you have credentials for an ElasticSearch cluster:
- Port-forward the ElasticSearch cluster to port
9200
. - Set the ElasticSearch environment variables in
src/local.env
:ES_USERNAME
andES_PASSWORD
if the cluster requires authentication, andES_INDEX
with the name of the index that you want to process. It should contain documents with a field namedfull_text
. - Run
make run
. This will train the model, and add asimilar_docs
field to the documents in the index. Note that this Make command limits the CPU and memory usage; you can adjust this with the variables set in theMakefile
.