Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docvecs.most_similar() does not return cosine similarity = 1 for same document vector #2915

Closed
PredictedLife opened this issue Aug 14, 2020 · 4 comments

Comments

@PredictedLife
Copy link

PredictedLife commented Aug 14, 2020

If you call the most_similar() method with an inferred vector a top-n list of most similar documents including consine similarities is returned.

If the document was also present in the training set the returned most_similar document should be the same document.

Suprisingly the cosine_similarity of the same document is +-0.86 but never 1

Is this a bug? Or is there any explanation for this behaviour?

@piskvorky
Copy link
Owner

piskvorky commented Aug 14, 2020

Check the FAQ #12.

@PredictedLife
Copy link
Author

Hi Radim, Thank you very much for your quick response and for providing gensim!

I am aware of FAQ #12. But the difference is too big:

  1. Call most_similar() with inferred vector:
d1 = model.infer_vector(tokenized_text_document_1)
model.docvecs.most_similar(positive=[d1],topn=1)

returns: [('doc1',0.8511...]

  1. Inferr vector of the document multiple times and calculate cosine similarity:
d1_inferred_1 = model.infer_vector(tokenized_text_document_1)
d1_inferred_2 =  model.infer_vector(tokenized_text_document_1)
cosine_similarity([d1_inferred_1],[d1_inferred_2]) 

returns : 0.98
... I inferred the vectors multiple times... but there was never a variance bigger than 0.005 ...

But why is there such a big difference of cosine similarities beween the methods model.docvecs.most_similar(positive=[d1],topn=1) and cosine_similarity([d1_inferred_1],[d1_inferred_2]) ?

@piskvorky
Copy link
Owner

FAQ 12 explains that: the training uses different parameters than the inference.

@gojomo
Copy link
Collaborator

gojomo commented Aug 15, 2020

That your re-inferred vectors are close to each other is good, but if you think the difference with the vector in the model is "too big" maybe there were other problems with the adequacy of the data/training/parameters (about which you provide no details) on the 1st bulk training.

Note, especially, that while training's N epochs occur on a model which is, for many of the epochs, less than half trained, all N inference epochs happen on a model that's already fully trained. So something immediately worth trying to reduce the effect could be more initial training epochs.

There are other tips in the project discussion mailing list archives – https://groups.google.com/forum/#!forum/gensim – or providing more details about your setup (data size/quality/type, parameters, goals, etc) may generate more tips specific to your situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants