docvecs.most_similar() does not return cosine similarity = 1 for same document vector #2915

PredictedLife · 2020-08-14T10:03:22Z

If you call the most_similar() method with an inferred vector a top-n list of most similar documents including consine similarities is returned.

If the document was also present in the training set the returned most_similar document should be the same document.

Suprisingly the cosine_similarity of the same document is +-0.86 but never 1

Is this a bug? Or is there any explanation for this behaviour?

piskvorky · 2020-08-14T10:16:15Z

Check the FAQ #12.

PredictedLife · 2020-08-14T12:07:05Z

Hi Radim, Thank you very much for your quick response and for providing gensim!

I am aware of FAQ #12. But the difference is too big:

Call most_similar() with inferred vector:

d1 = model.infer_vector(tokenized_text_document_1)
model.docvecs.most_similar(positive=[d1],topn=1)

returns: [('doc1',0.8511...]

Inferr vector of the document multiple times and calculate cosine similarity:

d1_inferred_1 = model.infer_vector(tokenized_text_document_1)
d1_inferred_2 =  model.infer_vector(tokenized_text_document_1)
cosine_similarity([d1_inferred_1],[d1_inferred_2])

returns : 0.98
... I inferred the vectors multiple times... but there was never a variance bigger than 0.005 ...

But why is there such a big difference of cosine similarities beween the methods model.docvecs.most_similar(positive=[d1],topn=1) and cosine_similarity([d1_inferred_1],[d1_inferred_2]) ?

piskvorky · 2020-08-14T12:34:37Z

FAQ 12 explains that: the training uses different parameters than the inference.

gojomo · 2020-08-15T18:01:46Z

That your re-inferred vectors are close to each other is good, but if you think the difference with the vector in the model is "too big" maybe there were other problems with the adequacy of the data/training/parameters (about which you provide no details) on the 1st bulk training.

Note, especially, that while training's N epochs occur on a model which is, for many of the epochs, less than half trained, all N inference epochs happen on a model that's already fully trained. So something immediately worth trying to reduce the effect could be more initial training epochs.

There are other tips in the project discussion mailing list archives – https://groups.google.com/forum/#!forum/gensim – or providing more details about your setup (data size/quality/type, parameters, goals, etc) may generate more tips specific to your situation.

PredictedLife mentioned this issue Aug 14, 2020

KeyedVectors TODOs #2166

Open

gojomo closed this as completed Aug 17, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docvecs.most_similar() does not return cosine similarity = 1 for same document vector #2915

docvecs.most_similar() does not return cosine similarity = 1 for same document vector #2915

PredictedLife commented Aug 14, 2020 •

edited

piskvorky commented Aug 14, 2020 •

edited

PredictedLife commented Aug 14, 2020

piskvorky commented Aug 14, 2020

gojomo commented Aug 15, 2020

docvecs.most_similar() does not return cosine similarity = 1 for same document vector #2915

docvecs.most_similar() does not return cosine similarity = 1 for same document vector #2915

Comments

PredictedLife commented Aug 14, 2020 • edited

piskvorky commented Aug 14, 2020 • edited

PredictedLife commented Aug 14, 2020

piskvorky commented Aug 14, 2020

gojomo commented Aug 15, 2020

PredictedLife commented Aug 14, 2020 •

edited

piskvorky commented Aug 14, 2020 •

edited