Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc2Vec getting back the same vector from infer_vector #374

Closed
AbdealiLoKo opened this issue Jun 30, 2015 · 8 comments
Closed

Doc2Vec getting back the same vector from infer_vector #374

AbdealiLoKo opened this issue Jun 30, 2015 · 8 comments

Comments

@AbdealiLoKo
Copy link

Hi,

Right now, I tried the newly pushed doc2vec #356 and I was using the training data to test the model itself to check how good it was. I don't seem to be getting good results when using infer_vector()

If I have 100,000 paragraphs of text and I run the Doc2Vec's train on it - I can see some vectors are created internally and can access with docmodel.docvecs[tag]. Now, when I try to run infer_vector() on one of the TaggedDocument that I learnt with, what values of alpha, min_alpha and steps do I need to give to return back the same vector as docmodel.docvecs[tag] ?

Is it possible to get back the same vector ?

When I use the same alpha, min_alpha and use the value of iter as steps - I get a completely different vector, and their cosine distance is around 0.2 .

I get good results when I find similar terms using the docmodel.docvecs[tag] and pretty bad results with infer_vector()

@gojomo
Copy link
Collaborator

gojomo commented Jun 30, 2015

The infer_vector() process won't find the same vector.

Even a vector from the bulk training is the product of a random process, which on subsequent runs can be quite different based on slight differences in setup. A different seed, or ordering of the training examples, or even just random scheduling jitter of a multi-thread process will all result in different end model/vector states. Each end-state will be about equally 'good' at the training goal – corpus word prediction – and thus should be roughly equally good at other outside tasks. But equally 'good' end states might have the vectors in very different coordinates.

The inference tries to fit a later example into a frozen model, and so if you re-present the same document, it should wind up 'close' to the vector that the same document induced in multi-pass bulk training. But how 'close' would depend on a lot of things. The information in the PV paper about parameter choices is limited.

Similar alpha/iteration choices as the original training seem a reasonable choice, but on the IMDB dataset, I've seen inferred vectors for doc X become closer to the bulk-trained vector for X than any other bulk-trained vectors with just a few steps at a larger alpha. (Perhaps a coarse approach is OK with the far-fewer free parameters?)

If inference isn't coming close on your document set, maybe something else is wrong. Note that infer_vector takes a list of tokens (words) – not the same TaggedDocument instances as train().

I also have the vague hunch that if the model is very large compared to the dataset size – eg thousands of dimensions, or the very large input/hidden layers of DM/concat mode – maybe there are many coordinate regions equally good at the train/inference prediction task, so any one inference doesn't necessarily match the earlier result. But I'm not sure of this.

You can see examples of inference more-or-less working to match bulk-trained vectors, in the demo IPython notebook (in /docs/notebooks) or the /gensim/tests/test_doc2vec.py unit tests.

@AbdealiLoKo
Copy link
Author

Hi @gojomo , thanks for the reply :)

SO, first off - I did test on words, not the TaggedDocument class. And I am creating a 300 element vector from 80000 documents. Is there any rule of thumb to make sure I do not create too many dimensions ?

So, as far as I see it - either

  • my dataset is weird and is incapable of being used with doc2vec (for some reason) or
  • the parameters which I am using with the dataset are invalid or
  • My preprocessing is bad

Is there any logic behind what sort of datasets will work well with doc2vec ? Or is the only way of knowing this by running it ?
Is there any method of estimating good alpha and so on other than trial and error ?

I tried doing the following - I took 1 sentence from my corpus and trained doc2vec using this about 50,000 times. And when I gave the same sentence to infer_vector - it still gave me completely different vector. - Is this a good test to check that the vector should be nearly same or could the randomness change it drastically ?

@gojomo
Copy link
Collaborator

gojomo commented Jul 2, 2015

Nothing seems out-of-the-ordinary with your document count or chosen dimensionality. Only you would know if your text/vocabulary is so unique it could confuse this technique. Most of this is trial and error, starting with data preparation or parameters similar to published results, then trying and evaluating variants.

To be clear, your bulk training should be using TaggedDocument instances, with 'words' a list of strings (the content) and 'tags' a list of non-word identifiers for the document (usually just a list with one item, a unique int ID). When that's done – after at least a few full passes, but often 10-20 – you can present lists-of-words to the infer_vector() and get back vectors that should 'fit in' with the rest. If the vector you're getting back is nothing-at-all like the same vector for those same words in the bulk batch, there's probably something wrong with one of your steps.

How are you prepping your 80K sentences? What training mode and setup parameters are you using?

When you finish initial training, and pick one of the 80K sentences, and request something like model.docvecs.most_similar(sentence_ID) – do the list of returned other IDs make sense, as being topically similar?

You'd not want to train excessively on a single sentence, as the overall point is to learn a general model that can represent a range of texts. The sanity checks around inferred-vectors in the test_doc2vec.py or IPynb notebook are better models.

@guy4261
Copy link

guy4261 commented Nov 22, 2015

Hi,

Subsequent calls to infer_vector with the same doc return different vectors:

features1 = infer_vector(doc)
features2 = infer_vector(doc)
assert features1 == features2 #this will fail!

How can I sent my model to return the same results? A known seed and a single thread?

@gojomo
Copy link
Collaborator

gojomo commented Nov 22, 2015

Some steps that might achieve that result are discussed in issue #447.

@piskvorky
Copy link
Owner

@gojomo @tmylk if you see a duplicate issue, feel free to close it as "duplicate" and point to the authoritative ticket in a comment. Just so we don't get lost.

@tmylk can you go through open tickets and issues and close those that are not relevant any more (already fixed / irrelevant)?

@menshikh-iv
Copy link
Contributor

dup #447

@emankhadom
Copy link

i have problem with inferred_vector
inferred_vector = model.infer_vector(test_corpus[doc_id], steps=20, alpha=0.025)
word_locks = model.syn0_lockf
AttributeError: 'Doc2Vec' object has no attribute 'syn0_lockf'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants