Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added save method for doc2vec #1256

Merged
merged 9 commits into from
Apr 19, 2017
Merged

Conversation

parulsethi
Copy link
Contributor

This PR adds a method save_doc2vec_format to save document vectors in similar format as save_word2vec_format saves word vectors.

Another PR #699 was opened with this issue and to address @gojomo comment there

mixing word & doc vectors into the same flattened file on save

I’ve added a flag parameter to indicate whether to save word vectors along with doc vectors. For now, I append word vectors after doc vectors, let me know if some other criteria is more favorable.

And if someone would like to keep doc and word vectors disentangled as pointed out by @gojomo, they can separately save them to different files now using save_doc2vec_format (with flag word_vec=False) and then save_word2vec_format.

@tmylk @gojomo If this seems ok, I’ll add the corresponding load method

fout.write(utils.to_utf8("%s %s\n" % (total_vec, self.vector_size)))
# store as in input order
for i in range(len(self.docvecs)):
doctag = self.docvecs.index_to_doctag(i)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will work in case user's model.docvecs.doctags is empty, and will assign the vector index as doctag

@gojomo
Copy link
Collaborator

gojomo commented Apr 1, 2017

Cool! Some thoughts:

  • since the format written is still the "word2vec_format" used by word2vec.c, the method name should still reflect that
  • potentially this could even be a save_word2vec_format() method on the DocvecsArray, rather than on the Doc2Vec model itself – though that leaves open the question of how to get both word-and-doc vectors in the same file
  • if I understand correctly, if strings are repeated between the word vocabulary and doctags, the resulting file will have repeating keys? A convention for avoiding those collisions, such as prefixing all doctags in a unique way (perhaps with a default that the user could override) would be helpful.
  • a new load_… may not be necessary; there's not enough info in the format to rebuilt a trainable model. But, since the format is compatible with KeyedVectors.load_word2vec_format(), anything written here can be reloaded, for read-only access, using that.

@parulsethi
Copy link
Contributor Author

@gojomo made changes according to 1st and 3rd point, and agree on 4th point.
And yeah Word2vec instance is used for directly accessing vocab and syn0 to save word vectors in Doc2vec's save_word2vec_format(). I'm doubtful for how it could be done in DocvecsArray

@tmylk
Copy link
Contributor

tmylk commented Apr 3, 2017

Could you please add an ipynb with images showing how to visualise doc2vec in Tensorboard?

@parulsethi
Copy link
Contributor Author

@tmylk Sure, I'll prepare a notebook tutorial for that

Copy link
Collaborator

@gojomo gojomo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions on naming, reuse.

@@ -808,6 +809,50 @@ def delete_temporary_training_data(self, keep_doctags_vectors=True, keep_inferen
if self.docvecs and hasattr(self.docvecs, 'doctag_syn0_lockf'):
del self.docvecs.doctag_syn0_lockf

def save_word2vec_format(self, fname, doc_vec=True, word_vec=False, prefix='dt_', fvocab=None, binary=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with behavior before this change, default should be to just write word-vectors (as when implementation just inherited from Word2Vec). Let the new capability require explicit activation with a new parameter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about names:

  • to acknowledge the doc-vectors aren't necessarily one-per-doc, but one-per-doctag, and to follow the convention elsewhere, let's enable with doctag_vec=True, rather than doc_vec=True.
  • let's make the default prefix even weirder and less-at-risk of collision with any real tokens. In Mikolov's example sentence-vectors scripts, he prefixed those vector-keys with *_. So let's use *dt_ as the default prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# save document vectors
if doc_vec:
logger.info("storing %sx%s projection weights into %s" % (total_vec, self.vector_size, fname))
with utils.smart_open(fname, 'wb') as fout:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all writing is done in this method, seems that should be possible, and would be cleaner, to only open the file once, rather once then a second time in append-mode. BUT, see later comment for a way that this method might be able to offload some of the writing, and thus would just choose to re-open append.

`doc_vec` is an optional boolean indicating whether to store document vectors
`word_vec` is an optional boolean indicating whether to store word vectors
(if both doc_vec and word_vec are True, then both vectors are stored in the same file)
`prefix` to uniquely indentify doctags from word vocab, and avoid collision
Copy link
Collaborator

@gojomo gojomo Apr 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: 'indentify'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

(if both doc_vec and word_vec are True, then both vectors are stored in the same file)
`prefix` to uniquely indentify doctags from word vocab, and avoid collision
in case of repeated string in doctag and word vocab
`fvocab` is an optional file used to save the vocabulary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The potential to save the vocabulary, with particular index-positions that correspond to the word-vectors only, makes me think that when both word+doc vectors are stored, the word-vectors should go first. Then, at least, any vocab written aligns one-for-one with the word-vectors portion of the save file. (Also: does fvocab currently do anything in the save-both case?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, word-vectors go first now.

(Also: does fvocab currently do anything in the save-both case?)

It didn't, earlier. But now that only KeyedVectors.save_word2vec is used for save-only-wv and save-both, vocab is saved in both the cases.

else:
fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join("%f" % val for val in row))))
else:
KeyedVectors.save_word2vec_format(self.wv, fname, fvocab=fvocab, binary=binary)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought, in combination with my other comments: perhaps code duplication can be reduced by, if word-vecs are enabled, just calling KeyedVectors on the word-vectors first, then appending doc-vecs if necessary. This would reuse the word-writing (and vocab-writing) code from KeyedVectors, but then re-open the file for append to add doc-vectors (if enabled). To make sure the front-of-file vector-count was correct, the KeyedVectors.load_word2vec_format() would need a new parameter telling it to boost the count by some factor the caller knows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@gojomo
Copy link
Collaborator

gojomo commented Apr 6, 2017

Looks good!

Maybe just needs a note in CHANGELOG.md about new doctag_vec, word_vec options on Doc2Vec.save_word2vec_format().

@parulsethi
Copy link
Contributor Author

Sure, and a test too.

@parulsethi parulsethi changed the title [WIP] Added save method for doc2vec Added save method for doc2vec Apr 6, 2017
Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand test coverage

model = doc2vec.Doc2Vec(DocsLeeCorpus(), min_count=1)
model.save_word2vec_format(testfile(), doctag_vec=True, binary=True)
binary_model_dv = keyedvectors.KeyedVectors.load_word2vec_format(testfile(), binary=True)
self.assertEqual(len(model.wv.vocab) + len(model.docvecs), len(binary_model_dv.vocab))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for more combinations of word_vec/doctag_vec True/False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants