Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed document representations #204

Closed
dhammack opened this issue May 26, 2014 · 10 comments

Comments

Projects
None yet
5 participants
@dhammack
Copy link

commented May 26, 2014

Paper here http://arxiv.org/abs/1405.4053 shows how to learn distributed document representations similar to word2vec. Several state of the art results are demonstrated. It would be a great addition to gensim, if feasible.

@piskvorky piskvorky added the wishlist label May 26, 2014

@piskvorky piskvorky changed the title [wishlist] Add distributed document representations Add distributed document representations May 26, 2014

@kemaswill

This comment has been minimized.

Copy link

commented Jun 8, 2014

Any one who is interested in??

@seanlindsey

This comment has been minimized.

Copy link

commented Aug 12, 2014

In the following fork, from ccri, under gensim.models there are "doc2vec" implementations, seems to attempt to label the document like the paragraph vector approach from the paper. Playing with it myself at this moment, it runs on small sets, now crossing my fingers and running it on a larger set. Instead of sentences it requires an iterator of "LabeledText" which is a sentence and document ID(s) so, I imagine, its always looking at the label during the learning phase.

https://github.com/ccri/gensim

For an iterator, I made a csv of tokenized documents and a corresponding ID for each, there is another LabeledText iterator called DocSet in the code.

import ast, csv
class LTIterator():
    def init(self, fname):
        self.fname = fname
    def iter(self):
        rcsv = csv.reader(open(self.fname))
        for lt_row in self.rcsv:
            yield LabeledText(ast.literal_eval(lt_row[0]), [lt_row[1]])

it = LTIterator(some_filename)
model = Doc2Vec(it, size=400, window=10, min_count=5, workers=11, sg=0) # sg=1 should be fine
your_docs_paragraph_vec = model[your_docs_label] # I imagine

I would love to hear any comments, input, elaboration.

As for Cython/Threading, I think that's what its going for, but I'm not sure.

Haven't tested it, but it seems neat.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Aug 13, 2014

Well, for comments, it would probably be best to CC its author: @temerick .

@temerick

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2014

Yes, the doc2vec implementation is intended to model the algorithm from the Distributed Representations of Sentences and Documents paper. @seanlindsey has the right idea about how to use it. All you need is an iterator of LabeledText elements, which are made up of two lists: 1) the text of the sentence, as in the current gensim word2vec implementation and 2) a list of labels for the text. The goal is for you to be able to add as many or as few labels as you want, although I've mostly experimented with a single label, as in the paper. (The idea being, hopefully, to enable labels at multiple levels of granularity: ['doc1', 'para2', 'sent4'], for example.)

One of the known differences between this code and the paper is that, in the paper, they pad short blocks of text with a null character to make them long enough, whereas here we throw them out of the vocabulary if their length is below the min_count threshold. This design choice makes sense for our context, but may not make sense for yours (if you're looking at twitter data, for instance).

Since it's largely copied and modified code from @piskvorky's excellent word2vec implementation, threading should go through without a hitch. The cython code should also work, if you want your jobs to get done before the heat death of the universe.

Depending on what you define as a large amount of data, this code may scale reasonably well to what you're looking for. I've successfully run this over a collection of over 2 million paragraphs in less than 10 mins. However, I tried to run it on 20x that much data and my box ran out of RAM since it needed to create a new vector for each paragraph.

Here's a picture I threw together from paragraphs extracted from some popular project Gutenberg documents. It shows the distributions of paragraphs for each of the documents after running a couple of epochs of training using CBOW (called PV-DM in the paper): sample3x3

I hope that helps clear things up a little bit. Feel free to let me know if you have any comments or questions, or if you find any bugs.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Aug 14, 2014

@temerick thanks for the info! What are your plans with this implementation -- did you consider a pull request?

(cc @gojomo )

@seanlindsey

This comment has been minimized.

Copy link

commented Aug 14, 2014

@temerick
A couple questions.
Would using the paper's PV-DM be achieved by initializing Doc2Vec with sg=0 and would setting sg=1 use the PV-DBOW method?
Would you suppose the papers uses hierarchical sampling as opposed to negative sampling?

To get the papers implementation of the sentence padding what I'm trying is pre-pending a bunch of null characters to sentence.text (haven't tested it) while building the vocab, then reset the sentence length to get the vocab filter to keep the label later on. Also the vocab wont lose out on having these null characters later on. Then I repeat the process in the prepare_sentences function defined in Doc2Vec's train function, so we don't lose out on the null characters in the learning phase. Sound right?

@temerick

This comment has been minimized.

Copy link
Contributor

commented Aug 15, 2014

@piskvorky I'd be happy to make a pr once I get the code cleaned up a little bit. At the moment, there is too much duplicate code between my doc2vec file and your word2vec file, which is no good.

@seanlindsey

  1. Your assumption about the sg=0/1 toggle is correct. It gets a bit confusing, since the term "bag of words" is used refer to "many in, one out" in the original word2vec paper, but refers to "one in, many out" in the document/paragraph paper.
  2. My understanding is that the paper just uses the hierarchical softmax algorithm.
  3. Your implementation with padding also sounds correct; I'd be interested in hearing about differences in vector quality with-vs-without padding.

@piskvorky piskvorky removed the wishlist label Sep 5, 2014

@piskvorky

This comment has been minimized.

Copy link
Member

commented Sep 5, 2014

@temerick any progress on the PR? CC @gojomo

@temerick

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2014

@piskvorky Yes, I think I should be able to make it by the middle of next week at the latest.

@piskvorky

This comment has been minimized.

Copy link
Member

commented Sep 9, 2014

Brilliant, thanks. Closing this, to be continued in #231 .

@piskvorky piskvorky closed this Sep 9, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.