Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Add distributed document representations #204
In the following fork, from ccri, under gensim.models there are "doc2vec" implementations, seems to attempt to label the document like the paragraph vector approach from the paper. Playing with it myself at this moment, it runs on small sets, now crossing my fingers and running it on a larger set. Instead of sentences it requires an iterator of "LabeledText" which is a sentence and document ID(s) so, I imagine, its always looking at the label during the learning phase.
For an iterator, I made a csv of tokenized documents and a corresponding ID for each, there is another LabeledText iterator called DocSet in the code.
import ast, csv
it = LTIterator(some_filename)
I would love to hear any comments, input, elaboration.
As for Cython/Threading, I think that's what its going for, but I'm not sure.
Haven't tested it, but it seems neat.
Yes, the doc2vec implementation is intended to model the algorithm from the Distributed Representations of Sentences and Documents paper. @seanlindsey has the right idea about how to use it. All you need is an iterator of LabeledText elements, which are made up of two lists: 1) the text of the sentence, as in the current gensim word2vec implementation and 2) a list of labels for the text. The goal is for you to be able to add as many or as few labels as you want, although I've mostly experimented with a single label, as in the paper. (The idea being, hopefully, to enable labels at multiple levels of granularity: ['doc1', 'para2', 'sent4'], for example.)
One of the known differences between this code and the paper is that, in the paper, they pad short blocks of text with a null character to make them long enough, whereas here we throw them out of the vocabulary if their length is below the min_count threshold. This design choice makes sense for our context, but may not make sense for yours (if you're looking at twitter data, for instance).
Since it's largely copied and modified code from @piskvorky's excellent word2vec implementation, threading should go through without a hitch. The cython code should also work, if you want your jobs to get done before the heat death of the universe.
Depending on what you define as a large amount of data, this code may scale reasonably well to what you're looking for. I've successfully run this over a collection of over 2 million paragraphs in less than 10 mins. However, I tried to run it on 20x that much data and my box ran out of RAM since it needed to create a new vector for each paragraph.
Here's a picture I threw together from paragraphs extracted from some popular project Gutenberg documents. It shows the distributions of paragraphs for each of the documents after running a couple of epochs of training using CBOW (called PV-DM in the paper):
I hope that helps clear things up a little bit. Feel free to let me know if you have any comments or questions, or if you find any bugs.
To get the papers implementation of the sentence padding what I'm trying is pre-pending a bunch of null characters to sentence.text (haven't tested it) while building the vocab, then reset the sentence length to get the vocab filter to keep the label later on. Also the vocab wont lose out on having these null characters later on. Then I repeat the process in the prepare_sentences function defined in Doc2Vec's train function, so we don't lose out on the null characters in the learning phase. Sound right?
@piskvorky I'd be happy to make a pr once I get the code cleaned up a little bit. At the moment, there is too much duplicate code between my doc2vec file and your word2vec file, which is no good.