New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modifying train_cbow_pair #920
Comments
Note that in practice, The 'input' word vectors are in Note that |
@gojomo I started to notice that the Python path is incredibly much slower than the Cython path. Is there a way for me to replace the compilation with my own? The transformations I'd like to apply are actually simple scaling operations - I just want to apply weights to the input vectors dynamically. |
The Cython code is not compiled Python - it's alternate code, in the Cython python-like language, written under more constraints that allow better performance (and greater multithreaded paralleism). All the Cython source is available for you to modify just like the Python code – see the files ending If the weights are the same for all examples, you could directly modify |
@gojomo Thank you for explaining that to me - I never had anything to do with Cython so far but it's good to know what I'm actually dealing with here :) Unfortunately, since the weights are not the same for all words, I cannot apply the weight to l1 = np.sum((word_vectors[word2_indexes].T * word_weights[idx]).T, axis=0) + np.sum(doctag_vectors[doctag_indexes], axis=0) but as you already said this is very slow now. The only way I can potentially see around this, without having to alter the Cython implementation, would be the input parameter def train_document_dm(model, doc_words, #...
word_vectors=None,
# ...
):
if word_vectors is None:
word_vectors = model.syn0
# ... One could simply set The only thing I don't know here is where it's getting decided whether the Python or the Cython path is going to be used. |
I'm thinking about something like this: elif self.dm_weighted:
windices = [i for i in range(len(doc.words)) if doc.words[i] in self.vocab and
self.vocab[doc.words[i]].sample_int > self.random.rand() * 2 ** 32]
word_weights = np.asarray([doc.weights[i] for i in windices])
# Make copy of affected vectors for later restoration
word_vectors_copy = self.syn0[windices].copy()
# Apply weights
self.syn0 = (self.syn0[windices].T * word_weights).T
tally += train_document_dm(self, doc.words, doctag_indexes, alpha, work, neu1,
# Pass those word vectors directly
word_vectors=self.syn0,
doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
# Restore unscaled vectors
self.syn0[windices] = word_vectors_copy Unfortunately it does not run. It appears that changing |
You could transform The When the code loads, either the Cython methods are imported, or the Python alternatives are defined. So it's not decided each time - only one or the other implementation of each method is present. See https://github.com/RaRe-Technologies/gensim/blob/a84f64e7b617a3983c2b332c8383e1a30b14db5d/gensim/models/doc2vec.py#L60 I'm not sure where your example snippet code might be intended to appear, but a crash likely indicates your new |
I'm aware of that fact but thank you for pointing it out :) I was able to make it work. The issue was that I didn't use an augmented assignment statement for
The appearance of my snipped is in class WeightedDoc2Vec(Doc2Vec):
# ...
def _do_train_job(self, job, alpha, inits):
work, neu1 = inits
tally = 0
for doc in job:
indexed_doctags = self.docvecs.indexed_doctags(doc.tags)
doctag_indexes, doctag_vectors, doctag_locks, ignored = indexed_doctags
if self.sg:
# ...
elif self.dm_concat:
# ...
elif self.dm_weighted:
# Get word indices
windices = [self.vocab[doc.words[i]].index for i in range(len(doc.words))]
# Grab the weights
word_weights = np.asarray(doc.weights)
# Make copy of affected vectors
word_vectors_copy = self.syn0[windices].copy()
# Apply weights (important to use augmented assignment here)
self.syn0[windices] *= (np.ones((len(word_weights),self.syn0.shape[1])).T * word_weights).T
# Call the optimized 'train_document_dm' function
tally += train_document_dm(self, doc.words, doctag_indexes, alpha, work, neu1,
word_vectors=self.syn0,
doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
# Restore unscaled vectors
self.syn0[windices] = word_vectors_copy This way the overhead shouldn't be that bad since I "only" copy the affected vectors and restore their value after |
Please excuse me for asking this question here since it's not really actual issue regarding gensim.
TL;DR:
I'd like to know how I can get to the word vectors before they are getting propagated in order to apply transformations on them while training paragraph/document vectors.
What I'm trying to do is make a modification to
train_cbow_pair
ingensim.models.Word2Vec
. However, I struggle a bit to understand what's exactly happening there.I get, that
l1
is ist the sum of the current context window of words plus the sum of document-tag vectors that is passed totrain_cbow_pair
.Here I'm not sure what I'm looking at. In particular I struggle to understand the line
I don't know what
word.point
is describing here and why thisinput is getting propagated.Does this provide the word vectors for activating the hidden layer - which appears to be
fa
in that case? But this can't actually be the case sinceword
is actually just the current word of the context window if I get that right:So what I'd like to know is actually how I can get to the word vectors before they are getting propagated in order to apply transformations on them beforehand.
The text was updated successfully, but these errors were encountered: