New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

online word2vec #435

Closed
wants to merge 15 commits into
base: develop
from

Conversation

Projects
None yet
10 participants
@rutum
Copy link

rutum commented Aug 17, 2015

Usage:

model = Word2Vec() # sg and hs are the default parameters
model.build_vocab(sentences)
model.train(sentences)
model.save("base_model")

model.build_vocab(new_sentences, update=True)
model.train(new_sentences)
model.save("updated_model")

Then you can compare the 2 models to see whether the new vocabulary is learning the way it is supposed to.

I tried an experiment with learning a model without "queen", and adding it in the subsequent set of sentences. The updated model learned "queen" as being similar to "king", "duke" etc. So that was a huge success. I would love to hear of any other ideas you might have to test this.

Accompanying blogpost: http://rutumulkar.com/blog/2015/word2vec/

@piskvorky piskvorky changed the title Rm online online word2vec Aug 19, 2015

@piskvorky piskvorky referenced this pull request Aug 19, 2015

Closed

Online Word2Vec #365

@rutum

This comment has been minimized.

Copy link

rutum commented Aug 21, 2015

Here is the accompanying blog post about how to use Online Word2Vec:
http://rutumulkar.com/blog/2015/word2vec/

@robhawkins

This comment has been minimized.

Copy link

robhawkins commented Aug 24, 2015

I see you consolidated update_vocab into build_vocab with another parameter. I think that makes sense to keep code concise but I liked your original description. For people new to gensim like myself, I think it'd be nice to have an update_vocab that wraps build_vocab(update=true). Great work!

@rutum rutum force-pushed the rutum:rm_online branch from baeac80 to a64fc7e Aug 24, 2015

@rutum

This comment has been minimized.

Copy link

rutum commented Aug 24, 2015

Any ideas why this branch is failing on Python 3.3 only?

@rutum

This comment has been minimized.

Copy link

rutum commented Aug 24, 2015

@robhawkins Sure! But lets wait for the PR process to see what others recommend

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Aug 24, 2015

The fail is unrelated; @gojomo is working on fixing that.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Aug 24, 2015

Btw why six commit for resolving conflicts? That's scary. What's going on there?

Can you at least squash all these commits into one? We don't want such verbose git history.

@@ -1444,7 +1510,6 @@ def __init__(self, init_fn, job_fn):
def put(self, job):
self.job_fn(job, self.inits)


This comment has been minimized.

@piskvorky

piskvorky Aug 24, 2015

Member

There should be two empty lines between classes; why did you delete it?

@@ -1409,6 +1474,7 @@ def save(self, *args, **kwargs):

save.__doc__ = utils.SaveLoad.save.__doc__


This comment has been minimized.

@piskvorky

piskvorky Aug 24, 2015

Member

Only one newline between methods.

drop_unique += 1
drop_total += v
original_total += v
elif update:

This comment has been minimized.

@piskvorky

piskvorky Aug 24, 2015

Member

The condition is superfluous: if not update above evaluates to false, then update here is true for sure.

@@ -1 +0,0 @@
make_wikicorpus.py

This comment has been minimized.

@piskvorky

piskvorky Aug 24, 2015

Member

This is not good: you replaced all the symlinks by regular files. Can you put the symlinks back? This PR shouldn't touch the make_wiki functionality at all IMO.

This comment has been minimized.

@rutum

rutum Aug 24, 2015

Hmm! This is very strange. I have touched 3 files only, but it shows that I have updated 13 files. Possible merge issue?

This comment has been minimized.

@piskvorky

piskvorky Aug 24, 2015

Member

What is your merge process? I don't understand how all these whitespace, newline and unrelated file change issues can happen.

This comment has been minimized.

@rutum

rutum Aug 24, 2015

I merge all conflicts manually, using Sublime, and then commit my changes. I have not come across any wiki related merge conflicts, so it happened automatically? The other space related ones are probably me, because of the manual merge conflict removal, but they seem harmless.

This comment has been minimized.

@piskvorky

piskvorky Aug 25, 2015

Member

When there's a conflict, git clearly shows you which parts are in conflict: it puts the pieces of code in a file between ===== and >>>>> and <<<<<< markers.

I'm not sure what happened here or why you replaced the symlinks, but we definitely want to keep the original symlinks (and not full files, as in this PR).

This is also connected to debugging -- the fewer the changes, the smaller the surface where the bug may have been introduced :) The segfault will be easier to debug with a smaller PR. There should be no changes besides what is necessary for the PR.

This comment has been minimized.

@rutum

rutum Aug 25, 2015

Agreed. However, I don't know how this changed happened with the sym links. I have not touched these files, and I am happy to reverse this change, if I knew how to.

This comment has been minimized.

@piskvorky

piskvorky Aug 25, 2015

Member

OK, let's do that after fixing the segfault. Can you reproduce & debug the segfault? Compared to that, this symlink is a trivial fix.

This comment has been minimized.

@zachmayer

zachmayer Feb 22, 2016

See my update to this PR: #615

@rutum Something weird happened when you merged the dev branch into your branch, and you ended up replacing all the symlinks with their actual files.

@rutum rutum force-pushed the rutum:rm_online branch from a64fc7e to 68aa0d4 Aug 24, 2015

rutum and others added some commits Aug 17, 2015

update summarization tutorial image
resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

resolving merge conflicts

@rutum rutum force-pushed the rutum:rm_online branch from 68aa0d4 to 8df818b Aug 24, 2015

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Aug 24, 2015

The failing Travis test is different this time: it looks like the online learning caused a segfault on Python3.4. That is a serious problem.

@rutum

This comment has been minimized.

Copy link

rutum commented Aug 24, 2015

Yeah! Looks like Python 3.4 is doing something differently. Any ideas?

@rutum

This comment has been minimized.

Copy link

rutum commented Aug 24, 2015

All is well on Travis in the online Word2Vec land! Review away @piskvorky, @gojomo and others

@piskvorky

This comment has been minimized.

Copy link

piskvorky commented on gensim/models/word2vec.py in e886d90 Aug 24, 2015

Was this reason for the segfault? Can you explain a bit more? Why did it fail on py3.4 and not the others?

This comment has been minimized.

Copy link

gojomo replied Aug 25, 2015

Coming in late, but if the segfault was in one of the 'dmc' tests (DM with concatenative input context), then I could understand this being implicated. That's currently the only mode where layer1_size is different that vector_size. Segfaults can easily be intermittent, between runs or configurations, based on the luck-of-the-mallocs – sometimes the same wrong-writes land somewhere a write is still allowed, and cause subtle corruption rather than a hard failure. (For any recurrent hard-to-diagnose similar problem, it may help to turn cython bounds-checking back on temporarily, and try to reproduce, to better identify the offending write.)

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Aug 24, 2015

I re-ran Travis and it still segfaults during the online training test. It looks like there is some subtle bug that causes a non-deterministic segfault.

rutum added some commits Aug 26, 2015

@rutum

This comment has been minimized.

Copy link

rutum commented Aug 26, 2015

I got some time to debug this, and it is a non-deterministic error. The syn[0] array ends up with nan values when training after new vocabulary is added.

word2vec.py:174: RuntimeWarning: overflow encountered in exp fa = 1.0 / (1.0 + exp(-dot(l1, l2a.T))) # propagate hidden -> output
So clearly, there was something going on with the way the numpy arrays are referenced. It looks like I wrote some code to mane a shallow copy of the NP arrays, instead of a deep copy. I updated and tested it, and have not hit upon the segfault yet.

@gojomo

This comment has been minimized.

Copy link
Member

gojomo commented Aug 26, 2015

In the cython work I did, intermittent nan-related crashes of this type were sometimes due to earlier mis-addressed writes, which corrupted other arrays.

If deepcopy has really prevented the errors, I'm concerned it may be covering them up rather than fixing the real error. (Perhaps it's coercing nan to 0?) The pre-deepcopy property-reassignment should have worked fine, and avoids both a large extra-allocation and a large memory-copy – so I'd revert to that so to be sure to trigger the problem until it's definitively fixed.

I would focus on earlier steps, especially the initial expand/update-weights. (I see one potential problem L967 – will add line comment there.)

# randomize the remaining words
for i in xrange(len(self.vocab), len(newsyn0)):
# construct deterministic seed from word AND seed argument
self.syn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))

This comment has been minimized.

@gojomo

gojomo Aug 26, 2015

Member

Seems this should start at the old vocab length, and the newly-seeded vectors should be landing in the newsyn0? (Doubt this could cause segfaults/nan-issues as out-of-bounds access should be caught in this pure Python code, but true cause of those errors might be something similar elsewhere...)

@rutum

This comment has been minimized.

Copy link

rutum commented Aug 26, 2015

@gojomo The error is non-deterministic, which is why I think there was some malloc issue which was hard to reproduce. If I do not do a deep copy, then an index in syn0 will still reference to an index in newsyn0, but I think python3 handles numpy references and numpy copies differently than 2.6 and 2.7. Anyway, this is what I think the issue was. Let me know if you can reproduce this.

@gojomo

This comment has been minimized.

Copy link
Member

gojomo commented Aug 26, 2015

If the issue is corruption from earlier bad writes, whether that corruption is fatal or harmless can be affected by the (effectively) arbitrary locations chosen by earlier mallocs, yes. So that could explain the indeterminism. But if that's the case, I don't yet see any way a deepcopy can fix the real problem.

All indexed-accesses to syn0, eg syn0[i], will always reach into the current value of syn0, so there's no risk of misdirected accesses from doing self.syn0 = newsyn0. I'm pretty confident that's the right assignment to make, for speed and memory efficiency.

If the deepcopy is seemingly resolving the problem (or just making it much less frequent), it may be by simply hiding it. The big reallocation might guarantee the array is further from other arrays, so out-of-bounds writes are damaging other memory.

The tactic I mentioned the other day, to try the cython code with bounds-checkling back on (a directive atop the file), might help catch a real corrupting write when it happens, rather than when later calculations 'step' on it again. If you're not set up to modify/recompile the cython parts, I'll have a chance to try that in the next day or so...

@tmylk

This comment has been minimized.

Copy link
Contributor

tmylk commented Jan 10, 2016

@rutum Are you still working on this? Or should this be closed?
Thanks

@skillachie

This comment has been minimized.

Copy link

skillachie commented Feb 3, 2016

This feature looks good. Do you guys plan to merge it into the main dev/main branch soon ?

@danintheory

This comment has been minimized.

Copy link

danintheory commented Feb 7, 2016

Hi,

I am very interested in this feature. What would be needed to get this or a new pull request accepted?

Thanks!
Dan

@si74

This comment has been minimized.

Copy link

si74 commented Feb 11, 2016

Hi I also wanted to know if you were planning on merging this soon! :)

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Feb 12, 2016

This PR is not mergeable yet:

  • segfaults
  • contains changes in unrelated files (symlinks)
  • merge conflicts against the current develop branch
  • needs a deeper review that it works -- see code comments by @gojomo in this PR

@danintheory the most important one is fixing the segfault. I think starting from a current develop checkout, and re-applying relevant @rutum 's changes, may be easier than resolving the conflicts, due to the commit history.

@zachmayer

This comment has been minimized.

Copy link

zachmayer commented Feb 22, 2016

FYI: I rebased this PR and fixed merge conflicts here: #615.

I suggest we close this PR, and move the discussion over there.

@piskvorky

This comment has been minimized.

Copy link
Member

piskvorky commented Feb 23, 2016

Continued in #615 .

@piskvorky piskvorky closed this Feb 23, 2016

@zachmayer zachmayer referenced this pull request May 18, 2016

Closed

Online word2vec #700

@adarshaj

This comment has been minimized.

Copy link

adarshaj commented Jan 4, 2017

For posterity, this has been merged to develop on #900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment