Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Documentation fixes #3307

Merged
merged 12 commits into from
Apr 19, 2022
Merged

[MRG] Documentation fixes #3307

merged 12 commits into from
Apr 19, 2022

Conversation

@codecov
Copy link

codecov bot commented Mar 18, 2022

Codecov Report

Merging #3307 (65b7fb2) into develop (ac3bbcd) will decrease coverage by 1.90%.
The diff coverage is 100.00%.

❗ Current head 65b7fb2 differs from pull request most recent head 2982148. Consider uploading reports for the commit 2982148 to get more accurate results

@@             Coverage Diff             @@
##           develop    #3307      +/-   ##
===========================================
- Coverage    81.43%   79.53%   -1.91%     
===========================================
  Files          122       68      -54     
  Lines        21052    11781    -9271     
===========================================
- Hits         17144     9370    -7774     
+ Misses        3908     2411    -1497     
Impacted Files Coverage Δ
gensim/models/word2vec.py 89.70% <ø> (+4.54%) ⬆️
gensim/corpora/wikicorpus.py 93.75% <100.00%> (-1.04%) ⬇️
gensim/models/doc2vec.py 81.61% <100.00%> (+0.32%) ⬆️
gensim/models/lsimodel.py 64.02% <100.00%> (-0.03%) ⬇️
gensim/scripts/glove2word2vec.py 76.19% <0.00%> (-7.15%) ⬇️
gensim/matutils.py 77.23% <0.00%> (-0.90%) ⬇️
gensim/similarities/docsim.py 23.95% <0.00%> (-0.76%) ⬇️
gensim/models/rpmodel.py 89.47% <0.00%> (-0.53%) ⬇️
gensim/models/ldamulticore.py 90.58% <0.00%> (-0.33%) ⬇️
... and 94 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 91175dd...2982148. Read the comment docs.

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 18, 2022

@gojomo the doc2vec runs finished:
https://github.com/RaRe-Technologies/gensim/blob/9e8e90a5b3d75c753dd6d0c492aed9c89610e5f1/docs/notebooks/doc2vec-wikipedia.ipynb

Unlike in the original notebook, the DBOW results no longer seem superior to DM now. Both modes show reasonable results, so I changed the conclusions and wording in the notebook accordingly. Perhaps the difference was due to some bug in the earlier versions of Gensim (the notebook is quite old).

The next step is to run the notebook front-to-back, commit & merge. Please review and let me know if you'd have something reworded / changed / fixed, before the final run.

@gojomo
Copy link
Collaborator

gojomo commented Mar 19, 2022

Looks good overall. My main suggestion would be to add a parallel demonstration of a 3rd mode: pure DBOW (leaving the default dbow_words=0 in place).

IIRC, the very idea of mixing DBOW with interleaved (& largely-analogous) skipgram word-training traces back to this paper, in its innocent aside, "We also show a simple yet effective trick to improve Paragraph Vector. In particular, we observe that by jointly training word embeddings, as in the skip gram model, the quality of the paragraph vectors
is improved." In fact, I think I added the dbow_words option to make it possible to match this paper.

But the plain DBOW is also very fast, & very competitive on many tasks, if you just want full-doc vectors (without word-vectors). And the fact that the word-vectors still exist in that plain mode, but remain untrained random-initialized junk, is an occasional gotcha that it'd be useful to note. (My old IMDB-based Doc2Vec demo showed the nonsense results, when doing word-vector most-similars on a plain-DBOW model, to highlight this point.)

In particular, the reason your existing DBOW+W vs DM comparison shows a significant speed advantage for DM is entirely the extra word-training. With a corpus of M words, in one epoch, plain DBOW does M innnermost-loop micro-predictions, each doc_vector -> word. Plain DM also does M innermost-loop micro-predictions, each mean(doc_vector, contextword0_vector, ..., contextwordN_vector) -> word. But DBOW+W does M + (window * M) innermost micro-predictions: both the plain DBOW predictions, plus an average of window more predictions, per target-word, of skip-gram form word-vector -> word.

It's likely that DBOW alone would train fastest, and do reasonably well on all the tasks that work only with articles. Even if not adding a "pure DBOW" trial, it'd be good to be clear that what you're trying is the special DBOW plus skip-gram introduced by the 'Document Embeddings' paper, not plain DBOW.

I notice you're using a non-default window=8 value – though the paper doesn't seem to specify what value they're using. You might as well leave the default window=5 in place. That'll make the DM mode a teensy bit faster, with fewer words averaged to context vector, and the DBOW+W mode noticeably faster, probably about 33% faster, as it directly reduces the raw number of word-vector -> word predictions. This improvement could mean that adding a 3rd plain-DBOW model still takes less training time than the existing 2. (Though, holding all 3 in RAM might be a challenge.)

From my eyeballing of the paper's declared vector-sizes, it looks like they tried 100, 300, 1000, & 10000 – & reported the 10000-dimensional (!) doc-vectors as best-performing. 10k dimensions are probably overkill & impractical on your dev laptop... but their 10000 wasn't that much better than their 1000. You could probably manage that – especially if perhaps discarding more smaller articles (so there aren't 10M+ total doc-vectors to train). You could also aim for 300 dimensions – very common in word-vector work, and one of the sizes they tested. If you go for 200 not matching anything the paper tried, I'd make a comment in the notebook noting that you're doing that for memory efficiency, & it's still sufficient to demo the benefits, though projects with adequate resources may find it useful to try larger dimensionalities when copious training data is available.

Asuming your M1 CPU reports 8 cores, it's possible that cores -1 (7) or cores-2 (6) workers would offer even better training throughput. (Running a train just for a few minutes, for the reported words-rate in the log to stabilize, is enough to search the space: this throughput remains roughly stable through long trains.)

On a corpus this large, a non-default sample=1e-05 (or even sample=1e-06) could halve training time, or more, with no harm to quality – or even improvement, by diluting (via aggressive removal) the influence of the most-common words. (I suspect the model paper may have used such a more-aggressive-than-default value, because I think some early iterations of word2vec.c used 1e-05 rather than the later, more modest 1e-04 default that eventually prevailed.)

Note also that max_final_vocab dynamically sets a min_count such that the surviving total of words is no more than the target. (It won't hit the target exactly, if that'd require arbitrarily saving some words of frequency n but not others.) So, you're only winding up with effective_min_count=27 & 894,446 surviving words. As I suspect the paper authors just chose a min_count that got them under 1M words, using max_final_vocab=1000000 might be more in spirit with the paper, and wind up with a closer number of surviving words.

Finally, the paper seems to declare that they used HS mode, which you could match by specifying hs=1, negative=0 to the models. I'm not sure what effect that might have on training speed or end-evaluations. Generally negative-sampling does relatively better (same benefits in less training time) as vocabularies grow larger, but I'm not sure where this corpus/demo falls in terms of potential 'sweet spots'. Since negative-sampling is the default, & does well, & seems to be used more commonly on serious projects, it seems OK to use it here. But if trying to match the paper as closely as possible in all declared/practical aspects, you'd want to use HS.

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 19, 2022

I notice you're using a non-default window=8 value – though the paper doesn't seem to specify what value they're using. You might as well leave the default window=5 in place.

I didn't touch the pre-existing model settings. I assumed, given the notebook's explicit goal of "replicate the paper", that these were already set appropriately. So in addition to poor language and poor code, the setup was also incorrect? That's sad.

But at least a chance to clean up properly now… Snowballing from the initial user report of No str.decode in python3 :)

You could also aim for 300 dimensions – very common in word-vector work, and one of the sizes they tested. If you go for 200 not matching anything the paper tried, I'd make a comment in the notebook noting that you're doing that for memory efficiency, & it's still sufficient to demo the benefits, though projects with adequate resources may find it useful to try larger dimensionalities when copious training data is available.

OK (dtto).

Asuming your M1 CPU reports 8 cores, it's possible that cores -1 (7) or cores-2 (6) workers would offer even better training throughput. (Running a train just for a few minutes, for the reported words-rate in the log to stabilize, is enough to search the space: this throughput remains roughly stable through long trains.)

Ditto. Your suggestion makes good sense, I'll set the worker threads to 8 or 9; I use my 10-core laptop for other things too during the training anyway.

Regarding all the following:

Looks good overall. My main suggestion would be to add a parallel demonstration of a 3rd mode: pure DBOW (leaving the default dbow_words=0 in place).

It's likely that DBOW alone would train fastest, and do reasonably well on all the tasks that work only with articles. Even if not adding a "pure DBOW" trial, it'd be good to be clear that what you're trying is the special DBOW plus skip-gram introduced by the 'Document Embeddings' paper, not plain DBOW.

On a corpus this large, a non-default sample=1e-05 (or even sample=1e-06) could halve training time, or more, with no harm to quality – or even improvement, by diluting (via aggressive removal) the influence of the most-common words. (I suspect the model paper may have used such a more-aggressive-than-default value, because I think some early iterations of word2vec.c used 1e-05 rather than the later, more modest 1e-04 default that eventually prevailed.)

Note also that max_final_vocab dynamically sets a min_count such that the surviving total of words is no more than the target. (It won't hit the target exactly, if that'd require arbitrarily saving some words of frequency n but not others.) So, you're only winding up with effective_min_count=27 & 894,446 surviving words. As I suspect the paper authors just chose a min_count that got them under 1M words, using max_final_vocab=1000000 might be more in spirit with the paper, and wind up with a closer number of surviving words.

Finally, the paper seems to declare that they used HS mode, which you could match by specifying hs=1, negative=0 to the models. I'm not sure what effect that might have on training speed or end-evaluations. Generally negative-sampling does relatively better (same benefits in less training time) as vocabularies grow larger, but I'm not sure where this corpus/demo falls in terms of potential 'sweet spots'. Since negative-sampling is the default, & does well, & seems to be used more commonly on serious projects, it seems OK to use it here. But if trying to match the paper as closely as possible in all declared/practical aspects, you'd want to use HS.

May I tempt you with a commit fix @gojomo? Clearly you're more knowledgable here so you set the "correct" params and then I run the cells.

Let me know because otherwise I'll adapt what I can & merge, I don't want to stall the release, this notebook isn't that vital.

Although after the clean up, maybe we could promote it to a tutorial @mpenkov, give the notebook more visibility, link it from the gallery? Are the long running times a problem? BTW, I see one of the CI tests failed again, could you have a look please.

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 20, 2022

@gojomo I changed the settings as suggested (300 dim, more subsampling, default window etc) & re-ran the notebook:
https://github.com/RaRe-Technologies/gensim/blob/7e866cd85e26505646f9770e9ca6e28cb5203f09/docs/notebooks/doc2vec-wikipedia.ipynb

Good news: the WARNING : Effective 'alpha' higher than previous training cycles disappeared. It was indeed an artifact of me re-launching the cells ad-hoc, corrupting the state somehow.

Bad news: the results seem significantly worse. They no longer match the paper, nor the accompanying text.

I'm starting to see why the original notebook author chose the settings they did. It's probably a result of some laborious experimentation, left undocumented. Which to be fair matches the original paper too, because that was done in a similarly under-documented manner. So, a meta-match.

EDIT:

And results for dim=100, window=8: https://github.com/RaRe-Technologies/gensim/blob/010a7ac745896e7b4607f1ff2b4507b835c3ddf9/docs/notebooks/doc2vec-wikipedia.ipynb

Quite different again, although more meaningful than the dim=300 window=5 above IMO.

@gojomo
Copy link
Collaborator

gojomo commented Mar 21, 2022

It surprises me that a window=8 vs window=5 change (or sample=1e-05 vs sample=1e-04) creates noticeable degradation in demo-able results.

I can take a look & test/apply some other tweaks this week - but given the long cycles involved in full runs, even if it's just a couple hours of poking around, it'll likely be 4+ days of wall clock time.

@gojomo
Copy link
Collaborator

gojomo commented Mar 21, 2022

Thinking a bit more about it:

That the 100d model "looks better" suggests to me the larger 300d model might've needed more epochs to fully converge. Checking back to the original paper, they're even coy about exactly how many epochs they used, saying only "we trained paragraph vectors over at least 10 epochs of the data". So they may have used more, especially if their running-epoch-loss readout (which Gensim sadly still doesn't have in Doc2Vec) helped better-hint to them when more epochs could help.

So my vague hunch is that to the extent a smaller window & more-aggressive sample save lots of training time, achieving the best results probably requires then giving some training-effort back via more epochs. But of course it'd take a very time-consuming, more-extensive parameter search to be sure.

To the extent the 100d model is going to be faster-to-train & smaller, and seems sufficient to demonstrate the vivid results, we might as well focus on that, with side commentary noting that when time/resources permit, even larger models can perform better on rigorous evaluations, as the paper's {300, 1000, 10000} figure implies.

(More generally, another reason for some drift in results, or other entries surging past the paper's, could easily be the emergence of newer acts in the later Wikipedia dumps we're using. So if matching-as-closely-as-possible were a top priority, using a similar vintage dump could be considered. But I think using the latest is a better basis for users present needs – & if we or they are going to spend the effort downloading/preprocessing a 10-20GB dataset, most would rather have the latest around than some arbitrary dump from a decade+ ago.)

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 21, 2022

Yes, I added a mention of "7 years later" into the notebook as one possible reason for the divergence of results.

I'm training d300,n5,w8,mc5,s1e-05,t10 models now. It doesn't really cost me anything, but is tedious. The notebook doesn't matter that much TBH – I'm just down the rabbit hole and loath to merge something inferior :)

@piskvorky
Copy link
Owner Author

Several DM runs have stalled (no progress, no CPU used). @gojomo is this a known bug, have you seen it before?

Screen Shot 2022-03-22 at 20 31 32

Restarting usually helps, the issue appears in maybe 20% runs.

Only in DM mode training btw, never in DBOW

@gojomo
Copy link
Collaborator

gojomo commented Mar 23, 2022

Never seen that. Suggestive of a queueing issue where either expected items are lost, or the cross-thread signaling has gone awry. I'd look in the console where the notebook-server/kernel is running to see if some sort of odd/fatal error happened in a worker thread preventing normal progress.

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 23, 2022

I don't see anything out of ordinary in the server console:

Screen Shot 2022-03-23 at 08 14 46

The cell "stopped" between 1:08 and 1:30 AM. I woke up and restarted the kernel around 8 AM.

This is weird. The issue has become pretty much 100%, so I haven't been able to finish the notebook once the last couple of days. I wonder if it's some HW / OS issue. I'll try on a different machine.

I'll skip DM for now.

@gojomo
Copy link
Collaborator

gojomo commented Mar 24, 2022

My first run of a version of your "100d" notebook also hit the DM-mode hang, at the end of the very-1st epoch, waiting forever for 4 of the threads to report they were finished. My environment is MacOS 12.3 / Python 3.9 on a pre-M1 MacBook Pro - so it's not unique to the M1-compiled-versions you're using. That this is now so quick to reproduce makes me suspect it's a regression. I still suspect worker threads are silently dying.

@piskvorky piskvorky added this to the Next release milestone Apr 15, 2022
@piskvorky
Copy link
Owner Author

piskvorky commented Apr 19, 2022

I retrained on a Linux machine over Easter, several times. Everything worked fine there, no hang ups.

So I suspect it's a Mac issue. Because the issue appears both pre-M1 (you) and on M1 (me), it's likely related to the OS. Maybe the MacOS BLAS, I remember it has some quirks.

Either way, I'm done here. @gojomo please review https://github.com/RaRe-Technologies/gensim/blob/d872c02849af37812991ed72d69c8ed5725d1563/docs/notebooks/doc2vec-wikipedia.ipynb and I'll merge and we'll release the new Gensim.

@piskvorky piskvorky changed the title [WIP] Documentation fixes [MRG] Documentation fixes Apr 19, 2022
@piskvorky piskvorky merged commit 5fe3bbc into develop Apr 19, 2022
@piskvorky piskvorky deleted the fix_docs branch April 19, 2022 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants