Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

online word2vec #900

Merged
merged 5 commits into from
Oct 3, 2016
Merged

online word2vec #900

merged 5 commits into from
Oct 3, 2016

Conversation

isomap
Copy link
Contributor

@isomap isomap commented Sep 28, 2016

Rebase of #778 .
It's ready to merge :)

self.vocab[word].count += v
else:
self.vocab[word] = Vocab(count=v, index=len(self.index2word))
self.index2word.append(word)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add logging for the number of new words added to vocab?

logger.info("New words added to vocab: %d ", new_word_count)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @isohyt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i forgot it.
it's added in code

@@ -110,30 +175,6 @@ def testPersistenceWord2VecFormat(self):
norm_only_model.init_sims(replace=True)
self.assertFalse(numpy.allclose(model['human'], norm_only_model['human']))
self.assertTrue(numpy.allclose(model.syn0norm[model.vocab['human'].index], norm_only_model['human']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete the lines below?

self.assertTrue(model.n_similarity(['graph', 'trees'], ['trees', 'graph']))
self.assertTrue(model.n_similarity(['graph'], ['trees']) == model.similarity('graph', 'trees'))
self.assertRaises(ZeroDivisionError, model.n_similarity, ['graph', 'trees'], [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please only have your changes

@tmylk
Copy link
Contributor

tmylk commented Sep 29, 2016

Please add logging and redo the merge.

We are almost done with this needed feature!

@isomap
Copy link
Contributor Author

isomap commented Sep 29, 2016

Oops, I overlooked test_w2v compatibility.
I will fix these which you pointed out within the week.

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still merge artefacts.
Also need logging for the new words.

@@ -214,9 +214,10 @@ def extract_pages(f, filter_namespaces=False):
title = elem.find(title_path).text
text = elem.find(text_path).text

ns = elem.find(ns_path).text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isohyt why is this file here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I transform old enwiki dump (2010 in this notebook), Attribution Error was returned as follows.

File "***/lib/python3.5/site-packages/gensim/corpora/wikicorpus.py", line 211, in extract_pages ns = elem.find(ns_path).text AttributeError: 'NoneType' object has no attribute 'text'

To avoid this error, I fixed wikicorpus.py.

@tmylk
Copy link
Contributor

tmylk commented Oct 2, 2016

What about logging of the count of new words?

@isomap
Copy link
Contributor Author

isomap commented Oct 2, 2016

sorry i found obvious mistake through adding logger

original_unique_total = len(pre_exist_words) + len(new_words) + drop_unique
pre_exist_unique_pct = len(pre_exist_words) * 100 / max(original_unique_total, 1)
new_unique_pct = len(new_words) * 100 / max(original_unique_total, 1)
logger.info("""New added %i unique words (%i%% of original %i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand this logger message. could you please add an example in the notebook?

Copy link
Contributor Author

@isomap isomap Oct 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean to add this logger on online_w2v_tutorial.ipynb or add new notebook only for the logger message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better add to the main notebook.

len(new_words), new_unique_pct, original_unique_total,
len(pre_exist_words), pre_exist_unique_pct, original_unique_total)
retain_words = new_words + pre_exist_words
retain_total = new_total + pre_exist_total
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retain_words and retain_total are used for the down sampling.
They are mainly used from line 642 to 663

@tmylk
Copy link
Contributor

tmylk commented Oct 2, 2016

Please add the logging example to the notebook on a small sample. It is hard to understand what is happening with the current messages.

@tmylk tmylk merged commit 6627c6f into piskvorky:develop Oct 3, 2016
@tmylk
Copy link
Contributor

tmylk commented Oct 3, 2016

@isohyt Thanks for finishing this code!
Let's update the notebook in a separate PR.

@isomap
Copy link
Contributor Author

isomap commented Oct 3, 2016

thanks for staying with me on this long journey, @tmylk :)
i will do that asap

@piskvorky
Copy link
Owner

@isohyt Has the notebook been updated?

I still see

This implementation is still beta version at 16/09/04. You can download the beta version of online word2vec implementation in the following repository.
In [ ]:
%%bash
git clone -b online-w2v git@github.com:isohyt/gensim.git

at the top of the notebook. Can you please update?

@tmylk what else needs to be done for this to be complete?

@tmylk
Copy link
Contributor

tmylk commented Jan 3, 2017

@isohyt the line about 'beta' and link to your private github repo should be removed. This project is complete.

@isomap
Copy link
Contributor Author

isomap commented Jan 4, 2017

Sorry, I forgot to remove that description.
I will update this notebook asap in other PR.

@isomap
Copy link
Contributor Author

isomap commented Jan 4, 2017

this notebook has been already fixed in dev branch now.
Thanks for fixing @tmylk !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants