-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
online word2vec #900
online word2vec #900
Conversation
self.vocab[word].count += v | ||
else: | ||
self.vocab[word] = Vocab(count=v, index=len(self.index2word)) | ||
self.index2word.append(word) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add logging for the number of new words added to vocab?
logger.info("New words added to vocab: %d ", new_word_count)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @isohyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry i forgot it.
it's added in code
@@ -110,30 +175,6 @@ def testPersistenceWord2VecFormat(self): | |||
norm_only_model.init_sims(replace=True) | |||
self.assertFalse(numpy.allclose(model['human'], norm_only_model['human'])) | |||
self.assertTrue(numpy.allclose(model.syn0norm[model.vocab['human'].index], norm_only_model['human'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why delete the lines below?
self.assertTrue(model.n_similarity(['graph', 'trees'], ['trees', 'graph'])) | ||
self.assertTrue(model.n_similarity(['graph'], ['trees']) == model.similarity('graph', 'trees')) | ||
self.assertRaises(ZeroDivisionError, model.n_similarity, ['graph', 'trees'], []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please only have your changes
Please add logging and redo the merge. We are almost done with this needed feature! |
Oops, I overlooked test_w2v compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still merge artefacts.
Also need logging for the new words.
@@ -214,9 +214,10 @@ def extract_pages(f, filter_namespaces=False): | |||
title = elem.find(title_path).text | |||
text = elem.find(text_path).text | |||
|
|||
ns = elem.find(ns_path).text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@isohyt why is this file here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I transform old enwiki dump (2010 in this notebook), Attribution Error was returned as follows.
File "***/lib/python3.5/site-packages/gensim/corpora/wikicorpus.py", line 211, in extract_pages ns = elem.find(ns_path).text AttributeError: 'NoneType' object has no attribute 'text'
To avoid this error, I fixed wikicorpus.py.
What about logging of the count of new words? |
sorry i found obvious mistake through adding logger |
original_unique_total = len(pre_exist_words) + len(new_words) + drop_unique | ||
pre_exist_unique_pct = len(pre_exist_words) * 100 / max(original_unique_total, 1) | ||
new_unique_pct = len(new_words) * 100 / max(original_unique_total, 1) | ||
logger.info("""New added %i unique words (%i%% of original %i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand this logger message. could you please add an example in the notebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean to add this logger on online_w2v_tutorial.ipynb or add new notebook only for the logger message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better add to the main notebook.
len(new_words), new_unique_pct, original_unique_total, | ||
len(pre_exist_words), pre_exist_unique_pct, original_unique_total) | ||
retain_words = new_words + pre_exist_words | ||
retain_total = new_total + pre_exist_total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this used below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retain_words and retain_total are used for the down sampling.
They are mainly used from line 642 to 663
Please add the logging example to the notebook on a small sample. It is hard to understand what is happening with the current messages. |
@isohyt Thanks for finishing this code! |
thanks for staying with me on this long journey, @tmylk :) |
@isohyt Has the notebook been updated? I still see
at the top of the notebook. Can you please update? @tmylk what else needs to be done for this to be complete? |
@isohyt the line about 'beta' and link to your private github repo should be removed. This project is complete. |
Sorry, I forgot to remove that description. |
this notebook has been already fixed in dev branch now. |
Rebase of #778 .
It's ready to merge :)