online word2vec #900

isomap · 2016-09-28T22:35:01Z

Rebase of #778 .
It's ready to merge :)

tmylk · 2016-09-29T01:09:13Z

gensim/models/word2vec.py

+                            self.vocab[word].count += v
+                        else:
+                            self.vocab[word] = Vocab(count=v, index=len(self.index2word))
+                            self.index2word.append(word)


Could you please add logging for the number of new words added to vocab?

logger.info("New words added to vocab: %d ", new_word_count)

Ping @isohyt

sorry i forgot it.
it's added in code

tmylk · 2016-09-29T01:10:16Z

gensim/test/test_word2vec.py

@@ -110,30 +175,6 @@ def testPersistenceWord2VecFormat(self):
        norm_only_model.init_sims(replace=True)
        self.assertFalse(numpy.allclose(model['human'], norm_only_model['human']))
        self.assertTrue(numpy.allclose(model.syn0norm[model.vocab['human'].index], norm_only_model['human']))


why delete the lines below?

tmylk · 2016-09-29T01:11:19Z

gensim/test/test_word2vec.py

        self.assertTrue(model.n_similarity(['graph', 'trees'], ['trees', 'graph']))
        self.assertTrue(model.n_similarity(['graph'], ['trees']) == model.similarity('graph', 'trees'))
-        self.assertRaises(ZeroDivisionError, model.n_similarity, ['graph', 'trees'], [])


please only have your changes

tmylk · 2016-09-29T01:12:08Z

Please add logging and redo the merge.

We are almost done with this needed feature!

isomap · 2016-09-29T01:27:13Z

Oops, I overlooked test_w2v compatibility.
I will fix these which you pointed out within the week.

tmylk

Still merge artefacts.
Also need logging for the new words.

tmylk · 2016-10-02T06:05:55Z

gensim/corpora/wikicorpus.py

@@ -214,9 +214,10 @@ def extract_pages(f, filter_namespaces=False):
            title = elem.find(title_path).text
            text = elem.find(text_path).text

-            ns = elem.find(ns_path).text


@isohyt why is this file here?

When I transform old enwiki dump (2010 in this notebook), Attribution Error was returned as follows.

File "***/lib/python3.5/site-packages/gensim/corpora/wikicorpus.py", line 211, in extract_pages ns = elem.find(ns_path).text AttributeError: 'NoneType' object has no attribute 'text'

To avoid this error, I fixed wikicorpus.py.

tmylk · 2016-10-02T08:09:32Z

What about logging of the count of new words?

isomap · 2016-10-02T09:00:08Z

sorry i found obvious mistake through adding logger

tmylk · 2016-10-02T10:51:37Z

gensim/models/word2vec.py

+            original_unique_total = len(pre_exist_words) + len(new_words) + drop_unique
+            pre_exist_unique_pct = len(pre_exist_words) * 100 / max(original_unique_total, 1)
+            new_unique_pct = len(new_words) * 100 / max(original_unique_total, 1)
+            logger.info("""New added %i unique words (%i%% of original %i)


Sorry, I don't understand this logger message. could you please add an example in the notebook?

you mean to add this logger on online_w2v_tutorial.ipynb or add new notebook only for the logger message?

Better add to the main notebook.

tmylk · 2016-10-02T10:52:23Z

gensim/models/word2vec.py

+                        len(new_words), new_unique_pct, original_unique_total,
+                        len(pre_exist_words), pre_exist_unique_pct, original_unique_total)
+            retain_words = new_words + pre_exist_words
+            retain_total = new_total + pre_exist_total


where is this used below?

retain_words and retain_total are used for the down sampling.
They are mainly used from line 642 to 663

tmylk · 2016-10-02T10:53:38Z

Please add the logging example to the notebook on a small sample. It is hard to understand what is happening with the current messages.

tmylk · 2016-10-03T03:35:22Z

@isohyt Thanks for finishing this code!
Let's update the notebook in a separate PR.

isomap · 2016-10-03T03:42:26Z

thanks for staying with me on this long journey, @tmylk :)
i will do that asap

piskvorky · 2017-01-03T06:03:01Z

@isohyt Has the notebook been updated?

I still see

This implementation is still beta version at 16/09/04. You can download the beta version of online word2vec implementation in the following repository.
In [ ]:
%%bash
git clone -b online-w2v git@github.com:isohyt/gensim.git

at the top of the notebook. Can you please update?

@tmylk what else needs to be done for this to be complete?

tmylk · 2017-01-03T08:49:09Z

@isohyt the line about 'beta' and link to your private github repo should be removed. This project is complete.

isomap · 2017-01-04T08:26:39Z

Sorry, I forgot to remove that description.
I will update this notebook asap in other PR.

isomap · 2017-01-04T08:41:07Z

this notebook has been already fixed in dev branch now.
Thanks for fixing @tmylk !

online-w2v (done)

213d2db

tmylk reviewed Sep 29, 2016

View reviewed changes

fix compatibility

eb17c1f

tmylk suggested changes Oct 2, 2016

View reviewed changes

isomap added 2 commits October 2, 2016 17:21

add logger for count of new added words

dffc971

add retain_words and total in case of online updateing

8087ac8

fix retain_toatl

e55e4cc

tmylk reviewed Oct 2, 2016

View reviewed changes

tmylk merged commit 6627c6f into piskvorky:develop Oct 3, 2016

tmylk mentioned this pull request Oct 4, 2016

Online word2vec #700

Closed

isomap deleted the online-w2v_done branch October 10, 2016 09:22

martinpopel mentioned this pull request Nov 30, 2016

Enable and refactor image summaries ufal/neuralmonkey#162

Merged

schwittlick mentioned this pull request Dec 1, 2016

Combine Google/Wiki model with our own model Schwittleymani/ECO#176

Closed

adarshaj mentioned this pull request Jan 4, 2017

online word2vec #435

Closed

kalmanchapman mentioned this pull request Apr 3, 2017

[FLINK-2094] [ml] implements Word2Vec for FlinkML apache/flink#2735

Closed

piskvorky mentioned this pull request Jul 20, 2017

Fully supporting incremental updation of vocabulary in Word2Vec model #1493

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

online word2vec #900

online word2vec #900

isomap commented Sep 28, 2016

tmylk Sep 29, 2016

tmylk Oct 2, 2016

isomap Oct 2, 2016

tmylk Sep 29, 2016

tmylk Sep 29, 2016

tmylk commented Sep 29, 2016

isomap commented Sep 29, 2016

tmylk left a comment

tmylk Oct 2, 2016

isomap Oct 2, 2016

tmylk commented Oct 2, 2016

isomap commented Oct 2, 2016 •

edited

Loading

tmylk Oct 2, 2016

isomap Oct 2, 2016 •

edited

Loading

tmylk Oct 3, 2016

tmylk Oct 2, 2016

isomap Oct 2, 2016

tmylk commented Oct 2, 2016

tmylk commented Oct 3, 2016

isomap commented Oct 3, 2016

piskvorky commented Jan 3, 2017

tmylk commented Jan 3, 2017

isomap commented Jan 4, 2017

isomap commented Jan 4, 2017

online word2vec #900

online word2vec #900

Conversation

isomap commented Sep 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Sep 29, 2016

isomap commented Sep 29, 2016

tmylk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Oct 2, 2016

isomap commented Oct 2, 2016 • edited Loading

Choose a reason for hiding this comment

isomap Oct 2, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Oct 2, 2016

tmylk commented Oct 3, 2016

isomap commented Oct 3, 2016

piskvorky commented Jan 3, 2017

tmylk commented Jan 3, 2017

isomap commented Jan 4, 2017

isomap commented Jan 4, 2017

isomap commented Oct 2, 2016 •

edited

Loading

isomap Oct 2, 2016 •

edited

Loading