Improving Scan_Vocab speed, build_vocab_from_freq function. Iteration 2 #1695

jodevak · 2017-11-06T10:58:24Z

As requested, this is a new pull request. Thanks

…viously provided word frequencies table

…vided word frequencies table

…_vocab_from_freq, and hanging indents in build_vocab

…espace

…into build_vocab_freq

horpto · 2017-11-06T11:00:29Z

gensim/test/test_word2vec.py

        half_precision_model_kv = keyedvectors.KeyedVectors.load_word2vec_format(
            testfile(), binary=True, datatype=np.float16
        )
-        self.assertEqual(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2)
+        self.assertEquals(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2)


https://docs.python.org/2/library/unittest.html#deprecated-aliases

assertEquals is deprecated.

horpto · 2017-11-06T11:02:31Z

gensim/test/test_word2vec.py

+            ["minors", "survey", "minors", "survey", "minors"]
+        ]
+        model = word2vec.Word2Vec(sentences, size=10, min_count=0, max_vocab_size=2, seed=42, hs=1, negative=0)
+        self.assertTrue(len(model.wv.vocab), 3)


maybe you need assertEqual, not assertTrue, don't you?

horpto · 2017-11-06T11:09:42Z

gensim/models/word2vec.py

-        .. [#taddy] Taddy, Matt.  Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
-        .. [#deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb
+        .. [taddy] Taddy, Matt.  Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
+        .. [deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb


I'm sorry, but why are you remove # in citate ? (#1633)

autopep8 tool did

this file is merged with an older version.

menshikh-iv · 2017-11-06T13:52:47Z

gensim/models/word2vec.py

@@ -647,15 +647,19 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No

        Examples
        --------
-        >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
+        >>> model.build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)


PEP8: model.build_vocab_from_freq({"Word1": 15, "Word2": 20}, update=True)

sorry, whats the problem with this ?

spaces after :, , (in comment fixed variant)

menshikh-iv · 2017-11-06T13:55:31Z

gensim/models/word2vec.py

@@ -647,15 +647,19 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No

        Examples
        --------
-        >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)


Model is undefined, please create model first (docstring should be executable, i.e. I can copy-paste this code to console and I expect that code run successfully) we plan to add doctests to our CI soon.

menshikh-iv · 2017-11-06T13:55:58Z

gensim/models/word2vec.py


-        self.corpus_count = corpus_count if corpus_count else 0
-        self.raw_vocab = vocab
+        self.corpus_count = corpus_count if corpus_count else 0  # Since no sentences are provided, this is to control the corpus_count


PEP8 - two spaces before #

These are 2 space, arent they ?

Oh, really, sorry

menshikh-iv · 2017-11-06T14:03:46Z

gensim/models/word2vec.py


-        self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update)  # trim by min_count & precalculate downsampling
+        self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule,update=update)  # trim by min_count & precalculate downsampling


Return previous variant

menshikh-iv · 2017-11-06T14:07:27Z

gensim/models/word2vec.py

@@ -675,14 +679,14 @@ def scan_vocab(self, sentences, progress_per=10000, trim_rule=None):
                        type(sentence)
                    )
                checked_string_types += 1
-            if sentence_no % progress_per == 0:
+            if sentence_no % progress_per == 0 and sentence_no != 0:


Why did this need?

Because 0% anything will equal to 0; so the logger will log a statement saying sentence 0 and processed 0.

But we want that :)

menshikh-iv

Please add test based on #1599 (comment) and fix log message based on this comment - #1599 (comment)

After this - I'll merge your PR

menshikh-iv · 2017-11-06T14:54:29Z

gensim/models/word2vec.py

@@ -647,13 +647,19 @@ def build_vocab_from_freq(self, word_freq, keep_raw_vocab=False, corpus_count=No

        Examples
        --------
-        >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
+        >>> from gensim.models.word2vec import Word2Vec
+        >>> model=Word2Vec()


PEP8 model = Word2Vec()

jodevak · 2017-11-06T15:11:24Z

@menshikh-iv function testPruneVocab is already there .

menshikh-iv · 2017-11-06T15:23:54Z

@jodevak need add test for total_words, because you change "counting logic"

jodevak · 2017-11-06T16:22:31Z

@menshikh-iv Do you have any suggestions to test total_words, other than adding new attributes to the model object nor returning total_words as a value ?

menshikh-iv · 2017-11-07T05:52:56Z

I see 3 variants

Test for logger output (strange way, but why not)
Make _total_words attr & check it after build_vocab
Return total_words from build_vocab

@piskvorky what's variant looks best for you?

jodevak · 2017-11-07T09:44:01Z

@menshikh-iv Choice 3 seems most convenient to me.

menshikh-iv · 2017-11-08T06:49:01Z

Thank you @jodevak 👍

…#1695) * fix build vocab speed issue, and new function to build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * Removing the extra blank lines, documentation in numpy-style to build_vocab_from_freq, and hanging indents in build_vocab * Fixing Indentation * Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whitespace * Remove trailing white spaces * Adding test * fix spaces * iteration 2 on code * iteration 2 on code * Fixing old version of word2vec.py merge problems * Fixing indent * Fixing Styling * Fixing Styling * test * test * adding total words count test * adding total words count test

jodevak added 13 commits September 25, 2017 17:47

fix build vocab speed issue, and new function to build vocab from pre…

3f30e1e

…viously provided word frequencies table

fix build vocab speed issue, function build vocab from previously pro…

c4f387e

…vided word frequencies table

fix build vocab speed issue, function build vocab from previously pro…

8abd58b

…vided word frequencies table

fix build vocab speed issue, function build vocab from previously pro…

8ec0433

…vided word frequencies table

Removing the extra blank lines, documentation in numpy-style to build…

b9f3a5f

…_vocab_from_freq, and hanging indents in build_vocab

Fixing Indentation

0a5e8d6

Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whit…

644fcad

…espace

Remove trailing white spaces

c91b4cb

Adding test

1e4ef3e

fix spaces

9ae7a84

iteration 2 on code

1e82811

iteration 2 on code

aa9227d

Merge branch 'build_vocab_freq' of https://github.com/jodevak/gensim …

e156b95

…into build_vocab_freq

horpto reviewed Nov 6, 2017

View reviewed changes

jodevak added 2 commits November 6, 2017 15:24

Fixing old version of word2vec.py merge problems

2066a2a

Fixing indent

62ed129

menshikh-iv suggested changes Nov 6, 2017

View reviewed changes

Fixing Styling

473d7e6

menshikh-iv suggested changes Nov 6, 2017

View reviewed changes

menshikh-iv mentioned this pull request Nov 6, 2017

Fix scan vocab speed issue, build vocab from provided word frequencies #1599

Merged

Fixing Styling

a65e36b

jodevak added 2 commits November 6, 2017 18:24

test

7f46a05

test

f744c4f

adding total words count test

6471164

adding total words count test

9bc6b78

menshikh-iv merged commit 40b0417 into piskvorky:develop Nov 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Scan_Vocab speed, build_vocab_from_freq function. Iteration 2 #1695

Improving Scan_Vocab speed, build_vocab_from_freq function. Iteration 2 #1695

jodevak commented Nov 6, 2017

horpto Nov 6, 2017

jodevak Nov 6, 2017

horpto Nov 6, 2017

jodevak Nov 6, 2017

horpto Nov 6, 2017

jodevak Nov 6, 2017

jodevak Nov 6, 2017

jodevak Nov 6, 2017

menshikh-iv Nov 6, 2017

jodevak Nov 6, 2017

menshikh-iv Nov 6, 2017

menshikh-iv Nov 6, 2017

jodevak Nov 6, 2017

menshikh-iv Nov 6, 2017

jodevak Nov 6, 2017

menshikh-iv Nov 6, 2017

menshikh-iv Nov 6, 2017

menshikh-iv Nov 6, 2017

jodevak Nov 6, 2017

piskvorky Nov 6, 2017

piskvorky Nov 6, 2017

menshikh-iv left a comment •

edited

Loading

menshikh-iv Nov 6, 2017

jodevak commented Nov 6, 2017

menshikh-iv commented Nov 6, 2017

jodevak commented Nov 6, 2017

menshikh-iv commented Nov 7, 2017

jodevak commented Nov 7, 2017

menshikh-iv commented Nov 8, 2017


		self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling
		self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule,update=update) # trim by min_count & precalculate downsampling

Improving Scan_Vocab speed, build_vocab_from_freq function. Iteration 2 #1695

Improving Scan_Vocab speed, build_vocab_from_freq function. Iteration 2 #1695

Conversation

jodevak commented Nov 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jodevak commented Nov 6, 2017

menshikh-iv commented Nov 6, 2017

jodevak commented Nov 6, 2017

menshikh-iv commented Nov 7, 2017

jodevak commented Nov 7, 2017

menshikh-iv commented Nov 8, 2017

menshikh-iv left a comment •

edited

Loading