NMF optimization & documentation #2361

anotherbugmaster · 2019-01-29T08:13:55Z

Massive performance improvements and better docs. Continues from PR #2007.

Takes less memory and 4-5 times faster now. Also metrics such as perplexity works as expected.

TODO:

1. Improved performance ~4x 2. LDA-like API 3. BOW compatibility

gensim/models/nmf_pgd.pyx

piskvorky

Minor suggestions for language + code style. Looks much better overall 👍

gensim/models/nmf.py

piskvorky · 2019-01-30T19:05:44Z

gensim/models/nmf.py

-        first_doc = matutils.corpus2csc([first_doc], len(self.id2word))
+        self._h = None
+
+        if isinstance(corpus, scipy.sparse.csc.csc_matrix):


Is this some special undocumented case? Deserves a comment.

Corpus can be a csc matrix now, I'll add it to the docstring

I still see no comment about this code path. Why does this method accept two different input types? Is it some optimization?

Oops, I added that to the __init__ docstring, but forgot to mention csc in the update. I'll fix it.

piskvorky · 2019-01-30T19:08:04Z

gensim/models/nmf.py

+            first_doc = corpus.getcol(0)
+        else:
+            first_doc_it = itertools.tee(corpus, 1)
+            first_doc = next(first_doc_it[0])


I think this will still advance the original generator (lose the first document).

Isn't there a mismatch between expected corpus docs and actually seen corpus docs during iteration, after _setup?

No, that's something that I checked, itertools.tee copies generator and leaves the original one intact.

# python 3.6 corpus = (i for i in range(10)) first_doc_it = itertools.tee(corpus, 1) first_doc = next(first_doc_it[0]) print("first doc:", first_doc) for doc_no, doc in enumerate(corpus): print(f"doc #{doc_no}: {doc}") # outputs: first doc: 0 doc #0: 1 doc #1: 2 doc #2: 3 doc #3: 4 doc #4: 5 doc #5: 6 doc #6: 7 doc #7: 8 doc #8: 9

Aha, you're right, it works for iterators, but not for generators. I'll fix that.

Does NMF accept generators? Or does it require a "restartable" sequence in corpus?

Can you add those corner cases into unit tests, too? To prevent regressions. (generator input, empty docs, corpus shape not divisible by chunksize, etc)

NMF loses the first document of a non-restartable generator for now, though non-restartable sequence should work fine with passes=1.

I'll fix that and add unit tests.

gensim/models/nmf.py

piskvorky · 2019-01-30T19:28:23Z

@anotherbugmaster regarding the notebook:

Does trainset / testset need to be a np.array, instead of list? Storing objects (dicts) as numpy array is inefficient and looks strange. If really necessary, deserves a comment "why".
Can we introduce the notebook cells with a bit of context / motivation? For example, under the section Coherence, mention why this section is here in the notebook + link to an explanation what coherence is?
The density definition as def density(matrix): return (matrix > 0).mean() seems non-standard and deserves a comment (why density as "mean of above-zero elements"?)

anotherbugmaster · 2019-01-30T20:06:53Z

1. Does `trainset` / `testset` need to be a `np.array`, instead of `list`? Storing objects (`dict`s) as numpy array is inefficient and looks strange. If really necessary, deserves a comment "why".

Yes, this is not efficient, but I wanted to fix the random seed during permutation while not messing with the global random seed.

2. Can we introduce the notebook cells with a bit of context / motivation? For example, under the section **Coherence**, mention why this section is here in the notebook + link to an explanation what coherence is?

Sure, I'll explain it in more details.

3. The density definition as `def density(matrix):    return (matrix > 0).mean()` seems non-standard and deserves a comment (why density as "mean of above-zero elements"?)

The (matrix > 0) is a bool matrix, mean function sums up True's for every cell in which condition stands, than divides it by the number of all elements in the matrix. Thus, it's an ordinary density, nothing fancy. :)

menshikh-iv · 2019-01-31T02:40:11Z

@anotherbugmaster awesome 🔥

Release can't more wait, so, I merge PR right now.
Please don't forget to fix unchecked notes from first post.

gensim/models/nmf.py

piskvorky · 2019-01-31T07:27:06Z

gensim/models/nmf.py

-                    v, self._W, r=self._r, h=self._h, v_max=self.v_max
+            if isinstance(corpus, scipy.sparse.csc.csc_matrix):
+                grouper = (
+                    corpus[:, col_idx:col_idx + self.chunksize]


What does this do? Takes a chunk of matrix columns (features) instead of rows (documents)? And then shuffles the features??

Needs a strong comment, unusual process.

The corpus has shape (n_tokens, n_documents) in this case, so the grouper splits the input corpus in chunks by column (documents dimension) and then shuffles columns of every chunk (shuffles documents order in the chunk).

Aha! Are you sure documents are columns? Where does this order originate from? (atypical, definitely needs a comment / docstring)

Yes, it's unusual, but that's the format that's used in the original paper. I'll specify the shape of the corpus in the module docstring.

piskvorky · 2019-01-31T07:27:43Z

gensim/models/nmf.py

-                    v, self._W, r=self._r, h=self._h, v_max=self.v_max
+            if isinstance(corpus, scipy.sparse.csc.csc_matrix):
+                grouper = (
+                    corpus[:, col_idx:col_idx + self.chunksize]


This will raise an exception if corpus.shape[1] is not divisible by self.chunksize.

Why?

In [10]: foo = sparse.random(2, 10, density=0.5, format='csc') In [11]: foo[:, 5:100500].toarray() Out[11]: array([[0.98888343, 0. , 0. , 0.24760066, 0. ], [0.24857359, 0. , 0. , 0.27890554, 0.16412464]])

The corpus[:, col_idx:col_idx + self.chunksize] is equivalent to corpus[:, col_idx:corpus.shape[1]] on the last iteration

This may be due to some changes in scipy, but your example results in

IndexError: index out of bounds: 0 <= 5 <= 10, 0 <= 100500 <= 10, 5 <= 100500

in scipy 0.19.0.

Oh, I see. Seems like they added the support for that in the following versions, I'll do something like corpus[:, col_idx:min(col_idx + self.chunksize, chunk.shape[1])]

Thanks. Plus add a comment, so someone doesn't accidentally remove it again in the future.

We may simplify it again once the older scipy versions become irrelevant in the future.

piskvorky · 2019-01-31T13:52:43Z

@anotherbugmaster let me know about the performance (RAM, speed, quality) in our target use-case of sparse input (text, not dense faces), especially in comparison to sklearn or other tools. Then we can publish this. Super cool feature!

anotherbugmaster · 2019-02-01T13:06:48Z

@piskvorky Ok! I'll push the notebook with metrics and additional annotations later this weekend.

anotherbugmaster · 2019-02-01T15:31:45Z

Here are the current wikipedia results:

piskvorky · 2019-02-01T17:20:29Z

So, 0.17 GB vs 16 GB RAM compared to sklearn… nice! And 2x as fast 🐎

But the L2 norm is worrying, that's a big difference. Is that expected? Especially given the lower perplexity (?).

Can you post the resulting topics too? For visual, manual comparison.

Reordering the table to show model first, then train time, then RAM, then the quality metrics should make it easier to read and interpret. Thanks!

anotherbugmaster · 2019-02-01T23:19:16Z

So, 0.17 GB vs 16 GB RAM compared to sklearn… nice! And 2x as fast racehorse

But the L2 norm is worrying, that's a big difference. Is that expected? Especially given the lower perplexity (?).

Checked the metrics functions again and found a bug in evaluation of l2 (I normalized the input corpus for the sklearn, but not for the NMF and the LDA, hence the invalid metrics). I've fixed it, re-running wikipedia notebook now. I expect changes only in the l2 norm of all models and in the perplexity of sklearn NMF.

Here are updated metrics for the tutorial:

Can you post the resulting topics too? For visual, manual comparison.

Sure, but I'm not sure how to present them. I think I'll take top-5 topics from each model to fit them in one scroll.

Reordering the table to show model first, then train time, then RAM, then the quality metrics should make it easier to read and interpret. Thanks!

Ok!

anotherbugmaster · 2019-02-02T11:54:50Z

@piskvorky, here are the updated metrics on wikipedia:

So, 2x faster, 100x less memory, still better on l2 and perplexity.

piskvorky · 2019-02-02T14:58:03Z

Awesome. Please update the notebook (and reorder the table columns) and the parameter ranges, and I'll "officially" announce our new model.

Great work @anotherbugmaster , this came together very nicely in the end.

piskvorky · 2019-02-04T09:41:48Z

@anotherbugmaster since this PR is already merged, can you push these updated metrics / tables / notebook in a new PR?

anotherbugmaster · 2019-02-04T09:50:46Z

@piskvorky Sure, here it is

piskvorky · 2019-02-04T17:19:28Z

gensim/models/nmf.py

@@ -1,21 +1,114 @@
-"""Online Non-Negative Matrix Factorization."""
+"""`Online Non-Negative Matrix Factorization. <https://arxiv.org/abs/1604.02634>`


This formatting is broken, see https://radimrehurek.com/gensim/models/nmf.html. RST hyperlinks must end in _.

How about this instead:

Online Non-Negative Matrix Factorization. This implementation uses the efficient sparse incremental algorithm of Renbo Zhao, Vincent Y. F. Tan et al. `[PDF] <https://arxiv.org/abs/1604.02634>`_.

Rendered version.

anotherbugmaster added 30 commits March 29, 2018 10:51

Implement first version of the algorithm

343e46f

Fix variable names

3171be3

Add support for streaming corpora

bd325bc

Add benchmark

19b3ba4

Fix bugs, introduce batches, add images to the benchmark notebook

9e52399

Update notebook

c54fc92

Improve model

6dc9d3e

1. Improved performance ~4x 2. LDA-like API 3. BOW compatibility

Merge remote-tracking branch 'upstream/develop' into online_nmf

0554b7b

Add show topics, change API

5f4b3d3

Add more LDA-like API

52fc956

Fix logger name

ddebcf0

Add more LDA API

6d0a1b3

Remove redundant method

cf430fc

Remove commented out lines

df5a6e9

Fix flakes

25080b4

Cythonize

83b1a6b

Dramatically improve performance

7f27f52

Add parameters, improve accuracy and speed

405e12f

Remove redundant W copying

7b45b23

Fix random seed again

a154a6e

Optimize E/M step

e82628d

Add an eval_every option, use softmax for normalization

1ca33f8

Fixes

f19e6ce

Improve notebook examples a bit

583cb15

Fix eval_every

fe0ab0a

Return outliers

8e647a1

Optimizations

89cc803

Experimenting with loss

bbd3099

Merge remote-tracking branch 'upstream/develop' into online_nmf

f71ad89

Fix PEP8

936e629

horpto reviewed Jan 30, 2019

View reviewed changes

gensim/models/nmf_pgd.pyx Show resolved Hide resolved

[skip ci] Fix error counting again

abf3239

piskvorky requested changes Jan 30, 2019

View reviewed changes

[skip ci] Remove redundant imports

0948c85

anotherbugmaster added 7 commits January 30, 2019 23:49

Fix grouper for csc matrices

7e2782e

Fix module docstring

e213d7e

Fix training corpus description

4bb6f9f

Fix pep8

700bc36

Fix flake8 for real

0fc38c4

Normalize, sparsity and dictionary fixes

f77873e

Updated module docstring in the notebook

db62b49

menshikh-iv merged commit 366d8ae into piskvorky:develop Jan 31, 2019

piskvorky reviewed Jan 31, 2019

View reviewed changes

gensim/models/nmf.py Show resolved Hide resolved

piskvorky reviewed Jan 31, 2019

View reviewed changes

anotherbugmaster mentioned this pull request Feb 4, 2019

NMF metrics and wikipedia #2371

Merged

15 tasks

piskvorky reviewed Feb 4, 2019

View reviewed changes

piskvorky mentioned this pull request Feb 21, 2019

Undo the hash2index optimization #2370

Merged

		@@ -1,21 +1,114 @@
		"""Online Non-Negative Matrix Factorization."""
		"""`Online Non-Negative Matrix Factorization. <https://arxiv.org/abs/1604.02634>`

NMF optimization & documentation #2361

NMF optimization & documentation #2361

Conversation

anotherbugmaster commented Jan 29, 2019 • edited by piskvorky

piskvorky left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anotherbugmaster Jan 30, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jan 31, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Jan 30, 2019

anotherbugmaster commented Jan 30, 2019 • edited

menshikh-iv commented Jan 31, 2019

piskvorky Jan 31, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jan 31, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jan 31, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Jan 31, 2019

anotherbugmaster commented Feb 1, 2019

anotherbugmaster commented Feb 1, 2019

piskvorky commented Feb 1, 2019 • edited

anotherbugmaster commented Feb 1, 2019 • edited

anotherbugmaster commented Feb 2, 2019 • edited

piskvorky commented Feb 2, 2019 • edited

piskvorky commented Feb 4, 2019

anotherbugmaster commented Feb 4, 2019

piskvorky Feb 4, 2019 • edited

Choose a reason for hiding this comment

anotherbugmaster commented Jan 29, 2019 •

edited by piskvorky

piskvorky left a comment •

edited

anotherbugmaster Jan 30, 2019 •

edited

piskvorky Jan 31, 2019 •

edited

anotherbugmaster commented Jan 30, 2019 •

edited

piskvorky Jan 31, 2019 •

edited

piskvorky Jan 31, 2019 •

edited

piskvorky Jan 31, 2019 •

edited

piskvorky commented Feb 1, 2019 •

edited

anotherbugmaster commented Feb 1, 2019 •

edited

anotherbugmaster commented Feb 2, 2019 •

edited

piskvorky commented Feb 2, 2019 •

edited

piskvorky Feb 4, 2019 •

edited