Online NMF #2007

anotherbugmaster · 2018-03-29T07:57:03Z

Online Robust NMF. Resolves #132. Based on this paper.

menshikh-iv

Good start @anotherbugmaster 👍

Main things that you need to do now

Benchmark (add notebook where you compare current implementation with others using different tasks)
Support for BoW format (feel free to drop numpy dense matrices)
API (should be very similar with Lda/Lsi)
Tests

gensim/models/nmf.py

1. Improved performance ~4x 2. LDA-like API 3. BOW compatibility

gensim/models/nmf.py

…online_nmf

menshikh-iv · 2019-01-17T13:14:49Z

Time to merge, awesome work @anotherbugmaster 🚀💣🔥💣🚀

piskvorky · 2019-01-18T23:14:57Z

@anotherbugmaster can you share those TL;DR comparisons against other implementations (sklearn etc), as per my comment above (time, memory, quality)?

I'd like to include that in the release notes. Thanks!

piskvorky · 2019-01-18T23:22:12Z

I found some numbers in the images at the bottom of the tutorial. Is the Gensim implementation really 6x slower than sklearn's?

anotherbugmaster · 2019-01-18T23:34:54Z

@anotherbugmaster can you share those TL;DR comparisons, as per my comment above (time, memory, quality)?

I'd like to include that in the release notes. Thanks!

Sure, here they are: https://github.com/anotherbugmaster/gensim/blob/e34b939e9a5f1f79f9582ef3d0618fd43bbd7be2/docs/notebooks/nmf_wikipedia.ipynb

I found some numbers in the images at the bottom of the tutorial. Is the Gensim implementation really 6x slower than sklearn's?

Only with certain hyperparameters. It's 2-3x faster than sklearn in most cases, which also have better F1:

piskvorky · 2019-01-19T08:38:59Z

@anotherbugmaster thanks, but I don't know how to read any of these tables, what these NaNs mean (?), or what F1 is doing in an unsupervised method. It looks more like some internal benchmark notes -- what I'd like is some human-readable digestion and insights.

There's almost no text in the tutorial. The part that was easy to interpret were the images in the end, which say Gensim is 6x slower than anything else :(

Can you please post a TL;DR comparison against sklearn on the same dataset (wiki? images?): memory, time, quality? Why should someone use our NMF implementation, instead of other implementations?

anotherbugmaster · 2019-01-19T10:09:39Z

@anotherbugmaster thanks, but I don't know how to read any of these tables, what these NaNs mean (?), or what F1 is doing in an unsupervised method. It looks more like some internal benchmark notes -- what I'd like is some human-readable digestion and insights.

There's almost no text in the tutorial. The part that was easy to interpret were the images in the end, which say Gensim is 6x slower than anything else :(

Can you please post a TL;DR comparison against sklearn on the same dataset (wiki? images?): memory, time, quality? Why should someone use our NMF implementation, instead of other implementations?

Ok, Radim, how about the first table in the release notes?

https://github.com/RaRe-Technologies/gensim/releases

Also, here are the insights from the tutorial notebook:

Gensim NMF clearly beats sklearn implementation both in terms of speed and quality
LDA is still significantly better in terms of quality, though interpretabiliy of topics and speed are clearly worse then NMF's

Here are the RAM comparison on wikipedia:

NaN means that this metric weren't computed for particular model (coherence for sklearn NMF, for example).

F1 is the quality of a model on the downstream task, 20-newsgroups classification.

Our NMF is online (you can't just run sklearn on wikipedia, it won't fit in memory) and faster than sklearn NMF on sparse and large datasets (which is the case for Topic Modeling).

piskvorky · 2019-01-19T10:55:28Z

@anotherbugmaster I already saw all these tables and notebook multiple times. They are not what I am asking. Nobody but you knows how the numbers relate, what's important, or even which way is up.

I am asking for a short human summary of a head-to-head comparison of memory, time and quality with other NMF implementations (e.g. sklearn), on a concrete dataset.

Gensim NMF clearly beats sklearn implementation both in terms of speed and quality is substantiated where? The images show it's actually 6x slower. I don't know where "clearly beats in quality" is coming from -- the numbers seems either NaN or better?

Similarly for LDA is still significantly better in terms of quality, though interpretabiliy of topics and speed are clearly worse then NMF's: ignoring the typos, what does this claim actually mean? The whole point of LDA is to produce interpretable topics.

I'm sure the code is fine if @menshikh-iv OKed it and merged. That's not the issue. The issue is the documentation, especially with regard to motivation and user guidance. As a user, I don't understand where this NMF implementation stands, how it compares to other implementations, when I should use it (or not use it), what the parameters mean and which ones I should change (or not touch).

I can help with the language once I understand it myself, but I need some insights, not a huge table full of some unexplained numbers and code cells without commentary.

@menshikh-iv do you understand what I'm asking? Can you help out here?

piskvorky · 2019-01-19T11:19:43Z

For clarity, here's an example what I meant by "insights", something users may understand, ground them conceptually and guide their intuition about this implementation:

Gensim NMF should be used whenever you want to retrieve interpretable (non-negative factors) topics from a very large and sparse dataset. Its online incremental training allows you to update the NMF model in pieces, in constant memory. This is in stark contrast to other NMF implementation (such as in scikit-learn), where the entire dataset must be loaded into memory at once. It also allows resuming training with more data at a later time. Another application of this "online" architecture is joining NMF models built from partial data slices into a single model (e.g. individual NMF models from weekly time-slices combined into a single NMF model for the whole year) .

In terms of memory, the Gensim NMF implementation scales linearly with the number of terms and topics. You also need to be able to load a partial chunk of documents into RAM at a time (the chunksize parameter). For example, on the English Wikipedia dataset, you'll need 2 GB RAM for 100 NMF topics and 100k vocabulary, updating the model with chunks of 10,000 documents at a time. See this notebook table for more details and benchmark numbers.

In terms of CPU, the runtime is dominated by the coordinate descent in each update iteration. You can control the CPU-accuracy tradeoff by tweaking the ABC parameter. The default are set to work well on standard English texts (sparsity <1%), but if your dataset is dense, you may want to change it to EFG.

In terms of model quality, the algorithm implemented in Gensim NFM follows this paper. It achieves the online training capability by calculating only approximatate XYZ. On the English Wikipedia, this results in L2 reconstruction error of ABC (compared to sklearn's DEF). For more information, see the paper above or our benchmarks here.

If you want to use NMF, check out our official tutorial here for a step-by-step code guide. The API parameters are documented here.

(just an example, maybe the facts are wrong, or the implementation cannot do this -- I don't know. but this was our goal.)

anotherbugmaster · 2019-01-19T11:33:13Z

@anotherbugmaster I already saw all these tables multiple times. They are not what I am asking. Nobody but you knows how they relate, what's important, or even which way is up.

I am asking for a head-to-head comparison of memory, time and quality with other NMF implementations (e.g. sklearn), on a concrete dataset.

Gensim NMF clearly beats sklearn implementation both in terms of speed and quality is substantiated where? The images show it's actually 6x slower. I don't know where "clearly beats in quality is coming from.

Similarly for LDA is still significantly better in terms of quality, though interpretabiliy of topics and speed are clearly worse then NMF's: ignoring the typos, what does this claim actually mean? The whole point of LDA is to produce interpretable topics, that's its "quality".

I'm sure the code is fine if @menshikh-iv OKed it and merged. But I don't understand where this functionality stands when to comes to how it's different from other approaches, when I should use it (or not use it). Consequently, I don't know to communicate it to users. I need some insights, not a huge table full of some unexplained numbers.

Radim, to be clear, Olivietti faces decomposition is added just to show that it's possible to extract latent components. Model is optimized for the case of sparse corpora, not dense image matrices.

The main benchmark dataset is 20-newsgroups, and the huge table is concerning this dataset.

As for the quality, I can't entirely agree, because:

Perplexity and coherence doesn't always correlate with human estimation
TMs are used as features on downstream tasks, and that we can measure precisely

I see what you mean by insights. I'll try to make something similar to your example.

anotherbugmaster · 2019-01-19T12:28:33Z

Gensim NMF should be used whenever you want to retrieve interpretable (non-negative factors) topics from a very large and sparse dataset. Its online incremental training allows you to update the NMF model in pieces, in constant memory. This is in stark contrast to other NMF implementation (such as in scikit-learn), where the entire dataset must be loaded into memory at once. It also allows resuming training at a later time.

In terms of memory, the Gensim NMF implementation scales linearly with the number of terms and topics. You also need to be able to load a partial chunk of documents into RAM at a time (the chunksize parameter). For example, on the English Wikipedia dataset, you'll need 150Mb RAM for 50 NMF topics and 100k vocabulary, updating the model with chunks of 2,000 documents at a time. See this notebook table for more details and benchmark numbers.

In terms of CPU, the runtime is dominated by the coordinate descent in each update iteration. You can control the CPU-accuracy tradeoff by tweaking the w_max_iter, w_stop_condition, h_r_max_iter, h_r_stop_condition and sparse_coef parameters. The default are set to work well on standard English texts (sparsity <1%), but if your dataset is dense, you may want to increase sparse_coef.

In terms of model quality, the algorithm implemented in Gensim NMF follows this paper. It achieves the online training capability by accumulating document-topic matrices of each subsequent batch in a special way and then iteratively computing topic-word matrix. For more information, see the paper above or our benchmarks here.

If you want to use NMF, check out our official tutorial here for a step-by-step code guide. The API parameters are documented here.

piskvorky · 2019-01-20T08:39:08Z

Thanks for your patience, but we need to improve the docs significantly before we really promote this exciting new model addition.

Still missing: clear numbers from a single benchmark (ideally Wikipedia, 3 numbers: RAM + time + reconstruction error/loss), and a TL;DR comparison to sklearn (same 3 calculated/estimated numbers, for a direct head-to-head).

I don't know how else to say it, but we need a human-friendly TL;DR comparison of NMF implementation in Gensim and other NMF implementations. The current nondescript table full of numbers and NaNs, in a notebook without comments, is insufficient.

@anotherbugmaster Can you improve the parameter intuition too please? Enumerating the parameter names like h_r_max_iter tells me nothing. What are they for? What are their acceptable value ranges? When would I want to change them? How do they relate to each other? The API documentation under https://radimrehurek.com/gensim/models/nmf.html is similarly terse and frustrating (compare to sklearn NMF, Gensim SVD).

Try to see this from the user perspective please. Users are not going to decode academic papers or pour over the code, just to understand what this model is supposed to do and how it differs from their other options. We have to provide a basic overview and intuition.

Radim, to be clear, Olivietti faces decomposition is added just to show that it's possible to extract latent components. Model is optimized for the case of sparse corpora, not dense image matrices.

That wasn't clear at all from the notebook. In fact, "Olivietti faces" is not even introduced / described anywhere. As a reader, I don't know what I'm looking at, why, or what I'm supposed to be seeing there.

I assume by 150 Mb you mean megabytes, right?

Does this model support merging partial models built from independent chunks or not? I see you removed this sentence from my example text which you used as a template (I completely made it up, are you sure the algo descriptions fit?), but then the rest of the text makes it sound like it does support such partial training.

anotherbugmaster · 2019-01-21T09:11:43Z

Thanks for your patience, but we need to improve the docs significantly before we really promote this exciting new model addition.

Still missing: clear numbers from a single benchmark (ideally Wikipedia, 3 numbers: RAM + time + reconstruction error/loss), and a TL;DR comparison to sklearn (same 3 calculated/estimated numbers, for a direct head-to-head).

Radim, as I wrote before, I can't run sklearn's NMF on Wikipedia (at least on my machine), it takes too much RAM. I can either run it on a smaller corpus (like 20-newsgroups) or compare NMF with some other model, LDA for example (though it wouldn't be completely fair to compare L2 here). Do you have any ideas how can I implement the right benchmark?

I don't know how else to say it, but we need a human-friendly TL;DR comparison of NMF implementation in Gensim and other NMF implementations. The current nondescript table full of numbers and NaNs, in a notebook without comments, is insufficient.

Okay, I obviously need to revamp the notebooks and NMF's documentation. I'll try to do it this week.

@anotherbugmaster Can you improve the parameter intuition too please? Enumerating the parameter names like h_r_max_iter tells me nothing. What are they for? What are their acceptable value ranges? When would I want to change them? How do they relate to each other? The API documentation under radimrehurek.com/gensim/models/nmf.html is similarly terse and frustrating (compare to sklearn NMF, Gensim SVD).

Sure. I think I'll add more info to the module docstrings and describe what W, h and r matrices mean and how exactly does algorithm works.

Those parameters are for estimation and maximization steps of the algo. For example, h_r_max_iter is the maximum number of iterations for the estimation step, h_r_stop_condition is the error value that is considered small enough to finish the step.

w_max_iter and w_stop_condition works the same way.

Try to see this from the user perspective please. Users are not going to decode academic papers or pour over the code, just to understand what this model is supposed to do and how it differs from their other options. We have to provide a basic overview and intuition.

I see that a lot of things seem vague, I'll try to clear things up.

Radim, to be clear, Olivietti faces decomposition is added just to show that it's possible to extract latent components. Model is optimized for the case of sparse corpora, not dense image matrices.

That wasn't clear at all from the notebook. In fact, "Olivietti faces" is not even introduced / described anywhere. As a reader, I don't know what I'm looking at, why, or what I'm supposed to be seeing there.

Fair enough. I can either elaborate more on this section or we can completely remove it to not confuse readers.

I assume by 150 Mb you mean megabytes, right?

Yep, that's right.

Does this model support merging partial models built from independent chunks or not? I see you removed this sentence from my example text which you used as a template (I completely made it up, are you sure the algo descriptions fit?), but then the rest of the text makes it sound like it does support such partial training.

No, the model doesn't support merging of partial chunks, and I have no idea how to implement that even in theory. Maybe updating in pieces is not a good description of the model's behavior, more like it updates iteratively, which means that we need to go through a corpus top-down, not to build partial models and then merge them.

Yes, I get it that you've made an example up, but it's actully quite close to the truth, I fixed the parts where it wasn't.

piskvorky · 2019-01-21T12:35:51Z

I can't run sklearn's NMF on Wikipedia (at least on my machine), it takes too much RAM.

I understand, hence the word "estimated". Btw how much RAM would be needed? Perhaps we can run it on one of our machines (64 GB RAM).

I can either elaborate more on or we can completely remove it to not confuse readers.

I like that idea (showing a different non-text usecase/workflow), I'd prefer to keep it. Being visual always helps!

Expanding the high-level descriptions, "what am I looking at and why, how is it different from others" is really what is needed here, across the board.

We went over the API docs with Ivan today, and we'll need to:

Add a module docstring with overview as per above (currently the docstring is missing).
Clarify the parameters, their relationship, ranges, perf/quality implications. Things like
sparse_coef (float, optional) – The more it is, the more sparse are matrices. Significantly increases performance. or
lambda (float, optional) – Residuals regularizer coefficient. Increasing it helps prevent ovefitting. Has no effect if use_r is set to False. (what is use_r? not documented)
normalize (bool, optional) – Whether to normalize results. Offers "kind-of-probabilistic" result.
are too vague, not actionable. We need to build the user intuition more, not require users to study papers or code.

Thanks!

Implement first version of the algorithm

343e46f

anotherbugmaster self-assigned this Mar 29, 2018

anotherbugmaster requested a review from menshikh-iv March 29, 2018 07:57

menshikh-iv suggested changes Mar 29, 2018

View reviewed changes

gensim/models/nmf.py Outdated Show resolved Hide resolved

gensim/models/nmf.py Outdated Show resolved Hide resolved

gensim/models/nmf.py Outdated Show resolved Hide resolved

gensim/models/nmf.py Outdated Show resolved Hide resolved

menshikh-iv added the incubator project PR is RaRe incubator project label Mar 29, 2018

anotherbugmaster added 3 commits March 30, 2018 15:19

Fix variable names

3171be3

Add support for streaming corpora

bd325bc

Add benchmark

19b3ba4

menshikh-iv suggested changes Apr 2, 2018

View reviewed changes

gensim/models/nmf.py Outdated Show resolved Hide resolved

gensim/models/nmf.py Outdated Show resolved Hide resolved

anotherbugmaster added 4 commits April 15, 2018 21:52

Fix bugs, introduce batches, add images to the benchmark notebook

9e52399

Update notebook

c54fc92

Improve model

6dc9d3e

1. Improved performance ~4x 2. LDA-like API 3. BOW compatibility

Merge remote-tracking branch 'upstream/develop' into online_nmf

0554b7b

menshikh-iv suggested changes Apr 22, 2018

View reviewed changes

anotherbugmaster added 16 commits April 23, 2018 05:40

Add show topics, change API

5f4b3d3

Add more LDA-like API

52fc956

Fix logger name

ddebcf0

Add more LDA API

6d0a1b3

Remove redundant method

cf430fc

Remove commented out lines

df5a6e9

Fix flakes

25080b4

Cythonize

83b1a6b

Dramatically improve performance

7f27f52

Add parameters, improve accuracy and speed

405e12f

Remove redundant W copying

7b45b23

Fix random seed again

a154a6e

Optimize E/M step

e82628d

Add an eval_every option, use softmax for normalization

1ca33f8

Fixes

f19e6ce

Improve notebook examples a bit

583cb15

anotherbugmaster and others added 10 commits January 15, 2019 12:59

Merge remote-tracking branch 'upstream/develop' into online_nmf

4941745

Add more description and metrics

c4d6ebd

[skip ci] Fix log_probabiliy

3b1195d

Multiple format fixes in notebook, outputs cleared til tomorrow

5edec1b

Merge remote-tracking branch 'upstream/develop' into online_nmf

33ce1a3

Train on full corpus

1806bf6

Merge branch 'online_nmf' of github.com:anotherbugmaster/gensim into …

3b9b8ea

…online_nmf

[skip ci] Remove disclaimer

3f1af1d

Add RAM usage stats

38143a9

Native 20-newsgroups and additional text

72a02db

menshikh-iv changed the title ~~[WIP] Online NMF~~ Online NMF Jan 17, 2019

anotherbugmaster added 4 commits January 17, 2019 14:25

Truncate outputs

7cf80e1

Merge remote-tracking branch 'upstream/develop' into online_nmf

72178c0

Fix last cell formatting

467a2ad

[skip ci] Change model hyperparameters back

e34b939

menshikh-iv merged commit 239856c into piskvorky:develop Jan 17, 2019

anotherbugmaster mentioned this pull request Feb 4, 2019

NMF optimization & documentation #2361

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Online NMF #2007

Online NMF #2007

anotherbugmaster commented Mar 29, 2018 •

edited by piskvorky

Loading

menshikh-iv left a comment •

edited

Loading

menshikh-iv commented Jan 17, 2019

piskvorky commented Jan 18, 2019 •

edited

Loading

piskvorky commented Jan 18, 2019

anotherbugmaster commented Jan 18, 2019 •

edited

Loading

piskvorky commented Jan 19, 2019 •

edited

Loading

anotherbugmaster commented Jan 19, 2019

piskvorky commented Jan 19, 2019 •

edited

Loading

piskvorky commented Jan 19, 2019 •

edited

Loading

anotherbugmaster commented Jan 19, 2019 •

edited

Loading

anotherbugmaster commented Jan 19, 2019 •

edited

Loading

piskvorky commented Jan 20, 2019 •

edited

Loading

anotherbugmaster commented Jan 21, 2019 •

edited

Loading

piskvorky commented Jan 21, 2019 •

edited

Loading

Online NMF #2007

Online NMF #2007

Conversation

anotherbugmaster commented Mar 29, 2018 • edited by piskvorky Loading

menshikh-iv left a comment • edited Loading

Choose a reason for hiding this comment

menshikh-iv commented Jan 17, 2019

piskvorky commented Jan 18, 2019 • edited Loading

piskvorky commented Jan 18, 2019

anotherbugmaster commented Jan 18, 2019 • edited Loading

piskvorky commented Jan 19, 2019 • edited Loading

anotherbugmaster commented Jan 19, 2019

piskvorky commented Jan 19, 2019 • edited Loading

piskvorky commented Jan 19, 2019 • edited Loading

anotherbugmaster commented Jan 19, 2019 • edited Loading

anotherbugmaster commented Jan 19, 2019 • edited Loading

piskvorky commented Jan 20, 2019 • edited Loading

anotherbugmaster commented Jan 21, 2019 • edited Loading

piskvorky commented Jan 21, 2019 • edited Loading

anotherbugmaster commented Mar 29, 2018 •

edited by piskvorky

Loading

menshikh-iv left a comment •

edited

Loading

piskvorky commented Jan 18, 2019 •

edited

Loading

anotherbugmaster commented Jan 18, 2019 •

edited

Loading

piskvorky commented Jan 19, 2019 •

edited

Loading

piskvorky commented Jan 19, 2019 •

edited

Loading

piskvorky commented Jan 19, 2019 •

edited

Loading

anotherbugmaster commented Jan 19, 2019 •

edited

Loading

anotherbugmaster commented Jan 19, 2019 •

edited

Loading

piskvorky commented Jan 20, 2019 •

edited

Loading

anotherbugmaster commented Jan 21, 2019 •

edited

Loading

piskvorky commented Jan 21, 2019 •

edited

Loading