[WIP] Adding sklearn wrapper for LDA code #932

AadityaJ · 2016-10-10T07:48:16Z

creating wrapper to make LdaModel of gensim scikit-learn API enabled

tmylk · 2016-10-10T09:07:30Z

Please add tests, a changelog and a quick draft of an ipynb on how to use it with sklearn api.

devashishd12 · 2016-10-10T09:10:10Z

Wouldn't it be useful to also add a partial_fit which wraps update?

tmylk · 2016-10-10T09:30:26Z

Yes, update is on the road map

piskvorky · 2016-10-21T02:13:23Z

gensim/sklearn_integration/base.py

+            self.setattr(parameter, value)
+        return self
+
+    def fit(self,X,y=None):


PEP8: Spaces after commas.

suggestions have been incorporated. thanks

piskvorky · 2016-10-21T02:13:44Z

gensim/sklearn_integration/base.py

+        """
+        if X is None:
+            raise AttributeError("Corpus defined as none")
+        self.lda_model = gensim.models.LdaModel(corpus=X,num_topics=self.n_topics, id2word=self.id2word, passes=self.passes,


Code style: we don't use vertical indent in gensim. Use hanging indent.

(plus, spaces & commas again)

piskvorky · 2016-10-21T02:14:31Z

gensim/test/test_sklearn_integration.py

+corpus = [dictionary.doc2bow(text) for text in texts]
+
+
+class TestLdaModel:


All Python classes should inherit from object (new-style classes).

piskvorky · 2016-10-21T02:14:44Z

gensim/test/test_sklearn_integration.py

+            self.assertTrue(isinstance(v, float))
+
+if __name__ == '__main__':
+    unittest.main()


Missing newline at the end of file.

piskvorky · 2016-10-25T23:59:25Z

gensim/sklearn_integration/base.py

-                                                eta=self.eta,random_state=self.random_state)
-        return  self.lda_model
+        self.lda_model = gensim.models.LdaModel(
+                         corpus=X, num_topics=self.n_topics, id2word=self.id2word, passes=self.passes,


Indent too large (should be a single level).

piskvorky · 2016-10-26T00:00:03Z

gensim/sklearn_integration/base.py

-        # might need to do more
-    def get_term_topics(self,wordid,minimum_probability=None):
+        return self.lda_model.get_document_topics(bow, minimum_probability=minimum_probability,
+                                                  minimum_phi_value=minimum_phi_value, per_word_topics=per_word_topics)


No vertical indent in gensim; we use hanging indent (see PEP8 for examples).

tmylk · 2016-10-26T07:26:13Z

gensim/test/test_sklearn_integration.py

+        self.model=base.LdaModel(id2word=dictionary,n_topics=2,passes=100)
+        self.model.fit(corpus)
+
+    def testPrintTopic(self):


Please add a partial_fit test

tmylk · 2016-12-19T17:55:25Z

gensim/sklearn_integration/base.py

-
-
-class LdaModel(object):
+class LdaModel(models.LdaModel,object):


Please rename to SklearnWrapperLdaModel

tmylk

Please add a sklearn pipeline example with a classifier

tmylk · 2016-12-25T10:39:11Z

CHANGELOG.md

-
+* Added sklearn wrapper for LdaModel (Basic LDA Model) along with relevant test cases and ipynb draft. (@AadityaJ,
+[#932](https://github.com/RaRe-Technologies/gensim/pull/932))
+* Add online learning feature to word2vec. (@isohyt [#900](https://github.com/RaRe-Technologies/gensim/pull/900))


Please resolve merge conflicts. Only one line should be added to changelog. Remove extra 2 lines about other changes.

Please merge in develop branch to remove merge conflicts

tmylk · 2016-12-25T10:39:54Z

gensim/sklearn_integration/base.py

+                 eval_every=10, iterations=50, gamma_threshold=0.001,
+                 minimum_probability=0.01, random_state=None):
+        """
+        sklearn wrapper for LDA model.derived class for gensim.model.LdaModel


Space after "."

tmylk · 2016-12-25T10:41:40Z

gensim/sklearn_integration/base.py

+from gensim import models
+
+
+class SklearnWrapperLdaModel(models.LdaModel,object):


Rename the python file to "SklearnWrapperGensimLdaModel"

tmylk · 2016-12-25T10:52:55Z

docs/notebooks/sklearn_wrapper.ipynb

+   "metadata": {},
+   "source": [
+    "The wrapper available (as of now) are :\n",
+    "* LdaModel (```gensim.sklearn_integration.base.LdaModel```),which implements gensim's ```LdaModel``` in a scikit-learn interface"


Please update ipynb with new names of .py file and of the class

tmylk · 2016-12-25T10:55:10Z

gensim/test/test_sklearn_integration.py

+        dictionary_up = Dictionary(texts_update)
+        corpus_up = [dictionary_up.doc2bow(text) for text in texts_update]
+        self.model.partial_fit(corpus_up)
+        self.testPrintTopic()


add a test that checks that the values changed after the update

…n class

tmylk · 2016-12-27T12:40:17Z

gensim/test/test_sklearn_integration.py

+corpus = [dictionary.doc2bow(text) for text in texts]
+
+
+class TestLdaModel(unittest.TestCase):


Please rename the tests to TestSklearnLDAWrapper

piskvorky

Good start!

I didn't review the notebook but the code is in the right direction.

Will need some code style & language polishing before merging (punctuation, missing/extra whitespace, capitalization...).

piskvorky · 2017-01-06T11:58:25Z

CHANGELOG.md

@@ -5,6 +5,10 @@ Unreleased:

 None

+0.13.5, 2016-12-31
+
+* Added sklearn wrapper for LdaModel (Basic LDA Model) along with relevant test cases and ipynb draft. (@AadityaJ,[#932](https://github.com/RaRe-Technologies/gensim/pull/932))


What is "Basic LDA Model"?

piskvorky · 2017-01-06T11:59:36Z

gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py

@@ -0,0 +1,116 @@
+#!/usr/bin/env python


Not a good filename; please use lower case, with underscores _ to separate expressions where necessary.

piskvorky · 2017-01-06T12:00:22Z

gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py

+from scipy.sparse.csr import csr_matrix
+
+
+class SklearnWrapperLdaModel(models.LdaModel,object):


PEP8: space after comma.

Actually, not relevant at all, because LdaModel already inherits from object naturally.

piskvorky · 2017-01-06T12:00:37Z

gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py

+    """
+    Base LDA module
+    """
+    def __init__(self, corpus=None, num_topics=100, id2word=None,


Code style: no vertical indent.

piskvorky · 2017-01-06T12:00:47Z

gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py

+        """
+        if self.corpus:
+            models.LdaModel.__init__(
+                                     self, corpus=self.corpus, num_topics=self.num_topics, id2word=self.id2word,


No vertical indent.

piskvorky · 2017-01-06T12:07:48Z

gensim/test/test_sklearn_integration.py

+from gensim import matutils
+
+texts = [['complier', 'system', 'computer'],
+ ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],


Incorrect indentation.

piskvorky · 2017-01-06T12:11:42Z

gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py

+        self.corpus = corpus
+        self.num_topics = num_topics
+        self.id2word = id2word
+        self.distributed = distributed


I don't think stuff like distributed would really work in sklearn. Same with training or storing very large models (sklearn makes lots of deep object copies internally).

piskvorky · 2017-01-06T12:12:20Z

gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py

+        self.gamma_threshold = gamma_threshold
+        self.minimum_probability = minimum_probability
+        self.random_state = random_state
+        """


Use normal # code comments (not docstring """ comments).

piskvorky · 2017-01-06T12:13:08Z

gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py

+        Warnings: Must for sklearn API. Do not Remove.
+        """
+        if isinstance(X, csr_matrix):
+            self.corpus = matutils.Sparse2Corpus(X)


What about other cases of X? Numpy array? List?

I used csr matrix since it is the return type of fit_transform for CountVectorizer as well as other vectorizers (TfidfVectorizer, DictVectorizer)

piskvorky · 2017-01-06T12:14:45Z

gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py

+        Return topic distribution for the given document bow, as a list of (topic_id, topic_probability) 2-tuples.
+        Warnings: Must for sklearn API. Do not Remove.
+        """
+        return self.get_document_topics(


This doesn't look right -- transform accepts a corpus (~sequence or array of multiple examples), not a single document (~one example).

AadityaJ · 2017-01-06T18:18:54Z

Thanks for the comments and suggestions. I am making some of the changes today itself. Rest I'll do in a couple of days.

tmylk · 2017-01-08T14:30:23Z

@AadityaJ About the RST files. Here is an example of what needs to be added. Here is how to re-generate API ref and make sure that new files are generated.

tmylk · 2017-01-08T14:36:32Z

gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py

-                                     eval_every=self.eval_every, iterations=self.iterations,
-                                     gamma_threshold=self.gamma_threshold, minimum_probability=self.minimum_probability,
-                                     random_state=self.random_state)
+                                self, corpus=self.corpus, num_topics=self.num_topics, id2word=self.id2word,


Too much indentation. Hanging indent would have the line starting one indent more than line 46

tmylk · 2017-01-08T14:36:47Z

gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py

-                                eval_every=self.eval_every, iterations=self.iterations,
-                                gamma_threshold=self.gamma_threshold, minimum_probability=self.minimum_probability,
-                                random_state=self.random_state)
+                            self, corpus=self.corpus, num_topics=self.num_topics, id2word=self.id2word,


Too much indent as above

tmylk · 2017-01-08T14:37:43Z

gensim/sklearn_integration/SklearnWrapperGensimLdaModel.py

@@ -98,8 +98,8 @@ def transform(self, bow, minimum_probability=None, minimum_phi_value=None, per_w
        Returns the topic distribution for the given document bow, as a list of (topic_id, topic_probability) 2-tuples.
        """
        return self.get_document_topics(
-                                        bow, minimum_probability=minimum_probability,
-                                        minimum_phi_value=minimum_phi_value, per_word_topics=per_word_topics)
+                                    bow, minimum_probability=minimum_probability,


too much indent as above

tmylk · 2017-01-10T12:47:07Z

@AadityaJ Please add the new class to gensim/docs/src/apiref.rst

tmylk · 2017-02-14T22:44:12Z

@AadityaJ Could you please update the tutorial for clf.fit to use the output of LDA? Currently it is using sklearn's CountVectorizer in X as input.

adding basic sklearn wrapper for LDA code

08f417c

AadityaJ added 3 commits October 11, 2016 15:16

updating changelog

61a6f8c

adding test case,adding id2word,deleting showtopics

66be324

adding relevant ipynb

cffa95b

tmylk mentioned this pull request Oct 18, 2016

Sklearn wrapper integration #916

Closed

adding transfrom and other get methods and modifying print_topics

10badc6

piskvorky requested changes Oct 21, 2016

View reviewed changes

AadityaJ added 3 commits October 21, 2016 13:36

stylizing code to follow conventions

62a4d2f

removing redundant default argumen values

b7eff2d

adding partial_fit

2a193fd

piskvorky reviewed Oct 25, 2016

View reviewed changes

piskvorky reviewed Oct 26, 2016

View reviewed changes

tmylk reviewed Oct 26, 2016

View reviewed changes

AadityaJ added 3 commits December 9, 2016 20:01

adding a line in test_sklearn_integration

a32f8dc

using LDAModel as Parent Class

a048ddc

adding docs, modifying getparam

ac1d28e

tmylk reviewed Dec 19, 2016

View reviewed changes

AadityaJ added 3 commits December 20, 2016 00:29

changing class name.Adding comments

0d6cc0a

adding test case for update and transform

5d8c1a6

adding init

894784c

tmylk suggested changes Dec 25, 2016

View reviewed changes

AadityaJ added 3 commits December 26, 2016 17:17

updating changes,fixed typo and changing file name

7a5ca4b

deleted base.py

b35baba

adding better testPartialFit method and minor changes due to change i…

13a136d

…n class

tmylk reviewed Dec 27, 2016

View reviewed changes

AadityaJ added 2 commits December 30, 2016 19:08

change name of test class

682f045

adding changes in classname to ipynb

9fda951

adding rst file

a3895b5

piskvorky requested changes Jan 6, 2017

View reviewed changes

AadityaJ added 3 commits January 6, 2017 20:29

removed "basic" , added rst update to log

f832737

changing indentation in texts

bc352a0

added file preamble, removed unnecessary space

7cc39da

AadityaJ added 7 commits January 7, 2017 00:10

following more pep8 conventions

0ba233c

removing unnecessary comments

e23a8a4

changing isinstance csr_matrix to issparse

041a32e

changed to hanging indentation

e7120f0

changing main filename

8a0950d

changing module name in test

bd8bced

updating ipynb with main filename

bb5872b

tmylk reviewed Jan 8, 2017

View reviewed changes

AadityaJ added 6 commits January 8, 2017 20:14

changed class name

777576e

changed file name

e50c3f9

fixing filename typo

e521269

adding html file

51931fa

deleting html file

7ba30d6

vertical indentation fixes

82d1fdc

piskvorky changed the title ~~[WIP]Adding sklearn wrapper for LDA code~~ [WIP] Adding sklearn wrapper for LDA code Jan 10, 2017

adding file to apiref.rst

4f3441e

tmylk merged commit 0e0c082 into piskvorky:develop Jan 29, 2017

tmylk mentioned this pull request Jan 29, 2017

Support sklearn pipeline interface. Continuing #932. #1123

Closed

kris-singh mentioned this pull request Mar 17, 2017

Fix Pipeline #1213

Merged

		corpus = [dictionary.doc2bow(text) for text in texts]


		class TestLdaModel:



		class LdaModel(object):
		class LdaModel(models.LdaModel,object):

		from gensim import models


		class SklearnWrapperLdaModel(models.LdaModel,object):

		corpus = [dictionary.doc2bow(text) for text in texts]


		class TestLdaModel(unittest.TestCase):

		from scipy.sparse.csr import csr_matrix


		class SklearnWrapperLdaModel(models.LdaModel,object):

[WIP] Adding sklearn wrapper for LDA code #932

[WIP] Adding sklearn wrapper for LDA code #932

Conversation

AadityaJ commented Oct 10, 2016

tmylk commented Oct 10, 2016 • edited Loading

devashishd12 commented Oct 10, 2016

tmylk commented Oct 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AadityaJ commented Jan 6, 2017

tmylk commented Jan 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmylk commented Jan 10, 2017

tmylk commented Feb 14, 2017

tmylk commented Oct 10, 2016 •

edited

Loading