[WIP] Added sklearn wrapper for LDASeq model #1405

chinmayapancholi13 · 2017-06-09T09:28:08Z

This PR adds a scikit-learn wrapper for Gensim's LDASeq model.

menshikh-iv · 2017-06-13T15:04:59Z

gensim/sklearn_integration/sklearn_wrapper_gensim_ldaseqmodel.py

+        """
+        Sklearn wrapper for LdaSeq model. Class derived from gensim.models.LdaSeqModel
+        """
+        self.corpus = None


Why you needed a field for a corpus?

@menshikh-iv In my opinion, the user might be interested to know about the corpus used for training the model (using the get_params function). Should we continue to store this value?

@chinmayapancholi13 No, sklearn does not store X, so we should not

@menshikh-iv Yes, that is true for sklearn. Removing corpus attribute from all the wrappers then.

menshikh-iv · 2017-06-13T15:05:34Z

gensim/sklearn_integration/sklearn_wrapper_gensim_ldaseqmodel.py

+        Sklearn wrapper for LdaSeq model. Class derived from gensim.models.LdaSeqModel
+        """
+        self.corpus = None
+        self.model = None


Please do this field "private" (start with underscores)

menshikh-iv · 2017-06-13T15:06:15Z

gensim/sklearn_integration/sklearn_wrapper_gensim_ldaseqmodel.py

+                initialize='gensim', sstats=None,  lda_model=None, obs_variance=0.5, chain_variance=0.005, passes=10,
+                random_state=None, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, chunksize=100)
+        """
+        self.corpus = X


Don't need to save X.

menshikh-iv · 2017-06-13T15:09:05Z

gensim/sklearn_integration/sklearn_wrapper_gensim_ldaseqmodel.py

+        """
+        Fit the model according to the given training data.
+        Calls gensim.models.LdaSeqModel:
+        >>> gensim.models.LdaSeqModel(corpus=None, time_slice=None, id2word=None, alphas=0.01, num_topics=10,


Please remove this block >>> ... , this example does not help for a new user.

@menshikh-iv Should we remove this >>> .... statement in all the model wrappers? This line basically tells us how the associated Gensim model is actually called.

You just need to specify the class that is used (you have already done above) and write where a user can read the documentation.

menshikh-iv · 2017-06-16T03:47:05Z

gensim/sklearn_integration/sklearn_wrapper_gensim_ldaseqmodel.py

+            em_min_iter=self.em_min_iter, em_max_iter=self.em_max_iter, chunksize=self.chunksize)
+        return self
+
+    def transform(self, docs):


Chek case, when you create instance and call transform immediately (without fit), you need to raise exception like sklearn

Also, please add an example of docs param in docstring.

@menshikh-iv For checking if the model has been fitted, would it be a good idea to check if self.gensim_model is None or not? This approach would clearly give an error when fit hasn't been called before calling transform but this also allows the user to set the value of self.gensim_model through set_params function (or even as wrapper.gensim_model=...) and then call transform function, which makes sense for us to allow.

I completely forgot about set_param, so, I think if you disable gensim_model in set_param, you can check model is None (it does not cover all cases, but covers the most obvious)

Could you elaborate the meaning of "disabling" gensim_model param from the function set_params?
Actually, gensim_model is a public attribute of the model so it can be set like ldaseq_wrapper.gensim_model = some_model, which is almost the same as using set_params function to set this value. So, checking whether self.gensim_model is None should be enough, right?
This would be like :

def transform(self, docs): """ Return the topic proportions for the documents passed. """ if self.gensim_model is None: raise NotFittedError("This model has not been fitted yet. Call 'fit' with appropriate arguments before using this method.") # The input as array of array check = lambda x: [x] if isinstance(x[0], tuple) else x .......................................................................... .......................................................................... .......................................................................... ..........................................................................

Ok, as a temporary option.

menshikh-iv · 2017-06-16T03:48:54Z

gensim/sklearn_integration/sklearn_wrapper_gensim_ldaseqmodel.py

+        return np.reshape(np.array(X), (len(docs), self.num_topics))
+
+    def partial_fit(self, X):
+        raise NotImplementedError("'partial_fit' has not been implemented for the LDA Seq model")


LDA Seq model -> SklLdaSeqModel

menshikh-iv · 2017-06-16T03:49:50Z

gensim/test/test_sklearn_integration.py

+        for key in param_dict.keys():
+            self.assertEqual(model_params[key], param_dict[key])
+
+


Add persistence test with pickle

And add test with pipeline

menshikh-iv · 2017-06-19T08:31:44Z

gensim/test/test_sklearn_integration.py

+        score = text_ldaseq.score(corpus, test_target)
+        self.assertGreater(score, 0.50)
+
+    def testPersistence(self):


It's sanity check only.
For persistence, you need to compare current and loaded models. For this purpose, you need to compare current and loaded inner matrices OR get corpus, transform it with both variant and compare results

Thanks. I have now added code for comparing the vectors transformed from original and loaded models, in addition to this sanity check. :)

menshikh-iv · 2017-06-19T08:32:55Z

gensim/test/test_sklearn_integration.py

+        text_ldaseq = Pipeline((('features', model,), ('classifier', clf)))
+        text_ldaseq.fit(corpus, test_target)
+        score = text_ldaseq.score(corpus, test_target)
+        self.assertGreater(score, 0.50)


It's will be correct every time? No needed to fix seeds for reproducibility?

We now have a fixed seed which is set before the test testPipeline to ensure that we get similar values.

menshikh-iv · 2017-06-20T17:19:27Z

Thank you @chinmayapancholi13 👍

chinmayapancholi13 added 7 commits June 9, 2017 02:25

added new file for LDASeq model's sklearn wrapper

73cd770

PEP8 changes

4744c7b

added 'transform' and 'partial_fit' methods

d79f125

added unit_tests for ldaseq model

07efa33

PEP8 changes

d73838e

PEP8 changes

6e57c5f

refactored code acc. to composite design pattern

c969c8b

menshikh-iv suggested changes Jun 13, 2017

View reviewed changes

This was referenced Jun 13, 2017

[MRG] Added sklearn wrapper for AuthorTopic model #1403

Merged

[WIP] Changes in sklearn wrappers for LDA and LSI models #1398

Merged

[WIP] Sklearn wrapper for RandomProjections Model #1395

Merged

chinmayapancholi13 added 6 commits June 14, 2017 00:47

refactored wrapper and tests

8b0cced

removed 'self.corpus' attribute

ea9922e

updated 'self.__model' to 'self.gensim_model'

8f88a10

updated 'fit' and 'transform' functions

4f33248

updated 'testTransform' test

8aa6898

updated 'testTransform' test

77a8672

menshikh-iv suggested changes Jun 16, 2017

View reviewed changes

chinmayapancholi13 added 6 commits June 16, 2017 00:02

added 'NotFittedError' in 'transform' function

ad895a2

added 'testPersistence' and 'testModelNotFitted' tests

6f9929a

added description for 'docs' in docstring of 'transform'

05b63e3

added 'testPipeline' test

3452e80

PEP8 change

492fbc6

replaced 'text_lda' variable with 'text_ldaseq'

dec60e1

menshikh-iv reviewed Jun 19, 2017

View reviewed changes

chinmayapancholi13 added 2 commits June 19, 2017 03:10

updated 'testPersistence' test

fd5fc90

set fixed seed in 'testPipeline' test

e041431

menshikh-iv approved these changes Jun 20, 2017

View reviewed changes

menshikh-iv merged commit 477a3a3 into piskvorky:develop Jun 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Added sklearn wrapper for LDASeq model #1405

[WIP] Added sklearn wrapper for LDASeq model #1405

chinmayapancholi13 commented Jun 9, 2017

menshikh-iv Jun 13, 2017

chinmayapancholi13 Jun 14, 2017 •

edited

Loading

menshikh-iv Jun 14, 2017

chinmayapancholi13 Jun 14, 2017

menshikh-iv Jun 13, 2017

menshikh-iv Jun 13, 2017

menshikh-iv Jun 13, 2017

chinmayapancholi13 Jun 14, 2017 •

edited

Loading

menshikh-iv Jun 14, 2017 •

edited

Loading

menshikh-iv Jun 16, 2017

menshikh-iv Jun 16, 2017

chinmayapancholi13 Jun 16, 2017

menshikh-iv Jun 16, 2017

chinmayapancholi13 Jun 16, 2017

menshikh-iv Jun 16, 2017

menshikh-iv Jun 16, 2017

menshikh-iv Jun 16, 2017

menshikh-iv Jun 16, 2017 •

edited

Loading

menshikh-iv Jun 19, 2017

chinmayapancholi13 Jun 19, 2017

menshikh-iv Jun 19, 2017

chinmayapancholi13 Jun 19, 2017

menshikh-iv commented Jun 20, 2017

		for key in param_dict.keys():
		self.assertEqual(model_params[key], param_dict[key])

[WIP] Added sklearn wrapper for LDASeq model #1405

[WIP] Added sklearn wrapper for LDASeq model #1405

Conversation

chinmayapancholi13 commented Jun 9, 2017

Choose a reason for hiding this comment

chinmayapancholi13 Jun 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chinmayapancholi13 Jun 14, 2017 • edited Loading

Choose a reason for hiding this comment

menshikh-iv Jun 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv Jun 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Jun 20, 2017

chinmayapancholi13 Jun 14, 2017 •

edited

Loading

chinmayapancholi13 Jun 14, 2017 •

edited

Loading

menshikh-iv Jun 14, 2017 •

edited

Loading

menshikh-iv Jun 16, 2017 •

edited

Loading