Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update sklearn API for Gensim models #1473

Merged

Conversation

chinmayapancholi13
Copy link
Contributor

@chinmayapancholi13 chinmayapancholi13 commented Jul 6, 2017

This PR does the following things:

  • Renames scikit-learn wrapper classes for Gensim models as suggested here.
  • Updates existing API by removing the base sklearn-API class and set_params() & get_params() functions.
  • Updates transform implementation by making use of sparse2full function.
  • Adds testConsistencyWithGensimModel test for LdaModel.
  • Updates ipynb corresponding to these changes.

@menshikh-iv menshikh-iv added the breaks backward-compatibility Change breaks backward compatibility label Jul 7, 2017
from gensim.sklearn_integration.sklearn_wrapper_gensim_lsimodel import LsiTransformer
from gensim.sklearn_integration.sklearn_wrapper_gensim_ldaseqmodel import LdaSeqTransformer
from gensim.sklearn_integration.sklearn_wrapper_gensim_w2vmodel import W2VTransformer
from gensim.sklearn_integration.sklearn_wrapper_gensim_atmodel import AuthorTopicTransformer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we're making breaking renames/refactors for clarity/concision, these package-names have a lot of redundancy, too. gensim.sklearn_integration.sklearn_wrapper_gensim_atmodel repeats both sklearn and gensim, and integration and wrapper have overlapping meanings.

So perhaps just: gensim.sklearn_wrappers.atmodel? And that's if at is already easily recognized from use elsewhere as abbreviation; otherwise gensim.sklearn_wrappers.authortopicmodel or even gensim.sklearn_wrappers.authortopic?

Or maybe even:, given that these wrappers are each pretty slight-in-size, they could either go: (1) in the respective associated model file – that is W2VTransformer goes into gensim.models.word2vec (and probably loses its abbreviation, to match the non-abbreviation used in that file's classes); or (2) into a single sklearn_wrappers file, alongside the BaseSklearnWrapper?

(And regarding BaseSklearnWrapper - not sure it adds much beyond alternative of using skearn's own BaseEstimator and TransformerMixin. It forces more rigor in overriding the necessary abstract-methods, which sklearn itself never does, but has less functionality in set_params() and other aspects of introspectability.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @piskvorky for overall library naming/organization priorities

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for from gensim.models.atmodel import AuthorTopicTransformer, but would agree with from gensim.sklearn_api import AuthorTopicTransformer too. It is neither a wrapper nor an integration, but just an API.

Copy link
Owner

@piskvorky piskvorky Jul 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to share the same module for gensim/sklearn classes, we'd have to tiptoe around the sklearn imports, because sklearn is not a dependency of gensim.

An extra subpackage sounds cleaner to me: one subpackage for sklearn / keras / spark / whatever API. Not imported automatically from gensim __init__, so that users need to import it explicitly, after installing sklearn/keras/spark/tensorflow/whatever.

And if the subpackage becomes too unwieldy or complex (not the case with sklearn now), a separate library would make sense to me too, to decouple the release and maintenance cycle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for from gensim.sklearn_api import AuthorTopicTransformer with subpackage sklearn_api.
We will not have "sklearn as dependency", but have "short and uniq import path" for sklearn wrappers.

@gojomo gojomo requested review from piskvorky and removed request for piskvorky July 7, 2017 11:08
@@ -11,7 +11,7 @@
from abc import ABCMeta, abstractmethod


class BaseSklearnWrapper(object):
class BaseTransformer(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this refactoring... I'm again wondering, is this class even helpful? Its set_params() does less than the implementation that would be inherited from sklearn's own BaseEstimator. The other methods just serve to enforce that any subclass override certain methods – a rigor that is possibly useful, but that sklearn does not, itself, impose within its own superclass-hierarchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gojomo scikit-learn indeed does not impose such a hierarchy among its classes. However, as you stated above, in addition to not having to implement the set_params() every time, we know which functions are to be implemented for each model. Also, if in the future, we decide to refine/improve the set_params() function and make it more sophisticated (e.g. as in sklearn's BaseEstimator), we would only have to do it once in the base class rather than having to make the same change in the sklearn-api class of each Gensim model. Hence, having an abstract base class and deriving classes for Gensim models from it would help us in writing new sklearn-api classes in the future as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the existing set_params() looks worse than just doing nothing (and inheriting from BaseEstimator as all current subclasses of this class already do) – unless I'm missing something. So there's no need to plan for a future where "we decide to refine/improve" it, and do such improvements in the base class. We can make that decision now, and delete 8 lines of code from the base class, and all the subclasses automatically get the better & more standard (for sklearn) implementation of set_params().

And then, the only benefit of this shared superclass is forcing an error as a reminder of what to implement. But if sklearn doesn't do that itself, is it that important to do so? It's also over-enforcing – it is not a strict requirement that all transformers support partial_fit() - not every algorithm will necessarily support that.

Copy link
Contributor

@tmylk tmylk Jul 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head I expect all sklearn models to have partial_fit, fit, transform and be used in a pipeline. We constantly get questions on the mailing list about how to update model X for almost every sklearn algo. Raising a NotImplemented with an explanation for word2vec and doc2vec use gensim api update function but it's not recommended as it hasn't been researched. And throw an explicitly that RandomProj can't do it because of algo limitations.

Also set_params method shouldn't be abstract and shouldn't be defined in all the children -it's the same code. Having common set_params makes the base useful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, providing a place to include a error/warning is a potential reason to implement partial_fit() even where the underlying algorithm can't (or shouldn't casually) be used that way. BUT, simply not implementing a method is also a pretty good way of signalling that something isn't implemented – and indeed that seems to be what scikit-learn itself does. For that reason, it's even possible other code will use introspection about whether an sklearn pipeline object implements partial_fit() as a test of support, in which case having an implemented-method that just errors could be suboptimal.

But with set_params(), again, why have any lesser-featured implementation here, when every subclass (so far) inherits from scikit-learn's own BaseEstimator, which provides a more fully-featured implementation? Why wouldn't we want fewer limitations (and match the source library's behavior) for zero lines of code, rather than more limitations with more code?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi,

jumping in the conversation a little late here, but I do agree that the old BaseSklearnWrapper does very little since

  • the default get_params and set_params from sklearn's BaseEstimator are very reasonable and should only be overriden when necessary; if all the concrete wrapper classes inherit from BaseEstimator (as they do now) then there's no need to force them to re-implement these methods
  • @tmylk partial_fit is absolutely not a requirement for a class to be compatible with sklearn Pipeline since lots of algorithm are not online in nature. If there is a particular algorithm that most people are confused about, then the warning should be raised for that class only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would struggle to find a gensim algorithm for which I haven't seen a "is it possible to update the model post-training?" question on the mailing list...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it's always asked, but not always supported. The scikit-learn convention for not supporting incremental training is to simply not implement partial_fit(). Why not adopt the same convention for scikit-learn compatible classes here?

Copy link
Owner

@piskvorky piskvorky Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gojomo 's point of view here. The point of these adapters is to assume the "philosophy" of another package, whatever structure or idiosyncrasies that may have. The least amount of surprises and novelty for users of that package.

It is not the place of the adapters to introduce their own philosophy on class structures or missing methods or whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the BaseTransformer class now. Thanks for the useful feedback and suggestions. :)

@@ -83,4 +83,5 @@ def transform(self, words):
return np.reshape(np.array(X), (len(words), self.size))

def partial_fit(self, X):
raise NotImplementedError("'partial_fit' has not been implemented for W2VTransformer")
raise NotImplementedError("'partial_fit' has not been implemented for W2VTransformer since 'update()' function for Word2Vec class is experimental. "
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Why would the update() method of word2vec be experimental?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's correct.

Copy link
Owner

@piskvorky piskvorky Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't even see any update() method. Can you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky The original error message was not correct (since there is no update() function in Word2Vec class as you correctly pointed out) and I have updated the message now. My apologies for the confusion.
This change was made in reference to this comment by @gojomo in one of the older PRs.

Copy link
Owner

@piskvorky piskvorky Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chinmayapancholi13 thanks for the clarification.

So the concern is about initializing the vocabulary. That's a separate concern to incremental training, which is fully supported and not experimental (in fact, it's the same code as non-incremental training).

I don't think partial_fit is a high priority, but at the same time, users ask for it all the time, especially in combination with incremental vocab updates (which are indeed experimental).

So I see two directions (both non-critical, nice-to-have) here:

  1. support incremental training with a fixed, pre-defined vocabulary
  • implies finding a natural way to initialize vocabulary before hand
  • implies finding a natural way to control the learning rate (alpha) during the incremental calls.
  1. support fully online training, including updating the vocabulary incrementally
  • implies changing the w2v algo(s) to support this (hashing trick with fixed hash space? what would this do to HS? or do we not support HS in this online mode?)
  • still implies finding a way to control alpha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. So the issue is actually "updating the vocab" (and not "training") being experimental. I have updated the error message in partial_fit for W2VTransformer for this accordingly.
Thanks for your explanation and suggestions about resolving this concern. I guess it would be nice to create an issue for this. :)

@@ -101,4 +83,4 @@ def transform(self, words):
return np.reshape(np.array(X), (len(words), self.size))

def partial_fit(self, X):
raise NotImplementedError("'partial_fit' has not been implemented for SklW2VModel")
raise NotImplementedError("'partial_fit' has not been implemented for W2VTransformer since updating vocabulary incrementally for Word2Vec model is experimental.")
Copy link
Contributor

@tmylk tmylk Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"However the model can be updated with a fixed vocab using Gensim API call"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -295,19 +292,6 @@ def testTransform(self):
self.assertEqual(transformed_vecs.shape[0], 1)
self.assertEqual(transformed_vecs.shape[1], self.model.num_topics)

def testSetGetParams(self):
Copy link
Contributor

@tmylk tmylk Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rewrite these tests to check the gensim model inside the wrapper is affected. Check gensim_model after fit call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks

@tmylk
Copy link
Contributor

tmylk commented Jul 27, 2017

@menshikh-iv LGTM please

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright (C) 2011 Radim Rehurek <radimrehurek@seznam.cz>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chinmayapancholi13 please add yourself as the author. This is your baby :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe. Ok! 😄

"""


from .ldamodel import LdaTransformer # noqa: F401
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is all that noqa: F401 for? Is it really necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A line having noqa doesn't issue flake8 errors. Writing noqa: F401 saves us from flake8 errors saying that we have imports (e.g. LdaTransformer) which we haven't used anywhere in the file.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels wrong to litter the gensim code with such constructs -- the gensim code is correct, this is essentially working around some idiosyncrasy (bug?) of an unrelated library.

By the way, how come we don't these errors from all the other __init__ imports in gensim? Or do we? CC @menshikh-iv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piskvorky because flake8 analyze the only diff every time, if you change same lines in any other __init__ file you will get same error.

Copy link
Owner

@piskvorky piskvorky Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so the fake warnings will go away immediately? Why did we add these comments then. Let's remove it.

"metadata": {},
"source": [
"The wrappers available (as of now) are :\n",
"* LdaModel (```gensim.sklearn_api.ldaModel.SklLdaModel```),which implements gensim's ```LDA Model``` in a scikit-learn interface\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't these been updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I had changed the transformer names in the code in the ipynb but missed updating it here. Thanks a lot for pointing this out. :)

@chinmayapancholi13 chinmayapancholi13 changed the title Renaming sklearn wrapper classes [WIP] Updating sklearn API for Gensim models Jul 27, 2017
@menshikh-iv menshikh-iv changed the title [WIP] Updating sklearn API for Gensim models Update sklearn API for Gensim models Aug 10, 2017
@menshikh-iv menshikh-iv merged commit 718b1c6 into piskvorky:develop Aug 10, 2017
@tmylk
Copy link
Contributor

tmylk commented Aug 17, 2017

@menshikh-iv May I ask for a date of a new release with this new API?

@menshikh-iv
Copy link
Contributor

@tmylk ~ first half of September

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaks backward-compatibility Change breaks backward compatibility
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants