Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing of sparse matrices #29

Closed
msjgriffiths opened this issue Nov 17, 2015 · 10 comments
Closed

Allow passing of sparse matrices #29

msjgriffiths opened this issue Nov 17, 2015 · 10 comments

Comments

@msjgriffiths
Copy link

In some situations, it's easier to pass a scipy.sparse matrix object to sklearn model objects. This would reduce the memory requirements when fitting larger datasets.

Most sklearn models will accept a sparse matrix. For those that do not, checking for sparsity in the method call and calling matrix.todense() would work.

@rhiever
Copy link
Contributor

rhiever commented Nov 17, 2015

Do you know how this data format interacts with pandas? Right now, TPOT uses pandas to pass around the data sets between pipeline operators.

@rasbt
Copy link
Contributor

rasbt commented Nov 17, 2015

I further recommend toarray over todense. In practice, this may often not be an issue, but todense returns a numpy matrix whereas toarray returns a numpy array.

@msjgriffiths Yes, I think sparse matrices shouldn't be a problem anymore -- in the majority of cases. Even the random forest has sparse matrix support now.

Do you know how this data format interacts with pandas? Right now, TPOT uses pandas to pass around the data sets between pipeline operators.

As far as I can tell you are using numpy arrays as input to the scikit-line pipelines, right?

E.g.,

training_features = input_df.loc[input_df['group'] == 'training'].drop(['class', 'group', 'guess'], axis=1).values

So, in this case, we could add an intermediate transformer step for the conversion

pipe_1 = Pipeline([
    (prep', CountVectorizer(analyzer='word',
                      decode_error='replace',
                      preprocessor=lambda text: re.sub('[^a-zA-Z]', ' ', text.lower()), 
                      stop_words=stopwords,) ),
    ('to_dense', DenseTransformer()),
    ('clf', RandomForestClassifier())
])

(Note that it is not necessary for the RandomForest anymore.

And the DenseTransformer could simply be:

class DenseTransformer(object):
    """
    A transformer for scikit-learn's Pipeline class that converts
    a sparse matrix into a dense matrix.
    """

    def __init__(self, some_param=True):
        pass

    def transform(self, X, y=None):
        return X.toarray()

    def fit(self, X, y=None):
        return self

    def fit_transform(self, X, y=None):
        return X.toarray()

    def get_params(self, deep=True):
        return {'some_param': True}

I think we'd need to check for which scikit transformers this would be necessary, but I could add the general functionality if you like.

@manugarri
Copy link

+1 to this, it would make it even closer to sklearn

@rhiever
Copy link
Contributor

rhiever commented Mar 6, 2016

Now that we're looking at adding support for sklearn.preprocessing.OneHotEncoder, I'm looking more seriously at sparse matrix support for TPOT. I've changed this issue to high priority and will think about it more in the near future.

We may have to rework TPOT's internals entirely to work with numpy arrays/matrices in place of pandas DataFrames. I think I'm okay with that -- it would actually eliminate the pandas dependency -- but that means this would entail a significant rework of TPOT.

@tonyfast
Copy link

Would using some of the sklearn Mixins help with problems like SparseCoeffMixin?

There is a lot of talk about a rework of TPOT in the issues.

@rhiever
Copy link
Contributor

rhiever commented May 31, 2016

We're actually heading toward a major refactor of TPOT after the next release (v0.4). :-)

@tonyfast
Copy link

Are y'all talking about that in an issue somewhere?

@rhiever
Copy link
Contributor

rhiever commented May 31, 2016

@teaearlgraycold should be raising that issue soon (we just finished meeting). He's been working on a PR with a base implementation of it.

@rhiever rhiever modified the milestone: TPOT v0.6 Aug 20, 2016
@rhiever rhiever modified the milestones: TPOT v0.6, TPOT v0.7 Aug 29, 2016
@rhiever
Copy link
Contributor

rhiever commented Sep 1, 2016

@teaearlgraycold, can you please write up a script here to pull all operators in TPOT and pass a sparse matrix to each of them individually? I'd like to see what operators we need to work on to natively support sparse matrices.

@rhiever
Copy link
Contributor

rhiever commented Oct 9, 2017

The 0.9 release added sparse matrix support via the TPOT sparse configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants