Allow passing of sparse matrices #29

msjgriffiths · 2015-11-17T13:12:57Z

In some situations, it's easier to pass a scipy.sparse matrix object to sklearn model objects. This would reduce the memory requirements when fitting larger datasets.

Most sklearn models will accept a sparse matrix. For those that do not, checking for sparsity in the method call and calling matrix.todense() would work.

The text was updated successfully, but these errors were encountered:

rhiever · 2015-11-17T13:26:30Z

Do you know how this data format interacts with pandas? Right now, TPOT uses pandas to pass around the data sets between pipeline operators.

rasbt · 2015-11-17T15:38:27Z

I further recommend toarray over todense. In practice, this may often not be an issue, but todense returns a numpy matrix whereas toarray returns a numpy array.

@msjgriffiths Yes, I think sparse matrices shouldn't be a problem anymore -- in the majority of cases. Even the random forest has sparse matrix support now.

Do you know how this data format interacts with pandas? Right now, TPOT uses pandas to pass around the data sets between pipeline operators.

As far as I can tell you are using numpy arrays as input to the scikit-line pipelines, right?

E.g.,

training_features = input_df.loc[input_df['group'] == 'training'].drop(['class', 'group', 'guess'], axis=1).values

So, in this case, we could add an intermediate transformer step for the conversion

pipe_1 = Pipeline([
    (prep', CountVectorizer(analyzer='word',
                      decode_error='replace',
                      preprocessor=lambda text: re.sub('[^a-zA-Z]', ' ', text.lower()), 
                      stop_words=stopwords,) ),
    ('to_dense', DenseTransformer()),
    ('clf', RandomForestClassifier())
])

(Note that it is not necessary for the RandomForest anymore.

And the DenseTransformer could simply be:

class DenseTransformer(object):
    """
    A transformer for scikit-learn's Pipeline class that converts
    a sparse matrix into a dense matrix.
    """

    def __init__(self, some_param=True):
        pass

    def transform(self, X, y=None):
        return X.toarray()

    def fit(self, X, y=None):
        return self

    def fit_transform(self, X, y=None):
        return X.toarray()

    def get_params(self, deep=True):
        return {'some_param': True}

I think we'd need to check for which scikit transformers this would be necessary, but I could add the general functionality if you like.

manugarri · 2015-11-20T17:02:56Z

+1 to this, it would make it even closer to sklearn

rhiever · 2016-03-06T13:00:05Z

Now that we're looking at adding support for sklearn.preprocessing.OneHotEncoder, I'm looking more seriously at sparse matrix support for TPOT. I've changed this issue to high priority and will think about it more in the near future.

We may have to rework TPOT's internals entirely to work with numpy arrays/matrices in place of pandas DataFrames. I think I'm okay with that -- it would actually eliminate the pandas dependency -- but that means this would entail a significant rework of TPOT.

tonyfast · 2016-05-31T15:11:16Z

Would using some of the sklearn Mixins help with problems like SparseCoeffMixin?

There is a lot of talk about a rework of TPOT in the issues.

rhiever · 2016-05-31T15:13:17Z

We're actually heading toward a major refactor of TPOT after the next release (v0.4). :-)

tonyfast · 2016-05-31T15:14:07Z

Are y'all talking about that in an issue somewhere?

rhiever · 2016-05-31T15:18:18Z

@teaearlgraycold should be raising that issue soon (we just finished meeting). He's been working on a PR with a base implementation of it.

rhiever · 2016-09-01T00:06:41Z

@teaearlgraycold, can you please write up a script here to pull all operators in TPOT and pass a sparse matrix to each of them individually? I'd like to see what operators we need to work on to natively support sparse matrices.

rhiever · 2017-10-09T13:56:00Z

The 0.9 release added sparse matrix support via the TPOT sparse configuration.

rhiever added the enhancement label Nov 17, 2015

rhiever added the need contributor label Feb 16, 2016

rhiever added the high priority label Mar 6, 2016

rhiever mentioned this issue Mar 6, 2016

Add more feature preprocessing operators #102

Closed

4 tasks

rhiever mentioned this issue Mar 15, 2016

Refactor TPOT to work directly with numpy matrix instead of pandas DataFrames #113

Closed

rhiever removed the high priority label Jun 1, 2016

rhiever modified the milestone: TPOT v0.6 Aug 20, 2016

rhiever mentioned this issue Aug 20, 2016

Add OneHotEncoder operator #215

Closed

rhiever modified the milestones: TPOT v0.6, TPOT v0.7 Aug 29, 2016

rhiever removed this from the TPOT v0.7 milestone Dec 19, 2016

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

rhiever added being worked on and removed need contributor labels Oct 9, 2017

rhiever closed this as completed Oct 9, 2017

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow passing of sparse matrices #29

Allow passing of sparse matrices #29

msjgriffiths commented Nov 17, 2015

rhiever commented Nov 17, 2015

rasbt commented Nov 17, 2015

manugarri commented Nov 20, 2015

rhiever commented Mar 6, 2016

tonyfast commented May 31, 2016

rhiever commented May 31, 2016

tonyfast commented May 31, 2016

rhiever commented May 31, 2016

rhiever commented Sep 1, 2016

rhiever commented Oct 9, 2017

Allow passing of sparse matrices #29

Allow passing of sparse matrices #29

Comments

msjgriffiths commented Nov 17, 2015

rhiever commented Nov 17, 2015

rasbt commented Nov 17, 2015

manugarri commented Nov 20, 2015

rhiever commented Mar 6, 2016

tonyfast commented May 31, 2016

rhiever commented May 31, 2016

tonyfast commented May 31, 2016

rhiever commented May 31, 2016

rhiever commented Sep 1, 2016

rhiever commented Oct 9, 2017