Convenience function: Detect if there are non-numerical features and encode them as numerical features #61

rhiever · 2015-12-16T19:32:52Z

(As discussed in #60)

Since many sklearn tools only work on numerical data, one limitation of TPOT is that it cannot work with non-numerical features. We should look into adding a convenience function that:

detects whether there exist non-numerical features in the feature set
sends a warning to the user that they should preprocess the non-numerical features into numerical features
... but also tell the user that TPOT is automatically encoding the non-numerical features as numerical features, do so, and pass the new preprocessed feature set to the optimization process.

bartleyn · 2015-12-17T17:31:58Z

As an additional note, should we make an effort to distinguish between numerical and ordinal data?

rhiever · 2015-12-17T18:08:41Z

I can definitely see the value in that, but how could we accomplish it without explicit information from the user?

bartleyn · 2015-12-17T18:13:13Z

Yeah I can imagine it's difficult without the information. Maybe that'll just something on the backburner/wishlist then.

rhiever · 2015-12-17T18:15:59Z

@amueller, how does sklearn handle this (w.r.t. what @bartleyn mentioned)? Does sklearn just assume that the user will take care of issues w.r.t. numerical vs. ordinal data?

amueller · 2015-12-17T18:19:51Z

sklearn assumes everything is numerical basically. (well or binary)

For trees ordinal vs numeric doesn't make a difference, how would you even handle it in other models?

bartleyn · 2015-12-17T18:32:41Z

Okay, so the onus for dealing with ordinal data would then be on the user. As for encoding the non-numerical features, should we be binarizing them? If not, what should the range be? I.e., should we consider normalizing other features?

rhiever · 2015-12-17T18:35:57Z

I'm thinking binarizing is the best option, yes. IIRC sklearn (or was it pandas?) has a built-in function for this, so we can probably harness that.

pronojitsaha · 2015-12-18T10:19:40Z

I was thinking the same, that for tress data type doesnt make a difference. So may be for other models we do preprocessing specific to the model. Further I believe for binarizing, we need to chose a threshold, which then becomes a tuning parameter. How about the following encodings which require no decision parameters:

pandas.get_dummies()
sklearn.preprocessing.LabelEncoder()

Both lead to information gain in most cases.

dmarx · 2015-12-18T15:35:20Z

Pretty sure these are the sklearn tools you're talking looking for:

sklearn.preprocessing.LabelBinarizer
sklearn.preprocessing.MultiLabelBinarizer

rhiever · 2015-12-18T16:37:04Z

@dmarx: Yep, those are the functions I was thinking of!

amueller · 2015-12-29T16:27:03Z

these are for labels, not input features.
For trees "types don't make a difference" if the tree knows what to do with them. The scikit-learn trees don't at the moment, and there will be a difference between one-hot encoding a variable and not doing so.

The problem with the pandas dummy variables is that they don't know about a training and a test phase, and so you need to make sure that the categorical variables in testing have the same possible values as in training.

rhiever · 2015-12-29T16:47:56Z

Hmmm, that's a good point @amueller. I wonder if we should simply make it a requirement that the user perform the data cleaning beforehand. Otherwise, it's possible for the feature encodings to change between fit() and score()/predict() calls, which could be quite disastrous.

amueller · 2015-12-29T16:58:15Z

Jeff said it's possible to make sure that the encoding is the same, if you remember the possible values for the categorical variable.
You just have to take care of that, as ignoring it might indeed be disastrous.

pronojitsaha · 2015-12-30T12:33:58Z

@amueller While I thought the same about MultiLabelBinarizer initially, but dont you think we can represent an input feature with it by treating the features's values as classes and then stacking the (n_sample x n_classes) obtained through MultiLabelBinarizer (on the input feature) with the original dataset for further analysis?

amueller · 2016-01-04T16:42:50Z

yeah but you should really rather use OneHotEncoder - which unfortunately doesn't support strings at the moment, because the changes for that haven't been reviewed.

MichaelMarkieta · 2016-01-21T15:32:18Z

pandas has get_dummies() as a convenience function, which by default scans the data frame and one hot encodes the obvious categorical features: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

MichaelMarkieta · 2016-01-24T05:56:06Z

I just came across this again in a new dataset where 1 column is a categorical feature with values such as "c1","c2","c3", etc. The pandas.get_dummies() worked well for me. I just passed the X_train data frame like X_train = pandas.get_dummies(X_train) and off I went without issue (it one-hot-encoded the categorical column and kept the rest of the data frame in tact (and in order)).

rhiever · 2016-01-24T15:26:08Z

Nice -- good to know, @MichaelMarkieta.

amueller · 2016-01-24T18:29:45Z

@MichaelMarkieta how did you transform the test set?

pronojitsaha · 2016-01-25T03:20:19Z

PR #71 shows the implementation using the Titanic dataset. Basically dropped the values which do not appear in the training set.

rhiever · 2016-02-27T15:48:28Z

@pronojitsaha, I'm going to close this issue and scrap the feature. Now that I think about it, with all the conversations going on in this thread, I think that this convenience function would unnecessarily bloat the code while we try to accommodate so many different potential inputs. TPOT is meant to be a pipeline optimizer, not a data cleaner, so this convenience function would undoubtedly be scope creep.

Instead, I think we should clearly state the expected format of the input and throw an exception if those expectations are violated. This is what's done in sklearn, and it's the more Pythonic way of handling inputs.

However, I do believe that automatically cleaning data is valuable, so I've created the datacleaner project to build a separate tool for that functionality. Please feel free to direct your contributions there if you've worked on this problem. Ultimately, I think having the two tasks as separate tools is better software design, especially if we follow Doug McIlroy's teachings:

(i) Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features.

amueller · 2016-02-27T15:59:04Z

Curious to see what will happen with the datacleaner :)
What are your plans?

rhiever · 2016-02-27T16:10:30Z

As of now, I'm writing it to handle my daily cleaning needs. So:

Encode categorical features/classes/strings as numerals
Impute or drop NaNs (which one is determined by a setting; imputes w/ median by default)

From there, I'll likely raid the sklearn preprocessing module and a couple other packages I know of to find some other common uses.

Feedback is of course welcomed! Feel free to file an issue on the datacleaner repo.

pronojitsaha · 2016-02-28T17:44:38Z

@rhiever Agreed, I like the thought process. Will look into datacleaner and discuss with you once I get some airtime! BTW did you have a look at https://github.com/wdm0006/categorical_encoding?

thedatadecoder · 2018-07-10T06:15:19Z

So, the issue in TPOT with the categorical variables has been resolved or not?

rhiever added the enhancement label Dec 16, 2015

rhiever mentioned this issue Dec 28, 2015

Using the tpot object for prediction #67

Closed

pronojitsaha mentioned this issue Feb 18, 2016

Adding the Titanic Kaggle tutorial #71

Merged

rhiever added the being worked on label Feb 19, 2016

rhiever closed this as completed Feb 27, 2016

westurner mentioned this issue Dec 14, 2016

Planned functionality rhiever/datacleaner#1

Open

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convenience function: Detect if there are non-numerical features and encode them as numerical features #61

Convenience function: Detect if there are non-numerical features and encode them as numerical features #61

rhiever commented Dec 16, 2015

bartleyn commented Dec 17, 2015

rhiever commented Dec 17, 2015

bartleyn commented Dec 17, 2015

rhiever commented Dec 17, 2015

amueller commented Dec 17, 2015

bartleyn commented Dec 17, 2015

rhiever commented Dec 17, 2015

pronojitsaha commented Dec 18, 2015

dmarx commented Dec 18, 2015

rhiever commented Dec 18, 2015

amueller commented Dec 29, 2015

rhiever commented Dec 29, 2015

amueller commented Dec 29, 2015

pronojitsaha commented Dec 30, 2015

amueller commented Jan 4, 2016

MichaelMarkieta commented Jan 21, 2016

MichaelMarkieta commented Jan 24, 2016

rhiever commented Jan 24, 2016

amueller commented Jan 24, 2016

pronojitsaha commented Jan 25, 2016

rhiever commented Feb 27, 2016

amueller commented Feb 27, 2016

rhiever commented Feb 27, 2016

pronojitsaha commented Feb 28, 2016

thedatadecoder commented Jul 10, 2018

Convenience function: Detect if there are non-numerical features and encode them as numerical features #61

Convenience function: Detect if there are non-numerical features and encode them as numerical features #61

Comments

rhiever commented Dec 16, 2015

bartleyn commented Dec 17, 2015

rhiever commented Dec 17, 2015

bartleyn commented Dec 17, 2015

rhiever commented Dec 17, 2015

amueller commented Dec 17, 2015

bartleyn commented Dec 17, 2015

rhiever commented Dec 17, 2015

pronojitsaha commented Dec 18, 2015

dmarx commented Dec 18, 2015

rhiever commented Dec 18, 2015

amueller commented Dec 29, 2015

rhiever commented Dec 29, 2015

amueller commented Dec 29, 2015

pronojitsaha commented Dec 30, 2015

amueller commented Jan 4, 2016

MichaelMarkieta commented Jan 21, 2016

MichaelMarkieta commented Jan 24, 2016

rhiever commented Jan 24, 2016

amueller commented Jan 24, 2016

pronojitsaha commented Jan 25, 2016

rhiever commented Feb 27, 2016

amueller commented Feb 27, 2016

rhiever commented Feb 27, 2016

pronojitsaha commented Feb 28, 2016

thedatadecoder commented Jul 10, 2018