Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convenience function: Detect if there are non-numerical features and encode them as numerical features #61

Closed
rhiever opened this issue Dec 16, 2015 · 25 comments

Comments

@rhiever
Copy link
Contributor

rhiever commented Dec 16, 2015

(As discussed in #60)

Since many sklearn tools only work on numerical data, one limitation of TPOT is that it cannot work with non-numerical features. We should look into adding a convenience function that:

  1. detects whether there exist non-numerical features in the feature set

  2. sends a warning to the user that they should preprocess the non-numerical features into numerical features

  3. ... but also tell the user that TPOT is automatically encoding the non-numerical features as numerical features, do so, and pass the new preprocessed feature set to the optimization process.

@bartleyn
Copy link
Contributor

As an additional note, should we make an effort to distinguish between numerical and ordinal data?

@rhiever
Copy link
Contributor Author

rhiever commented Dec 17, 2015

I can definitely see the value in that, but how could we accomplish it without explicit information from the user?

@bartleyn
Copy link
Contributor

Yeah I can imagine it's difficult without the information. Maybe that'll just something on the backburner/wishlist then.

@rhiever
Copy link
Contributor Author

rhiever commented Dec 17, 2015

@amueller, how does sklearn handle this (w.r.t. what @bartleyn mentioned)? Does sklearn just assume that the user will take care of issues w.r.t. numerical vs. ordinal data?

@amueller
Copy link

sklearn assumes everything is numerical basically. (well or binary)

For trees ordinal vs numeric doesn't make a difference, how would you even handle it in other models?

@bartleyn
Copy link
Contributor

Okay, so the onus for dealing with ordinal data would then be on the user. As for encoding the non-numerical features, should we be binarizing them? If not, what should the range be? I.e., should we consider normalizing other features?

@rhiever
Copy link
Contributor Author

rhiever commented Dec 17, 2015

I'm thinking binarizing is the best option, yes. IIRC sklearn (or was it pandas?) has a built-in function for this, so we can probably harness that.

@pronojitsaha
Copy link
Contributor

I was thinking the same, that for tress data type doesnt make a difference. So may be for other models we do preprocessing specific to the model. Further I believe for binarizing, we need to chose a threshold, which then becomes a tuning parameter. How about the following encodings which require no decision parameters:

  • pandas.get_dummies()
  • sklearn.preprocessing.LabelEncoder()

Both lead to information gain in most cases.

@dmarx
Copy link

dmarx commented Dec 18, 2015

Pretty sure these are the sklearn tools you're talking looking for:

  • sklearn.preprocessing.LabelBinarizer
  • sklearn.preprocessing.MultiLabelBinarizer

@rhiever
Copy link
Contributor Author

rhiever commented Dec 18, 2015

@dmarx: Yep, those are the functions I was thinking of!

@amueller
Copy link

these are for labels, not input features.
For trees "types don't make a difference" if the tree knows what to do with them. The scikit-learn trees don't at the moment, and there will be a difference between one-hot encoding a variable and not doing so.

The problem with the pandas dummy variables is that they don't know about a training and a test phase, and so you need to make sure that the categorical variables in testing have the same possible values as in training.

@rhiever
Copy link
Contributor Author

rhiever commented Dec 29, 2015

Hmmm, that's a good point @amueller. I wonder if we should simply make it a requirement that the user perform the data cleaning beforehand. Otherwise, it's possible for the feature encodings to change between fit() and score()/predict() calls, which could be quite disastrous.

@amueller
Copy link

Jeff said it's possible to make sure that the encoding is the same, if you remember the possible values for the categorical variable.
You just have to take care of that, as ignoring it might indeed be disastrous.

@pronojitsaha
Copy link
Contributor

@amueller While I thought the same about MultiLabelBinarizer initially, but dont you think we can represent an input feature with it by treating the features's values as classes and then stacking the (n_sample x n_classes) obtained through MultiLabelBinarizer (on the input feature) with the original dataset for further analysis?

@amueller
Copy link

amueller commented Jan 4, 2016

yeah but you should really rather use OneHotEncoder - which unfortunately doesn't support strings at the moment, because the changes for that haven't been reviewed.

@MichaelMarkieta
Copy link

pandas has get_dummies() as a convenience function, which by default scans the data frame and one hot encodes the obvious categorical features: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

@MichaelMarkieta
Copy link

I just came across this again in a new dataset where 1 column is a categorical feature with values such as "c1","c2","c3", etc. The pandas.get_dummies() worked well for me. I just passed the X_train data frame like X_train = pandas.get_dummies(X_train) and off I went without issue (it one-hot-encoded the categorical column and kept the rest of the data frame in tact (and in order)).

@rhiever
Copy link
Contributor Author

rhiever commented Jan 24, 2016

Nice -- good to know, @MichaelMarkieta.

@amueller
Copy link

@MichaelMarkieta how did you transform the test set?

@pronojitsaha
Copy link
Contributor

PR #71 shows the implementation using the Titanic dataset. Basically dropped the values which do not appear in the training set.

@rhiever
Copy link
Contributor Author

rhiever commented Feb 27, 2016

@pronojitsaha, I'm going to close this issue and scrap the feature. Now that I think about it, with all the conversations going on in this thread, I think that this convenience function would unnecessarily bloat the code while we try to accommodate so many different potential inputs. TPOT is meant to be a pipeline optimizer, not a data cleaner, so this convenience function would undoubtedly be scope creep.

Instead, I think we should clearly state the expected format of the input and throw an exception if those expectations are violated. This is what's done in sklearn, and it's the more Pythonic way of handling inputs.

However, I do believe that automatically cleaning data is valuable, so I've created the datacleaner project to build a separate tool for that functionality. Please feel free to direct your contributions there if you've worked on this problem. Ultimately, I think having the two tasks as separate tools is better software design, especially if we follow Doug McIlroy's teachings:

(i) Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features.

@rhiever rhiever closed this as completed Feb 27, 2016
@amueller
Copy link

Curious to see what will happen with the datacleaner :)
What are your plans?

@rhiever
Copy link
Contributor Author

rhiever commented Feb 27, 2016

As of now, I'm writing it to handle my daily cleaning needs. So:

  • Encode categorical features/classes/strings as numerals
  • Impute or drop NaNs (which one is determined by a setting; imputes w/ median by default)

From there, I'll likely raid the sklearn preprocessing module and a couple other packages I know of to find some other common uses.

Feedback is of course welcomed! Feel free to file an issue on the datacleaner repo.

@pronojitsaha
Copy link
Contributor

@rhiever Agreed, I like the thought process. Will look into datacleaner and discuss with you once I get some airtime! BTW did you have a look at https://github.com/wdm0006/categorical_encoding?

@thedatadecoder
Copy link

So, the issue in TPOT with the categorical variables has been resolved or not?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants