-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convenience function: Detect if there are non-numerical features and encode them as numerical features #61
Comments
As an additional note, should we make an effort to distinguish between numerical and ordinal data? |
I can definitely see the value in that, but how could we accomplish it without explicit information from the user? |
Yeah I can imagine it's difficult without the information. Maybe that'll just something on the backburner/wishlist then. |
sklearn assumes everything is numerical basically. (well or binary) For trees ordinal vs numeric doesn't make a difference, how would you even handle it in other models? |
Okay, so the onus for dealing with ordinal data would then be on the user. As for encoding the non-numerical features, should we be binarizing them? If not, what should the range be? I.e., should we consider normalizing other features? |
I'm thinking binarizing is the best option, yes. IIRC sklearn (or was it pandas?) has a built-in function for this, so we can probably harness that. |
I was thinking the same, that for tress data type doesnt make a difference. So may be for other models we do preprocessing specific to the model. Further I believe for binarizing, we need to chose a threshold, which then becomes a tuning parameter. How about the following encodings which require no decision parameters:
Both lead to information gain in most cases. |
Pretty sure these are the sklearn tools you're talking looking for:
|
@dmarx: Yep, those are the functions I was thinking of! |
these are for labels, not input features. The problem with the pandas dummy variables is that they don't know about a training and a test phase, and so you need to make sure that the categorical variables in testing have the same possible values as in training. |
Hmmm, that's a good point @amueller. I wonder if we should simply make it a requirement that the user perform the data cleaning beforehand. Otherwise, it's possible for the feature encodings to change between |
Jeff said it's possible to make sure that the encoding is the same, if you remember the possible values for the categorical variable. |
@amueller While I thought the same about MultiLabelBinarizer initially, but dont you think we can represent an input feature with it by treating the features's values as classes and then stacking the (n_sample x n_classes) obtained through MultiLabelBinarizer (on the input feature) with the original dataset for further analysis? |
yeah but you should really rather use OneHotEncoder - which unfortunately doesn't support strings at the moment, because the changes for that haven't been reviewed. |
pandas has |
I just came across this again in a new dataset where 1 column is a categorical feature with values such as "c1","c2","c3", etc. The |
Nice -- good to know, @MichaelMarkieta. |
@MichaelMarkieta how did you transform the test set? |
PR #71 shows the implementation using the Titanic dataset. Basically dropped the values which do not appear in the training set. |
@pronojitsaha, I'm going to close this issue and scrap the feature. Now that I think about it, with all the conversations going on in this thread, I think that this convenience function would unnecessarily bloat the code while we try to accommodate so many different potential inputs. TPOT is meant to be a pipeline optimizer, not a data cleaner, so this convenience function would undoubtedly be scope creep. Instead, I think we should clearly state the expected format of the input and throw an exception if those expectations are violated. This is what's done in sklearn, and it's the more Pythonic way of handling inputs. However, I do believe that automatically cleaning data is valuable, so I've created the datacleaner project to build a separate tool for that functionality. Please feel free to direct your contributions there if you've worked on this problem. Ultimately, I think having the two tasks as separate tools is better software design, especially if we follow Doug McIlroy's teachings:
|
Curious to see what will happen with the datacleaner :) |
As of now, I'm writing it to handle my daily cleaning needs. So:
From there, I'll likely raid the sklearn preprocessing module and a couple other packages I know of to find some other common uses. Feedback is of course welcomed! Feel free to file an issue on the datacleaner repo. |
@rhiever Agreed, I like the thought process. Will look into datacleaner and discuss with you once I get some airtime! BTW did you have a look at https://github.com/wdm0006/categorical_encoding? |
So, the issue in TPOT with the categorical variables has been resolved or not? |
(As discussed in #60)
Since many sklearn tools only work on numerical data, one limitation of TPOT is that it cannot work with non-numerical features. We should look into adding a convenience function that:
detects whether there exist non-numerical features in the feature set
sends a warning to the user that they should preprocess the non-numerical features into numerical features
... but also tell the user that TPOT is automatically encoding the non-numerical features as numerical features, do so, and pass the new preprocessed feature set to the optimization process.
The text was updated successfully, but these errors were encountered: