Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] untransform encoded categorical values and change type of problem #21

Closed
fjpa121197 opened this issue Feb 25, 2022 · 8 comments

Comments

@fjpa121197
Copy link

fjpa121197 commented Feb 25, 2022

Hello, I'm testing featurewiz with a dataframe with numerical and categorical variables, and a target variables that ranges from 0 - 55, with most of my values (for the target variable) between 0-6.

My first question comes to the fact that when I run:

outputs = FW.featurewiz(train_df, target='unique_offers_cut', feature_engg='', category_encoders='OneHotEncoder', dask_xgboost_flag=False, nrows=None, verbose=2)

Everything runs fine, but the final output is like this:

['OneHotEncoder_property_type_1',
 'OneHotEncoder_property_type_6',
 'OneHotEncoder_itv_region_10',
 'OneHotEncoder_itv_region_5',
 'OneHotEncoder_itv_region_8',
 'OneHotEncoder_listing_pricetype_12',
 'OneHotEncoder_property_type_3',
 'first_listed_price',
 'OneHotEncoder_property_type_4',...

Is there any change that I know what is property_type_1? Or at least have it transformed back to its original name?

On the other hand, for the type of problem, is there any way to override this? I do want to set it to a regression problem, but it is assuming the target variables as multi classification (and the XGBoost part ends up not working).

Thanks

@AutoViML
Copy link
Owner

Hi @fjpa121197 👍
There are two mistakes you are making:

  1. You didn't need to transform the variables. For feature selection, featurewiz automatically transforms categorical variables internally to feed it to XGBoost. So you can simply remove OrdinalEncoder from your input. That should solve your first problem.
  2. You can try solving the second problem with another model if you first let featurewiz select the best variables.
    Hope this helps,
    AutoVIML

@fjpa121197
Copy link
Author

fjpa121197 commented Feb 25, 2022

Hi @AutoViML,

But the last part, using XGBoost, it gives the following output:

[15:33:59] [C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/objective/multiclass_obj.cu:120](): SoftmaxMultiClassObj: label must be in [0, num_class).
Regular XGBoost is crashing. Returning with currently selected features...

And outputs[0] is giving the target variable only (as a dataframe).


The first suggestion solved my problem, but I'm curious when looking at the transformed dataset (or the dataset with selected features) to find my categorical variables encoded using OrdinalEncoder? Is this the default on how the XGBoost part finds the most important features? Not sure if assuming an ordinal relationship is appropiate for all categorical columns.

@AutoViML
Copy link
Owner

Hi @fjpa121197 👍
There is one quick and easy way to resolve this. Just change your target variable to float before feeding it to Featurewiz. If it is float, it will treat it as a Regression problem. That should work.
If you still have a problem, just cut and paste the first 10 rows of your dataset here or attach a zip file with a sample dataset and I will try to trouble shoot it.
AutoViML

@fjpa121197
Copy link
Author

fjpa121197 commented Feb 26, 2022

Hi @AutoViML,

That did solve my problem, and was able to run the last part that without problem, thanks!

I do still have questions about this:
"when looking at the transformed dataset (or the dataset with selected features) to find my categorical variables encoded using OrdinalEncoder? Is this the default on how the XGBoost part finds the most important features? Not sure if assuming an ordinal relationship is appropiate for all categorical columns."

Is there any way to see if the results are different when using One-hot encoding? But to be able to see the actual features after encoding? For example:

Lets says I have a categorical column type_transportation with the following unique values: ['car', 'boat', 'bike', 'plane'], after one-hot encoding, it will create the following columns ['type_transportation_car', 'type_transportation_boat', 'type_transportation_bike'].

However, after using featurewiz, the returned features (selected) and returned like that:

['OneHotEncoder_property_type_1',
 'OneHotEncoder_property_type_6' ...

Is there any way to know the actual value or to which type does it refer to?

@AutoViML
Copy link
Owner

Hi @fjpa121197 👍
I will look into it. In the meantime, as I said earlier, you can one-hot encode categorical variables in your dataframe before you send it to featurewiz. The other option is to remove one-hot encoding from your featurewiz calling statement since featurewiz automatically transforms variables and detects which variables are important and sends you the list of features untransformed.
Check out both options.
Thanks for trying out featurewiz.
AutoViML

@fjpa121197
Copy link
Author

Hi @AutoViML,

The first option sound good for me! And I can handle the inverse/untransformation of the columns with the output from Featurewiz, and do not assume an ordinal relationship for my categorical features.

Sorry for another question, but I'm really interested and amazed by the automation part.

Is there any way to know the performance of the XGBoost estimator at the different stages where it reduces features?
I think it will be good to know, since the feature importance is also impacted by the estimator performance.

@AutoViML
Copy link
Owner

AutoViML commented Feb 28, 2022

Hi @fjpa121197 👍
Great question.

Is there any way to know the performance of the XGBoost estimator at the different stages where it reduces features?
I think it will be good to know, since the feature importance is also impacted by the estimator performance.

You should not worry too much about performance each time since Recursive XGBoost uses fewer and fewer features to use in its modeling. That means the actual performance in each round might be falling: but that is not what matters. What matters is that we need to know among the fewer variables, which one stands out as being the most important. That's why I don't show the performance since that will give a misleading picture. If you don't believe this method will work for you, the best thing to do is to compare featurewiz with other methods and see which one does feature selection better. That is one way to find out.

If this answers your question, please consider closing this issue.
Hope this helps,
AutoViML

@fjpa121197
Copy link
Author

That is understandable, I think I will compare results with other techniques.

But overall, great tool. Thanks for the help and answering these questions!

Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants