[QUESTION] untransform encoded categorical values and change type of problem #21

fjpa121197 · 2022-02-25T09:30:38Z

Hello, I'm testing featurewiz with a dataframe with numerical and categorical variables, and a target variables that ranges from 0 - 55, with most of my values (for the target variable) between 0-6.

My first question comes to the fact that when I run:

outputs = FW.featurewiz(train_df, target='unique_offers_cut', feature_engg='', category_encoders='OneHotEncoder', dask_xgboost_flag=False, nrows=None, verbose=2)

Everything runs fine, but the final output is like this:

['OneHotEncoder_property_type_1',
 'OneHotEncoder_property_type_6',
 'OneHotEncoder_itv_region_10',
 'OneHotEncoder_itv_region_5',
 'OneHotEncoder_itv_region_8',
 'OneHotEncoder_listing_pricetype_12',
 'OneHotEncoder_property_type_3',
 'first_listed_price',
 'OneHotEncoder_property_type_4',...

Is there any change that I know what is property_type_1? Or at least have it transformed back to its original name?

On the other hand, for the type of problem, is there any way to override this? I do want to set it to a regression problem, but it is assuming the target variables as multi classification (and the XGBoost part ends up not working).

Thanks

The text was updated successfully, but these errors were encountered:

AutoViML · 2022-02-25T12:56:20Z

Hi @fjpa121197 👍
There are two mistakes you are making:

You didn't need to transform the variables. For feature selection, featurewiz automatically transforms categorical variables internally to feed it to XGBoost. So you can simply remove OrdinalEncoder from your input. That should solve your first problem.
You can try solving the second problem with another model if you first let featurewiz select the best variables.
Hope this helps,
AutoVIML

fjpa121197 · 2022-02-25T15:35:42Z

Hi @AutoViML,

But the last part, using XGBoost, it gives the following output:

[15:33:59] [C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/objective/multiclass_obj.cu:120](): SoftmaxMultiClassObj: label must be in [0, num_class).
Regular XGBoost is crashing. Returning with currently selected features...

And outputs[0] is giving the target variable only (as a dataframe).

The first suggestion solved my problem, but I'm curious when looking at the transformed dataset (or the dataset with selected features) to find my categorical variables encoded using OrdinalEncoder? Is this the default on how the XGBoost part finds the most important features? Not sure if assuming an ordinal relationship is appropiate for all categorical columns.

AutoViML · 2022-02-25T17:37:45Z

Hi @fjpa121197 👍
There is one quick and easy way to resolve this. Just change your target variable to float before feeding it to Featurewiz. If it is float, it will treat it as a Regression problem. That should work.
If you still have a problem, just cut and paste the first 10 rows of your dataset here or attach a zip file with a sample dataset and I will try to trouble shoot it.
AutoViML

fjpa121197 · 2022-02-26T12:19:43Z

Hi @AutoViML,

That did solve my problem, and was able to run the last part that without problem, thanks!

I do still have questions about this:
"when looking at the transformed dataset (or the dataset with selected features) to find my categorical variables encoded using OrdinalEncoder? Is this the default on how the XGBoost part finds the most important features? Not sure if assuming an ordinal relationship is appropiate for all categorical columns."

Is there any way to see if the results are different when using One-hot encoding? But to be able to see the actual features after encoding? For example:

Lets says I have a categorical column type_transportation with the following unique values: ['car', 'boat', 'bike', 'plane'], after one-hot encoding, it will create the following columns ['type_transportation_car', 'type_transportation_boat', 'type_transportation_bike'].

However, after using featurewiz, the returned features (selected) and returned like that:

['OneHotEncoder_property_type_1',
 'OneHotEncoder_property_type_6' ...

Is there any way to know the actual value or to which type does it refer to?

AutoViML · 2022-02-26T14:42:24Z

Hi @fjpa121197 👍
I will look into it. In the meantime, as I said earlier, you can one-hot encode categorical variables in your dataframe before you send it to featurewiz. The other option is to remove one-hot encoding from your featurewiz calling statement since featurewiz automatically transforms variables and detects which variables are important and sends you the list of features untransformed.
Check out both options.
Thanks for trying out featurewiz.
AutoViML

fjpa121197 · 2022-02-28T08:37:28Z

Hi @AutoViML,

The first option sound good for me! And I can handle the inverse/untransformation of the columns with the output from Featurewiz, and do not assume an ordinal relationship for my categorical features.

Sorry for another question, but I'm really interested and amazed by the automation part.

Is there any way to know the performance of the XGBoost estimator at the different stages where it reduces features?
I think it will be good to know, since the feature importance is also impacted by the estimator performance.

AutoViML · 2022-02-28T12:14:59Z

Hi @fjpa121197 👍
Great question.

Is there any way to know the performance of the XGBoost estimator at the different stages where it reduces features?
I think it will be good to know, since the feature importance is also impacted by the estimator performance.

You should not worry too much about performance each time since Recursive XGBoost uses fewer and fewer features to use in its modeling. That means the actual performance in each round might be falling: but that is not what matters. What matters is that we need to know among the fewer variables, which one stands out as being the most important. That's why I don't show the performance since that will give a misleading picture. If you don't believe this method will work for you, the best thing to do is to compare featurewiz with other methods and see which one does feature selection better. That is one way to find out.

If this answers your question, please consider closing this issue.
Hope this helps,
AutoViML

fjpa121197 · 2022-02-28T12:50:46Z

That is understandable, I think I will compare results with other techniques.

But overall, great tool. Thanks for the help and answering these questions!

Closing this.

fjpa121197 closed this as completed Feb 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] untransform encoded categorical values and change type of problem #21

[QUESTION] untransform encoded categorical values and change type of problem #21

fjpa121197 commented Feb 25, 2022 •

edited

Loading

AutoViML commented Feb 25, 2022

fjpa121197 commented Feb 25, 2022 •

edited

Loading

AutoViML commented Feb 25, 2022

fjpa121197 commented Feb 26, 2022 •

edited

Loading

AutoViML commented Feb 26, 2022

fjpa121197 commented Feb 28, 2022

AutoViML commented Feb 28, 2022 •

edited

Loading

fjpa121197 commented Feb 28, 2022

[QUESTION] untransform encoded categorical values and change type of problem #21

[QUESTION] untransform encoded categorical values and change type of problem #21

Comments

fjpa121197 commented Feb 25, 2022 • edited Loading

AutoViML commented Feb 25, 2022

fjpa121197 commented Feb 25, 2022 • edited Loading

AutoViML commented Feb 25, 2022

fjpa121197 commented Feb 26, 2022 • edited Loading

AutoViML commented Feb 26, 2022

fjpa121197 commented Feb 28, 2022

AutoViML commented Feb 28, 2022 • edited Loading

fjpa121197 commented Feb 28, 2022

fjpa121197 commented Feb 25, 2022 •

edited

Loading

fjpa121197 commented Feb 25, 2022 •

edited

Loading

fjpa121197 commented Feb 26, 2022 •

edited

Loading

AutoViML commented Feb 28, 2022 •

edited

Loading