Label encoder for the case where y is 1-D. #18

NiMaZi · 2021-02-03T16:12:51Z

Resolved issue #13

This is a very naive label encoder implemented with sklearn.preprocessing.LabelEncoder

single output (1-D) partial mode
single output (1-D) full mode
unit test

The label encoder that convert original labels into integers (0, 1, 2, ...)

Label encoder does not deal with partial mode yet.

xuyxu · 2021-02-04T03:08:25Z

Hi, thanks for the PR! Here are some of my thoughts on how to implement this feature request:

We override the _check_input function in the class CascadeForestClassifier, and conduct the check and transformation on input labels in this function;
When the input to _check_input function is training data (depend on whether y is not None), we first check the type of target using the function type_of_target. If the result is 'binary' or multiclass, we then use a LabelEncoder to transfrom the labels. Otherwise, we throw an error to tell the users that the labels are invalid.

We can first focus on the training part, and add the prediction part latter. My biggest concern on your implementation is that it may not be a good idea to build everything from scratch. Instead, use mature tools from Scikit-Learn would be better.

Feel free to ask me if you have any problem, or I did not deliver the meaning clearly. Let's cooperate with each other to complete this great feature ;-)

NiMaZi · 2021-02-04T09:06:33Z

@xuyxu Hi, I fully agree with your suggestion.
Previously I didn't notice that sklearn is already part of the dependencies of deep-forest. That's why I figured I'd better write everything in numpy natively so I don't add an extra dependency to your package.
Now that it is no longer the problem, I will off course use existing implementation instead of reinventing the wheel :)

There is a typo. :)

xuyxu · 2021-02-04T13:02:40Z

Great, it looks much better now. I will have a careful look tomorrow ;-) Thanks!

NiMaZi · 2021-02-04T13:11:21Z

Hi @xuyxu I came up with a working branch.

I have some thoughts about your suggestion:

I'm not sure about overriding _check_input. it has no return value, and fit does not save the labels y in any member variable either. If we override _check_input for label encoding, it cannot output the "encoded label", unless we add return values to it and adapt the base class. Personally I always try not to change base classes, so I chose to override fit and predict with two helper functions _encode_class_labels and _decode_class_labels, and added some utility variables around them.
In order to make this work for "partial_mode", I think we need to save & load the "class label encoder" together with other "params" using save and load function. The next step for me is then to override save and load in CascadeForestClassifier.

xuyxu · 2021-02-05T08:09:10Z

Hi @NiMaZi, I have made some edits on your PR, mainly on the side of adding docstrings. Let me know if you are OK with them.

In addition, here are some of my thoughts on your latest comment:

About partial_mode:
- I think whether the model is trained in the full-memory mode or partial mode has no effect on the labels transformation. We only need to take care that we should save related attributes on handling class labels in the function save() and load(). Without these attributes, the model cannot be correctly re-used after dumping and re-loading
- I think adding the function _encode_class_labels and _decode_class_labels in CascadeForestClassifier is fine ;-)
About multi-output:
- Supporting this needs many extra works, and I think we can ignore it currently.

If you are OK with my modifications, here are things to do next to complete this PR:

Modify the save() and load() function to make this feature works fine after model serialization and de-serialization
- Since labels_are_encoded, type_of_target_, and label_encoder_ are not properties of BaseCascadeForest, you may need to use hasattr(self, "XXX") to judge that whether the save() is called from the CascadeForestClassifier
Add unit tests for this feature request (create another python file named test_model_input.py in the directory tests)
- For example, we can load a dataset with the labels equipped with two versions: one is the integers (e.g., [0, 1, 2, ..., 9]), while another is strings (e.g., ["0", "1", ..., "9"]), we need to make sure that two models produced from two different kinds of labels should have the same behavior on the testing data, right?

NiMaZi · 2021-02-05T13:39:52Z

Hi @xuyxu
The encoder and its testing should be working. But the black formatter somehow hates my code. :(((
I tried a couple of online formatting tools but they don't seem to help. Could you please run a formatting check on test_model_input.py from your side? Then I think the PR is complete.

xuyxu · 2021-02-05T13:45:17Z

Hi @xuyxu
The encoder and its testing should be working. But the black formatter somehow hates my code. :(((
I tried a couple of online formatting tools but they don't seem to help. Could you please run a formatting check on test_model_input.py from your side? Then I think the PR is complete.

Never mind, I will take a look latter. Thanks 😄

xuyxu

LGTM

xuyxu · 2021-02-06T04:48:29Z

Merged. Thanks for your contributions! 👍

NiMaZi added 5 commits February 3, 2021 16:28

Add label encoder

ea89404

The label encoder that convert original labels into integers (0, 1, 2, ...)

check dtype and add comments

23d3992

Bug fix

96d9565

bug fix

f19bb92

Disable partial mode

9757e53

Label encoder does not deal with partial mode yet.

NiMaZi added 4 commits February 4, 2021 13:06

Label encoder with scikit-learn

45659e7

bug fix

4576389

Add utility vars in __init__()

103d9ee

Bug fix

2e8a5f8

There is a typo. :)

xuyxu added 3 commits February 5, 2021 14:11

Merge branch 'master' into master

d204ced

black formatting

8a66d4f

Update cascade.py

8d80685

NiMaZi added 5 commits February 5, 2021 13:21

modify save and load

03fda38

fix format

d5968e2

Add testing case for label encoder

f43187c

fix format

c4dd46c

fix format

9a5ca75

xuyxu added 2 commits February 6, 2021 12:14

black formatting

93c9b6c

add CHANGELOG.rst

62542e1

xuyxu reviewed Feb 6, 2021

View reviewed changes

xuyxu merged commit ad030f4 into LAMDA-NJU:master Feb 6, 2021

xuyxu mentioned this pull request Feb 7, 2021

Can CascadeForestClassifier.predict() return the original class labels instead of 0/1? #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label encoder for the case where y is 1-D. #18

Label encoder for the case where y is 1-D. #18

NiMaZi commented Feb 3, 2021 •

edited by xuyxu

Loading

xuyxu commented Feb 4, 2021

NiMaZi commented Feb 4, 2021

xuyxu commented Feb 4, 2021

NiMaZi commented Feb 4, 2021

xuyxu commented Feb 5, 2021 •

edited

Loading

NiMaZi commented Feb 5, 2021

xuyxu commented Feb 5, 2021

xuyxu left a comment

xuyxu commented Feb 6, 2021

Label encoder for the case where y is 1-D. #18

Label encoder for the case where y is 1-D. #18

Conversation

NiMaZi commented Feb 3, 2021 • edited by xuyxu Loading

xuyxu commented Feb 4, 2021

NiMaZi commented Feb 4, 2021

xuyxu commented Feb 4, 2021

NiMaZi commented Feb 4, 2021

xuyxu commented Feb 5, 2021 • edited Loading

NiMaZi commented Feb 5, 2021

xuyxu commented Feb 5, 2021

xuyxu left a comment

Choose a reason for hiding this comment

xuyxu commented Feb 6, 2021

NiMaZi commented Feb 3, 2021 •

edited by xuyxu

Loading

xuyxu commented Feb 5, 2021 •

edited

Loading