Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Label encoder for the case where y is 1-D. #18

Merged
merged 19 commits into from
Feb 6, 2021
Merged

Conversation

NiMaZi
Copy link
Contributor

@NiMaZi NiMaZi commented Feb 3, 2021

Resolved issue #13

This is a very naive label encoder implemented with sklearn.preprocessing.LabelEncoder

  • single output (1-D) partial mode
  • single output (1-D) full mode
  • unit test

The label encoder that convert original labels into integers (0, 1, 2, ...)
Label encoder does not deal with partial mode yet.
@xuyxu
Copy link
Member

xuyxu commented Feb 4, 2021

Hi, thanks for the PR! Here are some of my thoughts on how to implement this feature request:

  • We override the _check_input function in the class CascadeForestClassifier, and conduct the check and transformation on input labels in this function;
  • When the input to _check_input function is training data (depend on whether y is not None), we first check the type of target using the function type_of_target. If the result is 'binary' or multiclass, we then use a LabelEncoder to transfrom the labels. Otherwise, we throw an error to tell the users that the labels are invalid.

We can first focus on the training part, and add the prediction part latter. My biggest concern on your implementation is that it may not be a good idea to build everything from scratch. Instead, use mature tools from Scikit-Learn would be better.

Feel free to ask me if you have any problem, or I did not deliver the meaning clearly. Let's cooperate with each other to complete this great feature ;-)

@NiMaZi
Copy link
Contributor Author

NiMaZi commented Feb 4, 2021

@xuyxu Hi, I fully agree with your suggestion.
Previously I didn't notice that sklearn is already part of the dependencies of deep-forest. That's why I figured I'd better write everything in numpy natively so I don't add an extra dependency to your package.
Now that it is no longer the problem, I will off course use existing implementation instead of reinventing the wheel :)

@xuyxu
Copy link
Member

xuyxu commented Feb 4, 2021

Great, it looks much better now. I will have a careful look tomorrow ;-) Thanks!

@NiMaZi
Copy link
Contributor Author

NiMaZi commented Feb 4, 2021

Hi @xuyxu I came up with a working branch.

I have some thoughts about your suggestion:

  1. I'm not sure about overriding _check_input. it has no return value, and fit does not save the labels y in any member variable either. If we override _check_input for label encoding, it cannot output the "encoded label", unless we add return values to it and adapt the base class. Personally I always try not to change base classes, so I chose to override fit and predict with two helper functions _encode_class_labels and _decode_class_labels, and added some utility variables around them.
  2. In order to make this work for "partial_mode", I think we need to save & load the "class label encoder" together with other "params" using save and load function. The next step for me is then to override save and load in CascadeForestClassifier.

@xuyxu
Copy link
Member

xuyxu commented Feb 5, 2021

Hi @NiMaZi, I have made some edits on your PR, mainly on the side of adding docstrings. Let me know if you are OK with them.

In addition, here are some of my thoughts on your latest comment:

  • About partial_mode:
    • I think whether the model is trained in the full-memory mode or partial mode has no effect on the labels transformation. We only need to take care that we should save related attributes on handling class labels in the function save() and load(). Without these attributes, the model cannot be correctly re-used after dumping and re-loading
    • I think adding the function _encode_class_labels and _decode_class_labels in CascadeForestClassifier is fine ;-)
  • About multi-output:
    • Supporting this needs many extra works, and I think we can ignore it currently.

If you are OK with my modifications, here are things to do next to complete this PR:

  • Modify the save() and load() function to make this feature works fine after model serialization and de-serialization
    • Since labels_are_encoded, type_of_target_, and label_encoder_ are not properties of BaseCascadeForest, you may need to use hasattr(self, "XXX") to judge that whether the save() is called from the CascadeForestClassifier
  • Add unit tests for this feature request (create another python file named test_model_input.py in the directory tests)
    • For example, we can load a dataset with the labels equipped with two versions: one is the integers (e.g., [0, 1, 2, ..., 9]), while another is strings (e.g., ["0", "1", ..., "9"]), we need to make sure that two models produced from two different kinds of labels should have the same behavior on the testing data, right?

@NiMaZi
Copy link
Contributor Author

NiMaZi commented Feb 5, 2021

Hi @xuyxu
The encoder and its testing should be working. But the black formatter somehow hates my code. :(((
I tried a couple of online formatting tools but they don't seem to help. Could you please run a formatting check on test_model_input.py from your side? Then I think the PR is complete.

@xuyxu
Copy link
Member

xuyxu commented Feb 5, 2021

Hi @xuyxu
The encoder and its testing should be working. But the black formatter somehow hates my code. :(((
I tried a couple of online formatting tools but they don't seem to help. Could you please run a formatting check on test_model_input.py from your side? Then I think the PR is complete.

Never mind, I will take a look latter. Thanks 😄

Copy link
Member

@xuyxu xuyxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xuyxu xuyxu merged commit ad030f4 into LAMDA-NJU:master Feb 6, 2021
@xuyxu
Copy link
Member

xuyxu commented Feb 6, 2021

Merged. Thanks for your contributions! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants