Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CatBoost component and pipeline #247
Add CatBoost component and pipeline #247
Changes from all commits
25c105f
8f0e936
e0e5d8c
dac6333
03c773b
d1e39f8
c9c92e1
7f0f1f9
9208514
80cc9e2
697a20f
3eb43e4
2dbf9f2
08186f2
8f5ee81
b2b3f4b
da2546a
9660f58
c6bc671
2e26550
15ca3bd
7446fc3
40fa7e9
403ce03
b889244
91c24f5
3b42d30
99a4ebe
d8da113
c40de4b
001d9dc
a16de3f
031d985
9c58e34
8b767b0
eec71f9
425b1a8
6375ad1
c1e6387
0010a49
60daead
c9fa6d3
1b7ebdc
c06837d
1222588
b7b50d4
dfb8de8
1a8e4c8
e51b45e
bfeb9ea
08cf82c
9c65982
6a0e8e7
0bca241
06b8ee9
da3947e
2d0f368
8d1f3c1
e955350
0006528
0559332
4a27e39
b7be6c1
2b36424
c69e777
222d93c
4fb3e09
db99bbb
2ca522e
3d0b801
706541f
360274d
b66a1a2
d6a2f7b
720373d
2fd4d29
becfdb6
6999dee
040d858
2d29206
eadcc21
208bfa0
cf8276c
b7c1ab8
f5dc5d5
f26568b
ef905b3
f405c32
505cdf6
7b4c88d
d2966ec
b6eed83
e754434
10c6946
6815c49
d576a1d
e1e335f
e132a3d
2c637ac
95c8c2f
3ab49b9
758a577
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think the benefits are of using CatBoost for encoding over our one-hot-encoding component? I agree that this is the better implementation due to simplicity and probably more optimization within catboost but I was wondering what you were thinking 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, catboost claims they "[allow] you to use non-numeric factors, instead of having to pre-process your data."
@jeremyliweishih, was this a general thought which came up, or was it related to this code? My read of this was that it sets up
SimpleImputer
, but does not use ourOneHotEncoder
currently, which feels like the right call for now. I ask just to make sure I'm not missing something subtle. And if any of us have a suspicion that a different pipeline configuration would be better, I'd suggest we file a ticket along the lines of "Performance: compare catboost pipelines using OneHotEncoder vs native catboost"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Think it just stemmed from not using our
OneHotEncoder
and I agree it should be the right call. Was mainly curious if catboost provided explanations on how it enhances speed or performance.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Me too; I haven't read up on it enough to answer well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to find too much describing the speed or performance boost, but here are two links from the catboost doc regarding this topic. The first link has a big attention box stating, "Attention. Do not use one-hot encoding during preprocessing. This affects both the training speed and the resulting quality." (... but unfortunately doesn't state how / why) 😟
Categorical features
Transforming categorical features to numerical features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for sharing that. Strange that they didn't provide more context to that warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also leads me to think that we may need more logic for SimpleImputer. I'm not too sure about current behavior I would think that if we select mean or median with non-numeric columns there would be issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Filed #314 to track that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see that the Estimator base class does
return self._component_obj.feature_importances_
. Can we delete the override here inCatBoostClassifier
? Maybe this code is out of date, because I don't seeget_feature_importance
in the repo right nowThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, code should still be there but we need to override feature_importances in the CatBoost class because the Estimator base class way of accessing the feature importance is very sklearn specific (that is, you can access the feature importance of an estimator via
feature_importances_
for all sklearn objects). This is CatBoost's way of exposing and calculating the feature importances :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I got confused and forgot this is defined by catboost. Cool