Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CatBoost component and pipeline #247

Merged
merged 102 commits into from
Feb 13, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
25c105f
cat init wip
angela97lin Dec 3, 2019
8f0e936
imports and all
angela97lin Dec 3, 2019
e0e5d8c
update
angela97lin Dec 3, 2019
dac6333
pipelibe init
angela97lin Dec 4, 2019
03c773b
install
angela97lin Dec 4, 2019
d1e39f8
test
angela97lin Dec 4, 2019
c9c92e1
adding line to remove catboost output
angela97lin Dec 4, 2019
7f0f1f9
merging master
angela97lin Dec 4, 2019
9208514
cat cat
angela97lin Dec 4, 2019
80cc9e2
lint
angela97lin Dec 4, 2019
697a20f
adding regression + cat cleanup
angela97lin Dec 4, 2019
3eb43e4
lint
angela97lin Dec 4, 2019
2dbf9f2
fixing feature importance test
angela97lin Dec 4, 2019
08186f2
fixing str
angela97lin Dec 4, 2019
8f5ee81
adding to autobase
angela97lin Dec 4, 2019
b2b3f4b
Merge branch 'master' into cat
angela97lin Dec 5, 2019
da2546a
heavily WIP, testing feature selector
angela97lin Dec 5, 2019
9660f58
Merge branch 'cat' of github.com:FeatureLabs/evalml into cat
angela97lin Dec 5, 2019
c6bc671
adding num_features for now to pipeline
angela97lin Dec 5, 2019
2e26550
Merge branch 'master' of github.com:FeatureLabs/evalml into cat
angela97lin Dec 5, 2019
15ca3bd
Merge branch 'master' into cat
angela97lin Dec 5, 2019
7446fc3
cleaning up errors
angela97lin Dec 5, 2019
40fa7e9
Merge branch 'cat' of github.com:FeatureLabs/evalml into cat
angela97lin Dec 5, 2019
403ce03
removing from autobase to test 3.5 error
angela97lin Dec 6, 2019
b889244
Merge branch 'master' into cat
angela97lin Dec 6, 2019
91c24f5
tested locally and ok, retrying
angela97lin Dec 6, 2019
3b42d30
Merge branch 'cat' of github.com:FeatureLabs/evalml into cat
angela97lin Dec 6, 2019
99a4ebe
merging master
angela97lin Dec 9, 2019
d8da113
Merge branch 'master' of github.com:FeatureLabs/evalml into cat
angela97lin Dec 10, 2019
c40de4b
Merge branch 'master' into cat
angela97lin Dec 10, 2019
001d9dc
Merge branch 'cat' of github.com:FeatureLabs/evalml into cat
angela97lin Dec 10, 2019
a16de3f
fixing fit()
angela97lin Dec 10, 2019
031d985
more fiddling
angela97lin Dec 10, 2019
9c58e34
fixing multiclass
angela97lin Dec 10, 2019
8b767b0
remove constant impute strategy
angela97lin Dec 10, 2019
eec71f9
Merge branch 'master' into cat
angela97lin Dec 10, 2019
425b1a8
testing 3.5.7 on circleci
angela97lin Dec 10, 2019
6375ad1
Merge branch 'cat' of github.com:FeatureLabs/evalml into cat
angela97lin Dec 10, 2019
c1e6387
more testing: set n_jobs
angela97lin Dec 11, 2019
0010a49
more testing :(
angela97lin Dec 11, 2019
60daead
oops forgot to move
angela97lin Dec 11, 2019
c9fa6d3
removing remove files for now for debug
angela97lin Dec 11, 2019
1b7ebdc
more testing...
angela97lin Dec 11, 2019
c06837d
readd rm files testing
angela97lin Dec 11, 2019
1222588
moving rm files to auto
angela97lin Dec 11, 2019
b7b50d4
revert, go back to using 3.5.9
angela97lin Dec 11, 2019
dfb8de8
readd n_jobs
angela97lin Dec 11, 2019
1a8e4c8
readd n_jobs
angela97lin Dec 11, 2019
e51b45e
... n_jobs gotta be 1 for logistic?
angela97lin Dec 11, 2019
bfeb9ea
more testing
angela97lin Dec 11, 2019
08cf82c
revert lr
angela97lin Dec 11, 2019
9c65982
decreasing parameters
angela97lin Dec 11, 2019
6a0e8e7
removing rm files
angela97lin Dec 11, 2019
0bca241
linting + logistic
angela97lin Dec 11, 2019
06b8ee9
bump max_depth
angela97lin Dec 11, 2019
da3947e
decrease max_depth...
angela97lin Dec 11, 2019
2d0f368
bumping n_estimators
angela97lin Dec 11, 2019
8d1f3c1
bump container size and max_depth
angela97lin Dec 11, 2019
e955350
decreasing image size to see where the limits are
angela97lin Dec 11, 2019
0006528
testing removing catboost_info after fitting?
angela97lin Dec 11, 2019
0559332
hmmm... moving rm stuff to pipeline
angela97lin Dec 12, 2019
4a27e39
can't remove catboost_info :c
angela97lin Dec 12, 2019
b7be6c1
readding shutil, using allow_writing=False
angela97lin Dec 12, 2019
2b36424
removing catboost feature selectors
angela97lin Dec 12, 2019
c69e777
merging
angela97lin Dec 12, 2019
222d93c
forgot to revert n_jobs in auto_base
angela97lin Dec 12, 2019
4fb3e09
making libaries optional
angela97lin Dec 12, 2019
db99bbb
oops forgot to import
angela97lin Dec 12, 2019
2ca522e
util tests from featuretools :3
angela97lin Dec 13, 2019
3d0b801
Merge branch 'master' into cat
angela97lin Dec 13, 2019
706541f
fix tests
angela97lin Dec 13, 2019
360274d
Merge branch 'cat' of github.com:FeatureLabs/evalml into cat
angela97lin Dec 13, 2019
b66a1a2
revert circleci resource size change
angela97lin Dec 13, 2019
d6a2f7b
add fxn to test dependencies
angela97lin Dec 13, 2019
720373d
linting
angela97lin Dec 16, 2019
2fd4d29
linting, had to update linter?
angela97lin Dec 16, 2019
becfdb6
Merge branch 'master' into cat
angela97lin Dec 16, 2019
6999dee
fixing
angela97lin Dec 16, 2019
040d858
Merge branch 'cat' of github.com:FeatureLabs/evalml into cat
angela97lin Dec 16, 2019
2d29206
Merge branch 'master' into cat
angela97lin Dec 20, 2019
eadcc21
Merge branch 'master' into cat
angela97lin Dec 26, 2019
208bfa0
fixing merge conflicts
angela97lin Jan 27, 2020
cf8276c
lint
angela97lin Jan 27, 2020
b7c1ab8
docstrings
angela97lin Jan 27, 2020
f5dc5d5
adding bootstrap_type
angela97lin Jan 27, 2020
f26568b
cleanup
angela97lin Jan 28, 2020
ef905b3
typo and missed param
angela97lin Jan 28, 2020
f405c32
changing requirements
angela97lin Jan 28, 2020
505cdf6
cleanup + test removing select object for fit in cat
angela97lin Jan 28, 2020
7b4c88d
Merge branch 'master' into cat
angela97lin Jan 29, 2020
d2966ec
readding object dtype
angela97lin Jan 29, 2020
b6eed83
Merge branch 'cat' of github.com:FeatureLabs/evalml into cat
angela97lin Jan 29, 2020
e754434
addressing more comments: remove util fxn and cleaning up tests
angela97lin Jan 29, 2020
10c6946
lint
angela97lin Jan 29, 2020
6815c49
write files flag works :)
angela97lin Jan 29, 2020
d576a1d
remove rm tree fxn in catboost
angela97lin Jan 29, 2020
e1e335f
removing unnecessary import
angela97lin Jan 29, 2020
e132a3d
adding
angela97lin Feb 7, 2020
2c637ac
merging master
angela97lin Feb 7, 2020
95c8c2f
addressing PR comments
angela97lin Feb 10, 2020
3ab49b9
merging
angela97lin Feb 12, 2020
758a577
Merge branch 'master' into cat
angela97lin Feb 12, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Changelog
**Future Releases**
* Enhancements
* Added emacs buffers to .gitignore :pr:`350`
* Add CatBoost (gradient-boosted trees) classification and regression components and pipelines :pr:`247`
* Fixes
* Fixed ROC and confusion matrix plots not being calculated if user passed own additional_objectives :pr:`276`
* Changes
Expand Down
4 changes: 3 additions & 1 deletion evalml/model_types/model_types.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,11 @@ class ModelTypes(Enum):
RANDOM_FOREST = 'random_forest'
XGBOOST = 'xgboost'
LINEAR_MODEL = 'linear_model'
CATBOOST = 'catboost'

def __str__(self):
model_type_dict = {ModelTypes.RANDOM_FOREST.name: "Random Forest",
ModelTypes.XGBOOST.name: "XGBoost Classifier",
ModelTypes.LINEAR_MODEL.name: "Linear Model"}
ModelTypes.LINEAR_MODEL.name: "Linear Model",
ModelTypes.CATBOOST.name: "CatBoost Classifier"}
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
return model_type_dict[self.name]
13 changes: 10 additions & 3 deletions evalml/pipelines/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,23 @@
FeatureSelector,
CategoricalEncoder,
RFClassifierSelectFromModel,
RFRegressorSelectFromModel
RFRegressorSelectFromModel,
CatBoostClassifier,
CatBoostRegressor
)

from .pipeline_base import PipelineBase
from .classification import (
LogisticRegressionPipeline,
RFClassificationPipeline,
XGBoostPipeline
XGBoostPipeline,
CatBoostClassificationPipeline,
)
from .regression import (
LinearRegressionPipeline,
RFRegressionPipeline,
CatBoostRegressionPipeline
)
from .regression import LinearRegressionPipeline, RFRegressionPipeline
from .utils import (
get_pipelines,
list_model_types,
Expand Down
1 change: 1 addition & 0 deletions evalml/pipelines/classification/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
from .logistic_regression import LogisticRegressionPipeline
from .random_forest import RFClassificationPipeline
from .xgboost import XGBoostPipeline
from .catboost import CatBoostClassificationPipeline
39 changes: 39 additions & 0 deletions evalml/pipelines/classification/catboost.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from skopt.space import Integer, Real

from evalml.model_types import ModelTypes
from evalml.pipelines import PipelineBase
from evalml.pipelines.components import CatBoostClassifier, SimpleImputer
from evalml.problem_types import ProblemTypes


class CatBoostClassificationPipeline(PipelineBase):
"""
CatBoost Pipeline for both binary and multiclass classification.
CatBoost is an open-source library and natively supports categorical features.

For more information, check out https://catboost.ai/
"""
name = "CatBoost Classifier w/ Simple Imputer"
model_type = ModelTypes.CATBOOST
problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]
hyperparameters = {
"impute_strategy": ["most_frequent"],
"n_estimators": Integer(10, 1000),
"eta": Real(0, 1),
"max_depth": Integer(1, 8),
}

def __init__(self, objective, impute_strategy, n_estimators,
eta, max_depth, number_features, bootstrap_type=None,
n_jobs=1, random_state=0):
# note: impute_strategy must support both string and numeric data
imputer = SimpleImputer(impute_strategy=impute_strategy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think the benefits are of using CatBoost for encoding over our one-hot-encoding component? I agree that this is the better implementation due to simplicity and probably more optimization within catboost but I was wondering what you were thinking 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, catboost claims they "[allow] you to use non-numeric factors, instead of having to pre-process your data."

@jeremyliweishih, was this a general thought which came up, or was it related to this code? My read of this was that it sets up SimpleImputer, but does not use our OneHotEncoder currently, which feels like the right call for now. I ask just to make sure I'm not missing something subtle. And if any of us have a suspicion that a different pipeline configuration would be better, I'd suggest we file a ticket along the lines of "Performance: compare catboost pipelines using OneHotEncoder vs native catboost"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Think it just stemmed from not using our OneHotEncoder and I agree it should be the right call. Was mainly curious if catboost provided explanations on how it enhances speed or performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Me too; I haven't read up on it enough to answer well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to find too much describing the speed or performance boost, but here are two links from the catboost doc regarding this topic. The first link has a big attention box stating, "Attention. Do not use one-hot encoding during preprocessing. This affects both the training speed and the resulting quality." (... but unfortunately doesn't state how / why) 😟

Categorical features
Transforming categorical features to numerical features

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for sharing that. Strange that they didn't provide more context to that warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also leads me to think that we may need more logic for SimpleImputer. I'm not too sure about current behavior I would think that if we select mean or median with non-numeric columns there would be issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Filed #314 to track that

estimator = CatBoostClassifier(n_estimators=n_estimators,
eta=eta,
max_depth=max_depth,
bootstrap_type=bootstrap_type,
random_state=random_state)
super().__init__(objective=objective,
component_list=[imputer, estimator],
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
n_jobs=1,
random_state=random_state)
4 changes: 3 additions & 1 deletion evalml/pipelines/components/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@
LogisticRegressionClassifier,
RandomForestClassifier,
RandomForestRegressor,
XGBoostClassifier
XGBoostClassifier,
CatBoostClassifier,
CatBoostRegressor
)
from .transformers import (
Transformer,
Expand Down
6 changes: 4 additions & 2 deletions evalml/pipelines/components/estimators/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
from .estimator import Estimator
from .classifiers import (LogisticRegressionClassifier,
RandomForestClassifier,
XGBoostClassifier)
XGBoostClassifier,
CatBoostClassifier)
from .regressors import (LinearRegressor,
RandomForestRegressor)
RandomForestRegressor,
CatBoostRegressor)
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
from .logistic_regression import LogisticRegressionClassifier
from .rf_classifier import RandomForestClassifier
from .xgboost_classifier import XGBoostClassifier
from .catboost_classifier import CatBoostClassifier
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from skopt.space import Integer, Real

from evalml.model_types import ModelTypes
from evalml.pipelines.components import ComponentTypes
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes
from evalml.utils import import_or_raise


class CatBoostClassifier(Estimator):
"""
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
CatBoost Classifier, a classifier that uses gradient-boosting on decision trees.
CatBoost is an open-source library and natively supports categorical features.

For more information, check out https://catboost.ai/
"""
name = "CatBoost Classifier"
component_type = ComponentTypes.CLASSIFIER
_needs_fitting = True
hyperparameter_ranges = {
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
"n_estimators": Integer(10, 1000),
"eta": Real(0, 1),
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
"max_depth": Integer(1, 16),
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
}
model_type = ModelTypes.CATBOOST
problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]

def __init__(self, n_estimators=1000, eta=0.03, max_depth=6, bootstrap_type=None, random_state=0):
parameters = {"n_estimators": n_estimators,
"eta": eta,
"max_depth": max_depth}
if bootstrap_type is not None:
parameters['bootstrap_type'] = bootstrap_type

cb_error_msg = "catboost is not installed. Please install using `pip install catboost.`"
catboost = import_or_raise("catboost", error_msg=cb_error_msg)
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
self._label_encoder = None
cb_classifier = catboost.CatBoostClassifier(**parameters,
silent=True,
allow_writing_files=False)
super().__init__(parameters=parameters,
component_obj=cb_classifier,
random_state=random_state)

def fit(self, X, y=None):
cat_cols = X.select_dtypes(['category', 'object'])
angela97lin marked this conversation as resolved.
Show resolved Hide resolved

# For binary classification, catboost expects numeric values, so encoding before.
if y.nunique() <= 2:
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
self._label_encoder = LabelEncoder()
y = pd.Series(self._label_encoder.fit_transform(y))
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
model = self._component_obj.fit(X, y, silent=True, cat_features=cat_cols)
return model
angela97lin marked this conversation as resolved.
Show resolved Hide resolved

def predict(self, X):
predictions = self._component_obj.predict(X)
if self._label_encoder:
return self._label_encoder.inverse_transform(predictions.astype(np.int64))
angela97lin marked this conversation as resolved.
Show resolved Hide resolved

return predictions

@property
def feature_importances(self):
return self._component_obj.get_feature_importance()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see that the Estimator base class does return self._component_obj.feature_importances_. Can we delete the override here in CatBoostClassifier? Maybe this code is out of date, because I don't see get_feature_importance in the repo right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, code should still be there but we need to override feature_importances in the CatBoost class because the Estimator base class way of accessing the feature importance is very sklearn specific (that is, you can access the feature importance of an estimator via feature_importances_ for all sklearn objects). This is CatBoost's way of exposing and calculating the feature importances :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I got confused and forgot this is defined by catboost. Cool

Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
from skopt.space import Integer, Real
from xgboost import XGBClassifier

from evalml.model_types import ModelTypes
from evalml.pipelines.components import ComponentTypes
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes
from evalml.utils import import_or_raise


class XGBoostClassifier(Estimator):
Expand All @@ -26,11 +26,13 @@ def __init__(self, eta=0.1, max_depth=3, min_child_weight=1, n_estimators=100, r
"max_depth": max_depth,
"min_child_weight": min_child_weight,
"n_estimators": n_estimators}
xgb_classifier = XGBClassifier(random_state=random_state,
eta=eta,
max_depth=max_depth,
n_estimators=n_estimators,
min_child_weight=min_child_weight)
xgb_error_msg = "XGBoost is not installed. Please install using `pip install xgboost.`"
xgb = import_or_raise("xgboost", error_msg=xgb_error_msg)
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
xgb_classifier = xgb.XGBClassifier(random_state=random_state,
eta=eta,
max_depth=max_depth,
n_estimators=n_estimators,
min_child_weight=min_child_weight)
super().__init__(parameters=parameters,
component_obj=xgb_classifier,
random_state=random_state)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# flake8:noqa
from .linear_regressor import LinearRegressor
from .rf_regressor import RandomForestRegressor

from .catboost_regressor import CatBoostRegressor
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
from skopt.space import Integer, Real

from evalml.model_types import ModelTypes
from evalml.pipelines.components import ComponentTypes
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes
from evalml.utils import import_or_raise


class CatBoostRegressor(Estimator):
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
"""
CatBoost Regressor, a regressor that uses gradient-boosting on decision trees.
CatBoost is an open-source library and natively supports categorical features.

For more information, check out https://catboost.ai/
"""
name = "CatBoost Regressor"
component_type = ComponentTypes.REGRESSOR
_needs_fitting = True
hyperparameter_ranges = {
"n_estimators": Integer(10, 1000),
"eta": Real(0, 1),
"max_depth": Integer(1, 16),
}
model_type = ModelTypes.CATBOOST
problem_types = [ProblemTypes.REGRESSION]

def __init__(self, n_estimators=1000, eta=0.03, max_depth=6, bootstrap_type=None, random_state=0):
parameters = {"n_estimators": n_estimators,
"eta": eta,
"max_depth": max_depth}
if bootstrap_type is not None:
parameters['bootstrap_type'] = bootstrap_type

cb_error_msg = "catboost is not installed. Please install using `pip install catboost.`"
catboost = import_or_raise("catboost", error_msg=cb_error_msg)
cb_regressor = catboost.CatBoostRegressor(**parameters,
random_state=random_state,
silent=True,
allow_writing_files=False)
super().__init__(parameters=parameters,
component_obj=cb_regressor,
random_state=random_state)

def fit(self, X, y=None):
"""Build a model

Arguments:
X (pd.DataFrame or np.array): the input training data of shape [n_samples, n_features]
y (pd.Series): the target training labels of length [n_samples]

Returns:
self
"""
cat_cols = X.select_dtypes(['object', 'category'])
model = self._component_obj.fit(X, y, silent=True, cat_features=cat_cols)
return model

@property
def feature_importances(self):
return self._component_obj.get_feature_importance()
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,9 @@ def __init__(self, number_features=None, n_estimators=10, max_depth=None,
n_estimators=n_estimators,
max_depth=max_depth,
n_jobs=n_jobs)
feature_selection = SkSelect(
estimator=estimator,
max_features=max_features,
threshold=threshold
)
feature_selection = SkSelect(estimator=estimator,
max_features=max_features,
threshold=threshold)

super().__init__(parameters=parameters,
component_obj=feature_selection,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,9 @@ def __init__(self, number_features=None, n_estimators=10, max_depth=None,
n_estimators=n_estimators,
max_depth=max_depth,
n_jobs=n_jobs)
feature_selection = SkSelect(
estimator=estimator,
max_features=max_features,
threshold=threshold
)
feature_selection = SkSelect(estimator=estimator,
max_features=max_features,
threshold=threshold)

super().__init__(parameters=parameters,
component_obj=feature_selection,
Expand Down
2 changes: 1 addition & 1 deletion evalml/pipelines/pipeline_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ def score(self, X, y, other_objectives=None):

@property
def feature_importances(self):
"""Return feature importances. Feature dropped by feaure selection are excluded"""
"""Return feature importances. Features dropped by feature selection are excluded"""
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
feature_names = self.input_feature_names[self.estimator.name]
importances = list(zip(feature_names, self.estimator.feature_importances)) # note: this only works for binary
importances.sort(key=lambda x: -abs(x[1]))
Expand Down
1 change: 1 addition & 0 deletions evalml/pipelines/regression/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# flake8:noqa
from .linear_regression import LinearRegressionPipeline
from .random_forest import RFRegressionPipeline
from .catboost import CatBoostRegressionPipeline
39 changes: 39 additions & 0 deletions evalml/pipelines/regression/catboost.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
from skopt.space import Integer, Real

from evalml.model_types import ModelTypes
from evalml.pipelines import PipelineBase
from evalml.pipelines.components import CatBoostRegressor, SimpleImputer
from evalml.problem_types import ProblemTypes


class CatBoostRegressionPipeline(PipelineBase):
"""
CatBoost Pipeline for regression problems.
CatBoost is an open-source library and natively supports categorical features.

For more information, check out https://catboost.ai/
"""
name = "CatBoost Regressor w/ Simple Imputer"
model_type = ModelTypes.CATBOOST
problem_types = [ProblemTypes.REGRESSION]
hyperparameters = {
"impute_strategy": ["most_frequent"],
"n_estimators": Integer(10, 1000),
"eta": Real(0, 1),
"max_depth": Integer(1, 8),
}

def __init__(self, objective, impute_strategy, n_estimators, eta,
max_depth, number_features, bootstrap_type=None,
n_jobs=-1, random_state=0):
# note: impute_strategy must support both string and numeric data
imputer = SimpleImputer(impute_strategy=impute_strategy)
estimator = CatBoostRegressor(n_estimators=n_estimators,
eta=eta,
max_depth=max_depth,
bootstrap_type=bootstrap_type,
random_state=random_state)
super().__init__(objective=objective,
component_list=[imputer, estimator],
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
n_jobs=1,
random_state=random_state)
3 changes: 2 additions & 1 deletion evalml/pipelines/regression/linear_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ class LinearRegressionPipeline(PipelineBase):
'fit_intercept': [False, True]
}

def __init__(self, objective, random_state, number_features, impute_strategy, normalize=False, fit_intercept=True, n_jobs=-1):
def __init__(self, objective, number_features, impute_strategy,
normalize=False, fit_intercept=True, random_state=0, n_jobs=-1):

imputer = SimpleImputer(impute_strategy=impute_strategy)
enc = OneHotEncoder()
Expand Down