# $$CatBoost\ Tutorial$$

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/catboost/tutorials/blob/master/python_tutorial.ipynb)

In this tutorial we would explore some base cases of using catboost, such as model training, cross-validation and predicting, as well as some useful features like early stopping,  snapshot support, feature importances and parameters tuning.
  
You could run this tutorial in Google Colaboratory environment with free CPU or GPU. Just click on this <a href="https://colab.research.google.com/github/catboost/tutorials/blob/master/python_tutorial.ipynb" target="_blank" title="Colab">link</a>.

## $$Contents$$
* [1. Data Preparation](#$$1.\-Data\-Preparation$$)
    * [1.1 Data Loading](#1.1-Data-Loading)
    * [1.2 Feature Preparation](#1.2-Feature-Preparation)
    * [1.3 Data Splitting](#1.3-Data-Splitting)
* [2. CatBoost Basics](#$$2.\-CatBoost\-Basics$$)
    * [2.1 Model Training](#2.1-Model-Training)
    * [2.2 Model Cross-Validation](#2.2-Model-Cross-Validation)
    * [2.3 Model Applying](#2.3-Model-Applying)
* [3. CatBoost Features](#$$3.\-CatBoost\-Features$$)
    * [3.1 Using the best model](#3.1-Using-the-best-model)
    * [3.2 Early Stopping](#3.2-Early-Stopping)
    * [3.3 Using Baseline](#3.3-Using-Baseline)
    * [3.4 Snapshot Support](#3.4-Snapshot-Support)
    * [3.5 User Defined Objective Function](#3.5-User-Defined-Objective-Function)
    * [3.6 User Defined Metric Function](#3.6-User-Defined-Metric-Function)
    * [3.7 Staged Predict](#3.7-Staged-Predict)
    * [3.8 Feature Importances](#3.8-Feature-Importances)
    * [3.9 Eval Metrics](#3.9-Eval-Metrics)
    * [3.10 Learning Processes Comparison](#3.10-Learning-Processes-Comparison)
    * [3.11 Model Saving](#3.11-Model-Saving)
* [4. Parameters Tuning](#$$4.\-Parameters\-Tuning$$)

## $$1.\ Data\ Preparation$$
### 1.1 CatBoost installation
If you have not already installed CatBoost, you can do so by running '!pip install catboost' command.  
  
Also you should install ipywidgets package and run special command before launching jupyter notebook to draw plots.

In [None]:
!pip install catboost
!pip install scikit-learn
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension

### 1.2 Data Loading
The data for this tutorial can be obtained from [this page](https://www.kaggle.com/c/titanic/data) (you would have to register a kaggle account or just login with facebook or google+) or you could use catboost.datasets as in code below.

In [1]:
from catboost.datasets import titanic
import numpy as np

train_df, test_df = titanic()

train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 1.3 Feature Preparation
First of all let's check how many absent values do we have:

In [2]:
null_value_stats = train_df.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

Age         177
Cabin       687
Embarked      2
dtype: int64

As we can see, **`Age`**, **`Cabin`** and **`Embarked`** indeed have some missing values, so let's fill them with some number way out of their distributions - so the model would be able to easily distinguish between them and take it into account:

In [3]:
train_df.fillna(-999, inplace=True)
test_df.fillna(-999, inplace=True)

Now let's separate features and label variable:

In [4]:
X = train_df.drop('Survived', axis=1)
y = train_df.Survived

Pay attention that our features are of different types - some of them are numeric, some are categorical, and some are even just strings, which normally should be handled in some specific way (for example encoded with bag-of-words representation). But in our case we could treat these string features just as categorical one - all the heavy lifting is done inside CatBoost. How cool is that? :)

In [5]:
print(X.dtypes)

categorical_features_indices = np.where(X.dtypes != float)[0]

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


### 1.4 Data Splitting
Let's split the train data into training and validation sets.

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)

X_test = test_df

## $$2.\ CatBoost\ Basics$$

Let's make necessary imports.

In [7]:
from catboost import CatBoostClassifier, Pool, metrics, cv
from sklearn.metrics import accuracy_score

### 2.1 Model Training
Now let's create the model itself. We will go here with default parameters, as they provide a _really_ good baseline almost all the time. The only thing we would like to specify here is `custom_loss` parameter, as this would give us an ability to see what's going on in terms of this competition metric - accuracy, as well as to be able to watch for logloss, as it would be more smooth on dataset of such size.

In [13]:
model = CatBoostClassifier(
    custom_loss=[metrics.Accuracy()],
    random_seed=42,
    #logging_level='Silent',
)

In [14]:
model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_validation, y_validation),
#     logging_level='Verbose',  # you can uncomment this for text output
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Learning rate set to 0.028683
0:	learn: 0.6739988	test: 0.6742630	best: 0.6742630 (0)	total: 21.2ms	remaining: 21.2s
1:	learn: 0.6589013	test: 0.6592240	best: 0.6592240 (1)	total: 32.5ms	remaining: 16.2s
2:	learn: 0.6421502	test: 0.6426778	best: 0.6426778 (2)	total: 49.1ms	remaining: 16.3s
3:	learn: 0.6297276	test: 0.6302310	best: 0.6302310 (3)	total: 62ms	remaining: 15.4s
4:	learn: 0.6147184	test: 0.6198228	best: 0.6198228 (4)	total: 77.1ms	remaining: 15.3s
5:	learn: 0.6017730	test: 0.6073627	best: 0.6073627 (5)	total: 91.8ms	remaining: 15.2s
6:	learn: 0.5885309	test: 0.5956000	best: 0.5956000 (6)	total: 106ms	remaining: 15s
7:	learn: 0.5783200	test: 0.5858523	best: 0.5858523 (7)	total: 119ms	remaining: 14.8s
8:	learn: 0.5665895	test: 0.5743842	best: 0.5743842 (8)	total: 135ms	remaining: 14.8s
9:	learn: 0.5575381	test: 0.5662283	best: 0.5662283 (9)	total: 152ms	remaining: 15s
10:	learn: 0.5491045	test: 0.5575176	best: 0.5575176 (10)	total: 166ms	remaining: 14.9s
11:	learn: 0.5423887	t

As you can see, it is possible to watch our model learn through verbose output or with nice plots (personally I would definately go with the second option - just check out those plots: you can, for example, zoom in areas of interest!)

With this we can see that the best accuracy value of **0.8296** (on validation set) was acheived on **150** boosting step.

### 2.2 Model Cross-Validation

It is good to validate your model, but to cross-validate it - even better. And also with plots! So with no more words:

In [15]:
cv_params = model.get_params()
cv_params.update({
    'loss_function': metrics.Logloss()
})
cv_data = cv(
    Pool(X, y, cat_features=categorical_features_indices),
    cv_params,
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/3]
0:	learn: 0.6773731	test: 0.6767121	best: 0.6767121 (0)	total: 20.2ms	remaining: 20.2s
1:	learn: 0.6606181	test: 0.6602677	best: 0.6602677 (1)	total: 38.9ms	remaining: 19.4s
2:	learn: 0.6480805	test: 0.6465190	best: 0.6465190 (2)	total: 50.7ms	remaining: 16.9s
3:	learn: 0.6340568	test: 0.6322194	best: 0.6322194 (3)	total: 65.7ms	remaining: 16.4s
4:	learn: 0.6222855	test: 0.6199084	best: 0.6199084 (4)	total: 74.5ms	remaining: 14.8s
5:	learn: 0.6110496	test: 0.6085780	best: 0.6085780 (5)	total: 87.5ms	remaining: 14.5s
6:	learn: 0.5978812	test: 0.5974018	best: 0.5974018 (6)	total: 105ms	remaining: 14.8s
7:	learn: 0.5874380	test: 0.5868837	best: 0.5868837 (7)	total: 118ms	remaining: 14.7s
8:	learn: 0.5778792	test: 0.5766659	best: 0.5766659 (8)	total: 130ms	remaining: 14.3s
9:	learn: 0.5685231	test: 0.5682471	best: 0.5682471 (9)	total: 149ms	remaining: 14.7s
10:	learn: 0.5597293	test: 0.5601873	best: 0.5601873 (10)	total: 164ms	remaining: 14.8s
11:	learn: 0.5507892	te

Now we have values of our loss functions at each boosting step averaged by 3 folds, which should provide us with a more accurate estimation of our model performance:

In [20]:
print('Best validation accuracy score: {:.2f}±{:.2f} on step {}'.format(
    np.max(cv_data['test-Accuracy-mean']),
    cv_data['test-Accuracy-std'][np.argmax(cv_data['test-Accuracy-mean'])],
    np.argmax(cv_data['test-Accuracy-mean'])
))

Best validation accuracy score: 0.83±0.02 on step 355


In [21]:
print('Precise validation accuracy score: {}'.format(np.max(cv_data['test-Accuracy-mean'])))

Precise validation accuracy score: 0.8294051627384961


As we can see, our initial estimation of performance on single validation fold was too optimistic - that is why cross-validation is so important!

### 2.3 Model Applying
All you have to do to get predictions is

In [22]:
predictions = model.predict(X_test)
predictions_probs = model.predict_proba(X_test)
print(predictions[:10])
print(predictions_probs[:10])

[0 0 0 0 1 0 1 0 1 0]
[[0.85473931 0.14526069]
 [0.76313031 0.23686969]
 [0.88972889 0.11027111]
 [0.87876173 0.12123827]
 [0.3611047  0.6388953 ]
 [0.90513381 0.09486619]
 [0.33434185 0.66565815]
 [0.78468564 0.21531436]
 [0.39429048 0.60570952]
 [0.94047549 0.05952451]]


But let's try to get a better predictions and Catboost features help us in it.

## $$3.\ CatBoost\ Features$$
You may have noticed that on model creation step I've specified not only `custom_loss` but also `random_seed` parameter. That was done in order to make this notebook reproducible - by default catboost chooses some random value for seed:

In [23]:
model_without_seed = CatBoostClassifier(iterations=10, logging_level='Silent')
model_without_seed.fit(X, y, cat_features=categorical_features_indices)

print('Random seed assigned for this model: {}'.format(model_without_seed.random_seed_))

Random seed assigned for this model: 0


Let's define some params and create `Pool` for more convenience. It stores all information about dataset (features, labeles, categorical features indices, weights and and much more).

In [32]:
params = {
    'iterations': 500,
    'learning_rate': 0.1,
    'eval_metric': metrics.Accuracy(),
    'random_seed': 42,
    'logging_level': 'Silent',
    'use_best_model': False
}
train_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)
validate_pool = Pool(X_validation, y_validation, cat_features=categorical_features_indices)

### 3.1 Using the best model
If you essentially have a validation set, it's always better to use the `use_best_model` parameter during training. By default, this parameter is enabled. If it is enabled, the resulting trees ensemble is shrinking to the best iteration.

In [33]:
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=validate_pool)

best_model_params = params.copy()
best_model_params.update({
    'use_best_model': True
})
best_model = CatBoostClassifier(**best_model_params)
best_model.fit(train_pool, eval_set=validate_pool);

print('Simple model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, model.predict(X_validation))
))
print('')

print('Best model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, best_model.predict(X_validation))
))

Simple model validation accuracy: 0.7982

Best model validation accuracy: 0.8251


### 3.2 Early Stopping
If you essentially have a validation set, it's always easier and better to use early stopping. This feature is similar to the previous one, but only in addition to improving the quality it still saves time.

In [42]:
%%time
model = CatBoostClassifier(**params)
model.fit(train_pool, eval_set=validate_pool)

CPU times: user 9.42 s, sys: 368 ms, total: 9.79 s
Wall time: 1.1 s


<catboost.core.CatBoostClassifier at 0x7fa2be087f70>

In [43]:
%%time
earlystop_params = params.copy()
earlystop_params.update({
    'od_type': 'Iter',
    'od_wait': 40
})
earlystop_model = CatBoostClassifier(**earlystop_params)
earlystop_model.fit(train_pool, eval_set=validate_pool);

CPU times: user 1.52 s, sys: 108 ms, total: 1.62 s
Wall time: 195 ms


<catboost.core.CatBoostClassifier at 0x7fa2be234a60>

In [44]:
print('Simple model tree count: {}'.format(model.tree_count_))
print('Simple model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, model.predict(X_validation))
))
print('')

print('Early-stopped model tree count: {}'.format(earlystop_model.tree_count_))
print('Early-stopped model validation accuracy: {:.4}'.format(
    accuracy_score(y_validation, earlystop_model.predict(X_validation))
))

Simple model tree count: 500
Simple model validation accuracy: 0.7982

Early-stopped model tree count: 82
Early-stopped model validation accuracy: 0.8072


So we get better quality in a shorter time.

Though as was shown earlier simple validation scheme does not precisely describes model out-of-train score (may be biased because of dataset split) it is still nice to track model improvement dynamics - and thereby as we can see from this example it is really good to stop boosting process earlier (before the overfitting kicks in)

### 3.3 Using Baseline
It is posible to use pre-training results (baseline) for training.

In [45]:
current_params = params.copy()
current_params.update({
    'iterations': 10
})
model = CatBoostClassifier(**current_params).fit(X_train, y_train, categorical_features_indices)
# Get baseline (only with prediction_type='RawFormulaVal')
baseline = model.predict(X_train, prediction_type='RawFormulaVal')
# Fit new model
model.fit(X_train, y_train, categorical_features_indices, baseline=baseline);

### 3.4 Snapshot Support
Catboost supports snapshots. You can use it for recovering training after an interruption or for starting training with previous results. 

In [46]:
params_with_snapshot = params.copy()
params_with_snapshot.update({
    'iterations': 5,
    'learning_rate': 0.5,
    'logging_level': 'Verbose'
})
model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)
params_with_snapshot.update({
    'iterations': 10,
    'learning_rate': 0.1,
})
model = CatBoostClassifier(**params_with_snapshot).fit(train_pool, eval_set=validate_pool, save_snapshot=True)

0:	learn: 0.8053892	test: 0.7937220	best: 0.7937220 (0)	total: 1.39ms	remaining: 5.54ms
1:	learn: 0.8008982	test: 0.7982063	best: 0.7982063 (1)	total: 3ms	remaining: 4.5ms
2:	learn: 0.8008982	test: 0.7937220	best: 0.7982063 (1)	total: 4.13ms	remaining: 2.75ms
3:	learn: 0.8113772	test: 0.7892377	best: 0.7982063 (1)	total: 5.52ms	remaining: 1.38ms
4:	learn: 0.8173653	test: 0.8026906	best: 0.8026906 (4)	total: 6.67ms	remaining: 0us

bestTest = 0.802690583
bestIteration = 4

5:	learn: 0.8173653	test: 0.8026906	best: 0.8026906 (4)	total: 8.14ms	remaining: 5.89ms
6:	learn: 0.8248503	test: 0.8026906	best: 0.8026906 (4)	total: 9.64ms	remaining: 4.46ms
7:	learn: 0.8233533	test: 0.8026906	best: 0.8026906 (4)	total: 10.8ms	remaining: 2.76ms
8:	learn: 0.8233533	test: 0.8026906	best: 0.8026906 (4)	total: 11.4ms	remaining: 1.19ms
9:	learn: 0.8233533	test: 0.8026906	best: 0.8026906 (4)	total: 12.7ms	remaining: 0us

bestTest = 0.802690583
bestIteration = 4



### 3.5 User Defined Objective Function
It is possible to create your own objective function. Let's create logloss objective function.

In [None]:
# for performance reasons it is better to install `numba` package for working with user defined functions
!pip install numba

In [49]:
class LoglossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        # approxes, targets, weights are indexed containers of floats
        # (containers which have only __len__ and __getitem__ defined).
        # weights parameter can be None.
        #
        # To understand what these parameters mean, assume that there is
        # a subset of your dataset that is currently being processed.
        # approxes contains current predictions for this subset,
        # targets contains target values you provided with the dataset.
        #
        # This function should return a list of pairs (der1, der2), where
        # der1 is the first derivative of the loss function with respect
        # to the predicted value, and der2 is the second derivative.
        #
        # In our case, logloss is defined by the following formula:
        # target * log(sigmoid(approx)) + (1 - target) * (1 - sigmoid(approx))
        # where sigmoid(x) = 1 / (1 + e^(-x)).
        
        assert len(approxes) == len(targets)
        if weights is not None:
            assert len(weights) == len(approxes)
        
        result = []
        for index in range(len(targets)):
            e = np.exp(approxes[index])
            p = e / (1 + e)
            der1 = (1 - p) if targets[index] > 0.0 else -p
            der2 = -p * (1 - p)

            if weights is not None:
                der1 *= weights[index]
                der2 *= weights[index]

            result.append((der1, der2))
        return result

In [52]:
model = CatBoostClassifier(
    iterations=10,
    random_seed=42, 
    loss_function=LoglossObjective(), 
    eval_metric=metrics.Logloss()
)
# Fit model
model.fit(train_pool)
# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`
preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')

0:	learn: 0.6827074	total: 16.7ms	remaining: 150ms
1:	learn: 0.6723302	total: 30.8ms	remaining: 123ms
2:	learn: 0.6619449	total: 43.3ms	remaining: 101ms
3:	learn: 0.6521466	total: 60.3ms	remaining: 90.5ms
4:	learn: 0.6435227	total: 73.4ms	remaining: 73.4ms
5:	learn: 0.6353848	total: 87.8ms	remaining: 58.5ms
6:	learn: 0.6277210	total: 99.6ms	remaining: 42.7ms
7:	learn: 0.6210282	total: 111ms	remaining: 27.8ms
8:	learn: 0.6141958	total: 123ms	remaining: 13.6ms
9:	learn: 0.6073236	total: 135ms	remaining: 0us


### 3.6 User Defined Metric Function
Also it is possible to create your own metric function. Let's create logloss metric function.

In [53]:
class LoglossMetric(object):
    def get_final_error(self, error, weight):
        return error / (weight + 1e-38)

    def is_max_optimal(self):
        return False

    def evaluate(self, approxes, target, weight):
        # approxes is a list of indexed containers
        # (containers with only __len__ and __getitem__ defined),
        # one container per approx dimension.
        # Each container contains floats.
        # weight is a one dimensional indexed container.
        # target is float.
        
        # weight parameter can be None.
        # Returns pair (error, weights sum)
        
        assert len(approxes) == 1
        assert len(target) == len(approxes[0])

        approx = approxes[0]

        error_sum = 0.0
        weight_sum = 0.0

        for i in range(len(approx)):
            w = 1.0 if weight is None else weight[i]
            weight_sum += w
            error_sum += -w * (target[i] * approx[i] - np.log(1 + np.exp(approx[i])))

        return error_sum, weight_sum

In [54]:
model = CatBoostClassifier(
    iterations=10,
    random_seed=42, 
    loss_function=metrics.Logloss(),
    eval_metric=LoglossMetric()
)
# Fit model
model.fit(train_pool)
# Only prediction_type='RawFormulaVal' is allowed with custom `loss_function`
preds_raw = model.predict(X_test, prediction_type='RawFormulaVal')

Learning rate set to 0.5
0:	learn: 0.5521578	total: 6.34ms	remaining: 57.1ms
1:	learn: 0.4885686	total: 11.9ms	remaining: 47.5ms
2:	learn: 0.4607664	total: 17.4ms	remaining: 40.5ms
3:	learn: 0.4418819	total: 22.6ms	remaining: 33.9ms
4:	learn: 0.4278162	total: 28.4ms	remaining: 28.4ms
5:	learn: 0.4151036	total: 35.1ms	remaining: 23.4ms
6:	learn: 0.4099336	total: 40.6ms	remaining: 17.4ms
7:	learn: 0.4095363	total: 45.8ms	remaining: 11.5ms
8:	learn: 0.4032867	total: 51.3ms	remaining: 5.7ms
9:	learn: 0.3929586	total: 56.6ms	remaining: 0us


### 3.7 Staged Predict
CatBoost model has `staged_predict` method. It allows you to iteratively get predictions for a given range of trees.

In [55]:
model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool)
ntree_start, ntree_end, eval_period = 3, 9, 2
predictions_iterator = model.staged_predict(validate_pool, 'Probability', ntree_start, ntree_end, eval_period)
for preds, tree_count in zip(predictions_iterator, range(ntree_start, ntree_end, eval_period)):
    print('First class probabilities using the first {} trees: {}'.format(tree_count, preds[:5, 1]))

First class probabilities using the first 3 trees: [0.53597869 0.41039128 0.42057479 0.64281031 0.46576685]
First class probabilities using the first 5 trees: [0.63722688 0.42492029 0.46209302 0.70926021 0.44280772]
First class probabilities using the first 7 trees: [0.66964764 0.42409144 0.46124982 0.76101033 0.47205986]


### 3.8 Feature Importances
Sometimes it is very important to understand which feature made the greatest contribution to the final result. To do this, the CatBoost model has a `get_feature_importance` method.

In [56]:
model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)
feature_importances = model.get_feature_importance(train_pool)
feature_names = X_train.columns
for score, name in sorted(zip(feature_importances, feature_names), reverse=True):
    print('{}: {}'.format(name, score))

Sex: 59.0040920142686
Pclass: 16.340887169747038
Ticket: 6.028107169932206
Cabin: 3.8347242202560192
Fare: 3.712969667934385
Age: 3.4844512041824824
Parch: 3.378089740355865
Embarked: 2.313999407289956
SibSp: 1.902679406033451
PassengerId: 0.0
Name: 0.0


This shows that features **`Sex`** and **`Pclass`** had the biggest influence on the result.

### 3.9 Eval Metrics
The CatBoost has a `eval_metrics` method that allows to calculate a given metrics on a given dataset. And to draw them of course:)

In [57]:
model = CatBoostClassifier(iterations=50, random_seed=42, logging_level='Silent').fit(train_pool)
eval_metrics = model.eval_metrics(validate_pool, [metrics.AUC()], plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [58]:
print(eval_metrics['AUC'][:6])

[0.8627368774106994, 0.8623176253563642, 0.8602213650846889, 0.8514170719436525, 0.8495723629045783, 0.8569092738554419]


### 3.10 Learning Processes Comparison
You can also compare different models learning process on a single plot.

In [61]:
model1 = CatBoostClassifier(iterations=100, depth=1, train_dir='model_depth_1/', logging_level='Silent')
model1.fit(train_pool, eval_set=validate_pool)
model2 = CatBoostClassifier(iterations=100, depth=5, train_dir='model_depth_5/', logging_level='Silent')
model2.fit(train_pool, eval_set=validate_pool);

In [62]:
from catboost import MetricVisualizer
widget = MetricVisualizer(['model_depth_1', 'model_depth_5'])
widget.start()

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

### 3.11 Model Saving
It is always really handy to be able to dump your model to disk (especially if training took some time).

In [63]:
model = CatBoostClassifier(iterations=10, random_seed=42, logging_level='Silent').fit(train_pool)
model.save_model('catboost_model.dump')
model = CatBoostClassifier()
model.load_model('catboost_model.dump');

# $$4.\ Parameters\ Tuning$$
While you could always select optimal number of iterations (boosting steps) by cross-validation and learning curve plots, it is also important to play with some of model parameters, and we would like to pay some special attention to `l2_leaf_reg` and `learning_rate`.

In this section, we'll select these parameters using the **`hyperopt`** package.

In [None]:
!pip install hyperopt

In [76]:
import hyperopt

def hyperopt_objective(params):
    model = CatBoostClassifier(
        l2_leaf_reg=int(params['l2_leaf_reg']),
        learning_rate=params['learning_rate'],
        iterations=500,
        eval_metric=metrics.Accuracy(),
        random_seed=42,
        verbose=False,
        loss_function=metrics.Logloss(),
    )
    
    cv_data = cv(
        Pool(X, y, cat_features=categorical_features_indices),
        model.get_params(),
        logging_level='Silent',
    )
    best_accuracy = np.max(cv_data['test-Accuracy-mean'])
    
    return 1 - best_accuracy # as hyperopt minimises

In [78]:
from numpy.random import RandomState

params_space = {
    'l2_leaf_reg': hyperopt.hp.qloguniform('l2_leaf_reg', 0, 2, 1),
    'learning_rate': hyperopt.hp.uniform('learning_rate', 1e-3, 5e-1),
}

trials = hyperopt.Trials()

best = hyperopt.fmin(
    hyperopt_objective,
    space=params_space,
    algo=hyperopt.tpe.suggest,
    max_evals=50,
    trials=trials,
    rstate=RandomState(123)
)

print(best)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [03:59<00:00,  4.79s/trial, best loss: 0.16386083052749711]
{'l2_leaf_reg': 1.0, 'learning_rate': 0.0450866712211308}


Now let's get all cv data with best parameters:

In [79]:
model = CatBoostClassifier(
    l2_leaf_reg=int(best['l2_leaf_reg']),
    learning_rate=best['learning_rate'],
    iterations=500,
    eval_metric=metrics.Accuracy(),
    random_seed=42,
    verbose=False,
    loss_function=metrics.Logloss(),
)
cv_data = cv(Pool(X, y, cat_features=categorical_features_indices), model.get_params())

Training on fold [0/3]

bestTest = 0.8417508418
bestIteration = 262

Training on fold [1/3]

bestTest = 0.8451178451
bestIteration = 269

Training on fold [2/3]

bestTest = 0.8215488215
bestIteration = 284



In [80]:
print('Precise validation accuracy score: {}'.format(np.max(cv_data['test-Accuracy-mean'])))

Precise validation accuracy score: 0.8361391694725029


Recall that with default parameters out cv score was 0.8283, and thereby we have (probably not statistically significant) some improvement.

### Make submission
Now we would re-train our tuned model on all train data that we have

In [81]:
model.fit(X, y, cat_features=categorical_features_indices)

<catboost.core.CatBoostClassifier at 0x7fa2c200feb0>

And finally let's prepare the submission file:

In [82]:
import pandas as pd
submisstion = pd.DataFrame()
submisstion['PassengerId'] = X_test['PassengerId']
submisstion['Survived'] = model.predict(X_test)

In [83]:
submisstion.to_csv('submission.csv', index=False)

Finally you can make submission at [Titanic Kaggle competition](https://www.kaggle.com/c/titanic).

That's it! Now you can play around with CatBoost and win some competitions! :)