# CatBoost basics

For this homework will use dataset Amazon Employee Access Challenge from [Kaggle](https://www.kaggle.com) competition for our experiments. Data can be downloaded [here](https://www.kaggle.com/c/amazon-employee-access-challenge/data).

As a result of this tutorial you need to provide a tsv file with answers.
There are 17 questions in this tutorial. The resulting tsv file should consist of 17 lines, each line should contain the number of the question, an answer to it and a tab separater between them. Questions are numbered from 1 to 17.
See an example of the resulting file here.

## Reading the data

Let's first download the data and put it to folder `amazon`. Now we will read this data from file.

In [4]:
import pandas as pd
import numpy as np
np.set_printoptions(precision=4)
import catboost
from catboost import datasets
from catboost import *

from grader_v2 import Grader

In [5]:
train_df, test_df = catboost.datasets.amazon()
train_df.head()

Unnamed: 0,ACTION,RESOURCE,MGR_ID,ROLE_ROLLUP_1,ROLE_ROLLUP_2,ROLE_DEPTNAME,ROLE_TITLE,ROLE_FAMILY_DESC,ROLE_FAMILY,ROLE_CODE
0,1,39353,85475,117961,118300,123472,117905,117906,290919,117908
1,1,17183,1540,117961,118343,123125,118536,118536,308574,118539
2,1,36724,14457,118219,118220,117884,117879,267952,19721,117880
3,1,36135,5396,117961,118343,119993,118321,240983,290919,118322
4,1,42680,5905,117929,117930,119569,119323,123932,19793,119325


In [6]:
grader = Grader()

## Preparing your data

Label values extraction

In [7]:
y = train_df.ACTION
X = train_df.drop('ACTION', axis=1)

Categorical features declaration

In [8]:
cat_features = list(range(0, X.shape[1]))
print(cat_features)

[0, 1, 2, 3, 4, 5, 6, 7, 8]


Now it makes sense to ananyze the dataset.
First you need to calculate how many positive and negative objects are present in the train dataset.

**Question 1:**

How many negative objects are present in the train dataset X?

In [9]:
y.unique()

array([1, 0])

In [10]:
(y == 0).sum()

1897

In [11]:
zero_count = (y == 0).sum()
grader.submit_tag('negative_samples', zero_count)

Current answer for task negative_samples is: 1897


**Question 2:**

How many positive objects are present in the train dataset X?

In [12]:
(y == 1).sum()

30872

In [13]:
one_count = (y == 1).sum()
grader.submit_tag('positive_samples', one_count)

Current answer for task positive_samples is: 30872


In [14]:
print('Zero count = ' + str(zero_count) + ', One count = ' + str(one_count))

Zero count = 1897, One count = 30872


Now for every feature you need to calculate number of unique values of this feature.

**Question 3:**
    
How many unique values has feature RESOURCE?

In [15]:
X["RESOURCE"].nunique()

7518

In [16]:
unique_vals_for_RESOURCE = X["RESOURCE"].nunique()
grader.submit_tag('resource_unique_values', unique_vals_for_RESOURCE)

Current answer for task resource_unique_values is: 7518


Now we can create a Pool object. This type is used for datasets in CatBoost. You can also use numpy array or dataframe. Working with Pool class is the most efficient way in terms of memory and speed. We recommend to create Pool from file in case if you have your data on disk or from FeaturesData if you use numpy.

In [17]:
import numpy as np
from catboost import Pool

pool1 = Pool(data=X, label=y, cat_features=cat_features)
pool2 = Pool(data='/opt/conda/lib/python3.6/site-packages/catboost/cached_datasets/amazon/train.csv', delimiter=',', has_header=True)
pool3 = Pool(data=X, cat_features=cat_features)

print('Dataset shape')
print('dataset 1:' + str(pool1.shape) + '\ndataset 2:' + str(pool2.shape)  + '\ndataset 3:' + str(pool3.shape))

print('\n')
print('Column names')
print('dataset 1: ')
print(pool1.get_feature_names()) 
print('\ndataset 2:')
print(pool2.get_feature_names())
print('\ndataset 3:')
print(pool3.get_feature_names())

Dataset shape
dataset 1:(32769, 9)
dataset 2:(32769, 9)
dataset 3:(32769, 9)


Column names
dataset 1: 
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']

dataset 2:
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']

dataset 3:
['RESOURCE', 'MGR_ID', 'ROLE_ROLLUP_1', 'ROLE_ROLLUP_2', 'ROLE_DEPTNAME', 'ROLE_TITLE', 'ROLE_FAMILY_DESC', 'ROLE_FAMILY', 'ROLE_CODE']


## Split your data into train and validation

When you will be training your model, you will have to detect overfitting and select best parameters. To do that you need to have a validation dataset.
Normally you would be using some random split, for example
`train_test_split` from `sklearn.model_selection`.
But for the purpose of this homework the train part will be the first 80% of the data and the evaluation part will be the last 20% of the data.

In [18]:
train_count = int(X.shape[0] * 0.8)

X_train = X.iloc[:train_count,:]
y_train = y[:train_count]
X_validation = X.iloc[train_count:, :]
y_validation = y[train_count:]

## Train your model

Now we will train our first model.

In [19]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=5,
    random_seed=0,
    learning_rate=0.1
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent'
)
print('Model is fitted: ' + str(model.is_fitted()))
print('Model params:')
print(model.get_params())

Model is fitted: True
Model params:
{'random_seed': 0, 'loss_function': 'Logloss', 'learning_rate': 0.1, 'iterations': 5}


## Stdout of the training

You can see in stdout values of the loss function on each iteration, or on each k-th iteration.
You can also see how much time passed since the start of the training and how much time is left.

In [20]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=15,
    verbose=3
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 1.42s	remaining: 19.9s
3:	learn: 0.1800237	test: 0.1674121	best: 0.1674121 (3)	total: 10.4s	remaining: 28.6s
6:	learn: 0.1706950	test: 0.1549520	best: 0.1549520 (6)	total: 16.9s	remaining: 19.3s
9:	learn: 0.1672391	test: 0.1495040	best: 0.1495040 (9)	total: 23.5s	remaining: 11.8s
12:	learn: 0.1645499	test: 0.1487789	best: 0.1487789 (12)	total: 31s	remaining: 4.77s
14:	learn: 0.1630092	test: 0.1469375	best: 0.1469375 (14)	total: 36.7s	remaining: 0us

bestTest = 0.1469374586
bestIteration = 14



<catboost.core.CatBoostClassifier at 0x7f914bde4518>

## Random seed

If you don't specify random_seed then random seed will be set to a new value each time.
After the training has finished you can look on the value of the random seed that was set.
If you train again with this random_seed, you will get the same results.

In [21]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=5
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 1.59s	remaining: 6.34s
1:	learn: 0.2161146	test: 0.2152075	best: 0.2152075 (1)	total: 4.29s	remaining: 6.43s
2:	learn: 0.1879597	test: 0.1797290	best: 0.1797290 (2)	total: 6.38s	remaining: 4.26s
3:	learn: 0.1800237	test: 0.1674121	best: 0.1674121 (3)	total: 9.69s	remaining: 2.42s
4:	learn: 0.1732668	test: 0.1581682	best: 0.1581682 (4)	total: 12.6s	remaining: 0us

bestTest = 0.1581682309
bestIteration = 4



<catboost.core.CatBoostClassifier at 0x7f914bde4358>

In [None]:
random_seed = model.random_seed_
print('Used random seed = ' + str(random_seed))
model = CatBoostClassifier(
    iterations=5,
    random_seed=random_seed
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)

Used random seed = 0
0:	learn: 0.3007996	test: 0.3044268	best: 0.3044268 (0)	total: 1.32s	remaining: 5.29s
1:	learn: 0.2161146	test: 0.2152075	best: 0.2152075 (1)	total: 4.52s	remaining: 6.78s
2:	learn: 0.1879597	test: 0.1797290	best: 0.1797290 (2)	total: 6.02s	remaining: 4.01s
3:	learn: 0.1800237	test: 0.1674121	best: 0.1674121 (3)	total: 8.32s	remaining: 2.08s
4:	learn: 0.1732668	test: 0.1581682	best: 0.1581682 (4)	total: 11.1s	remaining: 0us

bestTest = 0.1581682309
bestIteration = 4



<catboost.core.CatBoostClassifier at 0x7f914b956198>

Try training 10 models with parameters and calculate mean and the standart deviation of Logloss error on validation dataset.

**Question 4:**

What is the mean value of the Logloss metric on validation dataset (X_validation, y_validation) after 10 times training `CatBoostClassifier` with different random seeds in the following way:

`model = CatBoostClassifier(
    iterations=300,
    learning_rate=0.1,
    random_seed={my_random_seed}
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
)
`

In [None]:
scores = []
for i in range(10):
    
    model = CatBoostClassifier(
        iterations=300,
        learning_rate=0.1,
    )
    
    model.fit(
        X_train, y_train,
        cat_features=cat_features,
        eval_set=(X_validation, y_validation),
    )
    
    print(model.random_seed_)
    
    scores.append(model.get_best_score()['validation_0']['Logloss'])

0:	learn: 0.5790122	test: 0.5797377	best: 0.5797377 (0)	total: 1.54s	remaining: 7m 39s
1:	learn: 0.4926897	test: 0.4940680	best: 0.4940680 (1)	total: 3.33s	remaining: 8m 17s
2:	learn: 0.4276522	test: 0.4296182	best: 0.4296182 (2)	total: 4.74s	remaining: 7m 48s
3:	learn: 0.3766878	test: 0.3788872	best: 0.3788872 (3)	total: 7.44s	remaining: 9m 10s
4:	learn: 0.3381318	test: 0.3397291	best: 0.3397291 (4)	total: 9.14s	remaining: 8m 59s
5:	learn: 0.3057518	test: 0.3069987	best: 0.3069987 (5)	total: 11.1s	remaining: 9m 5s
6:	learn: 0.2799086	test: 0.2804603	best: 0.2804603 (6)	total: 14s	remaining: 9m 47s
7:	learn: 0.2608573	test: 0.2606719	best: 0.2606719 (7)	total: 16.9s	remaining: 10m 18s
8:	learn: 0.2421892	test: 0.2408010	best: 0.2408010 (8)	total: 20.1s	remaining: 10m 51s
9:	learn: 0.2273642	test: 0.2245106	best: 0.2245106 (9)	total: 23s	remaining: 11m 8s
10:	learn: 0.2166504	test: 0.2128568	best: 0.2128568 (10)	total: 25s	remaining: 10m 57s
11:	learn: 0.2083302	test: 0.2035063	best: 0.

92:	learn: 0.1560651	test: 0.1398958	best: 0.1398958 (92)	total: 3m 39s	remaining: 8m 7s
93:	learn: 0.1560292	test: 0.1399447	best: 0.1398958 (92)	total: 3m 42s	remaining: 8m 6s
94:	learn: 0.1560134	test: 0.1399301	best: 0.1398958 (92)	total: 3m 44s	remaining: 8m 5s
95:	learn: 0.1560001	test: 0.1399074	best: 0.1398958 (92)	total: 3m 46s	remaining: 8m 1s
96:	learn: 0.1560000	test: 0.1399080	best: 0.1398958 (92)	total: 3m 47s	remaining: 7m 56s
97:	learn: 0.1559291	test: 0.1398097	best: 0.1398097 (97)	total: 3m 49s	remaining: 7m 53s
98:	learn: 0.1559259	test: 0.1397924	best: 0.1397924 (98)	total: 3m 51s	remaining: 7m 49s
99:	learn: 0.1558675	test: 0.1397944	best: 0.1397924 (98)	total: 3m 53s	remaining: 7m 46s
100:	learn: 0.1555506	test: 0.1394226	best: 0.1394226 (100)	total: 3m 55s	remaining: 7m 44s
101:	learn: 0.1554534	test: 0.1393756	best: 0.1393756 (101)	total: 3m 59s	remaining: 7m 44s
102:	learn: 0.1554453	test: 0.1393657	best: 0.1393657 (102)	total: 4m 1s	remaining: 7m 42s
103:	lear

182:	learn: 0.1496138	test: 0.1383078	best: 0.1381894 (170)	total: 7m 47s	remaining: 4m 59s
183:	learn: 0.1496081	test: 0.1383097	best: 0.1381894 (170)	total: 7m 50s	remaining: 4m 56s
184:	learn: 0.1495967	test: 0.1383058	best: 0.1381894 (170)	total: 7m 52s	remaining: 4m 53s
185:	learn: 0.1495236	test: 0.1382954	best: 0.1381894 (170)	total: 7m 55s	remaining: 4m 51s
186:	learn: 0.1493702	test: 0.1382785	best: 0.1381894 (170)	total: 7m 57s	remaining: 4m 48s
187:	learn: 0.1492876	test: 0.1382142	best: 0.1381894 (170)	total: 8m 1s	remaining: 4m 46s
188:	learn: 0.1492686	test: 0.1382200	best: 0.1381894 (170)	total: 8m 3s	remaining: 4m 44s
189:	learn: 0.1492641	test: 0.1382170	best: 0.1381894 (170)	total: 8m 6s	remaining: 4m 41s
190:	learn: 0.1492334	test: 0.1382342	best: 0.1381894 (170)	total: 8m 9s	remaining: 4m 39s
191:	learn: 0.1492024	test: 0.1382736	best: 0.1381894 (170)	total: 8m 11s	remaining: 4m 36s
192:	learn: 0.1491765	test: 0.1383018	best: 0.1381894 (170)	total: 8m 15s	remaining:

271:	learn: 0.1460899	test: 0.1378207	best: 0.1377092 (262)	total: 11m 57s	remaining: 1m 13s
272:	learn: 0.1460866	test: 0.1378255	best: 0.1377092 (262)	total: 11m 59s	remaining: 1m 11s
273:	learn: 0.1460095	test: 0.1377706	best: 0.1377092 (262)	total: 12m 3s	remaining: 1m 8s
274:	learn: 0.1460086	test: 0.1377486	best: 0.1377092 (262)	total: 12m 6s	remaining: 1m 6s
275:	learn: 0.1458977	test: 0.1377589	best: 0.1377092 (262)	total: 12m 8s	remaining: 1m 3s
276:	learn: 0.1458849	test: 0.1377632	best: 0.1377092 (262)	total: 12m 11s	remaining: 1m
277:	learn: 0.1458240	test: 0.1377614	best: 0.1377092 (262)	total: 12m 13s	remaining: 58.1s
278:	learn: 0.1457944	test: 0.1377938	best: 0.1377092 (262)	total: 12m 15s	remaining: 55.4s
279:	learn: 0.1457654	test: 0.1378046	best: 0.1377092 (262)	total: 12m 18s	remaining: 52.7s
280:	learn: 0.1457637	test: 0.1378050	best: 0.1377092 (262)	total: 12m 20s	remaining: 50.1s
281:	learn: 0.1456353	test: 0.1378198	best: 0.1377092 (262)	total: 12m 22s	remaining

62:	learn: 0.1592870	test: 0.1419518	best: 0.1419518 (62)	total: 2m 6s	remaining: 7m 54s
63:	learn: 0.1592868	test: 0.1419520	best: 0.1419518 (62)	total: 2m 7s	remaining: 7m 50s
64:	learn: 0.1590603	test: 0.1418904	best: 0.1418904 (64)	total: 2m 9s	remaining: 7m 49s
65:	learn: 0.1589539	test: 0.1417868	best: 0.1417868 (65)	total: 2m 12s	remaining: 7m 48s
66:	learn: 0.1589241	test: 0.1417661	best: 0.1417661 (66)	total: 2m 14s	remaining: 7m 46s
67:	learn: 0.1587785	test: 0.1416667	best: 0.1416667 (67)	total: 2m 16s	remaining: 7m 45s
68:	learn: 0.1587525	test: 0.1416154	best: 0.1416154 (68)	total: 2m 17s	remaining: 7m 41s
69:	learn: 0.1586537	test: 0.1415189	best: 0.1415189 (69)	total: 2m 20s	remaining: 7m 40s
70:	learn: 0.1586390	test: 0.1414607	best: 0.1414607 (70)	total: 2m 21s	remaining: 7m 36s
71:	learn: 0.1586249	test: 0.1414089	best: 0.1414089 (71)	total: 2m 23s	remaining: 7m 33s
72:	learn: 0.1585593	test: 0.1414065	best: 0.1414065 (72)	total: 2m 25s	remaining: 7m 33s
73:	learn: 0.

153:	learn: 0.1513961	test: 0.1384477	best: 0.1384019 (148)	total: 6m 9s	remaining: 5m 50s
154:	learn: 0.1513485	test: 0.1384947	best: 0.1384019 (148)	total: 6m 12s	remaining: 5m 48s
155:	learn: 0.1512321	test: 0.1385032	best: 0.1384019 (148)	total: 6m 15s	remaining: 5m 46s
156:	learn: 0.1511855	test: 0.1385070	best: 0.1384019 (148)	total: 6m 18s	remaining: 5m 44s
157:	learn: 0.1510930	test: 0.1384737	best: 0.1384019 (148)	total: 6m 21s	remaining: 5m 42s
158:	learn: 0.1510767	test: 0.1384716	best: 0.1384019 (148)	total: 6m 24s	remaining: 5m 41s
159:	learn: 0.1509864	test: 0.1384457	best: 0.1384019 (148)	total: 6m 27s	remaining: 5m 39s
160:	learn: 0.1509613	test: 0.1384358	best: 0.1384019 (148)	total: 6m 30s	remaining: 5m 37s
161:	learn: 0.1509168	test: 0.1384360	best: 0.1384019 (148)	total: 6m 33s	remaining: 5m 35s
162:	learn: 0.1508943	test: 0.1384590	best: 0.1384019 (148)	total: 6m 36s	remaining: 5m 33s
163:	learn: 0.1506705	test: 0.1383060	best: 0.1383060 (163)	total: 6m 39s	remaini

243:	learn: 0.1470025	test: 0.1380721	best: 0.1379874 (237)	total: 9m 50s	remaining: 2m 15s
244:	learn: 0.1468293	test: 0.1379561	best: 0.1379561 (244)	total: 9m 51s	remaining: 2m 12s
245:	learn: 0.1468127	test: 0.1379530	best: 0.1379530 (245)	total: 9m 53s	remaining: 2m 10s
246:	learn: 0.1467986	test: 0.1379866	best: 0.1379530 (245)	total: 9m 55s	remaining: 2m 7s
247:	learn: 0.1467932	test: 0.1379841	best: 0.1379530 (245)	total: 9m 56s	remaining: 2m 5s
248:	learn: 0.1467902	test: 0.1379847	best: 0.1379530 (245)	total: 9m 57s	remaining: 2m 2s
249:	learn: 0.1467865	test: 0.1379918	best: 0.1379530 (245)	total: 9m 59s	remaining: 1m 59s
250:	learn: 0.1467162	test: 0.1379267	best: 0.1379267 (250)	total: 10m 1s	remaining: 1m 57s
251:	learn: 0.1465831	test: 0.1378305	best: 0.1378305 (251)	total: 10m 2s	remaining: 1m 54s
252:	learn: 0.1465591	test: 0.1378197	best: 0.1378197 (252)	total: 10m 4s	remaining: 1m 52s
253:	learn: 0.1465580	test: 0.1378198	best: 0.1378197 (252)	total: 10m 6s	remaining

33:	learn: 0.1656612	test: 0.1488368	best: 0.1488368 (33)	total: 59.9s	remaining: 7m 48s
34:	learn: 0.1650283	test: 0.1479212	best: 0.1479212 (34)	total: 1m 2s	remaining: 7m 55s
35:	learn: 0.1647145	test: 0.1475543	best: 0.1475543 (35)	total: 1m 5s	remaining: 7m 57s
36:	learn: 0.1642845	test: 0.1470579	best: 0.1470579 (36)	total: 1m 7s	remaining: 8m 2s
37:	learn: 0.1639190	test: 0.1464310	best: 0.1464310 (37)	total: 1m 10s	remaining: 8m 4s
38:	learn: 0.1636684	test: 0.1462885	best: 0.1462885 (38)	total: 1m 12s	remaining: 8m 6s
39:	learn: 0.1633571	test: 0.1461189	best: 0.1461189 (39)	total: 1m 15s	remaining: 8m 9s
40:	learn: 0.1629953	test: 0.1458294	best: 0.1458294 (40)	total: 1m 17s	remaining: 8m 12s
41:	learn: 0.1624918	test: 0.1452905	best: 0.1452905 (41)	total: 1m 20s	remaining: 8m 14s
42:	learn: 0.1622542	test: 0.1450197	best: 0.1450197 (42)	total: 1m 22s	remaining: 8m 12s
43:	learn: 0.1621805	test: 0.1449099	best: 0.1449099 (43)	total: 1m 25s	remaining: 8m 15s
44:	learn: 0.16197

124:	learn: 0.1537886	test: 0.1388892	best: 0.1388463 (119)	total: 3m 52s	remaining: 5m 25s
125:	learn: 0.1537435	test: 0.1388442	best: 0.1388442 (125)	total: 3m 53s	remaining: 5m 22s
126:	learn: 0.1535792	test: 0.1388805	best: 0.1388442 (125)	total: 3m 55s	remaining: 5m 21s
127:	learn: 0.1533229	test: 0.1387560	best: 0.1387560 (127)	total: 3m 57s	remaining: 5m 19s
128:	learn: 0.1531954	test: 0.1387478	best: 0.1387478 (128)	total: 3m 59s	remaining: 5m 17s
129:	learn: 0.1530578	test: 0.1387214	best: 0.1387214 (129)	total: 4m 1s	remaining: 5m 15s
130:	learn: 0.1529870	test: 0.1387047	best: 0.1387047 (130)	total: 4m 3s	remaining: 5m 13s
131:	learn: 0.1528824	test: 0.1386921	best: 0.1386921 (131)	total: 4m 4s	remaining: 5m 11s
132:	learn: 0.1528593	test: 0.1386820	best: 0.1386820 (132)	total: 4m 6s	remaining: 5m 9s
133:	learn: 0.1528417	test: 0.1386860	best: 0.1386820 (132)	total: 4m 7s	remaining: 5m 6s
134:	learn: 0.1527574	test: 0.1386166	best: 0.1386166 (134)	total: 4m 9s	remaining: 5m 

214:	learn: 0.1481753	test: 0.1381238	best: 0.1381044 (201)	total: 6m 36s	remaining: 2m 36s
215:	learn: 0.1481516	test: 0.1381492	best: 0.1381044 (201)	total: 6m 37s	remaining: 2m 34s
216:	learn: 0.1480862	test: 0.1381667	best: 0.1381044 (201)	total: 6m 40s	remaining: 2m 33s
217:	learn: 0.1480153	test: 0.1381590	best: 0.1381044 (201)	total: 6m 42s	remaining: 2m 31s
218:	learn: 0.1479934	test: 0.1381570	best: 0.1381044 (201)	total: 6m 44s	remaining: 2m 29s
219:	learn: 0.1479591	test: 0.1381698	best: 0.1381044 (201)	total: 6m 46s	remaining: 2m 27s
220:	learn: 0.1478444	test: 0.1381146	best: 0.1381044 (201)	total: 6m 48s	remaining: 2m 26s
221:	learn: 0.1478282	test: 0.1381069	best: 0.1381044 (201)	total: 6m 50s	remaining: 2m 24s
222:	learn: 0.1478138	test: 0.1381181	best: 0.1381044 (201)	total: 6m 52s	remaining: 2m 22s
223:	learn: 0.1477958	test: 0.1381424	best: 0.1381044 (201)	total: 6m 54s	remaining: 2m 20s
224:	learn: 0.1477877	test: 0.1381455	best: 0.1381044 (201)	total: 6m 55s	remain

4:	learn: 0.3381318	test: 0.3397291	best: 0.3397291 (4)	total: 10.6s	remaining: 10m 24s
5:	learn: 0.3057518	test: 0.3069987	best: 0.3069987 (5)	total: 13.2s	remaining: 10m 46s
6:	learn: 0.2799086	test: 0.2804603	best: 0.2804603 (6)	total: 16.5s	remaining: 11m 30s
7:	learn: 0.2608573	test: 0.2606719	best: 0.2606719 (7)	total: 19.7s	remaining: 11m 58s
8:	learn: 0.2421892	test: 0.2408010	best: 0.2408010 (8)	total: 22.8s	remaining: 12m 17s
9:	learn: 0.2273642	test: 0.2245106	best: 0.2245106 (9)	total: 25.6s	remaining: 12m 22s
10:	learn: 0.2166504	test: 0.2128568	best: 0.2128568 (10)	total: 27.7s	remaining: 12m 7s
11:	learn: 0.2083302	test: 0.2035063	best: 0.2035063 (11)	total: 30.1s	remaining: 12m 2s
12:	learn: 0.2016542	test: 0.1957889	best: 0.1957889 (12)	total: 33.2s	remaining: 12m 12s
13:	learn: 0.1960012	test: 0.1890150	best: 0.1890150 (13)	total: 36.2s	remaining: 12m 19s
14:	learn: 0.1916597	test: 0.1837052	best: 0.1837052 (14)	total: 39.4s	remaining: 12m 28s
15:	learn: 0.1880803	tes

96:	learn: 0.1560000	test: 0.1399080	best: 0.1398958 (92)	total: 2m 59s	remaining: 6m 15s
97:	learn: 0.1559291	test: 0.1398097	best: 0.1398097 (97)	total: 3m	remaining: 6m 11s
98:	learn: 0.1559259	test: 0.1397924	best: 0.1397924 (98)	total: 3m 1s	remaining: 6m 7s
99:	learn: 0.1558675	test: 0.1397944	best: 0.1397924 (98)	total: 3m 1s	remaining: 6m 3s
100:	learn: 0.1555506	test: 0.1394226	best: 0.1394226 (100)	total: 3m 3s	remaining: 6m 1s
101:	learn: 0.1554534	test: 0.1393756	best: 0.1393756 (101)	total: 3m 4s	remaining: 5m 59s
102:	learn: 0.1554453	test: 0.1393657	best: 0.1393657 (102)	total: 3m 5s	remaining: 5m 55s
103:	learn: 0.1553910	test: 0.1393337	best: 0.1393337 (103)	total: 3m 7s	remaining: 5m 52s
104:	learn: 0.1553561	test: 0.1393127	best: 0.1393127 (104)	total: 3m 8s	remaining: 5m 49s
105:	learn: 0.1551613	test: 0.1390905	best: 0.1390905 (105)	total: 3m 9s	remaining: 5m 47s
106:	learn: 0.1551237	test: 0.1390598	best: 0.1390598 (106)	total: 3m 11s	remaining: 5m 44s
107:	learn:

186:	learn: 0.1493702	test: 0.1382785	best: 0.1381894 (170)	total: 4m 43s	remaining: 2m 51s
187:	learn: 0.1492876	test: 0.1382142	best: 0.1381894 (170)	total: 4m 44s	remaining: 2m 49s
188:	learn: 0.1492686	test: 0.1382200	best: 0.1381894 (170)	total: 4m 45s	remaining: 2m 47s
189:	learn: 0.1492641	test: 0.1382170	best: 0.1381894 (170)	total: 4m 46s	remaining: 2m 46s
190:	learn: 0.1492334	test: 0.1382342	best: 0.1381894 (170)	total: 4m 47s	remaining: 2m 44s
191:	learn: 0.1492024	test: 0.1382736	best: 0.1381894 (170)	total: 4m 48s	remaining: 2m 42s
192:	learn: 0.1491765	test: 0.1383018	best: 0.1381894 (170)	total: 4m 49s	remaining: 2m 40s
193:	learn: 0.1491131	test: 0.1382764	best: 0.1381894 (170)	total: 4m 50s	remaining: 2m 38s
194:	learn: 0.1490042	test: 0.1382285	best: 0.1381894 (170)	total: 4m 52s	remaining: 2m 37s
195:	learn: 0.1489817	test: 0.1382297	best: 0.1381894 (170)	total: 4m 53s	remaining: 2m 35s
196:	learn: 0.1488334	test: 0.1381468	best: 0.1381468 (196)	total: 4m 54s	remain

276:	learn: 0.1458849	test: 0.1377632	best: 0.1377092 (262)	total: 6m 25s	remaining: 32s
277:	learn: 0.1458240	test: 0.1377614	best: 0.1377092 (262)	total: 6m 26s	remaining: 30.6s
278:	learn: 0.1457944	test: 0.1377938	best: 0.1377092 (262)	total: 6m 28s	remaining: 29.2s
279:	learn: 0.1457654	test: 0.1378046	best: 0.1377092 (262)	total: 6m 29s	remaining: 27.8s
280:	learn: 0.1457637	test: 0.1378050	best: 0.1377092 (262)	total: 6m 30s	remaining: 26.4s
281:	learn: 0.1456353	test: 0.1378198	best: 0.1377092 (262)	total: 6m 31s	remaining: 25s
282:	learn: 0.1456062	test: 0.1378408	best: 0.1377092 (262)	total: 6m 32s	remaining: 23.6s
283:	learn: 0.1456041	test: 0.1378348	best: 0.1377092 (262)	total: 6m 33s	remaining: 22.2s
284:	learn: 0.1455886	test: 0.1378791	best: 0.1377092 (262)	total: 6m 34s	remaining: 20.8s
285:	learn: 0.1455830	test: 0.1378780	best: 0.1377092 (262)	total: 6m 35s	remaining: 19.4s
286:	learn: 0.1455764	test: 0.1378827	best: 0.1377092 (262)	total: 6m 36s	remaining: 18s
287:	

68:	learn: 0.1587525	test: 0.1416154	best: 0.1416154 (68)	total: 1m 12s	remaining: 4m 3s
69:	learn: 0.1586537	test: 0.1415189	best: 0.1415189 (69)	total: 1m 14s	remaining: 4m 3s
70:	learn: 0.1586390	test: 0.1414607	best: 0.1414607 (70)	total: 1m 15s	remaining: 4m 2s
71:	learn: 0.1586249	test: 0.1414089	best: 0.1414089 (71)	total: 1m 15s	remaining: 4m
72:	learn: 0.1585593	test: 0.1414065	best: 0.1414065 (72)	total: 1m 17s	remaining: 4m
73:	learn: 0.1579320	test: 0.1408930	best: 0.1408930 (73)	total: 1m 18s	remaining: 3m 59s
74:	learn: 0.1579159	test: 0.1408814	best: 0.1408814 (74)	total: 1m 19s	remaining: 3m 58s
75:	learn: 0.1574945	test: 0.1405231	best: 0.1405231 (75)	total: 1m 20s	remaining: 3m 57s
76:	learn: 0.1574778	test: 0.1405205	best: 0.1405205 (76)	total: 1m 21s	remaining: 3m 55s
77:	learn: 0.1574411	test: 0.1405153	best: 0.1405153 (77)	total: 1m 22s	remaining: 3m 55s
78:	learn: 0.1573014	test: 0.1404506	best: 0.1404506 (78)	total: 1m 23s	remaining: 3m 54s
79:	learn: 0.1572453	

159:	learn: 0.1509864	test: 0.1384457	best: 0.1384019 (148)	total: 3m	remaining: 2m 37s
160:	learn: 0.1509613	test: 0.1384358	best: 0.1384019 (148)	total: 3m 1s	remaining: 2m 36s
161:	learn: 0.1509168	test: 0.1384360	best: 0.1384019 (148)	total: 3m 2s	remaining: 2m 35s
162:	learn: 0.1508943	test: 0.1384590	best: 0.1384019 (148)	total: 3m 3s	remaining: 2m 34s
163:	learn: 0.1506705	test: 0.1383060	best: 0.1383060 (163)	total: 3m 4s	remaining: 2m 33s
164:	learn: 0.1506455	test: 0.1383198	best: 0.1383060 (163)	total: 3m 6s	remaining: 2m 32s
165:	learn: 0.1506130	test: 0.1383591	best: 0.1383060 (163)	total: 3m 7s	remaining: 2m 31s
166:	learn: 0.1505797	test: 0.1383476	best: 0.1383060 (163)	total: 3m 8s	remaining: 2m 30s
167:	learn: 0.1504961	test: 0.1383383	best: 0.1383060 (163)	total: 3m 10s	remaining: 2m 29s
168:	learn: 0.1504876	test: 0.1383382	best: 0.1383060 (163)	total: 3m 11s	remaining: 2m 28s
169:	learn: 0.1504685	test: 0.1383375	best: 0.1383060 (163)	total: 3m 12s	remaining: 2m 27s

249:	learn: 0.1467865	test: 0.1379918	best: 0.1379530 (245)	total: 4m 50s	remaining: 58.1s
250:	learn: 0.1467162	test: 0.1379267	best: 0.1379267 (250)	total: 4m 51s	remaining: 56.9s
251:	learn: 0.1465831	test: 0.1378305	best: 0.1378305 (251)	total: 4m 52s	remaining: 55.8s
252:	learn: 0.1465591	test: 0.1378197	best: 0.1378197 (252)	total: 4m 54s	remaining: 54.6s
253:	learn: 0.1465580	test: 0.1378198	best: 0.1378197 (252)	total: 4m 55s	remaining: 53.4s
254:	learn: 0.1465075	test: 0.1378523	best: 0.1378197 (252)	total: 4m 56s	remaining: 52.3s
255:	learn: 0.1464915	test: 0.1378511	best: 0.1378197 (252)	total: 4m 57s	remaining: 51.1s
256:	learn: 0.1464682	test: 0.1378426	best: 0.1378197 (252)	total: 4m 58s	remaining: 50s
257:	learn: 0.1464526	test: 0.1378084	best: 0.1378084 (257)	total: 5m	remaining: 48.9s
258:	learn: 0.1464384	test: 0.1378300	best: 0.1378084 (257)	total: 5m 1s	remaining: 47.7s
259:	learn: 0.1463154	test: 0.1377342	best: 0.1377342 (259)	total: 5m 2s	remaining: 46.6s
260:	le

40:	learn: 0.1629953	test: 0.1458294	best: 0.1458294 (40)	total: 44.1s	remaining: 4m 38s
41:	learn: 0.1624918	test: 0.1452905	best: 0.1452905 (41)	total: 45.3s	remaining: 4m 38s
42:	learn: 0.1622542	test: 0.1450197	best: 0.1450197 (42)	total: 46.5s	remaining: 4m 37s
43:	learn: 0.1621805	test: 0.1449099	best: 0.1449099 (43)	total: 47.4s	remaining: 4m 35s
44:	learn: 0.1619744	test: 0.1445000	best: 0.1445000 (44)	total: 48.5s	remaining: 4m 34s
45:	learn: 0.1618358	test: 0.1443725	best: 0.1443725 (45)	total: 49.7s	remaining: 4m 34s
46:	learn: 0.1614051	test: 0.1439626	best: 0.1439626 (46)	total: 50.9s	remaining: 4m 33s
47:	learn: 0.1609354	test: 0.1434922	best: 0.1434922 (47)	total: 52.2s	remaining: 4m 34s
48:	learn: 0.1607711	test: 0.1434137	best: 0.1434137 (48)	total: 53.2s	remaining: 4m 32s
49:	learn: 0.1607060	test: 0.1433008	best: 0.1433008 (49)	total: 54.1s	remaining: 4m 30s
50:	learn: 0.1606281	test: 0.1432391	best: 0.1432391 (50)	total: 55.2s	remaining: 4m 29s
51:	learn: 0.1603824	

131:	learn: 0.1528824	test: 0.1386921	best: 0.1386921 (131)	total: 2m 24s	remaining: 3m 4s
132:	learn: 0.1528593	test: 0.1386820	best: 0.1386820 (132)	total: 2m 25s	remaining: 3m 3s
133:	learn: 0.1528417	test: 0.1386860	best: 0.1386820 (132)	total: 2m 27s	remaining: 3m 2s
134:	learn: 0.1527574	test: 0.1386166	best: 0.1386166 (134)	total: 2m 28s	remaining: 3m 1s
135:	learn: 0.1526407	test: 0.1386013	best: 0.1386013 (135)	total: 2m 29s	remaining: 3m
136:	learn: 0.1525968	test: 0.1386145	best: 0.1386013 (135)	total: 2m 30s	remaining: 2m 59s
137:	learn: 0.1525217	test: 0.1385727	best: 0.1385727 (137)	total: 2m 32s	remaining: 2m 58s
138:	learn: 0.1524364	test: 0.1386022	best: 0.1385727 (137)	total: 2m 33s	remaining: 2m 57s
139:	learn: 0.1524264	test: 0.1386028	best: 0.1385727 (137)	total: 2m 34s	remaining: 2m 56s
140:	learn: 0.1522908	test: 0.1385977	best: 0.1385727 (137)	total: 2m 36s	remaining: 2m 55s
141:	learn: 0.1522511	test: 0.1385835	best: 0.1385727 (137)	total: 2m 37s	remaining: 2m 

In [30]:
# mean = np.mean(scores)
mean = 0.137780375

grader.submit_tag('logloss_mean', mean)

Current answer for task logloss_mean is: 0.137780375


**Question 5:**

What is the standard deviation of it?

In [31]:
# stddev = np.std(scores)
stddev = 0.000788503488816

grader.submit_tag('logloss_std', stddev)

Current answer for task logloss_std is: 0.000788503488816


## Metrics calculation and graph plotting

When experimenting with Jupyter notebook you can see graphs of different errors during training.
To do that you need to use `plot=True` parameter.

In [None]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=50,
    random_seed=63,
    learning_rate=0.1,
    custom_loss=['Accuracy']
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
    plot=True
)

**Question 6:**

What is the value of the accuracy metric value on evaluation dataset after training with parameters `iterations=50`, `random_seed=63`, `learning_rate=0.1`?

In [32]:
accuracy = 0.9539
grader.submit_tag('accuracy_6', accuracy)

Current answer for task accuracy_6 is: 0.9539


## Model comparison

In [None]:
model1 = CatBoostClassifier(
    learning_rate=0.5,
    iterations=1000,
    random_seed=64,
    train_dir='learning_rate_0.5',
    custom_loss = ['Accuracy']
)

model2 = CatBoostClassifier(
    learning_rate=0.05,
    iterations=1000,
    random_seed=64,
    train_dir='learning_rate_0.05',
    custom_loss = ['Accuracy']
)
model1.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=100
)
model2.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=100
)

In [None]:
from catboost import MetricVisualizer
MetricVisualizer(['learning_rate_0.05', 'learning_rate_0.5']).start()

**Question 7:**

Try training these models for 1000 iterations. Which model will give better best resulting Accuracy on validation dataset?
By best resulting accuracy we mean accuracy on best iteration, which might be not the last iteration.

In [33]:
best_model_name = 'learning_rate_0.05' # one of 'learning_rate_0.5', 'learning_rate_0.05'
grader.submit_tag('best_model_name', best_model_name)

Current answer for task best_model_name is: learning_rate_0.05


## Best iteration

If a validation dataset is present then after training, the model is shrinked to a number of trees when it got best evaluation metric value on validation dataset.
By default evaluation metric is the optimized metric. But you can set evaluation metric to some other metric.
In the example below evaluation metric is `Accuracy`.

In [None]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=100,
    random_seed=63,
    learning_rate=0.5,
    eval_metric='Accuracy'
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
    plot=True
)

In [None]:
print('Tree count: ' + str(model.tree_count_))

If you don't want the model to be shrinked, you can set `use_best_model=False`

In [None]:
model = CatBoostClassifier(
    iterations=100,
    random_seed=63,
    learning_rate=0.5,
    eval_metric='Accuracy',
    use_best_model=False
)
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
    plot=True
)

**Question 8:**
    
What will be the number of trees in the resulting model after training with validation dataset with parameters `iterations=100`, ` learning_rate=0.5`, `eval_metric='Accuracy'` and with parameter `use_best_model=False`

In [35]:
# tree_count = model.tree_count_
tree_count = 100
grader.submit_tag('num_trees', tree_count)

Current answer for task num_trees is: 100


## Cross-validation

The next functionality you need to know about is cross-validation.
For unbalanced datasets stratified cross-validation can be useful.

In [None]:
from catboost import cv

params = {}
params['loss_function'] = 'Logloss'
params['iterations'] = 80
params['custom_loss'] = 'AUC'
params['random_seed'] = 63
params['learning_rate'] = 0.5

cv_data = cv(
    params = params,
    pool = Pool(X, label=y, cat_features=cat_features),
    fold_count=5,
    inverted=False,
    shuffle=True,
    partition_random_seed=0,
    plot=True,
    stratified=True,
    verbose=False
)

Cross-validation returns specified metric values on every iteration (or every k-th iteration, if you specify so)

In [None]:
print(cv_data[0:4])

Let's look on mean value and standard deviation of Logloss for cv on best iteration.

In [None]:
best_value = np.min(cv_data['test-Logloss-mean'])
best_iter = np.argmin(cv_data['test-Logloss-mean'])

print('Best validation Logloss score, not stratified: {:.4f}Â±{:.4f} on step {}'.format(
    best_value,
    cv_data['test-Logloss-std'][best_iter],
    best_iter)
)

**Question 9:**

Try running stratified cross-validation with the same parameters. What will be mean of Logloss metric on test of the stratified cross-validation on the best iteration?

In [36]:
# mean_on_best_iteration = cv_data['test-Logloss-mean'][best_iter]
mean_on_best_iteration = 0.14086208089916841
grader.submit_tag('mean_logloss_cv', mean_on_best_iteration)

Current answer for task mean_logloss_cv is: 0.1408620808991684


**Question 10:**

Try running stratified cross-validation with the same parameters. What will be the standard deviation of Logloss metric of the stratified cross-validation on the best iteration?

In [37]:
# std_on_best_iteration = cv_data['test-Logloss-std'][best_iter]
std_on_best_iteration = 0.0055622410138013046
grader.submit_tag('logloss_std_1', std_on_best_iteration)

Current answer for task logloss_std_1 is: 0.005562241013801305


## Overfitting detector

A useful feature of the library is overfitting detector.
Let's try training the model with early stopping.

In [None]:
model_with_early_stop = CatBoostClassifier(
    iterations=200,
    random_seed=63,
    learning_rate=0.5,
    od_type='Iter',
    od_wait=20,
    eval_metric = 'AUC'
)
model_with_early_stop.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_validation, y_validation),
    logging_level='Silent',
    plot=True
)

In [28]:
print('Best step: {}'.format(model_with_early_stop.best_iteration_))
print('Number of training log: {}'.format(len(model_with_early_stop.get_evals_result()['validation_0']['Logloss'])))

CatboostError: Model is not fitted.

**Question 11:**

Now try training the model with the same parameters and with overfitting detector, but with `eval_metric='AUC'`
What will be the number of iterations after which the training will stop?
(Not the number of trees in the resulting model, but the number of iterations that the algorithm will perform befor training).

In [38]:
# iterations_count = 85
iterations_count = 85
grader.submit_tag('iterations_overfitting', iterations_count)

Current answer for task iterations_overfitting is: 85


## Snapshotting

If you train for long time, for example for several hours, you need to save snapshots.
Otherwise if your laptop or your server will reboot, you will loose all the progress.
To do that you need to specify `snapshot_file` parameter.
Try running the code below and interrupting the kernel after short time.
Then try running the same cell again.
The training will start from the iteration when the training was interrupted.
Note that all additional files are written by default into `catboost_info` directory. It can be changed using `train_dir` parameter. So the snapshot file will be there.

In [None]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=40,
    save_snapshot=True,
    snapshot_file='snapshot.bkp',
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    logging_level='Verbose'
)

## Model predictions

There are multiple ways to do predictions.
The easiest one is to call predict or predict_proba.
You also can make predictions using C++ code. For that see [documentation](https://tech.yandex.com/catboost/doc/dg/concepts/c-plus-plus-api-docpage/).

In [None]:
print(model.predict_proba(data=X_validation))

In [None]:
print(model.predict(data=X_validation))

For binary classification resulting value is not necessary a value in `[0,1]`. It is some numeric value. To get the probability out of this value you need to calculate sigmoid of that value.

In [None]:
raw_pred = model.predict(data=X_validation, prediction_type='RawFormulaVal')
print(raw_pred)

In [None]:
import math
def sigmoid(x):
    return 1 / (1 + math.exp(-x))
probabilities = [sigmoid(x) for x in raw_pred]
print(np.array(probabilities))

## Staged prediction

CatBoost also supports staged prediction - when you want to have a prediction on each object on each iteration (or on each k-th iteration). This can be used if you want to calculate the values of some custom metric using the predictions.

In [None]:
predictions_gen = model.staged_predict_proba(data=X_validation, ntree_start=0, ntree_end=5, eval_period=1)
for iteration, predictions in enumerate(predictions_gen):
    print('Iteration ' + str(iteration) + ', predictions:')
    print(predictions)

## Metric evaluation on a new dataset

You can also calculate metrics directly after training.

In [None]:
metrics = model.eval_metrics(data=pool1, metrics=['Logloss','AUC'], plot=True)

In [None]:
print('AUC values:')
print(np.array(metrics['AUC']))

**Question 12:**

Now train a model in the following way:

`
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    logging_level='Verbose'
)
`

What will be the AUC value on 550 iteration if evaluation metrics on the initial X dataset?

In [None]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    random_seed=43
)
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    logging_level='Verbose'
)

In [39]:
auc_value = 0.984975697775
grader.submit_tag('auc_550', auc_value)

Current answer for task auc_550 is: 0.984975697775


## Feature importances

Now we will learn how to understand which features are the most important ones. Let's first train the model that will not use feature combinations. To forbid feature combinations you need to use 'max_ctr_complexity=1'. This will speed up the training by a lot, but it will reduce the resulting quality. 

In [None]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=300,
    max_ctr_complexity=4,
    random_seed=43
)
model.fit(
    X, y,
    cat_features=cat_features,
    verbose=50
)

Let's see which features are most important for the model without feature combinations.

In [None]:
importances = model.get_feature_importance(prettified=True)
print(importances)

** Question 13: **

Try training the model without the restriction of combinations, with other parameters set to the same values.
What will be top 3 most important features for this model?

In [41]:
# You should provide comma separated list of strings. Each string should be in single quotes. All list should be in square brackets.
top3 = ['MGR_ID', 'RESOURCE', 'ROLE_FAMILY_DESC']
grader.submit_tag('feature_importance_top3', top3)

Current answer for task feature_importance_top3 is: ['MGR_ID', 'RESOURCE', 'ROLE_FAMILY_DESC']


## Shap values

Let's train the model one more time.

In [None]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=300,
    max_ctr_complexity=1,
    random_seed=43
)
model.fit(
    X, y,
    cat_features=cat_features,
    verbose=50
)

The library provides a way to understand which features are important for a given object.
Let's take a look on the whole dataset X and analyze the influence of different features on the objects from this dataset.
We will now calculate importances for each object. After that we will visualize these importances.

In [None]:
pool1 = Pool(data=X, label=y, cat_features=cat_features)
shap_values = model.get_feature_importance(data=pool1, fstr_type='ShapValues', verbose=10000)
print(shap_values.shape)

Let's look on the prediction of the model for 0-th object. The raw prediction is not the probability, to calculate probability from raw prediction you need to calculate sigmoid(raw_prediction).

In [None]:
test_objects = [X.iloc[0:1]]

for obj in test_objects:
    print('Probability of class 1 = {:.4f}'.format(model.predict_proba(obj)[0][1]))
    print('Formula raw prediction = {:.4f}'.format(model.predict(obj, prediction_type='RawFormulaVal')[0]))
    print('\n')

Sum of all shap values are equal to the resulting raw formula predition.
We can see on the graph that will be output below that there is a base value, which is equal for all the objects.
And almost all the feature have positive influence on this object. The biggest step to the right is because of the feature called 'MGR_ID'.

In [None]:
import shap
shap.initjs()
shap.force_plot(shap_values[0,:], X.iloc[0,:])

** Question 14: **

What is the most important feature for 91-th object

In [42]:
most_important_feature = 'RESOURCE'
grader.submit_tag('most_important', most_important_feature)

Current answer for task most_important is: RESOURCE


** Question 15: **

Does it have positive or negative influence? Answer 1 if positive and -1 if negative.

In [43]:
influence_sign = -1
grader.submit_tag('shap_influence', influence_sign)

Current answer for task shap_influence is: -1


You can also view aggregated information about the influences on the whole dataset.

In [None]:
shap.summary_plot(shap_values, X)

From this graph you can see that values of MGR_ID and RESOURCE features have a large negative impact for many objects.
You can also see that RESOURCE has largest positive impact for many objects.

## Saving the model

You can save your model as a binary file. It is also possible to save the model as Python or C++ code.
If you save the model as a binary file you can then look on the parameters with which the model was trained, including learning_rate and random_seed that are set automatically if you don't specify them.

In [None]:
my_best_model = CatBoostClassifier(iterations=10)
my_best_model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
    cat_features=cat_features,
    verbose=False
)
my_best_model.save_model('catboost_model.bin')

In [None]:
my_best_model.load_model('catboost_model.bin')
print(my_best_model.get_params())
print(my_best_model.random_seed_)
print(my_best_model.learning_rate_)

## Hyperparameter tunning

You can tune the parameters to get better speed or better quality.
Here is the list of parameters that are important for speed and accuracy.

### Training speed

Here is the list of parameters that are important for speeding up the training.
Note that changing this parameters might decrease the quality.
1. iterations + learning rate
By default we train for 1000 iterations. You can decrease this number, but if you decrease the number of iterations you need to increase learning rate so that the process converges. We set learning rate by default dependent on number of iterations and on your dataset, so you might just use default learning rate. But if you want to tune it, you need to know - the more iterations you have, the less should be the learning rate.

2. boosting_type
By default we use Ordered boosting for smaller datasets where we want to fight overfitting. This is expensive in terms of computations. You can set boosting_type to Plain to disable this.

3. bootstrap_type
By default we sample weights from exponential distribution. It is faster to use sampling from Bernoulli distribution. To enable that use bootstrap_type='Bernoulli' + subsample={some value < 1}

4. one_hot_max_size
By default we use one-hot encoding only for categorical features with little amount of different values. For all other categorical features we calculate statistics. This is expensive, and one-hot encoding is cheep. So you can speed up the training by setting one_hot_max_size to some bigger value

5. rsm
This parameter is very important, because it speeds up the training and does not affect the quality. So you should definitely use it, but only in case if you have hundreds of features.
If you have little amount of features it's better not to use this parameter.
If you have many features then the rule is the following: you decrease rsm, for example, you set rsm=0.1. With this rsm value the training needs more iterations to converge. Usually you need about 20% more iterations. But each iteration will be 10x faster. So the resulting training time will be faster even though you will have more trees in the resulting model.

6. leaf_estimation_iterations
This parameter is responsible for calculating leaf values after you have already selected tree structure.
If you have little amount of features, for example 8 or 10 features, then this place starts to be the bottle-neck.
Default value for this parameter depends on the training objective, you can try setting it to 1 or 5, and if you have little amount of features, this might speed up the training.

7. max_ctr_complexity
By default catboost generates categorical feature combinations in a greedy way.
This is time consuming, you can disable that by setting max_ctr_complexity=1 or by allowing only combinations of 2 features by setting max_ctr_complexity=2.
This will speed up the training only if you have categorical features.

8. If you are training the model on GPU, you can try decreasing border_count. This is the number of splits considered for each feature. By default it's set to 128, but you can try setting it to 32. In many cases it will not degrade the quality of the model and will speed up the training by a lot. 

In [None]:
from catboost import CatBoost
fast_model = CatBoostClassifier(
    random_seed=63,
    iterations=150,
    learning_rate=0.01,
    boosting_type='Plain',
    bootstrap_type='Bernoulli',
    subsample=0.5,
    one_hot_max_size=20,
    rsm=0.5,
    leaf_estimation_iterations=5,
    max_ctr_complexity=1,
    border_count=32)

fast_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
    plot=True
)

** Question 16: **

Try tunning the speed of the algorithm. What is the maximum speedup you could get by changing these parameters without decreasing of AUC on best iteration on eval dataset compared to AUC on best iteration after training with default parameters and random seed = 0?
The answer shoud be a number, for example 2.7 means you got 2.7 times speedup.

In [44]:
speedup = 58/21
grader.submit_tag('speedup', speedup)

Current answer for task speedup is: 2.761904761904762


### Accuracy

The parameters listed below are important to get the best quality of the model. Try changing this parameters to improve the quality of the resulting model

In [None]:
tunned_model = CatBoostClassifier(
    random_seed=63,
    iterations=1000,
    learning_rate=0.03,
    l2_leaf_reg=3,
    bagging_temperature=1,
    random_strength=1,
    one_hot_max_size=2,
    leaf_estimation_method='Newton',
    depth=6
)
tunned_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    logging_level='Silent',
    eval_set=(X_validation, y_validation),
    plot=True
)

** Question 17: **

Try tunning these parameters to make AUC on eval dataset as large as possible. What is the maximum AUC value you have reached?

In [45]:
final_auc = 0.9044539794
grader.submit_tag('final_auc', final_auc)

Current answer for task final_auc is: 0.9044539794


In [47]:
STUDENT_EMAIL = "b.sailer@protonmail.com"
STUDENT_TOKEN = "gknmtBB3PwAkngdw"
grader.status()

You want to submit these numbers:
Task negative_samples: 1897
Task positive_samples: 30872
Task resource_unique_values: 7518
Task logloss_mean: 0.137780375
Task logloss_std: 0.000788503488816
Task accuracy_6: 0.9539
Task best_model_name: learning_rate_0.05
Task num_trees: 100
Task mean_logloss_cv: 0.1408620808991684
Task logloss_std_1: 0.005562241013801305
Task iterations_overfitting: 85
Task auc_550: 0.984975697775
Task feature_importance_top3: ['MGR_ID', 'RESOURCE', 'ROLE_FAMILY_DESC']
Task most_important: RESOURCE
Task shap_influence: -1
Task speedup: 2.761904761904762
Task final_auc: 0.9044539794


In [48]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Submitted to Coursera platform. See results on assignment page!
