# Catboost

CatBoost is an algorithm for gradient boosting on decision trees. It  is used for search, recommendation systems, personal assistant, self-driving cars, weather prediction and many other tasks at Yandex and in other companies, including CERN, Cloudflare, Careem taxi. It is in open-source and can be used by anyone.

In [100]:
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from preprocessing import clean_data, add_new_features
from catboost import CatBoostRegressor, cv, Pool

In [101]:
SEED = 42

In [102]:
random.seed(SEED)
np.random.seed(SEED)

def rmse(y_true, y_pred):
  se = (y_true - y_pred) ** 2
  mse = se.mean()
  rmse = np.sqrt(mse)
  return rmse

## Prepare data

In [67]:
data = pd.read_csv('../../data/beer_train.csv', index_col=['id'])

In [68]:
df = clean_data(data)
df = add_new_features(df)

In [69]:
target = ['ibu']
cat_features = ['available', 'glass']
label_features = ['isOrganic']
num_features = ['originalGravity',
                'abv',
                'srm',
                'abv_mul_grav',
                'abv_mul_srm',
                'srm_div_abv',
                'srm_mull_grav',
                'srm_mull_grav_div_abv']

In [70]:
train_features = cat_features + \
                 label_features + \
                 num_features

Create train and val dataset

In [71]:
df_train, df_val = train_test_split(df, test_size=0.2, random_state=SEED)
X_train = df_train[train_features].reset_index(drop=True)
X_val = df_val[train_features].reset_index(drop=True)

In [72]:
y_train = df_train['ibu'].values
y_val = df_val['ibu'].values

In order to use catboost it's necessary to find indexes of categorical features.

In [73]:
cat_features_indices = [list(X_train.columns).index(i) for i in cat_features]

In [107]:
train_pool = Pool(X_train, y_train, cat_features=cat_features_indices)
val_pool = Pool(X_val, cat_features=cat_features_indices)

## Feature Importances


In [75]:
model = CatBoostRegressor(verbose=1000, loss_function='RMSE', random_seed=SEED)\
                                                                .fit(train_pool)


Learning rate set to 0.052458
0:	learn: 24.4825393	total: 22.6ms	remaining: 22.6s
999:	learn: 12.7457132	total: 22.9s	remaining: 0us


In [76]:
feature_importances = model.get_feature_importance(train_pool)
feature_names = X_train.columns

In [77]:
for score, name in sorted(zip(feature_importances, feature_names), reverse=True):
    print(f'{name}: {round(score, 3)}')

originalGravity: 40.189
abv_mul_grav: 11.016
abv: 9.215
srm_mull_grav: 7.273
srm_mull_grav_div_abv: 6.719
glass: 5.988
abv_mul_srm: 5.95
available: 5.641
srm_div_abv: 4.26
srm: 3.493
isOrganic: 0.256


**The most important feature is originalGravity.**

## Tuning hyperparameters

In [83]:
model = CatBoostRegressor(verbose=1000, loss_function='RMSE')

In [None]:
grid = {'learning_rate': [0.03, 0.1, 0.5],
        'depth': [4, 6, 10],
        'l2_leaf_reg': [1, 3, 5, 7, 9],
        'iterations': [500, 1000, 2000],
        'random_seed': [SEED]}

grid_search_result = model.grid_search(grid, train_pool)

In [92]:
best_params = grid_search_result['params'] 
best_params

{'depth': 10,
 'random_seed': 42,
 'l2_leaf_reg': 7,
 'iterations': 500,
 'learning_rate': 0.1}

## Evaluation

Now we would re-train our tuned model on all train data that we have

In [98]:
model.fit(train_pool)

0:	learn: 23.8882264	total: 54.5ms	remaining: 27.2s
499:	learn: 10.8383255	total: 21.9s	remaining: 0us


<catboost.core.CatBoostRegressor at 0x1b61f605040>

Let's see validation metric.

In [109]:
print('CatBoost rmse: ',rmse(y_val, model.predict(val_pool)))

CatBoost rmse:  16.680498705999515


At the validation data CatBoost shows better metric then other models.

Therefore I'll save model params to the file in order to train model with these parameters on all data in future.


In [111]:
import pickle

with open('../../config/model_params.pkl', 'wb') as f:
    pickle.dump(best_params, f)