# CatBoost

The goal of training is to select the model y, depending on a set of features Xi, that best solves the given problem for any input object.

This model is found by using a training dataset, which is a set of objects with known features and label values. Accuracy is checked on the validation dataset, which has data in the same format as in the training dataset, but it is only used for evaluating the quality of training.

CatBoost is based on gradient boosted decision trees. During training, a set of decision trees is built consecutively. Each successive tree is built with reduced loss compared to the previous ones.

The number of trees is controlled by the starting parameters. To prevent overfitting, use the overfitting detector. When it is triggered, trees stop being built.

Number of trees
It is recommended to check that there is no obvious underfitting or overfitting before tuning any other parameters. In order to do this it is necessary to analyze the metric value on the validation dataset and select the appropriate number of iterations.

This can be done by setting the number of iterations to a large value, using the overfitting detector parameters and turning the use best model options on. In this case the resulting model contains only the first k best iterations, where k is the iteration with the best loss value on the validation dataset.

Also, the metric for choosing the best model may differ from the one used for optimizing the objective value. For example, it is possible to set the optimized function to Logloss and use the AUC function for the overfitting detector. To do so, use the evaluation metric parameter.

In [1]:
import catboost as cb
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

X_train = pd.read_csv('x_train_preprocessed.csv')
X_test = pd.read_csv('x_test_preprocessed.csv')
y_train = pd.read_csv('y_train_preprocessed.csv')
y_test = pd.read_csv('y_test_preprocessed.csv')

oh_neighbor = []
for col in X_train.columns:
       if 'Neighborhood_b' in col:
              oh_neighbor.append(col)

X_train.drop(columns=oh_neighbor, inplace=True)
X_test.drop(columns=oh_neighbor, inplace=True)

porch = ['Wood_Deck_SF', 'Open_Porch_SF', 'Enclosed_Porch', 'Three_season_porch', 'Screen_Porch']
surface = ['Total_Finished_Bsmt_SF', 'First_Flr_SF', 'Second_Flr_SF', 'Garage_Area']
baths = ['Full_Bath', 'Half_Bath', 'Bsmt_Full_Bath', 'Bsmt_Half_Bath']

X_train.drop(columns=porch, inplace=True)
X_test.drop(columns=porch, inplace=True)

X_train.drop(columns=surface, inplace=True)
X_test.drop(columns=surface, inplace=True)

X_train.drop(columns=baths, inplace=True)
X_test.drop(columns=baths, inplace=True)

train_dataset = cb.Pool(X_train, y_train)
test_dataset = cb.Pool(X_test, y_test)

model = cb.CatBoostRegressor(loss_function='RMSE', task_type='GPU', devices='0:1',
                             use_best_model=True, #Necessary to True for a high number of iterations
                             eval_metric='RMSE',
                             od_type='IncToDec',#Type of overfitting detector
                             od_pval=.01, #threshold to stop overfitting detector range 10 alla meno 10 a 10 alla -2
                             border_count=254,#The value of this parameter significantly impacts the speed of training on GPU. The smaller the value, the faster the training is performed (refer to the Number of splits for numerical features section for details).
                             )

grid = {'iterations': [100, 150, 200],  #The maximum number of trees that can be built
        'learning_rate': [0.01,0.03,0.05, 0.1],   #Used for reducing the gradient step,
        'depth': [6, 7, 8, 9, 10],
        'l2_leaf_reg': [0.2, 0.5, 1, 3],
        'random_strenght': [0.2,0.5,0.8]}

#model.grid_search(grid, train_dataset, plot=True,cv=5,search_by_train_test_split=True,calc_cv_statistics=True)


model.randomized_search(grid,
                  train_dataset,
                  y=None,
                  cv=5,
                  n_iter=10,
                  partition_random_seed=0,
                  calc_cv_statistics=True,
                  search_by_train_test_split=True,
                  refit=True,
                  shuffle=True,
                  stratified=None,
                  train_size=0.8,
                  verbose=True,
                    )

pred = model.predict(X_test)
rmse = (np.sqrt(mean_squared_error(y_test, pred)))
r2 = r2_score(y_test, pred)

print("Testing performance")
print('RMSE: {:.2f}'.format(rmse))
print('R2: {:.2f}'.format(r2))

CatBoostError: To employ param {'use_best_model': True} provide non-empty 'eval_set'.