<img src="../../../../imgs/CampQMIND_banner.png">

# Catboost

Catboost is an open-source gradient boosting machine learning library developed by Yandex engineers. Catboost provides an easy way to include categorical variables in your gradient boosted model. Its default choice of parameters usually beat its competitors.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Catboost" data-toc-modified-id="Catboost-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Catboost</a></span></li><li><span><a href="#Worked-Example" data-toc-modified-id="Worked-Example-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Worked Example</a></span></li><li><span><a href="#Resources" data-toc-modified-id="Resources-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Resources</a></span></li></ul></div>

# Worked Example

In [20]:
import numpy as np 
import pandas as pd

from catboost import CatBoostClassifier , Pool
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score


In [13]:
df = pd.read_csv("train.csv",index_col=0)
df.dropna(inplace=True)

In [14]:
# Catboost has its own way to dealing with categorical variables, so we will not encode any.

model = CatBoostClassifier(learning_rate =0.01,
                           class_weights=[0.8,0.2], eval_metric='AUC', 
                           logging_level='Silent',use_best_model=True)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(["target"],axis=1),df.target, random_state=0)

In [16]:
# We need to tell catboost which features are categorical
catagorical_features = np.where(X_train.dtypes != np.float)[0]
# Catboost really likes if we transform the data into its data structure
train = Pool(X_train,y_train,cat_features = catagorical_features)
test = Pool(X_test,y_test,cat_features = catagorical_features)

In [17]:
%%time
model.fit(train, eval_set=test)

CPU times: user 52min 47s, sys: 9min 9s, total: 1h 1min 56s
Wall time: 11min 22s


<catboost.core.CatBoostClassifier at 0x1412a6b50>

In [21]:
# A Competative Score without hyperparameter tuning or categorical encoding
preds = model.predict_proba(X_test)
roc_auc_score(y_test,preds[:,1])

0.7826332921931083

# Resources
- https://www.youtube.com/watch?v=8o0e-r0B5xQ - CatBoost - the new generation of gradient boosting - Anna Veronika Dorogush
- https://www.youtube.com/watch?v=V5158Oug4W8 - Topic 10. Part 2. Key ideas behind Xgboost, LightGBM, and CatBoost. Practice with LightGBM