# CatBoost for Categorical Features

In [1]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.2-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.2


In [7]:
from catboost import Pool, CatBoostClassifier
from sklearn.metrics import classification_report, confusion_matrix

We don't have to encode the features, just specify at what index they are at.

In [26]:
train_data = [["summer", 1924, 44, "yes"],
              ["summer", 1932, 37, "yes"],
              ["winter", 1980, 37, "no"],
              ["summer", 2012, 204, "yes"]]

eval_data = [["winter", 1996, 197, "no"],
             ["winter", 1968, 37, "no"],
             ["summer", 2002, 77, "yes"],
             ["summer", 1948, 59, "yes"]]

cat_features = [0, 3]  # indices of categorical features

train_label = ["France", "USA", "USA", "France"]
eval_label = ["USA", "France", "France", "USA"]

We need to create pools for the training and evaluation datasets. In the pools we need to pass the full data, the labels, and the indices of categorical features.

In [27]:
train_dataset = Pool(data=train_data,
                     label=train_label,
                     cat_features=cat_features)  # give the indices of the categorical features

eval_dataset = Pool(data=eval_data,
                    label=eval_label,
                    cat_features=cat_features)

In [28]:
# Initialize CatBoostClassifier with a MultiClass loss function
model = CatBoostClassifier(iterations=10,
                           learning_rate=1,
                           depth=3,
                           loss_function='MultiClass')

In [29]:
# Fit model
model.fit(train_dataset)
# Get predicted classes
preds_class = model.predict(eval_dataset)

0:	learn: 0.5604596	total: 2.41ms	remaining: 21.7ms
1:	learn: 0.4622656	total: 5.78ms	remaining: 23.1ms
2:	learn: 0.3884250	total: 6.62ms	remaining: 15.5ms
3:	learn: 0.3124253	total: 7.31ms	remaining: 11ms
4:	learn: 0.2567740	total: 8.03ms	remaining: 8.03ms
5:	learn: 0.2170463	total: 8.79ms	remaining: 5.86ms
6:	learn: 0.2060403	total: 9.43ms	remaining: 4.04ms
7:	learn: 0.1788116	total: 10.5ms	remaining: 2.62ms
8:	learn: 0.1507468	total: 11ms	remaining: 1.22ms
9:	learn: 0.1296974	total: 11.6ms	remaining: 0us


In [30]:
preds_class

array([['France'],
       ['USA'],
       ['France'],
       ['France']], dtype=object)

In [31]:
print(eval_label)

['USA', 'France', 'France', 'USA']


In [32]:
print(classification_report(eval_label, preds_class))

              precision    recall  f1-score   support

      France       0.33      0.50      0.40         2
         USA       0.00      0.00      0.00         2

    accuracy                           0.25         4
   macro avg       0.17      0.25      0.20         4
weighted avg       0.17      0.25      0.20         4

