# Catboost

## Catboos overview
For more in-depth information and a comprehensive understanding of CatBoost capabilities, it is recommended to refer to the <a href="https://catboost.ai/en/docs/">official CatBoost documentation</a>. The original documentation includes extensive materials, usage examples, and detailed explanations of various aspects of the library.

CatBoost is a machine learning method based on gradient boosting. Initially, it was introduced at Yandex for solving prediction, categorization, and recommendation tasks. CatBoost has built-in support for categorical variables and ensures high accuracy.
Main advantages of Catboost
1) Superior quality when compared with other GBDT libraries on many datasets.
2) Best in class prediction speed.
3) Support for both numerical and categorical features.
4) Fast GPU and multi-GPU support for training out of the box.
5) Visualization tools included.
6) Fast and reproducible distributed training with Apache Spark and CLI.

Then, on legal grounds, a series of questions arise:
1) What are the advantages of CatBoost over its counterparts?
2) Why boosting instead of neural networks?
3) What do cats have to do with this?



````{admonition} Question
:class: important

<div style="display:flex; justify-content:space-between;">
    <img src="https://tr.rbxcdn.com/20d9646248a37d88a4c0779b5d87b6b6/420/420/Hat/Png" style="width:300px; height:300px;">
    <img src = "https://cdn-icons-png.flaticon.com/256/32/32339.png" style="width:150px; height:150px;">
    <img src="https://images.apollo247.in/pub/media/catalog/product/b/o/boo0003_1-sep2023.jpg" style="width:300px; height:300px;">
    
</div>


```{admonition} Answer
:class: tip, dropdown

CATBOOST!!!

```
````


<h2>What does Catboost include?</h2>
<ol>
  <li>Classification</li>
  <li>Regression</li>
  <li>Multiclassification</li>
  <li>Ranging</li>
  <li>Metrics</li>
  <li>etc.</li>
</ol> 

### Operating Principle
<h3>Decision Tree</h3>

<a href = "https://fedmug.github.io/kbtu-ml-book/decision_trees/decision_tree.html">Decision trees information</a>

The operating algorithm is as follows: for each document, there is a set of feature values, and there is a tree with conditions at its nodes. If the condition is met, the algorithm moves to the right child of the node; otherwise, it goes to the left. One needs to traverse the tree to a leaf according to the feature values for the document. The value of the leaf corresponds to the output for each document. That is the answer.

<h3>Boosting</h3>

Boosting is a powerful machine learning technique that aims to enhance the predictive performance of weak models by combining them into a strong ensemble. This iterative process involves training new models to correct the errors of the previous ones. For an in-depth exploration of gradient boosting, refer to [this article](https://fedmug.github.io/kbtu-ml-book/gradient_boosting/generic_gb.html).
 
<h3>CatBoost is based on gradient boosting.</h3>

Gradient boosting, a specific form of boosting, creates an ensemble of weak prediction models, typically decision trees, to build a predictive model step by step. It optimizes any differentiable loss function by minimizing its gradients.

Boosting is a method which builds a prediction model $F^{T}$ as an ensemble of weak learners $F^{T} = \sum\limits_{t=1}^{T} f^{t}$.

In our case, $f^{t}$ is a decision tree. Trees are built sequentially and each next tree is built to approximate negative gradients $g_{i}$ of the loss function $L$ at predictions of the current ensemble:
$g_{i} = -\frac{\partial L(a, y_{i})}{\partial a} \Bigr|_{a = F^{T-1}(x_{i})}$
Thus, it performs a gradient descent optimization of the function $L$. The quality of the gradient approximation is measured by a score function $Score(a, g) = S(a, g)$.

````{admonition} Question
:class: important

How does CatBoost control the depth of decision trees, and why is this important?

```{admonition} Answer
:class: tip, dropdown

CatBoost uses "Ordered Boosting" and a "Symmetric Tree" structure to efficiently handle categorical features. It employs regularization for depth control, penalizing complex trees for better generalization. The algorithm adapts tree depth dynamically based on information gain, optimizing the use of the depth parameter.

```
````
<h3>How training is performed</h3>

<img src = "images/cat_is_boosting.png">

The goal of training is to select the model $y$, depending on a set of features $x_i$, that best solves the given problem (regression, classification, or multiclassification) for any input object. This model is found by using a training dataset, which is a set of objects with known features and label values. Accuracy is checked on the validation dataset, which has data in the same format as in the training dataset, but it is only used for evaluating the quality of training (it is not used for training).

CatBoost is based on gradient boosted decision trees. During training, a set of decision trees is built consecutively. Each successive tree is built with reduced loss compared to the previous trees.

The number of trees is controlled by the starting parameters. To prevent overfitting, use the overfitting detector. When it is triggered, trees stop being built.

Building stages for a single tree:

1) <a href = "https://catboost.ai/en/docs/concepts/algorithm-main-stages_pre-count">Preliminary calculation of splits.</a>
2) (Optional) <a href = "https://catboost.ai/en/docs/concepts/algorithm-main-stages_cat-to-numberic">Transforming categorical features to numerical features.</a>
3) (Optional) <a href = "https://catboost.ai/en/docs/concepts/algorithm-main-stages_text-to-numeric">Transforming text features to numerical features.</a>
4) (Optional) <a href = "https://catboost.ai/en/docs/concepts/algorithm-main-stages_embedding-to-numeric">Transforming embedding features to numerical features.</a>
5) <a href = "https://catboost.ai/en/docs/concepts/algorithm-main-stages_choose-tree-structure">Choosing the tree structure</a>. This stage is affected by the set <a href = "https://catboost.ai/en/docs/concepts/algorithm-main-stages_bootstrap-options">Bootstrap options</a>.
6) Calculating values in leaves.



<!-- 
## Metrics

CatBoost supports a variety of metrics, such as:

<b>Regression</b>: MAE, MAPE, RMSE, SMAPE, etc.

<b>Classification</b>: Logloss, Precision, Recall, F1, CrossEntropy, BalancedAccuracy, etc.

<b>Multiclass Classification</b>: MultiClass, MultiClassOneVsAll, HammingLoss, F1, etc.

<b>Ranking</b>: NDCG, PrecisionAt, RecallAt, PFound, PairLogit, etc. -->

````{admonition} Question
:class: important

In a regression problem, explain when you would choose Mean Absolute Error (MAE) over Root Mean Squared Error (RMSE) as a metric in CatBoost.

```{admonition} Answer
:class: tip, dropdown

In scenarios where outliers can heavily impact predictions, MAE may be preferred over RMSE since MAE is less sensitive to extreme values.
```
````

In [1]:
from jupyterquiz import display_quiz
import json
from base64 import b64encode

def get_spanned_encoded_q(q, q_name):
    byte_code = b64encode(bytes(json.dumps(q), 'utf8'))
    return f'<span style="display:none" id="{q_name}">{byte_code.decode()}</span>'


q_e_cat = [{
    "question": "What is the operating principle of CatBoost in building models, and what tasks does it include in its functionality?",
    "type": "many_choice",
    "answers": [
        {
            "answer": "CatBoost is based on neural networks and is used exclusively for classification tasks.",
            "correct": False,
            "feedback": "Incorrect! CatBoost actually employs gradient boosting on decision trees and supports various tasks beyond classification."
        },
        {
            "answer": "The operating principle of CatBoost is based on random forests, and it is designed only for regression.",
            "correct": False,
            "feedback": "Not quite right! CatBoost's operating principle involves gradient boosting on decision trees, and it supports more than just regression tasks."
        },
        {
            "answer": "CatBoost utilizes gradient boosting on decision trees and supports classification, regression, multiclassification, and ranking.",
            "correct": True,
            "feedback": "Correct! Categorical boosting in CatBoost allows the direct incorporation of categorical features without the need for extensive preprocessing, distinguishing it from traditional gradient boosting."
        },
        {
            "answer": "CatBoost builds a model based on the k-means method applicable only to clustering tasks.",
            "correct": False,
            "feedback": "Not the right choice! CatBoost primarily focuses on gradient boosting with decision trees and supports various tasks beyond clustering."
        },
    ]
}]

display_quiz(q_e_cat)



<IPython.core.display.Javascript object>

## Catboost Classifier
<h3>Purpose</h3>
Training and applying models for the classification problems. Provides compatibility with the scikit-learn tools.

The default optimized objective depends on various conditions:

`Logloss` — The target has only two different values or the `target_border` parameter is not `None`.

`MultiClass` — The target has more than two different values and the `border_count` parameter is `None`.
    
<h2>Loss functions</h2>
Note: for binary classification problems approxes are not equal to probabilities. Probabilities are calculated from approxes using sigmoid function.

<h3>1) LogLoss function</h3>
CatBoost uses the logarithmic loss function for optimizing class probabilities. The model aims to minimize logloss, which is important for accurate probabilistic predictions. Internally, CatBoost measures the error in probability prediction. Below we consider examples of user defined metrics for different types of tasks. We will use the following variables:

$ a $ - approx value

$ p $ - probability

$ t $ - target

$ w $ - weight

$ p = \frac{1}{1+e^{-a}} = \frac{e^a}{1+e^a} $

$ L(w_i) = -w_i\cdot (t_i \cdot log(p_i) + (1-t_i)\cdot log(1-p_i))  $

$ L(w) = \frac{\sum{L(w_i)}}{\sum{w_i}}  $

$ \frac{d(L(w_i))}{da_i} = w_i \cdot (t_i - p_i) $

$ \frac{d^2(L(w_i))}{da_i^2} = - w_i \cdot p_i \cdot (1 - p_i) $



<h3>2) Multiclass function</h3>
CatBoost inherently supports multiclass classification. It automatically adapts its internal mechanisms for multiclass tasks.

```{note} 
For multiclassification problems approxes are not equal to probabilities. Usually approxes are transformed to probabilities using Softmax function. 
```

$p_{i,c} = \frac{e^{a_{i,c}}}{\sum_{j=1}^k{e^{a_{i,j}}}}$

$p_{i,c}$ - the probability that $x_i$ belongs to class $c$

$k$ - number of classes

$a_{i,j}$ - approx for object $x_i$ for class $j$


Let's implement MultiClass objective that is defined as follows:

$MultiClass_i = w_i \cdot \log{p_{i,t_i}}$

$MultiClass = \frac{\sum_{i=1}^{N}Multiclass_i}{\sum_{i=1}^{N}w_i}$

$\frac{\partial(Multiclass_i)}{\partial{a_{i,c}}} = \begin{cases} 
w_i-\frac{w_i\cdot e^{a_{i,c}}}{\sum_{j=1}^{k}e^{a_{i,j}}}, & \mbox{if } c = t_i \\ 
-\frac{w_i\cdot e^{a_{i,c}}}{\sum_{j=1}^{k}e^{a_{i,j}}}, & \mbox{if } c \neq t_i 
\end{cases}$

$\frac{\partial^2(Multiclass_i)}{\partial{a_{i,c_1}}\partial{a_{i,c_2}}} = \begin{cases} 
\frac{w_i\cdot e^{2 \cdot a_{i,c_1}}}{(\sum_{j=1}^{k}e^{a_{i,j}})^2} - \frac{w_i \cdot e^{a_{i, c_1}}}{\sum_{j=1}^{k}e^{a_{i,j}}}, & \mbox{if } c_1 = c_2 \\ 
\frac{w_i \cdot e^{a_{i,c_1}+a_{i,c_2}}}{(\sum_{j=1}^{k}e^{a_{i,j}})^2}, & \mbox{if } c_1 \neq c_2 
\end{cases}$

<h4>Examples of using CatBoostClassifier</h4>

In [29]:
from catboost import CatBoostClassifier

cat_features = [0,1,2]

train_data = [["a", "b", 1, 4, 5, 6],
    ["a", "b", 4, 5, 6, 7],
    ["c", "d", 30, 40, 50, 60]]

train_labels = [1,1,0]

model = CatBoostClassifier(iterations=20, learning_rate = 0.05,
    custom_loss=['Logloss', 'Accuracy'])

model.fit(train_data, train_labels, cat_features, plot=True)
predictions = model.predict(train_data)


MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

0:	learn: 0.6871227	total: 3.41ms	remaining: 64.9ms
1:	learn: 0.6833302	total: 9.32ms	remaining: 83.9ms
2:	learn: 0.6774286	total: 13ms	remaining: 73.6ms
3:	learn: 0.6716027	total: 16.2ms	remaining: 64.9ms
4:	learn: 0.6658512	total: 23ms	remaining: 68.9ms
5:	learn: 0.6622217	total: 30.7ms	remaining: 71.6ms
6:	learn: 0.6565870	total: 43.5ms	remaining: 80.7ms
7:	learn: 0.6510241	total: 46.2ms	remaining: 69.3ms
8:	learn: 0.6475080	total: 59.7ms	remaining: 72.9ms
9:	learn: 0.6440187	total: 67.7ms	remaining: 67.7ms
10:	learn: 0.6386098	total: 75.8ms	remaining: 62ms
11:	learn: 0.6332694	total: 79.4ms	remaining: 52.9ms
12:	learn: 0.6279964	total: 83.8ms	remaining: 45.1ms
13:	learn: 0.6227900	total: 87.7ms	remaining: 37.6ms
14:	learn: 0.6176491	total: 91.2ms	remaining: 30.4ms
15:	learn: 0.6125728	total: 94.6ms	remaining: 23.7ms
16:	learn: 0.6075601	total: 98.3ms	remaining: 17.4ms
17:	learn: 0.6026101	total: 102ms	remaining: 11.3ms
18:	learn: 0.5977218	total: 105ms	remaining: 5.54ms
19:	learn: 

In [27]:
from catboost import Pool, cv

cv_data = [["France", 1924, 44],
           ["USA", 1932, 37],
           ["Switzerland", 1928, 25],
           ["Norway", 1952, 30],
           ["Japan", 1972, 35],
           ["Mexico", 1968, 112]]

labels = [1, 1, 0, 0, 0, 1]

cat_features = [0]

cv_dataset = Pool(data=cv_data,
                  label=labels,
                  cat_features=cat_features)

params = {"iterations": 100,
          "depth": 2,
          "loss_function": "CrossEntropy",
          "verbose": False}

scores = cv(cv_dataset,
            params,
            fold_count=2, 
            plot="True")

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Training on fold [0/2]

bestTest = 0.6877805006
bestIteration = 0

Training on fold [1/2]

bestTest = 0.6962181163
bestIteration = 0



The _left-hand_ side of the graph shows the training and test accuracy for each fold of the cross-validation. `Cross-validation` is a technique used to estimate the generalizability of a machine learning model by training it on different subsets of the data and then evaluating it on the remaining subsets. In this case, it looks like the model is performing well, with an accuracy of over $0.68$ on the `training data` and over $0.77$ on the `test data`.

The _right-hand_ side of the graph shows the learning rate and `Click Mode` settings for the CatBoost model. The `learning rate` is a parameter that controls the size of the steps taken by the gradient boosting algorithm when fitting the model. The `Click Mode` setting determines how the model handles categorical features.

The graphic says CatBoost model is performing well on this binary classification task. The model appears to be well-fit to the training data and is generalizing well to the unseen test data.

<h3>AUC</h3>
The calculation of this metric is disabled by default for the training dataset to speed up the training, but using the hints = skip_train~false parameter we are able to enable the calculation

$ L(w) = \frac {\sum{I(a_i, a_j) \cdot w_i \cdot w_j}}{\sum{w_i\cdot w_j}}$

The sum is calculated on all pairs of objects (i,j) such that:

$ t_i = 0 $
    
$ t_j = 1 $

$ \begin{aligned}
    I(x,y) = \begin{cases}
        {0}, & \text{if } x < y \\
        {0.5}, & \text{if } x = y \\
        {1}, & \text{if } x > y
    \end{cases}
\end{aligned} $

The sum is calculated on all pairs of objects (i,j) such that:

$ t_i < t_j $

```{note}
Logloss is particularly suitable for multiclass classification when dealing with imbalanced classes, as it penalizes misclassifications more severely. It provides a more nuanced evaluation compared to simpler metrics like accuracy.
```

### Binary Classification

In the realm of CatBoost, the raw function is a pivotal component that underlies the classification process, particularly in binary classification scenarios. This chapter aims to unravel the theory and significance of the raw function, offering readers insights into the inner workings of this fundamental element.

At its core, the raw function $z$ in CatBoost encapsulates the essence of a linear combination of input features and their corresponding model coefficients. For a binary classification task, the raw output for a given instance is expressed as:

$ z = \sum_{i=0}^n{\beta_i\cdot x^i}$

Where:

$ z $ is the raw output.

$ \beta_0,\beta_1,...,\beta_n $ are the model coefficients or weights.

$ x_1,x_2,...,x_n$ are the input features of the instance.

This linear combination of features serves as the foundation for subsequent transformations, capturing the essence of the input's contribution to the model.
<h3>Sigmoid Transformation</h3>

Following the computation of the raw output, the next step involves applying a sigmoid transformation. The sigmoid function, defined as $S(x) = \frac {1}{1 + e^{-x}}$, plays a pivotal role in mapping the raw output to a probability scale between 0 and 1. This transformation ensures that the model's output aligns with a probabilistic interpretation, a crucial aspect in binary classification.

<h3>Decision Threshold</h3>

The transformed probability ($P$) is then subject to a decision threshold. In CatBoost and many other classifiers, the default threshold is often set at 0.5. If the probability estimate $P$ exceeds or equals this threshold, the instance is classified as the positive class; otherwise, it assumes the negative class.

In [None]:
from jupyterquiz import display_quiz
import json
from base64 import b64encode

def get_spanned_encoded_q(q, q_name):
    byte_code = b64encode(bytes(json.dumps(q), 'utf8'))
    return f'<span style="display:none" id="{q_name}">{byte_code.decode()}</span>'


q_e_cat = [{
    "question": "What is the key difference between Log Loss and Multiclass Log Loss in the context of CatBoost?",
    "type": "multiple_choice",
    "answers": [
        {
            "answer": "Log Loss is used for binary classification, while Multiclass Log Loss is used for multiclass classification.",
            "correct": True,
            "feedback": "Correct! Log Loss is designed for binary classification, whereas Multiclass Log Loss is used when dealing with multiple classes."
        },
        {
            "answer": "Log Loss and Multiclass Log Loss are identical terms and can be used interchangeably.",
            "correct": False,
            "feedback": "Incorrect! Log Loss and Multiclass Log Loss have distinct applications and are not interchangeable."
        },
        {
            "answer": "Log Loss is specific to CatBoost, while Multiclass Log Loss is a general term in machine learning.",
            "correct": False,
            "feedback": "Not quite right! Log Loss is a general concept used widely, and Multiclass Log Loss extends it to multiclass scenarios."
        },
        {
            "answer": "There is no difference; both terms refer to the same loss function.",
            "correct": False,
            "feedback": "Incorrect! Log Loss and Multiclass Log Loss represent different loss functions with specific use cases."
        }
    ]
}]

display_quiz(q_e_cat)



In [None]:
from jupyterquiz import display_quiz
import json
from base64 import b64encode

def get_spanned_encoded_q(q, q_name):
    byte_code = b64encode(bytes(json.dumps(q), 'utf8'))
    return f'<span style="display:none" id="{q_name}">{byte_code.decode()}</span>'



q_demo_seq = [{
    "question": "Consider a binary classification scenario using CatBoost with two features, $x_1$ and $x_2$, and corresponding coefficients $β_0, β_1, β_2$. Given: $β_0=−2, β_1=1, β_2=2$. For a specific instance where $x_1=3$ and $x_2=1$, calculate the raw prediction score and the corresponding probability estimate.",
    "type": "numeric",
    "answers": [
        {
            "type": "value",
            "value": 0.952,
            "correct": True,
            "feedback": "Correct! The raw prediction score is calculated using the formula $z = β_0 + β_1 \cdot x_1 + β_2 \cdot x_2 = -2 + 1 \cdot 3 + 2 \cdot 1 = 3$. The probability estimate is obtained using the logistic function, resulting in P ≈ 0.952. Since P is above 0.5, we classify it as the positive class."
        },
        {
            "type": "default",
            "feedback": "Incorrect Answer! Review the calculation using the provided coefficients and features in the context of binary classification with CatBoost."
        }
    ]
}]
display_quiz(q_demo_seq)

| Name | Optimization | GPU support |
|-------------|-------------|-------------|
| Logloss  | + | + |
| Precision  | - | + |
| Accuracy  | - | + |
| AUC  | - | - |
| F1  | - | + |
| Recall  | - | + |
| CrossEntropy  | + | + |

### Example: breast cancer dataset

This is a dataset with 30 features and binary target.

In [47]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
data['data'].shape, data['target'].shape

((569, 30), (569,))

In [5]:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)


In [7]:
data.target_names

array(['malignant', 'benign'], dtype='<U9')

Divide the dataset into train and test:

In [48]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

Classification with `custom_loss = ['Logloss', 'Accuracy','CrossEntropy']`

In [49]:
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score
model = CatBoostClassifier(iterations=100, depth=6, learning_rate=0.05, custom_loss = ['Logloss', 'Accuracy','CrossEntropy'])
model.fit(X_train, y_train, verbose = False, plot = True)

y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")


MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

Accuracy: 0.9649122807017544


In [9]:
### НАДО СКРЫТЬ
import plotly.express as px
import pandas as pd
from sklearn.metrics import confusion_matrix
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

confusion_df = pd.DataFrame(cm, index=model.classes_, columns=model.classes_)

fig = px.imshow(confusion_df, labels=dict(x="Predicted", y="Actual", color="Count"), 
                x=confusion_df.columns, y=confusion_df.index, color_continuous_scale="Reds")

fig.update_layout(xaxis_title="Predicted", yaxis_title="Actual")

for i in range(len(confusion_df.index)):
    for j in range(len(confusion_df.columns)):
        fig.add_annotation(
            x=confusion_df.columns[j], 
            y=confusion_df.index[i], 
            text=str(confusion_df.iloc[i, j]), 
            showarrow=False,
            font=dict(color='black' if i!=j else 'white')  # Опционально: изменение цвета шрифта для диагонали
        )

fig.show()

In [10]:
import plotly.graph_objects as go
from sklearn.metrics import precision_recall_curve, roc_curve, auc

# Получите вероятности принадлежности к классу 1
y_scores = model.predict_proba(X_test)[:, 1]

# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_scores)
prc_trace = go.Scatter(x=recall, y=precision, mode='lines', name='Precision-Recall Curve')

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)
roc_trace = go.Scatter(x=fpr, y=tpr, mode='lines', name=f'ROC Curve (AUC = {roc_auc:.2f})')

# Layout
layout = go.Layout(
    title='Precision-Recall and ROC Curves',
    xaxis=dict(title='Recall or False Positive Rate'),
    yaxis=dict(title='Precision or True Positive Rate'),
)

# Figure
fig = go.Figure(data=[prc_trace, roc_trace], layout=layout)
fig.show()


The values tends above 0.9. This suggests that they are good at distinguishing between positive and negative cases. All the curves follow the typical ROC curve shape, meaning that as the TPR increases, the FPR also increases.

| Name | Catboost Logloss | Catboost Multiclass | Logistic Regression |
|-------------|-------------|-------------|-------------|
| **Accuracy**  | 0.9649 | 0.9649 | 0.9561 |
| **Precision**  | 0.9589 | 0.9589 | 0.9459 |
| **F1 score**  | 0.9777 | 0.9777 | 0.9655 |
| **Recall**  | 0.9859 | 0.9859 | 0.9859 |



````{admonition} Question
:class: important

Why do you think the results for logloss and multiclass turned out exactly the same?

````

### MNIST dataset with CatboostClassifier

In [13]:
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'svg'

X, Y = fetch_openml('mnist_784', return_X_y=True, parser='auto')

X = X.astype(float).values / 255
Y = Y.astype(int).values
X.shape, Y.shape

((70000, 784), (70000,))

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=10000)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((60000, 784), (10000, 784), (60000,), (10000,))

When the `verbose=True` parameter is set, it provides additional information about the model training process. This information includes an estimate of the time required to fit the data and the display of changes in the `learn` parameter over time. Enabling `verbose` can be useful for tracking the training progress and gaining a more detailed insight into how the learn parameter evolves over the course of model training.

In [21]:
from catboost import CatBoostClassifier
model = CatBoostClassifier(iterations=50, depth=4, learning_rate=0.05, custom_loss=['AUC', 'Accuracy'])
model.fit(X_train, y_train, verbose = False, plot=True)

y_pred = model.predict(X_test)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

```{note}
Let's classify train data many times with different parameters and see the results
```

In [None]:
from IPython.display import display
import plotly.express as px
import numpy as np
from sklearn.metrics import accuracy_score

# Load data
cm50 = np.loadtxt("cm50.txt")
cm150 = np.loadtxt("cm150.txt")
cm250 = np.loadtxt("cm250.txt")
pred50 = np.loadtxt("pred50.txt")
pred150 = np.loadtxt("pred150.txt")
pred250 = np.loadtxt("pred250.txt")

# Create the initial plot
fig = px.imshow(cm50, color_continuous_scale="Reds", title="Parameters iter = 50, depth = 4")

# Create buttons
button50 = dict(label="Parameters iter = 50, depth = 4", method="update", args=[{"z": [cm50], "annotations": [], "title.text": "Parameters iter = 50, depth = 4"}])
button150 = dict(label="Parameters iter = 150, depth = 5", method="update", args=[{"z": [cm150], "annotations": [], "title.text": "Parameters iter = 150, depth = 5"}])
button250 = dict(label="Parameters iter = 250, depth = 7", method="update", args=[{"z": [cm250], "annotations": [], "title.text": "Parameters iter = 250, depth = 7"}])

# Add buttons to layout
fig.update_layout(updatemenus=[dict(type="buttons", showactive=True, buttons=[button50, button150, button250])])

# Define update_annotations function
def update_annotations(button_data):
    # Clear existing annotations
    fig.update_layout(annotations=[])
    
    # Add new annotations based on the selected data
    if button_data['args'][0]['z'][0] is cm50:
        add_annotations(fig, cm50)
        accuracy = accuracy_score(y_test, pred50)
    elif button_data['args'][0]['z'][0] is cm150:
        add_annotations(fig, cm150)
        accuracy = accuracy_score(y_test, pred150)
    elif button_data['args'][0]['z'][0] is cm250:
        add_annotations(fig, cm250)
        accuracy = accuracy_score(y_test, pred250)

    # Update print statement
    print(f"Accuracy: {accuracy}")

# Assign the update function to the buttons
button50['args'][0]['annotations'] = update_annotations
button150['args'][0]['annotations'] = update_annotations
button250['args'][0]['annotations'] = update_annotations

# Display the initial plot
display(fig)


In [None]:
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from IPython.display import display, clear_output

import plotly.express as px

def plot_random_images_grid(X, y, y_true, grid_size=(4, 4)):
    num_images = grid_size[0] * grid_size[1]
    indices = np.random.choice(len(X), size=num_images, replace=False)

    fig = make_subplots(rows=grid_size[0], cols=grid_size[1],
                        subplot_titles=[f'ŷ: {y[i]}, y: {y_true[i]}' for i in indices])

    for i, index in enumerate(indices, 1):
        row = (i - 1) // grid_size[1] + 1
        col = (i - 1) % grid_size[1] + 1

        image = X[index].reshape(28, 28)

        trace = go.Heatmap(z=image[::-1], colorscale='gray', showscale=False)
        fig.add_trace(trace, row=row, col=col)

        fig.update_xaxes(showticklabels=False, row=row, col=col)
        fig.update_yaxes(showticklabels=False, row=row, col=col)

    # Настройка внешнего вида графика
    fig.update_layout(height=grid_size[0]*200, width=grid_size[1]*200)

    return fig
plot_random_images_grid(X_test, y_pred, y_test, grid_size=(4, 4))

Comparisions between two models Catboost(iterations=100, depth=6, learning_rate=0.05) and Logistic Regression(max_iter=100)
| Name | CatboostClaffier | Logistic Regression |
|-------------|-------------|-------------|
| Accuracy  | 0.9351 | 0.9215 |

<h3>Conclusions</h3>
In the analysis of the breast cancer dataset, we performed classification based on the target labels 'malignant' and 'benign'. We constructed a confusion matrix and juxtaposed the results with those obtained from <a href = "https://fedmug.github.io/kbtu-ml-book/linear_classification/log_reg.html">Logistic Regression</a>.

Additionally, for the MNIST dataset, we leveraged CatBoost's Multiclass loss function to classify digits. We created a confusion matrix, compared the outcomes with those from <a href = "https://fedmug.github.io/kbtu-ml-book/linear_classification/multi_log_reg.html">Multinomial Logistic Regression</a>, and presented the classification results.

These findings contribute to a comprehensive understanding of the model performances across different datasets and highlight the utility of CatBoost in both binary and multiclass classification scenarios.

In [None]:
from jupyterquiz import display_quiz
import json
from base64 import b64encode

def get_spanned_encoded_q(q, q_name):
    byte_code = b64encode(bytes(json.dumps(q), 'utf8'))
    return f'<span style="display:none" id="{q_name}">{byte_code.decode()}</span>'

q_e_auc = [{
    "question": "You are building a CatBoost model for a binary classification task. The model predicts probabilities, and you want to maximize the AUC-ROC. What would be the AUC-ROC score if your model predicts the probabilities [0.9, 0.7, 0.3, 0.1] for the positive class?",
    "type": "many_choice",
    "answers": [
        {
            "answer": "0.5",
            "correct": False,
            "feedback": "Incorrect answer"
        },
        {
            "answer": "0.75",
            "correct": False,
            "feedback": "Incorrect answer"
        },
        {
            "answer": "0.8",
            "correct": True,
            "feedback": "Explanation: The AUC-ROC score is the area under the Receiver Operating Characteristic curve, which evaluates the trade-off between true positive rate and false positive rate. A higher AUC-ROC indicates better model performance."
        },
        {
            "answer": "1.0",
            "correct": False,
            "feedback": "Incorrect answer"
        },
    ]
}]


display_quiz(q_e_auc)



## Catboost regression

CatBoost Regression is based on the same gradient boosting algorithm as CatBoost Classifier. Gradient boosting is an ensemble method that combines predictions from multiple weak models (usually decision trees) to create a stronger model.

   In regression tasks, CatBoost minimizes the mean squared error (MSE) as the loss function. MSE measures the average squared difference between predicted values and actual values. The model aims to find optimal weights that minimize MSE.
   
<h3>Purpose</h3>

Training and applying models

<h3>Description</h3>

A one-dimensional array of categorical columns indices (specified as integers) or names (specified as strings).

This array can contain both indices and names for different elements.

If any features in the cat_features parameter are specified as names instead of indices, feature names must be provided for the training dataset. Therefore, the type of the X parameter in the future calls of the fit function must be either catboost.Pool with defined feature names data or pandas.DataFrame with defined column names.

### Popular regression models

<h3>RMSE</h3>

For regression approxes don't need any transformations. As an example of regression loss function and metric we take well-known RMSE which is defined by the following formulas:

$RMSE = \sqrt{\frac{\sum_{i=1}^{N}{w_i * (t_i - a_i)^2}}{\sum_{i=1}^{N}{w_i}}}$

It is more convenient to calculate MSE derivative, we will use it instead of RMSE derivative. It will not affect the solution as these metrics have the same optimums. 

$\frac{\partial(MSE)}{\partial a_i} = w_i * (t_i - a_i)$

$\frac{\partial^2(MSE)}{(\partial a_i)^2} = -w_i$

<h3>Poisson model</h3>

Probability Mass Function of the Poisson distribution is the following:

$P(k) = \frac{e^{-\lambda} (\lambda )^k}{k!}$

Where $P(k)$ is probability of seeing k events during time unit given event rate (=number of events per time unit) $\lambda$. 

Event rate $\lambda$ as dependent variable
The idea of Poisson regression is to say that event rate $\lambda$ is a dependent variable.
The loss function maximizes the log-likelihood of a Poisson distribution.
$L_{\text{poisson}} =  \sum_{i=1}^{N}\left(\lambda(X_i) - y_i \log (\lambda(X_i) )  \right)$


In [50]:
import numpy 
from catboost import CatBoostRegressor

dataset = numpy.array([[1,4,5,6],[4,5,6,7],[30,40,50,60],[20,15,85,60]])
train_labels = [1.2,3.4,9.5,24.5]
model = CatBoostRegressor(learning_rate=1, depth=6, loss_function='RMSE')
fit_model = model.fit(dataset, train_labels, verbose=False)

print(fit_model.get_params())

{'learning_rate': 1, 'depth': 6, 'loss_function': 'RMSE'}


````{admonition} Question
:class: important

How does adjusting $\lambda$ impact the shape and characteristics of the distribution?

```{admonition} Answer
:class: tip, dropdown

Adjusting $\lambda$ influences both the shape and characteristics of the distribution. As $\lambda$ increases, the distribution shifts to the right, indicating a higher average rate of occurrence. Conversely, decreasing $\lambda$ shifts the distribution to the left, signifying a lower average rate. Additionally, $\lambda$ determines the variance of the distribution, with higher values leading to greater variability in event occurrences.
```
````

In [None]:
from jupyterquiz import display_quiz
import json
from base64 import b64encode

def get_spanned_encoded_q(q, q_name):
    byte_code = b64encode(bytes(json.dumps(q), 'utf8'))
    return f'<span style="display:none" id="{q_name}">{byte_code.decode()}</span>'


q_e_mae = [{
    "question": "In a regression task using CatBoost, your model predicts the values [5, 7, 10] for three instances. The actual target values are [4, 8, 9]. Calculate the Mean Absolute Error (MAE).",
    "type": "many_choice",
    "answers": [
        {
            "answer": "0.5",
            "correct": False,
            "feedback": "Incorrect! Go check your solution"
        },
        {
            "answer": "2",
            "correct": True,
            "feedback": "Explanation: MAE is the average absolute difference between predicted and actual values. In this case, it is (|5 - 4| + |7 - 8| + |10 - 9|) / 3 = 2.)"
        },
        {
            "answer": "3",
            "correct": False,
            "feedback": "Incorrect! Go check your solution. "
        },
        {
            "answer": "1",
            "correct": False,
            "feedback": "Incorrect! Go check your solution"
        },
    ]
}]

display_quiz(q_e_mae)



| Name | Optimization | GPU support |
|-------------|-------------|-------------|
| MAE  | + | - |
| RMSE  | + | + |
| Poisson  | + | + |
| R2  | - | - |

````{admonition} Question
:class: important

In a regression problem, explain when you would choose Mean Absolute Error (MAE) over Root Mean Squared Error (RMSE) as a metric in CatBoost.

````




### Example: Boston dataset

In [39]:
from sklearn.datasets import fetch_openml
import pandas as pd
from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error

boston = pd.read_csv("assets/Boston.csv")

X = boston[['lstat']]
y = boston['medv']

X.size, y.size

(506, 506)

In [40]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
model = CatBoostRegressor()
model.fit(X,y, verbose=False)
y_pred = model.predict(X)
r2 = r2_score(y, y_pred)
print(f"R2 Score: {r2}")

mae = mean_absolute_error(y, y_pred)
print(f"Mean Absolute Error: {mae}")

rmse = mean_squared_error(y, y_pred)
print(f"Mean Squared Error: {rmse}")

R2 Score: 0.7910914683615309
Mean Absolute Error: 3.081905582171659
Mean Squared Error: 17.635965518155835


| Name | Catboost | Simple Linear Regression |
|-------------|-------------|-------------|
| MAE  | 3.2326 | 4.5053 |
| MSE  | 19.4215 | 38.482 |
| R2  | 0.7699 | 0.5441 |


In [41]:
import plotly.express as px

scatter_data = pd.DataFrame({
    'lstat': X["lstat"],
    'y_test': y,
    'y_pred': y_pred
})

fig = px.scatter(scatter_data, x='lstat', y=['y_test', 'y_pred'], labels={'y_test': 'Actual', 'y_pred': 'Predicted'})

fig.update_layout(
    title='Actual vs Predicted Values',
    xaxis_title='lstat',
    yaxis_title='Values',
    legend_title='Data'
)

# Show the plot
fig.show()


## QUIZ TIME

In [44]:
from jupyterquiz import display_quiz
import json
from base64 import b64encode

def get_spanned_encoded_q(q, q_name):
    byte_code = b64encode(bytes(json.dumps(q), 'utf8'))
    return f'<span style="display:none" id="{q_name}">{byte_code.decode()}</span>'


q_e_cat = [
    {
        "question": "What type of machine learning algorithm is CatBoost?",
        "type": "many_choice",
        "answers": [
            {"answer": "Decision Trees", "correct": False, "feedback": "Incorrect! CatBoost employs gradient boosting on decision trees and supports various tasks beyond classification."},
            {"answer": "Neural Networks", "correct": False, "feedback": "Not quite right! CatBoost's operating principle involves gradient boosting on decision trees, and it supports more than just regression tasks."},
            {"answer": "Gradient Boosting", "correct": True, "feedback": "Correct! Categorical boosting in CatBoost allows the direct incorporation of categorical features without extensive preprocessing, distinguishing it from traditional gradient boosting."},
            {"answer": "k-Nearest Neighbors", "correct": False, "feedback": "Not the right choice! CatBoost primarily focuses on gradient boosting with decision trees and supports various tasks beyond clustering."},
        ]
    },
    {
        "question": "What is the purpose of the Pool class in CatBoost?",
        "type": "many_choice",
        "answers": [
            {"answer": "It represents a swimming pool where the model can relax.", "correct": False, "feedback": "Incorrect! CatBoost employs gradient boosting on decision trees and supports various tasks beyond classification."},
            {"answer": "It is used to store and manage training data.", "correct": True, "feedback": "Correct! The Pool class is used to store and manage training data in CatBoost."},
            {"answer": "It defines the depth of each tree in the ensemble.", "correct": False, "feedback": "Not quite right! The depth of each tree is controlled by the 'depth' parameter, not the Pool class."},
            {"answer": "It specifies the learning rate of the boosting algorithm.", "correct": False, "feedback": "Not the right choice! The learning rate is controlled by the 'learning_rate' parameter, not the Pool class."},
        ]
    },
    {
        "question": "What does the `loss_function` parameter in CatBoost define?",
        "type": "many_choice",
        "answers": [
            {"answer": "The function used to calculate the training error.", "correct": False, "feedback": "Incorrect! The loss_function parameter in CatBoost defines the function used to calculate the loss for each prediction."},
            {"answer": "The function used to calculate the validation error.", "correct": False, "feedback": "Not quite right! The loss_function parameter is used for training, not validation."},
            {"answer": "The function used to calculate the gradient during optimization.", "correct": False, "feedback": "Correct! The loss_function parameter defines the function used to calculate the gradient during optimization in CatBoost."},
            {"answer": "The function used to calculate the loss for each prediction.", "correct": True, "feedback": "Correct! The loss_function parameter in CatBoost defines the function used to calculate the loss for each prediction."},
        ]
    },
    {
        "question": "In CatBoost Regression, what type of output does the model predict?",
        "type": "many_choice",
        "answers": [
            {"answer": "Continuous values", "correct": True, "feedback": "Correct! CatBoost Regression predicts continuous values."},
            {"answer": "Discrete values", "correct": False, "feedback": "Incorrect! CatBoost Regression predicts continuous values, not discrete ones."},
            {"answer": "Binary values", "correct": False, "feedback": "Not the right choice! CatBoost Regression predicts continuous values."},
            {"answer": "Categorical values", "correct": False, "feedback": "Not quite right! CatBoost Regression predicts continuous values."},
        ]
    },
    {
        "question": "How does CatBoost handle categorical features in the context of regression?",
        "type": "many_choice",
        "answers": [
            {"answer": "It ignores categorical features.", "correct": False, "feedback": "Incorrect! CatBoost automatically converts categorical features to numerical features in regression."},
            {"answer": "It automatically converts them to numerical features.", "correct": True, "feedback": "Correct! CatBoost automatically converts categorical features to numerical features in regression."},
            {"answer": "It requires explicit encoding of categorical features.", "correct": False, "feedback": "Not quite right! CatBoost automatically handles the conversion of categorical features in regression."},
            {"answer": "It treats them as binary features.", "correct": False, "feedback": "Not the right choice! CatBoost automatically converts categorical features to numerical features in regression."},
        ]
    }
]

display_quiz(q_e_cat)

<IPython.core.display.Javascript object>

<div style = "display: flex; justify-content: space-around;">
    <img src = "images/cat1.png" width = 300 height = 300>

</div>
<div style = "display: flex; justify-content: center;">
    <img src = "images/box.png" width = 150 height = 100>
    <img src = "images/shred.png" width = 200 height = 200>
</div>
<div style = "display: flex; justify-content: space-around; align-items: middle;">
    <img src = "images/cat2.png" width = 400 height = 400>
</div>

In [46]:
from jupyterquiz import display_quiz
import json
from base64 import b64encode

def get_spanned_encoded_q(q, q_name):
    byte_code = b64encode(bytes(json.dumps(q), 'utf8'))
    return f'<span style="display:none" id="{q_name}">{byte_code.decode()}</span>'


q_e_cat = [
    {
        "question": "Rate the project! ",
        "type": "many_choice",
        "answers": [
            {"answer": "grade < 10", "correct": False, "feedback": "Incorrect! Choose another option"},
            {"answer": "10 < grade < 20", "correct": False, "feedback": "Maybe we can work something out? ;D "},
            {"answer": "20 < grade < 30", "correct": True, "feedback": "Correct! All kitties are safe now"},
            {"answer": "Naah", "correct": False, "feedback": "Please..."},
        ]
    }
]
display_quiz(q_e_cat)

<IPython.core.display.Javascript object>