<font color="red" size="6">Ensemble methods</font>
<p> <font color="Yellow" size="5"><b>5_CatBoost (Categorical Boosting)</font>

CatBoost (Categorical Boosting) is a gradient boosting algorithm developed by Yandex. It is designed to handle categorical features efficiently without needing to perform one-hot encoding or other preprocessing steps. CatBoost is a powerful tool for both classification and regression tasks, and it often performs well with minimal hyperparameter tuning.

<font color="pink" size=4>Key Features of CatBoost:</font>
<ol>
    <li><font color="orange">Handling Categorical Features:</font> Unlike other gradient boosting algorithms, CatBoost can handle categorical features directly, automatically transforming them into numerical representations without the need for explicit encoding like one-hot encoding.</li>
    <li><font color="orange">Ordered Boosting:</font> CatBoost introduces an innovative technique called "Ordered Boosting" that helps reduce overfitting, especially in small datasets.</li>
    <li><font color="orange">Efficient and Fast:</font> CatBoost is highly optimized for both speed and memory efficiency, making it faster than other gradient boosting methods in some cases.</li>
    <li><font color="orange">Regularization:</font> It supports L2 regularization and incorporates methods to avoid overfitting during training.</li>
    <li><font color="orange">Supports Multiclass Classification:</font> CatBoost can be used for multiclass classification tasks with minimal configuration.</li>
    <li><font color="orange">Explanability:</font> It provides tools to interpret and visualize the feature importance and model behavior.</li></ol>

In [None]:
from catboost import CatBoostClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1. Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

<font color="pink" size=4>CatBoost Classifier:</font>
<ol>
     <li><font color="orange">iterations:</font> The number of boosting iterations (trees). We set it to 1000 here, but you can adjust it.</li>
     <li><font color="orange">learning_rate:</font> The learning rate for each iteration, which determines the size of the steps taken towards minimizing the loss function.</li>
     <li><font color="orange">depth</font> The depth of the individual decision trees. Deeper trees can capture more complex patterns but might lead to overfitting.</li>
     <li><font color="orange">cat_features:</font> In this case, we don’t have categorical features, but if you have categorical features, you can pass the column indices here.    </li></ol>

In [1]:


# 3. Create the CatBoost classifier
catboost_model = CatBoostClassifier(
    iterations=1000,        # Number of boosting iterations
    learning_rate=0.1,      # Learning rate for each iteration
    depth=6,                # Depth of each tree
    random_state=42,        # Random seed for reproducibility
    cat_features=[]         # No categorical features for this dataset, but this can be used for datasets with categorical columns
)

# 4. Train the model
catboost_model.fit(X_train, y_train, verbose=200)  # Verbose outputs the training progress every 200 iterations

# 5. Make predictions on the test set
y_pred = catboost_model.predict(X_test)

# 6. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Display the classification report and confusion matrix
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


0:	learn: 1.0171507	total: 178ms	remaining: 2m 58s
200:	learn: 0.0194995	total: 817ms	remaining: 3.25s
400:	learn: 0.0084179	total: 1.37s	remaining: 2.05s
600:	learn: 0.0052795	total: 1.93s	remaining: 1.28s
800:	learn: 0.0038484	total: 2.46s	remaining: 613ms
999:	learn: 0.0030384	total: 3s	remaining: 0us
Accuracy: 0.9815

Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97        19
           1       1.00      0.95      0.98        21
           2       1.00      1.00      1.00        14

    accuracy                           0.98        54
   macro avg       0.98      0.98      0.98        54
weighted avg       0.98      0.98      0.98        54


Confusion Matrix:
[[19  0  0]
 [ 1 20  0]
 [ 0  0 14]]


<font color="pink" size=4>Hyperparameters in CatBoost:</font>

Here are some important hyperparameters that you can adjust to improve performance:
<ol>
    <li><font color="orange">iterations:</font> The number of boosting rounds (trees). Higher values can lead to better performance but might increase the risk of overfitting.</li>
    <li><font color="orange">learning_rate:</font> The learning rate controls how much each tree contributes to the final prediction. Lower values typically lead to better generalization but require more iterations.</li>
    <li><font color="orange">depth:</font> The maximum depth of individual trees. Deeper trees can capture more complex patterns but may overfit the data.</li>
    <li><font color="orange">cat_features:</font> Indexes of categorical features in the dataset (not used in the Wine dataset, but can be helpful for datasets with categorical columns).</li>
    <li><font color="orange">subsample:</font> The fraction of samples used to train each tree. This can help prevent overfitting.</li>
    <li><font color="orange">reg_lambda:</font> L2 regularization coefficient. This helps prevent overfitting.</li>
    <li><font color="orange">border_count:</font> The number of discrete values for numeric features. This is used when CatBoost transforms numeric features into categorical features internally. </li></ol>