# **CatBoost Algorithm**

One the best algorithms for `classification` tasks

- `CatBoost` is a state-of-the-art open-source gradient boosting on decision trees library.
- It is developed by Yandex researchers and engineers, and is used for search, recommendation systems, personal assistant, self-driving cars, weather prediction and many other tasks at Yandex and in other companies.
- It is in Python and it is designed to be integrated in data science pipelines.
- It provides state-of-the-art results and it is powerful in handling categorical features.
- Do we need to encode categorical features before training the model? `No, CatBoost does not require it.`
- It is efficient. It provides a fast and scalable multi-threaded implementation of the algorithm.
- It provides powerful visualization tools to understand the model.

In [None]:
# !pip install catboost -q

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
# data import titanic
df = sns.load_dataset('titanic')
df.head()

# pre-processing


In [None]:
df.isnull().sum().sort_values(ascending=False)

In [None]:
# impute missing values using knn imputers in age
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df['age'] = imputer.fit_transform(df[['age']])

# impute embarked missing values using pandas
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
# drop deck column
df.drop('deck', axis=1, inplace=True)

# df missing values
df.isnull().sum().sort_values(ascending=False)

In [None]:
df.info()

In [None]:
df = df.drop(['alive'], axis=1) # dropping it, because we are going to predict survived. --> survived and alive both are same

# convert each category/object column to category
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
# add this as a new column in the dataframe
df[categorical_cols] = df[categorical_cols].astype('category')

In [None]:
df.dtypes

In [None]:
# split data into X and y
X = df.drop('survived', axis=1)
y = df['survived']

# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


In [None]:
%%time
# run the catboost classifier
model = CatBoostClassifier(iterations=1000, # means: 1000 trees
                           learning_rate=0.001, # learning rate means how much to change the model in response to the estimated error each time the model weights are updated
                           depth=3, # depth of the tree
                           loss_function='Logloss', # Logloss is used for binary classification problems
                           eval_metric='Accuracy',
                           random_seed=42,
                           verbose=False) # verbose=False means no output during training

# train the model
model.fit(X_train, y_train, cat_features=categorical_cols.tolist()) # cat_features are the categorical columns in the dataset. We are telling model, there is no need to encode these columns, it will handle them internally.

# predictions
y_pred = model.predict(X_test)

# evaluate the model
print(f'Accuracy Score: {accuracy_score(y_test, y_pred)}')
print(f'Confusion Matrix: \n {confusion_matrix(y_test, y_pred)}')
print(f'Classification Report: \n {classification_report(y_test, y_pred)}')

# plot confusion matrix
plt.figure(figsize=(10, 6))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='viridis')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')

# Assignment Alert:
- In this notebook, you will learn how to use CatBoost algorithm to make predictions.
- Make the same prediction for any dataset and submit the results via discord.

---

In [None]:
# feature importance - this will show the importance of each feature in the model
feature_importance = model.get_feature_importance(prettified=True)
plt.figure(figsize=(10, 6))
sns.barplot(x='Importances', y='Feature Id', data=feature_importance)
plt.title('Feature Importance')
plt.show()