# Catboost  Algo
CatBoost is a gradient boosting algorithm developed by Yandex. It is part of the ensemble learning family and is specifically designed to handle categorical features effectively.
Key Features of CatBoost:

    Gradient Boosting: It builds decision trees sequentially, where each new tree corrects the errors made by previous trees.
    Handling Categorical Data: Unlike traditional gradient boosting algorithms (like XGBoost), CatBoost automatically handles categorical features without needing manual encoding (like one-hot encoding).
    High Performance: CatBoost is known for its speed and high accuracy, often outperforming other boosting algorithms.
    Overfitting Reduction: It uses techniques like ordered boosting and regularization to reduce overfitting.

Use Cases:

    CatBoost is used for both classification and regression tasks and performs well with both numerical and categorical data.

In short, CatBoost is a powerful gradient boosting algorithm optimized for categorical features, providing high performance and reduced overfitting.


# install catboost in conda environment to avoid errors

In [2]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [3]:
# data import titanic
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

# preprocessing of data


Summary of the Code:

    Imputing Missing Values in age:
        KNNImputer is used to fill missing values in the age column based on the average of the k (5) nearest neighbors. This method considers the similarity between rows to fill in missing values.

    Imputing Missing Values in embarked and embark_town:
        Missing categorical values in embarked and embark_town are filled with the mode (most frequent value) of each column, assuming that the most common value is the best guess for missing data.

    Dropping the deck Column:
        The deck column is dropped from the dataset since it likely contains too many missing values or is not relevant to the analysis.

    Checking for Remaining Missing Values:
        The code checks the remaining missing values in the DataFrame using isnull().sum() and sorts them in descending order to quickly identify columns with the most missing data.

This code helps clean and prepare the dataset by handling missing values in a structured manner, making the data ready for analysis or modeling.

In [5]:
# impute missing values using knn imputers in age
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df['age'] = imputer.fit_transform(df[['age']])

# impute embarked missing values using pandas
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
# drop deck column
df.drop('deck', axis=1, inplace=True)

# df missing values
df.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [6]:
# convert each category column to category even objects that
# contain multiple categories

categorical_cols =df.select_dtypes(include=['object','category']).columns
 # add this column to new data frame
 
df[categorical_cols] = df[categorical_cols].astype('category')
    
    

In [7]:
# split data in ti X and y

X = df.drop('survived', axis=1)
y = df['survived']

# split data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Summary of How They Work Together

    loss_function: Guides the model to minimize errors.
    eval_metric: Measures how well the model is performing.
    learning_rate: Controls how fast the model learns.
    depth: Controls the complexity of the model.
    verbose: Controls whether progress updates are shown.

In [9]:
# run the catboost classifier
model = CatBoostClassifier(iterations=100,
                           learning_rate=0.1,
                           depth=3,
                           loss_function='Logloss',
                           eval_metric='Accuracy',
                           random_seed=42,
                           verbose=False)

# train the model
model.fit(X_train, y_train, cat_features=categorical_cols.tolist())

# predictions
y_pred = model.predict(X_test)

# evaluate the model
print(f'Accuracy Score: {accuracy_score(y_test, y_pred)}')
print(f'Confusion Matrix: \n {confusion_matrix(y_test, y_pred)}')
print(f'Classification Report: \n {classification_report(y_test, y_pred)}')

Accuracy Score: 1.0
Confusion Matrix: 
 [[105   0]
 [  0  74]]
Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       105
           1       1.00      1.00      1.00        74

    accuracy                           1.00       179
   macro avg       1.00      1.00      1.00       179
weighted avg       1.00      1.00      1.00       179

