<a href="https://colab.research.google.com/github/ABBAS-37405/PYTHON-AND-DATA-SCIENCE/blob/main/CatBoost_ML_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CatBoost Classifier**

CatBoost is an open-source gradient boosting library developed by Yandex. It's known for its excellent performance without extensive hyperparameter tuning, especially on categorical features.

CatBoost Classifier
Purpose: Used for classification tasks, where the goal is to predict a categorical outcome (e.g., 'Low' or 'High' health risk).

Key Parameters Explained:

iterations (or n_estimators): The number of boosting rounds or trees to build. More iterations can lead to better accuracy but also increase training time and risk of overfitting. Default is 1000.
learning_rate (or eta): Controls the step size shrinkage to prevent overfitting. A smaller learning rate requires more iterations but can lead to a more robust model. Default is 0.03.
depth: The maximum depth of the trees. Deeper trees can capture more complex relationships but are more prone to overfitting. Default is 6.
l2_leaf_reg (or reg_lambda): L2 regularization term on weights. Helps to prevent overfitting. Default is 3.0.
loss_function: The objective function to optimize. For classification, common options include 'Logloss' (default for binary classification) or 'MultiClass' (for multi-class classification).
random_seed (or random_state): Used for reproducibility of results. Default is 0.
verbose: Controls the amount of diagnostic information printed to the console during training. Setting to 0 (as in your notebook) suppresses output.
CatBoost Regressor
Purpose: Used for regression tasks, where the goal is to predict a continuous numerical outcome (e.g., heart_rate).

Key Parameters Explained:

Most parameters like iterations, learning_rate, depth, l2_leaf_reg, random_seed, and verbose are similar to the Classifier and serve the same purpose for controlling the boosting process and preventing overfitting.
loss_function: The objective function to optimize. For regression, common options include 'RMSE' (Root Mean Squared Error, default) or 'MAE' (Mean Absolute Error).
Advantages of CatBoost:

Handles categorical features automatically: It uses a proprietary algorithm to handle categorical features directly without requiring one-hot encoding, which can be memory-intensive and lead to slower training.
Robust to overfitting: Uses ordered boosting and ordered target encoding to combat prediction shift and overfitting.
Good default parameters: Often performs well with default settings, reducing the need for extensive hyperparameter tuning.
Your notebook uses random_state=42 and verbose=0 for both the Classifier and Regressor, which are good practices for reproducibility and cleaner output, respectively.

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
df = pd.read_csv("healthcare_data_10000.csv")

# Encode target: Low = 0, High = 1
df['health_risk_category'] = df['health_risk_category'].map({'Low': 0, 'High': 1})

# Features and target
X = df.drop(columns=['health_risk_category'])
y = df['health_risk_category']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [4]:
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report

model = CatBoostClassifier(verbose= 0, random_state= 42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

classification_report = classification_report(y_test, y_pred, target_names=['Low', 'High'])
print(classification_report)

              precision    recall  f1-score   support

         Low       1.00      1.00      1.00      1738
        High       1.00      1.00      1.00       262

    accuracy                           1.00      2000
   macro avg       1.00      1.00      1.00      2000
weighted avg       1.00      1.00      1.00      2000



# **CatBoost Regressor**

In [5]:
# Step 2: Select numeric columns
numeric_cols = [
    'age', 'bmi', 'systolic_bp', 'diastolic_bp',
    'cholesterol_level', 'glucose_level',
    'exercise_mins_per_week', 'alcohol_units_per_week', 'medications_count'
]
target = 'heart_rate'

# Step 3: Feature matrix (X) and target vector (y)
X = df[numeric_cols]
y = df[target]

# Step 4: Train-test split
X_train, X_test, y_train_reg, y_test_reg = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 5: Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [6]:
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

model = CatBoostRegressor(verbose= 0, random_state= 42)
model.fit(X_train_scaled, y_train_reg)

y_pred_reg = model.predict(X_test_scaled)

MSE = mean_squared_error(y_test_reg, y_pred_reg)
print(MSE)

109.47677313935505
