# Hello everyone!


<img src="https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExeTQ2anQ0NzYwMGl5anliZzczcnFjYjBjbGtkd3FwNnA0OGNncmZjbiZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/LoCDk7fecj2dwCtSB3/giphy.webp">


Today we will dive into the world of classification problems in data science. Don’t worry, whether you are an engineer or an English teacher, I have you covered. We will try to understand from everyone’s perspective. So fasten your seat belts and get ready for this amazing journey! Assuming that you have some knowledge about the basics of AI, data science, and machine learning, we will skip the introductions and get straight to the action. We will work with machine learning algorithms to address some classic classification problems. Let’s start with the first step

Ready? Let’s get started!


<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*6LMDb34Pv4sTkt-55Zgo9A.png">


# Problem Definition

*The first step is always the same, no matter what type of problem you're working on: you must identify your problem, understand your dataset, and come up with a rock-solid game plan.*

Today, we focus on a very serious issue: cardiovascular diseases (CVDs). Did you know that CVDs are the first leading cause of death globally?

They account for an estimated 17.9 million lives lost in a year, representing a gigantic 31% of all the deaths worldwide. And get this: four out of five of those deaths are due to heart attacks and strokes, with a third occurring prematurely in people under 70. CVDs usually lead to heart failure, which is where our dataset comes into play.

We possess a dataset of 11 important features that help predict the likelihood of heart disease. Early detection and management are crucial for anyone suffering from or at high risk of cardiovascular disease, such as hypertension, diabetes, hyperlipidemia, or an already established disease. That is exactly where a well-trained machine learning model can make the big difference. We now possess all the features and data required for a database that predicts heart diseases, so let's tuck in and start preparing the model bit by bit. Our dataset and our problem are the supervised problems. Now let's examine the data.

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/0*urEV9Jr7qoP--9Qa.png">

# Importing Libraries

(((: After identifying the problem, the first thing we need to do is import the necessary libraries so that we can start working on the tasks :)))

* If some libraries don't work, it means Kaggle hasn't updated them yet. I suggest you download and install them manually just this once.

In [None]:
!pip install fairlearn
!pip install lime
!pip install xgboost
!pip install catboost


In [None]:
import numpy as np 
import pandas as pd 
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
import datetime

warnings.filterwarnings("ignore")
pd.set_option("display.max_rows",None)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# Fill Data 
from sklearn.impute import SimpleImputer
from sklearn.ensemble import IsolationForest

# Train-Test 
from sklearn.model_selection import train_test_split

# Preprocessing
from sklearn.preprocessing import StandardScaler

# Machine Algorithm

###### TREE Algorithm
import xgboost as xgb
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB


###### Non Tree Algorithm
from sklearn.linear_model import RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


#Hyperparameter optimizations
from sklearn.experimental import enable_halving_search_cv 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, HalvingRandomSearchCV
from skopt import BayesSearchCV

# Model evaluation
from sklearn.metrics import classification_report, f1_score ,recall_score, roc_auc_score

# Fairness Metrics bias mitigation
from fairlearn.metrics import MetricFrame, selection_rate, demographic_parity_difference, equalized_odds_difference


# Model Interpretability
import shap
import lime
from lime.lime_tabular import LimeTabularExplainer

# Probability Calibration
from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss
from sklearn.calibration import CalibratedClassifierCV



In [None]:
# Let's load the data
df = pd.read_csv("/kaggle/input/heart-failure-prediction/heart.csv")

# Exploratory Data Analysis (EDA)


<img src="https://www.researchgate.net/profile/Mahmoud_Elansary2/publication/352546274/figure/fig4/AS:1036518353289217@1624136641643/Exploratory-Data-Analysis-EDA-steps-source-7.png">

***Alright, folks, it’s time to roll up our sleeves and dive into Exploratory Data Analysis (EDA)! Think of EDA as the detective work we do to get familiar with our dataset. It’s all about understanding what’s going on with the data—finding patterns, spotting any weird anomalies, and identifying those all-important features. Basically, EDA helps us get the lay of the land so we can choose the best model and perform deeper analyses later on.***

Here’s the game plan for our EDA:

Data Identification: First things first, we need to get to know our data. What kind of data are we dealing with? What do the features represent? Let’s answer these questions before we dive deeper.

Data Cleaning: Next, we clean up our data. Think of this like tidying up your room—getting rid of anything that doesn’t belong, filling in the gaps, and making sure everything is in the right place.

Data Visualization: Now, we make our data speak visually. We’ll create charts, graphs, and plots to see if there are any obvious patterns or trends.

Data Transformation: Sometimes, our data needs a little makeover. This could mean normalizing values, encoding categories, or other transformations that make our data easier to work with.

Feature Analysis: Here, we dig into each feature to understand its impact. Are some features more important than others? We’ll find out!

Correlation Analysis: Lastly, we check out how our features relate to each other. Are there strong correlations we should be aware of? This step helps us understand these relationships better.

Alright, let's kick things off with Data Identification and see what we’re working with!

## Preview

By using info() we can predict the fullness of the features, data quantity, data type and content in the data set.

Let's do a short preview later.

In [None]:
# Let's examine the first few rows and basic statistics of the dataset
df.info()


We don't have null values

In [None]:
df.head()

## Dataset Properties

When we look at the features of our dataset, demographic data and factors affecting heart disease are considered. Let's examine the unique values ​​and distributions of each and move on to preparation.

* Age: age of the patient [years]
* Sex: sex of the patient [M: Male, F: Female]
* ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
* RestingBP: resting blood pressure [mm Hg]
* Cholesterol: serum cholesterol [mm/dl]
* FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
* RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
* MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
* ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
* Oldpeak: oldpeak = ST [Numeric value measured in depression]
* ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
* HeartDisease: output class [1: heart disease, 0: Normal]

**If you divide your work into pieces, you can manage it more easily, so we will evaluate numerical and non-numerical categories separately. First of all, we need to perform distribution and cardinal data control analysis for demographic characteristics, and distribution and outlier detection for numerical characteristics.**

## Cardinal Data and Missing Data Control


**Handling Missing Values:**

Missing data can create problems in the analysis. These data can be filled in, removed, or processed according to a specific method.

**Removing Inconsistencies:**

Inconsistencies in the data set can be caused by data entry errors and can distort the analysis result. For example, entering the same data in different formats (for example, as "High School" and "HS") can lead to inconsistencies.
Such inconsistencies should be eliminated and the data should be standardized.

**Correcting Incorrect Data Types:**

The types of data in the columns (numeric, categorical, date, etc.) may be incorrect. These incorrect types should be converted to the correct data type.

For example, if a numeric column is defined as text, it should be converted to numeric values.

In addition, we need to convert each data type to numeric values ​​before building our model because the algorithms do not detect sentences or words.

**Detecting and Correcting Outliers:**

Outliers are data that are significantly different from other observations in the data set. These values ​​may be due to incorrect measurement, data entry errors, or unexpected events.

Outliers should be handled carefully as they may cause misleading results in the analysis.

**Cardinal data analysis:**

Cardinal Data generally refers to categorical data that has a very high number of unique values. For example, a customer ID or IP addresses may be cardinal data.

**It May Affect the Performance of Models:**

Since cardinal data contains a large number of unique categories, it is difficult to use this data directly in modeling. High cardinality may cause the model to over-learn (overfitting).

**Consumes Memory and Computational Power:**

High cardinality creates large data matrices. This causes an increase in memory usage and computational time.

**Noise and Meaninglessness:**

Cardinal data may sometimes contain too much noise and become meaningless.

In [None]:
categorical_features = [col for col in df.columns if df[col].dtype in ['O','bool_']]

pd.DataFrame({  
              'cardinality': df[categorical_features].nunique(),
             })

We need to choose a threshold value for cardinal data. If it exceeds the threshold value we choose, we can either delete it or perform rare encoding operations. (Rare encoding briefly prevents cardinalization and overfitting.)
* Our threshold value:
* For small data sets: 5-10 unique values
* For medium-sized data sets: 10-50 unique values
* For large data sets: 50-100 unique values

We did not detect any cardinal data or null values ​​in the categorical variables in the dataset. This is good while developing the model.

Now let's visualize the distributions of categorical data and perform analysis for the next steps.

## Categoric Columns

### Visualize

In [None]:
# We visualize each categorical variable with HeartDisease
plt.figure(figsize=(20, 15))

for i, feature in enumerate(categorical_features, 1):
    plt.subplot(3, 3, i)
    sns.countplot(data=df, x=feature, hue='HeartDisease')
    plt.title(f'{feature} vs HeartDisease')
    plt.xlabel(feature)
    plt.ylabel('Count')

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8, 8))
df['HeartDisease'].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=90)
plt.title('Distribution of HeartDisease')
plt.ylabel('')  
plt.show()

In [None]:

# Let's examine the relationship between each categorical variable and HeartDisease.
categorical_analysis = {}

for feature in categorical_features:
    cross_tab = pd.crosstab(df[feature], df['HeartDisease'], normalize='index')
    categorical_analysis[feature] = cross_tab


# Let's convert the analysis results of categorical variables into a DataFrame
categorical_analysis_df = pd.concat(categorical_analysis.values(), keys=categorical_analysis.keys())
categorical_analysis_df


categorical distributions are generally neither good nor bad, of course our data is insufficient, which is one reason for this
There are factors that will reduce our accuracy rate in model building and later, or situations that may have bias. We can use class weighting, sampling techniques or more advanced modeling methods to avoid these situations. Keep in mind

##### Sex:

26% of women have heart disease.
63% of men have heart disease.

##### ChestPainType:

79% of those who are asymptomatic (ASY) have heart disease.
14% of those with atypical angina (ATA) have heart disease.
35% of those with non-anginal pain (NAP) have heart disease.
43% of those with typical angina (TA) have heart disease.

##### FastingBS (Fasting Blood Sugar):

48% of patients with fasting blood sugar below 120 mg/dl have heart disease.
79% of patients with fasting blood sugar above 120 mg/dl have heart disease.

##### RestingECG (Resting ECG Results):

56% of patients with left ventricular hypertrophy (LVH) have heart disease.
52% of patients with normal ECG results have heart disease.
66% of patients with ST-T wave abnormalities have heart disease.

##### ExerciseAngina (Exercise-Triggered Angina):

35% of patients without exercise-induced angina have heart disease.
85% of patients with exercise-induced angina have heart disease.

##### ST_Slope (ST Segment Slope):

78% of patients with downsloping ST segment have heart disease.
83% of patients with a flat-sloping ST segment have heart disease.
20% of patients with an upward-sloping ST segment have heart disease.

After obtaining the analysis, let's move on to our numerical features.

## Numeric Columns

In [None]:
# Let's select columns to visualize numeric data
numeric_features = [col for col in df.columns if df[col].dtype not in ['O','bool_'] and col != 'target']

# Let's calculate the correlation between numerical features
correlation_matrix = df[numeric_features].corr()

# Let's create a heatmap for the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Features')
plt.show()




First of all, looking at the correlation between our numerical features will tell us a lot.

Generally speaking, there is no multicorrelation, which is good for us in modeling our dataset because high correlation between variables reduces the stability and interpretability of the model.


There is a positive correlation between Age and HeartDisease (0.28). In other words, the risk of heart disease increases as age increases.

There is a weak positive correlation between RestingBP and HeartDisease (0.11).

There is a weak negative correlation between Cholesterol and HeartDisease (-0.23).

There is a moderate negative correlation between MaxHR and HeartDisease (-0.40). This indicates that individuals with higher maximum heart rates may have a lower risk of heart disease.

There is a moderate positive correlation between Oldpeak and HeartDisease (0.40). This indicates that the risk of heart disease may increase if ST segment depression is higher.

In [None]:

# Let's visualize these numerical features with histograms

plt.figure(figsize=(15, 10))
for i, feature in enumerate(numeric_features, 1):
    plt.subplot(3, 3, i)
    sns.histplot(df[feature], kde=True, bins=30)
    plt.title(f'Distribution of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


In [None]:
# Pairplot to visualize relationships between numerical features
sns.pairplot(df[numeric_features], hue="HeartDisease", diag_kind="kde")
plt.show()

### Cholesterol Analysis

Looking at the scatter plot, we can see that the cholesterol data is spread out over a wide range, but there’s an interesting peak near zero. This peak might suggest some unrealistic values in our dataset—after all, having a cholesterol level near zero isn’t possible!


When we check out the boxplot, it seems like there’s not a huge difference in cholesterol levels between individuals with and without heart disease. However, those zero values stand out and could be considered abnormal.


* The average cholesterol value is 198.8 mg/dl, which is within a reasonable range for most people.
* However, we have a minimum value reported as 0, which is not realistic for cholesterol levels. This could indicate errors or missing data in the dataset.

Given these findings, it’s likely that we’ll treat the zero or abnormally low cholesterol values as incorrect or missing data. We’ll need to clean up our dataset by addressing these inaccuracies to ensure our analysis is based on reliable data.

Oldpeak Analysis

For the Oldpeak data, the scatter plot shows that most values are clustered between 0 and 1.5, but there are some that fall outside this range, indicating a broader distribution.


The boxplot reveals a significant difference in Oldpeak values between individuals with and without heart disease. Generally, those with heart disease tend to have higher Oldpeak values.



* The mean Oldpeak value is 0.887, with a standard deviation of 1.067, showing some variability in the data.
* The minimum Oldpeak value is reported as -2.6, which doesn’t make sense because Oldpeak cannot be negative.
* The maximum value is 6.2, which is quite high. This might be an outlier, so we’ll need to investigate further in our visualizations.

We’ll need to dig deeper into these unusual values, especially the negative and very high Oldpeak figures, to determine if they’re errors or valid outliers. By refining our dataset, we can improve the accuracy of our analysis and ensure our models are robust.

In [None]:
# Let's show the relationships between numerical and categorical features with violin plot

plt.figure(figsize=(20, 15))

for i, numeric_feature in enumerate(numeric_features):
    for j, categorical_feature in enumerate(categorical_features):
        plt.subplot(len(numeric_features), len(categorical_features), i * len(categorical_features) + j + 1)
        sns.violinplot(x=categorical_feature, y=numeric_feature, data=df)
        plt.title(f'{numeric_feature} vs {categorical_feature}')

plt.tight_layout()
plt.show()


**Exploring Relationships Between Features**
Let’s dive deeper into our data to explore how different features are related. Understanding these relationships can really help us make smarter decisions when building our model. Here’s what we’ve discovered so far:

**Relationship Between Sex and Numerical Features:**

***Age:*** While age itself doesn’t directly depend on gender, combining these two can reveal some interesting patterns. For instance, the risk of heart disease might vary across age groups differently for men and women. It’s definitely something worth looking into!

***Maximum Heart Rate (MaxHR):*** We might notice that maximum heart rate differs between genders. This could suggest that when building our model, we might need to consider gender as a segment to account for these differences effectively.

**Relationship Between Chest Pain Type and Numerical Features:**

***Cholesterol and Resting Blood Pressure (RestingBP):*** TThere appear to be significant differences in cholesterol levels and blood pressure depending on the type of chest pain someone experiences. These differences could be crucial for assessing the risk of heart disease, making them important features to focus on.

***Oldpeak:*** Oldpeak, which refers to ST segment depression after exercise, can vary based on the type of chest pain. Understanding these variations is super helpful for predicting heart disease in our models, as it provides deeper insights into the patient’s condition.



**Relationship Between Exercise-Induced Angina and Numerical Features:**

***MaxHR and Oldpeak:*** Exercise-induced angina seems closely connected with both maximum heart rate and post-exercise ST segment depression (Oldpeak). These relationships are key for assessing the risk of heart disease progression, making them valuable features for our model.

**Relationship Between Fasting Blood Sugar (FastingBS) and Numerical Features:**

***Cholesterol and RestingBP:*** Individuals with high fasting blood sugar often have different cholesterol and blood pressure levels. This information is vital for understanding the link between diabetes and heart disease, which could help refine our model.

**Relationship Between Heart Disease and Numerical Features:**

***Age, MaxHR, Cholesterol, RestingBP, Oldpeak:*** These features show quite a bit of variation between those with heart disease and those without. Because of this, they’re likely to be key predictors in our model and should be considered carefully.
### Useful Inferences for Next Steps
**Segmentation and Grouping:** By segmenting our data based on categorical variables like gender, chest pain type, and exercise-induced angina, we can get a clearer picture of the dataset. Treating these segments separately might lead to better modeling outcomes and more accurate predictions.

**Feature Selection:** Observing how numerical features like MaxHR, Cholesterol, RestingBP, and Oldpeak change across different categories can guide us in selecting the most relevant features for our model. This step is crucial for building a strong, predictive model that accurately reflects the underlying data.

Let’s use these insights to refine our approach and create the most effective model possible!

In [None]:
# First, let's identify the highest physiological columns for RestingBP and Cholesterolcorrelation_matrix = df[numeric_features].corr()
correlation_matrix
# Let's examine the correlations of RestingBP, Cholesterol and Oldpeak columns with other columns
restingbp_corr = correlation_matrix['RestingBP'].sort_values(ascending=False)
cholesterol_corr = correlation_matrix['Cholesterol'].sort_values(ascending=False)
Oldpeak_corr = correlation_matrix['Oldpeak'].sort_values(ascending=False)
restingbp_corr, cholesterol_corr,Oldpeak_corr


In [None]:
# Let's detect zero values in RestingBP and Cholesterol
anomalous_restingbp = df['RestingBP'] == 0
anomalous_cholesterol = df['Cholesterol'] == 0
imputer = SimpleImputer(strategy='median')

# For RestingBP: Let's fill in the zero values by grouping with Age and Oldpeak
df.loc[anomalous_restingbp, 'RestingBP'] = df.groupby(['Age', 'Oldpeak'])['RestingBP'].transform(
    lambda x: x.replace(0, x.median())
)

# For Cholesterol: Let's fill in the zero values by grouping with MaxHR and RestingBP
df.loc[anomalous_cholesterol, 'Cholesterol'] = df.groupby(['MaxHR', 'RestingBP'])['Cholesterol'].transform(
    lambda x: x.replace(0, x.median())
)


# Make values less than 0 NaN in Oldpeak
df.loc[df['Oldpeak'] < 0, 'Oldpeak'] = float('nan')
df['Oldpeak'] = imputer.fit_transform(df[['Oldpeak']])

# make the remaining values nan and fill them with interpolation method
df['Cholesterol'].replace(0, np.nan, inplace=True)
# Let's fill the values that cannot be filled after grouping with the median
df['Cholesterol'] = df['Cholesterol'].interpolate(method='linear')

# Let's check the filled values again
restingbp_summary_after = df['RestingBP'].describe()
cholesterol_summary_after = df['Cholesterol'].describe()
oldpeak_summary_after = df['Oldpeak'].describe()

restingbp_summary_after, cholesterol_summary_after ,oldpeak_summary_after



In [None]:
# RestingBP graph - After filling
plt.figure(figsize=(15, 10))
plt.subplot(3, 1, 1)
sns.histplot(df['RestingBP'], kde=True, bins=30, color='blue')
plt.title('RestingBP Distribution (After Imputation)')
plt.xlabel('RestingBP')
plt.ylabel('Frequency')

# Cholesterol chart - After filling
plt.subplot(3, 1, 2)
sns.histplot(df['Cholesterol'], kde=True, bins=30, color='green')
plt.title('Cholesterol Distribution (After Imputation)')
plt.xlabel('Cholesterol')
plt.ylabel('Frequency')

# Cholesterol chart - After filling
plt.subplot(3, 1, 3)
sns.histplot(df['Oldpeak'], kde=True, bins=30, color='orange')
plt.title('Oldpeak Distribution (After Imputation)')
plt.xlabel('Cholesterol')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


# Why Detect Outliers?

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_anomaly_comparison_001.png">


Now let's talk about outliers. These are quirky data points that seem not to sit very well with the rest of them. Outlier detection is a key step in data preparation and for a number of reasons, including the following: Enhanced model performance means more realistic interactions. Outliers can wreak havoc on the learning process of your model. Let's consider linear regression. As has already been said, this model does not take into account the contribution of each data point separately. Everything becomes messed up when an outlier is extremely valued. This might lead your model to assign weird, oversized coefficients, reducing the overall accuracy of your model. Outliers make models give very strange answers and are therefore very unreliable.

Overfitting: Another problem of outliers is that they can lead to overfitting. If your model fits the unusual data points too well, it ends up performing well on training but failing to generalize to new, unseen data. It's sort of like training for a marathon by running only downhill: you'll nail the downhill, but as soon as you hit a flat or uphill section, you're in trouble. Equally, overfitting due to outliers can lead to a significant drop in performance of your model in the real world.

Detection and handling of outliers can help to make our models more accurate, generalizable, and robust in nature. Let's therefore not skip this important step; it is key to creating reliable machine learning models!

In [None]:


# Set up the visualizations for each column with potential outliers
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15))

# Plotting each potential outlier column
sns.boxplot(x=df['Cholesterol'], ax=axes[0, 0]).set_title('Cholesterol')
sns.boxplot(x=df['MaxHR'], ax=axes[1, 0]).set_title('MaxHR')
sns.boxplot(x=df['Oldpeak'], ax=axes[1, 1]).set_title('Oldpeak')
sns.boxplot(x=df['RestingBP'], ax=axes[2, 0]).set_title('RestingBP')

# Hide the last subplot as it's not needed
axes[2, 1].axis('off')

plt.tight_layout()
plt.show()


In [None]:
# Creating age groups
df['AgeGroup'] = pd.cut(df['Age'], bins=[20, 40, 60, 80], labels=['20-40', '40-60', '60-80'])

# Outlier detection and replacement function
def replace_outliers_corrected(df, group_column, target_columns):
    for col in target_columns:
        for group_name, group in df.groupby(group_column):
            iso_forest = IsolationForest(contamination=0.05, random_state=42)
            outliers = iso_forest.fit_predict(group[[col]])
            median_value = group[col].median()
            df.loc[group.index[outliers == -1], col] = median_value
    return df

# Let's specify the relevant columns
outlier_columns = ['RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']

# Substituting outliers within groups
df = replace_outliers_corrected(df, 'AgeGroup', outlier_columns)

df.drop("AgeGroup",axis=1,inplace= True)

# Feature Engineering 


<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*kCuNfifKaF-qhpwwKrESZQ.png">

**Feature Engineering:** 
The Secret Sauce of Machine Learning Feature engineering is like adding secret ingredients to your grandmother's famous cookie recipe: it's all about creating something extraordinary from the ordinary. The art of feature engineering refers to crafting new, predictive features from raw data in such a way that boosts the learning capacity of your model and leads to better results.

In [None]:
# Creating interaction terms

# df['Age_Sex'] = df['Age'] * df['Sex']
# df['ChestPain_MaxHR'] = df['ChestPainType'] * df['MaxHR']
# df['Oldpeak_ExerciseAngina'] = df['Oldpeak'] * df['ExerciseAngina']

# df.drop(labels=['Age','Sex','ChestPainType','MaxHR','Oldpeak','ExerciseAngina'],axis=1,inplace = True)
# Display the updated dataframe with new interaction terms
# print(df.head())


**Not Always Needed, But Often Beneficial:** In some rare cases, a dataset might be perfectly clean and complete—like finding all the ingredients for your favorite recipe prepped and ready. But most of the time, a little feature engineering goes a long way.


**Model Complexity:** Including interaction terms can increase model complexity. If interaction terms do not contribute significantly to model performance (e.g., improve accuracy, recall, precision, etc.), it may be better to remove them to avoid overfitting and simplify the model.

**Interpretability:** Depending on the model, interaction terms can make it difficult to interpret results. If interpretability is important, you may want to avoid using these terms unless they provide significant predictive value.

**Model Performance Evaluation:**

Evaluate the performance of the model on the validation set with and without these interaction terms. If the interaction terms significantly improve the model, keep them; otherwise, consider removing them.


# One Hot Encoding

<img src="https://www.researchgate.net/profile/Fatemeh-Davoudi-Kakhki/publication/344409939/figure/fig1/AS:940907041918978@1601341128930/An-example-of-one-hot-encoding.png">

**What is One-Hot Encoding?**

Let’s talk about One-Hot Encoding—a super handy technique when you're working with categorical data in machine learning. Categorical data is basically data that falls into specific categories or classes, like colors (red, blue, green) or types of fruit (apple, banana, cherry). The problem is, machine learning models don’t really know what to do with words or labels—they need numbers to crunch!

That’s where One-Hot Encoding comes in. It’s a way to convert these categories into a numerical format that our models can understand. Here’s how it works: One-Hot Encoding creates new columns (or features) for each category, filling them with binary values—0s and 1s. Each row in your data will have a ‘1’ in the column of the category it belongs to and ‘0’s in all other category columns.



In [None]:

# Let's code categorical variables using one-hot coding method
df_encoded = pd.get_dummies(df, columns=categorical_features, dtype='int64',drop_first=True)

# Let's show the first few lines
df_encoded.head()


# Data Splitting

In [None]:
# Let's separate the target variable and properties
X_tree = df_encoded.drop('HeartDisease', axis=1)
y_tree = df_encoded['HeartDisease']


In [None]:
# Let's split the data into training and test sets
X_train_tree, X_test_tree, y_train_tree, y_test_tree = train_test_split(X_tree, y_tree, test_size=0.2, random_state=42)

# Let's check the sizes of training and test sets
X_train_tree.shape, X_test_tree.shape, y_train_tree.shape, y_test_tree.shape

# Non-Normalized Model


<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/0*U0rcW7XrdHpvI0hU.jpeg">

**Machine Learning Algorithms That Don’t Sweat the Small Stuff: Normalization and Standardization**

When working with machine learning, you often hear about the need to normalize or standardize your data—basically, making sure all your features are on a similar scale. This is super important for many algorithms that are sensitive to the scale of data. But guess what? Not all algorithms need this kind of preprocessing! Some algorithms are scale-independent and work just fine without normalization or standardization. Let’s look at a few of these:

***Decision Trees:***

Decision trees are like those people who don't care about the fancy stuff—they’re straightforward and to the point. They split the data based on feature values, regardless of the scale. Whether your features range from 1 to 10 or 1 to 10,000, it doesn’t matter. Decision trees operate independently of feature scales, so you can skip normalization and standardization here.

***Random Forests:***

Random forests are basically a bunch of decision trees working together, like a team of independent thinkers. Since each tree in a random forest handles features just like a single decision tree, this algorithm is also scale-independent. No need to worry about the scales of your features with random forests—they don’t mind.

***Naive Bayes:***

Naive Bayes takes a different approach. It’s based on the assumption of independence between features and class labels, which means it doesn’t consider the relationship between features themselves. Because of this, the algorithm isn’t affected by the scales of the data. Whether your features are measured in grams or tons, Naive Bayes will handle them just the same.

***Boosting Algorithms (e.g., AdaBoost, Gradient Boosting, XGBoost, CatBoost):***

Boosting algorithms are like decision trees on steroids—they use an ensemble of trees to improve prediction accuracy. Since these algorithms rely on decision trees, they inherit the same scale independence. So, just like with decision trees and random forests, you don’t need to worry about normalizing or standardizing your data when using boosting algorithms.

So, if you’re working with any of these algorithms, you can save yourself some time and skip the normalization or standardization steps. Just dive right into building your model!

In [None]:
# Let's define the models tree
models = {    
    'NaiveBayes': GaussianNB(),
    'RandomForestClassifier': RandomForestClassifier(random_state=42),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=42),
    'GradientBoostingClassifier': GradientBoostingClassifier(random_state=42),
    'AdaBoostClassifier': AdaBoostClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
    'CatBoostClassifier': CatBoostClassifier(random_state=42, verbose=0)
}



# Model Tuning

<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/0*axPdc1zynCHtW__d">

# Hyperparameters: The Secret Sauce of Model Optimization
Hyperparameters are like the secret ingredients in your favorite recipe—they need to be just right to make your machine learning models perform their best. Unlike regular parameters that the model learns on its own, hyperparameters are set by you and need to be fine-tuned to get the optimal results.

There are plenty of methods out there for hyperparameter optimization, but instead of diving into each one, let’s hit some key points to get you started:

**General Strategy:**

Limited Time and Resources? Try Random Search: If you're working with limited time or computational resources, Random Search is usually a good place to start. It explores the hyperparameter space more broadly and can often find good results without too much effort.

Looking for Something More Sophisticated? Go for Bayesian Optimization or Hyperband: If you're working with complex models and want to get the best results with fewer trials, methods like Bayesian Optimization or Hyperband are great choices. They’re more efficient and can zero in on the optimal hyperparameters faster.

Simple Models? Stick with Grid Search: For simpler models, Grid Search will usually do the trick. It tests every possible combination of hyperparameters, which is great when you don't have too many to consider.

Here’s how we’d approach hyperparameter tuning for some specific models:

**Grid Search for Naive Bayes:**

Naive Bayes models typically have a limited set of hyperparameters to tune. Because of this, Grid Search, which systematically tries every combination of hyperparameters, works really well. It’s straightforward, and since there aren't too many combinations to test, it's quick and efficient.

**Random Search for RandomForestClassifier, AdaBoostClassifier, and DecisionTreeClassifier:**

Random forests, AdaBoost, and decision trees come with a lot of hyperparameters (like the number of trees in a forest or the maximum depth of a tree). This can make the search space quite large! Random Search is a great option here because it randomly samples from the hyperparameter space, allowing for a fast and efficient search without needing to test every single combination.

**Hyperband for GradientBoostingClassifier and XGBoost:**

For models like Gradient Boosting and XGBoost, Hyperband is a fantastic choice. These models have a wide range of hyperparameters to tune, and methods like Hyperband can help find the best combination more efficiently than a brute-force approach. Hyperband works by allocating resources dynamically and can quickly discard less promising configurations, saving time and computational power.

By choosing the right hyperparameter optimization strategy for your model and dataset, you can significantly improve your model's performance without wasting resources. So, whether you’re sticking with the basics or trying out more advanced methods, there’s a strategy that’s perfect for your needs!

In [None]:

# Hyperparameter optimization functions

def optimize_naive_bayes(X_train_tree, y_train_tree):
    param_grid_nb = {'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]}
    grid_search_nb = GridSearchCV(estimator=models['NaiveBayes'], param_grid=param_grid_nb, cv=5, scoring='recall')
    grid_search_nb.fit(X_train_tree, y_train_tree)
    return grid_search_nb.best_estimator_

def optimize_random_forest(X_train_tree, y_train_tree):
    param_dist_rf = {'n_estimators': [100, 200, 500], 'max_depth': [10, 20, 30], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}
    random_search_rf = RandomizedSearchCV(estimator=models['RandomForestClassifier'], param_distributions=param_dist_rf, n_iter=50, cv=5, scoring='recall', random_state=42)
    random_search_rf.fit(X_train_tree, y_train_tree)
    return random_search_rf.best_estimator_

def optimize_adaboost(X_train_tree, y_train_tree):
    param_dist_ab = {'n_estimators': [50, 100, 500], 'learning_rate': [0.001, 0.01, 0.1, 1.0]}
    random_search_ab = RandomizedSearchCV(estimator=models['AdaBoostClassifier'], param_distributions=param_dist_ab, n_iter=50, cv=5, scoring='recall', random_state=42)
    random_search_ab.fit(X_train_tree, y_train_tree)
    return random_search_ab.best_estimator_

def optimize_decision_tree(X_train_tree, y_train_tree):
    param_dist_dt = {'max_depth': [10, 20, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}
    random_search_dt = RandomizedSearchCV(estimator=models['DecisionTreeClassifier'], param_distributions=param_dist_dt, n_iter=50, cv=5, scoring='recall', random_state=42)
    random_search_dt.fit(X_train_tree, y_train_tree)
    return random_search_dt.best_estimator_

def optimize_gradient_boosting(X_train_tree, y_train_tree):
    param_dist_gb = {'n_estimators': [100, 200, 500], 'learning_rate': [0.001, 0.01, 0.1, 1.0], 'max_depth': [3, 5, 7]}
    hyperband_gb = HalvingRandomSearchCV(estimator=models['GradientBoostingClassifier'], param_distributions=param_dist_gb, factor=3, random_state=42, cv=5, scoring='recall')
    hyperband_gb.fit(X_train_tree, y_train_tree)
    return hyperband_gb.best_estimator_

def optimize_xgboost(X_train_tree, y_train_tree):
    param_dist_xgb = {'n_estimators': [100, 200, 300, 400, 500, 1000], 'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5, 1.0], 'max_depth': [3, 5, 7, 9, 11],
                      'min_child_weight': [1, 3, 5, 7], 'gamma': [0, 0.1, 0.2, 0.3, 0.4], 'subsample': [0.6, 0.7, 0.8, 0.9, 1.0], 'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
                      'reg_alpha': [0, 0.01, 0.1, 1, 10], 'reg_lambda': [0.01, 0.1, 1, 10, 100]}
    hyperband_xgb = HalvingRandomSearchCV(estimator=models['XGBoost'], param_distributions=param_dist_xgb, factor=3, random_state=42, cv=5, scoring='recall')
    hyperband_xgb.fit(X_train_tree, y_train_tree)
    return hyperband_xgb.best_estimator_

def optimize_catboost(X_train_tree, y_train_tree):
    param_grid_cb = {'iterations': (100, 1000), 'depth': (4, 10), 'learning_rate': (0.01, 0.3), 'l2_leaf_reg': (1, 10), 'bagging_temperature': (0, 1), 'border_count': (32, 255), 'random_strength': (1e-9, 10)}
    opt_cb = BayesSearchCV(estimator=models['CatBoostClassifier'], search_spaces=param_grid_cb, n_iter=50, cv=5, scoring='recall', verbose=0, random_state=42)
    opt_cb.fit(X_train_tree, y_train_tree)
    return opt_cb.best_estimator_

# Let's optimize the models
models['NaiveBayes'] = optimize_naive_bayes(X_train_tree, y_train_tree)
models['RandomForestClassifier'] = optimize_random_forest(X_train_tree, y_train_tree)
models['AdaBoostClassifier'] = optimize_adaboost(X_train_tree, y_train_tree)
models['DecisionTreeClassifier'] = optimize_decision_tree(X_train_tree, y_train_tree)
models['GradientBoostingClassifier'] = optimize_gradient_boosting(X_train_tree, y_train_tree)
models['XGBoost'] = optimize_xgboost(X_train_tree, y_train_tree)
models['CatBoostClassifier'] = optimize_catboost(X_train_tree, y_train_tree)

# You can print optimized models or save them for later use


In [None]:
models.items()

# Model Evaluation and Improvement
<img src="https://media.licdn.com/dms/image/D5612AQGKoVQ8Xhnjzg/article-cover_image-shrink_600_2000/0/1690913948713?e=2147483647&v=beta&t=Ou7PqmMdh9aYgSSD-smMzhB2HCJX1UESmqP8Z9ly5no">



## Understanding the Classification Report

When working on classification problems in machine learning and deep learning, the classification report is a handy tool for evaluating how well your model is doing. This report gives you a detailed breakdown of the model's performance across different classes using various metrics. Let's go through the key components of the classification report:

**Precision:**

Precision tells us how many of the examples that the model predicted as positive were actually positive. In other words, it shows how accurate the positive predictions are. A high precision score means that the model is good at minimizing false positives—meaning it rarely predicts something as positive when it’s not.

**Recall:**

Recall, also known as sensitivity, measures how many of the actual positive examples were correctly predicted by the model. A high recall score indicates that the model is good at capturing all the actual positives, producing few false negatives—meaning it doesn’t miss many positive cases.

**F1-Score:**

The F1-score provides a balance between precision and recall. It’s the harmonic mean of precision and recall, making it especially useful when you’re dealing with imbalanced classes. If one class is much more common than the other, the F1-score helps ensure that both precision and recall are being considered equally, giving you a better sense of the model's overall performance.





## Recall (Sensitivity / True Positive Rate)

*Why is it used?*
Recall, also known as sensitivity or the true positive rate, is crucial in scenarios where missing a positive case could have serious consequences. Take health-related issues, like cancer detection, for example. In these situations, a false negative—where the model incorrectly predicts a sick patient as healthy—can be extremely dangerous. It means a patient might miss out on early treatment opportunities, which could be life-threatening.

*Why is Recall Important in These Cases?*
Recall measures how many of the actual positive cases (like patients with cancer) are correctly identified by the model. A high recall means that the model is good at catching nearly all the positive cases, which is exactly what you want in critical health diagnoses. The goal is to minimize the chances of false negatives so that fewer patients are wrongly considered healthy when they are, in fact, ill.

*Especially Used:*
Recall is particularly important in fields like healthcare, where early and accurate diagnosis is vital. For conditions like cancer, where timely detection can significantly impact treatment outcomes, prioritizing recall helps ensure that as many true positives as possible are caught by the model.

In [None]:

# Lists to store results
recall_scores_test = []
recall_scores_train = []
model_names = []

# Get recall scores for each model
for name, model in models.items():
    
    # Make a prediction (Using trained models in hyperparameter optimization)
    y_pred_tree = model.predict(X_test_tree)
    y_pred_train_tree = model.predict(X_train_tree)
    
    # Calculate and save recall scores
    recall_test = recall_score(y_test_tree, y_pred_tree, average='binary')
    recall_train = recall_score(y_train_tree, y_pred_train_tree, average='binary')
    
    recall_scores_test.append(recall_test)
    recall_scores_train.append(recall_train)
    model_names.append(name)

# Let's visualize the recall scores on the same graph
plt.figure(figsize=(12, 6))

plt.plot(model_names, recall_scores_train, marker='o', linestyle='-', color='b', label='Train Recall')
plt.plot(model_names, recall_scores_test, marker='o', linestyle='-', color='r', label='Test Recall')

plt.xlabel('Models')
plt.ylabel('Recall')
plt.title('Train and Test Recall for Different Models')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()


* Naive Bayes: Relatively low recall difference between training and test sets, no memorization.

* Random Forest: Recall in training set is very high (almost 1.0) but recall in test set is lower. Memorized

* Decision Tree: Similarly, recall in training set is high, but this ratio decreases in test set. Memorized

* Gradient Boosting: While there is a perfect recall value (1.0) in training set, it shows very low performance in test set. Memorized

* AdaBoost: There is a more balanced difference between training and test recall values. However, since training set performance is higher, it memorized a little.

* XGBoost: Similar recall values ​​are seen in training and test sets, not memorized.

* CatBoost: Similarly, recall values ​​are close to each other in training and test sets, not memorized

In [None]:
comparison_data = []

for name, model in models.items():
        
    # Make a guess
    y_pred_tree = model.predict(X_test_tree)
    y_pred_train_tree = model.predict(X_train_tree)
    
    # Get classification report
    report_test = classification_report(y_test_tree, y_pred_tree, output_dict=True)
    report_train = classification_report(y_train_tree, y_pred_train_tree, output_dict=True)
    
    # Let's tabulate the classification report values ​​for each class
    for label in report_test.keys():
        if label not in ["accuracy", "macro avg", "weighted avg"]:
            comparison_data.append({
                'Model': name,
                'DataSet': 'Test',
                'Label': label,
                'Precision': report_test[label]['precision'],
                'Recall': report_test[label]['recall'],
                'F1-Score': report_test[label]['f1-score'],
                'Accuracy': report_test['accuracy']  # Accuracy değerini ekliyoruz
            })
            comparison_data.append({
                'Model': name,
                'DataSet': 'Train',
                'Label': label,
                'Precision': report_train[label]['precision'],
                'Recall': report_train[label]['recall'],
                'F1-Score': report_train[label]['f1-score'],
                'Accuracy': report_train['accuracy']  # Accuracy değerini ekliyoruz
            })

# Sonuçları bir DataFrame'e dönüştürelim
comparison_df = pd.DataFrame(comparison_data)



# Let's view the DataFrame
comparison_df


In [None]:
best_model_recall_test = max(comparison_data, key=lambda x: x['Recall'] if (x['DataSet'] == 'Test' and x['Recall']<1) else 0)

print("The best model based test on recall is:", best_model_recall_test['Model'])
print("Recall:", best_model_recall_test['Recall'])

best_model_recall_train = max(comparison_data, key=lambda x: x['Recall'] if (x['DataSet'] == 'Train' and x['Recall']<1) else 0)

print("The best model based on train recall is:", best_model_recall_train['Model'])
print("Recall:", best_model_recall_train['Recall'])


# Fairness Evaluation

<img src="https://dsp700.wordpress.com/wp-content/uploads/2012/02/fair-selection.jpg?w=584">


### What is Fairness in Machine Learning?
Fairness in machine learning is all about ensuring that your model doesn’t produce biased results across different demographic groups, such as race, gender, or age. A fair model should provide equal performance and treatment for all groups, avoiding any form of discriminatory stereotypes. In other words, fairness means that your model’s decisions are just and unbiased, regardless of who it’s making predictions about.

**Why Is Fairness Important?**

* ***Ethical Considerations:***

Fairness in machine learning isn’t just a technical goal—it’s an ethical one. Unfair models can perpetuate and even amplify social biases, leading to discriminatory practices and unfair treatment of certain groups. This is why it’s crucial to ensure that our models don’t reinforce harmful stereotypes or biases that exist in the data.

**When Should You Evaluate Fairness?**

* ***High-Impact Decisions:*** 
If your model is being used in scenarios where the stakes are high—like hiring decisions, loan approvals, law enforcement, or healthcare—it’s essential to evaluate fairness. These are situations where biased outcomes can have serious consequences for people’s lives.

* ***Sensitive Attributes in Your Dataset:***
Whenever your dataset includes sensitive attributes (like race, gender, or age) that could lead to biased results, you need to be vigilant about fairness. It’s important to check whether these attributes are causing the model to make unfair predictions.

* ***Deployment in Diverse Environments:***
If your model is going to be used in various environments where fairness matters across different groups, it’s crucial to ensure it performs fairly for everyone.

**What to Consider When Choosing an Algorithm:**

* ***Some Algorithms Are More Prone to Bias:***
Not all algorithms are created equal when it comes to fairness. For example, simpler models like decision trees or linear models can easily pick up on biases present in the training data, which can lead to biased predictions. On the other hand, more complex models like neural networks might learn subtle patterns that include biased behaviors, even if they aren’t immediately obvious.

* ***Interpretable Models for Fairness:***
Using interpretable models, such as decision trees or logistic regression, can make it easier to spot and address biases. These models are more transparent, allowing you to understand how decisions are being made and identify any potential issues. In contrast, complex models like deep neural networks are more of a black box and can be harder to interpret, making it challenging to ensure fairness.



Understanding both model robustness and fairness is crucial for developing reliable and ethical machine learning systems. By considering these factors, we can ensure our models perform consistently in the real world and make fair decisions for everyone—especially in scenarios where the impact on individuals is significant.

In [None]:

Demographic_feature = ["FastingBS","Sex_M","ChestPainType_ATA","ChestPainType_NAP",
                     "ChestPainType_TA","RestingECG_Normal","RestingECG_ST",
                     "ExerciseAngina_Y","ST_Slope_Flat","ST_Slope_Up"]
selected_models = {
    'CatBoostClassifier': models['CatBoostClassifier'],
    'xgb_model': models['XGBoost']
}

# Create a list to store the results
all_results = []

for model_name, model_ in selected_models.items():
        
    for feature in Demographic_feature:

        y_pred_tree = model_.predict(X_test_tree)

        # Calculate fairness metrics using Fairlearn MetricFrame
        sensitive_feature_index = X_train_tree.columns.get_loc(feature)
        sensitive_feature = X_test_tree.iloc[:, sensitive_feature_index]

        metric_frame = MetricFrame(
            metrics={"recall": recall_score, "selection_rate": selection_rate},
            y_true=y_test_tree,
            y_pred=y_pred_tree,
            sensitive_features=sensitive_feature
        )

        # Calculate and store differences between groups for Recall and Selection Rate
        group_labels = sensitive_feature.unique()
        group_metrics = []

        for label in group_labels:
            recall_value = metric_frame.by_group.loc[label, 'recall']
            selection_rate_value = metric_frame.by_group.loc[label, 'selection_rate']
            group_metrics.append({
                "Model": model_name,
                "Feature": feature,
                "Group": f" {label}",
                "Recall": recall_value,
                "Selection Rate": selection_rate_value
            })

        # Add Demographic Parity Difference and Equalized Odds Difference metrics
        dp_diff = demographic_parity_difference(y_test_tree, y_pred_tree, sensitive_features=sensitive_feature)
        eo_diff = equalized_odds_difference(y_test_tree, y_pred_tree, sensitive_features=sensitive_feature)

        group_metrics.append({
            "Model": model_name,
            "Feature": feature,
            "Group": "Demographic Parity Difference",
            "Recall": dp_diff,
            "Selection Rate": None
        })

        group_metrics.append({
            "Model": model_name,
            "Feature": feature,
            "Group": "Equalized Odds Difference",
            "Recall": eo_diff,
            "Selection Rate": None
        })

        # Store all results collectively
        all_results.extend(group_metrics)
        

# Convert the results to a DataFrame
results_df = pd.DataFrame(all_results)

# View DataFrame
results_df


**General Evaluation: Fairness Analysis of Your Models**

When we evaluate the fairness of your **CatBoost model** across different demographic features, we notice some potential biases that could be a cause for concern. Here are the key takeaways:

***ST_Slope_Up and ST_Slope_Flat:***

These features show the most significant differences in Demographic Parity Difference and Equalized Odds Difference. What does that mean? Well, it suggests that our model is showing a significant bias against groups characterized by these features. In other words, the model's predictions are not as fair as they could be for these groups.

***ExerseAngina_Y and ChestPainType_ATA:***

We also see notable fairness disparities here. Both the Demographic Parity Difference and Equalized Odds Difference are high, which indicates that the model might be making biased predictions against certain groups associated with these features.

***FastingBS, Sex_M, and RestingECG_Normal:***

On a brighter note, the fairness differences for these features are relatively low. This suggests that the model is performing more consistently and fairly for groups with these characteristics, which is a good sign!

Now, let’s take a look at the fairness analysis of your **XGBoost model:**

***ST_Slope_Flat and ST_Slope_Up:***

Just like with CatBoost, these features show the highest disparities in Demographic Parity Difference and Equalized Odds Difference. This indicates a strong bias against groups with these characteristics, signaling serious fairness issues within the model.

***ExerciseAngina_Y and ChestPainType_ATA:***

These features also show significant fairness differences in the XGBoost model. Again, the high values for Demographic Parity Difference and Equalized Odds Difference suggest that the model may be biased against some groups associated with these features.

***FastingBS, Sex_M, and RestingECG_Normal:***

Similar to the CatBoost model, the fairness differences for these features are relatively low. This indicates that the model is behaving more consistently and fairly across groups with these characteristics.

# Model Interpretability and Feature Importance




<img src="https://www.researchgate.net/publication/381518989/figure/fig2/AS:11431281252656108@1718783012097/Feature-importance-for-model-interpretability-The-present-figure-represents-the.png">



## Feature Importance

Feature importance is all about understanding which features are most influential in making predictions with your model. In decision tree-based models, for instance, feature importance is often determined by counting how many times a feature is used to make a decision (like how often a feature splits the data into branches).

However, feature importance has its limitations. While it tells you which features are generally more important for the model, it doesn’t give you the full picture. It doesn’t reveal details about interactions between features or show exactly how each feature contributes to individual predictions.




## Model Interpretability with SHAP

Enter SHAP values—a more advanced way to interpret model predictions. SHAP (SHapley Additive exPlanations) values provide a detailed look at the contribution of each feature to a specific model prediction. They’re based on Shapley values from cooperative game theory, which allow the effect of each feature on the prediction to be fairly distributed among all features.

What’s cool about SHAP values is that they show exactly how much a model's prediction increases or decreases when a feature’s value changes. This gives you a clear understanding of how each feature is impacting the model’s decisions.





### Differences Between Feature Importance and SHAP Values

* **Feature Importance:**

  * Shows which features are more important overall.
  * Provides a general sense of which features the model relies on the most, but it doesn’t indicate the direction of the impact or specific contributions to individual predictions.
* **SHAP Values:**

  * Show the effect of each feature on a particular prediction made by the model.
  * Provide a detailed view of how each feature positively or negatively affects a prediction and by how much.
  
### Level of Detail:

* **Feature Importances:**
  * Tell you how much features are used overall in the model but don’t specify in what direction they affect the predictions.
* **SHAP Values:**
  * Indicate whether a feature has a positive or negative impact on predictions and quantify that effect.
  
  
### Interactions:

* **Feature Importances:**

  * Typically do not account for interactions between features. They consider each feature independently, without showing how features might work together to affect predictions.
* **SHAP Values:**

  * Take feature interactions into account and demonstrate how multiple features can interact to influence the model’s predictions.
  
  
*In summary, while feature importance gives a high-level overview of which features are crucial to the model, SHAP values go a step further by showing the detailed impact of each feature, including interactions and directionality. This makes SHAP a powerful tool for understanding and interpreting complex models, especially when you need to explain model behavior in detail.*

In [None]:
xgb_model = models['XGBoost']

# Calculate feature importance levels
importance = xgb_model.get_booster().get_score(importance_type='weight')

# Convert and sort feature importances to a DataFrame
importance_df = pd.DataFrame(list(importance.items()), columns=['Feature', 'Importance']).sort_values(by='Importance', ascending=False)

# Visualize feature importance levels
plt.figure(figsize=(10, 8))
xgb.plot_importance(xgb_model, importance_type='weight', max_num_features=15)  # En önemli 10 özelliği gösterir
plt.title('Feature Importance (Weight)')
plt.show()

# If you want to evaluate based on profit or coverage:
# importance_gain = xgb_model.get_booster().get_score(importance_type='gain')
# importance_cover = xgb_model.get_booster().get_score(importance_type='cover')


In [None]:
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_train_tree)

# Visualize the overall impact of features
shap.summary_plot(shap_values, X_train_tree)

# Analysis of SHAP values ​​by feature
shap.dependence_plot("ST_Slope_Up", shap_values, X_train_tree)


In [None]:

catboost_model = models['CatBoostClassifier']

# Calculating feature importance levels
feature_importance = catboost_model.get_feature_importance()
features = X_train_tree.columns

# Convert and sort features and their importance into a DataFrame
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importance})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Visualize ranked feature importance levels
plt.figure(figsize=(10, 8))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance (CatBoost)')
plt.gca().invert_yaxis()  
plt.show()

# Display the sorted table
importance_df


In [None]:
# Calculating SHAP values
explainer = shap.TreeExplainer(catboost_model)
shap_values = explainer.shap_values(X_train_tree)

# Visualize the overall impact of features
shap.summary_plot(shap_values, X_train_tree)

# Analysis of SHAP values ​​by attribute (e.g. 'Age' attribute)
shap.dependence_plot("Age", shap_values, X_train_tree)


# Probability Calibration

<img src="https://ploomber.io/images/blog/calibration-curve/serialized/25-0.png">

### What is Probability Calibration?
​
Probability calibration is a technique used to adjust the predicted probabilities of a machine learning model to better match the true probabilities of the outcomes. This process is especially important in scenarios where accurate probability estimates are crucial—like in emerging or high-stakes problems. The goal is to ensure that the model’s predicted probabilities are reliable and aligned with reality.
​
### Why is Probability Calibration Important?
​
In many applications, it’s not just about making the right prediction but also about understanding how confident the model is in that prediction. For example, in a medical diagnosis scenario, knowing whether a model is 90% or 50% confident in predicting a disease can make a huge difference in decision-making. Similarly, in financial forecasting or risk assessment, calibrated probabilities can provide more accurate insights.
​
### When is Probability Calibration Done?
​
1. After the Model is Trained:
Once your model has been trained and you’ve evaluated its initial performance, you might notice that the predicted probabilities aren’t quite aligning with the real-world outcomes. This is when you perform probability calibration to fine-tune these estimates, making them more representative of true probabilities.
​
2. Performance Evaluation: 
After analyzing various performance metrics such as the Brier score, you may determine that your model’s probability estimates are inaccurate. If these metrics indicate that the model’s estimates are not as reliable or consistent as they should be, calibration can be applied to improve the accuracy of these probability estimates.
​
3. Especially in Cases of Imbalanced Data:
If your training data is imbalanced—say, there’s a significant disparity between the number of instances in different classes—your model’s probability estimates might be biased. For example, if one class is underrepresented, the model might not predict probabilities accurately for that class. In such cases, calibration is crucial to ensure fair and balanced probability estimates.
​
### How is Probability Calibration Useful?
​
* **Improving Decision-Making:** Calibrated probabilities help make better-informed decisions in critical applications, such as healthcare, finance, or any field where precise probability estimates are necessary.
​
* **Enhancing Model Reliability:** By ensuring that the predicted probabilities reflect real-world chances more accurately, calibration increases the trustworthiness of the model’s predictions.
​
* **Addressing Bias in Predictions:** When dealing with imbalanced datasets, calibration can help correct biases in probability estimates, making the model’s outputs fairer and more accurate.
​
*In summary, probability calibration is a valuable step in fine-tuning a machine learning model’s performance, especially when you need reliable probability estimates. It ensures that the model’s predictions are not only correct but also appropriately confident, leading to better decisions and more trustworthy outcomes.*

In [None]:

# Calibrate the model
calibrated_model = CalibratedClassifierCV(base_estimator=catboost_model, method='sigmoid', cv=5)
calibrated_model.fit(X_train_tree, y_train_tree)

# Post-calibration estimates
y_prob = calibrated_model.predict_proba(X_test_tree)[:, 1]

# Measuring calibration success with Brier Score Loss
brier_loss = brier_score_loss(y_test_tree, y_prob)
print(f"Brier Score Loss: {brier_loss}")


In [None]:

# Calibrate the model
calibrated_model = CalibratedClassifierCV(base_estimator=xgb_model, method='sigmoid', cv=5)
calibrated_model.fit(X_train_tree, y_train_tree)

# Post-calibration estimates
y_prob = calibrated_model.predict_proba(X_test_tree)[:, 1]

# Measuring calibration success with Brier Score Loss
brier_loss = brier_score_loss(y_test_tree, y_prob)
print(f"Brier Score Loss: {brier_loss}")


In [None]:
# Let's perform the calibration

calibrated_catboost = CalibratedClassifierCV(base_estimator=catboost_model, method='sigmoid', cv=5)
calibrated_catboost.fit(X_train_tree, y_train_tree)

# Post-calibration estimates
y_prob_calibrated = calibrated_catboost.predict_proba(X_test_tree)[:, 1]

# Pre-calibration estimates
y_prob_uncalibrated = catboost_model.predict_proba(X_test_tree)[:, 1]

# Let's calculate the Brier scores
brier_uncalibrated = brier_score_loss(y_test_tree, y_prob_uncalibrated)
brier_calibrated = brier_score_loss(y_test_tree, y_prob_calibrated)

print(f"Brier Score (Uncalibrated): {brier_uncalibrated:.4f}")
print(f"Brier Score (Calibrated): {brier_calibrated:.4f}")


# Let's draw the calibration curve
plt.figure(figsize=(10, 8))

# Before calibration
fraction_of_positives, mean_predicted_value = calibration_curve(y_test_tree, y_prob_uncalibrated, n_bins=10)
plt.plot(mean_predicted_value, fraction_of_positives, "s-", label="Uncalibrated")

# After calibration
fraction_of_positives_cal, mean_predicted_value_cal = calibration_curve(y_test_tree, y_prob_calibrated, n_bins=10)
plt.plot(mean_predicted_value_cal, fraction_of_positives_cal, "s-", label="Calibrated (Isotonic)")

# Perfect calibration line
plt.plot([0, 1], [0, 1], "k--", label="Perfectly calibrated")

plt.xlabel("Mean predicted value")
plt.ylabel("Fraction of positives")
plt.title("Calibration Curve")
plt.legend()
plt.grid(True)
plt.show()


# Normalized Model


### Data Splitting

In [None]:
X = df_encoded.drop('HeartDisease', axis=1)
y = df_encoded['HeartDisease']

# Let's split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Let's check the sizes of training and test sets
X_train.shape, X_test.shape, y_train.shape, y_test.shape


# Normalization and Standardization: What’s the Difference?

When working with machine learning models, you might often hear about normalization and standardization. These are two common techniques used to rescale data, but they’re used in different scenarios and for different reasons. Let’s break down what each one means and when you might want to use them.

### Normalization

* **Definition:** Normalization is the process of rescaling your data to a specific range, usually between 0 and 1. This technique is handy when you want to make sure all your data points fall within a uniform range.

* **How to Do It:** To normalize your data, you adjust each data point based on its minimum and maximum values. Essentially, you subtract the minimum value of the data from each data point and then divide by the range (maximum value minus minimum value). This way, all your data ends up on the same scale.

* **When to Use It:** If your dataset has features with a wide range of values or if the data distribution isn’t normal (like when it’s positively skewed), normalization can be very useful. It helps bring all the data into a comparable range, which can improve the performance of some machine learning algorithms.

### Standardization

* **Definition:** Standardization, on the other hand, involves rescaling your data so that it has a mean of 0 and a standard deviation of 1. This process adjusts the data to have a normal distribution (or at least, a distribution centered around zero).

* **How to Do It:** To standardize your data, you subtract the mean of the data from each data point and then divide by the standard deviation. This results in data that is centered around zero with a consistent scale.

* **When to Use It:** Standardization is often the go-to option if your data is generally normally distributed or when you’re using algorithms that assume a normal distribution. It helps ensure that the data is centered and has a uniform variance, which can be crucial for many statistical models.

### So, Which One Should You Choose?

Honestly, there’s no one-size-fits-all answer here, friends. The choice between normalization and standardization depends on your specific dataset and the requirements of your machine learning model. Some algorithms are more sensitive to the range and distribution of the data, while others are not.

If you’re unsure, there’s no harm in trying both! Experiment with normalization and standardization separately and see which one works best for your model. Sometimes, the best way to find out is through a bit of trial and error.

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
# Let's define the models
models_non_tree = {    
    'SVC': SVC(random_state=42),
    'KNeighborsClassifier': KNeighborsClassifier(n_neighbors=5),
    "LogisticRegression": LogisticRegression(random_state=42),
    'RidgeClassifier': RidgeClassifier(random_state=42),
}



In [None]:

def optimize_support_vector_classifier(X_train_scaled, y_train):
    param_dist_svc = {
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': [1e-3, 1e-4, 'scale', 'auto'],
    'kernel': ['linear', 'rbf']
}
    halving_search_svc = HalvingRandomSearchCV(SVC(random_state=42), param_dist_svc, factor=3, cv=5, random_state=42)
    halving_search_svc.fit(X_train_scaled, y_train)
    return halving_search_svc.best_estimator_


def optimize_logistic(X_train_scaled, y_train):
    param_grid_logistic = {
    'C': (1e-6, 1e+6, 'log-uniform'),
    'penalty': ['l2'],
    'solver': ['liblinear', 'saga']
}
    bayes_search_logistic = BayesSearchCV(LogisticRegression(random_state=42), param_grid_logistic, n_iter=30, cv=5, random_state=42)
    bayes_search_logistic.fit(X_train_scaled, y_train)
    return bayes_search_logistic.best_estimator_

def optimize_knn(X_train_scaled, y_train):
    param_dist_knn = {
    'n_neighbors': range(1, 31),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}
    random_search_knn = RandomizedSearchCV(KNeighborsClassifier(), param_dist_knn, n_iter=30, cv=5, random_state=42)
    random_search_knn.fit(X_train_scaled, y_train)
    return random_search_knn.best_estimator_

def optimize_ridge(X_train_scaled, y_train):
    param_grid_ridge = {
    'alpha': [0.1, 1.0, 10.0, 100.0],
    'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sag']
}

    grid_search_ridge = GridSearchCV(RidgeClassifier(random_state=42), param_grid_ridge, cv=5)
    grid_search_ridge.fit(X_train_scaled, y_train)
    return grid_search_ridge.best_estimator_


models_non_tree['SVC'] = optimize_support_vector_classifier(X_train_scaled, y_train)
models_non_tree['KNeighborsClassifier'] = optimize_knn(X_train_scaled, y_train)
models_non_tree['LogisticRegression'] = optimize_logistic(X_train_scaled, y_train)
models_non_tree['RidgeClassifier'] = optimize_ridge(X_train_scaled, y_train)




In [None]:
models_non_tree.items()

In [None]:
# Lists to store results
recall_scores_test = []
recall_scores_train = []
model_names = []

# Get recall scores for each model
for name, model in models_non_tree.items():
    
    # Make a prediction (Using trained models in hyperparameter optimization)
    y_pred = model.predict(X_test_scaled)
    y_pred_tr = model.predict(X_train_scaled)
    
    # Calculate and save recall scores
    recall_test = recall_score(y_test, y_pred, average='macro')
    recall_train = recall_score(y_train, y_pred_tr, average='macro')
    
    recall_scores_test.append(recall_test)
    recall_scores_train.append(recall_train)
    model_names.append(name)

# Let's visualize the recall scores on the same graph
plt.figure(figsize=(12, 6))

plt.plot(model_names, recall_scores_train, marker='o', linestyle='-', color='b', label='Train Recall')
plt.plot(model_names, recall_scores_test, marker='o', linestyle='-', color='r', label='Test Recall')

plt.xlabel('Models')
plt.ylabel('Recall')
plt.title('Train and Test Recall for Different Models')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()


In [None]:
comparison_data = []

for name, model in models_non_tree.items():
        
    # Make a guess
    y_pred = model.predict(X_test_scaled)
    y_pred_tr = model.predict(X_train_scaled)
    
    # Classification report'u al
    report_test = classification_report(y_test, y_pred, output_dict=True)
    report_train = classification_report(y_train, y_pred_tr, output_dict=True)
    
    # Let's tabulate the classification report values ​​for each class
    for label in report_test.keys():
        if label not in ["accuracy", "macro avg", "weighted avg"]:
            comparison_data.append({
                'Model': name,
                'DataSet': 'Test',
                'Label': label,
                'Precision': report_test[label]['precision'],
                'Recall': report_test[label]['recall'],
                'F1-Score': report_test[label]['f1-score'],
                'Accuracy': report_test['accuracy']  # Accuracy değerini ekliyoruz
            })
            comparison_data.append({
                'Model': name,
                'DataSet': 'Train',
                'Label': label,
                'Precision': report_train[label]['precision'],
                'Recall': report_train[label]['recall'],
                'F1-Score': report_train[label]['f1-score'],
                'Accuracy': report_train['accuracy']  # Accuracy değerini ekliyoruz
            })

# Convert the results to a DataFrame
comparison_df = pd.DataFrame(comparison_data)



# Let's view the DataFrame
comparison_df


In [None]:
best_model_recall_test = max(comparison_data, key=lambda x: x['Recall'] if (x['DataSet'] == 'Test' and x['Recall']<1) else 0)

print("The best model based test on recall is:", best_model_recall_test['Model'])
print("Recall:", best_model_recall_test['Recall'])

best_model_recall_train = max(comparison_data, key=lambda x: x['Recall'] if (x['DataSet'] == 'Train' and x['Recall']<1) else 0)

print("The best model based on train recall is:", best_model_recall_train['Model'])
print("Recall:", best_model_recall_train['Recall'])


In [None]:
# Let's train a linear SVC model
model_svc = models_non_tree['SVC']
model_svc.fit(X_train_scaled, y_train)

# Let's calculate feature importances
feature_importances = abs(model_svc.coef_[0])
importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Visualization
plt.figure(figsize=(10, 8))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.xlabel('Coefficient Value (Importance)')
plt.title('Feature Importance for Linear SVC')
plt.gca().invert_yaxis()  # En önemli özelliklerin üstte olması için ters çeviriyoruz
plt.show()


## LIME (Local Interpretable Model-agnostic Explanations)

**What is LIME?**

LIME stands for Local Interpretable Model-agnostic Explanations. It’s a tool designed to explain individual predictions of any machine learning model, making it incredibly versatile and powerful. The cool thing about LIME is that it’s model-agnostic, which means it doesn’t matter what type of machine learning model you’re using—LIME can work with all of them!

**How Does LIME Work?**

LIME explains a model's decision for a specific prediction by creating a simple, interpretable model (usually a linear model) that approximates the complex model locally around the prediction. In other words, LIME focuses on one prediction at a time, building a local explanation that helps you understand why the model made that particular prediction.

Think of LIME as creating a small, simplified snapshot of your model's decision-making process for just one data point. It essentially says, "Hey, if we look at this prediction closely and simplify things a bit, here’s how the model is making its decision." This makes it easier to understand and trust the model’s predictions, especially when dealing with complex or black-box models.

**Key Features of LIME:**

* **Model-agnostic:** Works with any machine learning model, whether it's a simple logistic regression or a complex deep neural network.

* **Local Explanations:** Focuses on explaining individual predictions rather than the overall model. This makes it great for understanding specific decisions and gaining insights into why a model behaved in a certain way for a particular instance.

* **Interpretable Models:** LIME builds a simple, interpretable model (like a linear model) around the prediction to approximate the complex model’s behavior locally. This helps in visualizing and understanding the influence of each feature on that specific prediction.

In summary, LIME is a fantastic tool when you need to interpret the decisions of your machine learning model on a granular, case-by-case basis. It’s especially useful when working with complex models where understanding the rationale behind each prediction can provide valuable insights and improve trust in the model’s outputs.

In [None]:
# Let's train the SVC model
model_svc = SVC(C=1, kernel='linear',probability=True, random_state=42)
# For LIME to use predict_proba, probability=True
model_svc.fit(X_train_scaled, y_train)

# Create the LIME descriptor
explainer = lime.lime_tabular.LimeTabularExplainer(
    training_data=X_train_scaled,
    feature_names=X_train.columns,
    class_names=['No HeartDisease', 'HeartDisease'],  # Sınıf adlarını buraya koyabilirsiniz
    mode='classification'
)

# Select sample data to explain a single prediction
i = 0  # İlk örneği seçiyoruz, başka bir index seçebilirsiniz
exp = explainer.explain_instance(X_test_scaled[i], model_svc.predict_proba, num_features=10)

# Visualize feature importance
exp.show_in_notebook(show_table=True)

fig = exp.as_pyplot_figure()
plt.show()



In [None]:
# Let's perform the calibration

calibrated_svc = CalibratedClassifierCV(base_estimator=model_svc, method='isotonic', cv=5)
calibrated_svc.fit(X_train_scaled, y_train)

# Post-calibration estimates
y_prob_calibrated = calibrated_svc.predict_proba(X_test_scaled)[:, 1]

# Pre-calibration estimates
y_prob_uncalibrated = model_svc.predict_proba(X_test_scaled)[:, 1]

# Let's calculate the Brier scores
brier_uncalibrated = brier_score_loss(y_test, y_prob_uncalibrated)
brier_calibrated = brier_score_loss(y_test, y_prob_calibrated)

print(f"Brier Score (Uncalibrated): {brier_uncalibrated:.4f}")
print(f"Brier Score (Calibrated): {brier_calibrated:.4f}")


# Let's draw the calibration curve
plt.figure(figsize=(10, 8))

# Before calibration
fraction_of_positives, mean_predicted_value = calibration_curve(y_test, y_prob_uncalibrated, n_bins=10)
plt.plot(mean_predicted_value, fraction_of_positives, "s-", label="Uncalibrated")

# After calibration
fraction_of_positives_cal, mean_predicted_value_cal = calibration_curve(y_test, y_prob_calibrated, n_bins=10)
plt.plot(mean_predicted_value_cal, fraction_of_positives_cal, "s-", label="Calibrated (Isotonic)")


plt.plot([0, 1], [0, 1], "k--", label="Perfectly calibrated")

plt.xlabel("Mean predicted value")
plt.ylabel("Fraction of positives")
plt.title("Calibration Curve")
plt.legend()
plt.grid(True)
plt.show()


# Model Robustness
What is Robustness?
Robustness in machine learning refers to a model’s ability to keep performing well even when the going gets tough. This means that when a model is exposed to challenging conditions—like noisy data, missing information, or even adversarial attacks—it still manages to produce reliable results. Essentially, a robust model doesn’t lose its cool when the input data isn’t perfect or when unexpected changes occur in the data distribution.

Why is Robustness Important in Real-World Applications?
In the real world, data is rarely clean or perfect. Models often have to deal with noise, outliers, or missing information, and this is where robustness becomes a key factor. A robust model handles these imperfect conditions gracefully without its performance taking a nosedive. Imagine a model deployed in a noisy environment or one that has to make predictions when some of the data is missing—robustness ensures it still gets the job done.

Adaptability Matters
A robust model is not just tough; it’s also adaptable. It can handle new data or slightly different datasets than what it was trained on. This adaptability is crucial for models that will be deployed in dynamic environments where the data can change unpredictably.

When Should You Test for Robustness?
Dynamic Environments: If your model is going to be used in an environment where data might change unexpectedly, robustness testing is essential. You want to make sure your model can adapt to these changes without losing performance.

Noisy or Missing Data: In situations where your model will have to deal with noisy or incomplete data, robustness testing helps ensure it can still function effectively.

Critical Applications: For applications where even small errors can have significant consequences—like medical diagnostics or autonomous driving—robustness is not just a nice-to-have; it’s a necessity.

The Role of Dataset Size in Robustness
Large Datasets: Generally speaking, more data helps models generalize better, which can increase their robustness. However, it’s not just about quantity. The data also needs to be diverse and cover various scenarios the model might encounter. This diversity prepares the model for real-world challenges, making it more robust.

Small Datasets: On the flip side, models trained on small datasets may not be as robust. They are more likely to overfit the training data, meaning they perform well on what they’ve seen but struggle with new, unseen data. This lack of generalization can make them less robust in real-world applications.

Testing Robustness: After Training, Before Deployment
Once your model is trained, it’s crucial to perform robustness testing before deployment. This involves putting your model through its paces to see how it handles different scenarios. You might add noise to the test data, remove certain features, or test the model on slightly different data distributions. By doing this, you can identify any weaknesses and ensure that your model is ready for whatever the real world throws at it.

After these steps, all that remains is to perform the model usage steps.

# Model Deploy
# Post-Deployment Monitoring
# Model Drift

> ***Thanks for sticking with me until the end! If you enjoyed this notebook, please don't forget to upvote it. And if you have any feedback or suggestions on things that could be added or improved, I’d love to hear from you. Your advice is always welcome!***

<img src="https://media0.giphy.com/media/v1.Y2lkPTc5MGI3NjExcGVrbGt0cnU0eXZycXdpbW5idGZueWtmOTN6YXRyY2x1cGdoczJzOCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/7zliaSiCjWREhN1iqC/giphy.webp">