# Heart Disease Classification

***Author: Jacob***

In [43]:
# Imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import xgboost as xgb
import random
import math
import datetime as dt
import os

# 1.1: Load Data
For this part, we will be using the Personal Key Indicators of Heart Disease prediction dataset from Kaggle. This dataset contains 18 columns and about 320,000 rows.

In this section, we load the dataset called `heart_cleaned.csv`.

In [2]:
github_url = 'https://raw.githubusercontent.com/Bubu631/Heart-Disease-Classification-Imbalanced/refs/heads/main/heart_cleaned.csv'

In [3]:
heart_df = pd.read_csv(github_url)
heart_df.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [4]:
# Get an overview of the data
heart_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319795 non-null  object 
 1   BMI               319795 non-null  float64
 2   Smoking           319795 non-null  object 
 3   AlcoholDrinking   319795 non-null  object 
 4   Stroke            319795 non-null  object 
 5   PhysicalHealth    319795 non-null  float64
 6   MentalHealth      319795 non-null  float64
 7   DiffWalking       319795 non-null  object 
 8   Sex               319795 non-null  object 
 9   AgeCategory       319795 non-null  object 
 10  Race              319795 non-null  object 
 11  Diabetic          319795 non-null  object 
 12  PhysicalActivity  319795 non-null  object 
 13  GenHealth         319795 non-null  object 
 14  SleepTime         319795 non-null  float64
 15  Asthma            319795 non-null  object 
 16  KidneyDisease     31

# 1.2: Data Pre-Processing & Feature Engineering

In a typical machine learning classification problem, it's crucial to carefully process and analyze your data to understand the features you're working with.
However, this one, the focus will be more on modeling rather than Exploratory Data Analysis (EDA). We've provided a dataset that is nearly ready for use, so you won't need to spend much time on EDA here.



## 1.2.1: Encoding Categorical Features
Looking at our column types, we see that we have some features of type object and some of type float64. Let's start by separating these.

**Task:**
*   Create two lists containing the column names of `numerical` and `categorical` features named `numerical_features` and `categorical_features` respectively
*   Sort these lists alphabetically

**Hint:**
* Consider using `.select_dtypes` from Pandas.

**Note:**
* Though `HeartDisease` is not a feature (it is our target), we include it within one of the lists you create. We will address this later when we create our test-train split.

In [5]:
# Populate the following lists
numerical_features = sorted(heart_df.select_dtypes(include=['float64']).columns.tolist())
categorical_features = sorted(heart_df.select_dtypes(include=['object']).columns.tolist())

print(f'There are {len(categorical_features)} categorical variables')
print(f'There are {len(numerical_features)} numerical variables')

There are 14 categorical variables
There are 4 numerical variables


In [6]:
numerical_features

['BMI', 'MentalHealth', 'PhysicalHealth', 'SleepTime']

In [7]:
categorical_features

['AgeCategory',
 'AlcoholDrinking',
 'Asthma',
 'Diabetic',
 'DiffWalking',
 'GenHealth',
 'HeartDisease',
 'KidneyDisease',
 'PhysicalActivity',
 'Race',
 'Sex',
 'SkinCancer',
 'Smoking',
 'Stroke']

Now, let's focus on those categorical features and do some **encoding**.

Encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.

In this first part, we want to cast the columns containing binary variables (no/yes) into integer values (0 and 1).

**Task:**

*   Create a copy of `heart_df` and save it as `binary_heart_df`, use this new data frame for this problem.
*   Find all columns that have 2 unique values (these are binary features).
*   Encode the columns with binary features as *float* values (0.0 and 1.0) using `OneHotEncoder` from sklearn **(refer to sklearn documentation for special arguments to the function)**.
*   Save results in `binary_heart_df`.

**Hint:**

* `binary_heart_df` should have the same dimensions/shape before and after encoding, and the order of the columns should remain the same.
* If not receiving full points, consider swapping the encoding scheme (swapping which values get the 0s and 1s)

In [8]:
# Create a copy of heart_df and store it in binary_heart_df
binary_heart_df = heart_df.copy()

# For every column in heart_df, see if it has 2 unique values and if so, save to the binary_cols list
binary_cols = [col for col in binary_heart_df.columns if binary_heart_df[col].nunique() == 2]

In [9]:
# Create encoder
binary_encoder = OneHotEncoder(dtype=float, drop='if_binary', sparse_output=False)

# Convert the values in these columns to 1.0/0.0
binary_heart_df[binary_cols] = binary_encoder.fit_transform(binary_heart_df[binary_cols])

In [10]:
binary_heart_df[binary_cols]

Unnamed: 0,HeartDisease,Smoking,AlcoholDrinking,Stroke,DiffWalking,Sex,PhysicalActivity,Asthma,KidneyDisease,SkinCancer
0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
319790,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
319791,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
319792,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
319793,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
binary_cols

['HeartDisease',
 'Smoking',
 'AlcoholDrinking',
 'Stroke',
 'DiffWalking',
 'Sex',
 'PhysicalActivity',
 'Asthma',
 'KidneyDisease',
 'SkinCancer']

In this step, we address the issue of categorical text data, which cannot be processed directly by machine learning algorithms. We identified 10 binary features in the dataset (columns with exactly two unique values, such as 'Yes'/'No' or 'Female'/'Male'), including our target variable HeartDisease.

We utilized Scikit-Learn's OneHotEncoder with the specific parameter `drop='if_binary'`. Instead of creating two redundant columns (e.g., 'Smoking_Yes' and 'Smoking_No'), this setting compresses the information into a single column represented by 0.0 and 1.0.

As seen in the output dataframe binary_heart_df, features like HeartDisease, Smoking, and Sex have been successfully transformed into numeric types (floats). This preserves the binary nature of the data while eliminating redundancy and preparing the dataset for the next stages of encoding and modeling.

Thus far we've encoded only our binary features, let's encode other categorical columns.

Here we can distinguish between ordinal categories and nominal featues. Features are ordinal when there is a clear ordering among the categories (i.e., good, neutral, bad, could be ordered, while gender would be nominal). To encode this in the data, we need some way to represent that good is *closer* to neutral then it is to bad.

One commonly used strategy for encoding nominal features is One Hot Encoding. This strategy involves converting a single feature with $k$ categories into $k$ or $(k-1)$ new columns of only ones and zeros. This allows us to train models that only take numerical values as input.

On the other hand, for ordered categories we have Ordinal Encoding. This typically replaces all the categorical values with increasing integers starting from 0 by default.

**Task:**

* Identify ordinal features, and store them in `ordered_features`. You should also specify the ordering within each category and store it in a list of lists named `ordered_categories` (i.e., for a hypothetical feature `Rating`, the categories/levels might be `Low`, `Medium`, `High`).
* Make a copy of `binary_heart_df` and save it as `ordered_heart_df`, use this new data frame for analysis.
* Apply Ordinal Encoding to the ordered_categorical columns, and replace them in the data; use `OrdinalEncoder` from sklearn (already imported).
* Save results in the pandas df named `ordered_heart_df`. The order of the columns should remain the same.

**Hint:**

* `ordered_features` and `ordered_categories` lists can be created manually
* For the purposes of this homework, treat `Diabetic` as a nominal feature

In [12]:
# Identify the columns that represent ordered categories and store them in ordered_features
ordered_features = ['AgeCategory', 'GenHealth']

In [13]:
# Extract the ordering of each ordered feature and place them in a list.
# Use a logical ordering of the categories in increasing order

# 1. Define the order of AgeCategory (From youngest to oldest)
age_cats = ['18-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54',
            '55-59', '60-64', '65-69', '70-74', '75-79', '80 or older']
# 2. Define the order of GenHealth (from Worst to Greatest)
health_cats = ['Poor', 'Fair', 'Good', 'Very good', 'Excellent']

ordered_categories = [age_cats, health_cats]

In [14]:
ordered_categories

[['18-24',
  '25-29',
  '30-34',
  '35-39',
  '40-44',
  '45-49',
  '50-54',
  '55-59',
  '60-64',
  '65-69',
  '70-74',
  '75-79',
  '80 or older'],
 ['Poor', 'Fair', 'Good', 'Very good', 'Excellent']]

In [15]:
# Create copy of binary_heart_df and save in ordered_heart_df
ordered_heart_df = binary_heart_df.copy()

# Apply Ordinal Encoding to ordered columns in the dataset.

# 1. Initialize the encoder, passing in our defined category order
ordinal_encoder = OrdinalEncoder(categories=ordered_categories)

# 2. Fit and transform the 'ordered_features' columns in the dataframe
#    and assign the results back to the same columns (overwriting the text)
ordered_heart_df[ordered_features] = ordinal_encoder.fit_transform(ordered_heart_df[ordered_features])

**Task:**

* Identify nominal features, and store them in a list in `unordered_features`. **Make sure that the list is sorted in alphabetical order.**
* Make a copy of `ordered_heart_df` and save it as `encoded_heart_df`, use this new data frame for analysis.
* Apply One Hot Encoding to the `unordered_features`. If using `OneHotEncoder` from sklearn, **make sure to drop the original columns from the dataframe and concatenate the newly encoded columns to the *end* of your final dataframe using pd.concat().**
* For each encoded variable, drop the first column that is produced in your One Hot Encoding scheme - this ensures we have the minimum number of colums required to represent our data.
* Drop any of the original columns that have now been encoded and added to the data frame.
* Save results in the pandas df named `encoded_heart_df`.


**Hint:**

* Use `pd.get_dummies` or `OneHotEncoder` from sklearn (already imported).
* Columns containing dummy variables after performing One Hot Encoding do not need to have any specific names, they can have any name
* **Refer to sklearn documentation for necessary/special arguments to functions**

In [16]:
# Identify the columns that represent unordered categories and store them in unordered_features
# Our logic is: from all categorical features, subtract the binary features and ordered features we've already processed
unordered_set = set(categorical_features) - set(binary_cols) - set(ordered_features)

# Sort alphabetically
unordered_features = sorted(list(unordered_set))

In [17]:
# Create a copy of ordered_heart_df and save it in encoded_heart_df
encoded_heart_df = ordered_heart_df.copy()


# Apply One-Hot Encoding (OHE) to the unordered categorical columns in encoded_heart_df
# Initialize the OneHotEncoder with sparse_output=False to get a dense array,
# drop='first' to avoid multicollinearity, and dtype=float for numerical output
onehot_encoder = OneHotEncoder(sparse_output = False, drop = 'first', dtype = float)

# Fit and transform the unordered features
encoded_race = onehot_encoder.fit_transform(encoded_heart_df[unordered_features])

# Get the feature names for the encoded columns
encoded_cols = onehot_encoder.get_feature_names_out(unordered_features)

# Create a DataFrame from the encoded features with proper column names and index
encoded_df = pd.DataFrame(encoded_race, columns = encoded_cols, index = encoded_heart_df.index)

# Remove the original unordered categorical columns from the dataframe
encoded_heart_df = encoded_heart_df.drop(columns = unordered_features)

# Concatenate the encoded features with the remaining features
encoded_heart_df = pd.concat([encoded_heart_df, encoded_df], axis = 1)

In [18]:
encoded_heart_df

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,...,KidneyDisease,SkinCancer,"Diabetic_No, borderline diabetes",Diabetic_Yes,Diabetic_Yes (during pregnancy),Race_Asian,Race_Black,Race_Hispanic,Race_Other,Race_White
0,0.0,16.60,1.0,0.0,0.0,3.0,30.0,0.0,0.0,7.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,20.34,0.0,0.0,1.0,0.0,0.0,0.0,0.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,26.58,1.0,0.0,0.0,20.0,30.0,0.0,1.0,9.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,24.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,23.71,0.0,0.0,0.0,28.0,0.0,1.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,1.0,27.41,1.0,0.0,0.0,7.0,0.0,1.0,1.0,8.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
319791,0.0,29.84,1.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
319792,0.0,24.24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
319793,0.0,32.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### Methodology: Encoding Nominal Features

In the previous steps, we handled binary features (via `drop='if_binary'`) and ordinal features (via `OrdinalEncoder`). The final subset of categorical data consists of **Nominal Features**.

**Nominal features** (such as `Race` and `Diabetic` status) represent categories with no inherent ranking or order. For example, "Asian" is not greater than or less than "White". Therefore, using Ordinal Encoding would be inappropriate as it introduces false mathematical relationships.

### Technical Strategy: One-Hot Encoding (OHE)

To correctly handle these features, we applied **One-Hot Encoding**. This technique converts categorical variables into a binary matrix representation.

#### The "Dummy Variable Trap" and `drop='first'`

A critical decision in this step was setting the parameter `drop='first'`.

* **The Concept:** If a feature has $k$ unique categories, standard OHE creates $k$ binary columns. However, the sum of these $k$ columns will always equal 1. This means the value of any one column can be perfectly predicted from the others.
* **The Problem:** This creates **perfect multicollinearity** (also known as the Dummy Variable Trap), which causes mathematical instability for linear models (like Logistic Regression) because the design matrix becomes singular (non-invertible).
* **The Solution:** By dropping the first category (creating $k-1$ columns), we remove this redundancy while preserving all information. The dropped category becomes the "reference" or "baseline" class.


## 1.2.2: Test-Train Split and Class Imbalance

Now that we've encoded the data, let's take a closer look at the data before splitting it into testing and training sets. Our goal is to determine the class imbalance in our target variable.

**Task:**
* Calculate the class_ratio of those with `HeartDisease` compared to those without. Save this in a variable `class_ratio`


In [19]:
# Find the ratio of majority class to minority class

# The final DataFrame from the previous step: encoded_heart_df
# In section 1.2.1, the 'HeartDisease' column was binary encoded as 0.0 (for 'No') and 1.0 (for 'Yes')

# Get the count of each class (0.0 and 1.0) in the 'HeartDisease' column
counts = encoded_heart_df['HeartDisease'].value_counts()

# Extract the counts for "no heart disease" (0.0) and "has heart disease" (1.0)
count_negative_majority = counts[0.0]
count_positive_minority = counts[1.0]

# Calculate the ratio of majority class to minority class
class_ratio = count_negative_majority / count_positive_minority

# DO NOT CHANGE ----------------------------------------------------------------
print(f'Target variable has {class_ratio:.2f}x more observations in the negative (majority) class than the positive (minority) class.')
# DO NOT CHANGE ----------------------------------------------------------------

Target variable has 10.68x more observations in the negative (majority) class than the positive (minority) class.


Before we move on to our final step in preprocessing our data, we're going to split the data into our training and testing sets. Our data set appears to be quite imbalanced, so in splitting our data, we must ensure the proportion of each class in the data remains the same in our training and testing set. This is called a stratified split or sample - we split the data in such a way that the resulting training and testing sets have the same proportions of the classes that were observed in the original data. Without stratifying our data when splitting, we risk getting an inaccurate depiction of model performance later.

**Task:**

* Using `encoded_heart_df`, split your data into a training and testing set via sklearn's `train_test_split` function
  * Ensure that you stratify your sample so the target variable has approximately the same proportion of positive and negative observations in the training and testing set
  * Ensure that you shuffle your sample
  * Ensure that the testing set holds about 20% of total observations
  * Use the seed set for you do not change it!
* Save your splits in variables `X_train`, `y_train`, `X_test`, and `y_test`

Reminder: the target column is `HeartDisease`.

In [20]:
# DO NOT CHANGE ----------------------------------------------------------------
SEED = 12345
# DO NOT CHANGE ----------------------------------------------------------------

# TODO: Split the data into training and testing sets. Set the random_state = SEED.
X = encoded_heart_df.drop(columns = 'HeartDisease')
y = encoded_heart_df['HeartDisease']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, shuffle = True, random_state = SEED)

# DO NOT CHANGE ----------------------------------------------------------------
print(f'Training data set has size {X_train.shape} and {y_train.shape}')
print(f'Testing data set has size {X_test.shape} and {y_test.shape}')

Training data set has size (255836, 23) and (255836,)
Testing data set has size (63959, 23) and (63959,)


## 1.2.3: Scaling Features
For our final pre-processing step we will be scaling our data. Scaling broadly is the process of mapping your features to a new range of values. This often helps machine learning models learn and converge faster. Some machine learning models are not *scale invariant*, meaning their ability to learn, and what they learn, about the data can be impacted by the size (scale) of the features. For example, some models might give a feature that is larger in scale more influence in the prediction, implying that the feature is more important than others when in practice it should be treated with similar importance. Scaling features helps mitigate this.

There are multiple strategies for scaling, but today we will be using Standardization on our continuous numerical features. Standardizing the columns of the data ensures each feature is centered around zero ($\mu=0$) and has unit variance ($\sigma^2=1$).

Information from outside our training set should not be used to train our models. When deploying a model in practice, you only have your training data, and so any hyperparameter tuning, encoding, and inference should be done on only the training set. In deployment, data that you see should be encoded using the information and learned parameters from the training data.

**Task:**

* Apply standardization to our training and testing data using `StandardScaler` from sklearn (imported).
  * Be sure to include our original numerical columns and new ordinal encoded feature!
* Any results should also be stored in `X_train`, `y_train`, `X_test`, and `y_test`

**Hint:**

* Prevent data leakage -  parameters should only be learned on the training set.

In [21]:
# Scale the necessary features
numerical_cols = ['BMI', 'PhysicalHealth', 'MentalHealth', 'AgeCategory', 'GenHealth', 'SleepTime']
scaler = StandardScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

# If not already done, convert X_train, y_train, X_test, y_test into numpy arrays
X_train = X_train.to_numpy(dtype=float)
X_test  = X_test.to_numpy(dtype=float)
y_train = y_train.to_numpy()
y_test  = y_test.to_numpy()

# 1.3: Modeling Imbalanced Data

We now have encoded and pre-processed our data. Let's start building the models! We will explore three types of models here: Logistic Regression, Random Forests, and XGBoost.

As we build our models, we have to keep the class imbalanced we observed in mind. To combat this, we're going to use class weights to adjust the way our model performs optimizations and learns about our data.

## 1.3.1: Logistic Regression

We will now be using Logistic Regression to build our first simple classification model. The Logistic Regression Classifier models the probability of a target class by using a linear combination of input features. It applies a logistic function to this combination to output probabilities, making it effective for binary classification tasks.

By default, the [`LogisticRegression`](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier in sklearn weighs both classes (subscribing to a term deposit or not) equally. However, we know that there is a large imbalance within our class.

The `class_weight` parameter controls the penalty for misclassifying an observation from a given class. Typically, the penalty is the same for each class. However, when there is a large class imbalance, we can increase the penalty for misclassifying the minority class (positive). These make the model adjust and emphasize learning observations from the minority class more than the majority. Providing the `balanced` keyword to this parameter results in the model penalizing the misclassification of each observation inversely proportional to class frequencies.

**Task:**
* Build two Logistic Regression models using `LogisticRegression` from sklearn (already imported)
  * The first should be saved in the `lrg` variable, and should be a plain Logistic Regression model
  * The second should be saved in the `weighted_lrg` variable, and should use the `class_weight` hyperparameter, set this to be `balanced`
* Fit this model to `X_train` and `y_train`

*Ignore any warnings from sklearn that show a max number of iterations reached*

In [36]:
# Create regular and class-weighted logistic regression models

# Model 1 (lrg): A regular logistic regression model
lrg = LogisticRegression()

# Model 2 (weighted_lrg): A logistic regression model with class_weight='balanced'
weighted_lrg = lrg.set_params(class_weight = 'balanced')

# Fit (train) both models on X_train and y_train
lrg.fit(X_train, y_train)
weighted_lrg.fit(X_train, y_train)

## 1.3.2: Logistic Regression Evaluation

Now that we have two models, we need to assess them. It's important to think critically about which metric you're using to measure model performance and why. There are many metrics you can use to evaluate models.

Metrics:
* Accuracy = $\frac{TP+TN}{TP+FN+FP+TN}$ - for an imbalanced dataset like this one, accuracy is usually not the best metric, as one can get nearly 90% accuracy just by predicting the majority class.
* Recall/Sensitivity = $\frac{TP}{TP+FN}$ - intuitively, recall is the ability of the classifier to find all the positive samples. A model with high recall has few false negatives, and thus fewer positive cases may be missed.
* Precision = $\frac{TP}{TP+FP}$ - intuitively, precision is the ability of the classifier to not label a negative sample as positive. A model with high precision has few false positives, and thus fewer negative cases that are misidentified as positive.
* F1-Score = $2 \times \frac{Precision \times Recall}{Precision + Recall}$ - can be thought of as an "average" (actually the harmonic mean) of precision and recall and is high if there is a good balance between them.

In our setting, the cost of missing a potential subscriber (false negative) is usually higher than incorrectly predicting a non-subscriber (false positive). Therefore, getting a high recall may be slightly more important than getting a high precision in this case. F1-Score is a great metric to look at for a balanced approach between not sacrificing too much recall or precision. However, we should not rely on accuracy as an indicator of model performance.

Remember that in machine learning we care mostly about model performance on *unseen* data (i.e. test data), but it is useful to look at the training metrics as well to see if the model is overfitting or underfitting.

**Task:**

* For each of your models, compute the performance using each of the above metrics on the training and testing sets
* Save your calculations in the pre-defined variables

**Hint:**

* Make use of the scoring functions in sklearn, `accuracy score()`, `recall_score()`, `precision_score()`, and `f1_score()`

In [37]:
# Predict on training and testing data for both models

# Predictions from the regular model (lrg)
lrg_pred_train = lrg.predict(X_train)
lrg_pred_test = lrg.predict(X_test)

# Predictions from the weighted model (weighted_lrg)
weighted_lrg_pred_train = weighted_lrg.predict(X_train)
weighted_lrg_pred_test = weighted_lrg.predict(X_test)

In [38]:
# Compute performance metrics for both train and test models

# --- Regular model (lrg) metrics ---
# Training set
lrg_train_acc = accuracy_score(y_train, lrg_pred_train)      # Training Accuracy
lrg_train_rec = recall_score(y_train, lrg_pred_train)        # Training Recall
lrg_train_pre = precision_score(y_train, lrg_pred_train)     # Training Precision
lrg_train_f1 = f1_score(y_train, lrg_pred_train)             # Training F1-Score

# Test set
lrg_test_acc = accuracy_score(y_test, lrg_pred_test)         # Testing Accuracy
lrg_test_rec = recall_score(y_test, lrg_pred_test)           # Testing Recall
lrg_test_pre = precision_score(y_test, lrg_pred_test)        # Testing Precision
lrg_test_f1 = f1_score(y_test, lrg_pred_test)                # Testing F1-Score

# --- Weighted model (weighted_lrg) metrics ---
# Training set
weighted_lrg_train_acc = accuracy_score(y_train, weighted_lrg_pred_train)      # Training Accuracy
weighted_lrg_train_rec = recall_score(y_train, weighted_lrg_pred_train)        # Training Recall
weighted_lrg_train_pre = precision_score(y_train, weighted_lrg_pred_train)     # Training Precision
weighted_lrg_train_f1 = f1_score(y_train, weighted_lrg_pred_train)             # Training F1-Score

# Test set
weighted_lrg_test_acc = accuracy_score(y_test, weighted_lrg_pred_test)         # Testing Accuracy
weighted_lrg_test_rec = recall_score(y_test, weighted_lrg_pred_test)           # Testing Recall
weighted_lrg_test_pre = precision_score(y_test, weighted_lrg_pred_test)        # Testing Precision
weighted_lrg_test_f1 = f1_score(y_test, weighted_lrg_pred_test)                # Testing F1-Score

In [39]:
# DO NOT CHANGE ----------------------------------------------------------------
print('Regular Logistic Regression Performance')
print('---------------------------------------')
print(f'Training Accuracy: {lrg_train_acc*100:.2f}%')
print(f'Testing Accuracy: {lrg_test_acc*100:.2f}%')
print(f'Training Recall: {lrg_train_rec*100:.2f}%')
print(f'Testing Recall: {lrg_test_rec*100:.2f}%')
print(f'Training Precision: {lrg_train_pre*100:.2f}%')
print(f'Testing Precision: {lrg_test_pre*100:.2f}%')
print(f'Training F1-Score: {lrg_train_f1*100:.2f}%')
print(f'Testing F1-Score: {lrg_test_f1*100:.2f}%')

print()

print('Class Weighted Logistic Regression Performance')
print('----------------------------------------------')
print(f'Training Accuracy: {weighted_lrg_train_acc*100:.2f}%')
print(f'Testing Accuracy: {weighted_lrg_test_acc*100:.2f}%')
print(f'Training Recall: {weighted_lrg_train_rec*100:.2f}%')
print(f'Testing Recall: {weighted_lrg_test_rec*100:.2f}%')
print(f'Training Precision: {weighted_lrg_train_pre*100:.2f}%')
print(f'Testing Precision: {weighted_lrg_test_pre*100:.2f}%')
print(f'Training F1-Score: {weighted_lrg_train_f1*100:.2f}%')
print(f'Testing F1-Score: {weighted_lrg_test_f1*100:.2f}%')
# DO NOT CHANGE ----------------------------------------------------------------

Regular Logistic Regression Performance
---------------------------------------
Training Accuracy: 75.19%
Testing Accuracy: 75.12%
Training Recall: 78.05%
Testing Recall: 77.77%
Training Precision: 22.56%
Testing Precision: 22.46%
Training F1-Score: 35.00%
Testing F1-Score: 34.86%

Class Weighted Logistic Regression Performance
----------------------------------------------
Training Accuracy: 75.19%
Testing Accuracy: 75.12%
Training Recall: 78.05%
Testing Recall: 77.77%
Training Precision: 22.56%
Testing Precision: 22.46%
Training F1-Score: 35.00%
Testing F1-Score: 34.86%


## 1.3.3: XGBoost Classifier

We will now examine another type of classifier: XGBoost. XGBoost is an optimized implementation of gradient boosting designed for speed and performance. It builds an ensemble of decision trees sequentially, where each tree corrects the errors of the previous ones. XGBoost includes regularization to prevent overfitting and can handle large, complex datasets efficiently. It support parallelization, making it faster than other gradient boosting implementations.

 As mentioned, XGBoost trees are built sequentially, with each tree correcting the errors of the previous ones. In contrast, Random Forest (next section) combines the results of independently built trees for final predictions. XGBoost includes regularization to prevent overfitting and supports parallelization during tree construction, making it faster and more accurate in certain scenarios. Random Forest, on the other hand, is typically more robust to hyperparameter tuning and easier to train but may not achieve the same level of performance on structured data as XGBoost.

 **Task:**
 * Initialize an `XGBClassifier` called `xgb_clf` with the following parameters:
  * `random_state = SEED` (same SEED as before)
  * `tree_method = 'hist'`
  * `scale_pos_weight = class_ratio`, this helps account for class imbalances (similar to the `class_weight` parameter in `LogisticRegression`)
  * `n_estimators = 100`
  * `max_depth = 3`
* Fit it to the training data

**Hint:**
 *   Use this [XGBoost documentation](https://xgboost.readthedocs.io/en/latest/python/python_api.html)

In [44]:
# TODO: Initialize your XGB Classifier here with the above parameters
xgb_clf = xgb.XGBClassifier(
    random_state=SEED,
    tree_method='hist',
    scale_pos_weight=class_ratio,
    n_estimators=100,
    max_depth=3
)

# TODO: Fit the classifier to our data
xgb_clf.fit(X_train, y_train)

## 1.3.4 XGBoost Classifier Evaluation
We will now use the same metrics as we used for the `LogisticRegression` model on `xgb_clf`.

**Task:**
*   Predict on the training and testing data using the `.predict()` method of the `XGBClassifier` class
*   Compute the performance using each of the above metrics on the training and testing sets
* Save your calculations in the pre-defined variables

**Hint:**
* Make use of the scoring functions in sklearn, [`accuracy_score()`](https://scikit-learn.org/dev/modules/generated/sklearn.metrics.accuracy_score.html),  [`recall_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html),  [`precision_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html),  [`f1_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)

In [45]:
# Predict using your classifier
xgb_clf_y_train_pred = xgb_clf.predict(X_train)
xgb_clf_y_test_pred = xgb_clf.predict(X_test)

# Compute metrics

# --- Training ---
training_accuracy_score = accuracy_score(y_train, xgb_clf_y_train_pred)
training_recall_score = recall_score(y_train, xgb_clf_y_train_pred)
training_precision_score = precision_score(y_train, xgb_clf_y_train_pred)
training_f1_score = f1_score(y_train, xgb_clf_y_train_pred)

# --- Testing ---
testing_accuracy_score = accuracy_score(y_test, xgb_clf_y_test_pred)
testing_recall_score = recall_score(y_test, xgb_clf_y_test_pred)
testing_precision_score = precision_score(y_test, xgb_clf_y_test_pred)
testing_f1_score = f1_score(y_test, xgb_clf_y_test_pred)

# DO NOT EDIT ------------------------------------------------------------------
print(f'XGBoosting Training Accuracy: {training_accuracy_score}')
print(f'XGBoosting Testing Accuracy: {testing_accuracy_score}')

print(f'XGBoosting Training Recall: {training_recall_score}')
print(f'XGBoosting Testing Recall: {testing_recall_score}')

print(f'XGBoosting Training Precision: {training_precision_score}')
print(f'XGBoosting Testing Precision: {testing_precision_score}')

print(f'XGBoosting Training F1-Score: {training_f1_score}')
print(f'XGBoosting Testing F1-Score: {testing_f1_score}')

XGBoosting Training Accuracy: 0.7397395206304038
XGBoosting Testing Accuracy: 0.7387388795947404
XGBoosting Training Recall: 0.8106219746095534
XGBoosting Testing Recall: 0.7972602739726027
XGBoosting Training Precision: 0.22136728687584178
XGBoosting Testing Precision: 0.21863260706235912
XGBoosting Training F1-Score: 0.3477656093881629
XGBoosting Testing F1-Score: 0.3431603773584906


## 1.3.5: Random Forest Classifier
Now that you've built a Logistic Regression classifier and XGBoost Classifier, we'll contrast it with a different classifier: a random forest. The Random Forest Classifier is an ensemble method that builds multiple decision trees on different subsets of the data and combines their predictions to improve accuracy and reduce overfitting. Each tree is trained on a random subset of the data and features, making the model robust to noise and complex patterns. It's particularly useful for handling non-linear relationships and is effective on high-dimensional datasets.


**Task:**
*   Using the [`RandomForestClassifer`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) class from sklearn, set the following hyperparameters:
   * `class_weight = 'balanced'`
   * `n_estimators = 100`
   * `max_depth = 5`
   * `random_state = SEED` (same SEED as before)
*   Save this classifier as `rfc` and fit the classifier to `X_train` and `y_train`

In [46]:
# Initialize the RandomForestClassifier with the above parameters
rfc = RandomForestClassifier(
    class_weight='balanced',              # Handle class imbalance by automatically assigning higher weight to the minority class (has heart disease)
    n_estimators=100,                     # Number of trees in the forest
    max_depth=5,                          # Maximum depth of each tree to control overfitting
    random_state=SEED                     # Set random seed for reproducibility
)

# Fit the classifier to the training set
rfc.fit(X_train, y_train)

## 1.3.6: Random Forest Classifier Evaluation
We will now use the same metrics as for the previous models model on `rfc`.

**Task:**
*   Predict on the training and testing data using the `.predict()` method of the `RandomForestClassifier` class
*   Compute the performance using each of the above metrics on the training and testing sets
* Save your calculations in the pre-defined variables

In [47]:
# Predict on the training and testing set
rfc_y_train_pred = rfc.predict(X_train)
rfc_y_test_pred = rfc.predict(X_test)

In [48]:
# Compute performance metrics for the model

# --- Training Set Metrics ---
rfc_train_acc = accuracy_score(y_train, rfc_y_train_pred)
rfc_train_recall = recall_score(y_train, rfc_y_train_pred)
rfc_train_precision = precision_score(y_train, rfc_y_train_pred)
rfc_train_f1 = f1_score(y_train, rfc_y_train_pred)

# --- Testing Set Metrics ---
rfc_test_acc = accuracy_score(y_test, rfc_y_test_pred)
rfc_test_recall = recall_score(y_test, rfc_y_test_pred)
rfc_test_precision = precision_score(y_test, rfc_y_test_pred)
rfc_test_f1 = f1_score(y_test, rfc_y_test_pred)

In [49]:
# DO NOT CHANGE ----------------------------------------------------------------
print(f'Random Forest Classifier Training Accuracy: {rfc_train_acc}')
print(f'Random Forest Classifier Testing Accuracy: {rfc_test_acc}')

print(f'Random Forest Classifier Training Recall: {rfc_train_recall}')
print(f'Random Forest Classifier Testing Recall: {rfc_test_recall}')

print(f'Random Forest Classifier Training Precision: {rfc_train_precision}')
print(f'Random Forest Classifier Testing Precision: {rfc_test_precision}')

print(f'Random Forest Classifier Training F1-Score: {rfc_train_f1}')
print(f'Random Forest Classifier Testing F1-Score: {rfc_test_f1}')
# DO NOT CHANGE ----------------------------------------------------------------

Random Forest Classifier Training Accuracy: 0.7367102362450945
Random Forest Classifier Testing Accuracy: 0.736581247361591
Random Forest Classifier Training Recall: 0.7751392821262215
Random Forest Classifier Testing Recall: 0.7682191780821918
Random Forest Classifier Training Precision: 0.21375410847636916
Random Forest Classifier Testing Precision: 0.2125852918877938
Random Forest Classifier Training F1-Score: 0.3351002398649649
Random Forest Classifier Testing F1-Score: 0.3330166270783848


# 1.4 Model Evaluation and Comparative Analysis

## 1.4.1 Overview of Objectives

In the modeling phase of this project, we trained three distinct classifiers: **Logistic Regression**, **XGBoost**, and **Random Forest**.

Given the critical nature of medical diagnosis (predicting heart disease) and the significant class imbalance in our dataset, our primary evaluation metric is **Recall (Sensitivity)**. Our objective is to minimize False Negatives (missed diagnoses), even if it incurs a higher cost in False Positives (low Precision).

---

## 1.4.2 Individual Model Performance

### Random Forest Classifier

* **Configuration:** The Random Forest model was trained with `class_weight='balanced'` and a constrained `max_depth=5` to prevent overfitting.
* **Testing Recall:** 76.82%
* **Testing Precision:** 21.26%

**Analysis:**
The Random Forest model demonstrated stability, with virtually identical scores on training and testing sets (indicating **zero overfitting**). However, its Recall of **76.82%** was the lowest among the three models. This suggests that the `max_depth=5` constraint, while making the model robust, likely caused it to **underfit** slightly.

### Logistic Regression (Baseline)

* **Configuration:** We utilized a Logistic Regression model with `class_weight='balanced'`.
* **Testing Recall:** ~77.77%
* **Testing Precision:** ~22.46%

**Analysis:**
Surprisingly, the simple linear baseline outperformed the constrained Random Forest in terms of Recall. This indicates that either the linear relationships in the dataset are quite strong, or that the Random Forest was simply too restricted to outperform the linear decision boundary.

### XGBoost Classifier

* **Configuration:** We utilized the XGBClassifier with `scale_pos_weight` set to the class ratio, `tree_method='hist'`, and `max_depth=3`.
* **Testing Recall:** **79.72%**
* **Testing Precision:** 21.86%

**Analysis:**
XGBoost emerged as the superior model. By building trees sequentially—where each new tree attempts to correct the errors of the previous ones—the "boosting" mechanism allowed it to better focus on the hard-to-classify minority samples.

---

## 1.4.3 Comparative Summary

The table below summarizes the performance of all three models on the unseen testing set.

| Model | Testing Recall (Sensitivity) | Testing Precision | Overfitting Risk |
| :--- | :--- | :--- | :--- |
| **Logistic Regression** | 77.77% | 22.46% | Low |
| **XGBoost** | **79.72% (Winner)** | 21.86% | Low |
| **Random Forest** | 76.82% | 21.26% | Very Low |

### Key Observations

1.  **The Trade-off:** All three models exhibit the classic trade-off required for imbalanced data: **High Recall, Low Precision**. We successfully forced the models to cast a "wide net" to catch sick patients.
2.  **Complexity vs. Performance:** While Random Forest is generally robust, the boosting architecture of XGBoost proved more effective for this specific classification task, providing a ~2% gain in Recall over the Random Forest and a ~2% gain over the Linear baseline.

---

## 1.4.4 Final Conclusion and Recommendation

For a **medical screening application**, the cost of a False Negative (a patient dying because their heart disease was missed) is infinitely higher than the cost of a False Positive.

Therefore, **XGBoost is the recommended model**.

It provides the highest probability of detecting at-risk patients (Recall ~79.7%) while maintaining a stable generalization profile.

# 1.5: Intro to Hyperparameter Tuning for RandomForest
In this section, we will be gain some familiarity with a new machine learning algorithm and an essential component of data modelling called hyperparameter tuning.

In machine learning, there are typically two sets of parameters. The first being the model parameters. These are typically the parameters that are used to model the data directly, and these are learned through some sort of optimization problem. An example of these would be the weights in a linear regression model (Recall: $y=\beta^Tx=\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m$, $\beta$ are the model parameters). The second are the *hyperparameters*, and these typically control the learning algorithm. Hyperparameters influence the way the model learns or fits the model parameters to the data, and as a result influences model performance. Some example of hyperparameters include the learning rate in a neural network, the class weight we used in our logistic regression model earlier, or optimization method in a linear regression.

**Tuning** is the process of finding the set of hyperparameters that yield the best model parameters for the given data set. As you saw in 3.3, we define *best* loosely here as that depends on the specific problem being worked on. Here we will be using the random forest model to provide an introduction to hyperparameter tuning.

## 1.5.1: Grid Search w/ Random Forest
There are many different ways in which hyperparameter tuning can be done (see [Ax](https://ax.dev/docs/why-ax.html) and [Optuna](https://optuna.org/) for recent developments). We will be getting started with one of the earliest hyperparameter tuning methods, Grid Search. As the name suggests, here were are going to select a model, define a grid of hyperparameters to search on, and then gradually and exhaustively search every combination of hyperparameters.

We will be using a model training and evaluation method called cross-validation to determine the best set of hyperparameters - if you're curious and would like to learn more about cross-validation, see [sklearn documentation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation).

Here we will be tuning a new Random Forest Classifier.

**Task:**
*   Define a [`StratifiedKFold`](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.StratifiedKFold.html) object from sklearn. This allows us to stratify our splits in cross-validation according to class proportions, similar to what we did in training.
   *   Set the `random_state` to the `SEED` value from before.
   *   Ensure that you shuffle the data
   *   Use 5 splits for cross validation
   *   Save this data in the variable `cv`

*   Define a `RandomForestClassifier` class and save the instance in the variable `estimator`. Remember to set `random_state = SEED`.

*   Create a search grid for hyperparameter tuning
  * The parameters we are tuning are `class_weight`, `num_estimators`, and `max_depth`
  *  Values for `class_weight` should be `'balanced'`
  * Values for `n_estimators` should be 50, 100
  * Values for `max_depth` should be 3, 5, 10
  * Save the grid in a `dictionary` called `param_grid`, where the keys are strings of the parameter names and the values are the lists of possible values.

* Based on your experience with the Random Forest and what you know about the problem context, choose an appropriate metric to evaluate the models and set the `scoring` variable to a string for that metric (e.g. `scoring = 'accuracy'`). Keep in mind that since we are working with a heart disease prediction model, we want to favor minimizing false negatives and ensure we identify as many heart disease cases as possible.


  **Hint:**
  *   Reminder that `param_grid` is  a `dictionary`

In [50]:
# Define the stratified cross-validation splitter with the above seed
# 'cv' (Cross-Validation) defines how we will test our model combinations
cv = StratifiedKFold(
    n_splits=5,       # Split the data into 5 folds
    shuffle=True,     # Shuffle the data before splitting
    random_state=SEED # Ensure the shuffling is reproducible
)

# Define the RandomForestClassifier with the above seed
estimator = RandomForestClassifier(
    random_state=SEED # We only set random_state; other parameters will be filled by GridSearch
)

# Define the parameter grid
# 'param_grid' is a dictionary that defines all hyperparameters we want to search through
param_grid = {
    # Dictionary keys must be parameter names recognized by the model (e.g., 'class_weight')
    # Dictionary values must be lists containing all options to try

    'class_weight': ['balanced'],  # Try 'balanced' for handling class imbalance
    'n_estimators': [50, 100],     # Try 50 and 100 trees
    'max_depth': [3, 5, 10]        # Try maximum depths of 3, 5, and 10

    # GridSearch will automatically test all combinations (1 * 2 * 3 = 6 combinations)
}

# Define the metric as a lowercase string (one of 'accuracy', 'recall', 'precision', 'f1')
# 'scoring' defines how we judge which combination is best
# To minimize false negatives (avoid missing diagnoses), we use 'recall' as our metric
scoring = 'recall'

Now we will perform the grid search for Random Forest Classifier and see if we can find a set of hyperparameters that produce a model which outperforms the one we trained in Section 3.3.

**Task:**

* Using [`GridSearchCV`](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GridSearchCV.html) from sklearn, set up the grid search and save this in the variable `search`
* Execute the grid search for the Random Forest model over the hyperparameter space you defined.
* After the best model is found, use it to predict and calculate the recall score on the `X_test` data set, save this in the `search_score` variable

**Hint:**

* This may take a few minutes - if you'd like to see the progress of the grid search procedure, set `verbose=2` in `GridSearchCV`

In [51]:
# Set up the grid search
# Initialize the "experiment manager"
search = GridSearchCV(
    estimator=estimator,        # The model to train
    param_grid=param_grid,      # The hyperparameters to search through
    scoring=scoring,            # The metric to optimize (recall)
    cv=cv,                      # The cross-validation strategy
    verbose=2                   # Print progress updates, e.g., "Fitting 5 folds for 1 of 6 candidates"
)

# Execute the grid search by calling fit
# It will automatically complete all 6 combinations * 5 folds = 30 training runs
search.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] END class_weight=balanced, max_depth=3, n_estimators=50; total time=   3.3s
[CV] END class_weight=balanced, max_depth=3, n_estimators=50; total time=   4.4s
[CV] END class_weight=balanced, max_depth=3, n_estimators=50; total time=   3.8s
[CV] END class_weight=balanced, max_depth=3, n_estimators=50; total time=   3.9s
[CV] END class_weight=balanced, max_depth=3, n_estimators=50; total time=   3.3s
[CV] END class_weight=balanced, max_depth=3, n_estimators=100; total time=   7.7s
[CV] END class_weight=balanced, max_depth=3, n_estimators=100; total time=   8.9s
[CV] END class_weight=balanced, max_depth=3, n_estimators=100; total time=   6.6s
[CV] END class_weight=balanced, max_depth=3, n_estimators=100; total time=   8.9s
[CV] END class_weight=balanced, max_depth=3, n_estimators=100; total time=   7.3s
[CV] END class_weight=balanced, max_depth=5, n_estimators=50; total time=   5.8s
[CV] END class_weight=balanced, max_depth=5,

In [None]:
# Compute the score on the testing set using the best model
# The .score() method will now automatically calculate "recall" instead of the default "accuracy"
search_score = search.score(X_test, y_test)

# DO NOT CHANGE ----------------------------------------------------------------
print(f'The best Random Forest model has hyperparameters {search.best_params_}')
print(f'The best model achieves an average cross-validation score of {search.best_score_*100:.2f}%')
print(f'The best model achieves a {scoring} score of {search_score*100:.2f}% on the testing data')
# DO NOT CHANGE ----------------------------------------------------------------


# 1.6 Analysis of Best Parameters

The Grid Search successfully identified the set of hyperparameters that maximized the **Recall** score, confirming the value of the optimization process.

### 1.6.1 The Winning Configuration

The best model parameters found are: **`{'class_weight': 'balanced', 'max_depth': 10, 'n_estimators': 100}`**.

* **Max Depth (The Key Insight):** The optimal `max_depth` was determined to be **10**, which was the highest value tested. This confirms that the model used in Section 3.3.5 (with a manually set `max_depth=5`) was severely **underfitting** the training data. Increasing the depth allowed the model to learn the complex, non-linear boundaries necessary to distinguish minority samples effectively.
* **Class Weight:** The selection of `'balanced'` reaffirms the necessity of weighting the minority class to achieve high sensitivity in this specific classification problem.

### 1.6.2 Performance and Generalization

The primary goal of tuning is to improve performance on unseen data while maintaining stability.

| Score Type | Result | Interpretation |
| :--- | :--- | :--- |
| **Average Cross-Validation Score** | **78.90%** | The model's average Recall during the 5-fold training phase. |
| **Final Test Score (Unseen Data)** | **79.25%** | The ultimate performance of the best model on the true test set. |

**Observation on Robustness:**
The two scores are extremely close, with the final test score being slightly higher. This indicates excellent **generalization ability** and confirms that the model is **not overfitted**. The hyperparameter combination successfully captured the general patterns without memorizing the noise in the training data.

### 1.6.3 Comparative Conclusion

The tuning process successfully elevated the performance of the Random Forest model to near-XGBoost levels, making it a highly competitive alternative.

* **Tuned Random Forest Recall:** **79.25%**
* **Manual Random Forest Recall (Depth 5):** ~76.82%
* **XGBoost Recall (Best Model):** ~79.72%

The hyperparameter tuning achieved a gain of over **2.4 percentage points** in Recall, pushing the Random Forest into the optimal performance zone for minimizing false negatives.