## Predicting Heart Disease with KNeighborsClassifier

### Project Introduction: 
The **World Health Organization (WHO)** estimates that **17.9 million people** die every year due to **cardiovascular diseases (CVDs)** — making them the leading cause of death globally.

There are several **risk factors** that may contribute to CVD in an individual, such as:

- An unhealthy diet  
- Lack of physical activity  
- Mental health challenges

Being able to **identify these risk factors early** can play a crucial role in preventing **premature deaths**.

---

### Project Goal

In this project, I’ll be using a **heart disease dataset from Kaggle** to build a **K-Nearest Neighbors (KNN) classifier**.  

The aim is to **accurately predict the likelihood** of a patient developing heart disease in the future, based on various health indicators and risk factors.


* The aim of this project is to build a supervised Machine Learning model to predict which health attributes predict a heart dicease in the future. <br>
* The dataset used is anonymized patient data from multiple hospitals and several patients.

#### Data dictionary

**Age:** age of the patient [years]<br>
**Sex:** sex of the patient [M: Male, F: Female]<br>
**ChestPainType:** chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]<br>
**RestingBP:** resting blood pressure [mm Hg]<br>
**Cholesterol:** serum cholesterol [mm/dl]<br>
**FastingBS:** fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]<br>
**RestingECG:** resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]<br>
**MaxHR:** maximum heart rate achieved [Numeric value between 60 and 202]<br>
**ExerciseAngina:** exercise-induced angina [Y: Yes, N: No]<br>
**Oldpeak:** oldpeak = ST [Numeric value measured in depression]<br>
**ST_Slope:** the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]<br>
**HeartDisease:** output class [1: heart disease, 0: Normal]<br>

### Import required library and preview the dataset

In [62]:
# import libraries
import pandas as pd

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [63]:
df = pd.read_csv("C:/Users/DELL/Downloads/heart/predicting_heart.csv")

In [64]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [65]:
df.shape

(918, 12)

In [66]:
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [67]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [68]:
df.isna().sum()


Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

####  Observations:

- The average age of patients in the dataset is approximately **53.5 years**, with ages ranging from **28 to 77 years**. This suggests the dataset covers a wide spectrum of adult patients, from young adults to the elderly.

- In the **RestingBP** (Resting Blood Pressure) variable, we observe a **minimum value of 0**, which is not physiologically plausible and likely indicates **missing or incorrectly recorded data**. The **maximum value of 200** is abnormally high and may represent patients with severe cardiovascular conditions.

- Similarly, the **Cholesterol** variable also contains a **minimum value of 0**, which again is not realistic in a medical context. This indicates that zero values may have been used as placeholders for missing data and should be treated accordingly during data cleaning.


### Exploring the Dataset Variable Distributions (Categorical)

#### Creating x & y variables for bar charts


In [69]:
categorical_cols = ["Sex", "ChestPainType", "FastingBS", "RestingECG", "ExerciseAngina", "ST_Slope", "HeartDisease"]

fig = plt.figure(figsize=(16,15))

for idx, col in enumerate(categorical_cols):
    ax = plt.subplot(4, 2, idx+1)
    sns.countplot(x=df[col], hue=df["HeartDisease"], ax=ax)

    # add data labels to each bar
    for container in ax.containers:
        ax.bar_label(container, label_type="center")

#### Key Observations from Categorical Variable Distributions

- The dataset contains a significantly higher number of **male patients (725)** compared to **female patients (193)**. This gender imbalance may influence the performance and generalizability of the machine learning model.

- The most frequently reported **chest pain type** is **Atypical Angina**, suggesting it is a common symptom among patients, which may carry predictive weight in heart disease classification.

- The distribution of the **target variable (Heart Disease)** is relatively **balanced**, with a near-equal number of patients diagnosed with and without heart disease. This is beneficial for training a supervised classification model without requiring additional rebalancing techniques.


### Analysis of Categorical Variables by Heart Disease Status

To better understand the distribution patterns in our dataset, we group the categorical variables by `HeartDisease` status. This approach provides several key benefits:

1. **Comparative Visualization**: By stratifying each categorical variable according to the presence or absence of heart disease, we can identify potential risk factors or protective characteristics.

2. **Pattern Identification**: The grouped distributions may reveal clinically meaningful associations between patient characteristics and cardiovascular outcomes.

3. **Data Quality Assessment**: This grouping helps verify whether our dataset contains sufficient representation across all categories for both disease states, which is crucial for modeling.

The following visualization presents count distributions for all categorical variables, stratified by heart disease status:

In [70]:
fig = plt.figure(figsize=(16,15))

for idx, col in enumerate(categorical_cols[:-1]):
    ax = plt.subplot(4, 2, idx+1)
    # group by HeartDisease
    sns.countplot(x=df[col], hue=df["HeartDisease"], ax=ax)
    # add data labels to each bar
    for container in ax.containers:
        ax.bar_label(container, label_type="center")

### Key Observations from Categorical Feature Distributions

- The dataset is notably **skewed toward male patients**. Only **50 female patients** have been diagnosed with heart disease.
  
- A significant number of patients (**392**) diagnosed with heart disease reported **asymptomatic (ASY) chest pain**.  
  While chest pain is typically a relevant feature for heart disease prediction, **asymptomatic cases suggest** that these patients did not show chest pain as a symptom.

- A high number (**170**) of patients with **fasting blood sugar > 120 mg/dl** were diagnosed with heart disease, compared to those who were not.

- Out of all patients who experienced **exercise-induced angina**, **316** were diagnosed with heart disease.

- Among patients with a **flat ST slope**, **381** were diagnosed with heart disease.

---
**Conclusion**:  
From the distribution of the above categorical features, we can begin to identify **potentially relevant predictors** for our model. However, we will **perform data cleaning** first before finalizing our feature selection.


### Data Cleaning Operation

- We identified that there are **no missing values** in the dataset.

- However, as observed earlier, a few columns contain **zero values that are not logically valid**.

- Specifically, we will examine the number of zero values in the **`RestingBP`** and **`Cholesterol`** columns, and then decide on the appropriate strategy to handle them.


In [71]:
df[df['RestingBP'] == 0]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
449,55,M,NAP,0,0,0,Normal,155,N,1.5,Flat,1


In [72]:
df[df["Cholesterol"] == 0]


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
293,65,M,ASY,115,0,0,Normal,93,Y,0.0,Flat,1
294,32,M,TA,95,0,1,Normal,127,N,0.7,Up,1
295,61,M,ASY,105,0,1,Normal,110,Y,1.5,Up,1
296,50,M,ASY,145,0,1,Normal,139,Y,0.7,Flat,1
297,57,M,ASY,110,0,1,ST,131,Y,1.4,Up,1
...,...,...,...,...,...,...,...,...,...,...,...,...
514,43,M,ASY,122,0,0,Normal,120,N,0.5,Up,1
515,63,M,NAP,130,0,1,ST,160,N,3.0,Flat,0
518,48,M,NAP,102,0,1,ST,110,Y,1.0,Down,1
535,56,M,ASY,130,0,0,LVH,122,Y,1.0,Flat,1


### Data Cleaning Operation

As part of my data cleaning workflow, I first verified that the dataset contains **no missing values**, which is a good starting point.

However, upon closer inspection, I observed that some columns—specifically `RestingBP` and `Cholesterol`—contain **zero values** that are not **clinically or logically valid**. Blood pressure and cholesterol levels should never be zero, so these values need to be addressed.

To proceed:

- I will check the number of `0` values in both the `RestingBP` and `Cholesterol` columns.
- Based on their counts and impact on the dataset, I will decide on the most appropriate cleaning strategy—either removing rows or applying targeted imputation.

This step ensures that the model does not learn from **invalid or misleading data**, which could compromise predictive performance later on.


In [73]:
# checking how many zero values in RestingBP
df[df["RestingBP"]== 0]

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
449,55,M,NAP,0,0,0,Normal,155,N,1.5,Flat,1


In [74]:
# checking how many zero values in cholesterol
df[df["Cholesterol"] == 0]


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
293,65,M,ASY,115,0,0,Normal,93,Y,0.0,Flat,1
294,32,M,TA,95,0,1,Normal,127,N,0.7,Up,1
295,61,M,ASY,105,0,1,Normal,110,Y,1.5,Up,1
296,50,M,ASY,145,0,1,Normal,139,Y,0.7,Flat,1
297,57,M,ASY,110,0,1,ST,131,Y,1.4,Up,1
...,...,...,...,...,...,...,...,...,...,...,...,...
514,43,M,ASY,122,0,0,Normal,120,N,0.5,Up,1
515,63,M,NAP,130,0,1,ST,160,N,3.0,Flat,0
518,48,M,NAP,102,0,1,ST,110,Y,1.0,Down,1
535,56,M,ASY,130,0,0,LVH,122,Y,1.0,Flat,1


### Handling Zero Values in `RestingBP` and `Cholesterol`

Upon analysis:

- The `RestingBP` column contains **only one row** with a value of `0`. Since this is clearly invalid and isolated, we will **remove this row from the dataset**.

- The `Cholesterol` column, however, has **172 zero values**, which is a **significant proportion** of the dataset.  
  Removing all these rows would result in **substantial data loss**, and although using the median for imputation isn't a perfect solution, it provides a **practical balance between data integrity and completeness**.

To improve the accuracy of the imputation:

- We will apply **conditional median replacement** based on the `HeartDisease` status:
  - For patients **diagnosed with heart disease** (`HeartDisease == 1`), zero values in `Cholesterol` will be replaced with the **median of non-zero `Cholesterol` values** from the same group.
  - For patients **not diagnosed with heart disease** (`HeartDisease == 0`), zero values will be replaced with the **median of their corresponding group**.

This approach allows us to preserve patterns in the data while correcting invalid entries, leading to a cleaner and more reliable dataset for modeling.


In [75]:
# The RestingBP column has only one row with a value of 0. 
# Since it's an outlier, we will drop this row from the dataset:

df_clean = df.copy()
df_clean = df_clean[df_clean["RestingBP"] != 0]
df_clean.head()


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [76]:
heartdisease_mask = df_clean["HeartDisease"]==0

cholesterol_without_heartdisease = df_clean.loc[heartdisease_mask, "Cholesterol"]
cholesterol_with_heartdisease = df_clean.loc[~heartdisease_mask, "Cholesterol"]

df_clean.loc[heartdisease_mask, "Cholesterol"] = cholesterol_without_heartdisease.replace(to_replace = 0, value = cholesterol_without_heartdisease.median())
df_clean.loc[~heartdisease_mask, "Cholesterol"] = cholesterol_with_heartdisease.replace(to_replace = 0, value = cholesterol_with_heartdisease.median())

In [77]:
df_clean[["Cholesterol", "RestingBP"]].describe()


Unnamed: 0,Cholesterol,RestingBP
count,917.0,917.0
mean,239.700109,132.540894
std,54.352727,17.999749
min,85.0,80.0
25%,214.0,120.0
50%,225.0,130.0
75%,267.0,140.0
max,603.0,200.0


The minimum values for both have changed! There are no more zero values in either of those. This satisfies our cleaning requirement.



### Feature Selection


#### My Feature Selection Approach

Based on the insights gathered from my exploratory data analysis (EDA) and a solid understanding of the dataset, I’ve identified a few features that I believe are worth exploring further:

- `Age`
- `Sex`
- `ChestPainType`
- `Cholesterol`
- `FastingBS`

These features stood out as potentially useful for predicting heart disease.

---

#### What's Next?

To dig deeper, I want to understand **how strongly each of these features is related to the target variable** (`HeartDisease`). This will help me focus on the features that truly matter and avoid noise in the model.

---

#### Getting the Data Ready

Since some of my features are categorical, I’ll need to **convert them into numerical format** before running any correlation checks. I’ll use **one-hot encoding (dummy variables)** to transform them.

This step is crucial for ensuring all the features are model-ready and can be properly analyzed for their relationship with the target.


In [78]:
df_clean = pd.get_dummies(df_clean, drop_first=True)
df_clean.head()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ExerciseAngina_Y,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,True,True,False,False,True,False,False,False,True
1,49,160,180,0,156,1.0,1,False,False,True,False,True,False,False,True,False
2,37,130,283,0,98,0.0,0,True,True,False,False,False,True,False,False,True
3,48,138,214,0,108,1.5,1,False,False,False,False,True,False,True,True,False
4,54,150,195,0,122,0.0,0,True,False,True,False,True,False,False,False,True


In [79]:
correlations = abs(df_clean.corr())
plt.figure(figsize=(12,8))
sns.heatmap(correlations, annot=True, cmap="Blues")

<Axes: >

In [80]:
plt.figure(figsize=(12,8))
sns.heatmap(correlations[correlations > 0.3], annot=True, cmap="Blues")

<Axes: >

### Correlation Insights

From the correlation heatmap I generated, I was able to identify the following features that show a **positive correlation** (with a correlation coefficient greater than **0.3**) with `HeartDisease`:

- `Oldpeak`
- `MaxHR`
- `ChestPainType_ATA`
- `ExerciseAngina_Y`
- `ST_Slope_Flat`
- `ST_Slope_Up`

>  *Note:* The correlation coefficient threshold of 0.3 was chosen arbitrarily as a starting point for feature relevance.

Interestingly, **Cholesterol** did **not** show a strong correlation with heart disease. That’s a bit surprising, so for now, I'm considering **excluding it** from the model.

---

### Final Feature Selection

Based on both EDA findings and correlation results, I’m narrowing down my feature set to:

- `Oldpeak`
- `Sex_M`  
  *(It has a lower correlation coefficient, but EDA showed a noticeable trend, so I’m keeping it in.)*
- `ExerciseAngina_Y`
- `ST_Slope_Flat`
- `ST_Slope_Up`

---

###  Ready to Model

With these features selected, I’m now ready to move forward and build my predictive model!


In [93]:
X = df_clean.drop(["HeartDisease"], axis=1)
y = df_clean["HeartDisease"]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15, random_state=417)
    
features = [
    "Oldpeak",
    "Sex_M",
    "ExerciseAngina_Y",
    "ST_Slope_Flat",
    "ST_Slope_Up"
]

### Individual Feature Modeling

#### One Feature at a Time

I created individual models for each of the features I’ve selected and check how well each performs using **accuracy** as my key metric.

Doing this will give me a clear picture of which features carry the most predictive power on their own, before I move on to combining them into a more complete model.


In [98]:
for feature in features:
    knn = KNeighborsClassifier(n_neighbors = 3)
    knn.fit(X_train[[feature]], y_train)
    accuracy = knn.score(X_val[[feature]], y_val)
    print(f"The k-NN classifier trained on {feature} and with k = 3 has an accuracy of {accuracy*100:.2f}%")

The k-NN classifier trained on Oldpeak and with k = 3 has an accuracy of 58.70%
The k-NN classifier trained on Sex_M and with k = 3 has an accuracy of 61.59%
The k-NN classifier trained on ExerciseAngina_Y and with k = 3 has an accuracy of 73.19%
The k-NN classifier trained on ST_Slope_Flat and with k = 3 has an accuracy of 81.88%
The k-NN classifier trained on ST_Slope_Up and with k = 3 has an accuracy of 55.07%


My **best performing model**, with an accuracy of approximately **82%**, was trained on the **`ST_Slope_Flat`** feature.  
The **`ExerciseAngina_Y`** feature came in as a close second.

These results are consistent with the **data distributions** I explored earlier during EDA.



### Building a Classifier with Multiple Features
####  Preparing the Data: Normalization

Before I train the model using the features I’ve selected, it’s important to **normalize the data** so that each feature contributes fairly during the distance calculations in KNN.

To do this, I’ll use **`MinMaxScaler` from scikit-learn** to scale all feature values to a range between **0 and 1**. Once that’s done, I’ll move ahead with training the model on the normalized data.

In [96]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train[features])
X_val_scaled = scaler.transform(X_val[features])


In [97]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train_scaled, y_train)
accuracy = knn.score(X_val_scaled, y_val)
print(f"Accuracy: {accuracy*100:.2f}")

Accuracy: 83.33


The model's accuracy jumped to approximately **83%**!  

While this isn’t a huge improvement, it’s definitely a good start. Using all the selected features together has improved the model, but now it's time to explore which parameters and hyperparameters might be optimal for even better performance.


### Hyperparameter Optimization

Now that I’ve got a working model, it’s time to take things a step further with **hyperparameter tuning** to see if I can squeeze out even better performance.

But first, I’ll make sure my data is properly prepared and ready for optimization.


In [100]:
X = df_clean.drop(["HeartDisease"], axis=1)
y = df_clean["HeartDisease"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state = 417)

features = [
    "Oldpeak",
    "Sex_M",
    "ExerciseAngina_Y",
    "ST_Slope_Flat",
    "ST_Slope_Up"
]

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train[features])

To optimize our K-Nearest Neighbors (KNN) model, we’ll use **Grid Search** to explore a range of hyperparameter values.

Here’s what we’ll be testing:

- **`k` (n_neighbors in scikit-learn):**  
  Values ranging from **1 to 20**
  
- **Distance Metrics:**  
  - `minkowski` (default metric in `sklearn.KNeighborsClassifier`)  
  - `manhattan`

> While `minkowski` is the default and often performs well, we’ll also try `manhattan` just to see if it yields any surprising improvements.


In [85]:
grid_params = {"n_neighbors": range(1,20), "metric": ["minkowski", "manhattan"]}

knn = KNeighborsClassifier()
knn_grid = GridSearchCV(knn, grid_params, scoring="accuracy")
knn_grid.fit(X_train_scaled, y_train)


In [88]:
knn_grid.best_score_*100, knn_grid.best_params_

(83.43507030603806, {'metric': 'minkowski', 'n_neighbors': 19})

Our best KNN model achieved an **accuracy of approximately 83%**, using:

- **`n_neighbors = 19`**
- **Distance metric = `minkowski`**

>  **GridSearchCV** uses a **cross-validation** approach, so this accuracy is likely a more reliable estimate of how the model will perform on unseen data.

Now, it's time to evaluate our best model on the **test set** to see how well it generalizes to new data.


### Model Evaluation on Test Set

Before I can evaluate how well the model performs on unseen data, I need to **normalize the test set** just like I did with the training set.  
This ensures consistency in scale and helps the model make fair predictions.

In [104]:
X_test_scaled = scaler.transform(X_test[features])
predictions = knn_grid.best_estimator_.predict(X_test_scaled)
accuracy = accuracy_score(y_test, predictions)
print(f" Model Accuracy on test set: {accuracy*100:.2f}")

 Model Accuracy on test set: 86.96


### Test Set Accuracy and Initial Insights

The model achieved an impressive **accuracy of ~87%** on the test set. That’s a strong result! 

It suggests that the model is able to correctly predict whether a patient is at risk for heart disease roughly **87% of the time**. 

However, this jump in accuracy compared to earlier attempts raises a few questions. Could the improvement be too good to be true?

One possible explanation might lie in the **distribution of the test data**. We’ll need to investigate this further to confirm whether the performance truly reflects the model’s ability or if the test data happened to be particularly favorable.

In [106]:
print("Distribution of patients by their sex in the entire dataset")
print(X.Sex_M.value_counts())

print("\nDistribution of patients by their sex in the training dataset")
print(X_train.Sex_M.value_counts())

print("\nDistribution of patients by their sex in the test dataset")
print(X_test.Sex_M.value_counts())

Distribution of patients by their sex in the entire dataset
Sex_M
True     724
False    193
Name: count, dtype: int64

Distribution of patients by their sex in the training dataset
Sex_M
True     615
False    164
Name: count, dtype: int64

Distribution of patients by their sex in the test dataset
Sex_M
True     109
False     29
Name: count, dtype: int64


We included **Sex** as one of the features while training our model. Here's how the gender distribution looks across our datasets:

- **Full Dataset (X):** 724 males, 193 females  
- **Training Set (X_train):** 615 males, 164 females  
- **Test Set (X_test):** 109 males, 29 females  

As you can see, there's a **noticeable imbalance**, with male patients far outnumbering female patients across all subsets. We touched on this earlier, but it's important to highlight again: this kind of imbalance can **introduce bias** into our model.

If the training data contains significantly more male patients, the model might naturally become better at making predictions for males simply because it has more examples to learn from. The test set reflects this too—fewer female samples might make it harder to fairly evaluate how well the model performs on that group.

While this imbalance is a plausible explanation for any **performance discrepancy** we observe, we also recognize that **other factors could be at play**. It's a reminder to stay mindful of the data we use and consider how it might shape our model's behavior


## Conclusion

In this project, we explored and analyzed a heart disease dataset to build a predictive model using the K-Nearest Neighbors (KNN) algorithm. Through thorough exploratory data analysis (EDA), we identified key characteristics and patterns within the data, such as the skew towards male patients and the distribution of chest pain types, fasting blood sugar levels, and exercise-induced angina in relation to heart disease presence.

Data cleaning was essential, especially to address unrealistic zero values in `RestingBP` and `Cholesterol`. We replaced these zero values with median values segmented by heart disease status, which improved data quality and reliability.

Feature selection was guided by correlation analysis and domain knowledge. Surprisingly, some commonly assumed predictors like cholesterol showed low correlation with heart disease, while features like `Oldpeak`, `MaxHR`, and certain chest pain types exhibited stronger associations.

The final feature set was carefully chosen based on these insights to optimize the model's predictive power.

Overall, this project demonstrated the importance of detailed EDA and data preprocessing in developing effective machine learning models for healthcare applications. Early and accurate prediction of heart disease risk can significantly contribute to preventative care and reduce premature mortality related to cardiovascular diseases.
