Johann Rajosefa 300300054

Kalala, Hilaire Junior 300289737

Assignment 2 - CSI4142 - Group A-72

Dataset 2: Heart Attack Risk & Prediction Dataset In India

Comprehensive Cardiovascular Health Data Covering Risk Factors, Demographics.

(No need to download the dataset since our code accesses it from our GitHub page)

1) Introduction

The purpose of analyzing this dataset is to evaluate different imputation techniques for handling missing data related to heart attack risk factors. Given that cardiovascular diseases (CVDs) are a major health concern in India, it is essential to ensure data completeness and accuracy for effective predictive modeling and analysis. This experiment will help determine which imputation methods perform best for handling missing values in key health-related attributes.

2) Description of the Dataset

Official website link : https://www.kaggle.com/datasets/ankushpanday2/heart-attack-risk-and-prediction-dataset-in-india/data

Fast download link : https://github.com/KugleBlitz007/CSI4142/blob/main/heart_attack_prediction_india.csv

Dataset Name : Heart Attack Risk & Prediction Dataset In India

Author : Ankit

Purpose : It can be used for predictive modeling, machine learning applications, epidemiological research, and policy analysis to improve early detection and intervention strategies for heart disease.

Shape : This dataset contains 10000 rows (patient samples) and 26 columns

Features :

- Patient_ID - (Categorical): Unique identifier for each patient.  
- State_Name - (Categorical): The state where the patient resides.  
- Age - (Numerical): Patient’s age in years.  
- Gender - (Categorical): Patient’s gender.
- Diabetes - (Categorical) (Binary): Whether the patient has diabetes (Yes 1/No 0).  
- Hypertension - (Categorical) (Binary): Whether the patient has high blood pressure (Yes 1/No 0).  
- Obesity - (Categorical) (Binary): Whether the patient is classified as obese (Yes 1/No 0).  
- Smoking - (Categorical) (Binary): Whether the patient is a smoker (Yes 1/No 0).  
- Alcohol_Consumption - (Categorical) (Binary): If the patient consumes alcohol (Yes 1/ No 0).
- Physical_Activity - (Categorical) (Binary): If the patient is physicaly active (Yes 1/ No 0). 
- Diet_Score - (Numerical): A score representing the healthiness of the patient's diet (0 to 10).  
- Cholesterol_Level - (Numerical): Patient’s total cholesterol level in mg/dL.  
- Triglyceride_Level - (Numerical): Triglyceride level in mg/dL.  
- LDL_Level - (Numerical): Low-Density Lipoprotein (bad cholesterol) level in mg/dL.  
- HDL_Level - (Numerical): High-Density Lipoprotein (good cholesterol) level in mg/dL.  
- Systolic_BP - (Numerical): Systolic blood pressure (mmHg).  
- Diastolic_BP - (Numerical): Diastolic blood pressure (mmHg).  
- Air_Pollution_Exposure - (Categorical) (Binary): If the patietn is exposed to air polution (Yes 1/ No 0).  
- Family_History - (Categorical) (Binary): Whether the patient has a family history of heart disease (Yes 1/No 0).  
- Stress_Level - (Numerical): Self-reported stress level on a scale (1-10).  
- Healthcare_Access - (Categorical) (Binary): If hte patient has access to healthcare (Yes 1/ No 0).
- Heart_Attack_History - (Categorical) (Binary): Whether the patient has had a heart attack before (Yes 1/No 0).  
- Emergency_Response_Time - (Numerical): Average emergency response time (We can assume its in minutes).  
- Annual_Income - (Numerical): Patient’s income in local currency.  
- Health_Insurance - (Categorical) (Binary): Whether the patient has health insurance (Yes 1/No 0).  
- Heart_Attack_Risk - (Categorical) (Binary): Predicted risk of heart attack (Yes 1/No 0).  

We are now going to perform 3 tests where we use an imputation method and for each test, we are going to evaluate how accurate the method is at completing the missing data.

3) Imputation tests

a) For the first test, we are going to chose the Cholesterol_Level attribute because it is a numerical value and the dataset shows a pretty uniform distribution. We are also going to simulate a MCAR removal.

In [86]:
import pandas as pd
import numpy as np
import requests
from io import StringIO
from sklearn.metrics import mean_absolute_error, mean_squared_error


# First we load the data from github
# We saw this method from assignment 1
GITHUB_CSV_URL = "https://raw.githubusercontent.com/KugleBlitz007/CSI4142/refs/heads/main/heart_attack_prediction_india.csv"
response = requests.get(GITHUB_CSV_URL)
data = pd.read_csv(StringIO(response.text))

# Secondly we select the Cholesterol_Level column and we are going to remove 20% of the data
np.random.seed(300300054) 
missing_fraction = 0.2 
cholesterol_col = "Cholesterol_Level"
n_missing = int(missing_fraction * len(data))

# Third we chose indices completly at random and remove them, but we retain the original values for 
# evaluation later
missing_indices = np.random.choice(data.index, size=n_missing, replace=False)
original_values = data.loc[missing_indices, cholesterol_col].copy()  
data.loc[missing_indices, cholesterol_col] = np.nan  

# Fourth we use a python function to find the median of the cholesterol column
# And we replace every value of null to the median value
median_value = data[cholesterol_col].median()
data.loc[:, cholesterol_col] = data[cholesterol_col].fillna(median_value)

# Finaly we evaluate the Median method using 2 evaluation formulas MAE and MSE 
imputed_values = data.loc[missing_indices, cholesterol_col]

mae = mean_absolute_error(original_values, imputed_values)
mse = mean_squared_error(original_values, imputed_values)

# We display the result
print(f"We have a Mean Absolute Error (MAE) of: {mae:.4f}")
print(f"And we have a Mean Squared Error (MSE) of: {mse:.4f}")
print("")
print("The original values were:")
print(original_values.head())
print("")
print("The imputed values are:")
print(imputed_values.head())

We have a Mean Absolute Error (MAE) of: 37.9370
And we have a Mean Squared Error (MSE) of: 1899.5990

The original values were:
5097    184
2753    171
1375    171
3537    281
2687    278
Name: Cholesterol_Level, dtype: int64

The imputed values are:
5097    225.0
2753    225.0
1375    225.0
3537    225.0
2687    225.0
Name: Cholesterol_Level, dtype: float64


- Our first method using Default value imputation or Median imputation method showed a MAE of 37, considering the data values to be between 150 and 299, this is a significant deviation. This high deviation is also shown in the MSE because of the squarred values. This method removed the variability of the data in this column by replacing the missing data to a unique value (the median).

- We can conclude that our approach in finding the missing values for this test using the Median imputation method was far from successfull.

b) For our second test, we are going to use the Systolic_BP or Systolic Blood pressure column since this <a href="https://pubmed.ncbi.nlm.nih.gov/18192832/">article</a> stated that it is related to the Diastolic_BP or Diastolic Blood pressure. We are going to use a regression imputation by using the correlation between those columns after removing some values MAR.

In [89]:
import pandas as pd
import numpy as np
import requests
from io import StringIO
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# First we load the data from github
# We saw this method from assignment 1
GITHUB_CSV_URL = "https://raw.githubusercontent.com/KugleBlitz007/CSI4142/refs/heads/main/heart_attack_prediction_india.csv"
response = requests.get(GITHUB_CSV_URL)
data = pd.read_csv(StringIO(response.text))

# Secondly we select the Systolic_BP column and we are going to remove 20% of the data
np.random.seed(300300054)  
missing_fraction = 0.2
systolic_col = "Systolic_BP"
diastolic_col = "Diastolic_BP"

# Third we chose indices at random and remove them, but we retain the original values for 
# evaluation later
missing_indices = data[data[diastolic_col] > data[diastolic_col].median()].sample(frac=missing_fraction).index
original_values = data.loc[missing_indices, systolic_col].copy() 
data.loc[missing_indices, systolic_col] = np.nan  

# Then we train the regression model on the original values so it can predict the missing ones
train_data = data.dropna(subset=[systolic_col, diastolic_col])
X_train = train_data[[diastolic_col]]
y_train = train_data[systolic_col]
model = LinearRegression()
model.fit(X_train, y_train)

# Now we predict the missing values
X_missing = data.loc[missing_indices, [diastolic_col]]
predicted_values = model.predict(X_missing)
data.loc[missing_indices, systolic_col] = predicted_values

# Finaly, we use MAE and MSE formulas to evaluate our method of retreiving missing data
imputed_values = data.loc[missing_indices, systolic_col]

mae = mean_absolute_error(original_values, imputed_values)
mse = mean_squared_error(original_values, imputed_values)

print("Original vs Imputed Values:")
print(pd.DataFrame({"Original": original_values, "Imputed": imputed_values}))
print(f"\nWe have a Mean Absolute Error (MAE) of: {mae:.4f}")
print(f"And we have a Mean Squared Error (MSE) of: {mse:.4f}")

Original vs Imputed Values:
      Original     Imputed
4722       163  134.334678
8338       109  134.486804
3019       172  134.499481
5989       147  134.613575
2084       131  134.689638
...        ...         ...
7226        92  134.689638
8959       131  134.600898
5376        98  134.689638
6320       150  134.638929
7529       153  134.448772

[991 rows x 2 columns]

We have a Mean Absolute Error (MAE) of: 21.9409
And we have a Mean Squared Error (MSE) of: 642.2892


- Our result shows that there is a deviation of about 22 units from the imputed values and the original values, this is closer than the deviation of 37 units we got from the previous test; Additionaly the MSE result is also lower than the previous test. We can conclude that the regression Imputation method to retrieve missing data is more accurate than the median method; however, we may want a better and more accurate method, considering this is a health realated analysis.

c) For our Final test, we are choosing the Diet_Score column and we are going to remove some data not randomnly (MNAR) because some patient may either be overestimating how healthy their food are or does not want to disclose that information. We are going to use the similarity-based imputation method to retrieve those missing data.

In [104]:
import pandas as pd
import numpy as np
import requests
from io import StringIO
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error

# First we load the data from github
# We saw this method from assignment 1
GITHUB_CSV_URL = "https://raw.githubusercontent.com/KugleBlitz007/CSI4142/refs/heads/main/heart_attack_prediction_india.csv"
response = requests.get(GITHUB_CSV_URL)
data = pd.read_csv(StringIO(response.text))

# Secondly, we select the Diet_score column and we arbitrary choose to remove data whenever 
# the patient was obese.
np.random.seed(300300054)  
missing_fraction = 0.2  
diet_col = "Diet_Score"
obesity_col = "Obesity" 

# We manualy replace the diet values to NaN and we retaint the original values for evaluation
missing_indices = data[data[obesity_col] == 1].sample(frac=missing_fraction).index
original_values = data.loc[missing_indices, diet_col].copy()  
data.loc[missing_indices, diet_col] = np.nan  

# We use this function to build our similarity based imputation method
imputer = KNNImputer(n_neighbors=1000, weights="uniform")
numeric_data = data.select_dtypes(include=[np.number])  
imputed_array = imputer.fit_transform(numeric_data) 
data[numeric_data.columns] = imputed_array  # Restore imputed values back into DataFrame

# Finaly we use the MAE and MSE formulas to evaluate our method
imputed_values = data.loc[missing_indices, diet_col]

mae = mean_absolute_error(original_values, imputed_values)
mse = mean_squared_error(original_values, imputed_values)

print("Original vs Imputed Values:")
print(pd.DataFrame({"Original": original_values, "Imputed": imputed_values}))
print(f"\nWe have a Mean Absolute Error (MAE) of: {mae:.4f}")
print(f"And we have a Mean Squared Error (MSE) of: {mse:.4f}")


Original vs Imputed Values:
      Original  Imputed
2167         0    5.080
4600         3    4.976
5542         6    4.928
3548         1    5.109
8391         0    4.898
...        ...      ...
3592         0    5.050
2177         6    5.047
864          0    4.934
8229         7    5.115
1731         5    4.940

[607 rows x 2 columns]

We have a Mean Absolute Error (MAE) of: 2.7177
And we have a Mean Squared Error (MSE) of: 9.9790


- We can conclude that this test out of all 3 is the closest to find the original values because it only has one or two units separating them in average; Additionaly the MSE is not out of control either. The KNN method of imputation is the most efficient for retrieving missing data, this is a crucial finding since this is a health related dataset.

4) References :

- ChatGPT 4o
- https://www.geeksforgeeks.org
- https://www.youtube.com/watch?v=R15LjD8aCzc
- https://pubmed.ncbi.nlm.nih.gov/18192832/
- https://scikit-learn.org/stable/