#  Building a good performing ML model for Heart Disease Prediction

In this notebook, we aim to build a predictive model for heart disease using clinical and demographic data. The goal is to identify the best performing model, using rigorous exploratory data analysis (EDA) and robust evaluation metrics. We will discuss our approach, methodology, and final results.

# Problem Statement

The task is to predict the presence (1) or absence (0) of cardiovascular disease based on a dataset containing clinical and demographic features.
Our objectives include preprocessing and exploring the data to understand its structure, class distribution, and potential issues. Then we


# Exploratory Data Analysis (EDA)
In this section, we load the dataset from Google Drive.
Examine the first few rows and overall structure.
Check for missing values, outliers, and class imbalances.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
import os

In [5]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [7]:
# Load and inspect train and test datasets
import pandas as pd
train_data = pd.read_csv('/content/drive/MyDrive/train.csv')
test_data = pd.read_csv('/content/drive/MyDrive/test.csv')
sample_submission = pd.read_csv('/content/drive/MyDrive/sample.csv')


In [8]:
print(train_data.head())
print(train_data.info())

      id    age  gender  height  weight  ap_hi  ap_lo  cholesterol  gluc  \
0  83327  18995       2     162    83.0    120     80            1     1   
1  86196  17319       1     158    64.0    120     80            1     1   
2  59158  19017       1     165    95.0    160    100            2     1   
3  16399  20388       1     164    83.0    150    100            1     1   
4  29470  18236       1     156    52.0    100     67            1     1   

   smoke  alco  active  cardio  
0      0     0       0       1  
1      0     0       1       0  
2      0     0       1       1  
3      0     0       1       1  
4      0     0       0       0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56000 entries, 0 to 55999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           56000 non-null  int64  
 1   age          56000 non-null  int64  
 2   gender       56000 non-null  int64  
 3   height       56000 no

In [9]:
print(test_data.head())
print(test_data.info())

      id    age  gender  height  weight  ap_hi  ap_lo  cholesterol  gluc  \
0  26681  19386       1     155    59.5    120     85            1     1   
1  58585  21081       1     160    59.0    130     90            1     1   
2  54339  15129       2     175    88.0    120     80            2     1   
3  17273  18785       2     177    62.0    120     90            1     1   
4  25420  18171       1     167    81.0    120     80            1     1   

   smoke  alco  active  
0      0     0       1  
1      0     0       1  
2      0     0       1  
3      0     0       1  
4      0     0       1  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14000 entries, 0 to 13999
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           14000 non-null  int64  
 1   age          14000 non-null  int64  
 2   gender       14000 non-null  int64  
 3   height       14000 non-null  int64  
 4   weight       14000 non-null

In [10]:
print(sample_submission.head())
print(sample_submission.info())

     id  cardio
0  3001       0
1  3002       0
2  3003       0
3  3004       0
4  3005       0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      21 non-null     int64
 1   cardio  21 non-null     int64
dtypes: int64(2)
memory usage: 468.0 bytes
None


In [11]:
X_train = train_data.drop(['id', 'cardio'], axis=1)
y_train = train_data['cardio']

X_test = test_data.drop(['id'], axis=1)

In [12]:
# Data Preprocessing
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Scale the test data using the same scaler

# Model Training with GridSearchCV
rf_model = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

In [13]:
grid_search = GridSearchCV(
    estimator=rf_model,
    param_grid=param_grid,
    scoring='f1',
    cv=5,
    n_jobs=-1
)
grid_search.fit(X_train_scaled, y_train)
print("Best parameters:", grid_search.best_params_)

Best parameters: {'max_depth': 20, 'min_samples_split': 5, 'n_estimators': 100}


In [17]:
# prompt: evaluate model with f1 score on test data, but y_test is undefined so you need to generate another test data  from sample data and use it

# Generate synthetic test data (replace with your actual test data if available)
from sklearn.model_selection import train_test_split
X_train_sampled, X_test_sampled, y_train_sampled, y_test_sampled = train_test_split(
    X_train_scaled, y_train, test_size=0.2, random_state=42
)

# Use the best model from grid search
best_rf_model = grid_search.best_estimator_

# Make predictions on the sampled test data
y_pred_sampled = best_rf_model.predict(X_test_sampled)

# Evaluate the model using f1_score
f1 = f1_score(y_test_sampled, y_pred_sampled)
print(f"F1 Score on Sampled Test Data: {f1}")


F1 Score on Sampled Test Data: 0.852606015318513


In [18]:
# Model Prediction on Test Data
best_model = grid_search.best_estimator_
y_test_pred = best_model.predict(X_test_scaled)

# Save the predictions
predictions = pd.DataFrame({'id': test_data['id'], 'cardio': y_test_pred})
predictions.to_csv('heart_disease_predictions.csv', index=False)

print("Predictions saved to 'heart_disease_predictions.csv'")

Predictions saved to 'heart_disease_predictions.csv'


In [20]:
# prompt: arrange the id column of the heart_disease_predictions.csv in order and save it again

import pandas as pd

# Load the predictions
predictions = pd.read_csv('heart_disease_predictions.csv')

# Sort by 'id'
predictions_sorted = predictions.sort_values('id')

# Save the sorted predictions
predictions_sorted.to_csv('heart_disease_predictions.csv', index=False)


In [19]:
# Model Evaluation on Training data
best_model = grid_search.best_estimator_
y_train_pred = best_model.predict(X_train_scaled)
train_f1 = f1_score(y_train, y_train_pred)
print("Training F1 Score:", train_f1)

Training F1 Score: 0.8580901115684777


In [21]:
import pandas as pd

try:
    predictions = pd.read_csv('heart_disease_predictions.csv')

    # Check if the 'id' column exists
    if 'id' not in predictions.columns:
        raise ValueError("The 'id' column is missing from the predictions file.")

    # Check if all 'cardio' values are either 0 or 1
    if not all(x in [0, 1] for x in predictions['cardio']):
        raise ValueError("The 'cardio' column contains values other than 0 or 1.")

    print("Predictions file is valid.")

except FileNotFoundError:
    print("Error: 'heart_disease_predictions.csv' not found.")
except ValueError as e:
    print(f"Error: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")


Predictions file is valid.
