# Heart Disease Prediction

This notebook aims to predict heart disease using various machine learning techniques. The dataset includes several medical attributes that contribute to heart disease outcomes. We will perform data preprocessing, explore the creation of interaction terms, encode categorical variables, and apply machine learning models to predict the presence of heart disease.


### Import Libraries

In [306]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBRFClassifier
from sklearn.tree import DecisionTreeClassifier
import numpy as np

### Data Loading and Preliminary Exploration

We start by loading the training and test datasets and conduct a brief exploration to understand the structure and type of data we are dealing with.

In [307]:
# Load the data
train_data = pd.read_csv('train_heart.csv', sep=',')
test_data = pd.read_csv('test_heart.csv', sep=',')

# Display the first few rows of the training data
train_data.head()

Unnamed: 0,id,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,563,55,M,ASY,135,204,1,ST,126,Y,1.1,Flat,1
1,884,67,M,ASY,160,286,0,LVH,108,Y,1.5,Flat,1
2,352,56,M,ASY,120,0,0,ST,100,Y,-1.0,Down,1
3,694,56,M,ATA,120,236,0,Normal,178,N,0.8,Up,0
4,491,75,M,ASY,170,203,1,ST,108,N,0.0,Flat,1


### Data Preprocessing

In this section, we address missing values, create interaction terms, encode categorical variables, and scale the features to prepare our data for modeling.

First, we address potential outliers and missing values in the dataset. Specifically, we focus on two important variables: `Cholesterol` and `RestingBP` (Resting Blood Pressure). Zero values in these variables are not plausible and likely indicate missing or incorrectly recorded data. We replace these zero values with the median value of the respective variable, which is a robust measure of central tendency that is less sensitive to outliers.

In [308]:
# Filter values for Cholesterol and RestingBP
chol_median = train_data.loc[train_data['Cholesterol'] != 0, 'Cholesterol'].median()
train_data['Cholesterol'].replace(0, chol_median, inplace=True)

resting_median = train_data.loc[train_data['RestingBP'] != 0, 'RestingBP'].median()
train_data['RestingBP'].replace(0, resting_median, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_data['Cholesterol'].replace(0, chol_median, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_data['RestingBP'].replace(0, resting_median, inplace=True)


#### Create Interaction Term

Interaction terms can capture the effect of two or more variables acting together on the target variable. Here, we create an interaction term between `Age` and `Cholesterol`, hypothesizing that the combination of these variables might have a different effect on heart disease risk compared to considering them individually.

In [309]:
train_data['Age_Chol_Interact'] = train_data['Age'] * train_data['Cholesterol']

#### Encode Categorical Variables

Many machine learning models cannot handle categorical variables unless they are converted into numerical values. We use ordinal encoding to transform these variables into a numerical format while preserving the order of categories when applicable.

In [310]:
categorical_columns = train_data.select_dtypes(include=['object']).columns

# Encode categorical variables
encoder = ce.OrdinalEncoder(cols=categorical_columns)
X_encoded = encoder.fit_transform(train_data.drop(['id', 'HeartDisease'], axis=1))

#### Add Polynomial Features

Polynomial features are generated by raising existing features to an exponent. This technique helps to capture interactions between the original features by adding squared or higher order terms of the features. It can uncover complex relationships between features and the target variable.

In [311]:
# Instantiate and apply PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_encoded)

#### Scale Features
Feature scaling is essential to normalize the range of independent variables or features of data. In the absence of scaling, the model might get biased toward high values. We use StandardScaler for this purpose, which standardizes features by removing the mean and scaling to unit variance.

In [312]:
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_poly)

The final step in data preprocessing is splitting the dataset into training and test sets. This allows us to train our model on one subset of the data and then test it on a separate subset to evaluate its performance.

In [313]:
# Define the target variable
y = train_data['HeartDisease']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

With the data preprocessing completed, we're now ready to move on to the next steps: model training and evaluation

### Feature Engineering and Model Preparation

In this step, we focus on optimizing the model's performance through hyperparameter tuning. Hyperparameters are the settings that can be adjusted to control the model's learning process. Tuning these parameters helps in finding the best version of the model that can predict more accurately.

For our Decision Tree classifier, we will explore a range of hyperparameters to find the best combination that yields the highest accuracy. This process is facilitated by `GridSearchCV`, a tool that performs exhaustive search over specified parameter values for an estimator.

#### Hyperparameter Tuning with GridSearchCV

The parameters we are tuning include:
- `max_depth`: The maximum depth of the tree.
- `min_samples_split`: The minimum number of samples required to split an internal node.
- `min_samples_leaf`: The minimum number of samples required to be at a leaf node.
- `ccp_alpha`: Complexity parameter used for Minimal Cost-Complexity Pruning.
- `splitter`: The strategy used to choose the split at each node.
- `class_weight`: Weights associated with classes.

The goal is to find the optimal settings that lead to the best model performance.

In [314]:
# Define the parameter grid to search
param_grid_extended = {
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'ccp_alpha': [0.0, 0.001, 0.005, 0.01, 0.05, 0.1],
    'splitter': ['best', 'random'],
    'class_weight': [None, 'balanced']
}

# Updating the GridSearchCV setup
grid_search_extended = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid_extended,
    cv=5,
    scoring='accuracy',
    verbose=1,
    n_jobs=-1
)
# RandomForest GridSearchCV setup
param_grid_randomforest = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
    'subsample': [0.2, 0.5, 0.7, 0.9],
    'colsample_bynode': [0.1, 0.2, 0.5, 0.7],
}

grid_search_randomforest = GridSearchCV(estimator=XGBRFClassifier(), 
                                        param_grid=param_grid_randomforest,
                                        cv=5, 
                                        scoring='accuracy', 
                                        n_jobs=-1
                                        )

# Fit the model with grid search
grid_search_extended.fit(X_train, y_train)

grid_search_randomforest.fit(X_train, y_train)

# Access best_params_ to see the best set of parameters found by GridSearchCV
best_params = grid_search_extended.best_params_
print(f"Best parameters found for decisiontree: {best_params}")

best_params_random = grid_search_randomforest.best_params_
print(f"Best parameters found for randomforest: {best_params_random}")

# Use the best estimator directly
base_est = grid_search_extended.best_estimator_
base_est_rf = grid_search_randomforest.best_estimator_

Fitting 5 folds for each of 1296 candidates, totalling 6480 fits


Best parameters found for decisiontree: {'ccp_alpha': 0.005, 'class_weight': 'balanced', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'random'}
Best parameters found for randomforest: {'colsample_bynode': 0.1, 'max_depth': 10, 'n_estimators': 200, 'subsample': 0.9}


With the best hyperparameters identified, we can now proceed to use this optimally tuned Decision Tree as the base estimator for our AdaBoost classifier. This ensures that our ensemble model starts with a strong foundation.

### Model Training with AdaBoost

After tuning our decision tree model to find the optimal hyperparameters, we proceed to the next significant phase—model training. In this step, we employ the AdaBoost algorithm, an ensemble method that combines multiple weak learners (in this case, decision trees) to create a strong classifier. By adjusting the weights of incorrectly classified instances, AdaBoost focuses on the hard cases, thereby improving the model's performance on the training data.

### Training the AdaBoost Model

AdaBoost works by iteratively adding models (in this case, decision trees) that correct the mistakes of the models already added to the ensemble. We specify the number of trees to be used (`n_estimators`) and use the best decision tree model (`base_est`) identified in the previous step as our base estimator.

In [315]:
# Define the AdaBoost ensemble model and XGB RandomForest using the best decision tree estimator
ada_boost_clf = AdaBoostClassifier(estimator=base_est, n_estimators=1000, random_state=42)
# Train the AdaBoost and XGBRF model
ada_boost_clf.fit(X_train, y_train)
base_est_rf.fit(X_train, y_train)

# Predict on the test set
y_pred_ada = ada_boost_clf.predict(X_test)
y_pred_xgb_rf = base_est_rf.predict(X_test)

# Calculate and print the accuracy
accuracy_ada = accuracy_score(y_test, y_pred_ada)
accuracy_xgb_rf = accuracy_score(y_test, y_pred_xgb_rf)

print(f'AdaBoost Decision Tree Accuracy: {accuracy_ada}')
print(f'XGBoost Random Forest Accuracy: {accuracy_xgb_rf}')




AdaBoost Decision Tree Accuracy: 0.8393782383419689
XGBoost Random Forest Accuracy: 0.8549222797927462


### Adjusting Prediction Threshold

A key aspect of model evaluation involves adjusting the prediction threshold. This adjustment can help in achieving a balance between sensitivity and specificity, based on the prediction problem's requirements. Here, we experiment with a new threshold to see its impact on the model's accuracy.

In [316]:
# Get probabilities of the positive class
y_probs = ada_boost_clf.predict_proba(X_test)[:, 1]
# Define a new threshold
new_threshold = 0.4  # Example threshold

# Apply new threshold to get the adjusted predictions
y_pred_adjusted = np.where(y_probs > new_threshold, 1, 0)

# Calculate and print the adjusted accuracy
accuracy_adjusted = accuracy_score(y_test, y_pred_adjusted)
print(f'Adjusted AdaBoost Decision Tree Accuracy: {accuracy_adjusted}')

Adjusted AdaBoost Decision Tree Accuracy: 0.8393782383419689


### Evaluating the Model

The final step in model training is to evaluate the performance of our classifier. We do this by calculating the confusion matrix and classification report, which provide insights into the accuracy, precision, recall, and F1-score of the model.

In [317]:
# Print confusion matrix and classification report
cm = confusion_matrix(y_test, y_pred_adjusted)
cr = classification_report(y_test, y_pred_adjusted)


print("Confusion Matrix:\n", cm)
print("\nClassification Report:\n", cr)

Confusion Matrix:
 [[64 22]
 [ 9 98]]

Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.74      0.81        86
           1       0.82      0.92      0.86       107

    accuracy                           0.84       193
   macro avg       0.85      0.83      0.83       193
weighted avg       0.84      0.84      0.84       193



Through these evaluation metrics, we can assess the model's ability to classify instances accurately and understand its strengths and weaknesses in predicting heart disease. This critical analysis guides us in making any necessary adjustments to improve model performance further.

### Testing on New Data

Having trained and evaluated our model on a training dataset, the next critical step is to test its performance on new, unseen data. This step is essential for understanding how well our model generalizes to data it has not encountered before. We follow the same data preprocessing steps for the new dataset as we did for the training dataset. Finally, we use our trained model to make predictions on this new data and prepare a submission file with these predictions.

#### Preparing the Test Data

The new data must be preprocessed to match the format expected by the model. This involves filtering out missing or implausible values, creating interaction terms, encoding categorical variables, adding polynomial features, and scaling the features, just as we did with the training data.


In [318]:
# Import the test dataset
test = pd.read_csv('test_heart.csv', sep=',')

# Apply the same preprocessing steps to the test data
# Filter out missing or implausible values for Cholesterol and RestingBP
chol_median = test.loc[test['Cholesterol'] != 0, 'Cholesterol'].median()
test['Cholesterol'].replace(0, chol_median, inplace=True)

resting_median = test.loc[test['RestingBP'] != 0, 'RestingBP'].median()
test['RestingBP'].replace(0, resting_median, inplace=True)

# Create interaction term
test['Age_Chol_Interact'] = test['Age'] * test['Cholesterol']

# Prepare the test data
X_new = test.drop(['id'], axis=1)

# Encode and transform the test data using the same encoder and transformers
X_new_encoded = encoder.transform(X_new)
X_new_poly = poly.transform(X_new_encoded)
X_new_scaled = scaler.transform(X_new_poly)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test['Cholesterol'].replace(0, chol_median, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test['RestingBP'].replace(0, resting_median, inplace=True)


#### Making Predictions and Adjusting Threshold

With the test data prepared, we predict the likelihood of heart disease. We then apply a chosen threshold to these predictions to classify each instance as either having heart disease or not. This threshold is adjustable based on the specific requirements of the task or the desired balance between sensitivity and specificity.

In [319]:
# Predict probabilities for the test data
y_new_probs = base_est_rf.predict(X_new_scaled)

# Apply the chosen threshold to classify each instance
# chosen_threshold = 0.4  # This threshold can be adjusted
# y_new_pred_adjusted = np.where(y_new_probs > chosen_threshold, 1, 0)

#### Preparing the Submission File

Finally, we prepare a submission file that includes the predictions for the new data. This file typically contains an ID for each instance and the corresponding prediction. This step is especially relevant in competition or operational settings where predictions need to be shared or deployed.

In [320]:
# Create a DataFrame for submission
id_to_prediction_df = pd.DataFrame({
    'id': test['id'],
    'HeartDisease': y_new_probs
})

# Output the submission file
file_name = 'submission.csv'
id_to_prediction_df.to_csv(file_name, index=False)

print(f"File saved as {file_name}")

File saved as submission.csv


This step concludes our process of developing a model to predict heart disease. By carefully preparing the test data, making informed predictions, and generating a submission file, we have applied our model to new data, showcasing its potential for real-world applications.

In [321]:
# import pandas as pd
# from sklearn.model_selection import train_test_split, GridSearchCV
# from sklearn.metrics import accuracy_score
# from sklearn.preprocessing import PolynomialFeatures, StandardScaler
# import category_encoders as ce
# from sklearn.ensemble import AdaBoostClassifier
# from sklearn.tree import DecisionTreeClassifier
# import sklearn
# import numpy as np
# from sklearn.metrics import confusion_matrix, classification_report

# # IMPORT CSV
# hearts = pd.read_csv('train_heart.csv', sep=',')

# # FILTER VALUES FOR CHOLESTEROL AND RESTINGBP
# chol_median = hearts.loc[hearts['Cholesterol'] != 0, 'Cholesterol'].median()
# hearts['Cholesterol'].replace(0, chol_median, inplace=True)
# resting_median = hearts.loc[hearts['RestingBP'] != 0, 'RestingBP'].median()
# hearts['RestingBP'].replace(0, resting_median, inplace=True)

# # CREATE INTERACTION TERM
# hearts['Age_Chol_Interact'] = hearts['Age'] * hearts['Cholesterol']

# # DROP UNNECESSARY COLUMNS AND FIND y
# X = hearts.drop(['id', 'HeartDisease'], axis=1)
# y = hearts['HeartDisease']

# # ENCODE CATEGORICAL VARIABLES
# encoder = ce.OrdinalEncoder(cols=X.select_dtypes(include=['object']).columns)
# X_encoded = encoder.fit_transform(X)

# # ADD POLYNOMIAL FEATURES
# poly = PolynomialFeatures(degree=2, include_bias=False)
# X_poly = poly.fit_transform(X_encoded)

# # SCALE FEATURES
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X_poly)

# # DATA SPLITTING
# X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# # Define the parameter grid to search
# param_grid_extended = {
#     'max_depth': [None, 10, 20, 30, 40, 50],
#     'min_samples_split': [2, 5, 10],
#     'min_samples_leaf': [1, 2, 4],
#     'ccp_alpha': [0.0, 0.001, 0.005, 0.01, 0.05, 0.1],
#     'splitter': ['best', 'random'],
#     'class_weight': [None, 'balanced']
# }

# # Updating the GridSearchCV setup
# grid_search_extended = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42),
#                                     param_grid=param_grid_extended,
#                                     cv=5,
#                                     scoring='accuracy',
#                                     verbose=1,
#                                     n_jobs=-1)

# # Fit the model with grid search
# grid_search_extended.fit(X_train, y_train)

# # Now you can access best_params_
# base_est = grid_search_extended.best_estimator_  # Use the best estimator directly

# # Define the AdaBoost ensemble model using the best decision tree estimator
# ada_boost_clf = AdaBoostClassifier(estimator=base_est, n_estimators=1000, random_state=42)

# # Train the AdaBoost model
# ada_boost_clf.fit(X_train, y_train)

# # Predict on the test set
# y_pred_ada = ada_boost_clf.predict(X_test)

# # Calculate and print the accuracy
# accuracy_ada = accuracy_score(y_test, y_pred_ada)
# print(f'AdaBoost Decision Tree Accuracy: {accuracy_ada}')

# # Get probabilities of the positive class
# y_probs = ada_boost_clf.predict_proba(X_test)[:, 1]

# # Define a new threshold
# new_threshold = 0.4  # Example threshold

# # Apply new threshold to get the adjusted predictions
# y_pred_adjusted = np.where(y_probs > new_threshold, 1, 0)

# # Calculate and print the adjusted accuracy
# accuracy_adjusted = accuracy_score(y_test, y_pred_adjusted)
# print(f'Adjusted AdaBoost Decision Tree Accuracy: {accuracy_adjusted}')

# # Print confusion matrix and classification report
# print(confusion_matrix(y_test, y_pred_adjusted))
# print(classification_report(y_test, y_pred_adjusted))
# # TESTING PART
# # IMPORT CSV
# test = pd.read_csv('test_heart.csv', sep=',')

# # FILTER VALUES FOR CHOLESTEROL AND RESTINGBP
# chol_median = test.loc[test['Cholesterol'] != 0, 'Cholesterol'].median()
# test['Cholesterol'].replace(0, chol_median, inplace=True)

# resting_median = test.loc[test['RestingBP'] != 0, 'RestingBP'].median()
# test['RestingBP'].replace(0, resting_median, inplace=True)

# # CREATE INTERACTION TERM
# test['Age_Chol_Interact'] = test['Age'] * test['Cholesterol']

# # Prepare the test data following the same steps as for the training data
# X_new = test.drop(['id'], axis=1)

# # ENCODE X using the same encoder
# X_new_encoded = encoder.transform(X_new)

# # ADD POLYNOMIAL FEATURES AND SCALE using the same transformers
# X_new_poly = poly.transform(X_new_encoded)
# X_new_scaled = scaler.transform(X_new_poly)

# # Predict probabilities for the test data
# y_new_probs = ada_boost_clf.predict_proba(X_new_scaled)[:, 1]

# # Define the chosen threshold for decision-making
# chosen_threshold = 0.4  # Adjust based on your analysis and preference

# # Apply the threshold to decide the predicted class
# y_new_pred_adjusted = np.where(y_new_probs > chosen_threshold, 1, 0)

# # CREATE PANDAS DATAFRAME FOR SUBMISSION
# id_to_prediction_df = pd.DataFrame({
#     'id': test['id'],
#     'HeartDisease': y_new_pred_adjusted
# })
# # OUTPUT TO CSV
# file_name = './submission.csv'
# id_to_prediction_df.to_csv(file_name, index=False)

# print(f"File saved as {file_name}")
