# Heart Disease


##  Variable Information
Only 14 attributes used:

    1. Age (age): Patient's age in years.

    2. Sex (sex): Gender of the patient.
        Values: 1 = Male, 0 = Female

    3. Chest Pain Type (cp): Type of chest pain experienced.
        Values: 1 = Typical angina, 2 = Atypical angina, 3 = Non-anginal pain, 4 = Asymptomatic

    4. Resting Blood Pressure (trestbps): Blood pressure on admission in mm Hg.

    5. Serum Cholesterol (chol): Serum cholesterol level in mg/dl.

    6. Fasting Blood Sugar (fbs): Fasting blood sugar level.
        Values: 1 = >120 mg/dl, 0 = <=120 mg/dl

    7. Resting Electrocardiographic Results (restecg): Results of resting electrocardiogram.
        Values: 0 = Normal, 1 = ST-T wave abnormality, 2 = Probable or definite left ventricular hypertrophy

    8. Maximum Heart Rate Achieved (thalach): Maximum heart rate during examination.

    9. Exercise-Induced Angina (exang): Presence of exercise-induced angina.
        Values: 1 = Yes, 0 = No

    10. ST Depression Induced by Exercise Relative to Rest (oldpeak): ST depression induced by exercise relative to rest.

    11. Slope of the Peak Exercise ST Segment (slope): Slope of the peak exercise ST segment.
        Values: 1 = Upsloping, 2 = Flat, 3 = Downsloping

    12. Number of Major Vessels Colored by Fluoroscopy (ca): Number of major vessels colored by fluoroscopy. A higher count may indicate a greater degree of vessel involvement or narrowing, which can be associated with more advanced stages of coronary artery disease.

    13. Thalassemia (thal): Type of thalassemia.
        Values: 3 = Normal, 6 = Fixed defect, 7 = Reversible defect

    14. Diagnosis of Heart Disease (num): Diagnosis based on angiographic disease status.
        Values: 0 = < 50% diameter narrowing, 1 = > 50% diameter narrowing (in any major vessel)

## Dataset info: 
https://archive.ics.uci.edu/dataset/45/heart+disease


## Introductory Paper

International application of a new probability algorithm for the diagnosis of coronary artery disease.
By R. Detrano, A. Jánosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, V. Froelicher. 1989

Published in American Journal of Cardiology

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")
warnings.filterwarnings("ignore", category=UserWarning, module="seaborn")

In [None]:
! pip install ucimlrepo
! pip install pandas-profiling

In [None]:
from ucimlrepo import fetch_ucirepo
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from pandas.api.types import CategoricalDtype

# Silence Seaborn FutureWarnings related to is_categorical_dtype
warnings.filterwarnings("ignore", category=FutureWarning, module="seaborn._oldcore",
                        message="is_categorical_dtype is deprecated")
# import dataset
#heart_disease = fetch_ucirepo(name='Heart Disease')
heart_disease = fetch_ucirepo(id=45) 


# access data
df = heart_disease['data']['original']

In [None]:
# First 5 rows of our data
df.head()

In [None]:
# Define a dictionary with the new column names
new_column_names = {
    'age': 'Age',
    'sex': 'Sex',
    'cp': 'ChestPainType',
    'trestbps': 'RestingBloodPressure',
    'chol': 'SerumCholesterol',
    'fbs': 'FastingBloodSugar',
    'restecg': 'RestingECG',
    'thalach': 'MaxHeartRate',
    'exang': 'ExerciseInducedAngina',
    'oldpeak': 'STDepression',
    'slope': 'SlopeSTSegment',
    'ca': 'NumMajorVessels',
    'thal': 'Thalassemia',
    'num': 'HeartDiseaseDiagnosis'
}

# Rename the columns using the dictionary
df.rename(columns=new_column_names, inplace=True)

In [None]:
# First 5 rows of our data
df.head()

In [None]:
df.info()

In [None]:
# First 5 rows of our target
df['HeartDiseaseDiagnosis'].value_counts()

In [None]:
df['HeartDiseaseDiagnosis'] = np.int64(df['HeartDiseaseDiagnosis'] < 1) # binarize the target

In [None]:
palette = ["#87CEEB", "#FFA07A"]

In [None]:
sns.countplot(x="HeartDiseaseDiagnosis", data=df, palette=palette)
plt.xlabel("Heart Disease (0 = False, 1= True)")
plt.show()

In [None]:
sns.countplot(x='Sex', data=df, palette="mako_r")
plt.xlabel("Sex (0 = female, 1= male)")
plt.show()

In [None]:
df.groupby('HeartDiseaseDiagnosis').mean()

In [None]:
pd.crosstab(df.Sex,df.HeartDiseaseDiagnosis).plot(kind="bar",figsize=(15,6),color=palette)
plt.title('Heart Disease Frequency for Sex')
plt.xlabel('Sex (0 = Female, 1 = Male)')
plt.xticks(rotation=0)
plt.legend(["Disease", "Not Disease"])
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.scatter(x=df.Age[df.HeartDiseaseDiagnosis==1], y=df.MaxHeartRate[(df.HeartDiseaseDiagnosis==1)], c=palette[1])
plt.scatter(x=df.Age[df.HeartDiseaseDiagnosis==0], y=df.MaxHeartRate[(df.HeartDiseaseDiagnosis==0)], c=palette[0])
plt.legend(["Disease", "Not Disease"])
plt.xlabel("Age")
plt.ylabel("Maximum Heart Rate")
plt.show()

In [None]:
# Define the mapping for each categorical variable
ChestPainType_mapping = {1: 'Typical angina', 2: 'Atypical angina', 3: 'Non-anginal pain', 4: 'Asymptomatic'}
Thalassemia_mapping = {3.0: 'Normal', 6.0: 'Fixed defect', 7.0: 'Reversible defect'}
SlopeSTSegment_mapping = {1: 'Upsloping', 2: 'Flat', 3: 'Downsloping'}

# Replace the values in the original DataFrame
df['ChestPainType'] = df['ChestPainType'].map(ChestPainType_mapping)
df['Thalassemia'] = df['Thalassemia'].map(Thalassemia_mapping)
df['SlopeSTSegment'] = df['SlopeSTSegment'].map(SlopeSTSegment_mapping)

In [None]:
pd.crosstab(df.SlopeSTSegment,df.HeartDiseaseDiagnosis).plot(kind="bar",figsize=(15,6),color=palette)
plt.title('Heart Disease Frequency for Slope')
plt.xlabel('The Slope of The Peak Exercise ST Segment ')
plt.xticks(rotation = 0)
plt.legend(["Not Disease", "Disease"])
plt.ylabel('Frequency')
plt.show()

In [None]:
pd.crosstab(df.FastingBloodSugar,df.HeartDiseaseDiagnosis).plot(kind="bar",figsize=(15,6),color=palette)
plt.title('Heart Disease Frequency According To FBS')
plt.xlabel('FBS - (Fasting Blood Sugar > 120 mg/dl) (1 = true; 0 = false)')
plt.xticks(rotation = 0)
plt.legend(["Not Disease", "Disease"])
plt.ylabel('Frequency of Disease or Not')
plt.show()

In [None]:
pd.crosstab(df.ChestPainType,df.HeartDiseaseDiagnosis).plot(kind="bar",figsize=(15,6),color=palette)
plt.title('Heart Disease Frequency According To Chest Pain Type')
plt.xlabel('Chest Pain Type')
plt.xticks(rotation = 0)
plt.legend(["Not Disease", "Disease"])
plt.ylabel('Frequency of Disease or Not')
plt.show()

Creating Dummy Variables
Since 'cp', 'thal' and 'slope' are categorical variables we'll turn them into dummy variables.

**Chest Pain Type (cp)**: Type of chest pain experienced.
        Values: 1 = Typical angina, 2 = Atypical angina, 3 = Non-anginal pain, 4 = Asymptomatic

**Thalassemia (thal)**: Type of thalassemia.
        Values: 3 = Normal, 6 = Fixed defect, 7 = Reversible defect

**Slope of the Peak Exercise ST Segment (slope)**: Slope of the peak exercise ST segment.
        Values: 1 = Upsloping, 2 = Flat, 3 = Downsloping


In [None]:
# Convert variables to appropriate data types
categorical_vars = ['ChestPainType', 'Thalassemia', 'SlopeSTSegment']
df[categorical_vars] = df[categorical_vars].astype(str)

In [None]:
df.replace('nan', np.nan, inplace=True)

In [None]:
# Use get_dummies and assign column names
dummies = pd.get_dummies(df[categorical_vars], prefix=categorical_vars,dummy_na = False, drop_first=True).astype(np.int64)
df = pd.concat([df, dummies], axis=1)
df = df.drop(columns=categorical_vars)

In [None]:
# Check for missing values in all columns
missing_values = df.isna().sum()

# Display the count of missing values in each column
print(missing_values)

In [None]:
df.dropna(inplace=True)
# Display the count of missing values in each column
missing_values = df.isna().sum()
print(missing_values)

## Model

In [None]:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

# Separate features and target variable
X = df.drop('HeartDiseaseDiagnosis', axis=1)
y = df['HeartDiseaseDiagnosis']

# Perform a hold-out split (adjust test_size as needed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Impute missing values using mean for numerical columns
imputer = SimpleImputer(strategy='mean')
X_train['NumMajorVessels'] = imputer.fit_transform(X_train[['NumMajorVessels']])
X_test['NumMajorVessels'] = imputer.transform(X_test[['NumMajorVessels']])

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, classification_report, confusion_matrix, roc_curve
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns

# Set hyperparameters (clinicians can modify these)
max_depth = 10
n_estimators = 100

# Create and train the Random Forest model
model = RandomForestClassifier(max_depth=max_depth, n_estimators=n_estimators, random_state=42)
trained_model = model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Visualize the confusion matrix using Seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['No Disease', 'Disease'], yticklabels=['No Disease', 'Disease'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Print classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Calculate AUC (Area Under the Curve)
y_probs = model.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test, y_probs)

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_probs)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.xlabel('False Positive Rate (Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.legend()
plt.show()

In [None]:
# Display feature importances
feature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': trained_model.feature_importances_})
feature_importances = feature_importances.sort_values(by='Importance', ascending=False)
print("\nFeature Importances:")
print(feature_importances)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set seaborn style
sns.set(style="whitegrid")

# Display feature importances with a bar plot using seaborn
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importances, palette="viridis")
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importances')
plt.tight_layout()
plt.show()


In [None]:
df['NumMajorVessels'].value_counts()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Assuming df is your DataFrame with 'HeartDiseaseDiagnosis' and 'NumMajorVessels'
# You may need to replace 'HeartDiseaseDiagnosis' and 'NumMajorVessels' with your actual column names

# Create a contingency table
contingency_table = pd.crosstab(df['HeartDiseaseDiagnosis'], df['NumMajorVessels'])

# Plot a stacked bar chart
plt.figure(figsize=(8, 6))
sns.countplot(x='NumMajorVessels', hue='HeartDiseaseDiagnosis', data=df, palette=palette)
plt.title('Correlation Between Heart Disease Diagnosis and Number of Major Vessels')
plt.xlabel('Heart Disease Diagnosis')
plt.ylabel('Count')
plt.show()


In [None]:
# Create a box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='HeartDiseaseDiagnosis', y='STDepression', data=df, palette=palette)

# Add labels and title
plt.xlabel('Heart Disease (0 = False, 1= True)')
plt.ylabel('STDepression')
plt.title('Box Plot of STDepression for Each Class')

# Show the plot
plt.show()

## Explanation 

In [None]:
palette

In [None]:
import shap

# load JS visualization code to notebook
shap.initjs()
# explain the model's predictions using SHAP values
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

patient_number = 2

# Force plot with custom colors
shap.force_plot(explainer.expected_value[1], shap_values[1][patient_number], X_test.iloc[patient_number, :], plot_cmap=palette[::-1])

In [None]:
patient_number = np.argmin(y_probs)

# Force plot with custom colors
shap.force_plot(explainer.expected_value[1], shap_values[1][patient_number], X_test.iloc[patient_number, :], plot_cmap=palette[::-1])

In [None]:
#Create a beeswarm plot to visualize the impact of features on predictions.
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values[1], X_test, show=False)
plt.title("SHAP Summary Plot - Impact on Heart Disease Diagnosis")
plt.xlabel("Impact on Model Output (Proximity to Heart Disease)")
plt.ylabel("Feature")
plt.show()
plt.show()

Sex (sex): Gender of the patient. Values: 1 = Male, 0 = Female

Exercise-Induced Angina (exang): Presence of exercise-induced angina.
    Values: 1 = Yes, 0 = No