# Predicting Heart Attacks

In this notebook, we predict which people are at risk for heart attacks.

## The Data

We use the "Heart Attack Analysis & Prediction Dataset" provided by Rashik Rahman.

Age : Age of the patient

Sex : Sex of the patient

exang: exercise induced angina (1 = yes; 0 = no)

ca: number of major vessels (0-3)

cp : Chest Pain type chest pain type
- Value 1: typical angina
- Value 2: atypical angina
- Value 3: non-anginal pain
- Value 4: asymptomatic

trtbps : resting blood pressure (in mm Hg)

chol : cholestoral in mg/dl fetched via BMI sensor

fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

rest_ecg : resting electrocardiographic results

- Value 0: normal
- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- thalach : maximum heart rate achieved

target : 0= less chance of heart attack 1= more chance of heart attack

# Import Libraries and Read Data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve, auc, make_scorer
from scikitplot.metrics import plot_roc
from sklearn.preprocessing import RobustScaler, StandardScaler, minmax_scale, Normalizer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
plt.style.use('default')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')
df

In [None]:
df.info()

In [None]:
df.isnull().any()

In [None]:
delete_item = df[df.duplicated()]

df.drop(index = delete_item.index, inplace=True)

# EDA
## Univariate Analysis



### Target Variable

In [None]:
sns.countplot(x = 'output',data=df);

### Categorical Variables

In [None]:
plt.figure(figsize=(25,10))

plt.subplot(241)

sns.countplot(x = 'sex', data=df)

plt.subplot(242)
sns.countplot(x = 'cp', data=df)

plt.subplot(243)
sns.countplot(x = 'fbs', data=df)

plt.subplot(244)
sns.countplot(x = 'restecg', data=df)

plt.subplot(245)
sns.countplot(x = 'exng', data=df)

plt.subplot(246)
sns.countplot(x = 'slp', data=df)

plt.subplot(247)
sns.countplot(x = 'caa', data=df)

plt.subplot(248)
sns.countplot(x = 'thall', data=df);




### Numerical Variables

In [None]:
plt.figure(figsize=(15,15))

plt.subplot(321)
sns.histplot(x = 'age', data=df)

plt.subplot(322)
sns.histplot(x = 'trtbps', data=df)

plt.subplot(323)
sns.histplot(x = 'chol', data=df)

plt.subplot(324)
sns.histplot(x = 'thalachh', data=df)

plt.subplot(325)
sns.histplot(x = 'oldpeak', data=df);


## Bivariate Analysis

In [None]:
risky = df[df['output'] == 1]

not_risky = df[df['output'] == 0]


In [None]:
plt.figure(figsize=(20,7))

plt.subplot(131)
sns.countplot(data=df,x='sex', hue='output')

plt.subplot(132)
sns.countplot(data=df,x='fbs', hue='output')

plt.subplot(133)
sns.countplot(data=df,x='exng', hue='output');

In [None]:
df

In [None]:
plt.figure(figsize=(25,10))

plt.subplot(331)
sns.kdeplot(data=df, x='age',hue='output',fill=True,palette=["blue","orange"], alpha=.5, linewidth=0)

plt.subplot(332)
sns.kdeplot(data=df, x='cp',hue='output',fill=True,palette=["blue","orange"], alpha=.5, linewidth=0)

plt.subplot(333)
sns.kdeplot(data=df, x='trtbps',hue='output',fill=True,palette=["blue","orange"], alpha=.5, linewidth=0)

plt.subplot(334)
sns.kdeplot(data=df, x='chol',hue='output',fill=True,palette=["blue","orange"], alpha=.5, linewidth=0)

plt.subplot(335)
sns.kdeplot(data=df, x='restecg',hue='output',fill=True,palette=["blue","orange"], alpha=.5, linewidth=0)

plt.subplot(336)
sns.kdeplot(data=df, x='thalachh',hue='output',fill=True,palette=["blue","orange"], alpha=.5, linewidth=0)

plt.subplot(337)
sns.kdeplot(data=df, x='oldpeak',hue='output',fill=True,palette=["blue","orange"], alpha=.5, linewidth=0)

plt.subplot(338)
sns.kdeplot(data=df, x='slp',hue='output',fill=True,palette=["blue","orange"], alpha=.5, linewidth=0)

plt.subplot(339)
sns.kdeplot(data=df, x='caa',hue='output',fill=True,palette=["blue","orange"], alpha=.5, linewidth=0);

# Model Building


## Model Preparation

To prepare the model we must get dummy variables for all the categorical variables. We also select a lower test size because there is not a lot of training data so we want to use as much as possible for training.

In [None]:
modeling_df = df.copy()

cat_cols = ['sex','cp','fbs','restecg','exng','slp','caa','thall']
num_cols = ['age','trtbps','chol','thalachh','oldpeak']



pd.get_dummies(modeling_df, columns = cat_cols, drop_first = True)



X = modeling_df.drop(['output'],axis=1)
y = modeling_df[['output']]





X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=.10, random_state=0)

## Establishing Baseline Performance
​
To understand if our model holds any weight, we need to establish a baseline model to test our models against.

In [None]:
X_train.value_counts()

In [None]:
print('All Positive model equals:',130/y_train.size)

print('All Negative model equals:',111/y_train.size)

Since the all positive model has a higher accuracy we will be using it for our baseline. This means that out model must beat an accuracy score of 53.94%.

## Model Selection

### Defining model functions

This is a function to plot the ROC curve for each model.

In [None]:
def plot_roc_curve(fpr, tpr):
    plt.plot(fpr, tpr, color='orange', label='ROC')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

This function efficiently trains each model on the training data and makes predictions for our validation set.

In [None]:
def fit_model(model):
    
    model.fit(X_train, y_train)
    val_preds = model.predict(X_val)
    print(pd.DataFrame(confusion_matrix(y_val,val_preds),\
            columns=["Predicted No", "Predicted Yes"],\
            index=["No","Yes"]))
    print('\n')
    print(classification_report(y_val, val_preds))
    
    probs = model.predict_proba(X_val)
    probs = probs[:, 1]
    fpr, tpr, thresholds = roc_curve(y_val, probs)
    plot_roc_curve(fpr,tpr)
    print('auc score: '+ str(roc_auc_score(y_val,val_preds)))

## Model Fitting

### Logistic Regression

In [None]:
log_model = LogisticRegression(max_iter=700)

fit_model(log_model)

### K-nearest Neighbors

In [None]:
knn_model = KNeighborsClassifier(n_neighbors=50)

fit_model(knn_model)

### Decision Tree Classifier

In [None]:
tree_model = DecisionTreeClassifier(criterion='entropy', max_depth=5, max_leaf_nodes=10)

fit_model(tree_model)

### Random Forest Classifier

In [None]:
rf_model = RandomForestClassifier(n_estimators=1000, criterion = "entropy")

fit_model(rf_model)

In [None]:
xgb = XGBClassifier(max_depth = 100,learning_rate = .07, booster = "gblinear" )

fit_model(xgb)

# Conclusion

Out of all the models that we tried, the xgb classifier outperformed the rest. It out-performed the baseline model accuracy by about 40%. It performed the best at predicting people who are more risk of having a heart attack. We had a very small data set to work with and a very small test set to work with so we should question how well this model may do on new unforseen data. 

## Next Steps

The next step would be to collect more data. We have a very small data set to work with and there is a lot of potential to make a great model if we had more data to train it on. With an increase in records, Deep learning methods would also be interesting to apply here. 