# **HEART ATTACK ANALYSIS AND PREDICTION**

![](https://source.wustl.edu/wp-content/uploads/2019/02/HeartImage-760x594.jpg)

# Overview

Heart attack or myocardial infarction according to Wikipedia,commonly occurs when blood flow decreases or stops to a part of the heart, causing damage to the heart muscle. The most common symptom is chest pain or discomfort which may travel into the shoulder, arm, back, neck or jaw.According to a medical survey in USA, every year about 647,000 people die of heart attack making it the leading cause of death. According to the Centers for Disease Control and Prevention (CDC) approximately every 40 seconds an American will have a heart attack.And the scenario almost remains same in countries like India. 
Through the analysis and visualisations in this notebook we would try to go to the rockbottom of this problem and try to figure out what are the features that determines the causes of Heart Attack. 
The judgements produced are absolutely dependent on the information provided in the data.

# How will we proceed ?

1. **Understanding the Data**

2. **EDA**

3. **Model Building**

4. **Model Performance**

5. **Inference**


# **UNDERSTANDING THE DATA**

# Including Required Packages 

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

**READING THE DATA**

In [None]:
df= pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
df.head()


In [None]:
df.shape

So we know that there are 14 features that has been included in the dataset needed to determine Heart Attack

In [None]:
df.info()

**DESCRIPTION OF THE DATASET**

In [None]:
df.describe()

**Let Us Know if We Have any missing values**

In [None]:
features_with_na=[features for features in df.columns if df[features].isnull().sum()>1]
## 2- step print the feature name and the percentage of missing values
features_with_na

Great!!! We don't have to handle the cases for missing values !! 

# **EDA**

**Number of Numerical Variables**

In [None]:
numerical_features = [feature for feature in df.columns if df[feature].dtypes != 'O']
len(numerical_features),df.shape

Wow!! We got to know all of the features are numerical variables ! 

**We need to know the number of discrete variables, Let us find it out !**

In [None]:
discrete_feature=[feature for feature in numerical_features if len(df[feature].unique())<25]
print("Discrete Variables Count: {}".format(len(discrete_feature)))

In [None]:
discrete_feature

**LET US FIND OUT THE RELATION BETWEEN EACH OF THE DISCRETE FEATURE AND OUTPUT**

In [None]:
for feature in discrete_feature:
    data=df.copy()
    data.groupby(feature)['output'].median().plot.bar()
    plt.xlabel(feature)
    plt.ylabel('output')
    plt.title(feature)
    plt.show()

**Now let's deal with the Continuous Variables**

In [None]:
continuous_feature=[feature for feature in numerical_features if feature not in discrete_feature]
print("Continuous feature Count {}".format(len(continuous_feature)))

In [None]:
for feature in continuous_feature:
    data=df.copy()
    data[feature].hist(bins=25)
    plt.xlabel(feature)
    plt.ylabel("Count")
    plt.title(feature)
    plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,8))
sns.heatmap(df.corr(),annot=True,ax=ax)

**Results against the Age**

In [None]:
sns.displot(x='age', hue='output', data=df, alpha=0.6)
plt.show()

In [None]:
attack = df[df['output']==1]
sns.displot(attack.age, kind='kde')
plt.show()

In [None]:
sns.displot(attack.age, kind='ecdf')
plt.grid(True)
plt.show()

In [None]:
ranges = [0, 30, 40, 50, 60, 70, np.inf]
labels = ['0-30', '30-40', '40-50', '50-60', '60-70', '70+']

attack['age'] = pd.cut(attack['age'], bins=ranges, labels=labels)
attack['age'].head()

In [None]:
sns.countplot(attack.age)

**WE SEE THAT AGES BETWEEN 50-60 ARE THE MOST PRONE TO HEART ATTACKS**

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))
sns.countplot(x='sex', hue='age', data=attack, ax=ax)



In [None]:
attack = df[df['output'] == 1]
sns.displot(x='age', kind='kde', hue='sex', data=attack)


**WE NOTICE THAT MALE HAVE A HIGHER TENDENCY TO HAVE HEART ATTACK**

In [None]:
male_attack=attack[attack['sex']==1]

In [None]:
sns.countplot(male_attack['age'])

In [None]:
for feature in continuous_feature:
    data=df.copy()
    if 0 in data[feature].unique():
        pass
    else:
        data[feature]=np.log(data[feature])
        data.boxplot(column=feature)
        plt.ylabel(feature)
        plt.title(feature)
        plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import  BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

**PREPARING THE DATASET FOR MODEL**

In [None]:
#Creating a copy
data= df

In [None]:

scaler = StandardScaler()

# define the columns to be encoded and scaled
categorical_vars = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
continuous_vars = ["age","trtbps","chol","thalachh","oldpeak"]

# encoding the categorical columns
data = pd.get_dummies(data, columns = categorical_vars, drop_first = True)

X = data.drop(['output'],axis=1)
y = data[['output']]

data[continuous_vars] = scaler.fit_transform(X[continuous_vars])

# defining the features and target
X = data.drop(['output'],axis=1)
y = data[['output']]



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.1)

# **Models**

In [None]:
lr = LogisticRegression(random_state=42)

knn = KNeighborsClassifier()
para_knn = {'n_neighbors':np.arange(1, 50)}

grid_knn = GridSearchCV(knn, param_grid=para_knn, cv=5)

dt = DecisionTreeClassifier()
para_dt = {'criterion':['gini','entropy'],'max_depth':np.arange(1, 50), 'min_samples_leaf':[1,2,4,5,10,20,30,40,80,100]}
grid_dt = GridSearchCV(dt, param_grid=para_dt, cv=5)

rf = RandomForestClassifier()

# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators':[100, 350, 500],
    'min_samples_leaf':[2, 10, 30]
}
grid_rf = GridSearchCV(rf, param_grid=params_rf, cv=5)

In [None]:
dt = DecisionTreeClassifier(criterion='gini', max_depth=9, min_samples_leaf=10, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
rf = RandomForestClassifier(n_estimators=500, min_samples_leaf=2, random_state=42)

In [None]:
# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt), ('Random Forest', rf)]

In [None]:
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_pred, y_test) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

**WE SEE THAT LOGISTIC REGRESSION PERFORMS THE BEST**

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(base_estimator=rf, n_estimators=100, random_state=1)

ada.fit(X_train, y_train)

y_pred = ada.predict(X_test)

accuracy_score(y_pred, y_test)

In [None]:
importances = pd.Series(data=rf.feature_importances_,
                        index= X_train.columns)

# Sort importances
importances_sorted = importances.sort_values()

# Draw a horizontal barplot of importances_sorted
plt.figure(figsize=(10, 10))
importances_sorted.plot(kind='bar',color='orange')
plt.title('Features Importances')
plt.show()

# NEURAL NETWORK APPROACH

**IMPORTING THE NECESSARY LIBRARIES**

In [None]:
from tensorflow.keras.layers import Dense,Dropout,Flatten
from tensorflow.keras.layers import MaxPooling2D,GlobalAveragePooling2D,BatchNormalization,Activation
from tensorflow import keras
import tensorflow as tf

In [None]:

model = tf.keras.Sequential()
model.add(Dense(1024, input_dim=22, activation= "relu"))
model.add(Dropout(0.3))
model.add(Dense(512, activation= "relu"))
model.add(Dropout(0.4))
model.add(Dense(128, activation= "relu"))
model.add(Dropout(0.2))
model.add(Dense(32, activation= "relu"))
model.add(Dropout(0.2))
model.add(Dense(1))
model.summary() #Print model Summary

In [None]:
model.compile(loss= "binary_crossentropy" , optimizer="adam", metrics=["accuracy"])

In [None]:
Performance = model.fit(X_train, y_train, validation_split =0.1,epochs=30)

In [None]:
model.evaluate(X_test,y_test)

In [None]:
my_dpi = 50 # dots per inch .. (resolution)
plt.figure(figsize=(400/my_dpi, 400/my_dpi), dpi = my_dpi)
plt.plot(Performance.history['accuracy'], label='train accuracy')
plt.plot(Performance.history['val_accuracy'], label='val accuracy')
plt.legend()
plt.show()
plt.savefig('AccVal_acc')

# Inference

The accuracy of the following models are 
1. Logistic Regression : 0.871
2. K Nearest Neighbours : 0.742
3. Classification Tree : 0.742
4. Random Forest : 0.839
5. Adaboost Classifier:0.806
6. ANN : 0.780


# Acknowledgements


[Sarthak Bobde](http://https://www.kaggle.com/sarthakbobde/heart-attack-analysis-and-classifier)
[Jędrzej Dudzicz](http://https://www.kaggle.com/jedrzejdudzicz/heart-attack-analysis-prediciton)