<h1><center>💓Heart Attack Data Analysis🔎</center></h1>
<h3><center>🩺(Prediction at the end)🔮</center></h3>
<center><img src= "https://intermountainhealthcare.org/-/media/images/modules/blog/posts/2020/02/how-to-know-if-youre-having-a-heart-attack.jpg?la=en&h=595&w=896&mw=896&hash=86751E3A3A3AFDCD26067D5A66E861E4ED92AC72" alt ="Titanic" style='width: 400px;'></center>

<h3>What is a heart attack?</h3>
<p>
A heart attack, also called a myocardial infarction, happens when a part of the heart muscle doesn’t get enough blood.
    
The more time that passes without treatment to restore blood flow, the greater the damage to the heart muscle.
    
Coronary artery disease (CAD) is the main cause of heart attack. A less common cause is a severe spasm, or sudden contraction, of a coronary artery that can stop blood flow to the heart muscle.
</p>

<h3>What are the symptoms of heart attack?</h3>
<p>
The major symptoms of a heart attack are :
    
* Chest pain or discomfort. Most heart attacks involve discomfort in the center or left side of the chest that lasts for more than a few minutes or that goes away and comes back. The discomfort can feel like uncomfortable pressure, squeezing, fullness, or pain.
    
* Feeling weak, light-headed, or faint. You may also break out into a cold sweat.
    
* Pain or discomfort in the jaw, neck, or back.
    
* Pain or discomfort in one or both arms or shoulders.
    
* Shortness of breath. This often comes along with chest discomfort, but shortness of breath also can happen before chest discomfort.
</p>

# Exploratory Data Analysis

## Aim :
- Understand the data ("A small step forward is better than a big one backwards")
- Begin to develop a modelling strategy

## Features

- Age : Age of the patient

- Sex : Sex of the patient

- exang: exercise induced angina (1 = yes; 0 = no)

- ca: number of major vessels (0-3)

- cp : Chest Pain type chest pain type :
  - Value 1: typical angina
  - Value 2: atypical angina
  - Value 3: non-anginal pain
  - Value 4: asymptomatic
  
- trtbps : resting blood pressure (in mm Hg)

- chol : cholestoral in mg/dl fetched via BMI sensor

- fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

- rest_ecg : resting electrocardiographic results :
  - Value 0: normal
  - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

- thalach : maximum heart rate achieved

- target : 0= less chance of heart attack 1= more chance of heart attack


## Base Checklist
#### Shape Analysis :
- **target feature** : output
- **rows and columns** : 303 , 14
- **features types** : qualitatives : 0 , quantitatives : 14
- **NaN analysis** :
    - NaN (0 % of NaN)

#### Columns Analysis :
- **Target Analysis** :
    - Balanced (Yes/No) : Yes
    - Percentages : 55% / 45%
- **Categorical values**
    - There is 8 categorical features (0/1) (not inluding the target)

In [None]:
!pip install dataprep

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

from dataprep.eda import create_report
from dataprep.eda import plot_missing
from dataprep.eda import plot_correlation
from dataprep.eda import plot

from sklearn.model_selection import train_test_split

from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report, roc_curve
from sklearn.model_selection import learning_curve, cross_val_score, GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler,StandardScaler,MinMaxScaler

import warnings
warnings.filterwarnings('ignore')

## Dataset Analysis

In [None]:
data = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
df = data.copy()
pd.set_option('display.max_row',df.shape[0])
pd.set_option('display.max_column',df.shape[1]) 
df.head()

In [None]:
(df.isna().sum()/df.shape[0]*100).sort_values(ascending=False)

In [None]:
plot_missing(df)

In [None]:
print('There is' , df.shape[0] , 'rows')
print('There is' , df.shape[1] , 'columns')

In [None]:
df.duplicated().sum()

In [None]:
df.loc[df.duplicated(keep=False),:]

In [None]:
df.drop_duplicates(keep='first',inplace=True)
df.shape

## Visualising Target and Features

In [None]:
df['output'].value_counts(normalize=True) #Classes déséquilibrées

In [None]:
for col in df.select_dtypes(include=['float64','int64']):
    plt.figure()
    sns.displot(df[col],kind='kde',height=3)
    plt.show()

In [None]:
X = df.drop('output',axis=1)
y = df['output']

## Detailed Analysis

In [None]:
riskyDF = df[y == 1]
safeDF = df[y == 0]

In [None]:
plt.figure(figsize=(4,4))
sns.pairplot(data,height=1.5)
plt.show()

In [None]:
corr = df.corr(method='pearson').abs()

fig = plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='tab10', vmin=-1, vmax=+1)
plt.title('Pearson Correlation')
plt.show()

print (df.corr()['output'].abs().sort_values())

In [None]:
for col in df.select_dtypes(include=['float64','int64']):
    plt.figure(figsize=(4,4))
    sns.distplot(riskyDF[col],label='High Risk')
    sns.distplot(safeDF[col],label='Low Risk')
    plt.legend()
    plt.show()

### Comments

It looks like we have some very useful features here, with a correlation > 0.4.
The following features seems promising for predicting wether a patient will have a heart attack or not :
- **oldpeak**
- **exng**
- **cp**
- **thalachh**

We can also notice that **sip** and **oldpeak** looks correlated, let's find out !

In [None]:
for col in X.select_dtypes(include=['float64','int64']):
    plt.figure(figsize=(4,4))
    sns.lmplot(x='oldpeak', y=col, hue='output', data=df)

In [None]:
create_report(df)

# A bit of data engineering ...

In [None]:
def encoding(df):
    code = {
            # All columns are made of quantitative values (floats actually), so there is no need to encode the features
           }
    for col in df.select_dtypes('object'):
        df.loc[:,col]=df[col].map(code)
        
    return df

def imputation(df):
    
    df = df.dropna(axis=0) # There are no NaN anyways
    
    return df

def feature_engineering(df):
    useless_columns = [] # Let's consider we want to use all the features
    df = df.drop(useless_columns,axis=1)
    return df

In [None]:
def preprocessing(df):
    df = encoding(df)
    df = feature_engineering(df)
    df = imputation(df)
    
    X = df.drop('output',axis=1)
    y = df['output']    
      
    return df,X,y

### Comments
We can now analyze categorical features as quantitative features (rem : no qualitative features to be encoded here)

# Modelling

In [None]:
df = data.copy()
trainset, testset = train_test_split(df, test_size=0.2, random_state=0)
print(trainset['output'].value_counts())
print(testset['output'].value_counts())

In [None]:
_, X_train, y_train = preprocessing(trainset)
_, X_test, y_test = preprocessing(testset)

In [None]:
preprocessor = make_pipeline(MinMaxScaler())

PCAPipeline = make_pipeline(StandardScaler(), PCA(n_components=2,random_state=0))

RandomPipeline = make_pipeline(preprocessor,RandomForestClassifier(random_state=0))
AdaPipeline = make_pipeline(preprocessor,AdaBoostClassifier(random_state=0))
SVMPipeline = make_pipeline(preprocessor,SVC(random_state=0,probability=True))
KNNPipeline = make_pipeline(preprocessor,KNeighborsClassifier())
LRPipeline = make_pipeline(preprocessor,LogisticRegression(solver='sag'))

## PCA Analysis

In [None]:
PCA_df = pd.DataFrame(PCAPipeline.fit_transform(X))
PCA_df = pd.concat([PCA_df, y], axis=1)
PCA_df.head()

In [None]:
plt.figure(figsize=(8,8))
sns.scatterplot(PCA_df[0],PCA_df[1],hue=PCA_df['output'],palette=sns.color_palette("tab10", 2))
plt.show()

## Classification problem

In [None]:
dict_of_models = {'RandomForest': RandomPipeline,
'AdaBoost': AdaPipeline,
'SVM': SVMPipeline,
'KNN': KNNPipeline,
'LR': LRPipeline}

In [None]:
def evaluation(model):
    model.fit(X_train, y_train)
    # calculating the probabilities
    y_pred_proba = model.predict_proba(X_test)

    # finding the predicted valued
    y_pred = np.argmax(y_pred_proba,axis=1)
    print('Accuracy = ', accuracy_score(y_test, y_pred))
    print('-')
    print(confusion_matrix(y_test,y_pred))
    print('-')
    print(classification_report(y_test,y_pred))
    print('-')
    
    N, train_score, val_score = learning_curve(model, X_train, y_train, 
                                               cv=4, scoring='f1', 
                                               train_sizes=np.linspace(0.1,1,10))
    plt.figure(figsize=(12,8))
    plt.plot(N, train_score.mean(axis=1), label='train score')
    plt.plot(N, val_score.mean(axis=1), label='validation score')
    plt.legend()

In [None]:
for name, model in dict_of_models.items():
    print('---------------------------------')
    print(name)
    evaluation(model)

### Comments
#### All 5 models look promising, but **AdaBoost** has a slightly better accuracy **(90%)****

# Using AdaBoost

In [None]:
AdaPipeline.fit(X_train, y_train)
y_proba = AdaPipeline.predict_proba(X_test)
y_pred = np.argmax(y_proba,axis=1)

print("Adaboost : ", accuracy_score(y_test, y_pred))

In [None]:
y_pred_prob = AdaPipeline.predict_proba(X_test)[:,1]

fpr,tpr,threshols=roc_curve(y_test,y_pred_prob)

plt.plot(fpr,tpr,label='AdaBoost ROC Curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("AdaBoost ROC Curve")
plt.show()

# Using KNN

In [None]:
KNNPipeline.fit(X_train, y_train)
y_proba = KNNPipeline.predict_proba(X_test)
y_pred = np.argmax(y_proba,axis=1)

print("KNN : ", accuracy_score(y_test, y_pred))

## KNN Optimization

In [None]:
err = []
  
for i in range(1, 40):
    
    model = make_pipeline(preprocessor,KNeighborsClassifier(n_neighbors = i))
    model.fit(X_train, y_train)
    pred_i = model.predict(X_test)
    err.append(np.mean(pred_i != y_test))
  
plt.figure(figsize =(10, 8))
plt.plot(range(1, 40), err, color ='blue',
                linestyle ='dashed', marker ='o',
         markerfacecolor ='blue', markersize = 8)
  
plt.title('Mean Err = f(K)')
plt.xlabel('K')
plt.ylabel('Mean Err')

In [None]:
KNNPipeline = make_pipeline(preprocessor,KNeighborsClassifier(n_neighbors = 7))
KNNPipeline.fit(X_train, y_train)
y_proba = KNNPipeline.predict_proba(X_test)
y_pred = np.argmax(y_proba,axis=1)

print("KNN : ", accuracy_score(y_test, y_pred))

# If you like please upvote !
## Also check my other notebooks :
#### 🔎EDA & Modelling🔮 - 🐁Mice Trisomy (100% acc.) : https://www.kaggle.com/dorianvoydie/eda-modelling-mice-100-acc
#### 🔎EDA & Modelling🔮 - 🩺🎗️Breast Cancer Detection : https://www.kaggle.com/dorianvoydie/eda-modelling-breast-cancer-detection
#### 🌦🌡 Weather Forecasting 📈 (98% acc.) : https://www.kaggle.com/dorianvoydie/weather-forecasting-98-acc
#### 🔎EDA & Modelling🔮 - Heart Attack 🩺💓 (90% Acc.) : https://www.kaggle.com/dorianvoydie/eda-modelling-heart-attack-90-accuracy-score
#### 🔎EDA & Modelling🔮 - Mobile price (95.5% acc.) : https://www.kaggle.com/dorianvoydie/eda-modelling-95-5-acc-mobile-price
#### 🔎EDA & Modelling🔮 - 🩺🧠 Stroke (74% acc.) : https://www.kaggle.com/dorianvoydie/eda-modelling-stroke-74-acc