**Heart Attack Analysis**

Heart attacks happen when the heart doesn't get enough blood, or more accurately, oxygen, because blood contains oxygen. If your heart doesn't get enough oxygen, you're more likely to have a heart attack. This is due to a blood clot, a lack of blood supply, and a lack of oxygen. Chest pain, high blood pressure, and high cholesterol levels are all signs of a heart attack.

**Summary**

There are 14 columns and 303 rows in the dataset. Only a few of the 14 columns are numerical variables, while the rest are categorical variables. 'age', 'trtbps', 'chol', 'thalach', 'oldpeak' are numerical columns, while'sex', 'cp', 'caa', 'fbs','restecg', 'exng','slp', 'thall' are categorical columns.Because the ‘output' characteristic takes value 1 in the case of a higher risk of heart attack and 0 in the case of a lower risk of heart attack, this is the responsible variable.
There were no null values in this dataset, but it did have one duplicated value, which was eliminated.


In [None]:

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
ha =pd.read_csv("../input/heart-attack-analysis-prediction-dataset/heart.csv")

In [None]:
ha.head()

In [None]:
ha.shape

In [None]:
ha.info()

In [None]:
ha.isna().sum()

In [None]:
ha.drop_duplicates(inplace=True)
ha.reset_index(drop=True, inplace=True)
ha.shape

**Univariate Analysis**

*Few important features which can determine about the output*

**Categorical**

**Output**

In [None]:
ha['output'].value_counts().plot(kind='bar',grid=True,rot=0)

**SEX**

In [None]:
labels= ha['sex'].astype('category').cat.categories.tolist()
counts= ha['sex'].value_counts()
sizes= [counts[var_cat] for var_cat in labels]
figx1,ax1 = plt.subplots()
ax1.pie(sizes,labels=labels,autopct='%1.1f%%',shadow=True)
#autopct helps in showing the percentage on the pie chart
ax1.axis('equal')
plt.show()

This shows that 68.3% of the pateints belongs to the sex which is represented by 1 , it can be male or female . 

In [None]:
ha.boxplot('age','sex',rot = 0,figsize=(5,6),grid=True)

In [None]:
fig, ax = plt.subplots(2,3, figsize=(20,18))
sns.countplot(x='fbs', data=ha, palette='magma', ax=ax[0][0]).set(title='Fasting Blood Sugar')
sns.countplot(x='exng', data=ha, palette='magma', ax=ax[0][1]).set(title='Exercise Induced Angina')
sns.countplot(x='restecg', data=ha, palette='magma', ax=ax[1][0]).set(title='Rest ECG')
sns.countplot(x='cp', data=ha, palette='magma', ax=ax[0][2]).set(title='Chest Pain Type')
sns.countplot(x='caa', data=ha, palette='magma', ax=ax[1][1]).set(title='Number of major vessels')
sns.countplot(x='thall', data=ha, palette='magma', ax=ax[1][2]).set(title='Thallium Stress Test')

**Numerical**

In [None]:
fig, ax = plt.subplots(2,2, figsize=(20,18))
sns.histplot(x=ha["age"], ax=ax[0][0], color="red", kde=True).set(title='Age')
sns.histplot(x=ha["trtbps"], ax=ax[0][1], color="blue", kde=True).set(title='Resting Blood Pressure')
sns.histplot(x=ha["chol"], ax=ax[1][0], color="orange", kde=True).set(title='Cholestrol Levels')
sns.histplot(x=ha["thalachh"], ax=ax[1][1], color="green", kde=True).set(title='Maximum Heart Rate Achieved')

**Bivariate analysis**

Our target value here is the 'output' feature which helps in determing if the patient is likely to have the heart attack or not. We can perform bivariate on categorical features. 

In [None]:
def bi_ana(df,feature,target):
    sns.set(rc={'figure.figsize':(6,6)})
    
    ax=sns.countplot(x=feature,hue=target,data=df)
    
    for n in ax.patches:
        patch_height=n.get_height()
        if np.isnan(patch_height):
            patch_height=0
        ax.annotate('{}'.format(int(patch_height)), (n.get_x()+0.05, patch_height+10))
    plt.show()

In [None]:
bi_ana(ha,"sex",'output')
bi_ana(ha,"cp",'output')
bi_ana(ha,"fbs",'output')
bi_ana(ha,"restecg",'output')
bi_ana(ha,"exng",'output')
bi_ana(ha,"slp",'output')
bi_ana(ha,"caa",'output')
bi_ana(ha,"thall",'output')

**insights from the above bivariate countplots**

1. From the above univariate analysis, we found that than 68 percent of patients are of one gender (male or female), while the remaining patients are of the other gender (31.7 percent ). All we have to do now is figure out which one is which. For this, we can look at the Hravard studies, which demonstrate that men are more than twice as likely as women to have a heart attack.We can basically find out this by seeing which has more possibility of getting the heart attack that are more likely to be men and rest our women. By sampling using the value count above we can say that Sex =0 is female as it had less risk of heart attack and Sex=1 is male as it had more possibility of getting heart attack.  

**Multivariate analysis**

Correlation matrix

In [None]:

f, ax = plt.subplots(figsize=(15, 8))
corr = ha.corr()
sns.heatmap(corr,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

**Using Machine learning Algorithms**

**1. Logistic Regression**

In [None]:
h_num_cols =['age', 'trtbps', 'chol', 'thalach', 'oldpeak']
h_cat_cols= ['sex', 'cp', 'caa', 'fbs', 'restecg', 'exng', 'slp', 'thall']


In [None]:
#importing the model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report
from sklearn.metrics import accuracy_score

In [None]:
#intializing the model 
LR= LogisticRegression(max_iter=1000)

In [None]:
#Preparing the data
x = ha.drop('output', axis=1)
y = ha['output']


In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y, test_size=0.2, random_state=42)

In [None]:
print("shape of trainig set:",x_train.shape)
print("shape of testing set:",x_test.shape)

In [None]:
#fitting the model 
LR.fit(x_train,y_train)

In [None]:
#predicting 
ypred=LR.predict(x_test)

In [None]:
print(classification_report(y_test, ypred))

In [None]:
#feature importance
importance = list(zip(ha.columns ,LR.coef_.ravel()))
importance = list(sorted(importance, key=lambda x: x[1], reverse=True))
print(importance[0:2])

In [None]:
LR_acc_score= accuracy_score(y_test,ypred )
print("Accuracy of LogisticRegression:",LR_acc_score*100)

In [None]:
# plot feature importance top 5
top_columns, top_score = zip(*importance[:5])
plt.xticks(rotation=45)
plt.bar(top_columns, top_score)
plt.show()

**2. Random Forest**

In [None]:
#importing model
from sklearn.ensemble import RandomForestClassifier


In [None]:
#intializing the model
rfc= RandomForestClassifier(n_estimators=100)

In [None]:
#fitting the model
rfc.fit(x_train, y_train)

In [None]:
# predicting
y_pred=rfc.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
rfc_acc_score = accuracy_score(y_test, y_pred)
print("Accuracy of Random Forests Model is: ", rfc_acc_score)

In [None]:
#feature importnace
imp_feature = pd.DataFrame({'Feature': ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'], 'Importance': rfc.feature_importances_})
plt.figure(figsize=(10,4))
plt.title("barplot Represent feature importance ")
plt.xlabel("importance ")
plt.ylabel("features")
plt.barh(imp_feature['Feature'],imp_feature['Importance'])
plt.show()

**3.Decision Tree**

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier(criterion = 'entropy',random_state=0,max_depth = 6)

In [None]:
print("Shape of training set:", x_train.shape)
print("Shape of test set:", x_test.shape)

In [None]:
dt.fit(x_train,y_train)

In [None]:
Y_pred= dt.predict(x_test)

In [None]:

print(classification_report(y_test, Y_pred))

In [None]:
dt_acc_score= accuracy_score(y_test,Y_pred )
print("Accuracy of decision tree:",dt_acc_score*100)

In [None]:
imp_feature = pd.DataFrame({'Feature': ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'], 'Importance': dt.feature_importances_})
plt.figure(figsize=(10,4))
plt.title("Decision tree feature importance")
plt.xlabel("importance ")
plt.ylabel("features")
plt.barh(imp_feature['Feature'],imp_feature['Importance'])
plt.show()

**4.Extreme Gradient Boost**

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb= XGBClassifier(random_state=123 )

In [None]:
xgb.fit(x_train,y_train)

In [None]:
y1pred=xgb.predict(x_test)

In [None]:

print(classification_report(y_test, y1pred))

In [None]:
xgb_acc_score= accuracy_score(y_test,y1pred )
print("Accuracy of Gradient Boost:",xgb_acc_score*100)

In [None]:
imp_feature = pd.DataFrame({'Feature': ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'], 'Importance': xgb.feature_importances_})
plt.figure(figsize=(10,4))
plt.title("feature importance o ")
plt.xlabel("importance ")
plt.ylabel("features")
plt.barh(imp_feature['Feature'],imp_feature['Importance'])
plt.show()

**Model comparision**

In [None]:
model=pd.DataFrame({'Model':["Logistic Regression",'Random Forest','Decision Tree',"Gradient Boost"],
                    'Accuracy':[LR_acc_score*100,rfc_acc_score*100,dt_acc_score*100,xgb_acc_score*100]})
model

random forest and gradient boost worked the best as the machine leraning models ,giving the highest accuracy.


**Feature importance**

By comparing the feature importance of all the models , we can say the features which are highly correlated to our target values are 'thal','ca','cp','thalach' and 'age'.

**Conclusion**

If there are pateints with the following symptoms than the possibility of them getting heart attack is high and should be given immediate attention . Following are the factors to look for :

1.number of major vessels - if type 0

2.Chest Pain type         -  type= non-anginal pain

3.Age of the patient

4.maximum heart rate achieved

5.Sex of the pateint - and especially the male sex as said before males are more likely to get heart attack compared to the females
