### Importing the Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from scipy import stats
from scipy.stats import norm
warnings.filterwarnings('ignore')

In [None]:
raw_data=pd.read_csv("/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv")

In [None]:
data=raw_data.copy()
data.head()

In [None]:
# To check the number of rows and columns
data.shape

###  Questions (before any analysis) :
#### Before starting any analysis I like to look at the data and note down some questions . I like this approach because data analysis becomes easier.
1) Age : What is the age group of people who experience most heart attacks?. Older people are usually more susceptible to heart attacks (need confirmation) . 

2) Sex : What is the ratio of male and female in our data and which gender experience more  heart attacks?. 

3) CP (Chest Pain and types) : Is there any relationship between types of chestpain and heart attack?. If so, what is the type of chest pain that resulted in most heart attacks? (important and must be closely observed)

4) trestbps (resting blood pressure) : Is there any relationship between resting bp and heart attack? .( must be looked at closely)

5) chol (cholestoral) : Is there any threshold for cholesterol levels?.Bad cholestoral is a cause of heart attack and this can help in determining the level (useful for analysis).
    
6) fbs (fasting blood sugar) : People who experienced heart attacks also had a fbs > 120? . What is the ratio of people with fbs >120 and people with fbs <120. 

7) restecg (resting electrocardiographic results) : Is there a relationship between resting ecg and heart attack? If so , which value is more suceptible ? 
  

8) thalach (maximum heart rate achieved) : Can maximum heart rate give any insights to the cause of heart attack?. If so, ise there any threshold for this? 

9) exang (exercise induced angina) : What is the ratio of people with exang = 1 and exang = 0.(probably a realtionship between exang and heart attack )

10) ca (number of major blood vessels 0-3 ) : Is there any relationship between ca and heart attack?.Lesser vessels may lead to high chance of heart attack (need confirmation).

11) Target : Our target function.


In [None]:
data.info()

All our features except oldpeak are of type integer.

### Descriptive Statistics

In [None]:
data.describe()

#### Observations :
1) Average age is 54 years and the minimum age is 29 years. So, no children involved in this data.

2) The average resting bp is 131 and its maximum value is 200.

3) The average cholesterol level is 126 and the maximum value is 564 (this can be an outlier).

4) The average heart rate is 149 ( common while exercising) and the maximum is 200 (questionable).

### Missing values

In [None]:
def draw_missing_data_table(data):
    total = data.isnull().sum().sort_values(ascending=False)
    percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return missing_data
draw_missing_data_table(data)

There are no missing values present in our data. 

In [None]:
# Classifying data into numerical and categorical features.
data_numerical = data[['age','trtbps','chol','thalachh','oldpeak']] # not including slope as i don't think its useful
data_categorical = data[['sex','cp','fbs','restecg','exng','caa','thall','output']]


### Numerical Variable Analysis

In [None]:
fig,ax=plt.subplots(2,3,figsize=(15,10))
fig.patch.set_facecolor('#f6f5f7')
for i,idx in enumerate(data_numerical.columns):
    sns.histplot(ax=ax[i%2,i//2],x=data_numerical[idx],color='darkred',kde=True,alpha=0.5)
    ax[i%2,i//2].set_title(idx,fontweight='bold')
    ax[i%2,i//2].set_facecolor('#f6f5f5')
    for z in ["top","right"]:
        ax[i%2,i//2].spines[z].set_visible(False)
ax[1,2].set_visible(False)

#### Observations :
1) Age distribution appears normal with some skewness to the right.

2) Cholesterol appears normally distributed. 

3) The old peak is heavily skewed towards the left.

4) Resting bp appears to be normal  with some skewness towards left.

5) The maximum heart rate appears normal with some skewness towards right.

### Skewness and Kurtosis

In [None]:
s_k=[]
for i in data_numerical.columns:
    s_k.append([i,data_numerical[i].skew(),data_numerical[i].kurt()])
skew_kurt=pd.DataFrame(s_k,columns=['Columns','Skewness','Kurtosis'])
skew_kurt

General Rule :
1) If skewness is less than -1 or greater than 1, the distribution is highly skewed. If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed. If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.

1) Kurtosis is a measure of the combined sizes of the two tails. It measures the amount of probability in the tails.  The value is often compared to the kurtosis of the normal distribution, which is equal to 3.  If the kurtosis is greater than 3, then the dataset has heavier tails than a normal distribution . If the kurtosis is less than 3, then the dataset has lighter tails than a normal distribution.

### Analyis of Numerical Variables with Target


In [None]:
fig = plt.figure(figsize=(17,17))
gs = fig.add_gridspec(5,3)
gs.update(wspace=0.4, hspace=0.4)
# adding figures
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3 = fig.add_subplot(gs[1,1])
ax4 = fig.add_subplot(gs[2,0])
ax5 = fig.add_subplot(gs[2,1])
ax6 = fig.add_subplot(gs[3,0])
ax7 = fig.add_subplot(gs[3,1])
ax8 = fig.add_subplot(gs[4,0])
ax9 = fig.add_subplot(gs[4,1])
axes=[ax0,ax1,ax2,ax3,ax4,ax5,ax6,ax7,ax8,ax9]
# data_numerical = data[['age','trestbps','chol','thalach','oldpeak']]
background_color = '#f6f5f7'
for i in axes:
    i.set_facecolor(background_color)
fig.patch.set_facecolor(background_color) 
#https://www.geeksforgeeks.org/kde-plot-visualization-with-pandas-and-seaborn/
#ax0
sns.kdeplot(ax=ax0,x=data.loc[data['output']==0]['age'],color='black',label='No Heart attack',shade=True)
sns.kdeplot(ax=ax0,x=data.loc[data['output']==1]['age'],color='darkred',label='Heart attack',shade=True)
ax0.legend(loc = 'upper left')
ax0.grid(linestyle='--', axis='y')
#ax1
ax1.text(0.5,0.5,'Distribution of Age wrt Heart Attack',horizontalalignment = 'center',verticalalignment = 'center',fontsize = 18)

#ax2
sns.kdeplot(ax=ax2,x=data.loc[data['output']==0]['trtbps'],color='black',label='No Heart attack',shade=True)
sns.kdeplot(ax=ax2,x=data.loc[data['output']==1]['trtbps'],color='darkred',label='Heart attack',shade=True)
ax2.legend(loc = 'upper right')
ax2.grid(linestyle='--', axis='y')
#ax3
ax3.text(0.5,0.5,'Distribution of resting bp\n wrt Heart Attack',horizontalalignment = 'center',verticalalignment = 'center',fontsize = 18)

#ax4
sns.kdeplot(ax=ax4,x=data.loc[data['output']==0]['chol'],color='black',label='No Heart attack',shade=True)
sns.kdeplot(ax=ax4,x=data.loc[data['output']==1]['chol'],color='darkred',label='Heart attack',shade=True)
ax4.legend(loc = 'upper right')
ax4.grid(linestyle='--', axis='y')
#ax5
ax5.text(0.5,0.5,'Distribution of cholesterol \n wrt Heart Attack',horizontalalignment = 'center',verticalalignment = 'center',fontsize = 18)

#ax6
sns.kdeplot(ax=ax6,x=data.loc[data['output']==0]['thalachh'],color='black',label='No Heart attack',shade=True)
sns.kdeplot(ax=ax6,x=data.loc[data['output']==1]['thalachh'],color='darkred',label='Heart attack',shade=True)
ax6.legend(loc = 'upper left')
ax6.grid(linestyle='--', axis='y')
#ax7
ax7.text(0.5,0.5,'Distribution of maximum heart rate\n wrt Heart Attack',horizontalalignment = 'center',verticalalignment = 'center',fontsize = 18)

#ax8
sns.kdeplot(ax=ax8,x=data.loc[data['output']==0]['oldpeak'],color='black',label='No Heart attack',shade=True)
sns.kdeplot(ax=ax8,x=data.loc[data['output']==1]['oldpeak'],color='darkred',label='Heart attack',shade=True)
ax8.legend(loc = 'upper right')
ax8.grid(linestyle='--', axis='y')
#ax9
ax9.text(0.5,0.5,'Distribution of old peak\n wrt Heart Attack',horizontalalignment = 'center',verticalalignment = 'center',fontsize = 18)

# removing labels
axes1=[ax1,ax3,ax5,ax7,ax9]
for i in axes1:
    i.spines["bottom"].set_visible(False)
    i.spines["left"].set_visible(False)
    i.set_xlabel("")
    i.set_ylabel("")
    i.set_xticklabels([])
    i.set_yticklabels([])
    i.tick_params(left=False, bottom=False)
    
# removing spines of figures
for i in ["top","left","right"]:
    ax0.spines[i].set_visible(False)
    ax1.spines[i].set_visible(False)
    ax2.spines[i].set_visible(False)
    ax3.spines[i].set_visible(False)
    ax4.spines[i].set_visible(False)
    ax5.spines[i].set_visible(False)
    ax6.spines[i].set_visible(False)
    ax7.spines[i].set_visible(False)
    ax8.spines[i].set_visible(False)
    ax9.spines[i].set_visible(False)
    

### Bivariate  Analysis of Numerical Data

The best way to do bivariate analysis of numerical data is to make use of scatterplots.
#### Age with other numerical data

In [None]:
# Matplotlib because i want more control over my scatterplot colors.
fig = plt.figure(figsize=(15,12))
gs = fig.add_gridspec(2,2)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3=  fig.add_subplot(gs[1,1])
axes=[ax0,ax1,ax2,ax3]
background_color = '#f6f5f7'
for i in axes:
    i.set_facecolor(background_color)
fig.patch.set_facecolor(background_color)

# Age and Resting bp
ax0.scatter(x='age',y='trtbps',data=data[data['output']==0],alpha=0.5,color='lightslategrey',label = 'No heart attack')
ax0.scatter(x='age',y='trtbps',data=data[data['output']==1],color='darkred',alpha=0.7,label = 'Heart attack')
ax0.legend()
ax0.set_xlabel('Age')
ax0.set_title('Age and Resting bp',fontweight='bold')

# Age and Cholesterol
ax1.scatter(x='age',y='chol',data=data[data['output']==0],alpha=0.5,color='lightslategrey',label = 'No heart attack')
ax1.scatter(x='age',y='chol',data=data[data['output']==1],color='darkred',alpha=0.6,label = 'Heart attack')
ax1.legend()
ax1.set_xlabel('Age')
ax1.set_title('Age and Cholesterol',fontweight='bold')

# Age and Maximum heart rate
ax2.scatter(x='age',y='thalachh',data=data[data['output']==0],alpha=0.5,color='lightslategrey',label = 'No heart attack')
ax2.scatter(x='age',y='thalachh',data=data[data['output']==1],color='darkred',alpha=0.7,label = 'Heart attack')
ax2.legend()
ax2.set_xlabel('Age')
ax2.set_title('Age and Maximum heart rate',fontweight='bold')

# Age and Oldpeak
ax3.scatter(x='oldpeak',y='age',data=data[data['output']==0],alpha=0.5,color='lightslategrey',label = 'No heart attack')
ax3.scatter(x='oldpeak',y='age',data=data[data['output']==1],color='darkred',alpha=0.7,label = 'Heart attack')
ax3.legend()
ax3.set_xlabel('Age')
ax3.set_title('Age and OldPeak',fontweight='bold')

#removing spines
for i in ["top","right"]:
    ax0.spines[i].set_visible(False)
    ax1.spines[i].set_visible(False)
    ax2.spines[i].set_visible(False)
    ax3.spines[i].set_visible(False)  

#### Observations :
1) People who has resting bp >150 seems to experience lesser heart attacks than people with resting bp <150. There are few outliers here but we can ignore them.

2) No certain relationship between age and cholesterol. The data is spread evenly.

3) People with maximum heart rate above 140 experience more heart attacks than people with heart rate below 140 (significantly lower than people above mhr >140). 

4) People with old peak of 0 experience more heart attacks than any other group. This will be cleared in countplots later.

5) One person whose age is less than 30 had a heart attack. Probably an outlier.

#### Resting bp with other data
Not including age as we already saw it in the above figure

In [None]:
fig = plt.figure(figsize=(15,12))
gs = fig.add_gridspec(2,2)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3=  fig.add_subplot(gs[1,1])
axes=[ax0,ax1,ax2,ax3]
background_color = '#f6f5f7'
for i in axes:
    i.set_facecolor(background_color)
fig.patch.set_facecolor(background_color) 

# Resting bp and Old peak
ax0.scatter(x='oldpeak',y='trtbps',data=data[data['output']==0],alpha=0.5,color='lightslategrey',label = 'No heart attack')
ax0.scatter(x='oldpeak',y='trtbps',data=data[data['output']==1],color='darkred',alpha=0.7,label = 'Heart attack')
ax0.set_xlabel('Resting bp')
ax0.legend()
ax0.set_title('Resting bp and Old peak',fontweight='bold')

# Resting BP Cholesterol
ax1.scatter(x='trtbps',y='chol',data=data[data['output']==0],alpha=0.5,color='lightslategrey',label = 'No heart attack')
ax1.scatter(x='trtbps',y='chol',data=data[data['output']==1],color='darkred',alpha=0.6,label = 'Heart attack')
ax1.set_xlabel('Resting bp')
ax1.legend()
ax1.set_title('Resting BP Cholesterol',fontweight='bold')

# Resting BP and Maximum heart rate
ax2.scatter(x='trtbps',y='thalachh',data=data[data['output']==0],alpha=0.5,color='lightslategrey',label = 'No heart attack')
ax2.scatter(x='trtbps',y='thalachh',data=data[data['output']==1],color='darkred',alpha=0.7,label = 'Heart attack')
ax2.set_xlabel('Resting bp')
ax2.legend()
ax2.set_title('Resting BP and Maximum heart rate',fontweight='bold')

#removing spines
for i in ["top","right"]:
    ax0.spines[i].set_visible(False)
    ax1.spines[i].set_visible(False)
    ax2.spines[i].set_visible(False)
ax3.set_visible(False)

#### Observations :
1) People with oldpeak of 0 experience more heart attacks than others. We can see the same wrt age in the above figures.

2) People with resting bp above 150 seems to experience less heart attack than others (fewer data here). Resting bp between 120-140 and cholesterol levels 200-270 shows more heart attacks.

3) People with maximum heart rate above 140 experience more heart attacks than others. We can see the same wrt age in the above figures. 


#### Maximum heart rate with other data

In [None]:
fig = plt.figure(figsize=(15,5))
gs = fig.add_gridspec(1,2)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
axes=[ax0,ax1]
background_color = '#f6f5f7'
for i in axes:
    i.set_facecolor(background_color)
fig.patch.set_facecolor(background_color) 

# Maximum heart rate and Old peak
ax0.scatter(x='oldpeak',y='thalachh',data=data[data['output']==0],alpha=0.5,color='lightslategrey',label = 'No heart attack')
ax0.scatter(x='oldpeak',y='thalachh',data=data[data['output']==1],color='darkred',alpha=0.7,label = 'Heart attack')
ax0.set_xlabel('Maximum heart rate')
ax0.legend()
ax0.set_title('Maximum heart rate and Old peak',fontweight='bold')

# Resting BP Cholesterol
ax1.scatter(x='thalachh',y='chol',data=data[data['output']==0],alpha=0.5,color='lightslategrey',label = 'No heart attack')
ax1.scatter(x='thalachh',y='chol',data=data[data['output']==1],color='darkred',alpha=0.6,label = 'Heart attack')
ax1.set_xlabel('Maximum heart rate')
ax1.legend()
ax1.set_title('Maximum heart rate and Cholesterol',fontweight='bold')

#removing spines
for i in ["top","right"]:
    ax0.spines[i].set_visible(False)
    ax1.spines[i].set_visible(False)

### Correlation plot for numerical variables

In [None]:
fig=plt.figure(figsize=(10,4),dpi=100)
gs=fig.add_gridspec(1,2)
# adding subplots
ax0=fig.add_subplot(gs[0,0])
ax1=fig.add_subplot(gs[0,1])
axes=[ax0,ax1]
background_color = '#f6f5f7'
# changing background color of our plots
for i in axes:
    i.set_facecolor(background_color)
# changing the figure background color
fig.patch.set_facecolor(background_color) 
# heatmap of numerical data
matrix = np.triu(data_numerical.corr())
colors=['black','grey']
sns.heatmap(ax=ax0,data=data_numerical.corr(), annot=True, mask=matrix,cmap=colors)
ax1.text(0.5,0.5,'No strong correlation between\nthe variables',horizontalalignment = 'center',verticalalignment = 'center',fontsize = 15,fontfamily='serif')
ax1.spines["bottom"].set_visible(False)
ax1.spines["left"].set_visible(False)
ax1.set_xlabel("")
ax1.set_ylabel("")
ax1.set_xticklabels([])
ax1.set_yticklabels([])
ax1.tick_params(left=False, bottom=False)
for i in ["top","right","bottom","left"]:
    ax1.spines[i].set_visible(False)
plt.text(-1.7,1.1,'Heatmap of Numerical Variables',fontsize=18,fontweight='bold',fontfamily='serif')   

#### Observations :
1) Chol and Oldpeak has the highest corelation between the features.

2) Negative correlation between cholesterol and thalach , age and thalach , thalach and old peak. So we can ignore any chance of multi collinearity. 

### Pairplot

In [None]:
fig=plt.figure(figsize=(20,15),dpi=100)
colors=['darkred','black']
sns.pairplot(data=data,hue='output',size=2,palette=colors)
plt.show()

### Univariate Analysis of Categorical Variables

In [None]:
# Our Target Variable
colors=['brown','lightslategrey']
fig=plt.figure(figsize=(10,7))
gs=fig.add_gridspec(1,2)
ax0=fig.add_subplot(gs[0,0])
ax1=fig.add_subplot(gs[0,1])
axes=[ax0,ax1]
background_color = '#f6f5f7'
for i in axes:
    i.set_facecolor(background_color)
    i.spines["bottom"].set_visible(False)
    i.spines["left"].set_visible(False)
    i.set_xlabel("")
    i.set_ylabel("")
    i.set_xticklabels([])
    i.set_yticklabels([])
fig.patch.set_facecolor(background_color) 
labels=data_categorical['output'].value_counts().index
values=data_categorical['output'].value_counts()

ax0.pie(values,  labels=labels, colors=colors, autopct='%1.1f%%', shadow=True,startangle = 90)

ax1.text(0.5,0.5,'54% had heart attack and \n45.5% did not have any heart attack',horizontalalignment = 'center',verticalalignment = 'center',fontsize = 15,fontfamily='serif')
for i in ["top","right","bottom","left"]:
    ax1.spines[i].set_visible(False)
ax1.tick_params(left=False, bottom=False)

In [None]:
colors=['brown','lightslategrey']
fig=plt.figure(figsize=(15,23))
background_color = '#f6f5f7'
fig.patch.set_facecolor(background_color) 
for indx,val in enumerate(data_categorical.columns):
    ax=plt.subplot(4,3,indx+1)
    ax.set_facecolor(background_color)
    ax.set_title(val.upper(),fontweight='bold')
    for i in ['top','right']:
        ax.spines[i].set_visible(False)
    ax.grid(linestyle=':',axis='y')
    sns.countplot(data_categorical[val],palette=colors)

#### Observations :
1) ***Sex*** : The number of males are way more than the number of females in our data (male=1 , female = 0).

2) ***CP***: People with type 0 chest pain (typical angina) are way more in number than the other groups. Type 3 cp (asymptomatic) are in the least ammount. 

3) ***FBS***: People with fasting blood sugar <120 are greater in number than people with blood sugar levels>120.

4) ***RESECG*** : 0 ( normal) and 1(having ST-T wave abnormality) are almost equal in number. This will be useful for predicting heart attack.  Type 2 is almost negligible.

5) ***EXANG*** : People without exang (0) are almost double the amount of people with exang.

6) ***CA*** : People with blood vessels 0 occupy most amount of our data. More number of heart attacks were observed when CA=0 (previous analysis) .

7) ***THAL*** : People with thal 2 are more in number. No information was given about this(may not include this in predictions).


### Analysis with output

In [None]:
data_cat=data_categorical[['sex','cp','fbs','restecg','exng','caa','thall']]
fig=plt.figure(figsize=(15,23))
colors=['lightslategrey','brown']
background_color = '#f6f5f7'
fig.patch.set_facecolor(background_color) 
for indx,val in enumerate(data_cat.columns):
    ax=plt.subplot(4,3,indx+1)
    ax.set_facecolor(background_color)
    ax.set_title(val.upper(),fontweight='bold',fontfamily='serif')
    for i in ['top','right']:
        ax.spines[i].set_visible(False)
    ax.grid(linestyle=':',axis='y')
    sns.countplot(data_cat[val],palette=colors,hue=data['output'])

#### Observations :
1) ***Sex***: Men experience more heart attacks than women.

2) ***CP*** : People with type 2 chest pain(atypical angina) are more prone to heart attacks than any other type of chest pain (only few with type 2 pain did not get heart attack).

3) ***FBS*** : People with fasting blood sugar < 120 experienced more heart attacks. Will have to look at this again.

4) ***RESTECG*** :People with type 1 restecg(having ST-T wave abnormality) experienced the highest heart attacks (this is expected). Surprisingly, people with normal restecg also experiencecd heart attacks a lot. Type 2 restesg can be ignored.

5) ***EXANG***: People with exercise induced angina are more prone to heart attacks.

6) ***CA*** : People with 0 number of major blood vessels experienced the highest amount of heart attacks (expected) . 

### Looking at few categorical variables closely
Observations here are the same as the above count plot

#### Percentage of men and women who got heart attack

In [None]:
sum_target = data['output'].sum()
data_sex = pd.pivot_table(data=data[data['output']==1],index=data['sex'],values='output',aggfunc='count').reset_index()
data_sex['percentage'] = (data_sex['output']*100)/sum_target
colors=['lightslategrey','brown']
fig=plt.figure(figsize=(10,7))
gs=fig.add_gridspec(1,2)
ax0=fig.add_subplot(gs[0,0])
ax1=fig.add_subplot(gs[0,1])
axes=[ax0,ax1]
background_color = '#f6f5f7'
for i in axes:
    i.set_facecolor(background_color)
    i.spines["bottom"].set_visible(False)
    i.spines["left"].set_visible(False)
    i.set_xlabel("")
    i.set_ylabel("")
    i.set_xticklabels([])
    i.set_yticklabels([])
fig.patch.set_facecolor(background_color) 
labels=data_sex['sex']
values=data_sex['percentage']
ax0.pie(values,  labels=labels, colors=colors, autopct='%1.1f%%', shadow=True,startangle = 90)

ax1.text(0.5,0.5,'56.4% of male population had heart attack and \n43.6% of female had heart attack',horizontalalignment = 'center',verticalalignment = 'center',fontsize = 15,fontfamily='serif')
plt.text(-1.4,0.9,'Heart attacks % of male and female',fontsize=18,fontweight='bold',fontfamily='serif')   
for i in ["top","right","bottom","left"]:
    ax1.spines[i].set_visible(False)
ax1.tick_params(left=False, bottom=False)

#### Fasting Blood Sugar

In [None]:
fig,axes=plt.subplots(1,1,figsize=(8,5))
backgroundcolor='#f6f5f7'
fig.patch.set_facecolor(background_color)
axes.set_facecolor(background_color)
data_fbs = pd.pivot_table(data=data[data['output']==1],index=data['fbs'],values='output',aggfunc='count').reset_index()
sns.barplot(ax=axes,x=data_fbs['fbs'],y=data_fbs['output'],palette=colors)
for idx,val in enumerate(data_fbs['output']):
    axes.text( idx,val+1, round(val, 1), horizontalalignment='center')
axes.grid(linestyle=':',axis='y')
axes.set_xlabel('FBS')
axes.set_ylabel('Count')
plt.text(-0.7,155,'Fbs and Heart attack',fontsize=18,fontweight='bold',fontfamily='serif')
for i in ['top','right']:
    axes.spines[i].set_visible(False)

#### Chest pain and Heart attack


In [None]:
fig,axes=plt.subplots(1,1,figsize=(10,5))
backgroundcolor='#f6f5f7'
fig.patch.set_facecolor(background_color)
axes.set_facecolor(background_color)
data_cp = pd.pivot_table(data=data[data['output']==1],index=data['cp'],values='output',aggfunc='count').reset_index()
sns.barplot(ax=axes,x=data_cp['cp'],y=data_cp['output'],palette=colors)
for idx,val in enumerate(data_cp['output']):
    axes.text( idx,val+1, round(val, 1), horizontalalignment='center')
axes.grid(linestyle=':',axis='y')
axes.set_xlabel('CP')
axes.set_ylabel('Count')
plt.text(-0.7,75,'Chest Pain and Heart attack',fontsize=18,fontweight='bold',fontfamily='serif')

for i in ['top','right']:
    axes.spines[i].set_visible(False)


#### Checking for outliers


In [None]:
plt.figure(figsize=(15,10))
background_color = '#f6f5f7'
fig.patch.set_facecolor(background_color) 
for idx,val in enumerate(data_numerical.columns):
    ax=plt.subplot(2,3,idx+1)
    sns.boxplot(data_numerical[val],palette=colors)
    ax.set_facecolor(background_color)
    ax.set_title(val.upper(),fontweight='bold',fontfamily='serif')
    for i in ['top','right']:
        ax.spines[i].set_visible(False)

Cholesterol , Old peak , Resting bp have outliers. We standardize our data so we can ignore them.

### Data Preprocessing

In [None]:
# Before training the models we are going to drop the slope and target columns.
data_target=data['output']
data.drop(columns=['slp','output'],inplace=True)

In [None]:
# One-Hot encoding of Categorical Variables
data_dummies=data[['sex','cp','fbs','restecg','exng','caa','thall']]
data_dummies= pd.get_dummies(data_dummies,columns=['sex','cp','fbs','restecg','exng','caa','thall'])

In [None]:
# Merging the dummy variables and our original data
data.drop(columns=['sex','cp','fbs','restecg','exng','caa','thall'],inplace=True)
data=data.merge(data_dummies,left_index=True, right_index=True,how='left')
data.head()

In [None]:
# Splitting the data into training and testing sets.
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data,data_target,test_size=0.3,random_state=42)

In [None]:
# Standardizing the training and testing data.
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x_train=scaler.fit_transform(x_train)
x_test=scaler.transform(x_test)

We perform feature scaling after splitting the data into training and testing sets in order to avoid data leakage.

### Training and Prediction

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [None]:
#https://towardsdatascience.com/model-evaluation-techniques-for-classification-models-eac30092c38b
colors=['black','grey']
def Model(model):
    model.fit(x_train,y_train)
    score = model.score(x_test, y_test)
    model_train_score = model.score(x_train, y_train)
    model_test_score = model.score(x_test, y_test)
    prediction = model.predict(x_test)
    cm = confusion_matrix(y_test,prediction)
    print('Testing Score \n',score)
    plot_confusion_matrix(model,x_test,y_test,cmap='rocket_r')
    metrics.plot_roc_curve(model, x_test, y_test)    

In [None]:
# Logistic Regression
lg_reg=LogisticRegression()

Model(lg_reg)

In [None]:
# Decision Tree Classification
d_classif= DecisionTreeClassifier()
Model(d_classif)

In [None]:
#Regression Trees
reg_tree = RandomForestClassifier()
Model(reg_tree)

#### The confusion matrix interpretation:
1) The first element is True Negative([0,0]) - They are classified as 0 and our model correctly classified them as 0.

2) The second element is False Positive([0,1]) - Their actual value is 0 but our model predicted them as 1.

3) The third element is False Negative([1,0]) - Their actual value is 1 but our model predicted them as 0.

4) The Fourth element is True Positive([1,1]) - Their actual value is 1 and our model predicted them as 1.

***Random Forest Classifier*** performed better than Decision Tree Classifier and Logistic Regression for our data.

#### Check out my other notebooks here:¶
1) https://www.kaggle.com/ruthvikpvs/stroke-data-eda-and-prediction

2) https://www.kaggle.com/ruthvikpvs/students-performance-eda-and-prediction

3) https://www.kaggle.com/ruthvikpvs/diabetes-prediction

### References
1) https://www.kaggle.com/namanmanchanda/heart-attack-eda-prediction-90-accuracy

2) https://www.kaggle.com/subinium/simple-matplotlib-visualization-tips

### Do upvote the kernel if you find it useful. Feedback is highly appreciated. Thank You.