# Heart Failure Prediction Dataset:Exploratory Data Analysis
In this notebook,we're going to analyse the Heart Failure Prediction Dataset taken from Kaggle.The dataset is meant for supervised machine learning, but we're only going to do some exploratory analysis at this stage.

# We'll try to answer the following questions:-
1.What is age distribution of people considering survival and death?

2.What is age distribution of Male and Female w.r.t survival?

3.What is age distribution of Anaemic and Non-Anaemic patients considering their survival?

4.What is age distribution of Diabetic and Non-Diabetic patients considering their survival?

5.What is age distribution of High B.P and Non-High B.P patients considering their survival?

6.What is age distribution of Smokers and Non-Smokers  considering their survival?

7.What is Creatinine Phosphokinase,Platelets,Serum Creatinine,Serum Na distribution of Male and Female w.r.t their survival?

8.What is Anaemia distribution,death and survival rate of Anaemic patients and Non-Anaemic patients?

9.What is Diabetes distribution,death and survival rate of Diabetic patients and Non-Diabetic patients?

10.What is High B.P distribution,death and survival rate of High B.P patients and Non-High B.P patients?

11.What is Sex distribution,death and survival rate of Male and Female w.r.t Anaemia,Diabetes,High B.P,Smoking?

12.What is Smoking distribution,death and survival rate of Smokers and Non Smokers?



In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as pyo
from plotly.offline import init_notebook_mode,plot,iplot
import plotly.figure_factory as ff
import cufflinks as cf
import plotly.io as pio
pio.renderers.default = "svg"
import warnings
warnings.filterwarnings('ignore')

In [2]:
pyo.init_notebook_mode(connected=True)
cf.go_offline()

In [3]:
#Loading the dataset
df=pd.read_csv("/content/heart_failure_clinical_records_dataset.csv")
df

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


In [4]:
#Renaming the columns
df.columns=['Age','Anaemia','Creatinine phosphokinase','Diabetes','Ejection fraction','High B.P','Platelets','Serum creatinine','Serum Na','Sex','Smoking','Time','Outcome']

In [5]:
#Mapping the values
dict={0:'Female',1:'Male'}
df['Sex']=df['Sex'].map(dict)




In [6]:
#Checking missing values and datatype of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       299 non-null    float64
 1   Anaemia                   299 non-null    int64  
 2   Creatinine phosphokinase  299 non-null    int64  
 3   Diabetes                  299 non-null    int64  
 4   Ejection fraction         299 non-null    int64  
 5   High B.P                  299 non-null    int64  
 6   Platelets                 299 non-null    float64
 7   Serum creatinine          299 non-null    float64
 8   Serum Na                  299 non-null    int64  
 9   Sex                       299 non-null    object 
 10  Smoking                   299 non-null    int64  
 11  Time                      299 non-null    int64  
 12  Outcome                   299 non-null    int64  
dtypes: float64(3), int64(9), object(1)
memory usage: 30.5+ KB


# Note:
Here we see that columns namely-Anaemia,Diabetes,High B.P,Smoking are having boolean datatype but in info it is showing int64 so we have to change it's datatype.

In [7]:
df[['Anaemia','Diabetes','High B.P','Smoking','Outcome']]=df[['Anaemia','Diabetes','High B.P','Smoking','Outcome']].astype(bool)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       299 non-null    float64
 1   Anaemia                   299 non-null    bool   
 2   Creatinine phosphokinase  299 non-null    int64  
 3   Diabetes                  299 non-null    bool   
 4   Ejection fraction         299 non-null    int64  
 5   High B.P                  299 non-null    bool   
 6   Platelets                 299 non-null    float64
 7   Serum creatinine          299 non-null    float64
 8   Serum Na                  299 non-null    int64  
 9   Sex                       299 non-null    object 
 10  Smoking                   299 non-null    bool   
 11  Time                      299 non-null    int64  
 12  Outcome                   299 non-null    bool   
dtypes: bool(5), float64(3), int64(4), object(1)
memory usage: 20.3+ K

# Note:
This dataset contains no null values.


In [9]:
#Statistics of integer columns:
df.describe()

Unnamed: 0,Age,Creatinine phosphokinase,Ejection fraction,Platelets,Serum creatinine,Serum Na,Time
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,581.839465,38.083612,263358.029264,1.39388,136.625418,130.26087
std,11.894809,970.287881,11.834841,97804.236869,1.03451,4.412477,77.614208
min,40.0,23.0,14.0,25100.0,0.5,113.0,4.0
25%,51.0,116.5,30.0,212500.0,0.9,134.0,73.0
50%,60.0,250.0,38.0,262000.0,1.1,137.0,115.0
75%,70.0,582.0,45.0,303500.0,1.4,140.0,203.0
max,95.0,7861.0,80.0,850000.0,9.4,148.0,285.0


In [10]:
died=df[df['Outcome']==True]
survived=df[df['Outcome']==False]
values=(len(died),len(survived))
names=('Died','Survived')
fig=px.pie(df,names,values,title='Distribution of Survival')
fig.show()

##Useful insight:-
Out of 299 people:

96(32.1%) died and 203(67.9%) survived.

# What is age distribution of people considering survival and death?

In [11]:
fig=px.violin(df,x=df['Outcome'],y=df['Age'],points='all',box=True)
fig.update_layout(title_text='Age distribution')
fig.show()


## Useful insight:-
1.Max people survived are in the age.gp(50-70)

2.Max people died are in the age.gp(50-70)


# What is Age distribution of Male and Female w.r.t survival?

In [12]:
fig=px.violin(df,x=df['Sex'],y=df['Age'],color='Outcome',points='all',box=True,title='Gender Distribution')
fig.show()


## Useful insight:-
1.Max males survived are in age.gp(50-70)

2.Max females survived are in age.gp(50-70) 

3.The ages of both males and females died are distributed over the entire age.gp(>40&<100)

# What is age distribution of Anaemic and Non-Anaemic patients considering their survival? 

In [13]:
fig=px.violin(df,x=df['Anaemia'],y=df['Age'],color='Outcome',points='all',box=True,title='Anaemia distribution')
fig.show()

##Useful insight:
1.People having Anaemia and died are distributed over entire age.gp(>40 & <100)

2.Max people having Anaemia and survived are in the age.gp(50-70)

3.People not having Anaemia and died are distributed over entire age.gp(>40 & <100)

4.Max people not having Anaemia and survived are in age.gp(40-70)

# What is age distribution of Diabetic and Non-Diabetic patients considering their survival?

In [14]:
fig=px.violin(df,x=df['Diabetes'],y=df['Age'],color='Outcome',points='all',box=True,title='Diabetes distribution')
fig.show()

##Useful insight:-
1.People having Diabetes and died are distributed over entire age-g.p(>40 & <100)

2.Max people having Diabetes and survived are in age-g.p(40-70)

3.People not  having Diabetes and died are distributed over entire age-g.p(>40 & <100)

4.Max people not having Diabetes and survived are in age-g.p(40-70)

# What is age distribution of High B.P and Non High B.P patients considering their survival?


In [15]:
fig=px.violin(df,x=df['High B.P'],y=df['Age'],color='Outcome',points='all',box=True,title='High B.P. distribution')
fig.show()

##Useful insight:-
1.People having High B.P and died are distributed over entire age-g.p(>40 & <100)

2.People having High B.P and survived are distributed over age-g.p(40-80)

3.People not having High B.P and died are distributed over entire age-g.p(>40 & <100)

4.Max people not having High B.P and survived are distributed over age-g.p(40-70)

# What is age distribution of Smokers and Non Smokers  considering their survival?

In [16]:
fig=px.violin(df,x=df['Smoking'],y=df['Age'],color='Outcome',points='all',box=True,title='Smoking distribution')
fig.show()

##Useful insight:-
1.Smokers died are distributed in the age.gp(50-90)

2.Max smokers survived are in the age.gp(40-70)

3.People not smoking and died are distributed over entire age-g.p(>40 & <100)

4.Max people not smoking and survived are in the age.gp(40-70)




# What is Creatinine Phosphokinase,Platelets,Serum Creatinine,Serum Na distribution of Male and Female w.r.t their survival?

### Creatinine phosphokinase w.r.t  Gender and Outcome:-

In [17]:
fig=px.violin(df,x=df['Sex'],y=df['Creatinine phosphokinase'],color='Outcome',points='all',box=True,title='Creatinine phosphokinase')
fig.show()

## Platelets w.r.t Gender and Outcome:-

In [18]:
fig=px.violin(df,x=df['Sex'],y=df['Platelets'],color='Outcome',points='all',box=True,title='Platelets')
fig.show()

## Ejection fraction w.r.t Age and Outcome:-

In [19]:
fig=px.violin(df,x=df['Sex'],y=df['Ejection fraction'],color='Outcome',points='all',box=True,title='Ejection fraction')
fig.show()

## Serum Creatinine w.r.t Age and Outcome:-

In [20]:
fig=px.violin(df,x=df['Sex'],y=df['Serum creatinine'],color='Outcome',points='all',box=True,title='Serum Creatinine')
fig.show()

### Serum Na w.r.t Age and Outcome:-

In [21]:
fig=px.violin(df,x=df['Sex'],y=df['Serum Na'],color='Outcome',points='all',box=True,title='Serum Na')
fig.show()

# What is Anaemia distribution,death and survival rate of Anaemic patients and Non-Anaemic patients?

In [22]:
Anaemia_yes=df[df['Anaemia']==True]
Anaemia_no=df[df['Anaemia']==False]
values=(len(Anaemia_yes),len(Anaemia_no))
names=('Anaemic','Non Anaemic')
fig=px.pie(df,names=names,values=values,title='Anaemia distribution')
fig.show()

## Useful insight
Out of 299 people:

129(43.1% )of people are Anaemic and 170(56.9% )of people are not Anaemic.

In [23]:
Anaemic_died=Anaemia_yes[Anaemia_yes['Outcome']==True]
Anaemic_survived=Anaemia_yes[Anaemia_yes['Outcome']==False]
values=(len(Anaemic_died),len(Anaemic_survived))
names=('Anaemic people died','Anaemic people survived')
fig=px.pie(Anaemia_yes,names=names,values=values,title='Death and Survival rate of Anaemic people')
fig.show()



## Useful insight:-
Out of 129 Anaemic People:

83(64.3%) of people with Anaemia survived and 46(35.7% ) people died.

In [24]:
Non_Anaemic_died=Anaemia_no[Anaemia_no['Outcome']==True]
Non_Anaemic_survived=Anaemia_no[Anaemia_no['Outcome']==False]
values=(len(Anaemic_died),len(Anaemic_survived),len(Non_Anaemic_died),len(Non_Anaemic_survived))
names=('Anaemic people died','Anaemic people survived','Non-Anaemic people died','Non-Anaemic people survived')
fig=px.pie(df['Anaemia'],names,values,title='Analysis on Survival')
fig.show()

## Useful insight:-
From the above piechart it is clear that:-

1.Out of 129 Anaemic people 83(27.8%) people survived and 46(15.4%) died.

2.Out of 170 Non-Anaemic people 120(40.1%) survived and 50(16.7%) died.



#  What is Diabetes distribution,death and survival rate of Diabetic patients and Non-Diabetic patients?

In [25]:
Diabetic_yes=df[df['Diabetes']==True]
Diabetic_no=df[df['Diabetes']==False]
values=(len(Diabetic_yes),len(Diabetic_no))
names=('Diabetic','Non-Diabetic')
fig=px.pie(df,names,values,title='Diabetes distribution')
fig.show()

##Useful insight:-
Out of 299 125(41.8%) are Diabetic and 174(58.2% ) are Non-Diabetic.

In [26]:
Diabetic_died=Diabetic_yes[Diabetic_yes['Outcome']==True]
Diabetic_survived=Diabetic_yes[Diabetic_yes['Outcome']==False]
values=(len(Diabetic_died),len(Diabetic_survived))
names=('Diabetic people died','Diabetic people survived')
fig=px.pie(Diabetic_yes,names,values,title='Death rate of Diabetic people')
fig.show()

## Useful insight:-
Out of 125 Diabetic people 85(68%) survived and 40(32%) died.

In [27]:
Non_Diabetic_died=Diabetic_no[Diabetic_no['Outcome']==True]
Non_Diabetic_survived=Diabetic_no[Diabetic_no['Outcome']==False]
values=(len(Diabetic_died),len(Diabetic_survived),len(Non_Diabetic_died),len(Non_Diabetic_survived))
names=('Diabetic died','Diabetic survived','Non-Diabetic died','Non-Diabetic survived')
fig=px.pie(df['Diabetes'],names,values,title='Analysis of Survival')
fig.show()


## Useful insight:-
1.Out of 125 Diabetic people 85(28.4%) survived and 40(13.4%) died.

2.Out of 174 Non-Diabetic people 118(39.5%) survived and 56(18.7%) died.






# What is High B.P distribution,death and survival rate of High B.P patients and Non-High B.P patients?


In [28]:
High_BP_yes=df[df['High B.P']==True]
High_BP_no=df[df['High B.P']==False]
values=(len(High_BP_yes),len(High_BP_no))
names=('High_B.P_yes','High_B.P_no')
fig=px.pie(df,names,values,title='High B.P distribution')
fig.show()


## Useful insight:-
Out of 299 people 105(35.1%) are having High B.P and 194(64.9%) are not having High B.P.

In [29]:
High_BP_died=High_BP_yes[High_BP_yes['Outcome']==True]
High_BP_survived=High_BP_yes[High_BP_yes['Outcome']==False]
values=(len(High_BP_died),len(High_BP_survived))
names=('B.P patients died','B.P patients survived')
fig=px.pie(High_BP_yes,names,values,title='Death rate of High B.P patients')
fig.show()


## Useful insights:-
Out of 105 High B.P patients 39(37.1%)of them died and 66(62.9%) of them survived.

In [30]:
High_BP_no_died=High_BP_no[High_BP_no['Outcome']==True]
High_BP_no_survived=High_BP_no[High_BP_no['Outcome']==False]
values=(len(High_BP_died),len(High_BP_survived),len(High_BP_no_died),len(High_BP_no_survived))
names=('B.P patients died','B.P patients survived','Non B.P patients died','Non B.P patients survived')
fig=px.pie(df,names,values,title='Analysis of survival')
fig.show()


##  Useful insight:-
1.Out of 105 High B.P patients 39(13%)died and 66 survived(22.1%).

2.Out of 194 Non High B.P patients 57(19.1%) died and 137(45.8%)survived.

#What is Sex distribution,death and survival rate of Male and Female w.r.t Anaemia,Diabetes,High B.P,Smoking?

In [31]:
Male=df[df['Sex']=='Male']
Female=df[df['Sex']=='Female']
values=(len(Male),len(Female))
names=('Male','Female')
fig=px.pie(df,names,values,title='Sex distribution')
fig.show()

## Useful insight:-
Out of 299 people 194(64.9%) are male and 105(35.1%) are female.

In [32]:
Male_died=Male[Male['Outcome']==True]
Male_survived=Male[Male['Outcome']==False]
Female_died=Female[Female['Outcome']==True]
Female_survived=Female[Female['Outcome']==False]
values=(len(Male_died),len(Male_survived),len(Female_died),len(Female_survived))
names=('Male died','Male survived','Female died','Female survived')
fig=px.pie(df,names,values,title='Analysis of survival')
fig.show()

##Useful insights:-
1.Out of 199 male 62(20.7%)died and 132(44.1%) survived.

2.Out of 105 female 34(11.4%)died and 71(23.7%)survived.

In [34]:
Male_Anaemic=Male[Male['Anaemia']==True]
Female_Anaemic=Female[Female['Anaemia']==True]

In [35]:
Male_Anaemic_died=Male_Anaemic[Male_Anaemic['Outcome']==True]
Male_Anaemic_survived=Male_Anaemic[Male_Anaemic['Outcome']==False]
values=(len(Male_Anaemic_died),len(Male_Anaemic_survived))
names=('Male Anaemic died','Male Anaemic survived')
fig=px.pie(df,names,values,title='Analysis of Male Anaemic patients')
fig.show()




## Useful insights:-
Out of 77 Male Anaemic patients:

26(33.8%)died and 51(66.2%)survived.

In [36]:
Female_Anaemic_died=Female_Anaemic[Female_Anaemic['Outcome']==True]
Female_Anaemic_survived=Female_Anaemic[Female_Anaemic['Outcome']==False]
values=(len(Female_Anaemic_died),len(Female_Anaemic_survived))
names=('Female Anaemic died','Female Anaemic survived')
fig=px.pie(df,names,values,title='Analysis of Female Anaemic patients')
fig.show()



## Useful insight:-
Out of 52 Female Anaemic patients-

20(38.5%)died and 32(61.5%)survived.




In [37]:
Male_diabetic=Male[Male['Diabetes']==True]
Female_diabetic=Female[Female['Diabetes']==True]


In [38]:
Male_diabetic_died=Male_diabetic[Male_diabetic['Outcome']==True]
Male_diabetic_survived=Male_diabetic[Male_diabetic['Outcome']==False]

values=(len(Male_diabetic_died),len(Male_diabetic_survived),)
names=('Male diabetic died','Male diabetic survived')
fig=px.pie(df,names,values,title='Analysis of Male Diabetic patients')
fig.show()



## Useful insight:-
Out of 70 Male Diabetic patients-

20(28.6%) Male died and 50(71.4%) survived.



In [39]:
Female_diabetic_died=Female_diabetic[Female_diabetic['Outcome']==True]
Female_diabetic_survived=Female_diabetic[Female_diabetic['Outcome']==False]
values=(len(Female_diabetic_died),len(Female_diabetic_survived))
names=('Female Diabetic died','Female Diabetic survived')
fig=px.pie(Female_diabetic,names,values,title='Analysis of Female Diabetic patients')
fig.show()

## Useful insights:-
Out of 55 Female Diabetic patients-

20(36.4%)died and 35(63.6%) survived.

In [40]:
Male_High_BP=Male[Male['High B.P']==True]
Female_High_BP=Female[Female['High B.P']==True]

In [41]:
Male_BP_died=Male_High_BP[Male_High_BP['Outcome']==True]
Male_BP_survived=Male_High_BP[Male_High_BP['Outcome']==False]
values=(len(Male_BP_died),len(Male_BP_survived))
names=('Male BP patients died','Male BP patients survived')
fig=px.pie(Male,names,values,title='Analysis of Male High BP patients')
fig.show()

## Useful insights:-
Out of 61 Male High B.P patients-
 
 22(36.1%)died and 39(63.9%)survived.

In [42]:
Female_BP_died=Female_High_BP[Female_High_BP['Outcome']==True]
Female_BP_survived=Female_High_BP[Female_High_BP['Outcome']==False]
values=(len(Female_BP_died),len(Female_BP_survived))
names=('Female BP patients died','Female BP patients survived')
fig=px.pie(Female,names,values,title='Analysis of Female High BP patients')
fig.show()

## Useful insights:-
Out of 44 Female B.P patients-

17(38.6%)died and 27(61.4%) survived.


In [43]:
Male_smokers=Male[Male['Smoking']==True]
Female_smokers=Female[Female['Smoking']==True]

In [44]:
Male_smokers_died=Male_smokers[Male_smokers['Outcome']==True]
Male_smokers_survived=Male_smokers[Male_smokers['Outcome']==False]
values=(len(Male_smokers_died),len(Male_smokers_survived))
names=('Male smokers died','Male smokers survived')
fig=px.pie(Male_smokers,names,values,title='Analysis of Male smokers')
fig.show()


## Useful insights:-
Out of 92 Male smokers

27(29.3%)died and 65(70.7%)survived.

In [45]:
Female_smokers_died=Female_smokers[Female_smokers['Outcome']==True]
Female_smokers_survived=Female_smokers[Female_smokers['Outcome']==False]
values=(len(Female_smokers_died),len(Female_smokers_survived))
names=('Female smokers died','Female smokers survived')
fig=px.pie(Female_smokers,names,values,title='Analysis of Female smokers')
fig.show()

## Useful insights:-
1.Very less female smokers.

2.Out of 4:

3(75%) died and 1(25%) survived.

# What is Smoking distribution,death and survival rate of Smokers and Non Smokers?


In [46]:
Smokers=df[df['Smoking']==True]
Non_Smokers=df[df['Smoking']==False]
values=(len(Smokers),len(Non_Smokers))
names=('Smokers','Non-Smokers')
fig=px.pie(df,names,values,title='Analysis of Smoking')
fig.show()

## Useful insights:-
Out of 299 people-

96(32.1%) are smokers and 203(67.9%) are non-smokers.

In [47]:
Smokers_died=Smokers[Smokers['Outcome']==True]
Smokers_survived=Smokers[Smokers['Outcome']==False]
values=(len(Smokers_died),len(Smokers_survived))
names=('Smokers died','Smokers survived')
fig=px.pie(Smokers,names,values,title='Death rate of smokers')
fig.show()

## Useful insights:-
Out of 96 Smokers-

30(31.3%)died and 66(68.8%)survived.


In [48]:
Non_Smokers_died=Non_Smokers[Non_Smokers['Outcome']==True]
Non_Smokers_survived=Non_Smokers[Non_Smokers['Outcome']==False]
values=(len(Smokers_died),len(Smokers_survived),len(Non_Smokers_died),len(Non_Smokers_survived))
names=('Smokers died','Smokers survived','Non-Smokers died','Non-Smokers survived')
fig=px.pie(df,names,values,title='Analysis of survival')
fig.show()

## Useful insights:-
Out of 96 smokers 30(10%) died and 66(22.1%) survived.

Out of 203 Non-smokers 66(22.1%) died and 137(45.8%) survived.




## Note:-
To add to this project, we can try running ML algorithms on the data  to see if we can create a model that accurately predicts whether a person died or not.

 This notebook will be updated with those sections in the future.