# Heart Failure Prediction

In this notebook, we aim to predict heart failure using a dataset containing various clinical features. Heart disease is one of the leading causes of death worldwide, and early prediction can help in taking preventive measures. We will explore the dataset, preprocess it, build a machine learning model, and evaluate its performance.



<center>
<img src="https://storage.googleapis.com/kaggle-datasets-images/1582403/2603715/fc66626bcce9dec0f401f3f69c2ab2d1/dataset-cover.jpg?t=2021-09-10-18-13-42" alt="error" width="1000" height="600"></center>


## Attribute Information

- **👤 Age:** Age of the patient [years]

- **🔘 Sex:** Sex of the patient [M: Male, F: Female]

- **💓 ChestPainType:** Type of chest pain  
  - TA: Typical Angina  
  - ATA: Atypical Angina  
  - NAP: Non-Anginal Pain  
  - ASY: Asymptomatic

- **💉 RestingBP:** Resting blood pressure [mm Hg]

- **🩺 Cholesterol:** Serum cholesterol level [mg/dl]

- **🍬 FastingBS:** Fasting blood sugar  
  - 1: If FastingBS > 120 mg/dl  
  - 0: Otherwise

- **📉 RestingECG:** Resting electrocardiogram results  
  - Normal: Normal  
  - ST: ST-T wave abnormality  
  - LVH: Left ventricular hypertrophy by Estes' criteria

- **❤️‍🔥 MaxHR:** Maximum heart rate achieved [Numeric value between 60 and 202]

- **🚴‍♂️ ExerciseAngina:** Exercise-induced angina  
  - Y: Yes  
  - N: No

- **📏 Oldpeak:** ST depression value [Numeric value measured in depression]

- **📈 ST_Slope:** Slope of the peak exercise ST segment  
  - Up: Upsloping  
  - Flat: Flat  
  - Down: Downsloping

- **🫀 HeartDisease:** Output class  
  - 1: Heart disease  
  - 0: Normal


## Import Libraries

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

## Read Data📚

In [3]:
df=pd.read_csv("D:\Projects\predicting-heart-failure\heart.csv")

  df=pd.read_csv("D:\Projects\predicting-heart-failure\heart.csv")


**Show data sample**

In [3]:
df.head() #show the first 5 rows

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
df.tail() #show the last 5 rows

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1
917,38,M,NAP,138,175,0,Normal,173,N,0.0,Up,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


## Statistical Summary

In [6]:
df.describe(include='all')

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
count,918.0,918,918,918.0,918.0,918.0,918,918.0,918,918.0,918,918.0
unique,,2,4,,,,3,,2,,3,
top,,M,ASY,,,,Normal,,N,,Flat,
freq,,725,496,,,,552,,547,,460,
mean,53.510893,,,132.396514,198.799564,0.233115,,136.809368,,0.887364,,0.553377
std,9.432617,,,18.514154,109.384145,0.423046,,25.460334,,1.06657,,0.497414
min,28.0,,,0.0,0.0,0.0,,60.0,,-2.6,,0.0
25%,47.0,,,120.0,173.25,0.0,,120.0,,0.0,,0.0
50%,54.0,,,130.0,223.0,0.0,,138.0,,0.6,,1.0
75%,60.0,,,140.0,267.0,0.0,,156.0,,1.5,,1.0


### **Univariate Analysis**

**Age**

In [7]:
df.Age.describe()

count    918.000000
mean      53.510893
std        9.432617
min       28.000000
25%       47.000000
50%       54.000000
75%       60.000000
max       77.000000
Name: Age, dtype: float64

**Sex**

In [8]:
df.Sex.value_counts()

Sex
M    725
F    193
Name: count, dtype: int64

**RestingBP**

In [9]:
df.RestingBP.describe()

count    918.000000
mean     132.396514
std       18.514154
min        0.000000
25%      120.000000
50%      130.000000
75%      140.000000
max      200.000000
Name: RestingBP, dtype: float64

**Find the RestingBP = 0 that is mining This person is dead and drop it**

In [10]:
df.drop(index=df[df.RestingBP==0].index,inplace=True)

**Cholesterol**

- Cholesterol is good less than 200
- Cholesterol is BORDERLINE HIGH 200 --> 239
- Cholesterol is HIGH 240 and higher

In [11]:
df.Cholesterol.describe()

count    917.000000
mean     199.016358
std      109.246330
min        0.000000
25%      174.000000
50%      223.000000
75%      267.000000
max      603.000000
Name: Cholesterol, dtype: float64

**Find the Cholesterol = 0 and drop it**

In [12]:
df.drop(index=df[df.Cholesterol==0].index,inplace=True)

**FastingBS**

In [13]:
df.FastingBS.value_counts()

FastingBS
0    621
1    125
Name: count, dtype: int64

**What is the effect of fasting blood sugar on disease ?**

In [14]:
df.groupby(['FastingBS','HeartDisease'])['HeartDisease'].count()

FastingBS  HeartDisease
0          0               347
           1               274
1          0                43
           1                82
Name: HeartDisease, dtype: int64

**<p style="color:red">Observations 📋</p>**

- **For Fasting blood sugar :-**

   - **Most** people  in the data **fastingBS < 120 mg/dl**,
   - **Few** people in the data **fastingBS > 120 mg/dl**
   
   
--------------------------------------------------------------------------------    

**What is the number of sick and healthy people in both sexes?**

In [15]:
df.groupby("Sex")["HeartDisease"].value_counts()

Sex  HeartDisease
F    0               142
     1                40
M    1               316
     0               248
Name: count, dtype: int64

**<p style="color:red">Observations 📋</p>**

- Most of the patients in the data are **males**

- The number of **healthy women** is more than **patients** in the data

--------------------------------------------------------------------------------

**What is the most common type of chest pain that affects the disease ?**


In [16]:
df[df.HeartDisease==1]['ChestPainType'].value_counts()

ChestPainType
ASY    274
NAP     46
ATA     21
TA      15
Name: count, dtype: int64

**<p style="color:red">Observations 📋</p>**

- **order of the type of chest pain on the disease :-**
 
   1- **ASY**
   
   2- **NAP**
   
   3- **ATA**
   
   4- **TA**
   
--------------------------------------------------------------------------------   

In [17]:
df.groupby(['Sex'])['ChestPainType'].value_counts(normalize=True)

Sex  ChestPainType
F    ASY              0.340659
     ATA              0.324176
     NAP              0.285714
     TA               0.049451
M    ASY              0.546099
     NAP              0.207447
     ATA              0.189716
     TA               0.056738
Name: proportion, dtype: float64

**What is the effect of Exercise Angina on disease ?**

In [18]:
df.groupby(['ExerciseAngina','HeartDisease'])['HeartDisease'].count()

ExerciseAngina  HeartDisease
N               0               340
                1               119
Y               0                50
                1               237
Name: HeartDisease, dtype: int64

**<p style="color:red">Observations 📋</p>**

- **No Exercise Angina :-**
    
   - The number of **healthy** people is **more** than the number of **patients**
    
- **Yes Exercise Angina :-**
    
   - The number of **patients** is **more** than the number of **healthy** people 
   
    

--------------------------------------------------------------------------------

**What is  effect of the the slope of the peak exercise on the disease**


In [19]:
df[df.HeartDisease==1]['ST_Slope'].value_counts()

ST_Slope
Flat    279
Up       45
Down     32
Name: count, dtype: int64

**<p style="color:red">Observations 📋</p>**

- **order of the slope of the peak exercise on the disease :-**

  1- **Flat**
    
  2- **Up**
    
  3- **Down**
   
   
--------------------------------------------------------------------------------    

## Data preprocessing







**Data preprocessing** refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models.

**1 : Finding and cleaning null values**

In [20]:
df.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

**<p style="color:red">Observations 📋</p>**

There is no null values 

--------------------------------------------------------------------------------    


**2 : Delete duplicate data**

In [21]:
df.drop_duplicates(inplace=True)

In [22]:
df.shape

(746, 12)

**<p style="color:red">Observations 📋</p>**

**There is no duplicate data because there is the same number of original data**

------------------------------