# Heart Failure disease Prediction

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Attribute Information :-
- Age: age of the patient [years]
- Sex: sex of the patient [M: Male, F: Female]
- ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: resting blood pressure [mm Hg]
- Cholesterol: serum cholesterol [mm/dl]
- FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
- Oldpeak: oldpeak = ST [Numeric value measured in depression]
- ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: output class [1: heart disease, 0: Normal]

In [1]:
# Downloading Nessassary Libraries
# !pip install pandas
# !pip install numpy
# !pip install matplotlib
# !pip install scikit-learn

In [2]:
# Importing Libraries
import pandas as pd
import numpy as np

# VIsualization Libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')

###

## Data Preprocessing

### Loading Dataset

In [3]:
df = pd.read_csv("heart.csv")

In [4]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [6]:
# Counting, number of rows and columns
print("Number of rows : ",df.shape[0])
print("Number of columns : ",df.shape[1])

Number of rows :  918
Number of columns :  12


###

### Handling Missing Values

In [7]:
# Counting Missing/Null Values 
Missing_Value = df.isnull().sum()
print(Missing_Value)

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64


In [8]:
# Counting Duplicate Values
Duplicate_Value = df.duplicated().sum()
print(f'Number of duplicated rows are = {Duplicate_Value}')

Number of duplicated rows are = 0


###

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [None]:
df.shape

###

### Data Cleaning for Sex

In [None]:
df.head()

In [None]:
df.Sex.unique()

In [None]:
df['Gender_Male'] = df.apply(lambda row: 1 if (row["Sex"] == "M") else 0, axis=1)

In [None]:
df['Gender_Female'] = df.apply(lambda row: 1 if (row["Sex"] == "F") else 0, axis=1)

In [None]:
df.drop(columns=['Sex'], inplace=True)

In [None]:
df.head()

###

### Data Cleaning for ChestPainType

In [None]:
df.head()

In [None]:
df.ChestPainType.unique()

In [None]:
df['ChestPainType_ATA'] = df.apply(lambda row: 1 if (row["ChestPainType"] == "ATA") else 0, axis=1)

In [None]:
df['ChestPainType_NAP'] = df.apply(lambda row: 1 if (row["ChestPainType"] == "ATA") else 0, axis=1)

In [None]:
df['ChestPainType_ASY'] = df.apply(lambda row: 1 if (row["ChestPainType"] == "ATA") else 0, axis=1)

In [None]:
df['ChestPainType_TA'] = df.apply(lambda row: 1 if (row["ChestPainType"] == "ATA") else 0, axis=1)

In [None]:
df.drop(columns=['ChestPainType'], inplace=True)

In [None]:
df.head()

###

### Data Cleaning for RestingECG

In [None]:
df.head()

In [None]:
df.RestingECG.unique()

In [None]:
df['RestingECG_Normal'] = df.apply(lambda row: 1 if (row["RestingECG"] == "Normal") else 0, axis=1)

In [None]:
df['RestingECG_ST'] = df.apply(lambda row: 1 if (row["RestingECG"] == "Normal") else 0, axis=1)

In [None]:
df['RestingECG_LVH'] = df.apply(lambda row: 1 if (row["RestingECG"] == "Normal") else 0, axis=1)

In [None]:
df.drop(columns=['RestingECG'], inplace=True)

In [None]:
df.head()

###

### Data Cleaning for ExerciseAngina

In [None]:
df.head()

In [None]:
df.ExerciseAngina.unique()

In [None]:
df['ExerciseAngina'] = df.apply(lambda row: 1 if (row["ExerciseAngina"] == "Y") else 0, axis=1)

In [None]:
df.head()

###

### Data Cleaning for ST_Slope

In [None]:
df.head()

In [None]:
df.ST_Slope.unique()

In [None]:
df['ST_Slope'] = df.apply(lambda row: 0 if (row["ST_Slope"] == "Up") \
                          else 1 if (row["ST_Slope"] == "Flat") \
                          else 2 if (row["ST_Slope"] == "Down") else null, axis=1)

In [None]:
df.head()

###

## Independent And dependent features

In [None]:
X=df.drop('HeartDisease',axis=1)
y=df['HeartDisease']

In [None]:
X.head()

In [None]:
y.head()

###

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)

In [None]:
X_train.shape,X_test.shape

In [None]:
y_train.shape,y_test.shape

###

## Check for multicollinearity

In [None]:
plt.figure(figsize=(20,20))
corr=X_train.corr()
sns.heatmap(corr,annot=True)

###

## Feature Scaling Or Standardization

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train)
X_test_scaled=scaler.transform(X_test)

In [None]:
X_train_scaled

- ### Box Plots To understand Effect Of Standard Scaler

In [None]:
plt.subplots(figsize=(25, 6))
plt.subplot(1, 2, 1)
sns.boxplot(data=X_train)
plt.title('X_train Before Scaling')
plt.subplot(1, 2, 2)
sns.boxplot(data=X_train_scaled)
plt.title('X_train After Scaling')

###

## Model Selection

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score


# Creating an instance of each classifier
classifiers = {
    'Logistic Regression': LogisticRegression(),
    'Support Vector Machine': SVC(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
}

# Training and evaluating each classifier
for name, classifier in classifiers.items():
    classifier.fit(X_train_scaled, y_train)
    y_pred = classifier.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'{name}: Accuracy = {accuracy}')

#### Selecting "Random Forest Classifier for ML model, as it performed well out of all

## Creating ML Model using Random Forest Classifier

In [None]:
import pickle # To save model for future use

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Initializing the Random Forest Regressor
model = RandomForestClassifier()

# Training the model on the training set
model.fit(X_train_scaled, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test_scaled)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Printing the evaluation metrics
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

pickle.dump(model , open('heart.pk1' , 'wb'))

## Making Predications

In [None]:
import pickle

heart_model = pickle.load(open('heart.pk1' , 'rb'))

heart_model_accuracy = heart_model.score(X_test_scaled, y_test)

print("Model Accuracy:" , heart_model_accuracy * 100 , "%")