<a href="https://colab.research.google.com/github/JoaquinGF21/Hackabull-Code/blob/master/ACMxDSC_Workshop_Hands_On_Code_27_28.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Heart Failure Prediction**



### Aim :
- To classify / predict whether a patient is prone to heart failure depending on multiple attributes.
- It is a **binary classification** with multiple numerical and categorical features.

### <center>Dataset Attributes</center>
    
- **Age** : age of the patient [years]
- **Sex** : sex of the patient [M: Male, F: Female]
- **ChestPainType** : chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- **RestingBP** : resting blood pressure [mm Hg]
- **Cholesterol** : serum cholesterol [mm/dl]
- **FastingBS** : fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- **RestingECG** : resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- **MaxHR** : maximum heart rate achieved [Numeric value between 60 and 202]
- **ExerciseAngina** : exercise-induced angina [Y: Yes, N: No]
- **Oldpeak** : oldpeak = ST [Numeric value measured in depression]
- **ST_Slope** : the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- **HeartDisease** : output class [1: heart disease, 0: Normal]

# **Data Collection**

### Import the Necessary Libraries :

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
#Read the csv file

### Data Info :

In [None]:
#Get the dataset shape

In [None]:
#Get Columns in dataset

In [None]:
#Find the Null values in the dataset parameters

# Dealing with Misssing Values in the Dataset:
There are two ways:
1. **Remove the missing values in the dataset**: If the missing values are a small proportion of your dataset, you may choose to remove the rows or columns containing them. This can be done using the dropna() function in pandas.
2. **Imputing the dataset**: Replace missing values with a suitable estimate. Common methods include:
  * Mean/Median Imputation: Replace missing values with the mean or
median of the column.
  * Mode Imputation: Replace missing values with the mode (most frequent value) of the column.
  * Custom Imputation: Use domain knowledge or other statistical methods to impute missing values.






In [None]:
#Replacing the missing values into the mean of the dataset.


In [None]:
#Finding the number of O's and number of 1's


In [None]:
#Filling the ChestPainType parameter with the most common value for the parameter.

In [None]:
#Dropping the Null values in dataset

In [None]:
#Check the NAs in dataset for Age


In [None]:
data.describe().T

# **Exploratory Data Analysis**

### Dividing features into Numerical and Categorical :

In [None]:
col = list(data.columns)
categorical_features = []
numerical_features = []
for i in col:
    if len(data[i].unique()) > 6:
        numerical_features.append(i)
    else:
        categorical_features.append(i)

print('Categorical Features :',*categorical_features)
print('Numerical Features :',*numerical_features)

Categorical Features : Sex ChestPainType FastingBS RestingECG ExerciseAngina ST_Slope HeartDisease
Numerical Features : Age RestingBP Cholesterol MaxHR Oldpeak


- Here, categorical features are defined if the the attribute has less than 6 unique elements else it is a numerical feature.
- Typical approach for this division of features can also be based on the datatypes of the elements of the respective attribute.

**Eg :** datatype = integer, attribute = numerical feature ; datatype = string, attribute = categorical feature

- For this dataset, as the number of features are less, we can manually check the dataset as well.

### Categorical Features :

In [None]:
data.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data_transformed = data.copy(deep = True)

data_transformed['Sex'] = le.fit_transform(data_transformed['Sex'])
#Similarly for ChestPainType, RestingECG, ExerciseAngina, ST_Slope

- Creating a deep copy of the orginal dataset and label encoding the text data of the categorical features.
- Modifications in the original dataset will not be highlighted in this deep copy.
- Hence, we use this deep copy of dataset that has all the features converted into numerical values for visualization & modeling purposes.

In [None]:
data_transformed.head()

#### Distribution of Categorical Features :

In [None]:
fig, ax = plt.subplots(nrows = 3,ncols = 2,figsize = (10,15))
for i in range(len(categorical_features) - 1):

    plt.subplot(3,2,i+1)
    sns.distplot(data_transformed[categorical_features[i]],kde_kws = {'bw' : 1});
    title = 'Distribution : ' + categorical_features[i]
    plt.title(title)

plt.figure(figsize = (4.75,4.55))
sns.distplot(data_transformed[categorical_features[len(categorical_features) - 1]],kde_kws = {'bw' : 1})
title = 'Distribution : ' + categorical_features[len(categorical_features) - 1]
plt.title(title);

- All the categorical features are near about **Normally Distributed**.

In [None]:
output_counts = data_transformed['HeartDisease'].value_counts()

# Plotting the pie chart using Seaborn
plt.figure(figsize=(3, 3))
sns.set(style="whitegrid")
sns.color_palette("pastel")
sns.set_palette("pastel")
plt.pie(output_counts, labels=output_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Output Variable')
plt.show()

In [None]:
pd.DataFrame(data['HeartDisease']).value_counts().plot(kind = 'bar')

### Numerical Features :

#### Distribution of Numerical Features :

In [None]:
fig, ax = plt.subplots(nrows = 2,ncols = 2,figsize = (9,9))
for i in range(len(numerical_features) - 1):
    plt.subplot(2,2,i+1)
    sns.distplot(data[numerical_features[i]])
    title = 'Distribution : ' + numerical_features[i]
    plt.title(title)
plt.show()

plt.figure(figsize = (4.75,4.55))
sns.distplot(data_transformed[numerical_features[len(numerical_features) - 1]],kde_kws = {'bw' : 1})
title = 'Distribution : ' + numerical_features[len(numerical_features) - 1]
plt.title(title);

- **Oldpeak's** data distribution is rightly skewed.
- **Cholestrol** has a bidmodal data distribution.