# Heart Failure Prediction Data




<center><img src="https://www.michiganmedicine.org/sites/default/files/blog/heart_beating_0.gif"><center>

## Summary
Since 1975 cardiovascular disease (CVD) remains to be one of the leading causes of death worldwide. In 2015, the World Health Organisation found that 17.7 million deaths were CVD-related while 633,842 deaths were recorded in the United States alone (1). As CVD-related deaths continue to increase, so does its overall burden on the healthcare system with some estimating its overall burden to cost an estimated \\$237 billion, surpassing both Alzheimer's disease and diabetes with projections calculating a estimating the burden of CVD costing $368 billion by the year 2035 (1). Studies have shown that CVD incidence significantly increases with age with some variation between gender (1). This Dataset investigates 11 key features of heart disease across 918 patients (2). This project aims to clean, validate and investigate trends throughout the data to identify correlations that may be used to help develop early detection and treatment strategies for future patients.

## Importing required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import plotly.express as px

## Importing the Dataset

In [None]:
df = pd.read_csv('/kaggle/input/heart-failure-prediction/heart.csv')
df.info()

The dataset contains 918 rows, 12 columns and several data types

In [None]:
data_types = df.dtypes.value_counts()
data_types_table = pd.concat([data_types], axis = 1, keys = ['Sum of Data Type'])
data_types_table

Looking further into the data types we can see that we have a total of 6 integer measurements, 5 object measurements and 1 float measurement.

To work on the measurements with the 'object' data type we need to convert it to a string

In [None]:
object_select = df.select_dtypes(include = 'object').columns
df[object_select]=df[object_select].astype('string')

data_types = df.dtypes.value_counts()
data_types_table = pd.concat([data_types], axis = 1, keys = ['Sum of Data Type'])
data_types_table

As we can see all 'object' columns are now 'string' columns, which makes it easier for us to analyse the data while keeping the overall data consistent.

## Data Validation and Cleaning
### Missing Data Check and Removal

In [None]:
null_check = df.isnull().sum()

missing_data_table = pd.concat([null_check], axis = 1, keys = ['Total Missing Data'])

missing_data_table

As observed when reviewing the datatypes within this dataset, no null values are present

### Duplicated Data Check and Removal

In [None]:
duplicated_check = df.duplicated().sum()

print('Total duplicates present:', duplicated_check)

print('Rows before duplicate removal:',df.shape [0])

df.drop_duplicates(inplace = True)

print('Rows after duplicate removal:',df.shape [0])

As no duplicates were present in the dataset, no rows were removed after duplicate removal

## Data Analysis

### Average Age, RestingBP, Cholesterol and MaxHR from patients

In [None]:
mean_values = round(df[['Age', 'RestingBP', 'Cholesterol', 'MaxHR']].mean(),1)

pd.concat([mean_values], axis = 1, keys = ['Average values'])

**Summary**
* The average age of this dataset is 53, suggesting that the sample size consisted primarily of older patients

* Typically the average resting heart blood pressure of a healthy individual is around 120-125 mm Hg, while the sample size is considerably higher at 132.4 mm Hg

* Cholesterol levels are a significant risk factor towards the development of early heart failure. Within the data set, the average cholesterol level is significantly higher than in healthy patients. Typically, cholesterol levels of 190 mg/dl or higher are considered very high.

* Max heart rate was the overall maximum heart rate achieved during exercise tests. Using the **Miller formula** where *HRmax = 217 - (0.85 x age)* , we can identify that the overall max HR is significantly lower than what is expected with the average age

>* 217 - (0.85 x 53.5) = 171.5, which is 34.7 beats per minute higher than the recorded MaxHR of 136.5

### Selecting Categorical data

In [None]:
string_col = df.select_dtypes('string').columns.to_list()

cat_col = df.columns.to_list()

for col in string_col: cat_col.remove(col)
    
cat_col.remove("HeartDisease")

In [None]:
df.describe().T

### Distribution of Gender

In [None]:
fig = px.pie(df, 'Sex', 
             labels={"M":"Male",'F':'Female'},
             title= 'Gender Distribution')
fig.show()

***Summary of Gender Distribution***

725 patients (79%) recorded in this study were male, while 193 patients (21%) were female. While studies have found that heart failure occurs at a higher incidence in men, the overall prevalence rate is similar between genders. This can be explained as women typically survive longer after the onset of heart failure (3). Within healtcare studies it is critically important to use sample sizes that accuractly measure the overall population. As the population in this dataset is skewed towards males, the overall generalisability of this study will be impacted and futher study using more diverse populations will be required if we are to avoid developing a treatment methodology that is biased towards a specific gender.

### Most common chest pain type

In [None]:
Chest_pain_type_fig = px.histogram(df, "ChestPainType", color='Sex',hover_data=df.columns)
Chest_pain_type_fig.show()

***Summary of Gender Distribution***

* The overall most common type of chest pain was Asymptomatic (ASY), while Non-Anginal pain (NAP) and Atypical Angina (ATA) were close in occurance leaving Typical Angina (TA) to be the least common type of chest pain.

In [None]:
fig = px.box(df, y= "Age", x="HeartDisease",color='Sex', title= 'Distribution of Age')
fig.show()

In [None]:
fig = px.box(df, y= "RestingBP", x="HeartDisease", color='Sex', title= 'Distrubution of RestingBP')
fig.show()

In [None]:
fig = px.box(df, y= "Cholesterol", x="HeartDisease",color='Sex',title= 'Distribution of Cholesterol')
fig.show()

In [None]:
fig = px.box(df, y= "MaxHR", x="HeartDisease",color='Sex', title= 'Distrubution of MaxHR')
fig.show()

In [None]:
fig = px.box(df, y= "Oldpeak", x="HeartDisease",color='Sex', title= 'Distrubution of Oldpeak')
fig.show()

### Cholesterol Distribution between Gender

In [None]:
fig= px.histogram(df, 'Cholesterol',color = 'Sex')

fig.show()

### Distribution of exercise-induced angina

### Scatterplot: Age and Cholesterol level

In [None]:
fig = px.scatter(df,x = "Cholesterol", y = "Age", color= 'Sex')

fig.show()

References <br>
1. Benjamin, E. J., Virani, S. S., Callaway, C. W., Chamberlain, A. M., Chang, A. R., Cheng, S., Chiuve, S. E., Cushman, M., Delling, F. N., Deo, R., de Ferranti, S. D., Ferguson, J. F., Fornage, M., Gillespie, C., Isasi, C. R., Jiménez, M. C., Jordan, L. C., Judd, S. E., Lackland, D., Lichtman, J. H., … American Heart Association Council on Epidemiology and Prevention Statistics Committee and Stroke Statistics Subcommittee (2018). Heart Disease and Stroke Statistics-2018 Update: A Report From the American Heart Association. Circulation, 137(12), e67–e492. https://doi.org/10.1161/CIR.0000000000000558
2. Fedesoriano. (2021). Heart Failure Prediction Dataset. Retrieved [27/09/2023] from https://www.kaggle.com/fedesoriano/heart-failure-prediction.
3. Strömberg, A., & Mårtensson, J. (2003). Gender differences in patients with heart failure. European journal of cardiovascular nursing, 2(1), 7–18. https://doi.org/10.1016/S1474-5151(03)00002-1