# Heart Failure Prediction Data




<center><img src="https://www.michiganmedicine.org/sites/default/files/blog/heart_beating_0.gif"><center>

## Summary
Since 1975 cardiovascular disease (CVD) remains to be one of the leading causes of death worldwide. In 2015, the World Health Organisation found that 17.7 million deaths were CVD-related while 633,842 deaths were recorded in the United States alone (1). As CVD-related deaths continue to increase, so does its overall burden on the healthcare system with some estimating its overall burden to cost an estimated \\$237 billion, surpassing both Alzheimer's disease and diabetes with projections calculating a estimating the burden of CVD costing $368 billion by the year 2035 (1). Studies have shown that CVD incidence significantly increases with age with some variation between gender (1). This Dataset investigates 11 key features of heart disease across 918 patients (2). This project aims to clean, validate and investigate trends throughout the data to identify correlations that may be used to help develop early detection and treatment strategies for future patients.

## Importing required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing the Dataset

In [None]:
df = pd.read_csv('/kaggle/input/heart-failure-prediction/heart.csv')
df.info()

The dataset contains 918 rows, 12 columns and several data types

In [None]:
data_types = df.dtypes.value_counts()
data_types_table = pd.concat([data_types], axis = 1, keys = ['Sum of Data Type'])
data_types_table

Looking further into the data types we can see that we have a total of 6 integer measurements, 5 object measurements and 1 float measurement.

To work on the measurements with the 'object' data type we need to convert it to a string

In [None]:
object_select = df.select_dtypes(include = 'object').columns
df[object_select]=df[object_select].astype('string')

data_types = df.dtypes.value_counts()
data_types_table = pd.concat([data_types], axis = 1, keys = ['Sum of Data Type'])
data_types_table

As we can see all 'object' columns are now 'string' columns, which makes it easier for us to analyse the data while keeping the overall data consistent.

## Data Validation and Cleaning
### Missing Data Check and Removal

In [None]:
null_check = df.isnull().sum()

missing_data_table = pd.concat([null_check], axis = 1, keys = ['Total Missing Data'])

missing_data_table

**No null values are present within this dataset**

### Duplicated Data Check and Removal

In [None]:
duplicated_check = df.duplicated().sum()

print('Total duplicates present:', duplicated_check)

print('Rows before duplicate removal:',df.shape [0])

df.drop_duplicates(inplace = True)

print('Rows after duplicate removal:',df.shape [0])

**No duplicates were present in the dataset, therefore no rows were removed after duplicate removal**

## Data Analysis

### Average Age, RestingBP, Cholesterol and MaxHR from patients

In [None]:
mean_values = round(df[['Age', 'RestingBP', 'Cholesterol', 'MaxHR']].mean(),1)

pd.concat([mean_values], axis = 1, keys = ['Average values'])

**Summary**
* The average age of this dataset is 53, suggesting that the sample size consisted primarily of older patients

* Typically the average resting heart blood pressure of a healthy individual is around 120-125 mm Hg, while the sample size is considerably higher at 132.4 mm Hg

### Distribution of Gender

In [None]:
grouped_age = df.groupby('Sex').size()

glabels = ['Female', 'Male']
gexplode = [0,0.05]

plt.pie(grouped_age, autopct= '%1.0f%%', explode = gexplode, labels = glabels)
plt.title('Male and Female Distribution')

plt.show()

### Most common chest pain type

In [None]:
grouped_chest_pain_type = df.groupby('ChestPainType').size()
pd.concat([grouped_chest_pain_type], axis = 1, keys = ['Sum of each Chest Pain Type'])

### Cholesterol Distribution

In [None]:
Age = df['Age']
Cholesterol = df['Cholesterol']

plt.hist(Cholesterol, color='orange', bins = 15)


plt.title("Cholesterol Distribution")
plt.ylabel("Amount of patients")
plt.xlabel("Cholesterol mm/dl")
plt.show()

### Distribution of exercise-induced angina

In [None]:
x = df.groupby('ExerciseAngina').size()

glabels = ['Yes', 'No']
gexplode = [0,0.05]

plt.pie(x, autopct= '%1.0f%%', explode = gexplode, labels = glabels)
plt.title ("Distribution of exercise-induced angina")

plt.show()

### Scatterplot: Age and Cholesterol level

In [None]:
x = df['Cholesterol']
y = df['Age']

x_filtered = x[ (x > 0.0) & (y> 0.0)]
y_filtered = y[ (x > 0.0) & (y> 0.0)]

plt.scatter(x_filtered ,y_filtered)
plt.title("Age and Cholesterol")
plt.xlabel("Cholesterol Level (mm/dl)")
plt.ylabel("Age (Years)")
plt.show()

References <br>
1. Benjamin, E. J., Virani, S. S., Callaway, C. W., Chamberlain, A. M., Chang, A. R., Cheng, S., Chiuve, S. E., Cushman, M., Delling, F. N., Deo, R., de Ferranti, S. D., Ferguson, J. F., Fornage, M., Gillespie, C., Isasi, C. R., Jiménez, M. C., Jordan, L. C., Judd, S. E., Lackland, D., Lichtman, J. H., … American Heart Association Council on Epidemiology and Prevention Statistics Committee and Stroke Statistics Subcommittee (2018). Heart Disease and Stroke Statistics-2018 Update: A Report From the American Heart Association. Circulation, 137(12), e67–e492. https://doi.org/10.1161/CIR.0000000000000558
2. Fedesoriano. (2021). Heart Failure Prediction Dataset. Retrieved [27/09/2023] from https://www.kaggle.com/fedesoriano/heart-failure-prediction.