# Heart Failure Prediction Data
## Importing the dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('/kaggle/input/heart-failure-prediction/heart.csv')
df.info()

**Dataset contains 918 rows, 12 columns and several data types**

## Data Validation and Cleaning
### Missing Data Check and Removal

In [None]:
null_check = df.isnull().sum()

missing_data_table = pd.concat([null_check], axis = 1, keys = ['Total Missing Data'])

missing_data_table

**No null values are present within this dataset**

### Duplicated Data Check and Removal

In [None]:
duplicated_check = df.duplicated().sum()

print('Total duplicates present:', duplicated_check)

print('Rows before duplicate removal:',df.shape [0])

df.drop_duplicates(inplace = True)

print('Rows after duplicate removal:',df.shape [0])

**No duplicates were present in the dataset, therefore no rows were removed after duplicate removal**

## Data Analysis

### Average Age, RestingBP, Cholesterol and MaxHR from patients

In [None]:
mean_values = round(df[['Age', 'RestingBP', 'Cholesterol', 'MaxHR']].mean(),1)

pd.concat([mean_values], axis = 1, keys = ['Average values'])

**Summary**
* The average age of this dataset is 53, suggesting that the sample size consisted primarily of older patients

* Typically the average resting heart blood pressure of a healthy individual is around 120-125 mm Hg, while the sample size is considerably higher at 132.4 mm Hg

### Distribution of Gender

In [None]:
grouped_age = df.groupby('Sex').size()

glabels = ['Female', 'Male']
gexplode = [0,0.05]

plt.pie(grouped_age, autopct= '%1.0f%%', explode = gexplode, labels = glabels)
plt.title('Male and Female Distribution')

plt.show()

### Most common chest pain type

In [None]:
grouped_chest_pain_type = df.groupby('ChestPainType').size()
pd.concat([grouped_chest_pain_type], axis = 1, keys = ['Sum of each Chest Pain Type'])

### Cholesterol Distribution

In [None]:
Age = df['Age']
Cholesterol = df['Cholesterol']

plt.hist(Cholesterol, color='orange', bins = 15)


plt.title("Cholesterol Distribution")
plt.ylabel("Amount of patients")
plt.xlabel("Cholesterol mm/dl")
plt.show()

### Distribution of exercise-induced angina

In [None]:
x = df.groupby('ExerciseAngina').size()

glabels = ['Yes', 'No']
gexplode = [0,0.05]

plt.pie(x, autopct= '%1.0f%%', explode = gexplode, labels = glabels)
plt.title ("Distribution of exercise-induced angina")

plt.show()

### Scatterplot: Age and Cholesterol level

In [None]:
x = df['Cholesterol']
y = df['Age']

x_filtered = x[ (x > 0.0) & (y> 0.0)]
y_filtered = y[ (x > 0.0) & (y> 0.0)]

plt.scatter(x_filtered ,y_filtered)
plt.title("Age and Cholesterol")
plt.xlabel("Cholesterol Level (mm/dl)")
plt.ylabel("Age (Years)")
plt.show()