# **Project: Heart Disease Analysis**

## Goals: 
- Finding correlation between attributes linking to the cause(s) of heart disease
- Reasoning why and how attributes correlate with heart diease
- Suggest ways to lower risk of heart disease

## Dataset Information
Source: <a href="https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data">Kaggle Heart Failure Prediction</a>

Note: The attribute definitions are described in the source URL and we must state the following:

### Categorial
- Nominal Attributes
    - Boolean
        - **DEATH_EVENT:** If the patient deceased during the follow-up period 
        - **Smoking:** If the patient smokes
        - **High Blood Pressure:** If the patient has hypertension
        - **Anaemia:** Decrease of red blood cells or hemoglobin 
    - Binary
        - **Sex:** Woman or man
### Numerical
- Ratio Attributes
    - **Age**
    - **Time:** Follow-up period (days)
    - **Serum Sodium:** Level of serum sodium in the blood (mEq/L)
    - **Serum Creatinine:** Level of serum creatinine in the blood (mg/dL)
    - **Creatinine Phosphokinase:** Level of the CPK enzyme in the blood (mcg/L)
    - **Platelets:** Platelets in the blood (kiloplatelets/mL)
    - **Ejection Fraction:** Percentage of blood leaving the heart at each contraction

***
## Section 1 - Setup
- Adding needed imports, helper functions, etc.
***

In [2]:
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport

#!pip install matplotlib
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
#matplotlib.use('Qt5Agg')

import seaborn as sns

pd.set_option('display.max_columns', 50) #include to avoid ... in middle of display

### Loading the dataset

We want to load the data from 'data_with_errors' from 'Heart_Disease.csv' and print its schema.

In [3]:

data_with_errors = pd.read_csv("heart_failure_clinical_records_dataset.csv")
data_with_errors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


***
# Section 2 - Data Understanding
***

***
## Section 2.1 - Group attribute types
Group the attributes into two categories:
- Categorical
- Numerical

We will use these groups for futher analysis of the dataset based on each attribute's data type.
***

In [24]:
# Retreive and display nominal attributes
categoric_attributes = data_with_errors.columns[data_with_errors.nunique() <= 2]
print("Categorical attributes:", *categoric_attributes, sep="\n", end = "\n\n")

# Retreive and display ratio attributes
numeric_attributes = data_with_errors.columns[data_with_errors.nunique() > 2]
print("Numerical attributes:", *numeric_attributes, sep="\n")

Categorical attributes:
anaemia
diabetes
high_blood_pressure
sex
smoking
death_event

Numerical attributes:
age
creatinine_phosphokinase
ejection_fraction
platelets
serum_creatinine
serum_sodium
time


***
## Section 2.2 - Provide basic statistics for attributes
***

### Profile Report

The following report will allow us to analyze each data attribute in the following ways:
- Count # of distinct and missing values
- Quantative Statistics (Min, Max, Median, Quartiles)
- Descriptive Statistics (Std, CV, Mean, Variance)

We will use these analytics to better understand the data given and futher find ways to clean and standardize the dataset.

In [4]:
# Create a profiling report for the unclean dataset
profile_unclean = ProfileReport(df= data_with_errors, title = "Unsanitary Data", minimal= True)  # Creates a sophisticated profiling report based on 'satisfaction_with_errors'
profile_unclean

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 13/13 [00:00<00:00, 18248.31it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



From the profile above we witness following:

- All attributes seem have no missing values (further anaylsis will confirm this)
- There are no categorical attributes
- Only attributes that are nominal have 0s


### Describing Numerical Attributes

We will use the following table to display the basic statistics of the numerical attributes from our dataset.

In [25]:
data_with_errors[numeric_attributes].describe()

Unnamed: 0,age,creatinine_phosphokinase,ejection_fraction,platelets,serum_creatinine,serum_sodium,time
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,581.839465,38.083612,263358.029264,1.39388,136.625418,130.26087
std,11.894809,970.287881,11.834841,97804.236869,1.03451,4.412477,77.614208
min,40.0,23.0,14.0,25100.0,0.5,113.0,4.0
25%,51.0,116.5,30.0,212500.0,0.9,134.0,73.0
50%,60.0,250.0,38.0,262000.0,1.1,137.0,115.0
75%,70.0,582.0,45.0,303500.0,1.4,140.0,203.0
max,95.0,7861.0,80.0,850000.0,9.4,148.0,285.0


Note: This data is not enough to detect potential outliers in data. We must create a box plot to identify and pin point outliers. Read [Section 2.4](#Section-24---verify-data-quality)



### Count duplicate rows and confirm no missing values

In [5]:
# Count duplicate rows
print("Number of duplicate rows:", data_with_errors.duplicated().sum(), end="\n\n")

# Check for missing values
print("Number of missing values:", "\n" + str(data_with_errors.isnull().sum()))


Number of duplicate rows: 0

Number of missing values: 
age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64


From the evaluation above we can confirm there are no missing values nor no duplicate rows.

***
## Section 2.3 - Visualize / analyze the most important or interesting attributes
    
***

In [7]:
data_with_errors.columns = data_with_errors.columns.str.lower()

print("\n".join(data_with_errors.columns.values))

age
anaemia
creatinine_phosphokinase
diabetes
ejection_fraction
high_blood_pressure
platelets
serum_creatinine
serum_sodium
sex
smoking
time
death_event



***
<a id="sec-24-verify-data-quality"></a>
# Section 24 - Verify data quality

We will consider various methods of cleaning the dataset:
- Removing potential outliers
- Determining what attributes would be better represented using a different format of data (i.e int -> bool)
- Removing attributes that will not consider when analyzing the data
***

***
## **Section 3 - Determine appropriate methods of finding correlation**
Methods of correlation to evaluate:
- Numerical vs Numerical
    - Pearson Correlation
    - Spearman Correlation
- Categorial vs Categorial
    - Cramer's V
- Categorial vs Numerical
    - Correlation Ratio
***