# **Exploratory Data Analysis (EDA)**

## Objectives

This notebook aims to use **Exploratory Data Analysis (EDA)** to explore, and understand the dataset and its existing patterns and trends. The EDA process aims to:
- Conduct descriptive analysis of the data
- Generate and refine features in features engineering
- Test hypothesis using statistical tests
- Generate visualisation to display trends
- Gather insights on data that answers business requirements


## Prerequisites
- Python 3.12.8 is installed
- Required Python Libaries from `requirements.txt` and their dependencies must be installed
- Optional to set up Python virtual enviornment
- Completed ETL step

## Inputs

- Cleaned dataset from ETL `cleaned_heart_data.csv`.

## Initial hypothesis
- Older individuals are more likely to develope cardiovascular disease (CVD)
- Males are more likely to develope CVD than females
- High blood pressure increase the likelihood of developing CDV
- High cholesterol individuals are more likely to have CDV
- Physically active individuals are less likely to have CDV
- Smokers are more likely to have CDV
- Alcohol consumption increase the likelyhood of developing CDV

## Outputs

- Generated insights and tested hypothesis
- Data visualisations created using matplotlib, seaborn and plotly

## Additional Comments

- If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

The working directory must be changed from its current folder to its parent folder
* The current directory can be accessed with `os.getcwd()`

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases\\jupyter_notebooks'

The parent of the current directory will be the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases'

---

# Initial Setup

### Import Libaries

Essential data analysis and visualisation libaries are imported.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

### Extract Dataset

Extract the cleaned csv file as a pandas DataFrame

In [5]:
'''Read the raw data and create a copy of the original data'''
df = pd.read_csv('data/cleaned/cleaned_heart_data.csv') # Cleaned data directory
df_original =df.copy()
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


---

# EDA: Features Engineering and Descriptive Analysis

A basic summary of the DataFrame is generated using `.info()`. Although the data types of categorical data were changed in the previous ETL notebooks, they have reverted back to integer data types. This is because the actural values of those rows were not replaced, so when the saved csv file is read, it gets assigned to an integer data type.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68768 entries, 0 to 68767
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           68768 non-null  int64  
 1   age          68768 non-null  int64  
 2   gender       68768 non-null  int64  
 3   height       68768 non-null  int64  
 4   weight       68768 non-null  float64
 5   ap_hi        68768 non-null  int64  
 6   ap_lo        68768 non-null  int64  
 7   cholesterol  68768 non-null  int64  
 8   gluc         68768 non-null  int64  
 9   smoke        68768 non-null  int64  
 10  alco         68768 non-null  int64  
 11  active       68768 non-null  int64  
 12  cardio       68768 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.8 MB


The integer values of `gender`, `cholesterol`, `gluc`, smoke, alco, active and cardio are replaced with string variables representing the same thing. The rows are then converted to category data types.

In [7]:
'''Replace integer values with descriptive names and convert to category data type'''#
catago_cols = ['gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']

df['gender'].replace({1: 'Male', 2: 'Female'}, inplace=True)
df['cholesterol'].replace({1: 'Normal', 2: 'High', 3: 'Very high'}, inplace=True)
df['gluc'].replace({1: 'Normal', 2: 'High', 3: 'Very high'}, inplace=True)
df['smoke'].replace({0: 'Non-smoker', 1: 'Smoker'}, inplace=True)
df['alco'].replace({0: 'Non-drinker', 1: 'Drinker'}, inplace=True)
df['active'].replace({0: 'Inactive', 1: 'Active'}, inplace=True)
df['cardio'].replace({0: 'No CVD', 1: 'CVD'}, inplace=True)

for col in catago_cols:
    df[col] = df[col].astype('category')

df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,Female,168,62.0,110,80,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD
1,1,20228,Male,156,85.0,140,90,Very high,Normal,Non-smoker,Non-drinker,Active,CVD
2,2,18857,Male,165,64.0,130,70,Very high,Normal,Non-smoker,Non-drinker,Inactive,CVD
3,3,17623,Female,169,82.0,150,100,Normal,Normal,Non-smoker,Non-drinker,Active,CVD
4,4,17474,Male,156,56.0,100,60,Normal,Normal,Non-smoker,Non-drinker,Inactive,No CVD


Confirm that the data types have been changed to category

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68768 entries, 0 to 68767
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   id           68768 non-null  int64   
 1   age          68768 non-null  int64   
 2   gender       68768 non-null  category
 3   height       68768 non-null  int64   
 4   weight       68768 non-null  float64 
 5   ap_hi        68768 non-null  int64   
 6   ap_lo        68768 non-null  int64   
 7   cholesterol  68768 non-null  category
 8   gluc         68768 non-null  category
 9   smoke        68768 non-null  category
 10  alco         68768 non-null  category
 11  active       68768 non-null  category
 12  cardio       68768 non-null  category
dtypes: category(7), float64(1), int64(5)
memory usage: 3.6 MB


In [9]:
'''Create new features: pulse pressure (pp), body mass index (bmi), age in years (age_years)'''
df['pp'] = df['ap_hi'] - df['ap_lo']
df['bmi'] = (df['weight'] / (df['height']/100)**2).round(2)
df["age_years"] = (df["age"] / 365.25).round().astype(int)
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi,age_years
0,0,18393,Female,168,62.0,110,80,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,30,21.97,50
1,1,20228,Male,156,85.0,140,90,Very high,Normal,Non-smoker,Non-drinker,Active,CVD,50,34.93,55
2,2,18857,Male,165,64.0,130,70,Very high,Normal,Non-smoker,Non-drinker,Inactive,CVD,60,23.51,52
3,3,17623,Female,169,82.0,150,100,Normal,Normal,Non-smoker,Non-drinker,Active,CVD,50,28.71,48
4,4,17474,Male,156,56.0,100,60,Normal,Normal,Non-smoker,Non-drinker,Inactive,No CVD,40,23.01,48


Basic distribution of numerical values can be found using `.describe()`. This will be done to the columns with numerical data values. 

The descriptive statistics indicate the presence of outliers and errors: 
- Minimum height and weight of 55 cm and 11 kg is implausible considering that the youngest individual in the dataset is 29 years old (10,798 days).
- pulse pressure (pp) can not be negative as Systolic pressure can not be lower than Diastolic pressure. A typical range is 30–60 mmHg but pulse pressure higher than 100 mmHg is possible in individuals with medical conditions.
- Lowest BMI is 3.47 kg/m² which is very likely an error as lowest ever recorded BMI is 6.7 kg/m².
- Highest BMI is 298.67 kg/m² which also seems unlikely.

In [10]:
'''Get a statistical summary of the numerical columns'''
df.select_dtypes(exclude='category').describe().round(2)

Unnamed: 0,id,age,height,weight,ap_hi,ap_lo,pp,bmi,age_years
count,68768.0,68768.0,68768.0,68768.0,68768.0,68768.0,68768.0,68768.0,68768.0
mean,49969.66,19464.34,164.36,74.12,126.61,81.35,45.26,27.52,53.29
std,28844.51,2468.21,8.18,14.33,16.75,9.57,12.08,6.05,6.76
min,0.0,10798.0,55.0,11.0,60.0,30.0,-60.0,3.47,30.0
25%,24996.75,17657.0,159.0,65.0,120.0,80.0,40.0,23.88,48.0
50%,50007.5,19701.0,165.0,72.0,120.0,80.0,40.0,26.35,54.0
75%,74858.25,21324.0,170.0,82.0,140.0,90.0,50.0,30.12,58.0
max,99999.0,23713.0,250.0,200.0,240.0,150.0,140.0,298.67,65.0


In [12]:
df[df['pp'] < 0]

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi,age_years
468,681,19099,Male,156,65.0,120,150,High,Normal,Non-smoker,Non-drinker,Active,No CVD,-30,26.71,52
627,913,20457,Female,169,68.0,70,110,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,-40,23.81,56
2341,3356,23361,Male,154,102.0,90,150,Normal,Normal,Non-smoker,Non-drinker,Inactive,CVD,-60,43.01,64
2930,4214,21957,Female,182,90.0,80,140,Very high,Very high,Non-smoker,Non-drinker,Active,CVD,-60,27.17,60
3382,4880,19992,Female,180,80.0,80,125,Very high,Very high,Smoker,Drinker,Active,CVD,-45,24.69,55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64604,93855,14375,Male,165,65.0,80,120,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,-40,23.88,39
65491,95164,19498,Female,160,81.0,80,120,Very high,Very high,Smoker,Drinker,Active,CVD,-40,31.64,53
66235,96271,23424,Male,153,74.0,80,130,Normal,Normal,Non-smoker,Non-drinker,Active,CVD,-50,31.61,64
66284,96339,21193,Female,172,57.0,80,120,Normal,Normal,Smoker,Non-drinker,Active,CVD,-40,19.27,58


The dataset consists of: 
- 44783 male entries (65.12%) and 23985 female entries (34.88%).
- 6053 smokers (8.80%), 62715 non-smokers (91.20%)
- 3683 alcohol drinkers (5.36%), 65085 non-drinkers (94.64%)

In [12]:
'''Get the distribution of categorical columns and put into a dataframe'''
stats = []
for col in df.select_dtypes(include='category').columns:
    counts = df[col].value_counts()
    percents = df[col].value_counts(normalize=True)* 100
    stat = pd.DataFrame({
        'col': col,
        'value': counts.index, 
        'count': counts.values, 
        'percent': percents.round(2)
        })
    stats.append(stat)


catago_stats = pd.concat(stats).reset_index(drop=True)
catago_stats


Unnamed: 0,col,value,count,percent
0,gender,Male,44783,65.12
1,gender,Female,23985,34.88
2,cholesterol,Normal,51574,75.0
3,cholesterol,High,9312,13.54
4,cholesterol,Very high,7882,11.46
5,gluc,Normal,58469,85.02
6,gluc,Very high,5230,7.61
7,gluc,High,5069,7.37
8,smoke,Non-smoker,62715,91.2
9,smoke,Smoker,6053,8.8


---

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.