# **Exploratory Data Analysis (EDA)**

## Objectives

This notebook aims to use **Exploratory Data Analysis (EDA)** to explore, and understand the dataset and its existing patterns and trends. The EDA process aims to:
- Conduct descriptive analysis of the data
- Generate and refine features in features engineering
- Test hypothesis using statistical tests
- Generate visualisation to display trends
- Gather insights on data that answers business requirements


## Prerequisites
- Python 3.12.8 is installed
- Required Python Libaries from `requirements.txt` and their dependencies must be installed
- Optional to set up Python virtual enviornment
- Completed ETL step

## Inputs

- Cleaned dataset from ETL `cleaned_heart_data.csv`.

## Initial hypothesis
- Older individuals are more likely to develope cardiovascular disease (CVD)
- Males are more likely to develope CVD than females
- High blood pressure increase the likelihood of developing CDV
- High cholesterol individuals are more likely to have CDV
- Physically active individuals are less likely to have CDV
- Smokers are more likely to have CDV
- Alcohol consumption increase the likelyhood of developing CDV

## Outputs

- Generated insights and tested hypothesis
- Data visualisations created using matplotlib, seaborn and plotly

## Additional Comments

- If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

The working directory must be changed from its current folder to its parent folder
* The current directory can be accessed with `os.getcwd()`

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases\\jupyter_notebooks'

The parent of the current directory will be the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases'

---

# Initial Setup

### Import Libaries

Essential data analysis and visualisation libaries are imported.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

### Extract Dataset

Extract the cleaned csv file as a pandas DataFrame

In [5]:
'''Read the raw data and create a copy of the original data'''
df = pd.read_csv('data/cleaned/cleaned_heart_data.csv') # Cleaned data directory
df_original =df.copy()
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


---

# EDA: Features Engineering and Descriptive Analysis

## Data type conversion
A basic summary of the DataFrame is generated using `.info()`. Although the data types of categorical data were changed in the previous ETL notebooks, they have reverted back to integer data types. This is because the actural values of those rows were not replaced, so when the saved csv file is read, it gets assigned to an integer data type.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68768 entries, 0 to 68767
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           68768 non-null  int64  
 1   age          68768 non-null  int64  
 2   gender       68768 non-null  int64  
 3   height       68768 non-null  int64  
 4   weight       68768 non-null  float64
 5   ap_hi        68768 non-null  int64  
 6   ap_lo        68768 non-null  int64  
 7   cholesterol  68768 non-null  int64  
 8   gluc         68768 non-null  int64  
 9   smoke        68768 non-null  int64  
 10  alco         68768 non-null  int64  
 11  active       68768 non-null  int64  
 12  cardio       68768 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.8 MB


The integer values of `gender`, `cholesterol`, `gluc`, smoke, alco, active and cardio are replaced with string variables representing the same thing. The rows are then converted to category data types.

In [7]:
'''Replace integer values with descriptive names and convert to category data type'''#
catago_cols = ['gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']

df['gender'].replace({1: 'Male', 2: 'Female'}, inplace=True)
df['cholesterol'].replace({1: 'Normal', 2: 'High', 3: 'Very high'}, inplace=True)
df['gluc'].replace({1: 'Normal', 2: 'High', 3: 'Very high'}, inplace=True)
df['smoke'].replace({0: 'Non-smoker', 1: 'Smoker'}, inplace=True)
df['alco'].replace({0: 'Non-drinker', 1: 'Drinker'}, inplace=True)
df['active'].replace({0: 'Inactive', 1: 'Active'}, inplace=True)
df['cardio'].replace({0: 'No CVD', 1: 'CVD'}, inplace=True)

for col in catago_cols:
    df[col] = df[col].astype('category')

df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,Female,168,62.0,110,80,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD
1,1,20228,Male,156,85.0,140,90,Very high,Normal,Non-smoker,Non-drinker,Active,CVD
2,2,18857,Male,165,64.0,130,70,Very high,Normal,Non-smoker,Non-drinker,Inactive,CVD
3,3,17623,Female,169,82.0,150,100,Normal,Normal,Non-smoker,Non-drinker,Active,CVD
4,4,17474,Male,156,56.0,100,60,Normal,Normal,Non-smoker,Non-drinker,Inactive,No CVD


Confirm that the data types have been changed to category

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68768 entries, 0 to 68767
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   id           68768 non-null  int64   
 1   age          68768 non-null  int64   
 2   gender       68768 non-null  category
 3   height       68768 non-null  int64   
 4   weight       68768 non-null  float64 
 5   ap_hi        68768 non-null  int64   
 6   ap_lo        68768 non-null  int64   
 7   cholesterol  68768 non-null  category
 8   gluc         68768 non-null  category
 9   smoke        68768 non-null  category
 10  alco         68768 non-null  category
 11  active       68768 non-null  category
 12  cardio       68768 non-null  category
dtypes: category(7), float64(1), int64(5)
memory usage: 3.6 MB


## Feature creation

Pulse pressure, body mass index and age in years were created.

In [9]:
'''Create new features: pulse pressure (pp), body mass index (bmi), age in years (age_years)'''
df['pp'] = df['ap_hi'] - df['ap_lo']
df['bmi'] = (df['weight'] / (df['height']/100)**2).round(2)
df["age_years"] = (df["age"] / 365.25).round().astype(int)
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi,age_years
0,0,18393,Female,168,62.0,110,80,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,30,21.97,50
1,1,20228,Male,156,85.0,140,90,Very high,Normal,Non-smoker,Non-drinker,Active,CVD,50,34.93,55
2,2,18857,Male,165,64.0,130,70,Very high,Normal,Non-smoker,Non-drinker,Inactive,CVD,60,23.51,52
3,3,17623,Female,169,82.0,150,100,Normal,Normal,Non-smoker,Non-drinker,Active,CVD,50,28.71,48
4,4,17474,Male,156,56.0,100,60,Normal,Normal,Non-smoker,Non-drinker,Inactive,No CVD,40,23.01,48


Updated Features Description

|Name of feature|Description|Data type|
| ----------- | ----------- | ----------- |
|`id`|Unique identifier assigned to each person|Integer|
|`age`|Age of the person in days|Integer|
|`gender`|Gender of the person|Category|
|`height`|Height of the person in cm|Integer|
|`weight`|Weight of the person in kg|Float|
|`ap_hi`|Systolic blood pressure reading|Integer|
|`ap_lo`|Diastolic blood pressure reading|Integer|
|`cholesterol`|Cholesterol level|Category|
|`gluc`|Glucose level|Category|
|`smoke`|Smoking status|Category|
|`alco`|Alcohol status|Category|
|`active`|Physical activity status|Category|
|`cardio`|Presence of cardiovascular disease|Category|
|`pp`|Pulse pressure|Integer|
|`bmi`|Body Mass Index in kg/m²|Float|
|`age_years`|Age of the person in years|Integer|


## Identifying and handling outliers
Basic distribution of numerical values can be found using `.describe()`. This will be done to the columns with numerical data values. The age demographic of the data consists of adults between 30 to 65 years old.

The descriptive statistics indicate the presence of outliers and errors: 
- Minimum height and weight of 55 cm and 11 kg is implausible considering that the youngest individual in the dataset is 29 years old (10,798 days).
- pulse pressure (pp) can not be negative as Systolic pressure can not be lower than Diastolic pressure. A typical range is 30–60 mmHg but pulse pressure higher than 100 mmHg is possible in individuals with medical conditions.
- Lowest BMI is 3.47 kg/m² which is very likely an error as lowest ever recorded BMI is 6.7 kg/m².
- Highest BMI is 298.67 kg/m² which also seems unlikely.

In [None]:
'''Get a statistical summary of the numerical columns'''
def stat_summary(dataframe):
    desc = dataframe.describe().T
    desc['range'] = desc['max'] - desc['min']
    desc['var'] = dataframe.var()
    desc['skew'] = dataframe.skew()
    desc['kurtosis'] = dataframe.kurtosis()
    cols = ['min', 'max', 'range', 'mean', '50%', 'std', 'var', 'skew', 'kurtosis']
    return desc[cols].round(2)
stat_summary(df.select_dtypes(exclude='category'))

Unnamed: 0,min,max,range,mean,50%,std,var,skew,kurtosis
id,0.0,99999.0,99999.0,49969.66,50007.5,28844.51,832005600.0,-0.0,-1.2
age,10798.0,23713.0,12915.0,19464.34,19701.0,2468.21,6092059.0,-0.31,-0.83
height,55.0,250.0,195.0,164.36,165.0,8.18,66.99,-0.61,7.59
weight,11.0,200.0,189.0,74.12,72.0,14.33,205.4,1.01,2.56
ap_hi,60.0,240.0,180.0,126.61,120.0,16.75,280.45,0.9,1.83
ap_lo,30.0,150.0,120.0,81.35,80.0,9.57,91.54,0.43,2.24
pp,-60.0,140.0,200.0,45.26,40.0,12.08,145.94,0.73,6.49
bmi,3.47,298.67,295.2,27.52,26.35,6.05,36.61,7.79,227.24
age_years,30.0,65.0,35.0,53.29,54.0,6.76,45.74,-0.31,-0.82


There are 86 entries with negative pulse pressure values. Since these results are impossible, they will be removed from the dataset.

In [11]:
df[df['pp'] < 0]

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi,age_years
468,681,19099,Male,156,65.0,120,150,High,Normal,Non-smoker,Non-drinker,Active,No CVD,-30,26.71,52
627,913,20457,Female,169,68.0,70,110,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,-40,23.81,56
2341,3356,23361,Male,154,102.0,90,150,Normal,Normal,Non-smoker,Non-drinker,Inactive,CVD,-60,43.01,64
2930,4214,21957,Female,182,90.0,80,140,Very high,Very high,Non-smoker,Non-drinker,Active,CVD,-60,27.17,60
3382,4880,19992,Female,180,80.0,80,125,Very high,Very high,Smoker,Drinker,Active,CVD,-45,24.69,55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64604,93855,14375,Male,165,65.0,80,120,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,-40,23.88,39
65491,95164,19498,Female,160,81.0,80,120,Very high,Very high,Smoker,Drinker,Active,CVD,-40,31.64,53
66235,96271,23424,Male,153,74.0,80,130,Normal,Normal,Non-smoker,Non-drinker,Active,CVD,-50,31.61,64
66284,96339,21193,Female,172,57.0,80,120,Normal,Normal,Smoker,Non-drinker,Active,CVD,-40,19.27,58


Pulse pressure of 0 is impossible, and any pulse pressure below 10 is implausable. A pulse pressure of less than 25 mmHg is already considered to be dangerously low and requires medical evaluation.

In [12]:
df[df['pp']<10]

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi,age_years
468,681,19099,Male,156,65.0,120,150,High,Normal,Non-smoker,Non-drinker,Active,No CVD,-30,26.71,52
627,913,20457,Female,169,68.0,70,110,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,-40,23.81,56
2341,3356,23361,Male,154,102.0,90,150,Normal,Normal,Non-smoker,Non-drinker,Inactive,CVD,-60,43.01,64
2930,4214,21957,Female,182,90.0,80,140,Very high,Very high,Non-smoker,Non-drinker,Active,CVD,-60,27.17,60
3382,4880,19992,Female,180,80.0,80,125,Very high,Very high,Smoker,Drinker,Active,CVD,-45,24.69,55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64604,93855,14375,Male,165,65.0,80,120,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,-40,23.88,39
65491,95164,19498,Female,160,81.0,80,120,Very high,Very high,Smoker,Drinker,Active,CVD,-40,31.64,53
66235,96271,23424,Male,153,74.0,80,130,Normal,Normal,Non-smoker,Non-drinker,Active,CVD,-50,31.61,64
66284,96339,21193,Female,172,57.0,80,120,Normal,Normal,Smoker,Non-drinker,Active,CVD,-40,19.27,58


In [13]:
'''Removing entries with negative pulse pressure values and implausibly low pulse pressure values'''
df = df[df['pp'] >= 10]
df.shape

(68678, 16)

The extremely low BMI Values are likely due to incorrect weight measurements. Its unrealistic for adults to have weights of below 30 kg, especially considering the height of the entries. 

In [14]:
df[df['bmi']<15]

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi,age_years
5259,7634,16755,Male,167,41.0,110,80,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,30,14.7,46
6365,9223,21220,Male,250,86.0,140,100,Very high,Normal,Non-smoker,Non-drinker,Active,CVD,40,13.76,58
9308,13518,20958,Female,172,40.0,140,90,Normal,High,Non-smoker,Non-drinker,Active,CVD,50,13.52,57
10249,14908,22007,Male,162,38.0,100,70,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,30,14.48,60
12694,18415,15574,Female,174,45.0,130,100,High,High,Non-smoker,Non-drinker,Inactive,CVD,30,14.86,43
15942,23181,19621,Male,196,56.0,125,80,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,45,14.58,54
16038,23318,21872,Male,165,35.0,100,70,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,30,12.86,60
16616,24167,17272,Female,170,31.0,150,90,High,High,Non-smoker,Non-drinker,Active,CVD,60,10.73,47
16676,24244,21860,Male,165,40.0,90,60,High,Normal,Non-smoker,Non-drinker,Active,CVD,30,14.69,60
18232,26503,18140,Male,160,30.0,120,80,Normal,Normal,Non-smoker,Non-drinker,Active,CVD,40,11.72,50


A BMI above 40 kg/m² is classified as [class III (severe) obesity](https://www.nhs.uk/conditions/obesity/). Values exceeding 60 kg/m² are more likely the result of data entry errors rather than genuine outliers. Upon inspecting these abnormally high BMI values, many entries have heights below 100 cm, which is highly unlikely for adults, even for individuals with restricted growth conditions. This further supports the idea that the high BMI values are likely caused by incorrect height or weight measurements and should be removed.

In [15]:
'''Inspecting extremely high BMI values'''
df[df['bmi']>60].sort_values(by='bmi')

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi,age_years
38846,56496,14606,Male,108,70.0,140,90,Normal,Normal,Non-smoker,Non-drinker,Active,CVD,50,60.01,40
63520,92301,21705,Male,169,172.0,120,70,High,Normal,Non-smoker,Non-drinker,Active,No CVD,50,60.22,59
14418,20970,21135,Male,159,153.0,120,80,Normal,Normal,Non-smoker,Non-drinker,Active,CVD,40,60.52,58
41704,60631,19450,Male,160,155.0,120,80,Normal,Normal,Non-smoker,Non-drinker,Active,CVD,40,60.55,53
62817,91284,21115,Male,164,164.0,120,80,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,40,60.98,58
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28639,41661,19088,Male,60,69.0,110,70,Normal,Normal,Non-smoker,Non-drinker,Inactive,No CVD,40,191.67,52
23492,34186,19074,Male,81,156.0,140,90,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,50,237.77,52
22321,32456,23386,Male,55,81.0,130,90,Normal,Normal,Non-smoker,Non-drinker,Active,CVD,40,267.77,64
26900,39156,15292,Male,80,178.0,140,90,Very high,Very high,Non-smoker,Non-drinker,Active,CVD,50,278.12,42


BMI values were restricted to the range 15–60 to exclude outliers that could distort the analysis. Values outside of this range are unrealistic, considering the age range of the data is between 30 to 65 years old.

In [16]:
'''Removing BMI outliers'''
df = df[(df['bmi'] <= 60) & (df['bmi'] >= 15)]
df

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi,age_years
0,0,18393,Female,168,62.0,110,80,Normal,Normal,Non-smoker,Non-drinker,Active,No CVD,30,21.97,50
1,1,20228,Male,156,85.0,140,90,Very high,Normal,Non-smoker,Non-drinker,Active,CVD,50,34.93,55
2,2,18857,Male,165,64.0,130,70,Very high,Normal,Non-smoker,Non-drinker,Inactive,CVD,60,23.51,52
3,3,17623,Female,169,82.0,150,100,Normal,Normal,Non-smoker,Non-drinker,Active,CVD,50,28.71,48
4,4,17474,Male,156,56.0,100,60,Normal,Normal,Non-smoker,Non-drinker,Inactive,No CVD,40,23.01,48
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68763,99993,19240,Female,168,76.0,120,80,Normal,Normal,Smoker,Non-drinker,Active,No CVD,40,26.93,53
68764,99995,22601,Male,158,126.0,140,90,High,High,Non-smoker,Non-drinker,Active,CVD,50,50.47,62
68765,99996,19066,Female,183,105.0,180,90,Very high,Normal,Non-smoker,Drinker,Inactive,CVD,90,31.35,52
68766,99998,22431,Male,163,72.0,135,80,Normal,High,Non-smoker,Non-drinker,Inactive,CVD,55,27.10,61


Whilst heights of below 130 cm seems unlikely in adults, it is plausable for thoses with [restricted growth conditions](https://www.nhs.uk/conditions/restricted-growth/). Therefore these entries will be kept. 

In [17]:
'''Inspecting extremely low height values'''
(df[df['height']<130]
 .select_dtypes(exclude='category')
 .describe()
 .round(2)
 )

Unnamed: 0,id,age,height,weight,ap_hi,ap_lo,pp,bmi,age_years
count,46.0,46.0,46.0,46.0,46.0,46.0,46.0,46.0,46.0
mean,57050.67,18853.7,118.24,72.7,120.65,77.39,43.26,52.27,51.54
std,29288.79,2607.13,5.13,11.98,9.04,7.13,6.34,8.95,7.11
min,5278.0,14445.0,100.0,28.0,100.0,60.0,40.0,17.09,40.0
25%,33591.5,17071.75,120.0,70.0,120.0,70.0,40.0,54.52,46.5
50%,64626.0,18976.0,120.0,80.0,120.0,80.0,40.0,55.56,52.0
75%,82360.75,20693.75,120.0,80.0,120.0,80.0,40.0,55.56,56.75
max,98630.0,23422.0,128.0,80.0,150.0,90.0,60.0,60.0,64.0


In [None]:
'''Statistical summary of numerical values after outlier removal'''
stat_summary(df.select_dtypes(exclude='category'))

Unnamed: 0,min,max,range,mean,50%,std,var,skew,kurtosis
id,0.0,99999.0,99999.0,49978.88,50022.0,28845.3,832051500.0,-0.0,-1.2
age,10798.0,23713.0,12915.0,19464.16,19701.0,2468.14,6091728.0,-0.31,-0.83
height,100.0,207.0,107.0,164.41,165.0,7.93,62.92,-0.02,0.87
weight,28.0,200.0,172.0,74.1,72.0,14.2,201.72,0.94,2.04
ap_hi,60.0,240.0,180.0,126.67,120.0,16.68,278.07,0.93,1.83
ap_lo,30.0,150.0,120.0,81.3,80.0,9.43,88.99,0.3,1.64
pp,10.0,140.0,130.0,45.37,40.0,11.67,136.25,1.32,3.58
bmi,15.01,60.0,44.99,27.45,26.35,5.21,27.12,1.2,2.44
age_years,30.0,65.0,35.0,53.29,54.0,6.76,45.73,-0.31,-0.82


The dataset consists of: 
- 44783 male entries (65.12%) and 23985 female entries (34.88%).
- 6053 smokers (8.80%), 62715 non-smokers (91.20%)
- 3683 alcohol drinkers (5.36%), 65085 non-drinkers (94.64%)

In [12]:
'''Get the distribution of categorical columns and put into a dataframe'''
stats = []
for col in df.select_dtypes(include='category').columns:
    counts = df[col].value_counts()
    percents = df[col].value_counts(normalize=True)* 100
    stat = pd.DataFrame({
        'col': col,
        'value': counts.index, 
        'count': counts.values, 
        'percent': percents.round(2)
        })
    stats.append(stat)


catago_stats = pd.concat(stats).reset_index(drop=True)
catago_stats


Unnamed: 0,col,value,count,percent
0,gender,Male,44783,65.12
1,gender,Female,23985,34.88
2,cholesterol,Normal,51574,75.0
3,cholesterol,High,9312,13.54
4,cholesterol,Very high,7882,11.46
5,gluc,Normal,58469,85.02
6,gluc,Very high,5230,7.61
7,gluc,High,5069,7.37
8,smoke,Non-smoker,62715,91.2
9,smoke,Smoker,6053,8.8


---

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.