# **Exploratory Data Analysis (EDA)**

## Objectives

This notebook aims to use **Exploratory Data Analysis (EDA)** to explore, and understand the dataset and its existing patterns and trends. The EDA process aims to:
- Conduct descriptive analysis of the data
- Generate and refine features in features engineering
- Test hypothesis using statistical tests
- Generate visualisation to display trends
- Gather insights on data that answers business requirements


## Prerequisites
- Python 3.12.8 is installed
- Required Python Libaries from `requirements.txt` and their dependencies must be installed
- Optional to set up Python virtual enviornment
- Completed ETL step

## Inputs

- Cleaned dataset from ETL `cleaned_heart_data.csv`.

## Initial hypothesis
- Older individuals are more likely to develope cardiovascular disease (CVD)
- Males are more likely to develope CVD than females
- High blood pressure increase the likelihood of developing CDV
- High cholesterol individuals are more likely to have CDV
- Physically active individuals are less likely to have CDV
- Smokers are more likely to have CDV
- Alcohol consumption increase the likelyhood of developing CDV

## Outputs

- Generated insights and tested hypothesis
- Data visualisations created using matplotlib, seaborn and plotly

## Additional Comments

- If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

The working directory must be changed from its current folder to its parent folder
* The current directory can be accessed with `os.getcwd()`

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases\\jupyter_notebooks'

The parent of the current directory will be the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases'

---

# Initial Setup

### Import Libaries

Essential data analysis and visualisation libaries are imported.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

### Extract Dataset

Extract the cleaned csv file as a pandas DataFrame

In [5]:
df = pd.read_csv('data/cleaned/cleaned_heart_data.csv') # Cleaned data directory
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


---

# EDA: Descriptive Analysis

## Feature Description

Basic distribution of numerical values can be found using `.describe()`. This will be done to the columns: `age`, `height`, `weight`, `ap_hi` and `ap_lo`. Although other columns also have numerical Dtypes, they are acturally recorded as categorical values rather than numerical values so `.describe()` would not be useful.

In [14]:
numerical_cols = ['age', 'height', 'weight', 'ap_hi', 'ap_lo',]
df[numerical_cols].describe().round(2)

Unnamed: 0,age,height,weight,ap_hi,ap_lo
count,70000.0,70000.0,70000.0,70000.0,70000.0
mean,19468.87,164.36,74.21,128.82,96.63
std,2467.25,8.21,14.4,154.01,188.47
min,10798.0,55.0,10.0,-150.0,-70.0
25%,17664.0,159.0,65.0,120.0,80.0
50%,19703.0,165.0,72.0,120.0,80.0
75%,21327.0,170.0,82.0,140.0,90.0
max,23713.0,250.0,200.0,16020.0,11000.0


From the statistics generated, 

In [7]:
print(f'{df['gender'].value_counts()} \n \n {df['gender'].value_counts(normalize=True)}')

gender
1    45530
2    24470
Name: count, dtype: int64 
 
 gender
1    0.650429
2    0.349571
Name: proportion, dtype: float64


In [None]:
catag_cols = ["smoke", "alco", "active", "cardio"]
summary = {}

for col in catag_cols:
    counts = df[col].value_counts()
    perc = df[col].value_counts(normalize=True) * 100
    summary[col] = pd.DataFrame({
        "Count": counts,
        "Percentage (%)": perc.round(2)
    })

summary_df = pd.concat(summary, axis=1)
summary_df


Unnamed: 0_level_0,smoke,smoke,alco,alco,active,active,cardio,cardio
Unnamed: 0_level_1,Count,Percentage (%),Count,Percentage (%),Count,Percentage (%),Count,Percentage (%)
0,63831,91.19,66236,94.62,13739,19.63,35021,50.03
1,6169,8.81,3764,5.38,56261,80.37,34979,49.97


---

# EDA: Features Engineering 

In [9]:
df['pp'] = df['ap_hi'] - df['ap_lo'] # Create a new feature: pulse pressure
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,30
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,50
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,60
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,50
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,40


In [10]:
df['bmi'] = (df['weight'] / (df['height']/100)**2).round(2) # Create a new feature: BMI
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,30,21.97
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,50,34.93
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,60,23.51
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,50,28.71
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,40,23.01


In [11]:
df["age_years"] = (df["age"] / 365.25).round().astype(int)  # Create a new feature: age in years
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi,age_years
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,30,21.97,50
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,50,34.93,55
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,60,23.51,52
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,50,28.71,48
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,40,23.01,48


In [12]:
df['age_years'].min(), df['age_years'].max()

(30, 65)

---

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.