# **Exploratory Data Analysis (EDA)**

## Objectives

This notebook aims to use **Exploratory Data Analysis (EDA)** to explore, and uderstand the dataset and its existing patterns and trends. This notebook consists of the following sections:
- Initial Setup
- Feature Engineering


## Prerequisites
- Python 3.12.8 is installed
- Required Python Libaries from `requirements.txt` and their dependencies must be installed
- Optional to set up Python virtual enviornment
- Completed ETL step

## Inputs

- Cleaned dataset from ETL `cleaned_heart_data.csv`.

## Initial hypothesis
- Older individuals are more likely to develope cardiovascular disease (CVD)
- Males are more likely to develope CVD than females
- High blood pressure increase the likelihood of developing CDV
- High cholesterol individuals are more likely to have CDV
- Physically active individuals are less likely to have CDV
- Smokers are more likely to have CDV
- Alcohol consumption increase the likelyhood of developing CDV

## Outputs

- Generated insights and tested hypothesis
- Data visualisations to show trends
- 

## Additional Comments

- If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Change working directory

The working directory must be changed from its current folder to its parent folder
* The current directory can be accessed with `os.getcwd()`

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases\\jupyter_notebooks'

The parent of the current directory will be the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases'

---

# Initial Setup

### Import Libaries

Essential data analysis and visualisation libaries are imported.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

### Extract Dataset

Extract the cleaned csv file as a pandas DataFrame

In [None]:
df = pd.read_csv('data/cleaned/cleaned_heart_data.csv') # Cleaned data directory
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


---

# EDA: Descriptive Analysis

---

# EDA: Features Engineering 

Current DataFrame contains 13 different columns/features. The table below contains the name of the original features and a description of each.

|Name of feature|Description|Data type|
| ----------- | ----------- | ----------- |
|`id`|Unique identifier assigned to each person|Integer|
|`age`|Age of the person in days|Integer|
|`gender`|Gender of the person, 1 = male, 2 = female|Integer|
|`height`|Height of the person in cm|Integer|
|`weight`|Weight of the person in kg|Float|
|`ap_hi`|Systolic blood pressure reading|Integer|
|`ap_lo`|Diastolic blood pressure reading|Integer|
|`cholesterol`|Cholesterol level|Integer|
|`gluc`|Glucose level|Integer|
|`smoke`|Smoking status, 0 = non smoker, 1 = smoker|Integer|
|`alco`|Alcohol status, 0 = do not consume alcohol, 1 = consume alcohol|Integer|
|`active`|Physical activity status, 0 = non active, 1= active|Integer|
|`cardio`|Presence of cardiovascular disease, 0 = non present, 1 = present|Integer|

In [6]:
df['pp'] = df['ap_hi'] - df['ap_lo'] # Create a new feature: pulse pressure
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,30
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,50
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,60
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,50
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,40


In [8]:
df['bmi'] = (df['weight'] / (df['height']/100)**2).round(2) # Create a new feature: BMI
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,30,21.97
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,50,34.93
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,60,23.51
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,50,28.71
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,40,23.01


In [13]:
df["age_years"] = (df["age"] / 365.25).round().astype(int)  # Create a new feature: age in years
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,pp,bmi,age_years
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,30,21.97,50
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,50,34.93,55
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,60,23.51,52
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,50,28.71,48
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,40,23.01,48


In [21]:
df['age_years'].min(), df['age_years'].max()

(30, 65)

---

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.