# **Extract Transform Load (ETL)**

## Objectives

The objective of this notebook is to: 
- **Extract** raw data from a csv file downloaded from Kaggle. 
- **Transform** and clean the extracted data. 
- **Load** processed data into the cleaned data folder as a csv file.

## Prerequisites
- Python 3.12.8 is installed.
- Required Python Libaries from `requirements.txt` and their dependencies must be installed.
- Optional to set up Python virtual enviornment. 

## Inputs

- `heart_data.csv` file.
- Data source: https://data.world/kudem/heart-disease-dataset
- Data author: [Kuzak Dempsy](https://data.world/kudem)
- Kaggle download link: https://www.kaggle.com/datasets/thedevastator/exploring-risk-factors-for-cardiovascular-diseas 

## Outputs

* Cleaned dataset saved as `cleaned_heart_data.csv`.

## Additional Comments
- The transformations on the data in this section aims to mainly clean the dataset. More indepth transformations of the dataset will take place in a dedicated EDA section later on.



---

# Change working directory

The working directory must be changed from its current folder to its parent folder
* The current directory can be accessed with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases\\jupyter_notebooks'

The parent of the current directory will be the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases'

---

# ETL: Extraction

Import essential libaries for ETL before data extraction.

In [4]:
import pandas as pd
import numpy as np

Use `.read_csv()` to read the csv file and save it as a pandas DataFrame.

In [14]:
df = pd.read_csv('data/raw/heart_data.csv') # Directory of the input csv file
df.head()

Unnamed: 0,index,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


Index column is not needed as a pandas DataFrame already has indexing. It will therefore be removed.

In [15]:
df.drop(columns=['index'], inplace=True)
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.