# **Extract Transform Load (ETL)**

## Objectives

The objective of this notebook is to: 
- **Extract** raw data from a csv file downloaded from Kaggle. 
- **Transform** and clean the extracted data. 
- **Load** processed data into the cleaned data folder as a csv file.

## Prerequisites
- Python 3.12.8 is installed.
- Required Python Libaries from `requirements.txt` and their dependencies must be installed.
- Optional to set up Python virtual enviornment. 

## Inputs

- `heart_data.csv` file.
- Data source: https://data.world/kudem/heart-disease-dataset
- Data author: [Kuzak Dempsy](https://data.world/kudem)
- Kaggle download link: https://www.kaggle.com/datasets/thedevastator/exploring-risk-factors-for-cardiovascular-diseas 

## Outputs

* Cleaned dataset saved as `cleaned_heart_data.csv`.

## Additional Comments
- The transformations on the data in this section aims to mainly clean the dataset. More indepth analysis of the dataset will take place in a dedicated EDA section later on.



---

# Change working directory

The working directory must be changed from its current folder to its parent folder
* The current directory can be accessed with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases\\jupyter_notebooks'

The parent of the current directory will be the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\Code projects\\Capstone\\analysis-of-risk-factors-for-cardiovascular-diseases'

---

# ETL: Extraction

In this section, a pandas DataFrame will be extracted from the csv file. 

Firstly, Import essential libaries for ETL before data extraction.

In [4]:
import pandas as pd
import numpy as np

Use `.read_csv()` to read the csv file and save it as a pandas DataFrame.

In [5]:
df = pd.read_csv('data/raw/heart_data.csv') # Directory of the input csv file
df.head()

Unnamed: 0,index,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


---

# ETL: Transformation

In this section, descriptive analysis will be done to get an overview of the DataFrame. The DataFrame will then be transformed and cleaned. More indepth transformations of the DataFrame such as finding and handling outliers will be explored in a dedicated EDA notebook.

Index column is not needed as a pandas DataFrame already has indexing. It will therefore be removed.

In [6]:
df.drop(columns=['index'], inplace=True)
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


After removing the index column, the dimensions of the DataFrame can be checked using `.shape`. The output of `(70000, 13)` indicates a total of 70000 entries of data with 13 different features.

In [7]:
df.shape # Get the shape of the DataFrame

(70000, 13)

A basic summary of the DataFrame is produced using `df.info()`. Each column have 70000 non-null values, suggesting no missing values. All 13 columns consists of numerical values, with the weight column having a float dtype. 

In [8]:
df.info() # Get a summary of the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


The existance of any missing values can be confirmed using `.isnull().sum()`. It sums all null values in each column. Output of 0 for all columns means that there are no missing values.

In [9]:
df.isnull().sum() # Check for missing values

id             0
age            0
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
dtype: int64

Any duplicated entries are checked with `.duplicated().sum()`. Output of 0 means no instances of duplicated entries.

In [10]:
df.duplicated().sum() # Check for duplicate rows

0

The **'id'** column has a int64 Dtype. It is better to convert it to a categorical data type if there are multiple entries with the same id, because it alows easier grouping later. As there are no duplicated id values, seen by the `False` output, conversion is not needed.

In [11]:
df['id'].duplicated().any() # Check for any duplicate IDs

False

---

# ETL: Load

In this section, the cleaned DataFrame will be saved as a csv file named `cleaned_heart_data.csv`.

In [12]:
dir = 'data/cleaned/cleaned_heart_data.csv'
df.to_csv(dir, index=False) # Directory of the output csv file
print(f"Cleaned data saved to {dir}")

Cleaned data saved to data/cleaned/cleaned_heart_data.csv


---

# Conclusion

- A basic cleaning of the dataset has been completed.
- Cleaned dataset saved as a csv file.
- More indepth analysis and transformation of the dataset will be conducted in the following EDA section.