# **ETL - Extract, Transform, Load Process**

## Objectives

* This notebook performs the Extract, Transform, Load (ETL) process for the Diabetes dataset.
We aim to load all raw CSV files, clean and transform the data where necessary, and prepare a combined dataset for analysis.

## Inputs

This project uses three related CSV files containing health indicators and diabetes classification data from the 2015 Behavioral Risk Factor Surveillance System (BRFSS). Each file represents a slightly different version of the dataset for classification analysis.

* `diabetes_012_health_indicators_BRFSS2015.csv`: Multiclass target diabetes data from the 2015 Behavioral Risk Factor Surveillance System (BRFSS).
- `diabetes_binary_5050split_health_indicators_BRFSS2015.csv`: Balanced binary-class dataset, 50/50 split of diabetic and non-diabetic cases.
- `diabetes_binary_health_indicators_BRFSS2015.csv`: Original binary classification dataset with an imbalanced class distribution.

Location: All files are in the `data/` folder.

## Outputs

* A cleaned, merged DataFrame ready for analysis
- Saved version of cleaned data: `data/cleaned/diabetes_combined.csv`

## Additional Comments

* All transformations will be documented and commented during each step. Column names and values will be standardised for consistency.



# Change working directory

We are ensuring the working directory is set to the project root so file paths (e.g. for reading data) work reliably.

* We are assuming the notebooks will be stored in a subfolder, therefore when running the notebook in the editor, we will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [25]:
import os

# Change working directory to the project root (one level up from 'notebooks')
os.chdir('..')
print("Working directory set to:", os.getcwd())

Working directory set to: /Users/nasraibrahim/Documents/vscode-projects


In [26]:
import os
current_dir = os.getcwd()
current_dir

'/Users/nasraibrahim/Documents/vscode-projects'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [27]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [28]:
current_dir = os.getcwd()
current_dir

'/Users/nasraibrahim/Documents'

In [29]:
import os
print("Current working directory:", os.getcwd())

Current working directory: /Users/nasraibrahim/Documents


In [30]:
import os

# Set this path to your actual project folder
project_root = "/Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis"

os.chdir(project_root)
print("New working directory:", os.getcwd())

New working directory: /Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis


In [31]:
import os

print("Current working directory:", os.getcwd())
print("Files in data folder:", os.listdir('data'))

Current working directory: /Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis
Files in data folder: ['diabetes_012_health_indicators_BRFSS2015.csv', 'diabetes_binary_health_indicators_BRFSS2015.csv', 'diabetes_binary_5050split_health_indicators_BRFSS2015.csv']


# Section 1- Extract the Data

Section 1 content- 

In this section, we load the three diabetes-related CSV files stored in the `data` folder.  
We first verify that all files exist, then read each into a Pandas DataFrame.  
To keep track of the source of each record, a new column `source` is added to each DataFrame.  
Finally, we combine all three datasets into a single DataFrame called `combined_df` for further processing.

In [32]:
import pandas as pd

In [34]:
import pandas as pd
import os

current_dir = os.getcwd()  # define current_dir first
print("Current working directory:", os.getcwd())

# File paths (use absolute paths based on current_dir)
file_1 = os.path.join(current_dir, "data/diabetes_012_health_indicators_BRFSS2015.csv")
file_2 = os.path.join(current_dir, "data/diabetes_binary_5050split_health_indicators_BRFSS2015.csv")
file_3 = os.path.join(current_dir, "data/diabetes_binary_health_indicators_BRFSS2015.csv")

# Check each file exists
for f in [file_1, file_2, file_3]:
    if not os.path.exists(f):
        raise FileNotFoundError(f"File not found: {f}")

# Load CSVs
df1 = pd.read_csv(file_1)
df2 = pd.read_csv(file_2)
df3 = pd.read_csv(file_3)

# Optional: Add a source column to keep track of origin
df1['source'] = 'original'
df2['source'] = 'balanced_5050'
df3['source'] = 'binary'

# Merge
combined_df = pd.concat([df1, df2, df3], ignore_index=True)

# Preview
combined_df.head()


Current working directory: /Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis


Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income,source,Diabetes_binary
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0,original,
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0,original,
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0,original,
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0,original,
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0,original,


---

# Section 2- Transform the Data

Section 2 content

In this section, we perform necessary data cleaning and transformation steps on the combined dataset.  
This includes handling missing values, converting data types if needed, and possibly creating new features.  
The goal is to prepare a clean and consistent dataset for analysis.

In [35]:
# Check for missing values
missing_counts = combined_df.isnull().sum()
print("Missing values per column:\n", missing_counts)

# Example: Fill missing values if appropriate (e.g., with mode or median)
# Here we fill missing values in 'Diabetes_binary' with mode as example
if combined_df['Diabetes_binary'].isnull().any():
    mode_value = combined_df['Diabetes_binary'].mode()[0]
    combined_df['Diabetes_binary'].fillna(mode_value, inplace=True)

# Convert data types if needed (example)
# combined_df['Age'] = combined_df['Age'].astype(int)

# Check data types
print("Data types:\n", combined_df.dtypes)

# Optionally drop duplicate rows if any
before_dedup = combined_df.shape[0]
combined_df.drop_duplicates(inplace=True)
after_dedup = combined_df.shape[0]
print(f"Dropped {before_dedup - after_dedup} duplicate rows.")

# Preview cleaned data
combined_df.head()

Missing values per column:
 Diabetes_012            324372
HighBP                       0
HighChol                     0
CholCheck                    0
BMI                          0
Smoker                       0
Stroke                       0
HeartDiseaseorAttack         0
PhysActivity                 0
Fruits                       0
Veggies                      0
HvyAlcoholConsump            0
AnyHealthcare                0
NoDocbcCost                  0
GenHlth                      0
MentHlth                     0
PhysHlth                     0
DiffWalk                     0
Sex                          0
Age                          0
Education                    0
Income                       0
source                       0
Diabetes_binary         253680
dtype: int64
Data types:
 Diabetes_012            float64
HighBP                  float64
HighChol                float64
CholCheck               float64
BMI                     float64
Smoker                  float64
Stroke    

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income,source,Diabetes_binary
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0,original,0.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0,original,0.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0,original,0.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0,original,0.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0,original,0.0


# Section 3- Load the Data

In this stage, we save the cleaned and merged dataset to a CSV file for easy access in future analysis or modeling steps.  
We will export the combined dataframe `combined_df` to a new CSV file in the `data/` folder named `combined_diabetes_data.csv`.

This makes the data ready for further use without needing to repeat the extraction and transformation steps.

In [36]:
# Define output file path
output_file = "data/combined_diabetes_data.csv"

# Ensure combined_df exists
if 'combined_df' not in locals():
	raise NameError("combined_df is not defined. Please run the cell where combined_df is created.")

# Save the combined DataFrame to CSV
combined_df.to_csv(output_file, index=False)

print(f"Cleaned and merged data saved to {output_file}")

Cleaned and merged data saved to data/combined_diabetes_data.csv


---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.