# ETL Part 2: Feature Engineering, Encoding, and Scaling

## Overview

This notebook is part of the ETL (Extract, Transform, Load) pipeline for the diabetes dataset project.  
It focuses on preparing the data for downstream analysis and modeling by:  
- Extracting data from multiple sources  
- Transforming it through cleaning and feature engineering  
- Loading the cleaned and processed dataset for further machine learning tasks  

The following sections will walk through each step systematically, ensuring clarity, reproducibility, and data quality.

## 1. Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

## 2. Load Cleaned Dataset

In [3]:
import os
print(os.getcwd())

/Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis/jupyter_notebooks


In [5]:
file_path = "../data/combined_cleaned_final.csv"
df = pd.read_csv(file_path)

In [8]:
import os
print(os.path.abspath("../data/combined_cleaned_final.csv"))
print(os.path.exists("../data/combined_cleaned_final.csv"))

/Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis/data/combined_cleaned_final.csv
True


In [9]:
import os
os.chdir("/Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis/jupyter_notebooks")
print("Current working directory:", os.getcwd())

Current working directory: /Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis/jupyter_notebooks


In [10]:
file_path = "../data/combined_cleaned_final.csv"
df = pd.read_csv(file_path)
print(f"Loaded cleaned dataset with shape: {df.shape}")
df.head()

Loaded cleaned dataset with shape: (528312, 24)


Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income,source,Diabetes_binary
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0,original,0.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0,original,0.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0,original,0.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0,original,0.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0,original,0.0
