# ETL Part 2: Feature Engineering, Encoding, and Scaling

## Overview

This notebook is part of the ETL (Extract, Transform, Load) pipeline for the diabetes dataset project.  
It focuses on preparing the data for downstream analysis and modeling by:  
- Extracting data from multiple sources  
- Transforming it through cleaning and feature engineering  
- Loading the cleaned and processed dataset for further machine learning tasks  

The following sections will walk through each step systematically, ensuring clarity, reproducibility, and data quality.

## 1. Import Required Libraries

In [None]:
# This cell imports essential libraries for data manipulation (`pandas`, `numpy`), preprocessing (`OneHotEncoder`, `StandardScaler`), model building (`LogisticRegression`), and evaluation (`classification_report`, `accuracy_score`). These tools will be used for feature engineering, encoding, scaling, and modeling.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

## 2. Load Cleaned Dataset

In [None]:
# This cell prints the current working directory to help ensure that relative file paths will work as expected for loading the cleaned dataset.
import os
print(os.getcwd())

/Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis/jupyter_notebooks


In [None]:
# This cell loads the cleaned dataset from the previous ETL step into a pandas DataFrame for feature engineering and encoding.
file_path = "../data/combined_cleaned_final.csv"
df = pd.read_csv(file_path)

In [None]:
# This cell prints the absolute path to the cleaned data file and checks if it exists, helping debug any file path issues.
import os
print(os.path.abspath("../data/combined_cleaned_final.csv"))
print(os.path.exists("../data/combined_cleaned_final.csv"))

/Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis/data/combined_cleaned_final.csv
True


In [None]:
# This cell sets the working directory to the notebook folder and prints it, ensuring that subsequent file operations use the correct paths.
import os
os.chdir("/Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis/jupyter_notebooks")
print("Current working directory:", os.getcwd())

Current working directory: /Users/nasraibrahim/Documents/vscode-projects/diabetes-data-analysis/jupyter_notebooks


In [None]:
# This cell prints the shape of the loaded DataFrame and displays the first few rows to verify that the data has been loaded correctly.
file_path = "../data/combined_cleaned_final.csv"
df = pd.read_csv(file_path)
print(f"Loaded cleaned dataset with shape: {df.shape}")
df.head()

Loaded cleaned dataset with shape: (528312, 24)


Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income,source,Diabetes_binary
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0,original,0.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0,original,0.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0,original,0.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0,original,0.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0,original,0.0


# 3. Feature Engineering

* In this step, we create new meaningful features to enhance the dataset and potentially improve model performance.
- This can include creating interaction features, aggregating related variables, or transforming existing ones.

In [11]:
# Example: Create a new feature - BMI category
def bmi_category(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif 18.5 <= bmi < 25:
        return 'Normal'
    elif 25 <= bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

df['BMI_Category'] = df['BMI'].apply(bmi_category)

# Another example: Age groups
def age_group(age):
    if age <= 3:  # assuming age categories are encoded numerically (e.g. 1-10)
        return 'Young'
    elif 4 <= age <= 7:
        return 'Middle-aged'
    else:
        return 'Senior'

df['Age_Group'] = df['Age'].apply(age_group)

df[['BMI', 'BMI_Category', 'Age', 'Age_Group']].head()

Unnamed: 0,BMI,BMI_Category,Age,Age_Group
0,40.0,Obese,9.0,Senior
1,25.0,Overweight,7.0,Middle-aged
2,28.0,Overweight,9.0,Senior
3,27.0,Overweight,11.0,Senior
4,24.0,Normal,11.0,Senior


# 4. Data Encoding

- Convert categorical variables into a machine-readable format.
- We will use one-hot encoding for categorical columns such as BMI_Category, Age_Group, and any others.

In [12]:
# Select categorical columns to encode
categorical_cols = ['BMI_Category', 'Age_Group', 'source', 'Education', 'Income']

# One-hot encode using pandas.get_dummies (simpler for demo)
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

print(f"Data shape after encoding: {df_encoded.shape}")
df_encoded.head()

Data shape after encoding: (528312, 40)


Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,Education_4.0,Education_5.0,Education_6.0,Income_2.0,Income_3.0,Income_4.0,Income_5.0,Income_6.0,Income_7.0,Income_8.0
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,True,False,False,False,True,False,False,False,False,False
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,False,False,True,False,False,False,False,False,False,False
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,True,False,False,False,False,False,False,False,False,True
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,False,False,False,False,False,False,False,True,False,False
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,False,True,False,False,False,True,False,False,False,False


# 5. Scaling Numerical Features

- Normalise numerical features for machine learning models.
- We will standardise features like BMI, MentHlth, PhysHlth etc. using StandardScaler.

In [13]:
# Numerical columns to scale
numerical_cols = ['BMI', 'MentHlth', 'PhysHlth', 'Age']  # Adjust as needed

scaler = StandardScaler()
df_encoded[numerical_cols] = scaler.fit_transform(df_encoded[numerical_cols])

df_encoded.head()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,Education_4.0,Education_5.0,Education_6.0,Income_2.0,Income_3.0,Income_4.0,Income_5.0,Income_6.0,Income_7.0,Income_8.0
0,0.0,1.0,1.0,1.0,1.62754,1.0,0.0,0.0,0.0,0.0,...,True,False,False,False,True,False,False,False,False,False
1,0.0,0.0,0.0,0.0,-0.562466,1.0,0.0,0.0,1.0,0.0,...,False,False,True,False,False,False,False,False,False,False
2,0.0,1.0,1.0,1.0,-0.124464,0.0,0.0,0.0,0.0,1.0,...,True,False,False,False,False,False,False,False,False,True
3,0.0,1.0,0.0,1.0,-0.270465,0.0,0.0,0.0,1.0,1.0,...,False,False,False,False,False,False,False,True,False,False
4,0.0,1.0,1.0,1.0,-0.708466,0.0,0.0,0.0,1.0,1.0,...,False,True,False,False,False,True,False,False,False,False


# 6. Save the ML-Ready Dataset

After completing the feature engineering and encoding steps, it’s important to save the processed dataset for future modeling and analysis.

Since the current notebook is located inside the jupyter_notebooks folder, and the data folder is one level above, the save path uses "../data/" to navigate up one directory before accessing the data folder.

The code also ensures that the data directory exists and creates it if necessary to avoid errors during saving.

This saved dataset (combined_ml_ready.csv) contains all transformed, encoded, and scaled features, ready to be used for machine learning tasks.

In [19]:
import os

# Define output file path for the ML-ready dataset (one directory above)
output_file_ml = "../data/combined_ml_ready.csv"

# Ensure the 'data' directory exists (create if it doesn't)
os.makedirs("../data", exist_ok=True)

# Save the ML-ready dataframe to CSV
df_encoded.to_csv(output_file_ml, index=False)

print(f"ML-ready dataset saved to {output_file_ml}")

ML-ready dataset saved to ../data/combined_ml_ready.csv


# 6. Saving a Sample ML-Ready Dataset

The full ML-ready dataset (`combined_ml_ready.csv`) is quite large (~130MB) and exceeds GitHub's file size limits, so it cannot be pushed directly to the repository.

To address this, we save the full dataset locally for use in modeling and analysis, **but do not push it to GitHub**.

Additionally, we create a smaller random sample (10% of the data) and save it as `combined_ml_ready_sample.csv`. This smaller dataset is lightweight enough to be uploaded to GitHub, allowing others to review the structure and sample data without the burden of a large file.

This approach balances reproducibility with platform constraints.

In [21]:
import os

# 1. Save full ML-ready dataset locally (not pushed to GitHub)
output_file_ml = "../data/combined_ml_ready.csv"
os.makedirs("../data", exist_ok=True)
df_encoded.to_csv(output_file_ml, index=False)
print(f"Full ML-ready dataset saved to {output_file_ml}")

# 2. Save a smaller sample dataset (10%) to push to GitHub
df_sample = df_encoded.sample(frac=0.1, random_state=42)
sample_output_file = "../data/combined_ml_ready_sample.csv"
df_sample.to_csv(sample_output_file, index=False)
print(f"Sample ML-ready dataset saved to {sample_output_file}")

Full ML-ready dataset saved to ../data/combined_ml_ready.csv
Sample ML-ready dataset saved to ../data/combined_ml_ready_sample.csv


# 7. Basic Evaluation and Modeling (Logistic Regression)

* Basic Evaluation and Modeling

In this section, we apply logistic regression to the preprocessed and encoded dataset to build a predictive model for diabetes classification.

- We define the target variable and features.
- Split the dataset into training and testing subsets.
- Train a logistic regression model on the training data.
- Evaluate the model's performance using accuracy and classification report metrics.

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Define target and features
target = 'Diabetes_binary'
X = df_encoded.drop(columns=[target])
y = df_encoded[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predictions and evaluation
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9024540283732243

Classification Report:
               precision    recall  f1-score   support

         0.0       0.92      0.97      0.95     91654
         1.0       0.69      0.47      0.56     14009

    accuracy                           0.90    105663
   macro avg       0.81      0.72      0.75    105663
weighted avg       0.89      0.90      0.89    105663



## Conclusion
In this ETL and feature engineering pipeline, we successfully combined multiple diabetes datasets, cleaned and preprocessed the data, handled missing values and duplicates, and engineered meaningful features. The dataset was encoded and scaled to prepare it for machine learning models.

For initial predictive modeling, Logistic Regression was implemented as a baseline classification model, providing insights into the relationships within the data and establishing a performance benchmark.

The data cleaning, feature engineering, and basic modeling steps have been completed successfully. The full ML-ready dataset has been saved locally for use in further modeling and analysis.

Note:Because the full ML-ready dataset is large (~130MB) and exceeds GitHub’s file size limits, only a smaller representative sample (10% of the data) has been saved and pushed to GitHub. This allows collaborators and reviewers to inspect the dataset structure and sample records without the overhead of the full dataset.

## Next Steps
Moving forward, the next phase of the project will encompass both advanced machine learning modeling and data visualisation to deepen insights and improve predictive performance.

* Linear Regression:
Although diabetes prediction is a classification problem, Linear Regression will be applied to demonstrate its application and limitations in this context. This will help compare regression-based predictions with classification approaches.

* Advanced Classification Models:
Models such as Random Forest and XGBoost will be developed to improve predictive accuracy. These ensemble methods are well-suited for handling complex interactions and non-linear relationships in the dataset.

* Model Evaluation and Comparison:
Each model’s performance will be evaluated using appropriate metrics (accuracy, precision, recall, F1-score, ROC-AUC for classification) to identify the best performing approach.

* Further Feature Engineering and Hyperparameter Tuning:
Explore additional feature creation and optimise model parameters to enhance predictive power.

* Visualisation and Reporting:
Use EDA and Tableau to visualise key findings and support decision-making.

By following these next steps, the project aims to deliver a robust predictive solution for diabetes risk assessment while demonstrating a comprehensive understanding of both regression and classification modeling techniques.

