# Feature Engineering with Pipelines (Assignment 4.3)

In this notebook, I prepare the curated STEDI Silver dataset for machine learning by engineering features that models can interpret. This includes creating train/test splits, scaling numeric features, one-hot encoding categorical features, and combining all preprocessing steps into a single reproducible scikit-learn Pipeline.


## Step 1: Import Libraries and Load Curated Dataset

In this step, I load the curated Silver dataset (`labeled_step_test`) from Databricks using Spark and convert it into a pandas DataFrame. Pandas is required because scikit-learn operates on in-memory data structures.


In [0]:
from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.getOrCreate()

df_spark = spark.table("labeled_step_test")
df = df_spark.toPandas()

df.head()


## Step 2: Define Feature Columns

Here I define which columns will be used as model inputs (features) and which column represents the label. Numeric and categorical features are separated so they can be preprocessed differently.


In [0]:
# Numeric features
feature_cols_numeric = ["distance_cm"]

# Categorical features
feature_cols_categorical = ["sensor_type", "device_id"]

# Label
label_col = "step_label"


## Step 3: Create Train/Test Split

The dataset is split into training and testing sets. Stratification is used to preserve the proportion of step and no_step labels in both sets, which helps prevent bias during model evaluation.


In [0]:
from sklearn.model_selection import train_test_split

X = df[feature_cols_numeric + feature_cols_categorical]
y = df[label_col]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

## Step 4: Define Preprocessing Transformers

Numeric features are scaled using StandardScaler, while categorical features are one-hot encoded. These transformations ensure features are comparable and usable by machine-learning algorithms.


In [0]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()

categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, feature_cols_numeric),
        ("cat", categorical_transformer, feature_cols_categorical)
    ]
)

## Step 5: Build Feature Engineering Pipeline

The preprocessing steps are combined into a single scikit-learn Pipeline. This ensures that the same transformations are applied consistently every time the pipeline is run.


In [0]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[
    ("preprocess", preprocessor)
])

## Step 6: Fit Pipeline and Transform Data

The pipeline is fit on the training data only, then applied to both training and test sets. This prevents data leakage and ensures fair evaluation.

In [0]:
pipeline.fit(X_train)

X_train_transformed = pipeline.transform(X_train)
X_test_transformed = pipeline.transform(X_test)

X_train_transformed, X_test_transformed


## Step 7: Save Pipeline and Processed Features

The trained pipeline and transformed datasets are saved using joblib so they can be reused later for model training and evaluation.


In [0]:
import joblib
import os

# Save locally first
joblib.dump(pipeline, "/tmp/stedi_feature_pipeline.pkl")
joblib.dump(X_train_transformed, "/tmp/X_train_transformed.pkl")
joblib.dump(X_test_transformed, "/tmp/X_test_transformed.pkl")
joblib.dump(y_train, "/tmp/y_train.pkl")
joblib.dump(y_test, "/tmp/y_test.pkl")

# Show saved files
os.listdir("/tmp/")

## Ethics Reflection

Using a consistent and reproducible feature pipeline helps prevent unfairness by ensuring that all data is processed in the same way every time, reducing hidden bias caused by inconsistent preprocessing. Scaling and encoding features consistently also prevents certain inputs from having unintended influence on model predictions. Reproducible pipelines make it easier to audit and validate results, which increases accountability. Spiritually, Doctrine and Covenants 130:20–21 teaches that blessings follow dependable laws, reminding me that fairness and consistency—both in data and in life—come from following reliable, repeatable principles.


In [0]:
import os
import joblib

ARTIFACT_DIR = "/Workspace/Users/dec816@ensign.edu/csai382_lab_2_4_-DesmondChaparadza-/etl_pipeline"
os.makedirs(ARTIFACT_DIR, exist_ok=True)

# Copy from /tmp (already saved) → repo folder
joblib.dump(pipeline, f"{ARTIFACT_DIR}/stedi_feature_pipeline.pkl")
joblib.dump(X_train_transformed, f"{ARTIFACT_DIR}/X_train_transformed.pkl")
joblib.dump(X_test_transformed, f"{ARTIFACT_DIR}/X_test_transformed.pkl")
joblib.dump(y_train, f"{ARTIFACT_DIR}/y_train.pkl")
joblib.dump(y_test, f"{ARTIFACT_DIR}/y_test.pkl")

# Verify
os.listdir(ARTIFACT_DIR)


In [0]:
print(os.listdir(ARTIFACT_DIR))


In [0]:
import os, joblib

# Save inside your Workspace repo folder (this is what worked for you already)
SAVE_DIR = "/Workspace/Users/dec816@ensign.edu/csai382_lab_2_4_-DesmondChaparadza-/etl_pipeline"
os.makedirs(SAVE_DIR, exist_ok=True)

joblib.dump(pipeline, f"{SAVE_DIR}/stedi_feature_pipeline.pkl")
joblib.dump(X_train_transformed, f"{SAVE_DIR}/X_train_transformed.pkl")
joblib.dump(X_test_transformed, f"{SAVE_DIR}/X_test_transformed.pkl")
joblib.dump(y_train, f"{SAVE_DIR}/y_train.pkl")
joblib.dump(y_test, f"{SAVE_DIR}/y_test.pkl")

print("Saved files:", os.listdir(SAVE_DIR))
