# Assignment 4.3 - Feature Pipelines
### Geovanny Peña Rueda

In this notebook, we prepare the curated STEDI dataset for machine learning by:
- Creating train/test splits
- Scaling numeric features
- One-hot encoding categorical features
- Combining preprocessing steps into a scikit-learn Pipeline
- Saving the processed data and pipeline for later model training


1. Import Libraries and Load Your Curated Dataset

In [0]:
from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder.getOrCreate()

# Load curated dataset
# Use fully qualified table name
# workspace.silver.labeled_step_test

df_spark = spark.table("workspace.silver.labeled_step_test")

# Convert to pandas for scikit-learn
df = df_spark.toPandas()

df.head()


2. Define Your Feature Columns

In [0]:
# Feature columns
feature_cols_numeric = ["distance_cm"]
feature_cols_categorical = ["sensor_type", "device_id"]

# Label column
label_col = "step_label"


3. Create a Train/Test Split

In [0]:
from sklearn.model_selection import train_test_split

X = df[feature_cols_numeric + feature_cols_categorical]
y = df[label_col]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


4. Build Preprocessing Steps

In [0]:
# Scale numeric columns

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

numeric_transformer = StandardScaler()

In [0]:
# One-hot encode categorical columns

categorical_transformer = OneHotEncoder(handle_unknown="ignore")

In [0]:
# Combine into a single transformer

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, feature_cols_numeric),
        ("cat", categorical_transformer, feature_cols_categorical)
    ]
)

5. Build a Scikit-Learn Pipeline

In [0]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[
    ("preprocess", preprocessor)
])


6. Fit the Pipeline and Transform the Data

In [0]:
pipeline.fit(X_train)

X_train_transformed = pipeline.transform(X_train)
X_test_transformed = pipeline.transform(X_test)


7. Save Processed Feature Set and Pipeline

In [0]:
import os
import joblib

# Folder within the notebook/cluster where we can write.
base_path = "./etl_pipeline"
os.makedirs(base_path, exist_ok=True)

# Save the pipeline and transformed datasets.
joblib.dump(pipeline, f"{base_path}/stedi_feature_pipeline.pkl")
joblib.dump(X_train_transformed, f"{base_path}/X_train_transformed.pkl")
joblib.dump(X_test_transformed, f"{base_path}/X_test_transformed.pkl")
joblib.dump(y_train, f"{base_path}/y_train.pkl")
joblib.dump(y_test, f"{base_path}/y_test.pkl")

print("All files were saved successfully in:", base_path)



## Ethical Reflection
Using a consistent, reliable, and repeatable feature pipeline ensures that machine learning models always receive the same transformed data in the same way. This prevents errors, biases, or inconsistencies in the data from influencing the model's results, promoting fairer decisions. Furthermore, it guarantees that any analysis can be reviewed and replicated, increasing confidence in the model. From a spiritual perspective, this reflects the principle of obedience, as taught in Doctrine and Covenants 130:20–21: blessings are predicated upon eternal laws, and our Heavenly Father is unchanging and blesses His children based on obedience to those laws, just as He did yesterday, today, and forever. By applying consistent habits in data management, we promote fairness and accountability in our work.