# Baseline Machine Learning Models (Assignment 4.4)

In this notebook, I train and evaluate baseline machine learning models using the processed STEDI dataset. These baseline models help establish a performance reference point before applying hyperparameter tuning or model optimization.


## Step 1: Load Feature Pipeline and Transformed Data

In this step, I load the saved feature preprocessing pipeline and the transformed training and testing datasets. These datasets were generated in the previous feature engineering assignment and are required for consistent model training.

In [0]:
import joblib
import os

ARTIFACT_DIR = "/Workspace/Users/dec816@ensign.edu/csai382_lab_2_4_-DesmondChaparadza-/etl_pipeline"

# Sanity check (this MUST show the files)
print(os.listdir(ARTIFACT_DIR))

pipeline = joblib.load(f"{ARTIFACT_DIR}/stedi_feature_pipeline.pkl")
X_train_transformed = joblib.load(f"{ARTIFACT_DIR}/X_train_transformed.pkl")
X_test_transformed = joblib.load(f"{ARTIFACT_DIR}/X_test_transformed.pkl")
y_train = joblib.load(f"{ARTIFACT_DIR}/y_train.pkl")
y_test = joblib.load(f"{ARTIFACT_DIR}/y_test.pkl")


In [0]:
import os
print(os.getcwd())

In [0]:
import numpy as np
from scipy.sparse import issparse


## Step 2: Convert Transformed Features to Numeric Matrices

Because transformed feature arrays may be sparse, object-based, or inconsistently shaped, this helper function ensures all inputs are converted into clean numeric 2D arrays suitable for machine-learning models.


In [0]:
def to_float_matrix(arr: np.ndarray) -> np.ndarray:
    """
    Ensures that input arrays (possibly object-dtype, sparse, or 0-d)
    are converted to a 2-D float matrix for model training.
    """
    if arr.ndim == 0:
        arr = arr.item()
        if issparse(arr):
            arr = arr.toarray()
        arr = np.array(arr, dtype=float)
    elif arr.dtype == object:
        arr = np.array([
            x.toarray() if issparse(x) else np.array(x, dtype=float)
            for x in arr
        ])
        arr = np.vstack(arr)
    elif issparse(arr):
        arr = arr.toarray()
    else:
        arr = np.array(arr, dtype=float)
    return arr


##Step 2B: Apply conversion + verify shapes
This converts train/test features to numeric arrays and verifies the shapes match expectations before training.

In [0]:
X_train = to_float_matrix(X_train_transformed)
X_test = to_float_matrix(X_test_transformed)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


In [0]:
X_train = to_float_matrix(X_train_transformed)
X_test = to_float_matrix(X_test_transformed)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


## Step 3: Train Logistic Regression Baseline Model

Logistic Regression provides a simple and interpretable baseline model. It helps establish how well linear decision boundaries perform on the engineered features.


In [0]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=300)
log_reg.fit(X_train, y_train)

log_reg_score = log_reg.score(X_test, y_test)
log_reg_score



## Step 4: Train Random Forest Baseline Model

This trains a non-linear model that often handles noisy sensor data better than linear models.


In [0]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

rf_score = rf.score(X_test, y_test)
rf_score


## Step 5: Compare Baseline Model Performance

This displays both baseline accuracies side-by-side for comparison.


In [0]:
results = {
    "Logistic Regression baseline": log_reg_score,
    "Random Forest baseline": rf_score
}

results


## Step 6: Save Trained Baseline Models
This saves both trained models so they can be submitted as a ZIP file per the assignment instructions


In [0]:
import joblib
import os

MODEL_DIR = "/Workspace/Users/dec816@ensign.edu/csai382_lab_2_4_-DesmondChaparadza-/outputs/models"
os.makedirs(MODEL_DIR, exist_ok=True)

joblib.dump(log_reg, f"{MODEL_DIR}/log_reg_model.pkl")
joblib.dump(rf, f"{MODEL_DIR}/random_forest_model.pkl")

os.listdir(MODEL_DIR)



In [0]:
import zipfile

zip_path = f"{MODEL_DIR}/baseline_models.zip"
with zipfile.ZipFile(zip_path, "w") as z:
    z.write(f"{MODEL_DIR}/log_reg_model.pkl", arcname="log_reg_model.pkl")
    z.write(f"{MODEL_DIR}/random_forest_model.pkl", arcname="random_forest_model.pkl")

zip_path


## Baseline Model Analysis

The Random Forest model performed better than Logistic Regression, as shown by its higher accuracy score on the test dataset. Random Forest appears more stable for noisy sensor data because it combines many decision trees and reduces the impact of outliers or small fluctuations in individual readings. One question I have is how much of the accuracy difference is driven by non-linear feature interactions versus class imbalance. Another question is whether feature importance scores would reveal sensor patterns that Logistic Regression cannot capture. Testing models before real-world use is critical because incorrect predictions could affect users, developers, or organizations relying on the system. Fairness matters in both data science and discipleship because consistent, careful evaluation helps prevent harm and ensures decisions are made responsibly and with accountability.
