# Assignment 4.4 – Trained ML Models: Compare Models
### Geovanny Peña Rueda
In this assignment, I train baseline machine learning models using the transformed STEDI dataset.
I compare Logistic Regression and Random Forest models and evaluate their baseline accuracy.


### 1. Import Libraries and Load Feature Pipeline

In [0]:
import joblib
import numpy as np
import pandas as pd
from scipy.sparse import issparse

pipeline = joblib.load("./etl_pipeline/stedi_feature_pipeline.pkl")

X_train_transformed = joblib.load("./etl_pipeline/X_train_transformed.pkl")
X_test_transformed = joblib.load("./etl_pipeline/X_test_transformed.pkl")

y_train = joblib.load("./etl_pipeline/y_train.pkl")
y_test = joblib.load("./etl_pipeline/y_test.pkl")


In [0]:
def to_float_matrix(arr: np.ndarray) -> np.ndarray:
    """
    Ensures that input arrays (possibly object-dtype, sparse, or 0-d)
    are converted to a 2-D float matrix.
    """
    if arr.ndim == 0:
        arr = arr.item()
        if issparse(arr):
            arr = arr.toarray()
        arr = np.array(arr, dtype=float)
    elif arr.dtype == object:
        arr = np.array([
            x.toarray() if issparse(x) else np.array(x, dtype=float)
            for x in arr
        ])
        arr = np.vstack(arr)
    elif issparse(arr):
        arr = arr.toarray()
    else:
        arr = np.array(arr, dtype=float)
    return arr


In [0]:
X_train = to_float_matrix(X_train_transformed)
X_test = to_float_matrix(X_test_transformed)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


### 2. Train Logistic Regression (Baseline Model)

In [0]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=300)
log_reg.fit(X_train, y_train)

log_reg_score = log_reg.score(X_test, y_test)
log_reg_score


### 3. Train Random Forest (Baseline Model)

In [0]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

rf_score = rf.score(X_test, y_test)
rf_score


### 4. Compare Baseline Models

In [0]:
results = {
    "Logistic Regression baseline": log_reg_score,
    "Random Forest baseline": rf_score
}

results


### 5. Baseline Model Analysis

The Random Forest model performed better than the Logistic Regression model in terms of baseline accuracy.
This is likely because Random Forest can better handle non-linear patterns and noisy sensor data.
Logistic Regression is simpler and more interpretable, but it may struggle with complex feature interactions.
The difference in accuracy raises questions about how feature importance and model complexity affect performance.
Testing models before real-world use is critical because incorrect predictions could affect users relying on the system.
Fairness matters in both data science and discipleship because decisions and judgments can impact others, and we are responsible for acting with integrity and care.
