# Assignment 4.4 – Trained ML Models: Compare Models
### Geovanny Peña Rueda
In this assignment, I train baseline machine learning models using the transformed STEDI dataset.
I compare Logistic Regression and Random Forest models and evaluate their baseline accuracy.


### 1. Import Libraries and Load Feature Pipeline

In [0]:
import joblib
import numpy as np
import pandas as pd
from scipy.sparse import issparse

pipeline = joblib.load("./etl_pipeline/stedi_feature_pipeline.pkl")

X_train_transformed = joblib.load("./etl_pipeline/X_train_transformed.pkl")
X_test_transformed = joblib.load("./etl_pipeline/X_test_transformed.pkl")

y_train = joblib.load("./etl_pipeline/y_train.pkl")
y_test = joblib.load("./etl_pipeline/y_test.pkl")


In [0]:
def to_float_matrix(arr: np.ndarray) -> np.ndarray:
    """
    Ensures that input arrays (possibly object-dtype, sparse, or 0-d)
    are converted to a 2-D float matrix.
    """
    if arr.ndim == 0:
        arr = arr.item()
        if issparse(arr):
            arr = arr.toarray()
        arr = np.array(arr, dtype=float)
    elif arr.dtype == object:
        arr = np.array([
            x.toarray() if issparse(x) else np.array(x, dtype=float)
            for x in arr
        ])
        arr = np.vstack(arr)
    elif issparse(arr):
        arr = arr.toarray()
    else:
        arr = np.array(arr, dtype=float)
    return arr


In [0]:
X_train = to_float_matrix(X_train_transformed)
X_test = to_float_matrix(X_test_transformed)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


### 2. Train Logistic Regression (Baseline Model)

In [0]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=300)
log_reg.fit(X_train, y_train)

log_reg_score = log_reg.score(X_test, y_test)
log_reg_score


### 3. Train Random Forest (Baseline Model)

In [0]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

rf_score = rf.score(X_test, y_test)
rf_score


### 4. Compare Baseline Models

In [0]:
results = {
    "Logistic Regression baseline": log_reg_score,
    "Random Forest baseline": rf_score
}

results


### 5. Baseline Model Analysis

**Which baseline model performed better?**  
The Logistic Regression model performed slightly better than the Random Forest model based on accuracy results.

**Which model seems more stable for noisy sensor data?**  
The Random Forest model is generally more stable for noisy sensor data because it combines the results of many decision trees, which helps reduce the impact of outliers and noise.

**What questions do you have about why the numbers differ?**  
One question is whether the strong feature engineering and scaling allowed Logistic Regression to perform as well as, or better than, a more complex model like Random Forest. Another question is how the results might change with different hyperparameters.

**Why is it important to test your model before using it in real life?**  
Testing is important because models can make mistakes, and using an untested model could lead to incorrect decisions or harmful outcomes.

**If a model is wrong, who could be affected?**  
If a model is wrong, users who rely on its predictions could be affected, such as customers, patients, or organizations that use the model for decision-making.

**Why does fairness matter in both data science and discipleship?**  
Fairness matters in data science because models should treat all data and people responsibly and without bias. In discipleship, fairness and consistency reflect moral accountability and the responsibility to act with integrity and care toward others.


### 6. Save Trained Models

In [0]:
import os
import joblib
from datetime import datetime
# Create a unique folder name (prevents overwriting files)
run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
base_dir = f"stedi_models/{run_id}"
os.makedirs(base_dir, exist_ok=True)
# Save trained models
joblib.dump(log_reg, f"{base_dir}/log_reg.joblib")
joblib.dump(rf, f"{base_dir}/random_forest.joblib")
# Save accuracy information (metadata)
metadata = {
    "run_id": run_id,
    "logistic_regression_accuracy": float(log_reg_score),
    "random_forest_accuracy": float(rf_score),
}
joblib.dump(metadata, f"{base_dir}/metadata.joblib")
base_dir

In [0]:
import shutil

zip_path = f"stedi_models_{run_id}.zip"
shutil.make_archive(f"stedi_models_{run_id}", "zip", base_dir)

zip_path
