# Assignment 5.3 - Trained ML Models: Hyperparameter Tuning
### Geovanny Peña
In this assignment, I improve the baseline machine-learning models by tuning their hyperparameters and selecting the strongest model based on validation performance. I tune both Logistic Regression and Random Forest models using GridSearchCV, compare their results, and save the best-performing model for future analysis.


### 1. Import Libraries and Load Your Data

In [0]:
import joblib
import numpy as np
import pandas as pd
from scipy.sparse import issparse


In [0]:
pipeline = joblib.load("./etl_pipeline/stedi_feature_pipeline.pkl")

X_train_transformed = joblib.load("./etl_pipeline/X_train_transformed.pkl")
X_test_transformed = joblib.load("./etl_pipeline/X_test_transformed.pkl")

y_train = joblib.load("./etl_pipeline/y_train.pkl")
y_test = joblib.load("./etl_pipeline/y_test.pkl")


In [0]:
def to_float_matrix(arr: np.ndarray) -> np.ndarray:
    """
    Ensures that input arrays (possibly object-dtype, sparse, or 0-d)
    are converted to a 2-D float matrix.
    """
    if arr.ndim == 0:
        arr = arr.item()
        if issparse(arr):
            arr = arr.toarray()
        arr = np.array(arr, dtype=float)
    elif arr.dtype == object:
        arr = np.array([
            x.toarray() if issparse(x) else np.array(x, dtype=float)
            for x in arr
        ])
        arr = np.vstack(arr)
    elif issparse(arr):
        arr = arr.toarray()
    else:
        arr = np.array(arr, dtype=float)
    return arr


In [0]:
X_train = to_float_matrix(X_train_transformed)
X_test = to_float_matrix(X_test_transformed)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


### 2. Hyperparameter Tuning - Logistic Regression
I used GridSearchCV to test different regularization strengths and solvers for Logistic Regression. The goal was to identify the combination of hyperparameters that provides the best validation accuracy.

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV


In [0]:
log_reg_params = {
    "C": [0.01, 0.1, 1, 10],
    "penalty": ["l2"],
    "solver": ["lbfgs", "liblinear"]
}


In [0]:
log_reg_grid = GridSearchCV(
    LogisticRegression(max_iter=300),
    log_reg_params,
    cv=3,
    scoring="accuracy"
)

log_reg_grid.fit(X_train, y_train)

log_reg_best_params = log_reg_grid.best_params_
log_reg_best_score = log_reg_grid.best_score_

log_reg_best_params, log_reg_best_score

print("Logistic Regression Best Params:", log_reg_grid.best_params_)
print("Logistic Regression Best CV Score:", log_reg_grid.best_score_)

### 3. Hyperparameter Tuning – Random Forest
I used GridSearchCV to tune key Random Forest hyperparameters, including the number of trees, tree depth, and minimum sample requirements. This helps balance model accuracy and generalization.

In [0]:
from sklearn.ensemble import RandomForestClassifier


In [0]:
rf_params = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 5, 10, 20],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2]
}


In [0]:
rf_grid = GridSearchCV(
    RandomForestClassifier(),
    rf_params,
    cv=3,
    scoring="accuracy",
    n_jobs=-1
)

rf_grid.fit(X_train, y_train)

rf_best_params = rf_grid.best_params_
rf_best_score = rf_grid.best_score_

rf_best_params, rf_best_score


### 4. Compare Tuned Models


In [0]:
results = {
    "Logistic Regression (tuned)": log_reg_best_score,
    "Random Forest (tuned)": rf_best_score
}

results


### 5. Choose the Best Model

In [0]:
if rf_best_score > log_reg_best_score:
    best_model = rf_grid.best_estimator_
    best_model_name = "Random Forest"
else:
    best_model = log_reg_grid.best_estimator_
    best_model_name = "Logistic Regression"

best_model_name


### 6. Save the Model

In [0]:
best_model = log_reg_grid.best_estimator_
joblib.dump(best_model, "stedi_best_model.pkl")

In [0]:
import os
os.listdir(".")


### Model Evaluation Report and Ethics Reflection

**Which model performed best?**
The tuned Logistic Regression model was selected as the best model.

**How do you know? (accuracy, precision, recall?)** 
Both tuned models achieved the same cross-validation accuracy (about 95.1%), but Logistic Regression was chosen because it is simpler and more interpretable. Accuracy was the main metric used for comparison in this assignment.

**What hyperparameters improved performance?** 
For Logistic Regression, a smaller C value (0.01) improved performance by increasing regularization and reducing overfitting. For Random Forest, limiting the max_depth to 5 and using fewer trees helped control model complexity.

**Any surprising results?** 
It was surprising that both tuned models reached almost the exact same accuracy, even though Random Forest is usually more powerful. This suggests that the features in the dataset are well structured and mostly linearly separable.

**What would you test next if you had more time?** 
I would test additional evaluation metrics such as precision and recall, try more Random Forest parameters, or experiment with other models like Gradient Boosting or XGBoost. I would also explore adding more sensor features or collecting more data.

**How could hyperparameter tuning accidentally make a model unfair or biased?** 
Hyperparameter tuning can unintentionally favor patterns that work well for the majority group but perform poorly for smaller or underrepresented groups. If the tuning process only focuses on overall accuracy, it may hide unfair behavior in specific populations.

**Why is transparency important, and how does the gospel teach honest evaluation?** 
Transparency allows others to understand how decisions were made and to identify potential bias or errors. The gospel teaches principles of integrity, reminding us to evaluate our work truthfully and responsibly, especially when our decisions can affect others.