# MLOps Lab 2 Assignment (MLFlow)

Author: Grant Nitta

Date Created: 03/20/2025

Date Last Modified: 03/20/2025

# Task

Once you have selected a set of data, create a brand new experiment in MLFlow and begin exploring your data. Do some EDA, clean up, and learn about your data. You do not need to begin tracking anything yet, but you can if you want to (e.g. you can log different versions of your data as you clean it up and do any feature engineering). Do not spend a ton of time on this part. Your goal isn't really to build a great model, so don't spend hours on feature engineering and missing data imputation and things like that.

Once your data is clean, begin training models and tracking your experiments. If you intend to use this same dataset for your final project, then start thinking about what your model might look like when you actually deploy it. For example, when you engineer new features, be sure to save the code that does this, as you will need this in the future. If your final model has 1000 complex features, you might have a difficult time deploying it later on. If your final model takes 15 minutes to train, or takes a long time to score a new batch of data, you may want to think about training a less complex model.

Now, when tracking your experiments, at a *minimum*, you should:

1. Try at least 3 different ML algorithms (e.g. linear regression, decision tree, random forest, etc.).
2. Do hyperparameter tuning for **each** algorithm.
3. Do some very basic feature selection, and repeat the above steps with these reduced sets of features.
4. Identify the top 3 best models and note these down for later.
6. Choose the **final** "best" model that you would deploy or use on future data, stage it (in MLFlow), and run it on the test set to get a final measure of performance. Don't forget to log the test set metric.
7. Be sure you logged the exact training, validation, and testing datasets for the 3 best models, as well as hyperparameter values, and the values of your metrics.  
8. Push your code to Github. No need to track the mlruns folder, the images folder, any datasets, or the sqlite database in git.

# Library Importation

In [1]:
import mlflow
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, train_test_split

from statsmodels.stats.outliers_influence import variance_inflation_factor


from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

# Data Importation

In [2]:
# pip install ucimlrepo
# !pip install statsmodels

In [3]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
student_performance = fetch_ucirepo(id=320)

# data (as pandas dataframes)
X = student_performance.data.features
y = student_performance.data.targets

# metadata
# print(student_performance.metadata)

# # variable information
# print(student_performance.variables)

# Setting up MLFlow

In [4]:
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("Lab2-student_performance_V2")

2025/03/20 18:47:35 INFO mlflow.tracking.fluent: Experiment with name 'Lab2-student_performance_V2' does not exist. Creating a new experiment.


<Experiment: artifact_location='/Users/skier/MSDS/Spring2/Spring2-MSDS-MLOps/labs/lab2/mlruns/4', creation_time=1742521655423, experiment_id='4', last_update_time=1742521655423, lifecycle_stage='active', name='Lab2-student_performance_V2', tags={}>

# Experimenting Phase 1

In [5]:
X_encoded = X.copy()

# Track column transformations
column_mapping = {}
label_encoders = {}

# Find all object and category columns
string_columns = X.select_dtypes(include=["object", "category"]).columns

for col in string_columns:
    # For columns with many unique values, use label encoding
    le = LabelEncoder()
    X_encoded[col + "_encoded"] = le.fit_transform(X[col])

    # Drop the original column
    X_encoded = X_encoded.drop(col, axis=1)

    # Store mapping information
    column_mapping[col] = [col + "_encoded"]
    label_encoders[col] = le

X_encoded = X_encoded.astype(float)

In [6]:
y_use = y["G1"]

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y_use, test_size=0.2, shuffle=True
)

In [8]:
def objective(params):
    with mlflow.start_run():
        classifier_type = params["type"]
        del params["type"]
        if classifier_type == "dt":
            clf = DecisionTreeClassifier(**params)
        elif classifier_type == "rf":
            clf = RandomForestClassifier(**params)
        elif classifier_type == "gb":
            clf = GradientBoostingClassifier(**params)
        else:
            return 0
        acc = cross_val_score(clf, X_train, y_train).mean()

        mlflow.set_tag("Model", classifier_type)
        mlflow.log_params(params)
        mlflow.log_metric("accuracy", acc)
        mlflow.end_run()
        return {"loss": -acc, "status": STATUS_OK}


search_space = hp.choice(
    "classifier_type",
    [
        {
            "type": "dt",
            "criterion": hp.choice("dtree_criterion", ["gini", "entropy"]),
            "max_depth": hp.choice(
                "dtree_max_depth",
                [None, hp.randint("dtree_max_depth_int", 1, 10)],
            ),
            "min_samples_split": hp.randint("dtree_min_samples_split", 2, 10),
        },
        {
            "type": "rf",
            "n_estimators": hp.randint("rf_n_estimators", 20, 500),
            "max_features": hp.randint("rf_max_features", 2, 9),
            "criterion": hp.choice("criterion", ["gini", "entropy"]),
        },
        {
            "type": "gb",
            "loss": hp.choice("gb_loss", ["log_loss"]),
            "learning_rate": hp.uniform("gb_learning_rate", 0.05, 2),
            "n_estimators": hp.randint("gb_n_estimators", 20, 500),
            "subsample": hp.uniform("gb_subsample", 0.1, 1),
            "criterion": hp.choice(
                "gb_criterion", ["friedman_mse", "squared_error"]
            ),
            "max_depth": hp.choice(
                "gb_max_depth",
                [None, hp.randint("gb_max_depth_int", 1, 10)],
            ),
        },
    ],
)

algo = tpe.suggest
trials = Trials()

In [9]:
best_result = fmin(
    fn=objective, space=search_space, algo=algo, max_evals=32, trials=trials
)

  6%|▋         | 2/32 [00:00<00:01, 16.89trial/s, best loss: -0.15416355489171024]






 12%|█▎        | 4/32 [00:04<00:38,  1.38s/trial, best loss: -0.15416355489171024]





 16%|█▌        | 5/32 [00:04<00:27,  1.02s/trial, best loss: -0.16181852128454072]




 25%|██▌       | 8/32 [00:16<00:53,  2.24s/trial, best loss: -0.16181852128454072]









 38%|███▊      | 12/32 [00:32<01:06,  3.31s/trial, best loss: -0.16181852128454072]





 44%|████▍     | 14/32 [00:33<00:43,  2.42s/trial, best loss: -0.16181852128454072]




 47%|████▋     | 15/32 [00:34<00:38,  2.25s/trial, best loss: -0.16181852128454072]




 50%|█████     | 16/32 [00:49<01:16,  4.80s/trial, best loss: -0.16181852128454072]






 59%|█████▉    | 19/32 [00:57<00:48,  3.70s/trial, best loss: -0.16777445855115758]




 62%|██████▎   | 20/32 [01:11<01:08,  5.67s/trial, best loss: -0.16777445855115758]




 66%|██████▌   | 21/32 [01:12<00:50,  4.61s/trial, best loss: -0.17156460044809557]




 69%|██████▉   | 22/32 [01:12<00:36,  3.67s/trial, best loss: -0.17156460044809557]




 72%|███████▏  | 23/32 [01:13<00:25,  2.85s/trial, best loss: -0.17724047796863332]




 75%|███████▌  | 24/32 [01:13<00:17,  2.20s/trial, best loss: -0.18312173263629575]




 78%|███████▊  | 25/32 [01:14<00:13,  2.00s/trial, best loss: -0.18312173263629575]




 81%|████████▏ | 26/32 [01:15<00:10,  1.70s/trial, best loss: -0.18312173263629575]




 84%|████████▍ | 27/32 [01:16<00:06,  1.36s/trial, best loss: -0.18312173263629575]




 88%|████████▊ | 28/32 [01:17<00:05,  1.34s/trial, best loss: -0.18312173263629575]




 91%|█████████ | 29/32 [01:19<00:04,  1.36s/trial, best loss: -0.18312173263629575]




 97%|█████████▋| 31/32 [01:19<00:00,  1.26trial/s, best loss: -0.18312173263629575]





100%|██████████| 32/32 [01:20<00:00,  2.51s/trial, best loss: -0.18312173263629575]


In [10]:
best_result

{'classifier_type': 1,
 'criterion': 0,
 'rf_max_features': 6,
 'rf_n_estimators': 95}

# Feature Selection

In [11]:
X_redcued_train = X_train.copy()
VIF = [0]
while len(VIF) > 0:
    X_numeric = pd.DataFrame()
    for col in X_redcued_train.columns:
        # Force everything through string conversion to be safe
        X_numeric[col] = pd.to_numeric(
            X_train[col].astype(str), errors="coerce"
        )

    # Now calculate VIF
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X_numeric.columns
    vif_data["VIF"] = [
        variance_inflation_factor(X_numeric.values, i)
        for i in range(X_numeric.shape[1])
    ]
    vif_test = vif_data.set_index("Feature").sort_values(
        by="VIF", ascending=False
    )
    if vif_test.max().iloc[0] > 5:
        print(VIF)
        VIF = vif_test.idxmax().iloc[0]
        X_redcued_train.drop(VIF, axis=1, inplace=True)
    else:
        print("Stopping")
        VIF = []

[0]
age
famrel
Medu
freetime
higher_encoded
goout
Fjob_encoded
Pstatus_encoded
Walc
studytime
health
Stopping


In [12]:
def objective(params):
    with mlflow.start_run():
        classifier_type = params["type"]
        del params["type"]
        if classifier_type == "dt":
            clf = DecisionTreeClassifier(**params)
        elif classifier_type == "rf":
            clf = RandomForestClassifier(**params)
        elif classifier_type == "gb":
            clf = GradientBoostingClassifier(**params)
        else:
            return 0
        acc = cross_val_score(clf, X_redcued_train, y_train).mean()

        mlflow.set_tag("Model", classifier_type)
        mlflow.log_params(params)
        mlflow.log_metric("accuracy", acc)
        mlflow.end_run()
        return {"loss": -acc, "status": STATUS_OK}


search_space = hp.choice(
    "classifier_type",
    [
        {
            "type": "dt",
            "criterion": hp.choice("dtree_criterion", ["gini", "entropy"]),
            "max_depth": hp.choice(
                "dtree_max_depth",
                [None, hp.randint("dtree_max_depth_int", 1, 10)],
            ),
            "min_samples_split": hp.randint("dtree_min_samples_split", 2, 10),
        },
        {
            "type": "rf",
            "n_estimators": hp.randint("rf_n_estimators", 20, 500),
            "max_features": hp.randint("rf_max_features", 2, 9),
            "criterion": hp.choice("criterion", ["gini", "entropy"]),
        },
        {
            "type": "gb",
            "loss": hp.choice("gb_loss", ["log_loss"]),
            "learning_rate": hp.uniform("gb_learning_rate", 0.05, 2),
            "n_estimators": hp.randint("gb_n_estimators", 20, 500),
            "subsample": hp.uniform("gb_subsample", 0.1, 1),
            "criterion": hp.choice(
                "gb_criterion", ["friedman_mse", "squared_error"]
            ),
            "max_depth": hp.choice(
                "gb_max_depth",
                [None, hp.randint("gb_max_depth_int", 1, 10)],
            ),
        },
    ],
)

algo = tpe.suggest
trials = Trials()

In [13]:
best_result = fmin(
    fn=objective, space=search_space, algo=algo, max_evals=32, trials=trials
)

  0%|          | 0/32 [00:00<?, ?trial/s, best loss=?]





  6%|▋         | 2/32 [00:01<00:22,  1.34trial/s, best loss: -0.154126213592233]




  9%|▉         | 3/32 [00:07<01:24,  2.92s/trial, best loss: -0.154126213592233]




 12%|█▎        | 4/32 [00:07<00:56,  2.03s/trial, best loss: -0.154126213592233]




 16%|█▌        | 5/32 [00:09<00:45,  1.68s/trial, best loss: -0.154126213592233]




 19%|█▉        | 6/32 [00:19<02:00,  4.65s/trial, best loss: -0.154126213592233]




 22%|██▏       | 7/32 [00:29<02:39,  6.39s/trial, best loss: -0.154126213592233]





 28%|██▊       | 9/32 [00:47<02:55,  7.63s/trial, best loss: -0.15416355489171024]




 38%|███▊      | 12/32 [00:48<01:07,  3.37s/trial, best loss: -0.15416355489171024]








 47%|████▋     | 15/32 [01:40<02:51, 10.07s/trial, best loss: -0.17148991784914117]




 50%|█████     | 16/32 [01:41<02:14,  8.38s/trial, best loss: -0.17148991784914117]




 53%|█████▎    | 17/32 [01:43<01:45,  7.03s/trial, best loss: -0.17148991784914117]




 75%|███████▌  | 24/32 [01:44<00:15,  1.88s/trial, best loss: -0.17148991784914117]














100%|██████████| 32/32 [01:44<00:00,  3.26s/trial, best loss: -0.17148991784914117]







In [14]:
best_result

{'classifier_type': 0,
 'dtree_criterion': 1,
 'dtree_max_depth': 1,
 'dtree_max_depth_int': 3,
 'dtree_min_samples_split': 8}

# Top 3 Best models

1. 0311863630594607bdc85a56ad9c50eb
2. 190b280dca4648f6b65939a9d8e616bf
3. b4f5c7052a0d4b68b1000bc5d4300104

# Running on Test Data

In [16]:
X_redcued_test = X_test.copy()
VIF = [0]
while len(VIF) > 0:
    X_numeric = pd.DataFrame()
    for col in X_redcued_test.columns:
        # Force everything through string conversion to be safe
        X_numeric[col] = pd.to_numeric(
            X_train[col].astype(str), errors="coerce"
        )

    # Now calculate VIF
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X_numeric.columns
    vif_data["VIF"] = [
        variance_inflation_factor(X_numeric.values, i)
        for i in range(X_numeric.shape[1])
    ]
    vif_test = vif_data.set_index("Feature").sort_values(
        by="VIF", ascending=False
    )
    if vif_test.max().iloc[0] > 5:
        # print(VIF)
        VIF = vif_test.idxmax().iloc[0]
        X_redcued_test.drop(VIF, axis=1, inplace=True)
    else:
        # print("Stopping")
        VIF = []

In [34]:
def objective(params):
    with mlflow.start_run():
        classifier_type = params["type"]
        del params["type"]
        if classifier_type == "dt":
            clf = DecisionTreeClassifier(**params)
        elif classifier_type == "rf":
            clf = RandomForestClassifier(**params)
        elif classifier_type == "gb":
            clf = GradientBoostingClassifier(**params)
        else:
            return 0
        acc = cross_val_score(clf, X_redcued_test, y_test).mean()

        mlflow.set_tag("Model", classifier_type)
        mlflow.log_params(params)
        mlflow.log_metric("accuracy", acc)
        mlflow.end_run()
        return {"loss": -acc, "status": STATUS_OK}


search_space = hp.choice(
    "classifier_type",
    [
        {
            "type": "dt",
            "criterion": hp.choice("dtree_criterion", ["gini", "entropy"]),
            "max_depth": hp.choice(
                "dtree_max_depth",
                [None, hp.randint("dtree_max_depth_int", 1, 10)],
            ),
            "min_samples_split": hp.randint("dtree_min_samples_split", 2, 10),
        },
        {
            "type": "rf",
            "n_estimators": hp.randint("rf_n_estimators", 20, 500),
            "max_features": hp.randint("rf_max_features", 2, 9),
            "criterion": hp.choice("criterion", ["gini", "entropy"]),
        },
        {
            "type": "gb",
            "loss": hp.choice("gb_loss", ["log_loss"]),
            "learning_rate": hp.uniform("gb_learning_rate", 0.05, 2),
            "n_estimators": hp.randint("gb_n_estimators", 20, 500),
            "subsample": hp.uniform("gb_subsample", 0.1, 1),
            "criterion": hp.choice(
                "gb_criterion", ["friedman_mse", "squared_error"]
            ),
            "max_depth": hp.choice(
                "gb_max_depth",
                [None, hp.randint("gb_max_depth_int", 1, 10)],
            ),
        },
    ],
)

algo = tpe.suggest
trials = Trials()

In [35]:
best_result_test = fmin(
    fn=objective, space=search_space, algo=algo, max_evals=32, trials=trials
)

  3%|▎         | 1/32 [00:00<00:04,  7.70trial/s, best loss: -0.16153846153846155]





  6%|▋         | 2/32 [00:04<01:21,  2.72s/trial, best loss: -0.16153846153846155]







 22%|██▏       | 7/32 [00:05<00:15,  1.60trial/s, best loss: -0.16153846153846155]






 28%|██▊       | 9/32 [00:09<00:26,  1.16s/trial, best loss: -0.16153846153846155]




 31%|███▏      | 10/32 [00:12<00:32,  1.46s/trial, best loss: -0.16153846153846155]





 56%|█████▋    | 18/32 [00:15<00:08,  1.71trial/s, best loss: -0.16923076923076924]










 62%|██████▎   | 20/32 [00:19<00:10,  1.16trial/s, best loss: -0.16923076923076924]






 69%|██████▉   | 22/32 [00:19<00:07,  1.37trial/s, best loss: -0.16923076923076924]







 81%|████████▏ | 26/32 [00:20<00:03,  1.97trial/s, best loss: -0.16923076923076924]








100%|██████████| 32/32 [00:27<00:00,  1.17trial/s, best loss: -0.17692307692307693]





In [37]:
best_result_test

{'classifier_type': 0,
 'dtree_criterion': 1,
 'dtree_max_depth': 0,
 'dtree_min_samples_split': 9}

In [41]:
runid = "13b23d634e8b42ab87789617d4947ca1"
mod_path = f"runs:/{runid}/artifacts/better_models"
mlflow.register_model(model_uri=mod_path, name="Best_test_Lab2")

Registered model 'Best_test_Lab2' already exists. Creating a new version of this model...
Created version '2' of model 'Best_test_Lab2'.


<ModelVersion: aliases=[], creation_timestamp=1742525588969, current_stage='None', description=None, last_updated_timestamp=1742525588969, name='Best_test_Lab2', run_id='13b23d634e8b42ab87789617d4947ca1', run_link=None, source='/Users/skier/MSDS/Spring2/Spring2-MSDS-MLOps/labs/lab2/mlruns/4/13b23d634e8b42ab87789617d4947ca1/artifacts/artifacts/better_models', status='READY', status_message=None, tags={}, user_id=None, version=2>