# MLOps Lab 2 Assignment (MLFlow)

Author: Grant Nitta

Date Created: 03/20/2025

Date Last Modified: 03/21/2025

# Task

Once you have selected a set of data, create a brand new experiment in MLFlow and begin exploring your data. Do some EDA, clean up, and learn about your data. You do not need to begin tracking anything yet, but you can if you want to (e.g. you can log different versions of your data as you clean it up and do any feature engineering). Do not spend a ton of time on this part. Your goal isn't really to build a great model, so don't spend hours on feature engineering and missing data imputation and things like that.

Once your data is clean, begin training models and tracking your experiments. If you intend to use this same dataset for your final project, then start thinking about what your model might look like when you actually deploy it. For example, when you engineer new features, be sure to save the code that does this, as you will need this in the future. If your final model has 1000 complex features, you might have a difficult time deploying it later on. If your final model takes 15 minutes to train, or takes a long time to score a new batch of data, you may want to think about training a less complex model.

Now, when tracking your experiments, at a *minimum*, you should:

1. Try at least 3 different ML algorithms (e.g. linear regression, decision tree, random forest, etc.).
2. Do hyperparameter tuning for **each** algorithm.
3. Do some very basic feature selection, and repeat the above steps with these reduced sets of features.
4. Identify the top 3 best models and note these down for later.
6. Choose the **final** "best" model that you would deploy or use on future data, stage it (in MLFlow), and run it on the test set to get a final measure of performance. Don't forget to log the test set metric.
7. Be sure you logged the exact training, validation, and testing datasets for the 3 best models, as well as hyperparameter values, and the values of your metrics.  
8. Push your code to Github. No need to track the mlruns folder, the images folder, any datasets, or the sqlite database in git.

### Turning It In

In the MLFlow UI, next to the refresh button you should see three vertical dots. Click the dots and then download your experiments as a csv file. Open the csv file in Excel and highlight the rows for your top 3 models from step 4, highlight the run where you applied your best model to the test set, and then save as an excel file. Take a snapshot of the Models page in the MLFLow UI showing the model you staged in step 6 above. Submit the excel file and the snapshot to Canvas.

# Library Importation

In [17]:
import mlflow
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, train_test_split

from statsmodels.stats.outliers_influence import variance_inflation_factor


from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

# Data Importation

In [18]:
# pip install ucimlrepo
# !pip install statsmodels

In [19]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
# student_performance = fetch_ucirepo(id=320)

# # data (as pandas dataframes)
# X = student_performance.data.features
# y = student_performance.data.targets

# metadata
# print(student_performance.metadata)

# # variable information
# print(student_performance.variables)

# fetch dataset
iris = fetch_ucirepo(id=53)

# data (as pandas dataframes)
X = iris.data.features
y = iris.data.targets

# metadata
# print(iris.metadata)

# # variable information
# print(iris.variables)

# Setting up MLFlow

In [9]:
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("Lab2-student_performance_Iris")

2025/03/21 22:45:05 INFO mlflow.tracking.fluent: Experiment with name 'Lab2-student_performance_Iris' does not exist. Creating a new experiment.


<Experiment: artifact_location='/Users/skier/MSDS/Spring2/Spring2-MSDS-MLOps/labs/lab2/mlruns/12', creation_time=1742622305659, experiment_id='12', last_update_time=1742622305659, lifecycle_stage='active', name='Lab2-student_performance_Iris', tags={}>

# Experimenting Phase 1

In [20]:
X_encoded = X.copy()

# Track column transformations
column_mapping = {}
label_encoders = {}

# Find all object and category columns
string_columns = X.select_dtypes(include=["object", "category"]).columns

for col in string_columns:
    # For columns with many unique values, use label encoding
    le = LabelEncoder()
    X_encoded[col + "_encoded"] = le.fit_transform(X[col])

    # Drop the original column
    X_encoded = X_encoded.drop(col, axis=1)

    # Store mapping information
    column_mapping[col] = [col + "_encoded"]
    label_encoders[col] = le

X_encoded = X_encoded.astype(float)
# y_use = y["G1"]
y_use = y.values.ravel()

In [21]:
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y_use, test_size=0.2, shuffle=True
)
X_train_val, X_val, y_train_val, y_val = train_test_split(
    X_train, y_train, test_size=0.2, shuffle=True
)

# Initial Training

In [22]:
def objective(params):
    with mlflow.start_run():
        classifier_type = params["type"]
        del params["type"]
        if classifier_type == "dt":
            clf = DecisionTreeClassifier(**params)
        elif classifier_type == "rf":
            clf = RandomForestClassifier(**params)
        elif classifier_type == "gb":
            clf = GradientBoostingClassifier(**params)
        else:
            return 0
        acc = cross_val_score(clf, X_train_val, y_train_val).mean()

        mlflow.set_tag("Model", classifier_type)

        mlflow.log_params(params)
        mlflow.log_metric("accuracy", acc)
        mlflow.end_run()
        return {"loss": -acc, "status": STATUS_OK}


search_space = hp.choice(
    "classifier_type",
    [
        {
            "type": "dt",
            "criterion": hp.choice("dtree_criterion", ["gini", "entropy"]),
            "max_depth": hp.choice(
                "dtree_max_depth",
                [None, hp.randint("dtree_max_depth_int", 1, 10)],
            ),
            "min_samples_split": hp.randint("dtree_min_samples_split", 2, 10),
        },
        {
            "type": "rf",
            "n_estimators": hp.randint("rf_n_estimators", 20, 500),
            "max_features": hp.randint("rf_max_features", 2, 9),
            "criterion": hp.choice("criterion", ["gini", "entropy"]),
        },
        {
            "type": "gb",
            "loss": hp.choice("gb_loss", ["log_loss"]),
            "learning_rate": hp.uniform("gb_learning_rate", 0.05, 2),
            "n_estimators": hp.randint("gb_n_estimators", 20, 500),
            "subsample": hp.uniform("gb_subsample", 0.1, 1),
            "criterion": hp.choice(
                "gb_criterion", ["friedman_mse", "squared_error"]
            ),
            "max_depth": hp.choice(
                "gb_max_depth",
                [None, hp.randint("gb_max_depth_int", 1, 10)],
            ),
        },
    ],
)

algo = tpe.suggest
trials = Trials()

In [23]:
best_result = fmin(
    fn=objective, space=search_space, algo=algo, max_evals=32, trials=trials
)
best_result

100%|██████████| 32/32 [00:10<00:00,  2.96trial/s, best loss: -0.9689473684210526]


{'classifier_type': 2,
 'gb_criterion': 1,
 'gb_learning_rate': 0.7967696607938101,
 'gb_loss': 0,
 'gb_max_depth': 0,
 'gb_n_estimators': 211,
 'gb_subsample': 0.4209379501392708}

# Feature Selection

In [24]:
X_redcued_train = X_train.copy()
VIF = [0]
while len(VIF) > 0:
    X_numeric = pd.DataFrame()
    for col in X_redcued_train.columns:
        # Force everything through string conversion to be safe
        X_numeric[col] = pd.to_numeric(
            X_train[col].astype(str), errors="coerce"
        )

    # Now calculate VIF
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X_numeric.columns
    vif_data["VIF"] = [
        variance_inflation_factor(X_numeric.values, i)
        for i in range(X_numeric.shape[1])
    ]
    vif_test = vif_data.set_index("Feature").sort_values(
        by="VIF", ascending=False
    )
    if vif_test.max().iloc[0] > 5:
        print(VIF)
        VIF = vif_test.idxmax().iloc[0]
        X_redcued_train.drop(VIF, axis=1, inplace=True)
    else:
        print("Stopping")
        VIF = []

[0]
sepal length
Stopping


In [25]:
def objective(params):
    with mlflow.start_run():
        classifier_type = params["type"]
        del params["type"]
        if classifier_type == "dt":
            clf = DecisionTreeClassifier(**params)
        elif classifier_type == "rf":
            clf = RandomForestClassifier(**params)
        elif classifier_type == "gb":
            clf = GradientBoostingClassifier(**params)
        else:
            return 0
        acc = cross_val_score(clf, X_redcued_train, y_train).mean()

        mlflow.set_tag("Model", classifier_type)
        mlflow.set_tag("Data", "Training")
        mlflow.log_params(params)
        mlflow.log_metric("accuracy", acc)
        mlflow.sklearn.log_model(clf, artifact_path="better_models")
        mlflow.end_run()
        return {"loss": -acc, "status": STATUS_OK}


search_space = hp.choice(
    "classifier_type",
    [
        {
            "type": "dt",
            "criterion": hp.choice("dtree_criterion", ["gini", "entropy"]),
            "max_depth": hp.choice(
                "dtree_max_depth",
                [None, hp.randint("dtree_max_depth_int", 1, 10)],
            ),
            "min_samples_split": hp.randint("dtree_min_samples_split", 2, 10),
        },
        {
            "type": "rf",
            "n_estimators": hp.randint("rf_n_estimators", 20, 500),
            "max_features": hp.randint("rf_max_features", 2, 9),
            "criterion": hp.choice("criterion", ["gini", "entropy"]),
        },
        {
            "type": "gb",
            "loss": hp.choice("gb_loss", ["log_loss"]),
            "learning_rate": hp.uniform("gb_learning_rate", 0.05, 2),
            "n_estimators": hp.randint("gb_n_estimators", 20, 500),
            "subsample": hp.uniform("gb_subsample", 0.1, 1),
            "criterion": hp.choice(
                "gb_criterion", ["friedman_mse", "squared_error"]
            ),
            "max_depth": hp.choice(
                "gb_max_depth",
                [None, hp.randint("gb_max_depth_int", 1, 10)],
            ),
        },
    ],
)

algo = tpe.suggest
trials = Trials()

In [26]:
best_result = fmin(
    fn=objective, space=search_space, algo=algo, max_evals=32, trials=trials
)
best_result

  0%|          | 0/32 [00:00<?, ?trial/s, best loss=?]




  3%|▎         | 1/32 [00:01<00:39,  1.26s/trial, best loss: -0.9583333333333334]




  6%|▋         | 2/32 [00:02<00:41,  1.37s/trial, best loss: -0.9583333333333334]




  9%|▉         | 3/32 [00:03<00:33,  1.16s/trial, best loss: -0.9583333333333334]




 12%|█▎        | 4/32 [00:05<00:35,  1.25s/trial, best loss: -0.9583333333333334]




 16%|█▌        | 5/32 [00:06<00:33,  1.24s/trial, best loss: -0.9583333333333334]




 19%|█▉        | 6/32 [00:07<00:29,  1.12s/trial, best loss: -0.9583333333333334]




 22%|██▏       | 7/32 [00:08<00:30,  1.23s/trial, best loss: -0.9583333333333334]




 25%|██▌       | 8/32 [00:09<00:27,  1.16s/trial, best loss: -0.9583333333333334]




 28%|██▊       | 9/32 [00:11<00:29,  1.28s/trial, best loss: -0.9583333333333334]




 31%|███▏      | 10/32 [00:12<00:25,  1.15s/trial, best loss: -0.9583333333333334]




 34%|███▍      | 11/32 [00:13<00:25,  1.21s/trial, best loss: -0.9583333333333334]




 38%|███▊      | 12/32 [00:15<00:27,  1.37s/trial, best loss: -0.9583333333333334]




 41%|████      | 13/32 [00:16<00:25,  1.37s/trial, best loss: -0.9583333333333334]




 44%|████▍     | 14/32 [00:18<00:28,  1.57s/trial, best loss: -0.9583333333333334]




 47%|████▋     | 15/32 [00:19<00:23,  1.38s/trial, best loss: -0.9583333333333334]




 50%|█████     | 16/32 [00:20<00:22,  1.40s/trial, best loss: -0.9583333333333334]




 53%|█████▎    | 17/32 [00:21<00:18,  1.25s/trial, best loss: -0.9583333333333334]




 56%|█████▋    | 18/32 [00:23<00:19,  1.40s/trial, best loss: -0.9583333333333334]




 59%|█████▉    | 19/32 [00:24<00:16,  1.24s/trial, best loss: -0.9583333333333334]




 62%|██████▎   | 20/32 [00:25<00:15,  1.30s/trial, best loss: -0.9583333333333334]




 66%|██████▌   | 21/32 [00:26<00:13,  1.20s/trial, best loss: -0.9583333333333334]




 69%|██████▉   | 22/32 [00:27<00:10,  1.10s/trial, best loss: -0.9583333333333334]




 72%|███████▏  | 23/32 [00:28<00:10,  1.16s/trial, best loss: -0.9583333333333334]




 75%|███████▌  | 24/32 [00:29<00:08,  1.06s/trial, best loss: -0.9583333333333334]




 78%|███████▊  | 25/32 [00:30<00:07,  1.00s/trial, best loss: -0.9583333333333334]




 81%|████████▏ | 26/32 [00:31<00:05,  1.03trial/s, best loss: -0.9583333333333334]




 84%|████████▍ | 27/32 [00:32<00:04,  1.08trial/s, best loss: -0.9583333333333334]




 88%|████████▊ | 28/32 [00:33<00:03,  1.12trial/s, best loss: -0.9583333333333334]




 91%|█████████ | 29/32 [00:34<00:02,  1.14trial/s, best loss: -0.9583333333333334]




 94%|█████████▍| 30/32 [00:34<00:01,  1.14trial/s, best loss: -0.9583333333333334]




 97%|█████████▋| 31/32 [00:35<00:00,  1.14trial/s, best loss: -0.9583333333333334]




100%|██████████| 32/32 [00:37<00:00,  1.17s/trial, best loss: -0.9583333333333334]


{'classifier_type': 0,
 'dtree_criterion': 1,
 'dtree_max_depth': 0,
 'dtree_min_samples_split': 7}

# Top 3 Best Training Models

1. 7a2f884e11b9483b9ea730a06ae1fcdf
2. 1ba90b7e0ade4ea181f5ce14259a45fb
3. dbb34c6780f84f35abcc1aa0e37e881b

# Using Validation Set to find the Best

In [32]:
X_redcued_val = X_val.copy()
VIF = [0]
while len(VIF) > 0:
    X_numeric = pd.DataFrame()
    for col in X_redcued_val.columns:
        # Force everything through string conversion to be safe
        X_numeric[col] = pd.to_numeric(
            X_train[col].astype(str), errors="coerce"
        )

    # Now calculate VIF
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X_numeric.columns
    vif_data["VIF"] = [
        variance_inflation_factor(X_numeric.values, i)
        for i in range(X_numeric.shape[1])
    ]
    vif_test = vif_data.set_index("Feature").sort_values(
        by="VIF", ascending=False
    )
    if vif_test.max().iloc[0] > 5:
        print(VIF)
        VIF = vif_test.idxmax().iloc[0]
        X_redcued_val.drop(VIF, axis=1, inplace=True)
    else:
        print("Stopping")
        VIF = []

[0]
sepal length
Stopping


In [33]:
top3_1_logged_model = "runs:/7a2f884e11b9483b9ea730a06ae1fcdf/better_models"  # replace with one of your models

# Load model as a PyFuncModel.
top3_1_loaded_model = mlflow.pyfunc.load_model(top3_1_logged_model)

top3_2_logged_model = "runs:/1ba90b7e0ade4ea181f5ce14259a45fb/better_models"  # replace with one of your models

# Load model as a PyFuncModel.
top3_2_loaded_model = mlflow.pyfunc.load_model(top3_2_logged_model)

top3_3_logged_model = "runs:/dbb34c6780f84f35abcc1aa0e37e881b/better_models"  # replace with one of your models

# Load model as a PyFuncModel.
top3_3_loaded_model = mlflow.pyfunc.load_model(top3_3_logged_model)

model_1_train = mlflow.sklearn.load_model(top3_1_logged_model)
model_2_train = mlflow.sklearn.load_model(top3_2_logged_model)
model_3_train = mlflow.sklearn.load_model(top3_3_logged_model)

In [35]:
with mlflow.start_run():
    counter = 0
    for model in [model_1_train, model_2_train, model_3_train]:
        counter += 1
        params = model.get_params()
        for phase in ["validation"]:
            if phase == "validation":
                acc = cross_val_score(model, X_redcued_val, y_val).mean()
            mlflow.set_tag("Model", "dt")
            mlflow.set_tag("Data", "validation")
            mlflow.log_params(model.get_params())
            mlflow.log_metric("accuracy", acc)
            mlflow.sklearn.log_model(model, artifact_path="better_models")
            mlflow.end_run()



# Best Model according to Validation

1. 18344d4723fb4fb69155783ae8ddd437

# Running All models on the Test Data

In [36]:
X_redcued_test = X_test.copy()
VIF = [0]
while len(VIF) > 0:
    X_numeric = pd.DataFrame()
    for col in X_redcued_test.columns:
        # Force everything through string conversion to be safe
        X_numeric[col] = pd.to_numeric(
            X_train[col].astype(str), errors="coerce"
        )

    # Now calculate VIF
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X_numeric.columns
    vif_data["VIF"] = [
        variance_inflation_factor(X_numeric.values, i)
        for i in range(X_numeric.shape[1])
    ]
    vif_test = vif_data.set_index("Feature").sort_values(
        by="VIF", ascending=False
    )
    if vif_test.max().iloc[0] > 5:
        # print(VIF)
        VIF = vif_test.idxmax().iloc[0]
        X_redcued_test.drop(VIF, axis=1, inplace=True)
    else:
        # print("Stopping")
        VIF = []

In [37]:
top3_1_logged_model = "runs:/18344d4723fb4fb69155783ae8ddd437/better_models"  # replace with one of your models

# Load model as a PyFuncModel.
top3_1_loaded_model = mlflow.pyfunc.load_model(top3_1_logged_model)

top3_2_logged_model = "runs:/c7ab29591da9434ba65d90e8e3750fb9/better_models"  # replace with one of your models

# Load model as a PyFuncModel.
top3_2_loaded_model = mlflow.pyfunc.load_model(top3_2_logged_model)

top3_3_logged_model = "runs:/10b5fccad25f469ab55ea8f8c1b97586/better_models"  # replace with one of your models

# Load model as a PyFuncModel.
top3_3_loaded_model = mlflow.pyfunc.load_model(top3_3_logged_model)

best_model = mlflow.sklearn.load_model(top3_1_logged_model)
model_2 = mlflow.sklearn.load_model(top3_2_logged_model)
model_3 = mlflow.sklearn.load_model(top3_3_logged_model)

In [38]:
# mlflow.set_experiment("Lab2-student_performance_output")
with mlflow.start_run():
    counter = 0
    for model in [best_model, model_2, model_3]:
        counter += 1
        params = model.get_params()
        for phase in ["train", "validation", "test"]:
            if phase == "train":
                acc = cross_val_score(model, X_redcued_train, y_train).mean()
            elif phase == "validation":
                acc = cross_val_score(model, X_redcued_val, y_val).mean()
            elif phase == "test":
                acc = cross_val_score(best_model, X_redcued_test, y_test).mean()
            if counter == 1:
                mlflow.set_tag("Model_Variation", "Best")
            if counter == 2:
                mlflow.set_tag("Model_Variation", "2nd_Best")
            if counter == 3:
                mlflow.set_tag("Model_Variation", "3rd_Best")
            mlflow.set_tag("Model", "dt")
            mlflow.set_tag("Data", phase)
            mlflow.log_params(model.get_params())
            mlflow.log_metric("accuracy", acc)
            mlflow.sklearn.log_model(model, artifact_path="better_models")
            mlflow.end_run()



# Staging the best model on Github

In [None]:
runid = "18344d4723fb4fb69155783ae8ddd437"
mod_path = f"runs:/{runid}/artifacts/better_models"
mlflow.register_model(model_uri=mod_path, name="Lab2_Best_Model")

Successfully registered model 'Lab2_Best_Model'.
Created version '1' of model 'Lab2_Best_Model'.


<ModelVersion: aliases=[], creation_timestamp=1742622184659, current_stage='None', description=None, last_updated_timestamp=1742622184659, name='Lab2_Best_Model', run_id='e311596193dd48fca9cadd160c8e96d1', run_link=None, source='/Users/skier/MSDS/Spring2/Spring2-MSDS-MLOps/labs/lab2/mlruns/11/e311596193dd48fca9cadd160c8e96d1/artifacts/artifacts/better_models', status='READY', status_message=None, tags={}, user_id=None, version=1>