# 🚀 Module 4: Review the Experiments & Select the Best Model

In this module, we will:
1.
3. Train a Machine Learning Model and Evaluate it's Performance
4. Track Experiments and Parameters using **MLflow**
6. Register the Trained Model with the Best Performanc

Make sure MLflow is installed in your environment:

```bash
pip install mlflow
```

In [21]:
# Install requirements
!pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 📦 Import Required Libraries

Before we proceed with training and tracking our machine learning model, we need to import the necessary libraries.


In [1]:
# Import necessary modules
import os
import random

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

import pandas as pd
import numpy as np

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

## 📥 Load the Processed Dataset

We'll start by loading the processed dataset for January 2011, which was prepared in the data exploration phase.

This dataset contains cleaned and feature-engineered data and will be used as the reference dataset for drift and performance comparison.

We use `pandas` to read the CSV file and inspect the first few rows.

In [None]:
# Load the training data
data_path = "./data/processed/"

# Read both CSV files
data_01 = pd.read_csv(data_path + 'DATA_MONTH_1')
data_02 = pd.read_csv(data_path + 'DATA_MONTH_2')
# data_03 = pd.read_csv(data_path + 'data_2011_03.csv')

# Concatenate the datasets
# input_data_df = pd.concat([data_01, data_02, data_03], ignore_index=True)
input_data_df = pd.concat([data_01, data_02], ignore_index=True)

input_data_df.head()

Unnamed: 0,dteday,instant,season,year,month,hour,holiday,weekday,workingday,weathersit,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01,1,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2011-01-01,2,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,2011-01-01,3,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,2011-01-01,4,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,2011-01-01,5,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


#### We will continue with the steps in task 4 (`Review the Experiments & Select the Best Model`). 

## 📋 Retrieve and Review Experiment Runs

We query the MLflow tracking server to retrieve all runs associated with the "bike_sharing_model" experiment:

- Use MlflowClient to fetch all runs.

- Extract relevant details such as Run ID, Run Name, RMSE, R², and Start Time.

- Display the runs in a Pandas DataFrame for easier inspection and comparison.

The runs are sorted by start date, allowing us to review recent experiments and identify the best-performing model based on evaluation metrics.

In [None]:
MLFLOW_TRACKING_URI = 'MLFLOW_REMOTE_TRACKING_SERVER'
mlflow.set_tracking_uri(f"{MLFLOW_TRACKING_URI}")

# Get the experiment by name
PARTICIPANT_FIRSTNAME = 'YOUR_FIRSTNAME'  # Replace with your first name
experiment = mlflow.set_experiment(f"bike_sharing_model_{PARTICIPANT_FIRSTNAME}")

# Load all runs from the experiment
client = mlflow.tracking.MlflowClient()
runs = client.search_runs(experiment_ids=[experiment.experiment_id])

# Display runs in a DataFrame
import pandas as pd

df_runs = pd.DataFrame([{
    "Run ID": run.info.run_id,
    "Run Name": run.data.tags.get("mlflow.runName"),
    "RMSE": run.data.metrics.get("rmse"),
    "R2": run.data.metrics.get("r2"),
    "Date": run.info.start_time
} for run in runs])

df_runs.sort_values("Date", ascending=False).reset_index(drop=True)

Unnamed: 0,Run ID,Run Name,RMSE,R2,Date
0,4f79bec101d547cba5b11a98b86f5d47,random_forest_baseline_50_2235,20.682602,0.873506,1749575876951
1,218cf2fd397541ce8e3106d998d3e25f,random_forest_baseline_50_5269,20.866728,0.871244,1749575857719
2,753d6da466b54be3af662ecfa40b7011,random_forest_baseline_50_6336,28.849951,0.753878,1749575850003
3,5aeb4210d6414e8d9c63b5f8e561aa36,random_forest_baseline_50_1820,43.473912,0.441122,1749575834377
4,dde1ea36fb894347a1fa7a79b95ff8a6,random_forest_baseline_200_9507,43.45814,0.441527,1749575817326
5,f8c461ccccb443618f8cb058a1c4e0ed,random_forest_baseline_200_2288,28.344997,0.762419,1749575807361
6,de6bbac5ba044e299071796755cd07c7,random_forest_baseline_200_5349,20.734117,0.872875,1749575796513
7,9f41eb6e02ef403cbfedb46d778a15fa,random_forest_baseline_200_7665,20.538956,0.875257,1749575775844
8,0c7d96949a2646e9adf949ea505ecc40,random_forest_baseline_100_7233,20.388602,0.877077,1749575760894
9,337e650547324f86a874712ff97617d5,random_forest_baseline_100_3667,20.49174,0.87583,1749575747423


## 🏆 Select and Register the Best Model Run
The user is prompted to input the run name corresponding to the best-performing experiment. Based on this input:

- We locate the matching run and retrieve its unique run ID.

- Using the run ID, we construct the model URI to register the model in the MLflow Model Registry.

- The model is registered under a descriptive name (BikeSharingModel_{n_estimators}), enabling version control and easy deployment.

This process ensures the chosen model is formally tracked and available for production use.

In [None]:
# Ask user to select a run name
selected_name = input("Enter the run name with the best performance to register its model: ")

# Find the corresponding run ID
selected_run = next(run for run in runs if run.data.tags.get("mlflow.runName") == selected_name)
run_id = selected_run.info.run_id

# Register the model from the selected run
model_uri = f"runs:/{run_id}/model"
MODEL_NAME = f"BikeSharingModel_{PARTICIPANT_FIRSTNAME}"
result = mlflow.register_model(model_uri, f"{MODEL_NAME}")

print(f"Model registered: {result.name} v{result.version}")

Enter the run name with the best performance to register its model:  random_forest_baseline_100_7233


Successfully registered model 'BikeSharingModel'.
2025/06/10 17:19:02 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: BikeSharingModel, version 1


Model registered: BikeSharingModel v1


Created version '1' of model 'BikeSharingModel'.
