# 🚀 Module 2: Model Training and Experiment Tracking

In this module, we will:
1. Train a machine learning model on the processed dataset
2. Evaluate the model's performance
3. Track experiments and parameters using **MLflow**
4. (Optional) Register the trained model in the MLflow Model Registry

Make sure MLflow is installed in your environment:
```bash
pip install mlflow
```

In [None]:
# Install requirements
!pip install -r requirements.txt

## 📦 Import Required Libraries

Before we proceed with training and tracking our machine learning model, we need to import the necessary libraries.


In [2]:
# Import necessary modules
import os

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

import pandas as pd
import numpy as np

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

## 📥 Load the Processed Dataset

We'll start by loading the processed dataset for January 2011, which was prepared in the data exploration phase.

This dataset contains cleaned and feature-engineered data and will be used as the reference dataset for drift and performance comparison.

We use `pandas` to read the CSV file and inspect the first few rows.

In [3]:
# Load the training data
data_path = "../data/processed/"
train_df = pd.read_csv(data_path + 'data_2011_01.csv')
train_df.head()

Unnamed: 0,dteday,instant,season,year,month,hour,holiday,weekday,workingday,weathersit,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01,1,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2011-01-01,2,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,2011-01-01,3,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,2011-01-01,4,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,2011-01-01,5,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [4]:
# Load the testing data
data_path = "../data/processed/"
test_df = pd.read_csv(data_path + 'data_2011_02.csv')
test_df.head()

Unnamed: 0,dteday,instant,season,year,month,hour,holiday,weekday,workingday,weathersit,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-02-01,689,1,0,2,0,0,2,1,2,0.16,0.1818,0.64,0.1045,2,6,8
1,2011-02-01,690,1,0,2,1,0,2,1,2,0.16,0.1818,0.69,0.1045,0,3,3
2,2011-02-01,691,1,0,2,2,0,2,1,2,0.16,0.2273,0.69,0.0,0,2,2
3,2011-02-01,692,1,0,2,3,0,2,1,2,0.16,0.2273,0.69,0.0,0,2,2
4,2011-02-01,693,1,0,2,5,0,2,1,3,0.14,0.2121,0.93,0.0,0,3,3


In [8]:
numerical_features=['temp', 'atemp', 'humidity', 'windspeed', 'hour', 'weekday']
categorical_features=['season', 'holiday', 'workingday']

## ✂️ Prepare Features and Target Variable

In [12]:
# Define features and target
X_train = train_df[numerical_features + categorical_features]
y_train = train_df["count"]

X_train.head()
# Train-test split
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Unnamed: 0,temp,atemp,humidity,windspeed,hour,weekday,season,holiday,workingday
0,0.24,0.2879,0.81,0.0,0,6,1,0,0
1,0.22,0.2727,0.8,0.0,1,6,1,0,0
2,0.22,0.2727,0.8,0.0,2,6,1,0,0
3,0.24,0.2879,0.75,0.0,3,6,1,0,0
4,0.24,0.2879,0.75,0.0,4,6,1,0,0


In [13]:
# Define features and target
X_test = test_df[numerical_features + categorical_features]
y_test = test_df["count"]

X_test.head()
# Train-test split
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Unnamed: 0,temp,atemp,humidity,windspeed,hour,weekday,season,holiday,workingday
0,0.16,0.1818,0.64,0.1045,0,2,1,0,1
1,0.16,0.1818,0.69,0.1045,1,2,1,0,1
2,0.16,0.2273,0.69,0.0,2,2,1,0,1
3,0.16,0.2273,0.69,0.0,3,2,1,0,1
4,0.14,0.2121,0.93,0.0,5,2,1,0,1


## 🤖 Train a Regression Model

In [16]:
# Define and train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

## 📊 Evaluate Model Performance

In [17]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")

RMSE: 31.77
R² Score: 0.75


## 📝 Track Experiments with MLflow

In [18]:
# Ask user for a number to customize the run name
run_number = input("Enter a run number for this experiment (Remember! It should be a new number each!): ")
run_name = f"random_forest_baseline_{run_number}"
print(f"Generated MLflow run name: {run_name}")

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("bike_sharing_model")

with mlflow.start_run(run_name=run_name):
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_param("n_estimators", 100)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)

    mlflow.sklearn.log_model(model, "../model")
    print("Model and metrics logged to MLflow.")

Enter a run number for this experiment (Remember! It should be a new number each!):  10


Generated MLflow run name: random_forest_baseline_10




Model and metrics logged to MLflow.
🏃 View run random_forest_baseline_10 at: http://localhost:5000/#/experiments/120047244423298152/runs/cd611bbea71e4006a2c1668522776c47
🧪 View experiment at: http://localhost:5000/#/experiments/120047244423298152


In [19]:
# Get the experiment by name
experiment = mlflow.get_experiment_by_name("bike_sharing_model")

# Load all runs from the experiment
client = mlflow.tracking.MlflowClient()
runs = client.search_runs(experiment_ids=[experiment.experiment_id])

# Display runs in a DataFrame
import pandas as pd

df_runs = pd.DataFrame([{
    "Run ID": run.info.run_id,
    "Run Name": run.data.tags.get("mlflow.runName"),
    "RMSE": run.data.metrics.get("rmse"),
    "R2": run.data.metrics.get("r2"),
    "Date": run.info.start_time
} for run in runs])

df_runs.sort_values("Date", ascending=False).reset_index(drop=True)

Unnamed: 0,Run ID,Run Name,RMSE,R2,Date
0,cd611bbea71e4006a2c1668522776c47,random_forest_baseline_10,31.772359,0.750254,1746699157421
1,348b1d7a4db24bf19e5b48c074ae6bad,random_forest_baseline_50,2.763474,0.999759,1746684781029
2,bd324dc36d764539aae2d7e5226fd5e9,random_forest_baseline_20,2.763474,0.999759,1746684752364
3,9e9aa0888ca24e29b30d183f532db3c5,mysterious-crane-350,2.763474,0.999759,1746684297325


## 🗃️ (Optional) Register the Model in MLflow

In [20]:
# Set up MLflow client and experiment
client = MlflowClient()
experiment = mlflow.get_experiment_by_name("bike_sharing_model")
runs = client.search_runs(experiment_ids=[experiment.experiment_id])

# Show available run names
run_names = [run.data.tags.get("mlflow.runName") for run in runs]
print("Available run names:")
for name in run_names:
    print("-", name)

# Ask user to select a run name
selected_name = input("Enter the run name to register its model: ")

# Find the corresponding run ID
selected_run = next(run for run in runs if run.data.tags.get("mlflow.runName") == selected_name)
run_id = selected_run.info.run_id

# Register the model from the selected run
model_uri = f"runs:/{run_id}/model"
result = mlflow.register_model(model_uri, "BikeSharingModel")

print(f"Model registered: {result.name} v{result.version}")

Available run names:
- random_forest_baseline_10
- random_forest_baseline_50
- random_forest_baseline_20
- mysterious-crane-350


Enter the run name to register its model:  random_forest_baseline_10


Registered model 'BikeSharingModel' already exists. Creating a new version of this model...
2025/05/08 10:13:00 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: BikeSharingModel, version 2


Model registered: BikeSharingModel v2


Created version '2' of model 'BikeSharingModel'.


## ✅ Summary
- Trained and evaluated a regression model.
- Tracked parameters, metrics, and artifacts with MLflow.
- Optionally registered the model.

Next, we will package and deploy the model on Kubernetes!