# 🚀 Module 2: Model Training and Experiment Tracking

In this module, we will:
1. Train a machine learning model on the processed dataset
2. Evaluate the model's performance
3. Track experiments and parameters using **MLflow**
4. (Optional) Register the trained model in the MLflow Model Registry

Make sure MLflow is installed in your environment:
```bash
pip install mlflow
```

## 📥 Load the Processed Dataset

In [1]:
import pandas as pd
import os

data_path = "../data/processed/hour_processed.csv"
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,instant,dteday,season,year,month,hour,holiday,weekday,workingday,weathersit,temp,atemp,humidity,windspeed,casual,registered,count
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


## ✂️ Prepare Features and Target Variable

In [2]:
from sklearn.model_selection import train_test_split

# Define features and target
X = df.drop(columns=["count", "dteday"])
y = df["count"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 🤖 Train a Regression Model

In [3]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Define and train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

## 📊 Evaluate Model Performance

In [4]:
import numpy as np

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.2f}")

RMSE: 2.76
R² Score: 1.00


## 📝 Track Experiments with MLflow

In [8]:
import mlflow
import mlflow.sklearn

# Ask user for a number to customize the run name
run_number = input("Enter a run number for this experiment (Remember! It should be a new number each!): ")
run_name = f"random_forest_baseline_{run_number}"
print(f"Generated MLflow run name: {run_name}")

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("bike_sharing_model")

with mlflow.start_run(run_name=run_name):
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_param("n_estimators", 100)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)

    mlflow.sklearn.log_model(model, "model")
    print("Model and metrics logged to MLflow.")

Enter a run number for this experiment:  50


Generated MLflow run name: random_forest_baseline_50




Model and metrics logged to MLflow.
🏃 View run random_forest_baseline_50 at: http://localhost:5000/#/experiments/120047244423298152/runs/348b1d7a4db24bf19e5b48c074ae6bad
🧪 View experiment at: http://localhost:5000/#/experiments/120047244423298152


In [9]:
import mlflow

# Get the experiment by name
experiment = mlflow.get_experiment_by_name("bike_sharing_model")

# Load all runs from the experiment
client = mlflow.tracking.MlflowClient()
runs = client.search_runs(experiment_ids=[experiment.experiment_id])

# Display runs in a DataFrame
import pandas as pd

df_runs = pd.DataFrame([{
    "Run ID": run.info.run_id,
    "Run Name": run.data.tags.get("mlflow.runName"),
    "RMSE": run.data.metrics.get("rmse"),
    "R2": run.data.metrics.get("r2"),
    "Date": run.info.start_time
} for run in runs])

df_runs.sort_values("Date", ascending=False).reset_index(drop=True)

Unnamed: 0,Run ID,Run Name,RMSE,R2,Date
0,348b1d7a4db24bf19e5b48c074ae6bad,random_forest_baseline_50,2.763474,0.999759,1746684781029
1,bd324dc36d764539aae2d7e5226fd5e9,random_forest_baseline_20,2.763474,0.999759,1746684752364
2,9e9aa0888ca24e29b30d183f532db3c5,mysterious-crane-350,2.763474,0.999759,1746684297325


## 🗃️ (Optional) Register the Model in MLflow

In [10]:
import mlflow
from mlflow.tracking import MlflowClient

# Set up MLflow client and experiment
client = MlflowClient()
experiment = mlflow.get_experiment_by_name("bike_sharing_model")
runs = client.search_runs(experiment_ids=[experiment.experiment_id])

# Show available run names
run_names = [run.data.tags.get("mlflow.runName") for run in runs]
print("Available run names:")
for name in run_names:
    print("-", name)

# Ask user to select a run name
selected_name = input("Enter the run name to register its model: ")

# Find the corresponding run ID
selected_run = next(run for run in runs if run.data.tags.get("mlflow.runName") == selected_name)
run_id = selected_run.info.run_id

# Register the model from the selected run
model_uri = f"runs:/{run_id}/model"
result = mlflow.register_model(model_uri, "BikeSharingModel")

print(f"Model registered: {result.name} v{result.version}")

Available run names:
- random_forest_baseline_50
- random_forest_baseline_20
- mysterious-crane-350


Enter the run name to register its model:  random_forest_baseline_20


Successfully registered model 'BikeSharingModel'.
2025/05/08 06:17:16 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: BikeSharingModel, version 1


Model registered: BikeSharingModel v1


Created version '1' of model 'BikeSharingModel'.


## ✅ Summary
- Trained and evaluated a regression model.
- Tracked parameters, metrics, and artifacts with MLflow.
- Optionally registered the model.

Next, we will package and deploy the model on Kubernetes!