# ðŸš€ Module 3: Model Training and Experiment Tracking

Make sure MLflow is installed in your environment:

```bash
pip install mlflow
```

In [21]:
# Install requirements
!pip install -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## ðŸ“¦ Import Required Libraries

Before we proceed with training and tracking our machine learning model, we need to import the necessary libraries.


In [1]:
# Import necessary modules
import os
import random

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

import pandas as pd
import numpy as np

import mlflow
import mlflow.sklearn
from mlflow.tracking import MlflowClient

## ðŸ“¥ Load the Train and Test Dataset

We'll load train and test datasets

We use `pandas` to read the CSV file.

In [2]:
# Load the training data
split_dir = "./data/split/"

# Load data splits
X_train = pd.read_csv("./data/split/X_train.csv")
X_test = pd.read_csv("./data/split/X_test.csv")
y_train = pd.read_csv("./data/split/y_train.csv").squeeze()
y_test = pd.read_csv("./data/split/y_test.csv").squeeze()

Displaying the first few rows of the **test feature set** to verify the structure and content:

In [None]:
print('X_train', 'X_test')
X_train.head()
X_test.head()

## âœ… Set Necessary Variables

In [None]:
# TODO: YOU SHOULD SET THE REQUIRED VARIABLES HERE

# Variables for model tracking
MLFLOW_REMOTE_TRACKING_SERVER = # set the provided link to the mlflow instance  

# Unique name for the MLflow
YOUR_FIRSTNAME =  # Set your first name - For example "mohammad"

# Variables for model training
N_ESTIMATORS =  # set the number of estimators
MAX_DEPTH = # set the maximum depth

## ðŸ¤– Train a Regression Model

We train a Random Forest Regressor to predict bike rental counts using the training dataset:

``n_estimators``:  Number of decision trees that the model builds (i.e 50, 100, 200)

``max_depth``: Maximum depth of each tree (i.e 2, 6, 10, 15)

``random_state``: Ensures reproducibility of results by fixing the random seed. (i.e 42)

After fitting the model on X_train and y_train, we generate predictions on the test set (X_test) and store them in y_pred. These predictions will be evaluated to measure model performance.

In [93]:
# Define and train model
# n_estimators = PUT NUMBER OF ESTIMATORS HERE  # Replace with actual number of estimators
# max_depth = PUT MAX DEPTH HERE  # Replace with actual max depth
n_estimators = N_ESTIMATORS
max_depth = MAX_DEPTH
random_state = 42  # By default it is set to 42 for reproducibility

model_randomforest = RandomForestRegressor(
    n_estimators=n_estimators, 
    random_state=random_state,
    max_depth=max_depth
    )
model_randomforest.fit(X_train, y_train)

# Predict on test set
y_pred = model_randomforest.predict(X_test)

# print(y_pred)

## ðŸ“ˆ Model Evaluation Performance

ðŸ’¡ **Note:** Data scientists usually select the best (i.e. appropriate) model based on the specific problem and characteristics of the data. For the **bike sharing forecasting** problem, we will use two key regression metrics ``rmse`` and ``r2`` as deciding metrices to evaluate the performance of the trained Random Forest model:

- **RMSE** (Root Mean Squared Error): Measures the average magnitude of prediction errors. Lower values indicate better model performance.

- **RÂ² Score** (Coefficient of Determination): Indicates how well the model explains the variance in the target variable. A value closer to 1.0 means a better fit.


In [94]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE for {n_estimators} number of estimators: {rmse:.2f}")
print(f"RÂ² Score for {n_estimators} number of estimators: {r2:.2f}")

RMSE for 100 number of estimators: 20.49
RÂ² Score for 100 number of estimators: 0.88


## ðŸ§ª Track Experiments with MLflow
We use MLflow to track the experiment run, including model parameters, metrics, and the trained model itself:

- Run Name: Automatically generated using the number of estimators (e.g., random_forest_baseline_100) for better traceability.

- Tracking URI: Points to the MLflow server used for logging and managing experiments.

- Experiment Name: Set to "bike_sharing_model" to group related runs.

Within the mlflow.start_run() context:

- We log model parameters like the type (RandomForest) and number of estimators.

- We log evaluation metrics: RMSE and RÂ² score.

- The trained model is saved and logged using mlflow.sklearn.log_model().

This ensures the entire experiment is reproducible and easily comparable within the MLflow dashboard.

In [None]:
MLFLOW_TRACKING_URI = MLFLOW_REMOTE_TRACKING_SERVER
PARTICIPANT_FIRSTNAME = YOUR_FIRSTNAME

mlflow.set_tracking_uri(f"{MLFLOW_TRACKING_URI}")
experiment = mlflow.set_experiment(f"bike_sharing_{PARTICIPANT_FIRSTNAME}")

random_num = random.randint(1000, 9999)  # generates a 4-digit random number
run_name = f"random_forest_baseline_{n_estimators}_{random_num}"

print(f"MLflow run name based on the number of estimators: {run_name}")

# directory_path = "../model"
# os.makedirs(directory_path, exist_ok=True)

with mlflow.start_run(run_name=run_name):
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    mlflow.log_param("random_state", random_state)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)

    mlflow.sklearn.log_model(model_randomforest, "model")
    print("Model and metrics logged to MLflow.")

#### Congratulations! You have completed all the steps in task 3 (`Model Training & Experiment Tracking`). 
#### Please go back to the instructions on the GitHub-Pages.