Q1. Install MLflow
Q: What's the version that you have?

A: 2.22.0

In [18]:
%pip install mlflow hyperopt

Collecting hyperopt
  Downloading hyperopt-0.2.7-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting networkx>=2.2 (from hyperopt)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting future (from hyperopt)
  Downloading future-1.0.0-py3-none-any.whl.metadata (4.0 kB)
Collecting tqdm (from hyperopt)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting py4j (from hyperopt)
  Downloading py4j-0.10.9.9-py2.py3-none-any.whl.metadata (1.3 kB)
Downloading hyperopt-0.2.7-py2.py3-none-any.whl (1.6 MB)
   ---------------------------------------- 0.0/1.6 MB ? eta -:--:--
   ---------------------------------------  1.6/1.6 MB 13.9 MB/s eta 0:00:01
   ---------------------------------------- 1.6/1.6 MB 10.6 MB/s eta 0:00:00
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 1.7/1.7 MB 10.4 MB/s eta 0:00:00
Downloading future-1.0.0-py3-none-any.wh


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import mlflow
print(mlflow.__version__)

Q2. Download and preprocess the data

Download the data for January, February and March 2023 in parquet format from here.
Your task is to download the datasets and then execute this command:
```python
python preprocess_data.py --raw_data_path <TAXI_DATA_FOLDER> --dest_path ./output
```

Q: How many files were saved to OUTPUT_FOLDER?

A: 4

In [9]:
%run preprocess_data.py --raw_data_path taxi_data_folder --dest_path ./output

In [10]:
import os

os.listdir('output')

['dv.pkl', 'test.pkl', 'train.pkl', 'val.pkl']

Q3. Train a model with autolog

Q: What is the value of the min_samples_split parameter ?

A: 2

In [14]:
import os
import pickle
import mlflow
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

def load_pickle(filename):
    with open(filename, 'rb') as f:
        return pickle.load(f)

# Enable MLflow autologging for scikit-learn
mlflow.sklearn.autolog()

# Specify the path to your data files
current_dir = os.path.dirname(os.path.abspath(os.curdir))
output_dir = os.path.join(current_dir, 'homework', 'output')
train_path = os.path.join(output_dir, 'train.pkl')
val_path = os.path.join(output_dir, 'val.pkl')

# Load the data
X_train, y_train = load_pickle(train_path)
X_val, y_val = load_pickle(val_path)

# Train the model with MLflow tracking
with mlflow.start_run():
    rf = RandomForestRegressor(max_depth=10, random_state=0)
    rf.fit(X_train, y_train)

    # Make predictions
    y_pred = rf.predict(X_val)

    # Calculate RMSE
    rmse = mean_squared_error(y_val, y_pred)
    print(f'RMSE: {rmse}')

RMSE: 29.497522626996197


In [15]:
print(f"min_samples_split value: {rf.min_samples_split}")

min_samples_split value: 2


Q4. Launch the tracking server locally
Now we want to manage the entire lifecycle of our ML model. In this step, you'll need to launch a tracking server. This way we will also have access to the model registry.

Your task is to:

launch the tracking server on your local machine,
select a SQLite db for the backend store and a folder called artifacts for the artifacts store.
You should keep the tracking server running to work on the next two exercises that use the server.

Q: In addition to backend-store-uri, what else do you need to pass to properly configure the server?

A: - `default-artifact-root`


mlflow server \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./artifacts \
    --host 0.0.0.0 \
    --port 5000

Q5. Tune model hyperparameters
Now let's try to reduce the validation error by tuning the hyperparameters of the RandomForestRegressor using hyperopt.

We have prepared the script hpo.py for this exercise.

Your task is to modify the script hpo.py and make sure that the validation RMSE is logged to the tracking server for each run of the hyperparameter optimization (you will need to add a few lines of code to the objective function) and run the script without passing any parameters.

After that, open UI and explore the runs from the experiment called random-forest-hyperopt to answer the question below.

Note: Don't use autologging for this exercise.

The idea is to just log the information that you need to answer the question below, including:

the list of hyperparameters that are passed to the objective function during the optimization,
the RMSE obtained on the validation set (February 2023 data).

Q: What's the best validation RMSE that you got?

A: 28.258

In [None]:
%run hpo_modified.py

Q6. Promote the best model to the model registry
The results from the hyperparameter optimization are quite good. So, we can assume that we are ready to test some of these models in production. In this exercise, you'll promote the best model to the model registry. We have prepared a script called register_model.py, which will check the results from the previous step and select the top 5 runs. After that, it will calculate the RMSE of those models on the test set (March 2023 data) and save the results to a new experiment called random-forest-best-models.

Your task is to update the script register_model.py so that it selects the model with the lowest RMSE on the test set and registers it to the model registry.

Tip 1: you can use the method search_runs from the MlflowClient to get the model with the lowest RMSE,

Tip 2: to register the model you can use the method mlflow.register_model and you will need to pass the right model_uri in the form of a string that looks like this: "runs:/<RUN_ID>/model", and the name of the model (make sure to choose a good one!).

Q: What is the test RMSE of the best model?

A: 30.7931798953214

In [30]:
%run register_model_modified.py

2025/05/06 15:03:14 INFO mlflow.tracking.fluent: Experiment with name 'random-forest-best-models' does not exist. Creating a new experiment.
Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 16.66it/s]


🏃 View run abrasive-pug-191 at: http://127.0.0.1:5000/#/experiments/901281849889807915/runs/4a2fedfa92764dbfb40a5c47382550cd
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/901281849889807915


Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 21.29it/s]


🏃 View run nosy-deer-873 at: http://127.0.0.1:5000/#/experiments/901281849889807915/runs/f16bad68d93f4deaa4de930a27155319
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/901281849889807915


Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 34.52it/s]


🏃 View run useful-penguin-307 at: http://127.0.0.1:5000/#/experiments/901281849889807915/runs/b49a911b1b0041e4af36ce1387868777
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/901281849889807915


Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 23.77it/s]


🏃 View run dazzling-mink-49 at: http://127.0.0.1:5000/#/experiments/901281849889807915/runs/0f31f1a404524760af28acdae8196124
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/901281849889807915


Downloading artifacts: 100%|██████████| 5/5 [00:00<00:00, 41.53it/s]
Successfully registered model 'nyc-taxi-random-forest-regressor'.
2025/05/06 15:03:16 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: nyc-taxi-random-forest-regressor, version 1


🏃 View run serious-bee-310 at: http://127.0.0.1:5000/#/experiments/901281849889807915/runs/19e271dea736483f9acd2400fd948ddd
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/901281849889807915
Registered model with run_id: 0836093038934cae81734d5355112439
Best test RMSE: 30.7931798953214


Created version '1' of model 'nyc-taxi-random-forest-regressor'.
