In [2]:
import os
import mlflow
import preprocess_data as prep

### Q1. Install MLflow

In [8]:
!mlflow --version

mlflow, version 2.12.2


What's the version that you have? **2.12.2**

### Q2. Download and preprocess the data

In [None]:
!wget https://raw.githubusercontent.com/DataTalksClub/mlops-zoomcamp/refs/heads/main/cohorts/2025/02-experiment-tracking/homework/preprocess_data.py

In [None]:
links = ["https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet",
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet",
"https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet"]

for link in links:
    os.system(f"wget {link} -P taxi_data")

In [8]:
os.system(f"python preprocess_data.py --raw_data_path taxi_data --dest_path ./output")

0

How many files were saved to OUTPUT_FOLDER? **4**

### Q3. Train a model with autolog

In [None]:
train_link = "https://raw.githubusercontent.com/DataTalksClub/mlops-zoomcamp/refs/heads/main/cohorts/2025/02-experiment-tracking/homework/train.py"
os.system(f"wget {train_link}")

In [12]:
os.system(f"python train.py --data_path ./output")

2025/05/28 07:19:11 INFO mlflow.tracking.fluent: Experiment with name 'experiment_tracking_homework' does not exist. Creating a new experiment.
2025/05/28 07:19:12 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'ee1822d0d37a441faefc2d2d56f0672a', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


0

What is the value of the min_samples_split parameter: **2**

### Q4. Launch the tracking server locally

In addition to backend-store-uri, what else do you need to pass to properly configure the server? **default-artifact-root**

### Q5. Tune model hyperparameters

In [2]:
tune_link = "https://raw.githubusercontent.com/DataTalksClub/mlops-zoomcamp/refs/heads/main/cohorts/2025/02-experiment-tracking/homework/hpo.py"
os.system(f"wget {tune_link}")

--2025-05-29 21:09:31--  https://raw.githubusercontent.com/DataTalksClub/mlops-zoomcamp/refs/heads/main/cohorts/2025/02-experiment-tracking/homework/hpo.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1836 (1.8K) [text/plain]
Saving to: ‘hpo.py’

     0K .                                                     100% 32.4M=0s

2025-05-29 21:09:31 (32.4 MB/s) - ‘hpo.py’ saved [1836/1836]



0

In [None]:
os.system(f"python hpo.py")

What's the best validation RMSE that you got? **5.0375**

### Q6. Promote the best model to the model registry

In [7]:
reg_link = "https://raw.githubusercontent.com/DataTalksClub/mlops-zoomcamp/refs/heads/main/cohorts/2025/02-experiment-tracking/homework/register_model.py"
os.system(f"wget {reg_link}")

--2025-05-29 22:54:14--  https://raw.githubusercontent.com/DataTalksClub/mlops-zoomcamp/refs/heads/main/cohorts/2025/02-experiment-tracking/homework/register_model.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2487 (2.4K) [text/plain]
Saving to: ‘register_model.py’

     0K ..                                                    100% 18.9M=0s

2025-05-29 22:54:14 (18.9 MB/s) - ‘register_model.py’ saved [2487/2487]



0

In [None]:
os.system(f"python register_model.py")

In [3]:
HPO_EXPERIMENT_NAME = "random-forest-hyperopt"
EXPERIMENT_NAME = "random-forest-best-models"
RF_PARAMS = ['max_depth', 'n_estimators', 'min_samples_split', 'min_samples_leaf', 'random_state']

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.sklearn.autolog(log_datasets=False)

In [5]:
from mlflow.tracking import MlflowClient
client = MlflowClient()

In [7]:
# Select the model with the lowest test RMSE
experiment = client.get_experiment_by_name(EXPERIMENT_NAME)
best_run = client.search_runs( 
    experiment_ids=experiment.experiment_id,
    run_view_type=mlflow.entities.ViewType.ACTIVE_ONLY,
    max_results=1,
    order_by=["metrics.test_rmse ASC"]
  )[0]

# Register the best model
mlflow.register_model(
    model_uri=f"runs:/{best_run.info.run_id}/model",
    name=EXPERIMENT_NAME
)

Registered model 'random-forest-best-models' already exists. Creating a new version of this model...
2025/05/30 13:59:23 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: random-forest-best-models, version 2
Created version '2' of model 'random-forest-best-models'.


<ModelVersion: aliases=[], creation_timestamp=1748613563926, current_stage='None', description='', last_updated_timestamp=1748613563926, name='random-forest-best-models', run_id='f49803bcc42044e79ec78feba8d59e5e', run_link='', source='/workspaces/mlops_zoomcamp/02-experiment-tracking/artifacts/2/f49803bcc42044e79ec78feba8d59e5e/artifacts/model', status='READY', status_message=None, tags={}, user_id='', version='2'>

What is the test RMSE of the best model? **5.5405**