In [3]:
import pandas as pd
import numpy as np 
import mlflow


## Homework
The goal of this homework is to get familiar with tools like MLflow for experiment tracking and model management.

Q1. Install the package
To get started with MLflow you'll need to install the appropriate Python package.

For this we recommend creating a separate Python environment, for example, you can use conda environments, and then install the package there with pip or conda.

Once you installed the package, run the command mlflow --version and check the output.

What's the version that you have?

In [1]:
!mlflow --version # lo ideal es 1.26.4 porque mi versión no tiene integrada muchos modulos que trabajan en versiones anteriores, so, lección learned. 


mlflow, version 2.3.2


## Q2. Download and preprocess the data
We'll use the Green Taxi Trip Records dataset to predict the amount of tips for each trip.

Download the data for January, February and March 2022 in parquet format from here.

Use the script preprocess_data.py located in the folder homework to preprocess the data.

The script will:

load the data from the folder <TAXI_DATA_FOLDER> (the folder where you have downloaded the data),
fit a DictVectorizer on the training set (January 2022 data),
save the preprocessed datasets and the DictVectorizer to disk.
Your task is to download the datasets and then execute this command:



In [None]:
# (mlops) mdurango$ python preprocess_data.py --raw_data_path data/ --dest_path output/ DONE! 

In [8]:

os.path.getsize(os.path.join(os.getcwd(), "output", "dv.pkl"))/ 1024
# aprox 154 kB

150.05859375

## Q3. Train a model with autolog
We will train a RandomForestRegressor (from Scikit-Learn) on the taxi dataset.

We have prepared the training script train.py for this exercise, which can be also found in the folder homework.

The script will:

load the datasets produced by the previous step,
train the model on the training set,
calculate the RMSE score on the validation set.
Your task is to modify the script to enable autologging with MLflow, execute the script and then launch the MLflow UI to check that the experiment run was properly tracked.

Tip 1: don't forget to wrap the training code with a with mlflow.start_run(): statement as we showed in the videos.

Tip 2: don't modify the hyperparameters of the model to make sure that the training will finish quickly.

What is the value of the max_depth parameter:

* 4
* 6
* 8
* 10

In [10]:
!python train.py --data_path output

# !mlflow ui --backend-store-uri file://$(pwd)/random-forest-experiments

Start training ...
rmse: 6.121103017486401


In [4]:
client = mlflow.tracking.MlflowClient(tracking_uri = "/Users/mdurango/Documents/Mlops-datatalk-2023/cohort-2023/homeworks/week2/random-forest-experiments")
experiment = client.get_experiment_by_name('homework2')

In [17]:
experiment

<Experiment: artifact_location='/Users/mdurango/Documents/Mlops-datatalk-2023/cohort-2023/homeworks/week2/random-forest-experiments/0', creation_time=1685579503133, experiment_id='0', last_update_time=1685579960848, lifecycle_stage='active', name='homework2', tags={}>

In [18]:
experiment.experiment_id

'0'

In [19]:
runs = client.search_runs(experiment_ids=[experiment.experiment_id])

In [21]:
for run in runs:
    print("Ejecución ID:", run.info.run_id)
    print("Parámetros registrados:")
    for key, value in run.data.params.items():
        print(f"{key}: {value}")
    print("-----------------------")

# la respuesta está en tu corazón, broma broma... es 10 8) 

Ejecución ID: 2fa77934104244cd9e3d87e103e8759e
Parámetros registrados:
bootstrap: True
max_depth: 10
max_samples: None
min_weight_fraction_leaf: 0.0
max_leaf_nodes: None
min_samples_leaf: 1
random_state: 0
min_impurity_decrease: 0.0
verbose: 0
n_estimators: 100
criterion: squared_error
oob_score: False
ccp_alpha: 0.0
warm_start: False
max_features: 1.0
n_jobs: None
min_samples_split: 2
-----------------------


## Launch the tracking server locally for MLflow

Now we want to manage the entire lifecycle of our ML model. In this step, you'll need to launch a tracking server. This way we will also have access to the model registry.

In case of MLflow, you need to:

launch the tracking server on your local machine,
select a SQLite db for the backend store and a folder called artifacts for the artifacts store.
You should keep the tracking server running to work on the next three exercises that use the server.

In [None]:
# !mlflow server --backend-store-uri "sqlite:///mlflow.db" --default-artifact-root "./artifacts"

## Question 5. Tune the hyperparameters of the model

Your task is to modify the script hpo.py and make sure that the validation RMSE is logged to MLflow for each run of the hyperparameter optimization (you will need to add a few lines of code to the objective function) and run the script without passing any parameters.

In [6]:
experiment2 = client.get_experiment_by_name('random-forest-hyperopt')
experiment2.experiment_id

'410122080002736892'

In [9]:
runs = client.search_runs(experiment_ids=[experiment2.experiment_id])



In [14]:
rmses = []
for run in runs:
    print("Ejecución ID:", run.info.run_id)
    print("Parámetros registrados:")
    for key, value in run.data.metrics.items():
        rmses.append(value)
    




Ejecución ID: 3d4c7b9a8dbb41a488e6514d7f00deae
Parámetros registrados:
Ejecución ID: ac5abfb313654ef396d43ba3f0439f48
Parámetros registrados:
Ejecución ID: 917e6477780044b094bd89e4deae7604
Parámetros registrados:
Ejecución ID: 92321cbcb8324494a79f176e4d2ff64a
Parámetros registrados:
Ejecución ID: 2fd508fc137b40f9b3c3d357241790f2
Parámetros registrados:
Ejecución ID: 83f9378c6f284e53a86ded18da4f330a
Parámetros registrados:
Ejecución ID: 504f7a1428fc41cbb3c11b7a6032c1a3
Parámetros registrados:
Ejecución ID: 8367193f3d57480fa2e272975c06846d
Parámetros registrados:
Ejecución ID: 901818a689914b7ebf86ddbdcb7b15b1
Parámetros registrados:
Ejecución ID: aaa7fb77e88c4019b590fca879457224
Parámetros registrados:
Ejecución ID: 3d18002022cc4f5f8508596456669448
Parámetros registrados:
Ejecución ID: ed85ff1d05c84374a606e6cef1842c9f
Parámetros registrados:
Ejecución ID: d778748388c24e2b8abe41338dff524e
Parámetros registrados:
Ejecución ID: b47ce3879ff24bd6b4be53303bd91708
Parámetros registrados:
Ejecuc

In [18]:
rmses = np.array(rmses)
np.sort(rmses)[0]
# wtf ?? :o 


6.017362308837179