<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Q1:-Install-MLflow" data-toc-modified-id="Q1:-Install-MLflow-1">Q1: Install MLflow</a></span><ul class="toc-item"><li><span><a href="#What-is-the-version-that-you-have?" data-toc-modified-id="What-is-the-version-that-you-have?-1.1">What is the version that you have?</a></span></li></ul></li><li><span><a href="#Q2:-Download-and-preprocess-the-data" data-toc-modified-id="Q2:-Download-and-preprocess-the-data-2">Q2: Download and preprocess the data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Run-the-preprocessor" data-toc-modified-id="Run-the-preprocessor-2.0.1">Run the preprocessor</a></span></li></ul></li><li><span><a href="#How-many-files-were-saved-to-OUTPUT_FOLDER?" data-toc-modified-id="How-many-files-were-saved-to-OUTPUT_FOLDER?-2.1">How many files were saved to <code>OUTPUT_FOLDER</code>?</a></span></li></ul></li><li><span><a href="#Q3:-Train-a-model-with-autolog" data-toc-modified-id="Q3:-Train-a-model-with-autolog-3">Q3: Train a model with autolog</a></span><ul class="toc-item"><li><span><a href="#How-many-parameters-are-automatically-logged-by-MLflow?" data-toc-modified-id="How-many-parameters-are-automatically-logged-by-MLflow?-3.1">How many parameters are automatically logged by MLflow?</a></span></li></ul></li><li><span><a href="#Q4:-Launch-the-tracking-server-locally" data-toc-modified-id="Q4:-Launch-the-tracking-server-locally-4">Q4: Launch the tracking server locally</a></span><ul class="toc-item"><li><span><a href="#In-addition-to-backend-store-uri,-what-else-do-you-need-to-pass-to-properly-configure-the-server?" data-toc-modified-id="In-addition-to-backend-store-uri,-what-else-do-you-need-to-pass-to-properly-configure-the-server?-4.1">In addition to backend-store-uri, what else do you need to pass to properly configure the server?</a></span></li></ul></li><li><span><a href="#Q5:-Tune-the-hyperparameters-of-the-model" data-toc-modified-id="Q5:-Tune-the-hyperparameters-of-the-model-5">Q5: Tune the hyperparameters of the model</a></span><ul class="toc-item"><li><span><a href="#What's-the-best-validation-RMSE-that-you-got?" data-toc-modified-id="What's-the-best-validation-RMSE-that-you-got?-5.1">What's the best validation RMSE that you got?</a></span></li></ul></li><li><span><a href="#Q6:-Promote-the-best-model-to-the-model-registry" data-toc-modified-id="Q6:-Promote-the-best-model-to-the-model-registry-6">Q6: Promote the best model to the model registry</a></span><ul class="toc-item"><li><span><a href="#What-is-the-test-RMSE-of-the-best-model?" data-toc-modified-id="What-is-the-test-RMSE-of-the-best-model?-6.1">What is the test RMSE of the best model?</a></span></li></ul></li></ul></div>

In [1]:
import os

## Q1: Install MLflow

In [2]:
#! pip install mlflow
! mlflow --version

mlflow, version 1.26.0


### What is the version that you have?

In [3]:
A1 = '1.26.0'
print(f'Answer: {A1}')

Answer: 1.26.0


## Q2: Download and preprocess the data

[TLC Trip Record Data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

Inside the data folder: 
- `wget https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2021-01.parquet`
- `wget https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2021-02.parquet`
- `wget https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2021-03.parquet`

In [4]:
! ls -la data

total 3872
drwxrwxr-x 2 fdelca fdelca    4096 mai 27 12:25 .
drwxrwxr-x 9 fdelca fdelca    4096 mai 27 14:33 ..
-rw-rw-r-- 1 fdelca fdelca 1333519 mai 27 12:25 green_tripdata_2021-01.parquet
-rw-rw-r-- 1 fdelca fdelca 1145679 mai 27 12:25 green_tripdata_2021-02.parquet
-rw-rw-r-- 1 fdelca fdelca 1474538 mai 27 12:25 green_tripdata_2021-03.parquet


#### Run the preprocessor

In [5]:
! python homework/preprocess_data.py --raw_data_path "data" --dest_path "preprocessed"

In [6]:
! ls -la preprocessed/

total 7764
drwxrwxr-x 2 fdelca fdelca    4096 mai 27 13:05 .
drwxrwxr-x 9 fdelca fdelca    4096 mai 27 14:33 ..
-rw-rw-r-- 1 fdelca fdelca  305256 mai 27 14:34 dv.pkl
-rw-rw-r-- 1 fdelca fdelca 2805197 mai 27 14:34 test.pkl
-rw-rw-r-- 1 fdelca fdelca 2661149 mai 27 14:34 train.pkl
-rw-rw-r-- 1 fdelca fdelca 2166527 mai 27 14:34 valid.pkl


As we can see it was successfully preprocessed, the `train`, `test`, and `valid` dataframes were saved. Additionally, the trained `DictVectorizer` was also saved as a `pickle file`.

### How many files were saved to `OUTPUT_FOLDER`?

The `output_folder` corresponds to our `preprocessed` folder:

In [11]:
A2 = len(os.listdir('preprocessed/'))
print(f'Answer: {A2}')

Answer: 4


## Q3: Train a model with autolog

Specifications:
- Random Forest Regressor model
- Change `train.py` to autolog the model characteristics

```python
import argparse
import os
import pickle
import mlflow

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Set the tracking URI - SQLite backend
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("homework")


def load_pickle(filename: str):
    with open(filename, "rb") as f_in:
        return pickle.load(f_in)


def run(data_path):

    X_train, y_train = load_pickle(os.path.join(data_path, "train.pkl"))
    X_valid, y_valid = load_pickle(os.path.join(data_path, "valid.pkl"))

    rf = RandomForestRegressor(max_depth=10, random_state=0)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_valid)

    rmse = mean_squared_error(y_valid, y_pred, squared=False)


if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--data_path",
        default="./output",
        help="the location where the processed NYC taxi trip data was saved."
    )
    args = parser.parse_args()
    with mlflow.start_run(): # No idea if the mlflow start should be done here
        mlflow.sklearn.autolog()
        run(args.data_path)
```

In [1]:
! python homework/train.py --data_path "preprocessed"



To check the parameters logged run the following command:

`mlflow ui --backend-store-uri sqlite:///mlflow.db --default-artifact-root file:<PATH_TO_STORE_ARTIFACTS>/mlruns -h 0.0.0.0 -p 8000`

### How many parameters are automatically logged by MLflow?

In [12]:
A3 = 17
print(f'Answer: {A3}')

Answer: 17


## Q4: Launch the tracking server locally

In ther previous question I was already running a server and a default-artifact-root, it is being saved on `mlruns` folder inside the working directory.

To properly run the server:

1. Check the [LearningNotes Notebook](/notebooks/Week2-LearningNotes.ipynb) to properly create a sqlite database on your computer and store artifacts
2. Run in the terminal the following command line: `mlflow ui --backend-store-uri sqlite:///mlflow.db --default-artifact-root file:<PATH_TO_STORE_ARTIFACTS>/mlruns -h 0.0.0.0 -p 8000` 

### In addition to backend-store-uri, what else do you need to pass to properly configure the server?

In [2]:
A4 = 'default-artifact-root'
print(f'Answer: {A4}')

Answer: default-artifact-root


## Q5: Tune the hyperparameters of the model

Reduce the validation error by tuning the hyperparameters of the random forest regressor using `hyperopt`

Specifications: 
- Use `hpo.py` file for that;
- Validation `RMSE` error is logged to MLflow for each run;
- Run script without passing any parameters;
- Do not use `autolog()` in this exercise;
- Log: 
    - List of hyperparameters that are passed to `objective function` during optimization;
    - `RMSE` obtained on the validation set (February 2021 data)

```python
import argparse
import os
import pickle

import mlflow
import numpy as np
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
from hyperopt.pyll import scope
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("random-forest-hyperopt")


def load_pickle(filename):
    with open(filename, "rb") as f_in:
        return pickle.load(f_in)


def run(data_path, num_trials):

    X_train, y_train = load_pickle(os.path.join(data_path, "train.pkl"))
    X_valid, y_valid = load_pickle(os.path.join(data_path, "valid.pkl"))

    def objective(params):
        
        with mlflow.start_run():
    	    # Log parameters
            mlflow.log_params(params)
        
            rf = RandomForestRegressor(**params)
            rf.fit(X_train, y_train)
            y_pred = rf.predict(X_valid)
            rmse = mean_squared_error(y_valid, y_pred, squared=False)
        
            # Log error metric
            mlflow.log_metric('rmse', rmse)
        
        return {'loss': rmse, 'status': STATUS_OK}

    search_space = {
        'max_depth': scope.int(hp.quniform('max_depth', 1, 20, 1)),
        'n_estimators': scope.int(hp.quniform('n_estimators', 10, 50, 1)),
        'min_samples_split': scope.int(hp.quniform('min_samples_split', 2, 10, 1)),
        'min_samples_leaf': scope.int(hp.quniform('min_samples_leaf', 1, 4, 1)),
        'random_state': 42
    }
    
    rstate = np.random.default_rng(42)  # for reproducible results
    fmin(
        fn=objective,
        space=search_space,
        algo=tpe.suggest,
        max_evals=num_trials,
        trials=Trials(),
        rstate=rstate
    )


if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--data_path",
        default="./output",
        help="the location where the processed NYC taxi trip data was saved."
    )
    parser.add_argument(
        "--max_evals",
        default=50,
        help="the number of parameter evaluations for the optimizer to explore."
    )
    args = parser.parse_args()
    run(args.data_path, args.max_evals)
```

In [1]:
! python homework/hpo.py --data_path "preprocessed"

100%|█████████| 50/50 [06:25<00:00,  7.72s/trial, best loss: 6.6284257482044735]


**Note:** A good thing about using `py scripts` and using `mlflow` instead of `jupyter notebooks` - is that mlflow is able to save `source code name`, making it easy to track experiments later

### What's the best validation RMSE that you got?

In [2]:
A5 = 6.628
print(f'Answer: {A5}')

Answer: 6.628


## Q6: Promote the best model to the model registry

Specifications:
- Select the top 5 models from the previous run;
- Calculate the `RMSE` of those models in the test set (March 2021);
- Save the results in a new experiment called `random-forest-best-models`

- Update `register_model.py` to select the model with the lowest `RMSE` on the test set and register it to the `model registry`

**Tip 1:** you can use the method `search_runs` from the `MlflowClient` to get the model with the **lowest RMSE**. 


**Tip 2:** to register the model you can use the method `mlflow.register_model` and you will need to pass the right `model_uri` in the form of a string that looks like this: `"runs:/<RUN_ID>/model"`, and the name of the model.

```python
import argparse
import os
import pickle

import mlflow
from hyperopt import hp, space_eval
from hyperopt.pyll import scope
from mlflow.entities import ViewType
from mlflow.tracking import MlflowClient
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

HPO_EXPERIMENT_NAME = "random-forest-hyperopt"
EXPERIMENT_NAME = "random-forest-best-models"

# mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.sklearn.autolog()

SPACE = {
    'max_depth': scope.int(hp.quniform('max_depth', 1, 20, 1)),
    'n_estimators': scope.int(hp.quniform('n_estimators', 10, 50, 1)),
    'min_samples_split': scope.int(hp.quniform('min_samples_split', 2, 10, 1)),
    'min_samples_leaf': scope.int(hp.quniform('min_samples_leaf', 1, 4, 1)),
    'random_state': 42
}


def load_pickle(filename):
    with open(filename, "rb") as f_in:
        return pickle.load(f_in)


def train_and_log_model(data_path, params):
    X_train, y_train = load_pickle(os.path.join(data_path, "train.pkl"))
    X_valid, y_valid = load_pickle(os.path.join(data_path, "valid.pkl"))
    X_test, y_test = load_pickle(os.path.join(data_path, "test.pkl"))

    with mlflow.start_run():
        params = space_eval(SPACE, params)
        rf = RandomForestRegressor(**params)
        rf.fit(X_train, y_train)

        # evaluate model on the validation and test sets
        valid_rmse = mean_squared_error(y_valid, rf.predict(X_valid), squared=False)
        mlflow.log_metric("valid_rmse", valid_rmse)
        test_rmse = mean_squared_error(y_test, rf.predict(X_test), squared=False)
        mlflow.log_metric("test_rmse", test_rmse)


def run(data_path, log_top):

    client = MlflowClient()

    # retrieve the top_n model runs and log the models to MLflow
    experiment = client.get_experiment_by_name(HPO_EXPERIMENT_NAME)
    runs = client.search_runs(
        experiment_ids=experiment.experiment_id,
        run_view_type=ViewType.ACTIVE_ONLY,
        max_results=log_top,
        order_by=["metrics.rmse ASC"]
    )
    for run in runs:
        train_and_log_model(data_path=data_path, params=run.data.params)

    # select the model with the lowest test RMSE
    experiment = client.get_experiment_by_name(EXPERIMENT_NAME)
    best_run = client.search_runs(
		experiment_ids=experiment.experiment_id,
	    	run_view_type=ViewType.ACTIVE_ONLY,
	    	max_results=log_top,
	    	order_by=["metrics.test_rmse ASC"])[0]
    
    # register the best model
    best_model_run_id = best_run.info.run_id
    best_model_uri = f"runs:/{best_model_run_id}/model"
    mlflow.register_model(model_uri=best_model_uri, name='rf_best_model_in_test')

if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--data_path",
        default="./output",
        help="the location where the processed NYC taxi trip data was saved."
    )
    parser.add_argument(
        "--top_n",
        default=5,
        type=int,
        help="the top 'top_n' models will be evaluated to decide which model to promote."
    )
    args = parser.parse_args()

    run(args.data_path, args.top_n)
```

In [2]:
! python homework/register_model.py --data_path "preprocessed" --top_n 5

2022/05/27 16:31:29 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2022/05/27 16:31:29 INFO mlflow.store.db.utils: Updating database tables
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
INFO  [alembic.runtime.migration] Running upgrade  -> 451aebb31d03, add metric step
INFO  [alembic.runtime.migration] Running upgrade 451aebb31d03 -> 90e64c465722, migrate user column to tags
INFO  [alembic.runtime.migration] Running upgrade 90e64c465722 -> 181f10493468, allow nulls for metric values
INFO  [alembic.runtime.migration] Running upgrade 181f10493468 -> df50e92ffc5e, Add Experiment Tags Table
INFO  [alembic.runtime.migration] Running upgrade df50e92ffc5e -> 7ac759974ad8, Update run tags with larger limit
INFO  [alembic.runtime.migration] Running upgrade 7ac759974ad8 -> 89d4b8295536, create latest metrics table
INFO  [89d4b8295536_create_latest_metrics_table_py] Migration complete!
INFO  

### What is the test RMSE of the best model?

In [41]:
A6 = 6.548
print(f'Answer: {A6}')

Answer: 6.548


---

In [10]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))