# MLFlow 'Crash Course'
By JoyZ @ Oct 2022

--------------------------------------------------------------------------------
## Part 1: Intro of MLFlow
* Homepage: https://mlflow.org/

<p align="left">
<img src="mlflow-modules.png" width=1000>
</p>

### Our Focus is **MLFlow Tracking**
* Concept: https://mlflow.org/docs/latest/tracking.html#concepts

### In this tutorial, we'll focus on utilising MLFlow Fluent API with:
* Setting up a local MLFlow UI and experiment
* Tracking the following aspects for your experiment runs:
  * Parameters
  * Metrics
  * Artifacts (e.g. models, plots, datasets)
  * Log information (e.g. tags and description)
* Reusing the results of your runs
* Managing your runs

--------------------------------------------------------------------------------------------------------------------------------
## Part 2: Hands-on!

### Setup a virtual environment for our project!
```
cd [project-folder]
python3 -m venv venv/
source venv/bin/activate
which python3
pip3 install -r requirements.txt
```

In [None]:
import matplotlib.pyplot as plt
import mlflow
import mlflow.sklearn
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV, train_test_split

In [None]:
def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

### Load and Prepare Modeling Data
The data set used in this example is from http://archive.ics.uci.edu/ml/datasets/Wine+Quality
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.

Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

In [None]:
# Read the wine-quality csv file from the URL
csv_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
data = pd.read_csv(csv_url, sep=";")

# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)

In [None]:
# The predicted column is "quality" which is a scalar from [3, 9]
y = "quality"
train_x = train.drop([y], axis=1)
test_x = test.drop([y], axis=1)
train_y = train[[y]]
test_y = test[[y]]

* * *
### Let's play with the MLFLow now!
* * * 
### Very first but important step: get familiar with the Docs!
* Documentation: https://mlflow.org/docs/latest/index.html
* Github: https://github.com/mlflow/mlflow
* Python API: https://www.mlflow.org/docs/latest/python_api/index.html
* R API: https://www.mlflow.org/docs/latest/R-api.html

### Spin up the Tracking UI
The Tracking UI lets you visualize, search and compare runs, as well as download run artifacts or metadata for analysis in other tools.

If you log runs to a local mlruns directory, run mlflow ui in the directory above it, and it loads the corresponding runs.

Type the below in your CLI if your UI is running successfully:
```
mlflow ui
```
and view it at http://127.0.0.1:5000

### Set up an experiment otherwise it will use 'Default'

In [None]:
# mlflow.create_experiment(name='Crash Course Demo')
mlflow.set_experiment(experiment_name="Crash Course Demo")

### Let's start with a simple run
* Check what's there in the UI
* Check the local folder *mlruns/*

In [None]:
mlflow.start_run(run_name="my-first-run")
mlflow.log_param("hello", "world")
mlflow.log_metric("score", 100)
mlflow.end_run()

### Let's do some real modeling

* Use with statement in python -> context manager
* Hint - A context manager usually takes care of setting up some resource, e.g. opening a connection, and automatically handles the clean up when we are done with it
* No need to end your run with mlflow.end_run()

In [None]:
params = {"n_estimators": 100, "max_depth": 4}

with mlflow.start_run(run_name="random-forest") as run:

    clf_rf = RandomForestRegressor(**params, random_state=42)
    clf_rf.fit(train_x, train_y)

    y_test_predicted = clf_rf.predict(test_x)

    (rmse, mae, r2) = eval_metrics(test_y, y_test_predicted)
    print("RMSE: %s" % rmse)
    print("MAE: %s" % mae)
    print("R2: %s" % r2)
    
    metrics = {"rmse": rmse, "mae": mae, "r2": r2}

    # # load a single parameter
    # mlflow.log_param("n_estimators", 100)
    
    # # load individual metric
    # mlflow.log_metric("r2", r2)

    # # load a dict of parameters
    mlflow.log_params(params)
    # load a dict of metrics
    mlflow.log_metrics(metrics)

    # # log model using mlflow supported model flavor (check doc for more)
    mlflow.sklearn.log_model(clf_rf, "random-forest-model")

    # Get the run id
    print("Run ID: ", run.info.run_id)
    rf_run_id = run.info.run_id


### Log to an existing run using run_id

In [None]:
rf_run_id = "d6527a30db6141baadf2ca007b43c129"

In [None]:
with mlflow.start_run(run_id=rf_run_id):
    mlflow.log_metric("new_metric", 100)

### Log artifacts
* mlflow.log_artifact() logs a **local** file or directory as an artifact, optionally taking an artifact_path to place it in within the run’s artifact URI. Run artifacts can be organized into directories, so you can place the artifact in a directory this way
* https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_artifact

In [None]:
# dataset
train.to_csv("data/train_data.csv", index=False)
test.to_csv("data/test_data.csv", index=False)

# plots
fig = train.hist(figsize=(10, 10))
plt.savefig("plot/train_distribution.png", format="png")
fig = test.hist(figsize=(10, 10))
plt.savefig("plot/test_distribution.png", format="png")

#### log a single file

In [None]:
with mlflow.start_run(run_id=rf_run_id):
    # a single file
    mlflow.log_artifact("data/train_data.csv", artifact_path="data")
    mlflow.log_artifact("plot/train_distribution.png", artifact_path="plot")

#### log local directory

In [None]:
with mlflow.start_run(run_id=rf_run_id):
    mlflow.log_artifacts("data", artifact_path="data")
    mlflow.log_artifacts("plot", artifact_path="plot")

### Let's track a parameter search run!

In [None]:
params = {"max_depth": 4}

# Searching the best n_estimators

for n_estimators in [100, 200, 300]:

    with mlflow.start_run(run_name=f"random-forest-ntrees-{n_estimators}"):
        print(f"-----> Start training with ntrees = {n_estimators}")
        
        params.update({"n_estimators": n_estimators})
        
        clf = RandomForestRegressor(**params, random_state=42)
        clf.fit(train_x, train_y)

        predicted_qualities = clf.predict(test_x)

        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
        print("RMSE: %s" % rmse)
        print("MAE: %s" % mae)
        print("R2: %s" % r2)
        metrics = {"rmse": rmse, "mae": mae, "r2": r2}

        mlflow.log_params(params)
        mlflow.log_metrics(metrics)
        mlflow.sklearn.log_model(clf, "model")

### Let's try a better way - use parent & child runs!

In [None]:
with mlflow.start_run(run_name="random-forest-parent-ntrees") as parent_run:

    rf_params = {"max_depth": 4}
    
    for n_estimators in [100, 200, 300]:
        with mlflow.start_run(run_name=f"ntrees-{n_estimators}", nested=True):
            
            print(f"-----> Start training with ntrees = {n_estimators}")
            
            rf_params.update({"n_estimators": n_estimators})

            clf = RandomForestRegressor(**rf_params, random_state=42)
            clf.fit(train_x, train_y)

            predicted_qualities = clf.predict(test_x)

            (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
            rf_metrics = {"rmse": rmse, "mae": mae, "r2": r2}
            print("RMSE: %s" % rmse)
            print("MAE: %s" % mae)
            print("R2: %s" % r2)

            mlflow.log_params(rf_params)
            mlflow.log_metrics(rf_metrics)
            mlflow.sklearn.log_model(clf, "model")

### Set tag and description to your run

In [None]:
my_desc = 'This is a test run'
with mlflow.start_run(run_id="d1e3382727c2415187f30f48c8f20f29", description=my_desc):
    mlflow.set_tag('algorithm', 'randomforest')
    # mlflow.set_tags()

### Lazy? Let's try the mlflow's autolog
* Automatic logging allows you to log metrics, parameters, and models without the need for explicit log statements.
* MLflow currently supports a list of popular models. Check their doc for newest list: https://mlflow.org/docs/latest/tracking.html#automatic-logging
* Detailed doc for mlflow.sklearn.autolog(): https://mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.autolog

In [None]:
gbm_grid_params = {
    "learning_rate": [0.02, 0.05, 0.1],
    "n_estimators": [i for i in range(100, 501, 100)],
    "max_depth": [i for i in range(2, 12, 2)],
    "subsample": [
        0.5,
        0.6,
        0.7,
    ],
}

print(gbm_grid_params)

In [None]:
# MLFLow currently supported sklearn model and GridSearchCV or RnadomizedSearchCV
mlflow.sklearn.autolog(silent=True, max_tuning_runs=5)

clf = GradientBoostingRegressor(random_state=42)
grid = RandomizedSearchCV(clf, gbm_grid_params, n_iter=10, cv=3, verbose=0)
grid.fit(train_x, train_y)

### Let's load back a model and predict some samples!

* Where is the model saved? 
  * Local artifact stores: mlruns/[experiment_id]/[run_id]/artifacts/[model_artifact_path]
  * Or copy it from your MLflow UI!

In [None]:
model = mlflow.sklearn.load_model('mlruns/1/edd4ab5f328e4537bb7fcc12783b09d3/artifacts/model')
model

In [None]:
new_sample = test_x.copy()
predictions = model.predict(test_x)
predictions

In [None]:
eval_metrics(test_y, predictions)

### Clean up your MLFlow runs

#### Delete a run from the active runs


In [None]:
mlflow.delete_run(run_id='7d413df6bed24f7c9ce38b1eabec01ea')

---> Check its meta.yaml

#### Remove it completely from the backend store


In [None]:
!mlflow gc

## For more advanced use cases:
###  Let's watch this video if we still have time 
https://app.pluralsight.com/course-player?clipId=26623955-88a7-49da-895a-b9621cc5616b
