# Experiment Tracking with MLFlow (Local)

In this demo we will see how to use MLFlow for tracking experiments, using a toy data set. In the attached lab (below), you will download a larger dataset and attempt to train the best model that you can.

We should first install mlflow, and add it to the requirements.txt file if not done already.

`pip install mlflow` or `python3 -m pip install mlflow`.

You may also need to `pip install setuptools`.

From here, make sure to save this notebook in a specific folder, and ensure you run all command line commands from the same folder.

In [1]:
import mlflow
# import pandas as pd
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.datasets import load_wine
# from sklearn.metrics import accuracy_score

import numpy as np
import pandas as pd
import sklearn as sk
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.metrics import r2_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import make_scorer, r2_score
from pathlib import Path
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

After loading the libraries, we can first check the mlflow version you have. And, just for fun, let's look at the mlflow UI by running `mlflow ui`. After this, we should do two things:
- set the tracking uri
- create or set the experiment

Setting the tracking uri tells mlflow where to save the results of our experiments. We will first save these locally in a sqlite instance. In the next lab we will set up mlflow to run in GCP.

If you've already created an experiment previously that you'd like to use, you can tell mlflow by setting the experiment. You can also use `set_experiment` even if the experiment has not yet been created - mlflow will first check if the experiment exists, and if not, it will create it for you. 

In [2]:
mlflow.__version__

'2.15.1'

Running the below code will create a sqlite database and an mlruns folder in the current directory.

In [18]:
mlflow.set_tracking_uri('sqlite:///../mlflow.db')
mlflow.set_experiment('price-prediction-experiment')

INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.


<Experiment: artifact_location='/Users/tatshini/USF_School/DE/mlops/insyd_mlops/mlruns/1', creation_time=1725318818773, experiment_id='1', last_update_time=1725318818773, lifecycle_stage='active', name='price-prediction-experiment', tags={}>

# Data reformatting

In [5]:
def reformat(inp):
    for i in range(len(inp)):
        try:
            if type(inp[i]['price']) is str:
                t = ''
                for a in inp[i]['price']:
                    if a in '0123456789':
                        t += a
                inp[i]['price'] = int(t)
        except KeyError:
            pass
        try:
            for a in inp[i]['features']:
                s = a.split(' ')
                if s[1] == 'm2':
                    inp[i]['area'] = int(s[0])
                elif s[1] == 'hab.':
                    inp[i]['bed'] = int(s[0])
                elif s[1] == 'ba\u00f1o':
                    inp[i]['bath'] = int(s[0])
            del inp[i]['features']
        except KeyError:
            pass
        """
        if 'aire acondicionado' in inp[i]['desc'].lower():
            inp[i]['air_conditioning'] = True
        else:
            inp[i]['air_conditioning'] = False
        if 'jardin' in inp[i]['desc'].lower() or 'jard\u00edn' in inp[i]['desc'].lower():
            inp[i]['garden'] = True
        else:
            inp[i]['garden'] = False
        if 'parking' in inp[i]['desc'].lower():
            inp[i]['parking'] = True
        else:
            inp[i]['parking'] = False
        if 'galer\u00edn' in inp[i]['desc'].lower():
            inp[i]['gallery'] = True
        else:
            inp[i]['gallery'] = False
        """
    df = pd.DataFrame(inp)
    return df

In [6]:
file_path = Path('../data/train.pickle').expanduser()
infile = open(file_path, 'rb')
train = pickle.load(infile)
infile.close()
train_rf = reformat(train).dropna()
file_path = Path('../data/test_kaggle.pickle').expanduser()
infile = open(file_path, 'rb')
test = pickle.load(infile)
infile.close()
y_train = train_rf['price']
x_train = train_rf.drop('price', axis=1)
x_test = reformat(test).drop('id', axis=1)

In [7]:
x_train = x_train.select_dtypes(exclude=['object'])
x_test = x_test.select_dtypes(exclude=['object']).dropna()

In [8]:
model = sk.linear_model.LinearRegression()
model.fit(x_train, y_train)
y_pred_train = model.predict(x_train)

# Calculate R^2 score on the training set
r2_train = r2_score(y_train, y_pred_train)
print("Linear Regression R^2 Score:", r2_train)

Linear Regression R^2 Score: 0.29313953268795934


In [36]:
file_path = Path('../data/train.pickle').expanduser()
infile = open(file_path, 'rb')
train = pickle.load(infile)
infile.close()
train_rf = reformat(train)
file_path = Path('../data/test_kaggle.pickle').expanduser()
infile = open(file_path, 'rb')
test = pickle.load(infile)
infile.close()
test_rf = reformat(test)
train = pd.DataFrame(train_rf)
y_train = train['price']
x_train = train.drop('price', axis=1)
x_test = pd.DataFrame(test_rf).drop('id', axis=1)[x_train.columns]
x_train = pd.get_dummies(x_train, columns=["loc_string", "loc", "type", "subtype", "selltype"])
x_test = pd.get_dummies(x_test, columns=["loc_string", "loc", "type", "subtype", "selltype"])
all_columns = set(x_train.columns)
# Add any missing columns to the test dataset with values set to 0
missing_columns = all_columns - set(x_test.columns)
for col in missing_columns:
    x_test[col] = 0
# Reorder columns to match the order in the training dataset
x_test = x_test[x_train.columns]

  x_test[col] = 0
  x_test[col] = 0
  x_test[col] = 0
  x_test[col] = 0


In [37]:
x_train["bath"] = x_train["bath"].fillna(1)
x_train["bed"] = x_train["bed"].fillna(1)
x_train["area"] = x_train["area"].fillna(int(x_train["area"].mean()))
x_test["bath"] = x_test["bath"].fillna(1)
x_test["bed"] = x_test["bed"].fillna(1)
x_test["area"] = x_test["area"].fillna(int(x_train["area"].mean()))

In [38]:
numeric_columns = x_train.select_dtypes(exclude=['object']).columns

scaler = StandardScaler()
x_train_numeric_scaled = x_train.copy()
x_train_numeric_scaled[numeric_columns] = scaler.fit_transform(x_train[numeric_columns])

x_train_processed = pd.DataFrame(np.hstack((x_train_numeric_scaled[numeric_columns], x_train.select_dtypes(include=['object']))))


In [39]:
preprocessor = ColumnTransformer(
    transformers=[
        ('tfidf_title', TfidfVectorizer(), 'title'),
        ('tfidf_desc', TfidfVectorizer(), 'desc')
    ],
    remainder='passthrough'
)

In [40]:
columns = x_train.select_dtypes(exclude=['object']).columns.to_list() + x_train.select_dtypes(include=['object']).columns.to_list()

In [41]:
X_train = pd.DataFrame(x_train_processed.to_numpy(), columns=columns)

In [42]:
numeric_columns = x_test.select_dtypes(exclude=['object']).columns
x_test_numeric_scaled = x_test.copy()
x_test_numeric_scaled[numeric_columns] = scaler.transform(x_test[numeric_columns])

x_test_processed = pd.DataFrame(np.hstack((x_test_numeric_scaled[numeric_columns], x_test.select_dtypes(include=['object']))))
X_test = pd.DataFrame(x_test_processed.to_numpy(), columns=columns)

## Train a Model Using MLFLow

In this section, let's train a simple decision tree model, where we will now adjust the maximum depth (`max_depth`) of the tree, and save the results of each run of the experiment using mlflow. To do so, we need to tell mlflow to start recording. We do this with `start_run`. 

The things we might want to record in this simple case are:
- the value of `max_depth`
- the corresponding accuracy of the model

We can also tag each run to make it easier to identify them later.

After running the below code, be sure to check the mlflow UI by running the following in the terminal from the same directory as where you saved this notebook:

`mlflow ui` note that just running this you will not see any of your experiments. You must specify the uri (the place where all of your results are being stored)

`mlflow ui --backend-store-uri sqlite:///mlflow.db`

In [19]:
r2_scorer = make_scorer(r2_score)

# Define models
gb_model = make_pipeline(preprocessor, GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42))
rf_model = make_pipeline(preprocessor, RandomForestRegressor(n_estimators=100, max_depth=6, random_state=42))
adaboost_model =  make_pipeline(preprocessor, AdaBoostRegressor(learning_rate=0.1, random_state=42))

# Cross-validation with R^2 score for each model
models = {
    'GradientBoost': gb_model,
    'RandomForest': rf_model,
    'AdaBoost': adaboost_model
}
for name, model in models.items():
    with mlflow.start_run():
        mlflow.set_tags({"Model":name, "Train Data": "training-set"})
        mlflow.log_params({'cross_validation':5})
        cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring=r2_scorer)
        mlflow.log_metric("Mean R2 Score", cv_scores.mean())
    mlflow.end_run() 

Typically, in a real-world scenario, you wouldn't change your parameter values manually and re-run your code, you would either use a loop to loop through different parameter values, or you'd use a built-in method for doing cross-validation, of which there are a few. First, let's use a simple loop to run the experiment multiple times, and save the results of each run.

In [22]:
r2_scorer = make_scorer(r2_score)

param_grid = {
    'n_estimators': [300, 400, 500],
    'learning_rate': [0.05, 0.1, 0.01, 0.08, 0.09, 0.07, 0.06,0.12, 0.15],
    'max_depth': [3, 5],
    'min_samples_split': [5, 10],
    'subsample': [0.6],
    'random_state': [42]
}
best_r2 = 0
for est in param_grid["n_estimators"]:
    for lr in param_grid["learning_rate"]:
        for md in param_grid["max_depth"]:
            for ss in param_grid["min_samples_split"]:
                        for sbs in param_grid["subsample"]:
                            with mlflow.start_run():
                                mlflow.set_tags({"Model":"GradientBoost", "Train Data": "training-set"})
                                mlflow.log_params({'n_estimators':est, 'learning_rate':lr,'max_depth':md, 'min_samples_split':ss, 'subsample':sbs})
                                gb_model = make_pipeline(preprocessor, GradientBoostingRegressor(n_estimators=est, learning_rate=lr, max_depth=md,
                                                                                                min_samples_split=ss,
                                                                                                random_state=42, subsample=sbs))
                                cv_scores = cross_val_score(gb_model, X_train, y_train, cv=5, scoring=r2_scorer)
                                print(f"Gradient boosting with parameters n_estimators={est}, learning_rate={lr}, max_depth={md},\
                                    min_samples_split={ss},max_features={mf}, subsample={sbs} Cross-Validation R^2 Scores:", cv_scores)
                                mlflow.log_metric("Mean R2 Score", cv_scores.mean())
                            mlflow.end_run()
                                

Gradient boosting with parameters n_estimators=300, learning_rate=0.05, max_depth=3,                                    min_samples_split=5,max_features=auto, subsample=0.6 Cross-Validation R^2 Scores: [0.5797279  0.47611293 0.64536234 0.4247842  0.52691693]
Gradient boosting with parameters n_estimators=300, learning_rate=0.05, max_depth=3,                                    min_samples_split=10,max_features=auto, subsample=0.6 Cross-Validation R^2 Scores: [0.57032164 0.47910052 0.65317386 0.41155257 0.52817611]
Gradient boosting with parameters n_estimators=300, learning_rate=0.05, max_depth=5,                                    min_samples_split=5,max_features=auto, subsample=0.6 Cross-Validation R^2 Scores: [0.56567861 0.479377   0.64487676 0.44558909 0.52319222]
Gradient boosting with parameters n_estimators=300, learning_rate=0.05, max_depth=5,                                    min_samples_split=10,max_features=auto, subsample=0.6 Cross-Validation R^2 Scores: [0.56287443 0.53277

# Artifact Tracking and Model Registry (Local)

In this section we will save some artifacts from our model as we go through the model development process. There are a few things that might be worth saving, such as datasets, plots, and the final model itself that might go into production later.

## Data

First, let's see how we can store our important datasets, in a compressed format, for use for later, for example, in case we get a new request about our model and need to run some analyses (such as "what is the distribution of this feature, but only for this specific subset of data?" or "how did the model do on these particular observations from your validation set?").

In [58]:
r2_scorer = make_scorer(r2_score)

with mlflow.start_run():
    mlflow.set_tags({"Model":"GradientBoost", "Train Data": "training-set"})
    mlflow.log_params({'n_estimators':400, 'learning_rate':0.09,'max_depth':3, 'min_samples_split':5, 'subsample':0.6, 'cross_validation':5})
    gb_model = make_pipeline(preprocessor, GradientBoostingRegressor(n_estimators=400, learning_rate=0.09, max_depth=3,
                                                                    min_samples_split=5,
                                                                    random_state=42, subsample=0.6))
    cv_scores = cross_val_score(gb_model, X_train, y_train, cv=5, scoring=r2_scorer)
    print(f"Gradient boosting with parameters n_estimators={est}, learning_rate={lr}, max_depth={md},\
        min_samples_split={ss},max_features={mf}, subsample={sbs} Cross-Validation R^2 Scores:", cv_scores)
    mlflow.log_metric("Mean R2 Score", cv_scores.mean())


Gradient boosting with parameters n_estimators=500, learning_rate=0.15, max_depth=5,        min_samples_split=10,max_features=auto, subsample=0.6 Cross-Validation R^2 Scores: [0.58305717 0.50590536 0.67127705 0.43713941 0.52502832]


In [59]:
import os 

os.makedirs('save_data', exist_ok = True)

X_train.to_parquet('save_data/x_train.parquet')
pd.DataFrame(y_train).to_parquet('save_data/y_train.parquet')
mlflow.log_artifact('save_data/x_train.parquet')
mlflow.log_artifact('save_data/y_train.parquet')

In [60]:
X_test.to_parquet('save_data/x_test.parquet')

mlflow.log_artifact('save_data/x_test.parquet')

In [61]:
import pickle

os.makedirs('../models', exist_ok = True)

with open('../models/model.pkl','wb') as f:
    pickle.dump(gb_model,f)

# First we'll log the model as an artifact
mlflow.log_artifact('../models/model.pkl', artifact_path='my_models')

### Logging as a Model

Logging the model as an artifact only logs the pickle file (the serialized version of the model). It's not really very useful, especially since models contain so much metadata that might be critical to know for deploying the model later. mlflow has a built-in way of logging models specifically, so let's see how to use this, and how it's different from logging models as an artifact.

In [62]:
# Let's do it again, but this time we will log the model using log_model
mlflow.sklearn.log_model(gb_model, artifact_path = 'better_models')
mlflow.end_run()



### Loading Models

Now that models have been logged, you can load specific models back into python for predicting and further analysis. There are two main ways to do this. The mlflow UI actually gives you some instructions, with code that you copy and paste.

In [63]:
logged_model = 'runs:/6e609e69e44a452781beff60f3502547/better_models' #replace with one of your models

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)
loaded_model

mlflow.pyfunc.loaded_model:
  artifact_path: better_models
  flavor: mlflow.sklearn
  run_id: 6e609e69e44a452781beff60f3502547

In [64]:
sklearn_model = mlflow.sklearn.load_model(logged_model)
sklearn_model

In [65]:
sklearn_model.fit(X_train, y_train)
preds = sklearn_model.predict(X_test)
preds[:5]

array([351227.86924396, 341710.04607325, 290986.67943832, 374140.77610102,
       344945.10384913])

### Model Registry

Typically, you will **register** your *chosen* model, the model you plan to put into production. But, sometimes, after you've chosen and registered a model, you may need to replace that model with a new version. For example, the model may have gone into production and started to degrade in performance, and so the model needed to be retrained. Or, you go to deploy your model and notice an error or bug, and now have to go back and retrain it.

In this section let's see how we take our logged models and register them in the model registry, which then can get picked up by the production process, or engineer, for deployment. First, I'll demonstrate how this is done within the UI, but then below I'll show how we can use the python API to do the same thing.

In [66]:
runid = '6e609e69e44a452781beff60f3502547'
mod_path = f'runs:/{runid}/artifacts/better_models'
mlflow.register_model(model_uri = mod_path, name = 'Price_prediction')

Successfully registered model 'Price_prediction'.
Created version '1' of model 'Price_prediction'.


<ModelVersion: aliases=[], creation_timestamp=1725334604027, current_stage='None', description=None, last_updated_timestamp=1725334604027, name='Price_prediction', run_id='6e609e69e44a452781beff60f3502547', run_link=None, source='/Users/tatshini/USF_School/DE/mlops/insyd_mlops/mlruns/1/6e609e69e44a452781beff60f3502547/artifacts/artifacts/better_models', status='READY', status_message=None, tags={}, user_id=None, version=1>


# Experiment Tracking and Model Registry Lab

## Overview

In this lab you will each download a new dataset and attempt to train a good model, and use mlflow to keep track of all of your experiments, log your metrics, artifacts and models, and then register a final set of models for "deployment", though we won't actually deploy them anywhere yet.

## Goal

Your goal is **not** to become a master at MLFlow - this is not a course on learning all of the ins and outs of MLFlow. Instead, your goal is to understand when and why it is important to track your model development process (tracking experiments, artifacts and models) and to get into the habit of doing so, and then learn at least the basics of how MLFlow helps you do this so that you can then compare with other tools that are available.

## Data

You can choose your own dataset to use here. It will be helpful to choose a dataset that is already fairly clean and easy to work with. You can even use a dataset that you've used in a previous course. We will do a lot of labs where we do different things with datasets, so if you can find one that is interesting enough for modeling, it should work for most of the rest of the course. 

There are tons of places where you can find open public datasets. Choose something that interests you, but don't overthink it.

[Kaggle Datasets](https://www.kaggle.com/datasets)  
[HuggingFace Datasets](https://huggingface.co/docs/datasets/index)  
[Dagshub Datasets](https://dagshub.com/datasets/)  
[UCI](https://archive.ics.uci.edu/ml/datasets.php)  
[Open Data on AWS](https://registry.opendata.aws/)  
[Yelp](https://www.yelp.com/dataset)  
[MovieLens](https://grouplens.org/datasets/movielens/)  
And so many more...

## Instructions

Once you have selected a set of data, create a brand new experiment in MLFlow and begin exploring your data. Do some EDA, clean up, and learn about your data. You do not need to begin tracking anything yet, but you can if you want to (e.g. you can log different versions of your data as you clean it up and do any feature engineering). Do not spend a ton of time on this part. Your goal isn't really to build a great model, so don't spend hours on feature engineering and missing data imputation and things like that.

Once your data is clean, begin training models and tracking your experiments. If you intend to use this same dataset for your final project, then start thinking about what your model might look like when you actually deploy it. For example, when you engineer new features, be sure to save the code that does this, as you will need this in the future. If your final model has 1000 complex features, you might have a difficult time deploying it later on. If your final model takes 15 minutes to train, or takes a long time to score a new batch of data, you may want to think about training a less complex model.

At a minimum, you should:

1. Try at least 3 different ML algorithms.
2. Do hyperparameter tuning for each model.
3. Do some basic feature selection, and repeat the above steps with these reduced sets of features.
4. Identify the top 3 best models and note these down for later.
6. Choose the **final** model you want to deploy and stage it (in MLFlow) and run it on the test set to get a final measure of performance.
7. Log the exact training, validation, and testing datasets for the 3 best models, as well as hyperparameter values, and the values of your metrics.  
8. Push your code to Github. No need to track the mlruns folder, the images folder, any datasets, or the sqlite database in git.

### Turning It In

In the MLFlow UI, next to the refresh button you should see three vertical dots. Click the dots and then download your experiments as a csv file. Open the csv file and highlight the rows for your top 3 models from step 4 above and then save as an excel file. Take a snapshot of the Models page in the MLFLow UI showing the model you staged in step 6 above. Submit the excel file and snapshot to Canvas.