# 1. Experimental Tracking

Check introduction.ipynb for the experimental tracking code

![image info](images/experiment_tracking.png)

## 1.1 Terminology

Experiment Run --> Each trial in an ML experiment   
Run artifact: any file that is associated with an ML run  


## 1.2 What is experimental tracking?

- Is the proicess of keeping track of all the relevant information from an ML experiment which includes things like source code, environemnt, data, model, hyperparameters, metrics etc

## 1.3 Why is experimental tracking so important?

- Reproducibility - Want experiments that are reproducible
- Organization - For collaboration purposes so everybody knows where to find what
- Optimization of ML model - Using mlflow

## 1.4 Tracking experiments in spreadsheets

It is not enough because    
- error prone as we fill in manually.  
- no standard format and no standard way of understanding how hyperparameters change 
- no visibility & collaboration

## 1.5 Experimental tracking with MLflow

MLflow is an open source platform for the machine learning lifecycle. It contains four main modules - tracking, models, model registry and projects

- **Tracking:** allows you to organize your experiments into runs and to keep track of parameters, metrics, metadata, artifcats and models. Along with these information, MLflow automatically logs extra information about the run - source code, version of code, start end time, author

### 1.5.1 Steps to create experiment tracker

- cd into project directory
- command "mlflow ui --backend-store-uri sqlite:///mlflow.db"

In [3]:
import pandas as pd
import pickle
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression 
from sklearn.linear_model import Lasso 
from sklearn.linear_model import Ridge 
from sklearn.metrics import mean_squared_error 

In [22]:
import mlflow

mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("nyc-taxi-experiment-new") # create this experiment if does not exist. If exist, will append the run

<Experiment: artifact_location='./mlruns/3', experiment_id='3', lifecycle_stage='active', name='nyc-taxi-experiment-new', tags={}>

In [10]:
def read_dataframe(filename):
    if filename.endswith('.csv'):
        df = pd.read_csv(filename)
        
        df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
        df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
    elif filename.endswith('.parquet'):
        df = pd.read_parquet(filename)
        
    df['duration'] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df
        

In [11]:
df_train = read_dataframe('./data/green_tripdata_2021-01.parquet')
df_val = read_dataframe('./data/green_tripdata_2021-02.parquet')

In [12]:
df_train['PU_DO'] = df_train['PULocationID'] + '_' + df_train['DOLocationID']
df_val['PU_DO'] = df_val['PULocationID'] + '_' + df_val['DOLocationID']

In [13]:
categorical = ['PU_DO'] #'PULocationID', 'DOLocationID']
numerical = ['trip_distance']

dv = DictVectorizer()

train_dicts = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [14]:
target = 'duration'
y_train = df_train[target].values
y_val = df_val[target].values

In [15]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_val)

mean_squared_error(y_val, y_pred, squared=False)

7.758715205808363

Using MLflow to keep trakc of trained models parameters

In [16]:
with  mlflow.start_run():
    
    mlflow.set_tag("developer", "Joanna")
    
    mlflow.log_param("train-data-path", "./data/green_tripdata_2021-01.csv")
    mlflow.log_param("valid-data-path", "./data/green_tripdata_2021-02.csv")

    alpha = 0.1
    mlflow.log_param("alpha", alpha)
    lr = Lasso(alpha)
    lr.fit(X_train, y_train)

    y_pred = lr.predict(X_val)

    rmse = mean_squared_error(y_val, y_pred, squared=False)
    mlflow.log_metric("rmse", rmse)
    
    # Method 1 of saving model: saving model as an artifact
    mlflow.log_artifact(local_path="models/lin_reg.bin", artifact_path="models_pickle")

Some models such as xgboost allow for autolog(). Check https://www.mlflow.org/docs/latest/tracking.html to see list of models

In [23]:
import xgboost as xgb

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from hyperopt.pyll import scope

In [24]:
train = xgb.DMatrix(X_train, label=y_train)
valid = xgb.DMatrix(X_val, label=y_val)

In [25]:
def objective(params): # set of hyperparameter of xgboost for this specific runs
    
    with mlflow.start_run():
        mlflow.set_tag("model", "xgboost")
        mlflow.log_params(params)
        booster = xgb.train(
            params=params,
            dtrain=train,
            num_boost_round=10, # 10 just for testing. In practice, can do 1000.
            evals=[(valid, "validation")],
            early_stopping_rounds=50
        )
        y_pred = booster.predict(valid)
        rmse = mean_squared_error(y_val, y_pred, squared=False)
        mlflow.log_metric("rmse", rmse)
        
        
    return {'loss': rmse, 'status': STATUS_OK}

In [None]:
search_space = {
    'max_depth': scope.int(hp.quniform('max_depth', 4, 100, 1)),
    'learning_rate': hp.loguniform('learning_rate', -3, 0), # exp(-3), exp(0) --> [0.05, 1]
    'reg_alpha': hp.loguniform('reg_alpha', -5, -1),
    'reg_lambda': hp.loguniform('reg_lambda', -6, -1),
    'min_child_weight': hp.loguniform('min_child_weight', -1, 3),
    'objective': 'reg:linear', #regression problem
    'seed': 42
}

best_result = fmin(
    fn=objective,
    space=search_space,
    algo=tpe.suggest, # algorithm to run the optimization
    max_evals=50,
    trials=Trials()
)

After finding the best parameters, we can do the following.

In [105]:
mlflow.end_run()

In [29]:
mlflow.xgboost.autolog(disable=False)

In [30]:
params = {
    'learning_rate': 0.47238438352430406,
    'max_depth': 69,
    'min_child_weight':	1.9882917503498563,
    'objective': 'reg:linear',
    'reg_alpha': 0.10369777005688582,
    'reg_lambda': 0.3648877224132926,
    'seed': 42
}

mlflow.xgboost.autolog()

booster = xgb.train(
            params=params,
            dtrain=train,
            num_boost_round=1000, 
            evals=[(valid, "validation")],
            early_stopping_rounds=50)

2022/05/25 10:27:46 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'b9248be6e4fd43eaba84fec4580c00a8', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current xgboost workflow


[0]	validation-rmse:12.96198
[1]	validation-rmse:9.16113
[2]	validation-rmse:7.57532
[3]	validation-rmse:6.92101
[4]	validation-rmse:6.65273
[5]	validation-rmse:6.53380
[6]	validation-rmse:6.47145
[7]	validation-rmse:6.43689
[8]	validation-rmse:6.41869
[9]	validation-rmse:6.40946
[10]	validation-rmse:6.40333
[11]	validation-rmse:6.39770
[12]	validation-rmse:6.39220
[13]	validation-rmse:6.38737
[14]	validation-rmse:6.38185
[15]	validation-rmse:6.37721
[16]	validation-rmse:6.37559
[17]	validation-rmse:6.37273
[18]	validation-rmse:6.37026
[19]	validation-rmse:6.36691
[20]	validation-rmse:6.36417
[21]	validation-rmse:6.36147
[22]	validation-rmse:6.35938
[23]	validation-rmse:6.35809
[24]	validation-rmse:6.35593
[25]	validation-rmse:6.35396
[26]	validation-rmse:6.35010
[27]	validation-rmse:6.34818
[28]	validation-rmse:6.34672
[29]	validation-rmse:6.34467
[30]	validation-rmse:6.34207
[31]	validation-rmse:6.34089
[32]	validation-rmse:6.33992
[33]	validation-rmse:6.33851
[34]	validation-rmse:6.



![mlflow ui](images/experiment_tracking_1.png)

# 2. Model Management

A basic way of managing model versions

![model management](images/model_management.png)

Problems with this is   
- Error prone: Folders are created manually. Might overwrite accidentally
- No versioning: As number of models grow, can get confusing
- no model lineage: Not easy to understand how all these models were created. What were the hyperparameters? etc

Using mlflow, we can better organise our models.

![mlflow_model_registry](images/model_registry_chart.png)

## 2.1 Ways to save a model

- Method 1    
mlflow.log_artifact(local_path="models/lin_reg.bin", artifact_path="models_pickle")

- Method 2   
mlflow.xgboost.log_model(booster, artifact_path="models_mlflow")

In [31]:
mlflow.xgboost.autolog(disable=True)

In [32]:
with mlflow.start_run():
    train = xgb.DMatrix(X_train, label=y_train)
    valid = xgb.DMatrix(X_val, label=y_val)

best_params = {
    'learning_rate': 0.47238438352430406,
    'max_depth': 69,
    'min_child_weight':	1.9882917503498563,
    'objective': 'reg:linear',
    'reg_alpha': 0.10369777005688582,
    'reg_lambda': 0.3648877224132926,
    'seed': 42
}

mlflow.log_params(best_params)

booster = xgb.train(
            params=best_params,
            dtrain=train,
            num_boost_round=1000,
            evals=[(valid, "validation")],
            early_stopping_rounds=50)

y_pred = booster.predict(valid)
rmse = mean_squared_error(y_val, y_pred, squared=False)
mlflow.log_metric("rmse", rmse)

with open("models/preprocessor.b", "wb") as f_out:
    pickle.dump(dv, f_out)
    
# method 1 to store model
mlflow.log_artifact("models/preprocessor.b", artifact_path="preprocessor")

# method 2 to store model
mlflow.xgboost.log_model(booster, artifact_path="models_mlflow")

[0]	validation-rmse:12.96198
[1]	validation-rmse:9.16113
[2]	validation-rmse:7.57532
[3]	validation-rmse:6.92101
[4]	validation-rmse:6.65273
[5]	validation-rmse:6.53380
[6]	validation-rmse:6.47145
[7]	validation-rmse:6.43689
[8]	validation-rmse:6.41869
[9]	validation-rmse:6.40946
[10]	validation-rmse:6.40333
[11]	validation-rmse:6.39770
[12]	validation-rmse:6.39220
[13]	validation-rmse:6.38737
[14]	validation-rmse:6.38185
[15]	validation-rmse:6.37721
[16]	validation-rmse:6.37559
[17]	validation-rmse:6.37273
[18]	validation-rmse:6.37026
[19]	validation-rmse:6.36691
[20]	validation-rmse:6.36417
[21]	validation-rmse:6.36147
[22]	validation-rmse:6.35938
[23]	validation-rmse:6.35809
[24]	validation-rmse:6.35593
[25]	validation-rmse:6.35396
[26]	validation-rmse:6.35010
[27]	validation-rmse:6.34818
[28]	validation-rmse:6.34672
[29]	validation-rmse:6.34467
[30]	validation-rmse:6.34207
[31]	validation-rmse:6.34089
[32]	validation-rmse:6.33992
[33]	validation-rmse:6.33851
[34]	validation-rmse:6.

ModelInfo(artifact_path='models_mlflow', flavors={'python_function': {'loader_module': 'mlflow.xgboost', 'python_version': '3.8.5', 'data': 'model.xgb', 'env': 'conda.yaml'}, 'xgboost': {'xgb_version': '1.6.1', 'data': 'model.xgb', 'model_class': 'xgboost.core.Booster', 'code': None}}, model_uri='runs:/3af341ad41d8439fa8c824959a1e8b8a/models_mlflow', model_uuid='7b2dcb2c4d8349629883c301543735dc', run_id='3af341ad41d8439fa8c824959a1e8b8a', saved_input_example_info=None, signature_dict=None, utc_time_created='2022-05-25 02:31:03.387579', mlflow_version='1.26.0')

## 2.2 Ways to load a model

- Flavour 1: Python function  
- Flabour 2: Scikit-learn etc 

Load model copied from the mlflow ui.

In [33]:
import mlflow
logged_model = 'runs:/3af341ad41d8439fa8c824959a1e8b8a/models_mlflow'

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)



In [34]:
loaded_model

mlflow.pyfunc.loaded_model:
  artifact_path: models_mlflow
  flavor: mlflow.xgboost
  run_id: 3af341ad41d8439fa8c824959a1e8b8a

In [35]:
# Load model as a xgboost Model
xgboost_model = mlflow.xgboost.load_model(logged_model)



In [36]:
y_pred = xgboost_model.predict(valid)

## 2.3 MLflow's Model Registry

The MlflowClient object allows us to interact with...

- an MLflow Tracking Server that creates and manages experiments and runs.
- an MLflow Registry Server that creates and manages registered models and model versions.

In [37]:
from mlflow.tracking import MlflowClient

MLFLOW_TRACKING_URI = "sqlite:///mlflow.db"

client = MlflowClient(tracking_uri=MLFLOW_TRACKING_URI)

In [38]:
client.list_experiments()

[<Experiment: artifact_location='./mlruns/3', experiment_id='3', lifecycle_stage='active', name='nyc-taxi-experiment-new', tags={}>]

In [40]:
client.create_experiment(name="new-experiment")

'5'

What is the best run from our experiments?

In [41]:
from mlflow.entities import ViewType

runs = client.search_runs(
    experiment_ids='3',
    filter_string="metrics.rmse < 6.5",
    run_view_type=ViewType.ACTIVE_ONLY,
    max_results=5,
    order_by=["metrics.rmse ASC"]
)

In [42]:
for run in runs:
    print(f"Run ID: {run.info.run_id}, rmse: {run.data.metrics['rmse']:.4f}")

Run ID: f3c4a33951aa44bfaa0e2d65fa606c73, rmse: 6.2906
Run ID: ea522146c6c4485990739d5d7b3b888a, rmse: 6.2906
Run ID: 7ee3f06acf924f7db455190a82531dd6, rmse: 6.2906
Run ID: 48faba9136da4144a1e648e6266aaa11, rmse: 6.3128
Run ID: 3af341ad41d8439fa8c824959a1e8b8a, rmse: 6.3244


![model registry 1](images/model_registry.png)

In [43]:
import mlflow

MLFLOW_TRACKING_URI = "sqlite:///mlflow.db"
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

### 2.4.1 Register Models

In [106]:
run_id = "f3c4a33951aa44bfaa0e2d65fa606c73"
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri=model_uri, name="nyc-taxi-regressor")

Registered model 'nyc-taxi-regressor' already exists. Creating a new version of this model...
2022/05/25 11:20:34 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: nyc-taxi-regressor, version 4
Created version '4' of model 'nyc-taxi-regressor'.


<ModelVersion: creation_timestamp=1653448834456, current_stage='None', description=None, last_updated_timestamp=1653448834456, name='nyc-taxi-regressor', run_id='f3c4a33951aa44bfaa0e2d65fa606c73', run_link=None, source='./mlruns/3/f3c4a33951aa44bfaa0e2d65fa606c73/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=4>

![model_registry_2](images/model_registry_2.png)

In [107]:
model_uri

'runs:/f3c4a33951aa44bfaa0e2d65fa606c73/model'

### 2.4.2 Transition the stage of the model version

In [108]:
model_name = "nyc-taxi-regressor"
latest_versions = client.get_latest_versions(name=model_name)

for version in latest_versions:
    print(f"Version: {version.version}, Stage: {version.current_stage}")

Version: 1, Stage: Archived
Version: 2, Stage: Production
Version: 3, Stage: Staging
Version: 4, Stage: None


![model_registry_3](images/model_registry_3.png)

In [99]:
model_version = 3
new_stage = "Production"

client.transition_model_version_stage(
    name=model_name,
    version=model_version,
    stage=new_stage,
    archive_existing_versions=False
)

<ModelVersion: creation_timestamp=1653448465730, current_stage='Production', description=None, last_updated_timestamp=1653448509540, name='nyc-taxi-regressor', run_id='12ba80256bb84890995d4d96f4c2f5fc', run_link=None, source='./mlruns/3/12ba80256bb84890995d4d96f4c2f5fc/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=3>

![model_registry_4](images/model_registry_5.png)

### 2.4.3 Update Model Description

In [100]:
from datetime import datetime

date = datetime.today()

client.update_model_version(
    name=model_name,
    version=model_version,
    description=f"The model version {model_version} was transitioned to {new_stage} on {date}"
)

<ModelVersion: creation_timestamp=1653448465730, current_stage='Production', description=('The model version 3 was transitioned to Production on 2022-05-25 '
 '11:16:27.578790'), last_updated_timestamp=1653448587580, name='nyc-taxi-regressor', run_id='12ba80256bb84890995d4d96f4c2f5fc', run_link=None, source='./mlruns/3/12ba80256bb84890995d4d96f4c2f5fc/artifacts/model', status='READY', status_message=None, tags={}, user_id=None, version=3>

# 3. Example

In [109]:
from sklearn.metrics import mean_squared_error
import pandas as pd


def read_dataframe(filename):
    if filename.endswith('.csv'):
        df = pd.read_csv(filename)
        
        df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
        df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)
    elif filename.endswith('.parquet'):
        df = pd.read_parquet(filename)

    df.lpep_dropoff_datetime = pd.to_datetime(df.lpep_dropoff_datetime)
    df.lpep_pickup_datetime = pd.to_datetime(df.lpep_pickup_datetime)

    df['duration'] = df.lpep_dropoff_datetime - df.lpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df


def preprocess(df, dv):
    df['PU_DO'] = df['PULocationID'] + '_' + df['DOLocationID']
    categorical = ['PU_DO']
    numerical = ['trip_distance']
    train_dicts = df[categorical + numerical].to_dict(orient='records')
    return dv.transform(train_dicts)


def test_model(name, stage, X_test, y_test):
    model = mlflow.pyfunc.load_model(f"models:/{name}/{stage}")
    y_pred = model.predict(X_test)
    return {"rmse": mean_squared_error(y_test, y_pred, squared=False)}

In [110]:
df = read_dataframe("data/green_tripdata_2021-03.parquet")

In [111]:
# creates a preprocessor folder in local directory and download the preprocessor model from mlflow 
client.download_artifacts(run_id=run_id, path="preprocessor", dst_path=".")

'c:\\Users\\joann\\OneDrive\\Desktop\\My Files\\Data Science\\mlops_zoomcamp\\Module 2 - Experimental Tracking\\preprocessor'

In [112]:
with open("preprocessor/preprocessor.b", "rb") as f_in:
    dv = pickle.load(f_in)

In [113]:
X_test = preprocess(df, dv)

In [114]:
target = "duration"
y_test = df[target].values

In [None]:
%time 

test_model(name=model_name, stage="Production", X_test=X_test, y_test=y_test)

In [None]:
client.transition_model_version_stage(
    name=model_name,
    version=4,
    stage="Production",
    archive_existing_versions=True
)