# Experiment tracking

## What is Experiment tracking?
= process of keeping track of all the **relevant information** from an **ML experiment**, which includes

- source code
- environment
- data
- model
- hyperparameters
- metrics



## Why is experiment tracking important?

In general, because of these 3 main reasons:

- reproducibility
- organization
- optimization



## Tracking experiments in spreadsheets

Why is it not enough?

- error prone
- no standard format
- visibility & collaboration

[excel_tracking.png](excel_tracking.png)

## MLflow

Definition: "an open source platform for the machine learning lifecycle"

In practice, it's just a python package that can be installed with pip, and it contains four main modules:
- tracking
- models
- model registry
- projects


Keep track of
- parameters
- metrics
- metadata
- artifacts
- models

Also logs:
- source code
- version of the code (git commit)
- start & end time
- author



`mlflow ui --backend-store-uri sqlite:///mlflow.db`

this tells mlflow we want to store the artifacts & metadata in sqlite



`mlflow.set_tracking_url` sets the sqlite d

`mlflow.set_experiment` creates a new experiment if there is none yet. you can see it in the CLI.

`with mlflow.start_run():` associates everything below with the current run

`mlflow.set_tag` sets tags associated to the current run (e.g. the developer name)

`mlflow.log_param` logs information about the dataset or parameters

`mlflow.log_metric` logs the metric ("rmse", rmse)

## Hyperparameter tuning logging

`from hyperopt import fmin` - model will be a function - fmin minimizes the output (minimum output of the function)

`tpe` is controlling finding the minimum

`hp` range for hyperparameters

`STATUS_OK` has the operation ran successfully?



1. run objective function 

`def objective(params)`

`params` will be the parameters

xgboost will try to minimize the error on the validation set

`hp.loguniform` log uniform space: hyperopt define space (google it) - using a uniform distribution




## after hyperparameter tuning

- with mlflow.start_run() alternative:

- mlflow autologging: if you use pyspark, xgboost, scikit learn and some more, you can use autolog functionality!

`mlflow.xgboost.autolog`

with the best parameters.

- `dv` is the dictVectorizer
- `lr` is the linearRegression Model

saving the model as an artifact:

`mlflow.log_artifact(local_path = "models/lin_reg.bin", artifact_path = "models_pickle")`



`mlflow.xgboost.log_model(booster, artifact_path = "models_mlflow")`



