# MLflow Tracking Example

MLflow is organized into four components: **Tracking**, **Projects**, **Models**, and **Model Registry**. You can use each of these components on their own—for example, maybe you want to export models in MLflow’s model format without using Tracking or Projects—but they are also designed to work well together. So this notebook will focus on only the **Tracking** component within the PySpark environment. 

### Why is tracking useful/important?

Machine learning typically requires experimenting with a diverse set of hyperparameter tuning techniques, data preparation steps, and algorithms to build a model that maximizes some target metric. Given this complexity, building a machine learning model can therefore be challenging for a couple of reasons:

1. **It’s difficult to keep track of experiments.** When you are just working with files on your laptop, or with an interactive notebook, how do you tell which data, code and parameters went into getting a particular result?
2. **It’s difficult to reproduce code.** Even if you have meticulously tracked the code versions and parameters, you need to capture the whole environment (for example, library dependencies) to get the same result again. This is especially challenging if you want another data scientist to use your code, or if you want to run the same code at scale on another platform (for example, in the cloud).

### Solution that MLflow Tracking provides

MLflow Tracking is an API and UI for logging parameters, code versions, metrics, and artifacts when running your machine learning code and for later visualizing the results. You can use MLflow Tracking in any environment (for example, a standalone script or a notebook) to log results to local files or to a server, then compare multiple runs.

### How to install MLflow

You simply install MLflow by running *"pip install mlflow"* via the command line. Please reference the Quick Start Guide here for more details: https://mlflow.org/docs/latest/quickstart.html

### Viewing the Tracking MLflow UI

By default, wherever you run your program (Jupyter Notebook in this case), the tracking API writes data into files into a local ./mlruns directory. First you need to open your mlflow intance via the command line (cd into the folder where this notebook is stored). You can then run MLflow’s Tracking UI: http://localhost:5000/#/

### How to cd into a folder

 - **Mac**: https://macpaw.com/how-to/use-terminal-on-mac
 - **Windows**: https://www.minitool.com/news/how-to-change-directory-in-cmd.html


### Import dependencies

In [2]:
import os
import warnings
import sys

# Mlflow libaries
import mlflow

# Mlflow client
from  mlflow.tracking import MlflowClient
client = MlflowClient()

# Numpy for random number generator
import numpy as np

In [3]:
# Get info about your environment
mlflow.spark.get_default_conda_env()

  from collections import (
  class ResultIterable(collections.Iterable):


{'name': 'mlflow-env',
 'channels': ['defaults'],
 'dependencies': ['python=3.7.4', 'pyspark=2.4.4', 'pip', {'pip': ['mlflow']}]}

### Managing Experiments and Runs with the Tracking Service API

https://mlflow.org/docs/latest/tracking.html#managing-experiments-and-runs-with-the-tracking-service-api

### Organizing Runs in Experiments

https://mlflow.org/docs/latest/tracking.html#organizing-runs-in-experiments

In [4]:
# Create experiement
exp_id = mlflow.create_experiment("Experiment-4")
# mlflow.create_experiment("Experiment-0")
exp_id

'6'

In [5]:
# Set experiment
# This will actually automatically create one if the one you call on doesn't exist
mlflow.set_experiment(experiment_name = "Experiment-4")

In [6]:
# set up your client and get list of experiments
from  mlflow.tracking import MlflowClient
client = MlflowClient()
experiments = client.list_experiments() # returns a list of mlflow.entities.Experiment
for x in experiments:
#     print(x.name)
    print(x)
    print(" ")

<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/0', experiment_id='0', lifecycle_stage='active', name='Default', tags={}>
 
<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/6', experiment_id='6', lifecycle_stage='active', name='Experiment-4', tags={}>
 
<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/1', experiment_id='1', lifecycle_stage='active', name='first-experiment', tags={'mlflow.note.content': 'This experiment tested various tree classifiers '
                        'across various parameters. The training process used '
                        'a parameter grid search technique for Hyperparameter '
                        'optimization. It was a real su

In [61]:
# You can retrieve any of the elements from experiements that you need....
print("Full Description: ",experiments[2])
print(" ")
print("Name: ",experiments[2].name)
print("ID: ",experiments[2].experiment_id)
print("Tags: ",experiments[2].tags)

Full Description:  <Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/3', experiment_id='3', lifecycle_stage='active', name='Experiment-1', tags={}>
 
Name:  Experiment-1
ID:  3
Tags:  {}


In [8]:
# Create a run and attach it to the experiment you just created
# Just to the get the general concept down

experiement_name = 'Experiment-4'
def create_run(experiement_name):
    for x in experiments:
        if experiement_name in x.name:
            experiment_index = experiments.index(x)
            run = client.create_run(experiments[experiment_index].experiment_id) # returns mlflow.entities.Run
            return run
            
# Example run:
run = create_run(experiement_name)
run

<Run: data=<RunData: metrics={}, params={}, tags={}>, info=<RunInfo: artifact_uri='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/6/86e0d3b013c242a9a223e742583998ab/artifacts', end_time=None, experiment_id='6', lifecycle_stage='active', run_id='86e0d3b013c242a9a223e742583998ab', run_uuid='86e0d3b013c242a9a223e742583998ab', start_time=1589220431918, status='RUNNING', user_id='unknown'>>

In [14]:
experiments[0]

<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/0', experiment_id='0', lifecycle_stage='active', name='Default', tags={}>

**Conduct a run!**

In [36]:
# First create a new run if you haven't already
run = create_run(experiement_name)

random_num = np.random.randint(0,500)
binary = np.random.randint(0,2)
test_id = np.random.randint(0,3000)

# Add tag to a run
client.set_tag(run.info.run_id,"Test ID",test_id)

if binary == 0:
    result = 'Pass'
    client.set_tag(run.info.run_id, "Result", result)
else:
    result = 'Fail'
    client.set_tag(run.info.run_id, "Result", result)
    
client.set_tag(run.info.run_id,"Random Number",random_num)

# Terminate the client
client.set_terminated(run.info.run_id)