# MLflow Tracking Example

MLflow is organized into four components: **Tracking**, **Projects**, **Models**, and **Model Registry**. You can use each of these components on their own—for example, maybe you want to export models in MLflow’s model format without using Tracking or Projects—but they are also designed to work well together. So this notebook will focus on only the **Tracking** component within the PySpark environment. 

### Why is tracking useful/important?

Machine learning typically requires experimenting with a diverse set of hyperparameter tuning techniques, data preparation steps, and algorithms to build a model that maximizes some target metric. Given this complexity, building a machine learning model can therefore be challenging for a couple of reasons:

1. **It’s difficult to keep track of experiments.** When you are just working with files on your laptop, or with an interactive notebook, how do you tell which data, code and parameters went into getting a particular result?
2. **It’s difficult to reproduce code.** Even if you have meticulously tracked the code versions and parameters, you need to capture the whole environment (for example, library dependencies) to get the same result again. This is especially challenging if you want another data scientist to use your code, or if you want to run the same code at scale on another platform (for example, in the cloud).

### Solution that MLflow Tracking provides

MLflow Tracking is an API and UI for logging parameters, code versions, metrics, and artifacts when running your machine learning code and for later visualizing the results. You can use MLflow Tracking in any environment (for example, a standalone script or a notebook) to log results to local files or to a server, then compare multiple runs.

### How to install MLflow

You simply install MLflow by running *"pip install mlflow"* via the command line. Please reference the Quick Start Guide here for more details: https://mlflow.org/docs/latest/quickstart.html

### Viewing the Tracking MLflow UI

By default, wherever you run your program (Jupyter Notebook in this case), the tracking API writes data into files into a local ./mlruns directory. First you need to open your mlflow intance via the command line (cd into the folder where this notebook is stored). You can then run MLflow’s Tracking UI: http://localhost:5000/#/

### How to cd into a folder

 - **Mac**: https://macpaw.com/how-to/use-terminal-on-mac
 - **Windows**: https://www.minitool.com/news/how-to-change-directory-in-cmd.html
 
### Getting started

Since we using PySpark for this example, first we need to create a Spark Session. 

In [1]:
# import findspark
# findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# Notice that we call our spark session ss instead of spark
# This becuase it clashes with the spark lib we need for mlflow (tried to rename but didn't work)
ss = SparkSession.builder.appName("Mlflow").getOrCreate()
ss

### Import dependencies

In [2]:
import os
import warnings
import sys

# PySpark modeling libraries
from pyspark.ml.classification import *
from pyspark.ml.evaluation import *
from pyspark.ml.linalg import Vectors

# Mlflow libaries
import mlflow
from mlflow import spark

# Mlflow client
from  mlflow.tracking import MlflowClient
client = MlflowClient()

# Numpy for random number generator
import numpy as np

In [4]:
# Get info about your environment
mlflow.spark.get_default_conda_env()

{'name': 'mlflow-env',
 'channels': ['defaults'],
 'dependencies': ['python=3.7.4', 'pyspark=2.4.4', 'pip', {'pip': ['mlflow']}]}

## Train & Eval Model, then Log Results to MLflow

### Mexican Vanilla (a bit of flavor added)
 - Without using cross validator
 - **With** using the client which allows you to managing experiments and runs

### Managing Experiments and Runs with the Tracking Service API

https://mlflow.org/docs/latest/tracking.html#managing-experiments-and-runs-with-the-tracking-service-api

### Organizing Runs in Experiments

https://mlflow.org/docs/latest/tracking.html#organizing-runs-in-experiments

In [7]:
# Create experiement
exp_id = mlflow.create_experiment("Experiment-0")
# mlflow.create_experiment("Experiment-0")
exp_id

'2'

In [3]:
# Set experiment
# This will actually automatically create one if the one you call on doesn't exist
mlflow.set_experiment(experiment_name = "Experiment-1")

In [4]:
# set up your client and get list of experiments
from  mlflow.tracking import MlflowClient
client = MlflowClient()
experiments = client.list_experiments() # returns a list of mlflow.entities.Experiment
for x in experiments:
#     print(x.name)
    print(x)
    print(" ")

<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/0', experiment_id='0', lifecycle_stage='active', name='Default', tags={}>
 
<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/1', experiment_id='1', lifecycle_stage='active', name='first-experiment', tags={'mlflow.note.content': 'This experiment tested various tree classifiers '
                        'across various parameters. The training process used '
                        'a parameter grid search technique for Hyperparameter '
                        'optimization. It was a real success!'}>
 
<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/4', experiment_id='4', lifecycle_stage='active', name='Experiment-2',

In [61]:
# You can retrieve any of the elements from experiements that you need....
print("Full Description: ",experiments[2])
print(" ")
print("Name: ",experiments[2].name)
print("ID: ",experiments[2].experiment_id)
print("Tags: ",experiments[2].tags)

Full Description:  <Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/3', experiment_id='3', lifecycle_stage='active', name='Experiment-1', tags={}>
 
Name:  Experiment-1
ID:  3
Tags:  {}


In [5]:
# Create a run and attach it to the experiment you just created
# Just to the get the general concept down
for x in experiments:
    if 'Experiment-1' in x.name:
        experiment_index = experiments.index(x)
        run = client.create_run(experiments[experiment_index].experiment_id) # returns mlflow.entities.Run

**Conduct a run!**

In [7]:
# Normally you would have some ML stuff here

# Add tag to a run
client.set_tag(run.info.run_id, "Algorithm", "Gradient Boosted Tree")
client.set_tag(run.info.run_id,"Random Seed",900)
client.set_tag(run.info.run_id,"Train Perct",0.8)

# Add params and metrics to a run
client.log_param(run.info.run_id, "Max Depth", 100)
client.log_param(run.info.run_id, "Max Bins", 50)
client.log_metric(run.info.run_id, "Accuracy", 0.87)

# Terminate the client
client.set_terminated(run.info.run_id)

### Full Script

In [9]:
if __name__ == "__main__":
    warnings.filterwarnings("ignore")

    # Read the wine-quality csv file (make sure you're running this from the root of MLflow!)
#     wine_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "wine-quality.csv")
#     data = pd.read_csv(wine_path)
    data = ss.createDataFrame([
    (0, Vectors.dense([1.0, 0.1, -1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["label", "features"])

    # Split the data into training and test sets. (0.75, 0.25) split.
    train_val = 0.7
    test_val = 1-train_val
    seed=40
    np.random.seed(seed)
    train,test = data.randomSplit([train_val,test_val],seed=seed)

    with mlflow.start_run():
        
        # Create a run and attach it to an experiment (using the id) and add tags
        run = client.create_run(experiments[2].experiment_id) # returns mlflow.entities.Run
        # Instantiate Classifier
        classifier = DecisionTreeClassifier(maxDepth=5, maxBins=32)
        # Extract the name of the classifier
        classifier_name = type(classifier).__name__
        
        # Add tags to the run
        client.set_tag(run.info.run_id, "Algorithm", classifier_name)   
        client.set_tag(run.info.run_id,"Random Seed",seed)
        client.set_tag(run.info.run_id,"Train Perct",train_val)
        
        fitModel = classifier.fit(train)
        
        # Evaluate
        predictions = fitModel.transform(test)
        
        # Calculate Accuracy
        MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
        accuracy = (MC_evaluator.evaluate(predictions))*100
        print(accuracy)

#         mlflow.log_param("alpha", alpha)
#         mlflow.log_param("l1_ratio", l1_ratio)
        mlflow.log_metric("accuracy", accuracy)
    
        mlflow.spark.log_model(fitModel, "model")

        # Add params and metrics to a run
        client.log_param(run.info.run_id, "Max Depth", 3)
        client.log_param(run.info.run_id, "Max Bins", 5)
        client.log_metric(run.info.run_id, "accuracy", accuracy)

25.0


In [212]:
### Print a all the parameter names (keys in a dict)
paramMap = fitModel.extractParamMap()
for k, v in paramMap.items():
    print("Key: ",k, ": ", v)

Key:  DecisionTreeClassifier_c86026698e30__cacheNodeIds :  False
Key:  DecisionTreeClassifier_c86026698e30__checkpointInterval :  10
Key:  DecisionTreeClassifier_c86026698e30__featuresCol :  features
Key:  DecisionTreeClassifier_c86026698e30__impurity :  gini
Key:  DecisionTreeClassifier_c86026698e30__labelCol :  label
Key:  DecisionTreeClassifier_c86026698e30__maxBins :  32
Key:  DecisionTreeClassifier_c86026698e30__maxDepth :  5
Key:  DecisionTreeClassifier_c86026698e30__maxMemoryInMB :  256
Key:  DecisionTreeClassifier_c86026698e30__minInfoGain :  0.0
Key:  DecisionTreeClassifier_c86026698e30__minInstancesPerNode :  1
Key:  DecisionTreeClassifier_c86026698e30__predictionCol :  prediction
Key:  DecisionTreeClassifier_c86026698e30__probabilityCol :  probability
Key:  DecisionTreeClassifier_c86026698e30__rawPredictionCol :  rawPrediction
Key:  DecisionTreeClassifier_c86026698e30__seed :  -8543656710325985320


In [213]:
##### Search for a specific Param by key word

paramMap = fitModel.extractParamMap()
# Using items() + list comprehension 
# Substring Key match in dictionary 
search_key = 'maxBins' 

for key, val in paramMap.items():
    if search_key in key.name:
        print(val)

32


## Train & Eval Model, then Log Results to MLflow

### Rocky Road (most flavorful option)
 - **With** using cross validator
 - **With** using the client which allows you to managing experiments and runs

In [66]:
# Question
# If you use the client, do you have to do the ML start run part?
# Test here
# Test Result: you do not need to start a run with this method

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

for x in experiments:
    if 'Experiment-1' in x.name:
        experiment_index = experiments.index(x)
        run = client.create_run(experiments[experiment_index].experiment_id)

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

data = ss.createDataFrame([
(0, Vectors.dense([1.0, 0.1, -1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([2.0, 1.1, 1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([2.0, 1.1, 1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([2.0, 1.1, 1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(1, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([2.0, 1.1, 1.0]),),
(2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["label", "features"])

# Split the data into training and test sets. (0.75, 0.25) split.
seed = 60
train_val = 0.7
test_val = 1-train_val
train,test = data.randomSplit([train_val,test_val],seed=seed)

# Create a run and attach it to an experiment (using the id) and add tags
run = client.create_run(experiments[0].experiment_id) # returns mlflow.entities.Run
# Instantiate Classifier
classifier = DecisionTreeClassifier()
# Create a parameter grid to search
paramGrid = (ParamGridBuilder() \
                     .addGrid(classifier.maxDepth, [2, 5, 10, 20, 30]) \
                     .addGrid(classifier.maxBins, [10, 20, 40, 80, 100]) \
                     .build())

#Cross Validator requires all of the following parameters:
crossval = CrossValidator(estimator=classifier,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          seed=seed, #same as the one set above
                          numFolds=2) # 3 + is best practice

# Fit Model: Run cross-validation, and choose the best set of parameters.
fitModel = crossval.fit(train)

# Evaluate
predictions = fitModel.transform(test)

# Calculate Accuracy
MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
accuracy = (MC_evaluator.evaluate(predictions))*100
print(accuracy)

########### Track results in MLflow UI ################

# Add tag to a run
# Extract the name of the classifier
classifier_name = type(classifier).__name__
client.set_tag(run.info.run_id, "Algorithm", classifier_name) 
client.set_tag(run.info.run_id,"Random Seed",seed)
client.set_tag(run.info.run_id,"Train Perct",train_val)

# Log Model (can't do this to the client)
mlflow.spark.log_model(fitModel, "model")

# Extract params of Best Model
BestModel = fitModel.bestModel
paramMap = BestModel.extractParamMap()

# Log parameters to the client
for key, val in paramMap.items():
    if 'maxDepth' in key.name:
        client.log_param(run.info.run_id, "Max Depth", val)
for key, val in paramMap.items():
    if 'maxBins' in key.name:
        client.log_param(run.info.run_id, "Max Bins", 5)

# Log metrics to the client
client.log_metric(run.info.run_id, "accuracy", accuracy)

33.33333333333333


## More Resources

**Scikit Learn Example:** https://github.com/mlflow/mlflow-example

**Link to Mlflow Spark Documentation:** https://mlflow.org/docs/latest/python_api/mlflow.spark.html

In [14]:
experiments[0]

<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/0', experiment_id='0', lifecycle_stage='active', name='Default', tags={}>

## Without the Mlflow start run bit