# MLflow Tracking Example

MLflow is organized into four components: **Tracking**, **Projects**, **Models**, and **Model Registry**. You can use each of these components on their own—for example, maybe you want to export models in MLflow’s model format without using Tracking or Projects—but they are also designed to work well together. So this notebook will focus on only the **Tracking** component within the PySpark environment. 

### Why is tracking useful/important?

Machine learning typically requires experimenting with a diverse set of hyperparameter tuning techniques, data preparation steps, and algorithms to build a model that maximizes some target metric. Given this complexity, building a machine learning model can therefore be challenging for a couple of reasons:

1. **It’s difficult to keep track of experiments.** When you are just working with files on your laptop, or with an interactive notebook, how do you tell which data, code and parameters went into getting a particular result?
2. **It’s difficult to reproduce code.** Even if you have meticulously tracked the code versions and parameters, you need to capture the whole environment (for example, library dependencies) to get the same result again. This is especially challenging if you want another data scientist to use your code, or if you want to run the same code at scale on another platform (for example, in the cloud).

### Solution that MLflow Tracking provides

MLflow Tracking is an API and UI for logging parameters, code versions, metrics, and artifacts when running your machine learning code and for later visualizing the results. You can use MLflow Tracking in any environment (for example, a standalone script or a notebook) to log results to local files or to a server, then compare multiple runs.

### How to install MLflow

You simply install MLflow by running *"pip install mlflow"* via the command line. Please reference the Quick Start Guide here for more details: https://mlflow.org/docs/latest/quickstart.html

### Viewing the Tracking MLflow UI

By default, wherever you run your program (Jupyter Notebook in this case), the tracking API writes data into files into a local ./mlruns directory. First you need to open your mlflow intance via the command line (cd into the folder where this notebook is stored). You can then run MLflow’s Tracking UI: http://localhost:5000/#/

### How to cd into a folder

 - **Mac**: https://macpaw.com/how-to/use-terminal-on-mac
 - **Windows**: https://www.minitool.com/news/how-to-change-directory-in-cmd.html
 
### Getting started

Since we using PySpark for this example, first we need to create a Spark Session. 

In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Mlflow").getOrCreate()
spark

### Import dependencies

In [5]:
# PySpark modeling libraries
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors

# Mlflow libaries
import mlflow

# Mlflow client
from  mlflow.tracking import MlflowClient
client = MlflowClient()

## Create a fake dataset

In [6]:
df = spark.createDataFrame([
    (0, Vectors.dense([1.0, 0.1, -1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["label", "features"])

In [7]:
# Split the data into training and test sets. (0.75, 0.25) split.
train_val = 0.7
test_val = 1-train_val
seed=40
train,test = df.randomSplit([train_val,test_val],seed=seed)

In [19]:
# Get list of experiments and print them
experiments = client.list_experiments() # returns a list of mlflow.entities.Experiment
for x in experiments:
#     print(x.name)
    print(x)
    print(" ")

<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/0', experiment_id='0', lifecycle_stage='active', name='Default', tags={}>
 
<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/7', experiment_id='7', lifecycle_stage='active', name='Tree-Algorithms', tags={}>
 
<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/6', experiment_id='6', lifecycle_stage='active', name='Experiment-4', tags={}>
 
<Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/1', experiment_id='1', lifecycle_stage='active', name='first-experiment', tags={'mlflow.note.content': 'This experiment tested 

### Managing Experiments and Runs with the Tracking Service API

https://mlflow.org/docs/latest/tracking.html#managing-experiments-and-runs-with-the-tracking-service-api

### Organizing Runs in Experiments

https://mlflow.org/docs/latest/tracking.html#organizing-runs-in-experiments

In [9]:
# You can retrieve any of the elements from experiements that you need....
print("Full Description: ",experiments[2])
print(" ")
print("Name: ",experiments[2].name)
print("ID: ",experiments[2].experiment_id)
print("Tags: ",experiments[2].tags)

Full Description:  <Experiment: artifact_location='file:///Users/orcuncanlilar/Eden/PySpark%20Essentials%20for%20Data%20Scientists/Jupyter%20Notebooks/Machine%20Learning/mlruns/1', experiment_id='1', lifecycle_stage='active', name='first-experiment', tags={'mlflow.note.content': 'This experiment tested various tree classifiers '
                        'across various parameters. The training process used '
                        'a parameter grid search technique for Hyperparameter '
                        'optimization. It was a real success!'}>
 
Name:  first-experiment
ID:  1
Tags:  {'mlflow.note.content': 'This experiment tested various tree classifiers across various parameters. The training process used a parameter grid search technique for Hyperparameter optimization. It was a real success!'}


In [20]:
# Set and experiment name and tie it to a create run function
experimentName = "Tree-Algorithms"
def create_run(experimentName):
    mlflow.set_experiment(experiment_name = experimentName)
    for x in experiments:
        if experimentName in x.name:
            experiment_index = experiments.index(x)
            run = client.create_run(experiments[experiment_index].experiment_id) # returns mlflow.entities.Run
            return run

**Train and Evaluate your model**

In [11]:
# Instantiate Classifier
classifier = DecisionTreeClassifier(maxDepth=5, maxBins=32)
fitModel = classifier.fit(train)

# Evaluate
predictions = fitModel.transform(test)

# Calculate Accuracy
MC_evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
accuracy = (MC_evaluator.evaluate(predictions))*100
print(accuracy)

25.0


**Log results to MLflow**

In [21]:
run = create_run(experiment_name)

# Add tags to the run
classifier_name = type(classifier).__name__ # Extract the name of the classifier
client.set_tag(run.info.run_id, "Algorithm", classifier_name)  
client.set_tag(run.info.run_id,"Random Seed",seed)
client.set_tag(run.info.run_id,"Train Perct",train_val)

# Log metrics to the run
client.log_metric(run.info.run_id, "accuracy", accuracy)

# Log parameters to the run
for key, val in paramMap.items():
    if 'maxDepth' in key.name:
        client.log_param(run.info.run_id, "Max Depth", val)
for key, val in paramMap.items():
    if 'maxBins' in key.name:
        client.log_param(run.info.run_id, "Max Bins", val)

NameError: name 'paramMap' is not defined

In [14]:
type(classifier_name)

str

In [212]:
### See all params available to you
paramMap = fitModel.extractParamMap()
for k, v in paramMap.items():
    print("Key: ",k, ": ", v)

Key:  DecisionTreeClassifier_c86026698e30__cacheNodeIds :  False
Key:  DecisionTreeClassifier_c86026698e30__checkpointInterval :  10
Key:  DecisionTreeClassifier_c86026698e30__featuresCol :  features
Key:  DecisionTreeClassifier_c86026698e30__impurity :  gini
Key:  DecisionTreeClassifier_c86026698e30__labelCol :  label
Key:  DecisionTreeClassifier_c86026698e30__maxBins :  32
Key:  DecisionTreeClassifier_c86026698e30__maxDepth :  5
Key:  DecisionTreeClassifier_c86026698e30__maxMemoryInMB :  256
Key:  DecisionTreeClassifier_c86026698e30__minInfoGain :  0.0
Key:  DecisionTreeClassifier_c86026698e30__minInstancesPerNode :  1
Key:  DecisionTreeClassifier_c86026698e30__predictionCol :  prediction
Key:  DecisionTreeClassifier_c86026698e30__probabilityCol :  probability
Key:  DecisionTreeClassifier_c86026698e30__rawPredictionCol :  rawPrediction
Key:  DecisionTreeClassifier_c86026698e30__seed :  -8543656710325985320
