# MLflow Tracking: H2O example (remote server w/ Minio)
- The MLflow Tracking component is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results.
- This example uses H2O binary classification model on an public dataset
- MLflow Tracking can be done locally or using a remote server.
    - this notebook will use a `remote server` for tracking
    - MLflow requires a cloud bucket storage for artifacts
        - a Minio server provides a bucket for artifacts (plots,images,log files ..any unstructured data)
- set environmental variables for Minio bucket storage before running notebook
    - export MLFLOW_URL=mlflow_url
    - export MLFLOW_S3_ENDPOINT_URL=minio_url
    - export AWS_ACCESS_KEY_ID=minio_access_key
    - export AWS_SECRET_ACCESS_KEY=minio_secret_key   
- this notebook was tested on Windows Subsystem for Linux
- references:
    - h2o: https://www.h2o.ai/
    - mlflow: https://mlflow.org/

## Import Dependencies

In [1]:
import h2o
import mlflow
import mlflow.h2o
import mlflow.server
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.automl import H2OAutoML
import numpy as np
import os.path
import time
import matplotlib.pyplot as plt
import itertools
import json
import os

%matplotlib inline

## Initialize H2O

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_191"; OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12); OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
  Starting server from /home/sean/miniconda3/envs/mlflow_h2o/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmphwh0b7tt
  JVM stdout: /tmp/tmphwh0b7tt/h2o_sean_started_from_python.out
  JVM stderr: /tmp/tmphwh0b7tt/h2o_sean_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,04 secs
H2O cluster timezone:,Etc/GMT
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.1.4
H2O cluster version age:,6 days
H2O cluster name:,H2O_from_python_sean_j55rgh
H2O cluster total nodes:,1
H2O cluster free memory:,3.531 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,1


## Import dataset
- predict if someone has diabetes
- reponse column: Outcome 
- feature columns: Pregnancies,Glucose,BloodPressure,Insulin,BMI,DiabetesPedigreeFunction,Age

In [3]:
hf = h2o.import_file(path="https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [4]:
hf.head(5)

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1




## Add more descriptive reponse labels

In [5]:
hf['OutcomeClass'] = 'Sick'
mask = hf['Outcome'] == 0
hf[mask,'OutcomeClass'] = 'NotSick'
hf['OutcomeClass'] = hf['OutcomeClass'].asfactor() #make sure it the type is categorical
hf['Outcome'] = hf['Outcome'].asfactor()

hf.head(5)

Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,OutcomeClass
6,148,72,35,0,33.6,0.627,50,1,Sick
1,85,66,29,0,26.6,0.351,31,0,NotSick
8,183,64,0,0,23.3,0.672,32,1,Sick
1,89,66,23,94,28.1,0.167,21,0,NotSick
0,137,40,35,168,43.1,2.288,33,1,Sick




## Check class balance

In [6]:
hf.group_by('OutcomeClass').count().get_frame()

OutcomeClass,nrow
NotSick,500
Sick,268




## Select features and response columns
- display the list of features

In [7]:
y = 'OutcomeClass'
X = hf.columns
X.remove(y)
X.remove('Outcome')

X

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

## Split data into test/train subsets

In [8]:
hf_train, hf_test = hf.split_frame(ratios = [0.8],seed=42)

# Now we'll do some modeling and track with MLflow

## Set mlflow url
- MLFLOW_URL env variable should be set before running notebook
- choose experiement name
- and activate it using .set_experiment()

In [9]:
mlflow_url = os.environ['MLFLOW_URL'] 
mlflow.tracking.set_tracking_uri(mlflow_url)

mlflow_url

'http://127.0.0.1:5000'

## List existing experiments

In [10]:
mlflow.tracking.MlflowClient().list_experiments()

[<Experiment: experiment_id=0, name='h2o_diabetes', artifact_location='s3://mlflow/artifacts/0', lifecycle_stage='active'>,
 <Experiment: experiment_id=1, name='h2o_diabetes2', artifact_location='s3://mlflow/artifacts/1', lifecycle_stage='active'>]

## Create a new experiment
- this will generate and experiement id

In [16]:
ex_id = mlflow.create_experiment(name='h2o_diabetes2')

ex_id

1

In [11]:
ex_id = 1

## We'll now run RandomForest Models with a grid/list of a hyper-parameter
- hyper-parameter: number of trees
- we'll log the number of trees, logloss,auc
- we'll also save each model

### Helper functions
- save scoring,variable_importance,roc, confusion_matrix plots

In [12]:
def save_plot_scoring_history(model,image_name):
    df = model.scoring_history()
    plt.plot(df['number_of_trees'],df['training_logloss'])
    plt.plot(df['number_of_trees'],df['validation_logloss'])
    plt.xlabel('number of trees',fontsize=14)
    plt.ylabel('logloss',fontsize=14)
    plt.title('Scoring History',fontsize=18)
    plt.legend(['training','validation'])
    plt.grid()
    plt.savefig(image_name)
    plt.close()

def save_plot_varimp(model,image_name):
    plt.rcdefaults()
    fig, ax = plt.subplots()
    variables = model._model_json['output']['variable_importances']['variable']
    y_pos = np.arange(len(variables))
    scaled_importance = model._model_json['output']['variable_importances']['scaled_importance']
    ax.barh(y_pos, scaled_importance, align='center', color='blue', ecolor='black')
    ax.set_yticks(y_pos)
    ax.set_yticklabels(variables)
    ax.invert_yaxis()
    ax.set_xlabel('Scaled Importance')
    ax.set_title('Variable Importance')
    fig.savefig(image_name)
    plt.close()

    
def save_plot_roc(model,image_name):
    perf = model.model_performance(valid=True) # roc for validation frame
    plt.xlabel('False Positive Rate (FPR)')
    plt.ylabel('True Positive Rate (TPR)')
    plt.title('ROC Curve')
    plt.text(0.5, 0.5, r'AUC={0:.4f}'.format(perf._metric_json["AUC"]))
    plt.plot(perf.fprs, perf.tprs, 'b--')
    plt.legend(['validation'])
    plt.axis([0, 1, 0, 1])
    plt.grid()
    plt.savefig(image_name)
    plt.close()

    
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues,
                          image_name='confusion_matrix.png'):
    
    #reference: https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    #else:
        #print('Confusion matrix, without normalization')

    #print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()
    plt.savefig(image_name)
    plt.close()
    


### create a folder for local artifact storage

In [13]:
if not os.path.exists("artifact_folder"):
    os.makedirs("artifact_folder")

### grid number of trees random forest function
- this is a modification of the example here: https://github.com/mlflow/mlflow/blob/master/examples/h2o/random_forest.ipynb

In [14]:
mlflow.start_run(experiment_id=ex_id)

<ActiveRun: info=<RunInfo: run_uuid='53bc1abb5edb4f43a1bd7e68191b7264', experiment_id=1, name='', source_type=4, source_name='/home/sean/miniconda3/envs/mlflow_h2o/lib/python3.6/site-packages/ipykernel_launcher.py', entry_point_name='', user_id='sean', status=1, start_time=1550855992337, end_time=None, source_version='', lifecycle_stage='active', artifact_uri='s3://mlflow/artifacts/1/53bc1abb5edb4f43a1bd7e68191b7264/artifacts'>, data=<RunData: metrics=[], params=[], tags=[]>>

In [16]:
mlflow.end_run()

In [17]:
file_path = os.path.join('artifact_folder', "some_output_file.txt")
with open(file_path, "w") as handle:
    handle.write('hello,world')
mlflow.log_artifacts('artifact_folder')
#log_artifact(file_path, "another_dir")

In [None]:
def trainRandomForest(ntrees):
    with mlflow.start_run(experiment_id=ex_id):
        rf = H2ORandomForestEstimator(ntrees=ntrees)
        rf.train(x=X,
                 y=y,
                 training_frame=hf_train,
                 validation_frame=hf_test)
        
        mlflow.log_param("ntrees", ntrees)
        mlflow.log_metric("auc", rf.auc())
        mlflow.log_metric("logloss", rf.logloss())
        
        mlflow.h2o.log_model(rf, "model")
        
        cnf_matrix = np.zeros((2, 2))
        cnf_matrix = cnf_matrix.astype('int')
        perf = rf.model_performance(valid=True) # roc for validation frame
        conf_list = perf.confusion_matrix().to_list()
        cnf_matrix[0,0] = conf_list[0][0]
        cnf_matrix[1,0] = conf_list[1][0]
        cnf_matrix[0,1] = conf_list[0][1]
        cnf_matrix[1,1] = conf_list[1][1]
        np.set_printoptions(precision=2)
        plot_confusion_matrix(cnf_matrix, classes = ['NotSick','Sick'],
                              title='Confusion matrix, without normalization',
                              image_name = 'artifact_folder/confusion_matrix.png')
        
        catch_kill = cnf_matrix[1,1]/cnf_matrix[1,0]
        mlflow.log_metric("catch/kill",catch_kill)
        
        while not os.path.exists('artifact_folder/confusion_matrix.png'):
            time.sleep(1)
        mlflow.log_artifact("artifact_folder/confusion_matrix.png")
        
        save_plot_scoring_history(rf,'artifact_folder/score_history.png')
        while not os.path.exists('artifact_folder/score_history.png'):
            time.sleep(1)
        mlflow.log_artifact("artifact_folder/score_history.png")
        
        save_plot_varimp(rf,'artifact_folder/varimp.png')
        while not os.path.exists('artifact_folder/varimp.png'):
            time.sleep(1)
        mlflow.log_artifact("artifact_folder/varimp.png")
        
        save_plot_roc(rf,'artifact_folder/roc.png')
        while not os.path.exists('artifact_folder/roc.png'):
            time.sleep(1)
        mlflow.log_artifact("artifact_folder/roc.png")
        
        # in this case we'll delete the local plots on each iteration
        os.remove("artifact_folder/score_history.png")
        os.remove("artifact_folder/varimp.png")
        os.remove("artifact_folder/roc.png")
        os.remove("artifact_folder/confusion_matrix.png")

In [19]:
%env AWS_SECRET_ACCESS_KEY

UsageError: Environment does not have key: AWS_SECRET_ACCESS_KEY


## train the model on grid search of varying number of trees

In [None]:
for ntrees in [10, 20, 50, 100, 200]:
    trainRandomForest(ntrees)

## Open the MLflow UI
- open a browser to the uri provided below  

In [None]:
mlflow.tracking.get_tracking_uri()

## If you need to make any additions to a run of an experiment:
- mlflow.start_run(experiment_id=ex_id,run_uuid = '')
    - you can get the run_uuid from the mlflow UI
- make additions
- mlflow.end_run()

## Shutdown h2o cluster

In [None]:
h2o.cluster().shutdown()