# HIDA Workshop Introduction to MLOps, Workflow tools for Data Science
## Session 1: Tracking metrics, parameters, metadata and models

***Christian Gerloff - Helmholtz School for Data Science in Life, Earth and Energy*** <br>
This notebook consists of the practical examples of the first part of the workshop in MLOps and Workflow tools. <br><br><br>
### Get started with the course materials
To interactively work with the materials, you can open this notebook in  [google colab](https://https://colab.research.google.com/). All you need is a google account. Besides the server application, all course materials are prepared for direct use in google colab. No local installations are required. In the readme and during the course, we will provide you with an additional how-to for local or remote installations. <br><br>

### Credentials for cloud-hosted servers and storage

To allow interactions during the workshop and to provide a realistic server setup for labs or industrial use-cases, we will use a cloud-hosted storage and mlflow server. Both are protected. Every participant will receive his/her own credentials for the mlflow server via mail beforehand. The credentials are used to avoid collisions between runs so please use your own credentials. You should have received:

* MLFLOW_TRACKING_USERNAME
* MLFLOW_TRACKING_PASSWORD
* AWS_ACCESS_KEY_ID
* AWS_SECRET_ACCESS_KEY

## 1 Preparation

### 1.1 Install required packages
Here we install the required packages. <br><br>

 ***Tip: random initialization***: Several methods that we apply or develop in Data Science rely on some introduced randomness. As we work today on deterministic machines, these numbers are not entirely random; instead, their generation procedure depends on a seed. Hence, today the random numbers we use are deterministically specified, often via the CPU clock time. This has disadvantages but brings one advantage; We can set a seed to enable reproducible results. We suggest to always, always set a seed. Moreover, we recommend testing the variance introduced by different seeds because your final conclusions in your paper should not depend on the seed.

In [None]:
!pip -q install mlflow boto3

In [None]:
import os
import boto3
import mlflow as mf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as clr
import seaborn as sns

from pathlib import Path
from joblib import load
from mlflow.tracking import MlflowClient

from sklearn import metrics
from sklearn.linear_model import RidgeClassifier
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.model_selection import cross_val_score

# set seed
SEED = 42

### 1.2 Gather data sets from AWS S3
The original datasets are usually stored ***locally***, on an HDFS system, FDP, DBs (e.g., postgres) or cloud storage such as S3 (see storages 3.1). For demonstration purposes, we load the initial data set from S3 via boto. The code is not so important if you work locally. 

***Tip: data consistency:*** We recommend that you never modify the initial dataset. Always create a new version of your dataset when you touch it. If you need to save the manipulated data for subsequent analyses, a multi-stage pipeline may be useful (see session 2 of the workshop). An excellent alternative to ensure consistent data is the data versioning system `datalad`.

In [None]:
# aws settings for raw data & artifact storage
os.environ['AWS_ACCESS_KEY_ID'] =
os.environ['AWS_SECRET_ACCESS_KEY'] =
BUCKET_NAME = 'hida-workshop-data'
AWS_ACC_KEY = os.getenv('AWS_ACCESS_KEY_ID')
AWS_SEC_KEY = os.getenv('AWS_SECRET_ACCESS_KEY')

# specify aws resources to gather intial data
client = boto3.client('s3', aws_access_key_id=AWS_ACC_KEY,
                      aws_secret_access_key=AWS_SEC_KEY)
s3 = boto3.resource('s3')


def download_s3(client, resource, bucket, dist, local='/tmp'):
    paginator = client.get_paginator('list_objects')
    pag = paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist)
    for result in pag:
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_s3(client,
                            resource,
                            bucket,
                            subdir.get('Prefix'),
                            local)
        for file in result.get('Contents', []):
            dest_pathname = os.path.join(local, file.get('Key'))
            if not os.path.exists(os.path.dirname(dest_pathname)):
                os.makedirs(os.path.dirname(dest_pathname))
            if file.get('Key')[-1] != "/":
                resource.meta.client.download_file(bucket,
                                                   file.get('Key'),
                                                   dest_pathname)


download_s3(client, s3, BUCKET_NAME, 'fetal_health', 'data')


## 2 A first data analysis 



### 2.1 Explorative data analysis
***Description:*** Data Set | Classification task

Cardiotocograms (CTGs) are a simple and cost accessible option to assess fetal health, allowing healthcare professionals to take action in order to prevent child and maternal mortality. The equipment itself works by sending ultrasound pulses and reading its response, thus shedding light on fetal heart rate (FHR), fetal movements, uterine contractions and more. <br><br>

***Reference***: Ayres-de-Campos, D., Bernardes, J., Garrido, A., Marques-de-Sa, J., & Pereira-Leite, L. (2000). SisPorto 2.0: a program for automated analysis of cardiotocograms. Journal of Maternal-Fetal Medicine, 9(5), 311-318.

In [None]:
# load data set perform some EDA
data = pd.read_csv(Path.cwd() / 'data/fetal_health/fetal_health.csv')

labels = data.fetal_health.astype(int)
features = data.drop(columns=['fetal_health'])

fig = plt.figure(figsize=(20, 15))
plt.suptitle("Distribution of the Numeric variables", weight='bold', y=1.1)
for i, f in enumerate(features.columns):
    ax = plt.subplot(7, 3, 1+i)
    ax = sns.kdeplot(data=features, x=f, fill=True, alpha=1)
    ax.set_title(f, y=1.1)
plt.tight_layout(pad=0, w_pad=2, h_pad=2)

# corr
fig = plt.figure(figsize=(15, 8))
sns.heatmap(features.corr(), linewidths=3, annot=True)
plt.title("Correlation matrix", size=20, weight='bold')

# labels
fig = plt.figure(figsize=(15, 8))
sns.histplot(data=labels, discrete=True)
plt.title("Dependent variable", size=20, weight='bold')

#### Additional notes

N = 2126 records 

Classes 'fetal_health':
* Normal == 1
* Suspect == 2
* Pathological == 3

Features:

* 'baseline value' FHR baseline (beats per minute)
* 'accelerations' Number of accelerations per second
* 'fetal_movement' Number of fetal movements per second
+ 'uterine_contractions' Number of uterine contractions per second
* 'light_decelerations' Number of light decelerations per second
* 'severe_decelerations' Number of severe decelerations per second
* 'prolongued_decelerations' Number of prolonged decelerations per second
* 'abnormal_short_term_variability' Percentage of time with abnormal short term variability
* 'mean_value_of_short_term_variability' Mean value of short term variability
* 'percentage_of_time_with_abnormal_long_term_variability' Percentage of time with abnormal long term variability
* 'mean_value_of_long_term_variability' Mean value of long term variability
* 'histogram_width' Width of FHR histogram
* 'histogram_min' Minimum (low frequency) of FHR histogram
* 'histogram_max' Maximum (high frequency) of FHR histogram
* 'histogram_number_of_peaks' Number of histogram peaks
* 'histogram_number_of_zeroes' Number of histogram zeros
* 'histogram_mode' Histogram mode
* 'histogram_mean' Histogram mean
* 'histogram_median' Histogram median
* 'histogram_variance' Histogram variance
* 'histogram_tendency' Histogram tendency

### 2.2 A simple prediction model

In [None]:
# create train set for cross-validation and additional hold-out set
x_train, x_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.3, stratify=labels, random_state=SEED)

# create a simple sklearn pipeline
pipe = Pipeline(steps=[('sts', StandardScaler()),
                       ('cls', RidgeClassifier(alpha=1.0, random_state=SEED))])

# cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
scores = cross_val_score(pipe,
                         x_train,
                         y_train,
                         cv=cv,
                         scoring='balanced_accuracy')

plt.figure(figsize=(15, 8))
sns.displot(scores, kde=True)
plt.xlim(scores.min(), 1)
plt.title("ACC", size=20, weight='bold')

# test on hold-out
pipe.fit(x_train, y_train)
est_y = pipe.predict(x_test)

acc = metrics.balanced_accuracy_score(y_test, est_y)
(precision,
 recall,
 f_score,
 support) = metrics.precision_recall_fscore_support(
        y_test, est_y, beta=2, average=None)

print('Performance CV \n'
      f'acc mean: {np.mean(scores):.2f} \n'
      f'acc std: {np.std(scores):.2f} \n \n'
      'Hold-out: one-vs-rest performance \n'
      f'acc : {acc:.2f} \n'
      f'precision: {precision} \n'
      f'recall: {recall} \n'
      f'f-score: {f_score} \n'
      f'support: {support} \n')

***Motivation:*** But what if we want to compare several models*?

## 3 Setting up MLflow for tracking




### 3.1 Backround: What can we map in the DataScience Lifecycle via MLflow and where is it stored?
<br><br>

MLFlow is backed by two storages. One storage is to store files/artifacts and one storage is to store all meta data.

### Tracking / Metadata stores (your experiment)

* Local file path where data is directly stored locally on your machine.`mlruns/`
*   Mlflow supports the several DBs such as mysql, sqlite or postgresql. To connect to the data base, have a look at your preferred ODM for the corresponding data base type (e.g., sqlalchemy for postgres)
* HTTP server, which is a server hosting an MLFlow tracking server.
    * locally: https://my-server:5000
    * **In this workshop it is hosted remote**

***Note:*** Alternatively you could also use Databricks workspace.
<br><br>

    
### Artifact stores (Every data that specifically belongs to a run)

The default artifact store is your local folder which is feasible for local installations of MLflow. For this workshop, we hosted an MLflow server in the cloud with S3 as an artifact store.
Store Options:
* Amazon S3
* Azure Blob Storage
* Google Cloud Storage
* FTP server
* SFTP Server
* NFS
* HDFS
<br><br>

### ***Run***
= an instance of code that is tracked via MLflow. It can contain several elements, such as tags, notes, parameters, artifacts.
<br><br>
### ***Experiment***
= a set of runs. For example your current research project.
<br><br>
### ***Tags & Notes*** (stored in metadata storage): string
Information about a run, such as its main aim and difference or core assumption,
underling data set name

Notes can be added per run or for an experiment. They support markdown.
```
MlflowClient().set_tag(run_id, 
     "mlflow.note.content","***<nice_note>***")
clieMlflowClient().set_experiment_tag(experiment_id, 
     "mlflow.note.content","***<nice_note>***")
```
<br><br>
### ***Parameters*** (stored in metadata storage): e.g., dict
Key-value inputs for your code & model

```
parameters = {'s_width':  10}  # slinding window width in seconds
MlflowClient().log_params(parameters)
```
<br><br>
### ***Metrics*** (stored in metadata storage): e.g., int or float
Numeric values, can contain temporal dependencies
```
MlflowClient().log_metrics({'ROC_AUC': 90.6}, step=1)
```
<br><br>
### ***Artifact*** (stored in artifact storage)
Resulting data, such as preprocessed data, modes, figures or other files
<br><br>
### ***Source Information & Models*** (stored in artifact storage)
MLflow can store models, their versions and the source of the associated code (git hashes only).
Everything can be defined manually but for simple use cases autotracking can be considered. Integration with git is not as feasible as in DVC and this is a main drawback of MLflow from our point of view. We will discuss source information and models in session 2.

### 3.2 Manual tracking 

Here we specify where our MLflow server is located. If you are running MLflow locally, you don't have to worry about the tracking URL.
The artifact storage and metadata storage are already configured. We only have to specify the name of the experiment and our username.

In [None]:
os.environ['MLFLOW_TRACKING_USERNAME'] = 
os.environ['MLFLOW_TRACKING_PASSWORD'] = 

MF_URL = 
EXPERIMENT = 'Examples-Session-1'
NOTE_EXPERIMENT = 'This experiment belongs to the workshop Session 1'

mf.set_tracking_uri(MF_URL)
mf.set_experiment(experiment_name=EXPERIMENT)
experiment = mf.get_experiment_by_name(name=EXPERIMENT)

client = MlflowClient()
client.set_experiment_tag(
    experiment.experiment_id, "mlflow.note.content", NOTE_EXPERIMENT)

params = {'seed': SEED,
          'test_size': 0.3,
          'k_cv': 5,
          'shuffle': True,
          'alpha': 0.9}

tags = {'data': 'raw_fetal_data',
        'objective': 'influence_of_splits'}

NOTE = 'My first manually ***tracked classification***'

In [None]:
# create additional hold-out set
(x_train,
 x_test,
 y_train,
 y_test) = train_test_split(features,
                            labels,
                            test_size=params['test_size'],
                            stratify=labels,
                            random_state=params['seed'])

cv = StratifiedKFold(n_splits=params['k_cv'],
                     shuffle=True,
                     random_state=params['seed'])
# pipeline
stages = [('sts', StandardScaler()),
          ('cls', RidgeClassifier(alpha=params['alpha'],
                                  random_state=params['seed']))]
pipe = Pipeline(steps=stages)

#start with the manual tracking
with mf.start_run():

  # log meta data & parameters
  mf.set_tags(tags)  # add tags, e.g. to filter runs
  mf.set_tag('mlflow.note.content', NOTE)  # add notes
  mf.set_tag('mlflow.user', os.getenv('MLFLOW_TRACKING_USERNAME'))  # add user name
  mf.log_params(params)

  # log performance of the cross-validation on train set
  scores = cross_val_score(pipe,
                           x_train,
                           y_train,
                           cv=cv,
                           scoring='balanced_accuracy')

  for i, s in enumerate(scores):
        mf.log_metrics({'training_accuracy_score': s}, step=i+1)

  # train model on train set
  pipe.fit(x_train, y_train)
  est_y = pipe.predict(x_test)
  
  # test and log the performance on hold-out set
  acc = metrics.balanced_accuracy_score(y_test, est_y)
  mf.log_metrics({'test_accuracy_score': acc})
  precision, *_ = metrics.precision_recall_fscore_support(
          y_test, est_y, beta=2, average=None)
  for i, p in enumerate(precision):
    mf.log_metrics({'test_precision': p}, step=i+1)

  # store model
  mf.sklearn.log_model(pipe, 'model')

***Note:*** Please do not forget to set the username in all your runs.

### 3.3 Automatic tracking



In [None]:
#create data set
(x_train,
 x_test,
 y_train,
 y_test) = train_test_split(features,
                            labels,
                            test_size=0.3,
                            stratify=labels,
                            random_state=SEED)

cv = StratifiedKFold(n_splits=5,
                     shuffle=True,
                     random_state=SEED)
# pipeline
stages = [('sts', StandardScaler()),
          ('cls', RidgeClassifier(alpha=0.1,
                      random_state=SEED))]

pipe = Pipeline(steps=stages)
pipe.fit(x_train, y_train) 

#enable autologging
mf.sklearn.autolog(disable=False, silent=True)
with mf.start_run():

  mf.set_tag('mlflow.user',
             os.getenv('MLFLOW_TRACKING_USERNAME'))
  mf.set_tags(tags)

  # test on hold-out
  hold_out_metrics = mf.sklearn.eval_and_log_metrics(
      pipe, x_test, y_test, prefix="test_")
  cross_val_score(pipe, x_train, y_train, cv=cv)

# just for demonstration to ensure that autologging is off if you rerun a cell
mf.sklearn.autolog(disable=True)

`mlflow.autolog` is an experimental feature that can already save you some 
lines of code. It currently supports Pytorch, Tensorflow, and XGBoost to automatically track common metrics, models, parameters, and input examples. Hence, the `params` dictionary becomes unnecessary here. <br><br>

***Tip: autotracking***: For sklearn, parameters are always stored if `.fit`method and its derivates (`.fit_transform`) are called. The corresponding metrics are stored with the prefix `training`. Moreover, the artifacts with the default names will be overwritten for each call. Be aware of that if you call fit multiple times, such as here. In the example of this notebook, we called the fit method for hold-out evaluation before the start of the autotracking and stored the metrics and artifacts with the prefix `test_` to avoid this issue in autotracking.

# 4 How to fetch your tracked data - The non-UI way

An advantage of MLflow compared to DVC is its UI. The UI already provides an easy way to inspect your tracked data and to compare your results.

Out-of-the-box UI features:
* download of parameters and metrics as CSV across runs
* comparison and comparison charts across runs
* Detailed graphs for each metric in a run

Nevertheless, often we want to create tables and figures locally after we have carried out all the analyses and perhaps also saved the first result graphics as artifacts. Or we want to create a comprehensive report.
For these and other scenarios, it is necessary to be able to extract the data from the individual runs via API. In the following, we will briefly give an example of this:

## 4.1 Get runs of an experiment
To get all runs that you are interested in you can easily filter runs by defining a filter `strings` for tags or parameters.

Here we filter for user and data and additionally order the results.
The results are already òf type `pandas.DataFrame` which makes it straightforward to inspect and visualize your results.

In [None]:
# specify the experiment that we are interested in
experiment = mf.get_experiment_by_name(name='Examples-Session-1')

# specify a filter to select a subset of runs, e.g. ony our own runs
filter = f"tags.mlflow.user='{os.getenv('MLFLOW_TRACKING_USERNAME')}'"

# fetch the data
runs = mf.search_runs(experiment_ids=experiment.experiment_id,
                      filter_string=filter,
                      order_by=['tags.mlflow.user',
                                'tags.data',
                                'metrics.test_accuracy_score'])
runs

***Tip: fetch runs:*** This method has one drawback. If you inspect the results, you will discover that your results from metrics with multiple values (or in mlflow termed `history`), such as the performance metrics of the cross-validation, only show the last fold.


In [None]:
# show the missing fold specif information of this procedure
runs['metrics.training_accuracy_score']

Therefore, you can also gather the metadata of a specific run if you have initialized a client object `MlflowClient()`. This method also provides you with the history of a metric.


For example, here we collect a dictionary of performance measures of the cross-validation only from a specific run.

In [None]:
# fetch a nested metric of a specific run 
metric = client.get_metric_history(
      runs.run_id[0], 'training_accuracy_score')
print(f'{metric[0].key}: \n '
      f"{[f'{i}: {v.value:.2f}' for i, v in enumerate(metric)]}")

**Congratulations!** <br>
Take a break, Session 2 will start soon.