# HIDA Workshop Introduction to MLOps, Workflow tools for Data Science
## Session 2: Pipelines & MLFlow projects

***Christian Gerloff - Helmholtz School for Data Science in Life, Earth and Energy*** <br>
This notebook consists of the practical examples of the second part of the workshop in MLOps and Workflow tools. All course materials are prepared to run via google colab without further requirements<br><br><br>

## 1 Preparation

Here we download the MLFlow project and install miniconda in colab, which we need for the environment in which the project is about to run. By defining a specific conda environment or by setting up a Docker container, we aim for the technical reusability of the pipeline.

In [None]:
# donwload the project from the workshop repo
!wget -O - https://github.com/ChristianGerloff/hida-workshop-mlflow/archive/refs/heads/mlfow-projects.tar.gz | tar xz \
       --strip=1 "hida-workshop-mlflow-mlfow-projects/mlflow_projects"

# download miniconda
!wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local

# install required packages to the start the project
import sys
sys.path.append('/usr/local/lib/python3.8/site-packages/')
!pip install mlflow python-dotenv --quiet

## 2 Pipelines
A pipeline can be basically described as sequences of functions. We could describe a pipeline $P$ as a set of functions $F$ and its sequential relation between each function $E$. To make the story short - a pipeline can be described as a directed acyclic graph. 

A pipeline can reach certain complexity - some guiding questions?:
* How many steps has the pipeline
* Is the state of the previous step in the pipeline relevant (synchronized / or not) 
* Does the pipeline depend on multiple frameworks/packages?
* Do I have to run the pipeline from beginning to end or do I have multiple entry points?
* How is the pipeline triggered (manually, automatically, via an event or as a cron job)?
* How are the results of my pipleline served?
* ....


Hence, a pipeline can be realized in many ways (see previous discussion):
1. Native sequence of functions or methods
2. Packet/framework specific pipelines 
3. Airflow, DVC, MLFlow
...


## 3 From tracking scripts to small pipelines

### 3.1 Prepare credentials & environment variables
The example project uses a `.env` file to store the credentials.
Therefore, please upload a `.env` file consisting of all required environment variables or create it via the code below. Please be aware to add line breaks at the end of each variable `"<env.var.name>=<value>\n",`

 ***Note***: In this colab setting you could also directly specify the environment variables in a code cell. In production, we highly suggest isolating the confidential environment variables in a file.


In [None]:
with open("mlflow_projects/.env", "w") as f:
    f.writelines(["AWS_ACCESS_KEY_ID=\n",
                  "AWS_SECRET_ACCESS_KEY=\n",
                  "BUCKET_NAME=hida-workshop-data\n",
                  "MLFLOW_TRACKING_URL=http://3.125.220.21:80\n",
                  "MLFLOW_TRACKING_USERNAME=\n",
                  "MLFLOW_TRACKING_PASSWORD=\n",
                  "MLFLOW_EXPERIMENT=Example-Session-2"])

Now let's take a look at the pipeline!

### 3.2 Start the example pipeline
Here we change the working directory and trigger the pipeline via a starting script. This starting script will perform the following actions:

* load environment variables from `.env`
* setup the MLFLow client
* prepare and activate the conda environment via the `conda.yaml`
* start the pipelines with the defined entry point and parameters


In [None]:
import os

# change the current working directory
os.chdir('/content/mlflow_projects')

# start our project runs
!python start.py

In the UI you will find two nested runs in the experiment. 
The first nested run contains all the steps in our pipeline, while the second contains only the last step. As discussed earlier, this behaviour results from our multistep setting. In this setting, we avoid repeating runs with identical data, parameters, etc. and only run parts of the pipeline that have changed or are new.

### 3.3 Linting & Testing

While it is best practice in modern software development to integrate test stages into CI/CD pipelines, analytical pipelines can also benefit from ensuring correct behavior and reliable results via testing strategies. Two possible options to do so are:


1.   Integrate tests into each entry point of the pipeline,
2.   Add a test(s) as a separate entry point into your pipeline.


We prefer option (2), which is more in line with our software development practices. Therefore, we run the test entry point at the end of the pipelines and reset the state of the previous runs to `failed` if the tests were not successful. In this introductory workshop, we will not cover this topic further. 

### 3.4 How to submit your final results to the "Leaderboard"

Here we manually submit the results. 

***Tip:*** Alternatively, you can write your own automated submission procedure. To do so take a look at session 1 for fetching and how we reused runs in main.py - perhaps it helps you :)

In [None]:
import mlflow as mf
from dotenv import load_dotenv

# specify your team name, dataset (ASD/Fetal) an run_id of your final run
tags = {'team': 'HDS-LEE-Coffee-addicts',
        'dataset': 'ASD'
        'run_id' '12312312312312'}

metric = {'test_accuracy_score': 0.222,
          'test_f1_score': 0.12,
          'test_precision_score': 0.1,
          'test_recall_score': 0.3,
          'test_roc_auc': 0.5}

# feel free to add your parameters and model :)

load_dotenv()
mf.set_tracking_uri(os.getenv('MLFLOW_TRACKING_URL'))
mf.set_experiment('Leaderboard')
mf.set_tag('mlflow.user', os.getenv('MLFLOW_TRACKING_USERNAME'))

with mf.start_run() as run:
  mf.set_tags(tags)
  mf.log_metrics(metric)

## Great - Let's start the coding session!

***Important:*** Before you start with your colleagues, please change the name of your experiment in the `.env` to the name of your group, such as `"MLFLOW_EXPERIMENT=HDS-LEE-Coffee-addicts"`. Otherwise, we may get lost in different runs<br><br>

Have fun :)