# Chapter 1 : Introducing MLflow

**MLflow** is an open source platform for the **machine learning (ML)** life cycle, with a focus on *reproducibility*, *training*, and *deployment*. It is based on an open interface design and is able to work with any language or platform, with clients in Python and Java, and is accessible through a REST API. Scalability is also an important benefit that an ML developer can leverage with MLflow.

## **What is MLflow?**

Implementing a product based on ML can be a laborious task. There is a general need to reduce the friction between different steps of the ML development life cycle, and between teams of data scientists and engineers that are involved in the process. ML practitioners, such as data scientists and ML engineers, operate with different systems, standards, and tools. While data scientists spend most of their time developing models in tools such as Jupyter Notebooks, when running in production, the model is deployed in the context of a software application with an environment that is more demanding in terms of scale and reliability.

A common occurrence in ML projects is to have **the models reimplemented by an engineering team**, **creating a custom-made system to serve the specific model**. A set of challenges are common with teams that follow bespoke approaches regarding model development:

* ML projects that run over budget due to the need to create bespoke software infrastructure to develop and serve models
* Translation errors when reimplementing the models produced by data scientists
* Scalability issues when serving predictions
* Friction in terms of reproducing training processes between data scientists due to a lack of standard environments

Companies leveraging ML tend to create their own (often extremely laborious) internal systems in order to ensure a smooth and structured process of ML development. Widely documented ML platforms include systems such as Michelangelo and FBLearner, from Uber and Facebook, respectively.

It is in the context of the increasing adoption of ML that MLflow was initially created at Databricks and open sourced as a platform, to aid in the implementation of ML systems.

MLflow enables an everyday practitioner in one platform to manage the ML life cycle, from iteration on model development up to deployment in a reliable and scalable environment that is compatible with modern software system requirements.

## **Getting started with MLflow**

1. Create a Dockerfile with the following instructions
```dockerfile
FROM jupyter/datascience-notebook
RUN pip install mlflow
RUN pip install sklearn
```
2. Create a running file `run.sh` with the following instructions
```bash
#!/bin/bash
docker build -t chapter_1_homlflow .
docker run -p 8888:8888 -p 5000:5000 -v $(pwd):/home/ibra/ -it chapter_1_homlflow
```
3. Open your browser to `http://localhost:8888` and you should be able to navigate to the Chapter01 folder.

## **Developping your first model with MLflow**

Load the sample dataset:
```py
from sklearn import datasets
from sklearn.model_selection import train_test_split

dataset = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.4)
```
Next, let's train your model.
Training a simple machine model with a framework such as scikit-learn involves instantiating an estimator such as LogisticRegression and calling the fit command to execute training over the Iris dataset built in scikit-learn:
```py
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)
```
The preceding lines of code are just a small portion of the ML Engineering process. As will be demonstrated, a non-trivial amount of code needs to be created in order to productionize and make sure that the preceding training code is usable and reliable. One of the main objectives of MLflow is to aid in the process of setting up ML systems and projects. In the following sections, we will demonstrate how MLflow can be used to make your solutions robust and reliable.

Then, we will add MLflow.
With a few more lines of code, you should be able to start your first MLflow interaction. In the following code listing, we start by importing the mlflow module, followed by the LogisticRegression class in scikit-learn. You can use the accompanying Jupyter notebook to run the next section:
```py
import mlflow
from sklearn.linear_model import LogisticRegression

mlflow.sklearn.autolog()

with mlflow.start_run():
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
```
The `mlflow.sklearn.autolog()` instruction enables you to automatically log the experiment in the local directory. It captures the metrics produced by the underlying ML library in use. MLflow Tracking is the module responsible for handling metrics and logs. By default, the metadata of an MLflow run is stored in the local filesystem.

If you run the following excerpt on the accompanying notebook's root document, you should now have the following files in your home directory as a result of running the following command:

The `model.pkl` file contains a **serialized version of the model**. For a scikit-learn model, there is a binary version of the Python code of the model. Upon autologging, the metrics are leveraged from the underlying machine library in use. The default packaging strategy was based on a conda.yaml file, with the right dependencies to be able to serialize the model.

The `MLmodel` file **is the main definition of the project from an MLflow project** with information related to how to run inference on the current model.

The `metrics` folder contains **the training score value of this particular run of the training process**, which can be used to benchmark the model with further model improvements down the line.

The `params` folder on the first listing of folders contains **the default parameters of the logistic regression model**, with the different default possibilities listed transparently and stored automatically.

## **Exploring MLflow modules**

MLflow modules are software components that deliver the core features that aid in the different phases of the ML life cycle. MLflow features are delivered through modules, extensible components that organize related features in the platform.

The following are the built-in modules in MLflow:

* **MLflow Tracking**: Provides a mechanism and UI to handle metrics and artifacts generated by ML executions (training and inference)
* **Mlflow Projects**: A package format to standardize ML projects
Mlflow Models: A mechanism that deploys to different types of environments, both on-premises and in the cloud
* **Mlflow Model Registry**: A module that handles the management of models in MLflow and its life cycle, including state

In order to explore the different modules, we will install MLflow in your local environment using the following command:
```bash
pip install mlflow
```

## **Exploring MLflow projects**

An MLflow project represents the basic unit of organization of ML projects. There are three different environments supported by MLflow projects: the **Conda environment**, **Docker**, and the **local system**.
Once you have your environment, the main file that defines how your project should look is the **MLProject** file. This file is used by MLflow to understand how it should run your project.

## **Developping your first e2e pipeline**

We will prototype a **simple stock prediction project** in this section with MLflow and will document the different files and phases of the solution. You will develop it **in your local system using the MLflow and Docker installed locally**.

> In this section, we are assuming that MLflow and Docker are installed locally, as the steps in this section will be executed in your local environment.

The task in this illustrative project is to create a basic MLflow project and produce a working baseline ML model to predict, based on market signals over a certain number of days, whether the stock market will go up or down.

In this section, we will use a Yahoo Finance dataset available for quoting the BTC-USD pair in https://finance.yahoo.com/quote/BTC-USD/ over a period of 3 months. We will train a model to predict whether the quote will be going up or not on a given day. A REST API will be made available for predictions through MLflow.

We will illustrate, step by step, **the creation of an MLflow project to train a classifier on stock data**, using the Yahoo API for financial information retrieved using the package's pandas data reader:

1. Add your MLProject file:

```Yaml
name: stockpred
docker_env:
  image: stockpredictor-docker
entry_points:
  main:
    command: "python train.py"
```

  The preceding **MLProject** file specifies that **dependencies will be managed in Docker with a specific image name**. **MLflow will try to pull the image using the version of Docker installed on your system**. If it doesn't find it, it will try to retrieve it from Docker Hub. For the goals of this chapter, it is completely fine to have MLflow running on your local machine.

  The second configuration that we add to our project **is the main entry point command**. The command to be executed will invoke in the Docker environment the `train.py` Python file, which contains the code of our project.

2. Add a Docker file to the project.

   Additionally, you can specify the Docker registry URL of your image. The advantage of running Docker is that your project is not bound to the Python language, as we will see in the advanced section of this book. The MLflow API is available in a Rest interface alongside the official clients: Python, Java, and R:

```dockerfile
FROM continuumio/miniconda:4.12.0

RUN pip install mlflow \
    && pip install numpy \
    && pip install scipy \
    && pip install pandas \
    && pip install scikit-learn \
    && pip install cloudpickle \
    && pip install pandas-datareader>=0.10.0
```
The preceding Docker image file is based on the open source package Miniconda, a free minimal installer with a minimal set of packages for data science that allow us to control the details of the packages that we need in our environment.

We will specify the version of MLflow (our ML platform), `numpy`, and `scipy` for numerical calculations. `Cloudpickle` allows us to easily serialize objects. We will use pandas to manage data frames, and `pandas_datareader` to allow us to easily retrieve the data from public sources.

3. Import the packages required for the project.

On the following listing, we explicitly import all the libraries that we will use during the execution of the training script: the library to read the data, and the different sklearn modules related to the chosen initial ML model:

```python
import numpy as np
import datetime
import pandas_datareader.data as web
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
import mlflow.sklearn
```
We explicitly chose for the stock market movement detection problem a `RandomForestClassifier`, due to the fact that **it's an extremely versatile and widely accepted baseline model for classification problems**.

4. Acquire your training data.

The component of the code that acquires the Yahoo Finance stock dataset is intentionally small, so we choose a specific interval of 3 months to train our classifier.
The `acquire_training_data` method returns a pandas data frame with the relevant dataset:

```python
def acquire_training_data():
    start = datetime.datetime(2022, 9, 1)
    end = datetime.datetime(2022, 11, 30)
    df = web.DataReader("BTC-USD", 'yahoo', start, end)
    return df
```
5. Make the data usable by scikit-learn.

The data acquired in the preceding step is clearly not directly usable by `RandomForestAlgorithm`, which thrives on categorical features. In order to facilitate the execution of this, we will transform the raw data into a feature vector using the rolling window technique.

Basically, **the feature vector for each day becomes the deltas between the current and previous window days**. In this case, we use the previous day's market movement (`1 for a stock going up`, `0 otherwise`):

```python
def digitize(n):
    if n > 0:
        return 1
    return 0

def rolling_window(a, window):
    """
        Takes np.array 'a' and size 'window' as parameters
        Outputs an np.array with all the ordered sequences of values of 'a' of size 'window'
        e.g. Input: ( np.array([1, 2, 3, 4, 5, 6]), 4 )
             Output:
                     array([[1, 2, 3, 4],
                           [2, 3, 4, 5],
                           [3, 4, 5, 6]])
    """
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def prepare_training_data(data):
    data['Delta'] = data['Close'] - data['Open']
    data['to_predict'] = data['Delta'].apply(lambda d: digitize(d))
    return data
```

6. Train and store your model in MLflow.

This portion of the following code listing calls the data preparation methods declared previously and executes the prediction process.

The main execution also explicitly logs the ML model trained in the current execution in the MLflow environment
```python
if __name__ == "__main__":
    with mlflow.start_run():
        training_data = acquire_training_dataset()
        prepared_training_data_df = prepare_training_data(training_data)
        btc_mat = prepared_training_data_df.as_matrix()
        WINDOW_SIZE = 14
        X = rolling_window(btc_mat[:, 7], WINDOW_SIZE[:-1, :])
        Y =  prepared_training_data_df['to_predict'].as_matrix()[WINDOW_SIZE:]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state = 4284, stratify=Y)
        clf = RandomForestClassifier(bootstrap=True, criterion='gini', mini_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=50, random_state=4284, verbose=0)
        clf.fit(X_train, y_train)
        predicted = clf.predict(X_test)
        mlflow.sklearn.log_model(clf, "model_random_forest")
        mlflow.log_metric("precision_label_0", precision_score(y_test, predicted, pos_label=0))
        mlflow.log_metric("recall_label_0", recall_score(y_test, predicted, pos_label=0))
        mlflow.log_metric("f1score_label_0", f1_score(y_test, predicted, pos_label=0))
        mlflow.log_metric("precision_label_1", precision_score(y_test, predicted, pos_label=1))
        mlflow.log_metric("recall_label_1", recall_score(y_test, predicted, pos_label=1))
        mlflow.log_metric("f1score_label_1", f1_score(y_test, predicted, pos_label=1))
```

The `mlflow.sklearn.log_model(clf, "model_random_forest")` method **takes care of persisting the model upon training**. In contrast to the previous example, **we are explicitly asking MLflow to log the model and the metrics that we find relevant**. **This flexibility in the items to log allows one program to log multiple models into MLflow**.

7. Build your project's Docker image.

In order to build your Docker image, you should run the following command:
```bash
docker build -t stockpred -f dockerfile .
```
This will build the image specified previously with the stockpred tag. This image will be usable in MLflow in the subsequent steps as the model is now logged into your local registry.

1. Run your project.
In order to run your project, you can now run the MLflow project:
```bash
mlflow run .
```
At this stage, you have **a simple, reproducible baseline of a stock predictor pipeline using MLflow** that you can improve on and easily share with others.

## **Exploring MLflow tracking**

The **MLflow tracking** component is responsible for observability. The main features of this module are the logging of metrics, artifacts, and parameters of an MLflow execution. It provides vizualisations and artifact management features.

In a **production setting, it is used as a centralized tracking server implemented in Python that can be shared by a group of ML practitioners in an organization**. This **enables improvements in ML models to be shared within the organization**.

you can run the following command to have access to the results of your runs:

```bash
mlflow ui
```

## **Exploring MLflow Models**

**MLflow Models** is the core component that handles the different model flavors that are supported in MLflow and intermediates the deployment into different execution environments.

We will now delve into the different models supported in the latest version of MLflow.

As shown in the *Getting started with MLflow* section, MLflow models have a specific serialization approach for when the model is persisted in its internal format. For example, the serialized folder of the model implemented on the **stockpredictor** project would look like the following:
```bash
- MLmodel
- conda.yaml
- model.pkl
```

MLflow, by default, supports serving models in two flavors, namely, as a **python_function** or in **sklearn** format. The flavors are basically a format to be used by tools or environments serving models.

A good example of using the preceding is being able to serve your model without any extra code by executing the following command:
```bash
mlflow models serve -m ./mlruns/0/b9ee36e80a934cef9cac3a0513db515c/artifacts/model_random_forest/
```
You have access to a very simple web server that can run your model. Your model prediction interface can be executed by running the following command:
```bash
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '{"data":[[1,1,1,1,0,1,1,1,0,1,1,1,0,0]]}' [1]%
```
The response to the API call to our model was 1; as defined in our predicted variable, this means that in the next reading, the stock will move up.

The final few steps outline how powerful MLflow is as an end-to-end tool for model development, including for the prototyping of REST-based APIs for ML services.

The MLflow Models component allows the creation of custom-made Python modules that will have the same benefits as the built-in models, as long as a prediction interface is followed.

Some of the notable model types supported will be explored in upcoming chapters, including the following:

* XGBoost model format
* R functions
* H2O model
* Keras
* PyTorch
* Sklearn
* Spark MLib
* TensorFlow
* Fastai

Support for the most prevalent ML types of models, combined with its built-in capability for on-premises and cloud deployment, is one of the strongest features of MLflow Models. We will explore this in more detail in the deployment-related chapters.

## **Exploring MLflow Model Registry**

The model registry component in MLflow **gives the ML developer an abstraction for model life cycle management**. It is a centralized store for an organization or function that allows models in the organization to be shared, created, and archived collaboratively.

The management of the model can be made with the different APIs of MLflow and with the UI. Upon registering the model, you can annotate the registered model with the relevant metadata and manage its life cycle. One example is to have models in a staging pre-production environment and manage the life cycle by sending the model to production.