# MLFlow

> __MLflow is an open source platform for managing the end-to-end machine learning lifecycle.__

`mlflow` provides a few components we can use to manage our machine learning experimentation and deployment:

- Packaging ML code for reproducibility/sharing (dependencies, models etc.)
- Tracking experiments (compare results between runs)
- Deploying models to various inference platforms
- Central model store for MLFlow models (model versioning, stage transitions, annotations)

> __MLFlow provides integrations with popular frameworks like `sklearn`, `pytorch` or `spark`__

> __MLFlow provides APIs not only for `python`, but also for `R` or `java`__ (and REST API/CLI interface for other languages)

We will focus on Python and shell usage, but keep the above in mind

## Installation 

As `mlflow` is written in Python we can use `pip`/`conda` or other package manager to install.

If you have `pytorch` or other integration installed, it will be picked up automatically by `mlflow` (no need for extras this time).

In [2]:
!pip install mlflow



## Projects

> __MLFlow Projects are mainly CONVENTION to organize and describe your code to let others (people, automation pipelines) easily run it__

Projects are usually `git` repositories and allow you to specify (in varying level of detail) required environment (either `conda` or `docker`, eventually `system` specified but this is discouraged) via:
- directory structure
- `MLproject` file in git's root directory

Note:
- the `MLproject` file should be a yaml file, but it should have no extension
  - save it as `MLroject`, not `MLroject.yaml`

### Directories

> Structuring our code via directories is enough to create basic `MLFlow` project, __but specifying `MLproject` is a better option__

In case where there is no `MLproject.yaml` the following takes place:
- __Name of the project__ - name of the project's root directory (e.g. git's root)
- __Conda environment__ - if `conda.yaml` is available in the root
- __Any `.py`/`.sh` file in the project can be an entry point__ (more about running projects later)

One can obtain `conda.yaml` file via a simple command (provided you are inside the conda environment while running it):

```bash
conda env export [--from-history] > conda.yaml
```

`--from-history` requests only packages you have explicitly installed. This has two effects:
- Portability across operating systems (as OS specific packages will be installed this way)
- Not fully reproducible (due to possibly different dependencies)

__In general it should be safe to use the `--from-history` flag for increased portability of projects__

### Using MLProject.yaml

> Better option is to explicitly specify entry points, structure, parameters etc. via `MLproject` file

Here is an example `MLproject`:


```yml
---
name: My Project

conda_env: my_env.yaml

entry_points:
  main:
    parameters:
      data_file: path
      regularization: {type: float, default: 0.1}
    command: "python train.py -r {regularization} {data_file}"
  validate:
    parameters:
      data_file: path
    command: "python validate.py {data_file}"
```

Check out some examples in the documentation [here](https://github.com/mlflow/mlflow/tree/master/examples)

As you can see one can:

1. __specify environment explicitly:__
    - `conda` (simply a file with dependencies)
    - `docker` environment:
        - specify image available on the OS
        - if image is not available, try to pull it from `DockerHub`
        - if registry containing image is specified it will try to pull it (unless it's already available on the system)


For `docker` environment one can also specify:
- volumes to be mounted during project running
- environment variables passed to the container

See an example of `docker_env` below:

```yml
---
name: My Project

docker_env:
  image:  mlflow-docker-example

entry_points:
  main:
    parameters:
      data_file: path
      regularization: {type: float, default: 0.1}
      p: float
    command: "python train.py -r {regularization} {data_file}"
  validate:
    parameters:
      data_file: path
    command: "python validate.py {data_file}"
```

[See more here](https://www.mlflow.org/docs/latest/projects.html#mlproject-file)

2. __Specify parameters and entrypoints__

One can specify:
- name of the parameter
- type of the parameter (default is `str`, others are `float`, `path`, `uri`)
- default value(s)


Those values are latter passed on to `command` field and appropriately substituted.
__If we don't specify some parameter, it will be passed to the running command via `--key value` syntax__

Two cells above all of the parameter specification are shown.

### Running projects

> `MLFLow` provides command line program `mlflow` which has a subcommand `run` allowing us to run the project


Usage is really simple:

```bash
mlflow run <directory>
```

but there are a few useful tricks which allow us to run it with even less effort:

In [5]:
!mlflow run --help

Usage: mlflow run [OPTIONS] URI

  Run an MLflow project from the given URI.

  For local runs, the run will block until it completes. Otherwise, the
  project will run asynchronously.

  If running locally (the default), the URI can be either a Git repository
  URI or a local path. If running on Databricks, the URI must be a Git
  repository.

  By default, Git projects run in a new working directory with the given
  parameters, while local projects run from the project's root directory.

Options:
  -e, --entry-point NAME        Entry point within project. [default: main].
                                If the entry point is not found, attempts to
                                run the project file with the specified name
                                as a script, using 'python' to run .py files
                                and the default shell (specified by
                                environment variable $SHELL) to run .sh files

  -v, --version VERSION         Version o

For example we could do this (run in your CLI or in your cell):

In [7]:
!mlflow run https://github.com/mlflow/mlflow-example -P alpha=0.5

2021/07/04 18:41:53 INFO mlflow.projects.utils: === Fetching project from https://github.com/mlflow/mlflow-example into /var/folders/3z/29w5rr9d0k3_p863hm40sdnc0000gn/T/tmpyo68dla4 ===
2021/07/04 18:41:54 INFO mlflow.projects.utils: === Created directory /var/folders/3z/29w5rr9d0k3_p863hm40sdnc0000gn/T/tmp0f1l27x8 for downloading remote URIs passed to arguments of type 'path' ===
2021/07/04 18:41:54 INFO mlflow.projects.backend.local: === Running command 'source /Users/ice/miniconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-1abc00771765dd9dd15731cbda4938c765fbb90b 1>&2 && python train.py 0.5 0.1' in run with ID '3fedea0d61ba42ecafd5ed87791d507c' === 
  from collections import Sequence
  from collections import Iterable
  from collections import Mapping, namedtuple, defaultdict, Sequence
Elasticnet model (alpha=0.500000, l1_ratio=0.100000):
  RMSE: 0.7947931019036529
  MAE: 0.6189130834228138
  R2: 0.18411668718221819
2021/07/04 18:41:56 INFO mlflow.projects: === Run (ID '

## Experiment tracking

> __Tracking is an API and UI which allows us to log experiment's data and later visualizing it__

Using it we can log:
- model parameters
- code versions (git commit hashes)
- metrics
- generated artifacts

__`mlflow` tracking is organized around runs, which is simply some form of execution of our program__.

Each run is recorded by `mlflow` either to:
- local files
- SQLAlchemy database
- remote storage (via [`mlflow.set_tracking_uri()`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.set_tracking_uri) function)

For more information about storage [check out relevant part of documentation](https://mlflow.org/docs/latest/tracking.html#how-runs-and-artifacts-are-recorded).

> __Via `MLFlow` we can track, version and create comprehensive experiment from everything, starting with ETL and ending with deployment__

There are a few main concepts to keep in mind when using it:
- __experiment__ - mainly [`mlflow.set_experiment(UNIQUE_NAME_OF_EXPERIMENT)`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.set_experiment) which sets current experiments and optionally creates it if it doesn't exist.
- __run__ - single run, experiment can consist of multiple of those. Context manager [`mlflow.start_run()`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.start_run)
- __logging__ - logging data from an experiment; here are the related function:
    - [`mlflow.log_param(key, value)`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_param) - logs hyperparameters and other settable parameters under current run
    - [`mlflow.log_metric(key, value)`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_metric)
    - [`mlflow.log_artifact(local_path)`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_artifact) - logs created file (e.g. models, generated text etc.) under the current run
    
Given the above, let's see how to run and log __non-flavored__ (e.g. without specific integrations) dummy experiment:

In [14]:
import mlflow


def create_dummy_file():
    features = "rooms, zipcode, median_price, school_rating, transport"
    with open("features.txt", "w") as f:
        f.write(features)


create_dummy_file()

# Create experiment (artifact_location=./ml_runs by default)
mlflow.set_experiment("Dummy Experiments")

# By default experiment we've set will be used
with mlflow.start_run():
    mlflow.log_artifact("features.txt")
    mlflow.log_param("learning_rate", 0.01)
    for i in range(10):
        mlflow.log_metric("Iteration", i, step=i)

To visualize & explore saved data we can use `mlflow ui` command and open web browser under [`http://localhost:5000 `](http://localhost:5000) (__data will be saved inside `./mlruns`__)

Run below in the terminal:

In [8]:
# !mlflow ui --help

After navigating to the the experiment, we can see the `Iteration` being logged like below:

![](images/mlflow_ui.png)

## Model format

> MLFlow provides standard format for saving machine learning models (from various libraries) which helps us with model usage (e.g. inference on REST API, cloud etc.) 

MLFlow models consist of:
- directory with arbitrary files defined by the model)
- `MLmodel` file (written in yaml) which specifies what is contained within the model

Let's see how to save our model (in this case `sklearn`) in Python...

In [9]:
mlflow.sklearn.save_model(model, "my_model")

NameError: name 'model' is not defined

which creates the following directory in our `cwd`:

```bash
my_model/
├── MLmodel
└── model.pkl
```

Contents of the `MLModel` are equally easy to grasp:

```yml
---
time_created: 2021-04-03T17:28:53.35

flavors:
  sklearn:
    sklearn_version: 0.24.1
    pickled_model: model.pkl
  python_function:
    loader_module: mlflow.sklearn
```

### Model signature

In order to deploy (and sometimes even run, like in `tensorflow`) we need to specify __model signature__

> __Model signature specifies type and shape of inputs going through the model__

- Standard casting rules apply (upcasting is fine, downcasting would raise an error)
- Helps reading inputs when those are send using JSON via REST API or a-like

We can add it to `MLModel` file, two options to do so below:

#### Column signature

> Specify input signature by specifying each possible column input

This mode is supported by all flavors (frameworks), yet those might not be the easiest in all cases.

Example for `iris` dataset:

```yaml
signature:
    inputs: '[{"name": "sepal length (cm)", "type": "double"}, {"name": "sepal width
      (cm)", "type": "double"}, {"name": "petal length (cm)", "type": "double"}, {"name":
      "petal width (cm)", "type": "double"}]'
    outputs: '[{"type": "integer"}]'
```

#### Tensor signature

> Specify input for deep learning inputs (e.g. images) via tensor shape

Image oriented example:

```yaml
signature:
    inputs: '[{"name": "images", "dtype": "uint8", "shape": [-1, 28, 28, 1]}]'
    outputs: '[{"shape": [-1, 10], "dtype": "float32"}]'
```

#### Inferring input shapes

Often it is easier (and less error-prone) to infer `dtype` and shape through our code. One can easily do this via [`mlflow.models.infer_signature`](https://mlflow.org/docs/latest/python_api/mlflow.models.html#mlflow.models.infer_signature).

Check out code below for an example

In [11]:
import pandas as pd
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature

iris = datasets.load_iris()
iris_train = pd.DataFrame(iris.data, columns=iris.feature_names)
clf = RandomForestClassifier(max_depth=7, random_state=0)
clf.fit(iris_train, iris.target)
signature = infer_signature(iris_train, clf.predict(iris_train))

mlflow.sklearn.log_model(clf, "iris_rf", signature=signature)

`infer_signature` is really simple:
- Pass input data (usually `torch.Tensor`, `pd.DataFrame`, `np.ndarray` or other standard types)
- Pass data through the model as the second argument - this will create `outputs` automatically


`mlflow.sklearn.log_model` saves the model to the file in `cwd` named `iris_rf` with our specified signature.
We could later load it from the disk (__it has to be tailored to the flavor we saved it in!__):

In [None]:
# Load sklearn model

sklearn_model = mlflow.sklearn.load_model("iris_rf")

## Deploying models

Once we have our model saved we can easily deploy it to various services, namely:
- [locally](https://www.mlflow.org/docs/latest/models.html#deploy-mlflow-models) with REST API (either inside `docker` container or with `conda` environment)
- [Microsoft's Azure ML](https://www.mlflow.org/docs/latest/models.html#deploy-a-python-function-model-on-microsoft-azure-ml)
- [Amazon SageMaker](https://www.mlflow.org/docs/latest/models.html#deploy-a-python-function-model-on-amazon-sagemaker)
- [Apache UDF](https://www.mlflow.org/docs/latest/models.html#export-a-python-function-model-as-an-apache-spark-udf)
- Others, maintained by community deployment plugins (for example `torchserve`), check out [here](https://www.mlflow.org/docs/latest/plugins.html#deployment-plugins)

Let's see `mlflow models` command:

In [10]:
!mlflow models --help

Usage: mlflow models [OPTIONS] COMMAND [ARGS]...

  Deploy MLflow models locally.

  To deploy a model associated with a run on a tracking server, set the
  MLFLOW_TRACKING_URI environment variable to the URL of the desired server.

Options:
  --help  Show this message and exit.

Commands:
  build-docker  **EXPERIMENTAL**: Builds a Docker image whose default...
  predict       Generate predictions in json format using a saved MLflow...
  prepare-env   **EXPERIMENTAL**: Performs any preparation necessary to...
  serve         Serve a model saved with MLflow by launching a webserver on...


### models build-docker

> This subcommand creates a docker image and places our model inside it

After this we can serve the model by running created image (by default port `8080` is exposed so we can easily map it).

Let's see this command in more details

In [None]:
!mlflow models build-docker --help

__`python_flavor` is the default one and every specific integration is compatible with it__ (see more details [here](https://www.mlflow.org/docs/latest/python_api/mlflow.pyfunc.html))

### model serve

> Runs a basic webserver (created via `flask`) which we can query (e.g. using `curl`)

We can specify (amongst other things):
- `--model-uri` - model resource (mandatory)
- `--workers` - number of parallel workers handling requests
- `--port` - on which port the server will listen for requests

In [None]:
!mlflow models serve --help

### models predict

> Allows us to query model with a file (`.csv` or `.json`) (__useful for testing!__)

Let's see the possibilities:

In [None]:
!mlflow models predict --help

## Querying deployed model

Once we deployed the model (via `docker` or `flask` webserver) we can query it (from other machines or from `localhost` also). 

Requests are done via sending `json` text strings to `/invocations` endpoint. There are a few possibilities to send the data:
- JSON-serialized pandas DataFrames in the split orientation (`data = pandas_df.to_json(orient='split')`)
- JSON-serialized pandas DataFrames in the records orientation (discouraged)
- CSV-serialized pandas DataFrames (`data = pandas_df.to_csv()`)
- Tensor input formatted as described in TF Serving’s API docs where the provided inputs will be cast to Numpy arrays

Each of the above can be seen below (please notice `content/type` specification for different versions):

In [None]:
# split-oriented DataFrame input
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '{
    "columns": ["a", "b", "c"],
    "data": [[1, 2, 3], [4, 5, 6]]
}'

# record-oriented DataFrame input (fine for vector rows, loses ordering for JSON records)
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json; format=pandas-records' -d '[
    {"a": 1,"b": 2,"c": 3},
    {"a": 4,"b": 5,"c": 6}
]'

# numpy/tensor input using TF serving's "instances" format
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '{
    "instances": [
        {"a": "s1", "b": 1, "c": [1, 2, 3]},
        {"a": "s2", "b": 2, "c": [4, 5, 6]},
        {"a": "s3", "b": 3, "c": [7, 8, 9]}
    ]
}'

We could also encode more complex data before sending the request (e.g. images could be encoded using `base64` and automatically decoded by MLFlow):

In [None]:
# record-oriented DataFrame input with binary column "b"
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json; format=pandas-records' -d '[
    {"a": 0, "b": "dGVzdCBiaW5hcnkgZGF0YSAw"},
    {"a": 1, "b": "dGVzdCBiaW5hcnkgZGF0YSAx"},
    {"a": 2, "b": "dGVzdCBiaW5hcnkgZGF0YSAy"}
]'

# record-oriented DataFrame input with datetime column "b"
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json; format=pandas-records' -d '[
    {"a": 0, "b": "2020-01-01T00:00:00Z"},
    {"a": 1, "b": "2020-02-01T12:34:56Z"},
    {"a": 2, "b": "2021-03-01T00:00:00Z"}
]'

In summary, we've seen how MLFlow can be used to track experiments and deploy models.