# Introduction


## Who are we ?

- [**Data scientists**]{.orange} at Insee
    - methodological and IT innovation teams
    - support data science projects

- [**Contact us**]{.orange}
  - <romain.avouac@insee.fr>
  - <thomas.faria@insee.fr>
  - <tom.seimandi@insee.fr>

## Context

- Difficulty of transitioning from experiments to [**production-grade**]{.orange} machine learning systems

- Leverage [**best practices**]{.orange} from software engineering
  - Improve [**reproducibility**]{.blue2} of analysis
  - [**Deploy**]{.blue2} applications in a [**scalable**]{.blue2} way

## The DevOps approach

- [**Unify**]{.orange} development (*dev*) and system administration (*ops*)
  - [**shorten**]{.blue2} development time
  - maintain software [**quality**]{.blue2} 

. . .

![](img/devops.png){fig-align="center" height=300}

## The MLOps approach

- Integrate the [**specificities**]{.orange} of machine learning projects
  - [**Experimentation**]{.blue2}
  - [**Continuous improvement**]{.blue2}

. . .

![](img/mlops.png){fig-align="center" height=400}

## MLOps : principles

- [**Reproducibility**]{.orange}

- [**Versioning**]{.orange}

- [**Automation**]{.orange}

- [**Monitoring**]{.orange}

- [**Collaboration**]{.orange}

## Why MLflow ?

- Multiple [**frameworks**]{.orange} implement the MLOps principles

- Pros of `MLflow`
  - [**Open-source**]{.blue2}
  - Covers the whole [**ML lifecycle**]{.blue2}
  - [**Agnostic**]{.blue2} to the ML library used
  - We have [**experience**]{.blue2} with it

## Training platform : the SSP Cloud

- An [**open innovation production-like**]{.orange} environment
  - [**Kubernetes**]{.blue2} cluster
  - S3-compatible [**object storage**]{.blue2}
  - Large computational [**resources**]{.blue2} (including GPUs)

- Based on the [Onyxia](https://github.com/InseeFrLab/onyxia-web) project
  - User-friendly [interface](https://datalab.sspcloud.fr/) to launch data science services
  - A [catalog of services](https://datalab.sspcloud.fr/catalog/ide) which covers the full lifecycle of data science projects

## Outline

:one: Introduction to MLFlow

. . .

:two: Deploying a model as an API

. . .

:three: Distributing the hyperparameter optimization



## Application 0

:::{.callout-tip collapse="true" icon=false}
## Preparation of the working environment

:::{.incremental}
1. Create an account on the [SSP Cloud](https://datalab.sspcloud.fr/home) using your professional mail address
2. Launch a `MLflow` service by clicking [this URL](https://datalab.sspcloud.fr/launcher/automation/mlflow?autoLaunch=true)
3. Launch a `VSCode` service by clicking [this URL](https://datalab.sspcloud.fr/launcher/ide/vscode-python?autoLaunch=true&init.personalInit=«https%3A%2F%2Fraw.githubusercontent.com%2FInseeFrLab%2Fformation-mlops%2Fmain%2Finit.sh»)
4. Open the `VSCode` service and input the service password (either automatically copied or available in the `README` of the service)
5. You're all set !
:::

:::





# Introduction to MLFlow

## Tracking server

- "An [**API**]{.orange} and [**UI**]{.orange} for [**logging**]{.orange} parameters, code versions, metrics, and artifacts"

. . .

![](img/mlflow-tracking.png){fig-align="center" height=400}

## Projects

- "A standard format for [**packaging**]{.orange} reusable data science code"

. . .

![](img/mlflow-projects.png){fig-align="center" height=400}

## Models

- "A convention for [**packaging**]{.orange} machine learning [**models**]{.orange} in multiple [**flavors**]{.orange}"

. . .

![](img/mlflow-models.png){fig-align="center" height=400}

## Model registry

- "A [**centralized model store**]{.orange}, set of APIs, and UI, to [**collaboratively**]{.orange} manage the full lifecycle of an MLflow Model"

. . .

![](img/mlflow-model-registry.png){fig-align="center" height=400}



## Application 1

:::{.callout-tip collapse="true" icon=false}
## Introduction to MLflow concepts

:::{.incremental}
1. In `VSCode`, open the notebook `mlflow-introduction.ipynb` (from the `notebooks` directory)
2. Choose our custom `Python` kernel : 
    + `Select Kernel -> Python environments... -> base (Python 3.x.x)`
3. Execute the notebook cell by cell. Try to understand carefully how the `Python` session interacts with the `MLflow` API. Explore the `MLflow` UI and try to build your own experiments from the example code provided in the notebook.
:::

:::





# A Practical Example: NACE Code Prediction for French Companies

## Context

- [**NACE**]{.orange}
  - European standard classification of productive [**economic activities**]{.blue2}
  - [**Hierarchical structure**]{.blue2} with 4 levels and 615 codes

- At Insee previously handled by an outdated [**rule-based**]{.orange} algorithm

- [**Common problematic**]{.orange} to all National statistical institutes

## Data used {.scrollable}

::: {.panel-tabset}

### Slide 

- A simple use-case with only [**2 variables**]{.orange}:
  - [**Textual description**]{.blue2} of the activity – [text]{.green2}
  - [**True NACE code**]{.blue2} labelised by the rule-based engine – [nace]{.green2} (732 modalities)

- Standard [**preprocessing**]{.orange}:
  - lowercasing
  - punctuation removal
  - number removal
  - stopwords removal
  - stemming
  - ...


### Raw

```{ojs}
viewof table_data = Inputs.table(transpose(data_raw), {
    rows: 22
})
```

### Preprocessed

```{ojs}
viewof table_data_prepro = Inputs.table(transpose(data_prepro), {
    rows: 22
})
```

:::

## MLflow with a non standard framework

::: {.nonincremental}

:::: {.fragment fragment-index=1}
- [**Easy to use**]{.orange} with a variety of machine learning frameworks (scikit-learn, Keras, Pytorch...) 
::::

:::: {.fragment fragment-index=2}
```python
mlflow.sklearn.log_model(pipe_rf, "model")

mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/{version}")
y_train_pred = model.predict(X_train)

```
::::

:::: {.fragment fragment-index=3}
- What if we require greater [**flexibility**]{.orange} or our [**own framework**]{.orange}?
::::

:::: {.fragment fragment-index=4}
- Possibility to [**track**]{.orange} , [**register**]{.orange} and [**deliver**]{.orange} your own model
::::

:::

## MLflow with a non standard framework 

::: {.nonincremental}

:::: {.fragment fragment-index=1}
- There are [**2 main differences**]{.orange} when using your own framework:
  - [**logging**]{.blue2} of parameters, metrics and artifacts
  - [**wrapping**]{.blue2} of your custom model so that MLflow can serve it
::::

:::: {.fragment fragment-index=2}
```python
# Define a custom model
class MyModel(mlflow.pyfunc.PythonModel):

    def load_context(self, context):
      self.my_model.load_model(context.artifacts["my_model"])

    def predict(self, context, model_input):
        return self.my_model.predict(model_input)
```
::::

:::

<!-- By creating a class that inherits from mlflow.pyfunc.PythonModel, you are essentially creating a wrapper around your custom model that allows it to be used with the MLflow platform. The mlflow.pyfunc.PythonModel class provides a standardized interface that makes it easy to integrate your custom model with the rest of the MLflow platform. -->

## Necessity of avoiding notebooks for production deployment

- Arguments against using notebooks for [**ML models deployment**]{.orange}:
  - Limited scalability for [**automation**]{.blue2} of ML pipelines.
  - Lack of clear and [**reproducible**]{.blue2} workflows.
  - Hinders [**collaboration**]{.blue2} and [**versioning**]{.blue2} among team members.
  - Insufficient [**modularity**]{.blue2} for managing complex ML components.


In [None]:
import sys
sys.path.append("../src/")

import pandas as pd
import s3fs
import pyarrow.parquet as pq
from constants import TEXT_FEATURE, DATA_PATH
from preprocessor import Preprocessor

preprocessor = Preprocessor()
fs = s3fs.S3FileSystem(
    client_kwargs={"endpoint_url": "https://minio.lab.sspcloud.fr"}
)
df = pq.ParquetDataset(DATA_PATH, filesystem=fs).read_pandas().to_pandas()
df = df.sample(frac=0.001, random_state=0)

df_prepro = preprocessor.clean_text(df, TEXT_FEATURE)

ojs_define(data_raw = df, data_prepro = df_prepro)

## Application 2 {.scrollable}

:::{.nonincremental}
:::: {.callout-tip collapse="true" icon=false}
## Part 1 : From notebook to python scripts


1. All scripts related to our custom model are stored in the `src` folder. Check them out. Have a look at the `MLproject` file as well.
2. Run a training of the model using MLflow. To do so:

    ```sh
    mlflow run ~/work/formation-mlops/ --env-manager=local \
    -P remote_server_uri=$MLFLOW_TRACKING_URI \
    -P experiment_name="fasttext"
    ```

3. Look at the results of your previous run:
    + `Experiments -> fasttext -> <run_name>`
4. You have trained the model with some default parameters. In `MLproject` check the parameters available. Re-train a model with different parameters (i.e. `dim = 25`).

<details>
<summary>
    <font size=\"3\" color=\"darkgreen\"><b>Click to see the command </b></font>
</summary>

```sh
mlflow run ~/work/formation-mlops/ --env-manager=local \
-P remote_server_uri=$MLFLOW_TRACKING_URI \
-P experiment_name="fasttext"
-P dim=25
```

</details>


5. In MLflow, compare the 2 models by plotting the accuracy against one parameter you have changed (i.e. `dim`)
    + `Select the 2 runs -> Compare -> Scatter Plot -> Select your X and Y axis`
::::
:::


## Application 2 {.scrollable}

:::{.nonincremental}
:::: {.callout-tip collapse="true" icon=false}
## Part 2 : Model delivery with onboarded preprocessing

1. Explore the `src/train.py` file carefully. What are the main differences compare to application 1 ?
2. Why can we say that the MLflow model onboard the preprocessing ?
3. In MLflow, register your last model
4. Open and run the notebook `mlflow-custom-model.ipynb` which loads your model from the model store.
5. Read the documentation of the `predict()` function of the custom class (`src/fasttext_wrapper.py`)
6. Make a prediction of the model

<details>
<summary>
    <font size=\"3\" color=\"darkgreen\"><b>Click to see the command </b></font>
</summary>

```python
list_libs = ["vendeur d'huitres", "boulanger"]

test_data = {
    "query": list_libs,
    "k": 1
}
model.predict(test_data)
```

</details>


4. Make sure that the two following descriptions give the same results: `"COIFFEUR"` and `"coiffeur, & 98789"`
5. Change the value of the parameter `k`
::::
:::





# Deploying a model as an API

- On détaille pas la construction d'API, on renvoie vers doc de FastAPI
- Dire qu'on veut run l'API sous une image docker pour avoir tout ce qu'il nous faut dedans
- Un topo sur deploiement dans kube et les fichiers deployment, service, ingress. Juste dire rapidement a quoi ils servent
- Et truc sur Argo cd pour parler du CD




## Application 3 {.scrollable}

:::{.nonincremental}
:::: {.callout-tip collapse="true" icon=false}
## Deploying a machine-learning model as an API

1. We construct a very simplistic Rest API using FastAPI. All underlying files are in the `app` folder. Check it.
2. To deploy an API, you need to run it on a machine that contains all necessary packages, scripts, and configurations. The most convenient way is to build a Docker image containing all these specificities. Open the `Dockerfile` to see how the image is built. The image is publish via Github Actions, if interested have a look to `.github/workflows/build_image.yml`.
3. Open the file `kubernetes/deployment.yml` and modify the highlighted lines accordingly:

```{yml code-line-numbers: "7"}
containers:
- name: api
    image: inseefrlab/formation-mlops-api:main
    imagePullPolicy: Always
    env:
    - name: MLFLOW_TRACKING_URI
        value: https://projet-formation-******.user.lab.sspcloud.fr
    - name: MLFLOW_MODEL_NAME
        value: fasttext-model
    - name: MLFLOW_MODEL_VERSION
        value: "1"
```

4. On the [SSP Cloud](https://datalab.sspcloud.fr/home), launch an Argo-cd service by clicking [this URL](https://datalab.sspcloud.fr/launcher/automation/argo-cd?autoLaunch=true)
5. Create a new application and use the yaml template stored in the `argocd` folder
    a. `Create an application  -> Edit as YAML` 
    b. Adjust the yaml to your purpose
    c. `Save -> Create`
6. Reach your API using the URL you defined in your `ingress.yml` file
7. Display the documentation of your API by adding `/docs` to your URL
8. Try your API out!
9. Re-run a new model and deploy this new model in your API

<details>
<summary>
    <font size=\"3\" color=\"darkgreen\"><b>Click to see the steps </b></font>
</summary>

    + 1. Run a model
    + 2. Register the model
    + 3. Adjust your MLFLOW_MODEL_NAME or MLFLOW_MODEL_VERSION environment variable in the `deployment.yml` file
    + 4. Commit and push these changes
    + 5. Synchronise in Argo-cd or wait for 5 minutes
    + 6. Refresh your API, it should be based on your new version!

</details>

::::

:::





# Distributing the hyperparameter optimization

## Parallel training

- With our setup, we can train models [**one by one**]{.orange} and log all relevant information to the MLflow tracking server
- What if we would like to train [**multiple models at once**]{.orange}, for example to optimize hyperparameters ?

## Workflow automation

- [**General principles**]{.orange} :
    - Define workflows where each step in the workflow is a [**container**]{.blue2} (reproducibility)
    - Model multi-step workflows as a [**sequence**]{.blue2} of tasks or as a [**directed acyclic graph**]{.blue2}
    - This allows to easily [**run in parallel compute intensive jobs**]{.blue2} for machine learning or data processing

## Argo workflows

- A popular [**workflow engine**]{.orange} for orchestrating parallel jobs on `Kubernetes`
  - [**open-source**]{.blue2}
  - [**container-native**]{.blue2}
  - available on the [**SSP Cloud**]{.orange}

. . .

![](img/argo-logo.png){fig-align="center" height=300}

## Hello World

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow                  # new type of k8s spec
metadata:
  generateName: hello-world-    # name of the workflow spec
spec:
  entrypoint: whalesay          # invoke the whalesay template
  templates:
    - name: whalesay            # name of the template
      container:
        image: docker/whalesay
        command: [ cowsay ]
        args: [ "hello world" ]
```

## What is going on ?

. . .

![](img/argo-0.png){fig-align="center" height=500}

## What is going on ?

![](img/argo-1a.png){fig-align="center" height=500}

## What is going on ?

![](img/argo-2a.png){fig-align="center" height=500}

## Parameters

- Templates can take [**input parameters**]{.orange}

. . .

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: hello-world-parameters-
spec:
  entrypoint: whalesay
  arguments:
    parameters:
    - name: message
      value: hello world

  templates:
  - name: whalesay
    inputs:
      parameters:
      - name: message       # parameter declaration
    container:
      image: docker/whalesay
      command: [cowsay]
      args: ["{{inputs.parameters.message}}"]
```

## Multi-step workflows

- [**Multi-steps workflows**]{.orange} can be specified (`steps` or `dag`)

. . .

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello

  # This spec contains two templates: hello-hello-hello and whalesay
  templates:
  - name: hello-hello-hello
    # Instead of just running a container
    # This template has a sequence of steps
    steps:
    - - name: hello1            # hello1 is run before the following steps
        template: whalesay
    - - name: hello2a           # double dash => run after previous step
        template: whalesay
      - name: hello2b           # single dash => run in parallel with previous step
        template: whalesay
  - name: whalesay              # name of the template
    container:
      image: docker/whalesay
      command: [ cowsay ]
      args: [ "hello world" ]
```

## What is going on ?

. . .

![](img/argo-0.png){fig-align="center" height=500}

## What is going on ?

![](img/argo-1b.png){fig-align="center" height=500}

## What is going on ?

![](img/argo-2b.png){fig-align="center" height=500}

## What is going on ?

![](img/argo-1b.png){fig-align="center" height=500}

## What is going on ?

![](img/argo-3b.png){fig-align="center" height=500}

## Further applications

- Workflow [**to test**]{.orange} registered models, or models pushed to staging / production
- Workflows can be [**triggered**]{.orange} automatically (via Argo Events for example)
- [**Continuous training workflows**]{.orange}
- [**Distributed**]{.orange} machine learning pipelines in general (data downloading, processing, etc.)

## Further applications

. . .

![](img/pokemon_workflow.png){fig-align="center" height=450}

## Notes

- [**Python SDK**]{.orange} for Argo Workflows
- Kubeflow pipelines
- [**Couler**]{.orange} : unified interface for constructing and managing workflows on different workflow engines
- Other Python-native orchestration tools : [**Apache Airflow**]{.orange}, [**Metaflow**]{.orange}, [**Prefect**]{.orange}




## Application 4

:::{.callout-tip collapse="true" icon=false}
## Distributing the hyperparameter optimization with an orchestrator

:::{.incremental}
1. Open an Argo Workflows service and submit the `Hello World` workflow. Visualize the logs on the Argo Workflows UI.
2. Take a look at the `argo_workflows/workflow.yml` file. What do you expect will happen when we submit this workflow ?
3. Submit the workflow. Once all jobs are completed, visualize the logs. Then open the MLflow UI to check what has been done.
:::

:::





# Conclusion
