In [1]:
!python -c "import sys; print(sys.executable)"

/home/maviaalamkhan/Documents/mlop/data_engineering_bootcamp_2303/tasks/3_machine_learning_essentials/day_4_mlops/mlops-student/bin/python


# MLFlow lab

In [2]:
import pandas as pd

In [3]:
pd.__version__

'2.0.1'

### Setting up MLFlow tracking server

We also specify artifact root and backend store URI. This makes it possible to store models.

After running this command tracking server will be accessible at `localhost:5000`

In [4]:
%%bash --bg

mlflow server --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri sqlite:///mlflow.db \
    --default-artifact-root ./mlruns

### MLProject file

This file is used to configure MLFlow steps.

Using `MLproject` we can define our project's pipeline steps, called *entry points*.

Each entry point in this file corresponds to a shell command.

Entry points can be ran using

```
mlflow run -e <ENTRY_POINT>
```

By default `mlflow run` runs `main` entrypoint.

In [5]:
%cat MLproject

name: basic_mlflow

# this file is used to configure Python package dependencies.
# it uses Anaconda, but it can be also alternatively configured to use pip.
conda_env: conda.yaml

# entry points can be ran using `mlflow run <project_name> -e <entry_point_name>
entry_points:
  # download_data:
    # you can run any command using MLFlow
    # command: "bash download_data.sh"
  # MLproject file has to have main entry_point. It can be toggled without using -e option.
  main:
    # parameters is a key-value collection.
    parameters:
      file_name:
        type: str
        default: "winequalityN.csv"
      max_n:
        type: int
        default: 100
    command: "python train.py {file_name} {max_n}"



First we need to download data. We will use weather data from previous machine learning tutorial.

## Training

Now we can train models. See `train.py`.
It contains code from supervised machine learning tutorial; we added tracking metrics and model.

We will train kNN models for $k \in \{1, 2, ..., 10\}$ using *temperature* and *casual* features.

After running this command you can go to `localhost:5000` and see the trained models.

In [7]:
import sklearn

In [8]:
sklearn.__version__

'1.2.2'

In [9]:
! pip install fire
import fire


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
! pip install mlflow


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [11]:
%%bash
source mlflow_env_vars.sh
mlflow run . 

2023/05/08 15:43:38 INFO mlflow.utils.conda: Conda environment mlflow-dd0fbdd40ba98798131458f29496394bd1a3fb33 already exists.
2023/05/08 15:43:38 INFO mlflow.projects.utils: === Created directory /tmp/tmp_7ucltb7 for downloading remote URIs passed to arguments of type 'path' ===
2023/05/08 15:43:38 INFO mlflow.projects.backend.local: === Running command 'source /home/maviaalamkhan/anaconda3/bin/../etc/profile.d/conda.sh && conda activate mlflow-dd0fbdd40ba98798131458f29496394bd1a3fb33 1>&2 && python train.py winequalityN.csv 100' in run with ID '953f804dd20344dca2b8450eb56c1776' === 
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
Registered model 'sklearn_rfc' already exists. Creating a new version of this model...
2023/05/08 15:43:42 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: sklearn_rfc, version 21
Created version '21' of model 'sklearn_rfc'.
  self._final_estimator.fit(Xt, y, **fit_params_last

In [15]:
%%bash
last_model_path=$(ls -tr mlruns/0/ | tail -1)
cat mlruns/0/$last_model_path/artifacts/rfc/MLmodel
# cat mlruns/0/9bc20977e5894b72bc4bbeb0044f5e38/artifacts/rfc/MLmodel



artifact_path: rfc
flavors:
  python_function:
    env:
      conda: conda.yaml
      virtualenv: python_env.yaml
    loader_module: mlflow.sklearn
    model_path: model.pkl
    predict_fn: predict
    python_version: 3.10.6
  sklearn:
    code: null
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 1.2.2
mlflow_version: 2.3.1
model_uuid: 59ab7d62a3a945c28185c17dddcb28bd
run_id: a97e1fa4c6574c3c9ad86276bc1ac69a
utc_time_created: '2023-05-08 10:43:50.028765'


In [16]:
import mlflow

In [17]:
mlflow.__version__

'2.3.1'

## Serving model

Now that we trained our models we can go to *Models* page on MLFLow UI (http://localhost:5000/#/models).

Click *sklearn_knn* on this page, choose a model and move it to *Production* stage.

The following cell will serve the model at localhost on port 5001.

In [20]:
%%bash --bg
source mlflow_env_vars.sh
mlflow --version
mlflow models serve -m models:/sklearn_rfc/Production -p 5003 --env-manager=conda 


# Prediction

We'll load data that we can feed into prediction server.

In [21]:
import pandas as pd
df = pd.read_csv("winequalityN.csv")
df

Unnamed: 0,type,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,white,7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,white,6.3,0.300,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,white,8.1,0.280,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,white,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,white,7.2,0.230,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6492,red,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
6493,red,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,,11.2,6
6494,red,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
6495,red,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


Let's predict for first winter day and first non-winter day (first rows of previous two dataframes)

**warning: this might fail at first because the prediction server didn't spin up; in this case wait a minute**

In [24]:
%%bash
data='[[7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45], [7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45]]'
echo $data

curl -d "{\"inputs\": $data}" -H 'Content-Type: application/json' 127.0.0.1:5003/invocations

[[7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45], [7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45]]


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   153  100    23  100   130   1611   9111 --:--:-- --:--:-- --:--:-- 10928


{"predictions": [5, 5]}

In [25]:
%%bash
data='[[7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45], [7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45]]'
echo $data

curl -d "{\"instances\": $data}" -H 'Content-Type: application/json' 127.0.0.1:5003/invocations

[[7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45], [7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45]]


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   156  100    23  100   133   1900  10987 --:--:-- --:--:-- --:--:-- 13000


{"predictions": [5, 5]}

In [29]:
%%bash
data='[[7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45], [7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45]]'
columns='["fixed acidity","volatile acidity,"citric acid","residual sugar",	"chlorides","free sulfur dioxide","total sulfur dioxide","density,"pH","sulphates alcohol"]'
echo $data

curl -d "{\"dataframe_split\":{\"columns\":[\"fixed acidity\",\"volatile acidity\",\"citric acid\",\"residual sugar\",\"chlorides\",\"free sulfur dioxide\",\"total sulfur dioxide\",\"density\",\"pH\",\"sulphates\",\"alcohol\"],\"data\": $data}}" -H 'Content-Type: application/json' 127.0.0.1:5003/invocations

[[7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45], [7.0,0.270,0.36,20.7,0.045,45.0,170.0,1.00,100,3.00,0.45]]


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   343  100    23  100   320   1321  18389 --:--:-- --:--:-- --:--:-- 20176


{"predictions": [5, 5]}

Voila! We see that the model outputs correct predictions.