# How to transition a model experiment to a pipeline?

Machine learning is a work that needs collaboration of many different roles, among which data scientist and machine learning engineer are the two major roles who will work together to train and productionize models. This article aims maching learning engineer, enabling them to productionize models built by data scientists.
If you are not sure which role you are, please check the table below.

|Role |Responsiblities |Tools |
| --- | --- | --- |
|data scientist|model development, model debug, data understanding, model experimentation, continuous training |VSCode/PyCharm/Jupyter Notebook, Python, PyTorch/TensorFlow, ML Platform|
|machine learning engineer|engineering best practices, scaling, production training, model management, model deployment, application integration, MLOps|ML platform, Python, Docker, Kubernetes, ML pipeline|

When a model is developped and goes to production, work will be handed over from data scientist (hereinafter referred to as DS) to machine learning engineer (hereinafter referred to as MLE).

Data Scientist should provide:
* a training script, it could be a Jupyter Notebook file or a python script depending on editor used by DS. This script includes what piece of sample data used, how data is processd, how model is trained, how metrics are defined to evaluate the model, and metrics baseline on sample data.
* the environment to run this script, it could be a conda environment yaml or a requirements.txt.

After these inputs are handed over from DS, a MLE's first job is to make this script run successfullly on local (laptop, Virtual Machine, CodeSpace or Azure ML Notebook, etc.) and then on cloud.

We will take NYC taxi fares predicting as an example. Please find all datasets and codes [here in github](gitlink).



## Get it work



### Get it work on local 

In order to make a script run on local, MLE needs to first go through code, understand logic in code, and refine code if necessary, rebuild environment, then run it locally. 



#### Refine code

When refining code, a MLE should take into consideration security, compliance, cost, company internal engineering practices, etc.

For example, to imporve productivity, MLE can delete or comment parts of code for data visulization and expolartion, which will save compute time when runing on production environment, for example, code cells to check statistics or to view data distribution through histogram.

Besides, if a .ipynb file is provided, you also need to convert notebook to python file using command below because Azure ML accepts .py file as job input when moving to cloud.



In [1]:
!jupyter nbconvert --to script --output script inputs_from_data_scientist/notebook.ipynb 

[NbConvertApp] Converting notebook inputs_from_data_scientist/notebook.ipynb to script
[NbConvertApp] Writing 10286 bytes to inputs_from_data_scientist/script.py


#### Rebuild environement to run script

To run DS's script on local, MLE needs to reproduce the same environment.

These are some frequent approaches, for example:

* build a docker image and run container
* create conda environment from yaml file 
* pip install requirements.txt

Building a docker image and running script in container is recommended, it's OS independent and thus the best way to simulate running script in remote. We will take this method as an example below.

First, list all the dependencies imported in script, and get denpendencies versions according to full list of requirements provided by DS.



In [None]:
from pathlib import Path
import os
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
import pickle
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt 




Now you can obtain a relatively short list of requirements.txt. In this example, pathlib/os/pickle are removed, because they are included in python. And plt is not necessary, since we already remove histogramm plotting code.



In [None]:
pathlib2==2.3.6
pandas==1.3.3
sklearn==0.0
numpy==1.18.5



Now you can write your dockerfile in which you set base image as python with a proper version, then install requirements, copy sample data and script, set command.


In [None]:
#FROM mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04
FROM python:3.8.5

# python installs
COPY env/local/requirements.txt .
RUN pip install -r requirements.txt && rm requirements.txt

COPY data/sample_data /usr/python/data/sample_data
COPY /1_script_run_on_local/src/script.py /usr/python/1_script_run_on_local/src/
WORKDIR /usr/python

# set command
CMD ["bash", "-c", "cd 1_script_run_on_local/src && python script.py && exit"]

You can then run these commands to build image and run python script in container to test it. 

In [9]:
!docker build -t nyc_taxi_image -f env/local/Dockerfile .

Sending build context to Docker daemon  2.196GB
Step 1/7 : FROM python:3.8.5
 ---> 28a4c88cdbbf
Step 2/7 : COPY env/local/requirements.txt .
 ---> 12214c898ef2
Step 3/7 : RUN pip install -r requirements.txt && rm requirements.txt
 ---> Running in 8a205c2db9e1
Collecting pathlib2==2.3.6
  Downloading pathlib2-2.3.6-py2.py3-none-any.whl (17 kB)
Collecting pandas==1.3.3
  Downloading pandas-1.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.5 MB)
Collecting sklearn==0.0
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting numpy==1.18.5
  Downloading numpy-1.18.5-cp38-cp38-manylinux1_x86_64.whl (20.6 MB)
Collecting azureml-mlflow==1.39.0
  Downloading azureml_mlflow-1.39.0-py3-none-any.whl (46 kB)
Collecting argparse==1.4.0
  Downloading argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Collecting six
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting pytz>=2017.3
  Downloading pytz-2022.1-py2.py3-none-any.whl (503 kB)
Collecting python-dateutil>=2.7.3
  Downloading

In [10]:
 !docker run -it nyc_taxi_image:latest

raw data files: 
['yellowTaxiData.csv', 'greenTaxiData.csv']
(5000, 21)
(5000, 19)
['cost', 'distance', 'dropoff_datetime', 'dropoff_latitude', 'dropoff_longitude', 'passengers', 'pickup_datetime', 'pickup_latitude', 'pickup_longitude', 'store_forward', 'vendor']
green_columns:  {'vendorID': 'vendor', 'lpepPickupDatetime': 'pickup_datetime', 'lpepDropoffDatetime': 'dropoff_datetime', 'storeAndFwdFlag': 'store_forward', 'pickupLongitude': 'pickup_longitude', 'pickupLatitude': 'pickup_latitude', 'dropoffLongitude': 'dropoff_longitude', 'dropoffLatitude': 'dropoff_latitude', 'passengerCount': 'passengers', 'fareAmount': 'cost', 'tripDistance': 'distance'}
yellow_columns:  {'vendorID': 'vendor', 'tpepPickupDateTime': 'pickup_datetime', 'tpepDropoffDateTime': 'dropoff_datetime', 'storeAndFwdFlag': 'store_forward', 'startLon': 'pickup_longitude', 'startLat': 'pickup_latitude', 'endLon': 'dropoff_longitude', 'endLat': 'dropoff_latitude', 'passengerCount': 'passengers', 'fareAmount': 'cost', '

### Get it work on cloud

After making sure your code can work on local, you can then move to cloud by submitting an Azure ML job. 



#### Prerequisites

To submit a job to Azure ML, you should install cli and set up environment on your local machine.

Install latest version of azure cli, please refer to [How to install the Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli).

Install latest version of azure ml cli and then set up default subscription, resource group and workspace (commands shown below). For more informations, please refer to [Install and set up the Maching learning CLI ](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-cli?tabs=public).

Here is a cheatsheet for environment set-up.



In [None]:
!az login --use-device-code
!az account set -s "sub_id"
!az configure --defaults group=rg_name workspace=ws_name location=location

#### Define inputs and outputs

In preperation for moving to cloud, you need to define script interfaces (inputs and outputs), because you will use AZure ML datastore instead of local disk as data source.

In this example, I define raw_data as input, model_output as output.

Code modifications are:



1. Import argparse package, add these two arguments.

    

In [None]:
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--raw_data", type=str, help="Path to raw data")
parser.add_argument("--model_output", type=str, help="Path of output model")


args = parser.parse_args()

lines = [
    f"Raw data path: {args.raw_data}",
    f"model output path: {args.model_output}",

]

for line in lines:
    print(line)  

2. Replace raw_data with args.raw_data
    
    

In [None]:
# Read raw data from csv to dataframe
# raw_data = './../data/sample_data/'
print("raw data files: ")
arr = os.listdir(args.raw_data)
print(arr)

green_data = pd.read_csv((Path(args.raw_data) / 'greenTaxiData.csv'))
yellow_data = pd.read_csv((Path(args.raw_data) / 'yellowTaxiData.csv'))

3. Replace model_output with args.model_output
    

In [None]:
# Output the model 
# model_output = './model/'
if not os.path.exists(args.model_output):
    os.mkdir(args.model_output)
pickle.dump(model, open((Path(args.model_output) / "model.sav"), "wb")) 

4. Add metrics and parameters logging code. Azure ML leverages MLflow to do experiment tracking. You need to import mlflow, then use mlflow.log_param() and mlflow.log_metric() instead of standard ouput print(). See more in [this article](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow-cli-runs?tabs=mlflow).

In [None]:
# Compare predictions to actuals (testy)
# The mean squared error
# print("Scored with the following model:\n{}".format(model))
# print("Mean squared error: %.2f" % mean_squared_error(testy, predictions))
# The coefficient of determination: 1 is perfect prediction
# print("Coefficient of determination: %.2f" % r2_score(testy, predictions))

# Log params and metrics to AML

mlflow.log_param("learning_rate", learning_rate)
mlflow.log_param("n_estimators", n_estimators)

mlflow.log_metric("mean_squared_error", mean_squared_error(testy, predictions))
mlflow.log_metric("r2_score", r2_score(testy, predictions))

#### Create an Azure ML environment

In the first place, modify dockerfile by deleting commands for copying data and script.

In the second place, write your environment yaml file following [this schema instruction](https://docs.microsoft.com/en-us/azure/machine-learning/reference-yaml-environment). Remember to add argparse and azureml-mlflow in your requirements.txt.

In [None]:
azureml-mlflow==1.39.0
argparse==1.4.0

In the end, run this command to register environment on AML. Learn more about Azure ML environments management commands here.

In [2]:
!az ml environment create --file env/cloud/env.yml

[36mCommand group 'ml environment' is in preview and under development. Reference and support levels: https://aka.ms/CLI_refstatus[0m
[32mUploading docker (0.0 MBs): 100%|███████████| 299/299 [00:00<00:00, 4402.55it/s][0m
[39m

{
  "build": {
    "dockerfile_path": "Dockerfile",
    "path": "https://pmdev9225598307.blob.core.windows.net/azureml-blobstore-663bf81f-1924-4d17-a62c-3bc4a3984cab/LocalUpload/1109e35eb4795f412b2e811640506238/docker/"
  },
  "creation_context": {
    "created_at": "2022-04-29T04:46:03.163592+00:00",
    "created_by": "Yijun Zhang",
    "created_by_type": "User",
    "last_modified_at": "2022-04-29T04:46:03.163592+00:00",
    "last_modified_by": "Yijun Zhang",
    "last_modified_by_type": "User"
  },
  "id": "azureml:/subscriptions/ee85ed72-2b26-48f6-a0e8-cb5bcf98fbd9/resourceGroups/pipeline-pm/providers/Microsoft.MachineLearningServices/workspaces/pm-dev/environments/nyc_taxi_image/versions/7",
  "name": "nyc_taxi_image",
  "os_type": "linux",
  "resource


Now you can run the command below to list all environments created in workspace.



In [3]:
!az ml environment list

[36mCommand group 'ml environment' is in preview and under development. Reference and support levels: https://aka.ms/CLI_refstatus[0m
[
  {
    "latest version": "7",
    "name": "nyc_taxi_image"
  },
  {
    "latest version": "1",
    "name": "test"
  },
  {
    "latest version": "1",
    "name": "0b32258cd1fc290ed0176979f5481357"
  },
  {
    "latest version": "1",
    "name": "6ca0e5ed1b7262c8bb953806eb9af5ee"
  },
  {
    "latest version": "1",
    "name": "r-mpg-environment"
  },
  {
    "latest version": "1",
    "name": "r-environment-2"
  },
  {
    "latest version": "1",
    "name": "r-environment-1"
  },
  {
    "latest version": "1",
    "name": "r-environment"
  },
  {
    "latest version": "2",
    "name": "pytorch_tabnet_env"
  },
  {
    "latest version": "1",
    "name": "pytorch_tabnet_env_test5"
  },
  {
    "latest version": "1",
    "name": "pytorch_tabnet_env_test4"
  },
  {
    "latest version": "1",
    "name": "pytorch_tabnet_env_test3"
  },
  {
    "latest ve



Apart from that you can login Azure ML portal to check whether environment is registered correctly.

![env](./images/env.png)




#### Prepare Azure ML job yaml file and submit a job

What you should do is to wrap your python script into a standalone job through a yaml file. To begin with, you need to follow [this article](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-cli) to write your job yaml file, in which you define job name, description, environment used, code path and command to submit job, and interfaces, etc.

Here is an example.



In [None]:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
code: ./src
command: >-
  python script.py 
  --raw_data ${{inputs.raw_data}}
  --model_output ${{outputs.model_output}}
inputs:
  raw_data: 
    type: uri_folder
    path: ../sample_data 
outputs:
  model_output: 
    type: uri_folder
environment: azureml:nyc_taxi_image@latest
compute: azureml:cpu-cluster
display_name: nyc_taxi_regression
experiment_name: nyc_taxi_regression
description: Train a GBDT regression model on the NYC taxi dataset.


The input can be a local path, Azure ML will upload your sample data to default datastore.

After job yaml file prepared, you can run this command on your local environment to submit a job to Azure ML.



In [8]:
!az ml job create --f 2_standalone_job_run_on_cloud/2a_job_sample_data.yml --web

[36mCommand group 'ml job' is in preview and under development. Reference and support levels: https://aka.ms/CLI_refstatus[0m
[32mUploading src (0.01 MBs): 100%|█████████| 8821/8821 [00:00<00:00, 284877.96it/s][0m
[39m

{
  "code": "/subscriptions/ee85ed72-2b26-48f6-a0e8-cb5bcf98fbd9/resourceGroups/pipeline-pm/providers/Microsoft.MachineLearningServices/workspaces/pm-dev/codes/656aa7a0-5c23-4db7-ae39-2bc01e5d4c5c/versions/1",
  "command": "python script.py  --raw_data ${{inputs.raw_data}} --model_output ${{outputs.model_output}}",
  "compute": "azureml:cpu-cluster",
  "creation_context": {
    "created_at": "2022-04-29T04:59:02.197122+00:00",
    "created_by": "Yijun Zhang",
    "created_by_type": "User"
  },
  "description": "Train a GBDT regression model on the NYC taxi dataset.",
  "display_name": "nyc_taxi_regression",
  "environment": "azureml:nyc_taxi_image:7",
  "environment_variables": {},
  "experiment_name": "nyc_taxi_regression",
  "id": "azureml:/subscriptions/ee85ed72



With this --web option, you are automatically directed to Azure ML job detail page where you can view job informations like run status, duration, logs, etc.



## Get it reproducible

In the first stage, you get your script work in remote with a sample data. To be reproducible, it is recommended to resubmit an Azure ML job with full size big data.



#### Prepare full data

Azure ML datastores record connection information to your Azure storage where your full production data is located. For more details please refer to [Secure data access in Azure Machine Learning](https://review.docs.microsoft.com/en-us/azure/machine-learning/concept-data?branch=release-preview-aml-cli-v2-refresh#connect-to-storage-with-datastores).


Suppose your data is now in cloud. Here we will take Azure File Share as an example. Your full size data are stored in 



In [None]:
File shares/my_file_share_name/nyc_taxi/full_data

#### Prepare Azure ML job yaml file and submit a job

What you need to do is to take job.yml of last step, modify input from local path to remote datastore path. 

From:



In [None]:
inputs:
  raw_data: 
    type: uri_folder
    path: ./sample_data 



To:



In [None]:
inputs:
  raw_data: 
    type: uri_folder
    path: azureml://datastores/workspaceblobstore/paths/nyc_taxi_data/full_data


In this yaml file, workspaceblobstore is datastore name. For more information about uri_folder path format see [here is a doc about uri format]().

Attention, as you use full size data to reproduce your job, you might need to swich to a compute cluster with optimized memory. 

Then rerun this command to submit a job:

In [12]:
!az ml job create --f 2_standalone_job_run_on_cloud/2b_job_full_data.yml --web

[36mCommand group 'ml job' is in preview and under development. Reference and support levels: https://aka.ms/CLI_refstatus[0m
{
  "code": "/subscriptions/ee85ed72-2b26-48f6-a0e8-cb5bcf98fbd9/resourceGroups/pipeline-pm/providers/Microsoft.MachineLearningServices/workspaces/pm-dev/codes/656aa7a0-5c23-4db7-ae39-2bc01e5d4c5c/versions/1",
  "command": "python script.py  --raw_data ${{inputs.raw_data}} --model_output ${{outputs.model_output}}",
  "compute": "azureml:cpu-cluster-ram",
  "creation_context": {
    "created_at": "2022-04-29T05:46:07.599122+00:00",
    "created_by": "Yijun Zhang",
    "created_by_type": "User"
  },
  "description": "Train a GBDT regression model on the NYC taxi dataset.",
  "display_name": "willing_apple_0yz3g8rkvn",
  "environment": "azureml:nyc_taxi_image:7",
  "environment_variables": {},
  "experiment_name": "nyc_taxi_regression",
  "id": "azureml:/subscriptions/ee85ed72-2b26-48f6-a0e8-cb5bcf98fbd9/resourceGroups/pipeline-pm/providers/Microsoft.MachineLearn

## Get it modulized

Sometimes MLE will leverarge pipelines with modulized components to do production for many value adds: collaboration, cost effectiveness, etc. For more information about when and why to use pipelines, please refer to [here is concept doc]().



### Decompose code

First of all, you need to go through code, understand AI workflow, decompose it into several steps, for example, data processing, feature engineering, training, prediction, scoring, etc.



### Define components

In this NYC Taxi example, we are going to decompose script into 5 steps: data preperation, data transformation, training, prediction, scoring and define 5 components for each step. For more details about component, please refer to [this component concept article](https://docs.microsoft.com/en-us/azure/machine-learning/concept-component).

Each component can be considered as a stanalone job, then a pipeline is responsible to schedule them together. Similar with migrating a single script from local to remote, what you need to do is to import necessary dependencies, define interfaces, modify code for each step, create yaml file for component definition. You can learn more about component yaml schema [here](https://docs.microsoft.com/en-us/azure/machine-learning/reference-yaml-component-command).



#### Step 1:

Input: NYC taxi dataset folder, including 2 .csv files

Code: Take multiple taxi datasets (yellow and green), remove and rename columns, combine greed and yellow data.

Output: Single combined data



#### Step 2:

Input: Output of step 1, combined data

Code: Eliminate filers, filter out locations outside NYC, split the pickup and dropoff date into the day of the week, day of the month, and month values, etc.

Output: Dataset filtered and created with 20+ features



#### Step 3:

Input: Output of step 2, processed data

Code: Split data into X and Y, split the data into train/test set, train a GBDT model, log parameters

Output: Trained model (pickle format) and data subset for test (.csv)



#### Step 4:

Input: Output of step3, GBDT model and test data

Code: Predict test dataset with trained model

Output: Test data with predictions added as a column



#### Step 5:

Input: Output of step4, test data with predictions

Code: Calculate and log metrics

Output: None

After all these, you have 5 python source codes and yaml file.



### Define and submit a pipeline

After defining components, you could create a pipeline which connects a serie of components and submit whole complete AI workflow .

What you need to do is first to write a yaml file which describes how pipeline is built, what compute resource used, inputs and outputs, etc. and then to submit a pipeline job using command line.

You can refer to this article for [pipeline job yaml specification](https://docs.microsoft.com/en-us/azure/machine-learning/reference-yaml-job-pipeline).


You can first test with sample file and then run it with full size data by switching pipeline input path from local to Azure ML datastore.

In [6]:
!az ml job create --f 3_pipeline_job_run_on_cloud/3a_pipeline_sample_data.yml --web

[36mCommand group 'ml job' is in preview and under development. Reference and support levels: https://aka.ms/CLI_refstatus[0m
{
  "creation_context": {
    "created_at": "2022-04-29T04:50:43.592029+00:00",
    "created_by": "Yijun Zhang",
    "created_by_type": "User"
  },
  "display_name": "heroic_key_k2020pd2xd",
  "experiment_name": "nyc_taxi_regression",
  "id": "azureml:/subscriptions/ee85ed72-2b26-48f6-a0e8-cb5bcf98fbd9/resourceGroups/pipeline-pm/providers/Microsoft.MachineLearningServices/workspaces/pm-dev/jobs/heroic_key_k2020pd2xd",
  "inputs": {
    "pipeline_job_input": {
      "mode": "ro_mount",
      "path": "azureml:azureml://datastores/workspaceblobstore/paths/LocalUpload/1c2d0b4908fe99afe7e5d4d1e5af23e9/sample_data/",
      "type": "uri_folder"
    }
  },
  "jobs": {
    "predict_job": {
      "$schema": "{}",
      "code": "{}",
      "command": "{}",
      "component": "azureml:ce1ecdbb-ebcf-92e2-840d-386a32eefc9d:1",
      "environment_variables": {},
      "input

In [11]:
!az ml job create --f 3_pipeline_job_run_on_cloud/3b_pipeline_full_data.yml --web

[36mCommand group 'ml job' is in preview and under development. Reference and support levels: https://aka.ms/CLI_refstatus[0m
{
  "creation_context": {
    "created_at": "2022-04-29T05:46:00.373038+00:00",
    "created_by": "Yijun Zhang",
    "created_by_type": "User"
  },
  "display_name": "lemon_parsnip_ndvm7ktyj3",
  "experiment_name": "nyc_taxi_regression",
  "id": "azureml:/subscriptions/ee85ed72-2b26-48f6-a0e8-cb5bcf98fbd9/resourceGroups/pipeline-pm/providers/Microsoft.MachineLearningServices/workspaces/pm-dev/jobs/lemon_parsnip_ndvm7ktyj3",
  "inputs": {
    "pipeline_job_input": {
      "mode": "ro_mount",
      "path": "azureml:azureml://datastores/workspaceblobstore/paths/nyc_taxi_data/full_data",
      "type": "uri_folder"
    }
  },
  "jobs": {
    "predict_job": {
      "$schema": "{}",
      "code": "{}",
      "command": "{}",
      "component": "azureml:d6b35a61-b293-44fe-0b58-f316c0087bb5:1",
      "environment_variables": {},
      "inputs": {
        "model_input":

The command is the same as standalone job, except this time you are directed to pipeline job detail page where you can see your pipeline graph.

![pipeline detail page](images/pipeline_detail_page.png)