Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Linear Regression for NYC Taxi Datasets with AzureML Pipelines 

In this notebook, you learn how to prepare data for regression modeling. You run various transformations to filter and combine two different NYC taxi datasets
using [Azure Machine Learning Pipelines](https://aka.ms/aml-pipelines). After data process steps, you add a final custom training step to train a regression model.  The trained model then registered to AzureML workspace.

## Prerequisites

* [Setup Azure Arc-enabled Machine Learning Training and Inferencing on AKS on Azure Stack HCI](https://github.com/Azure/AML-Kubernetes/tree/master/docs/AKS-HCI/AML-ARC-Compute.md)

* Last but not least, you need to be able to run a Notebook. (azureml-core, azureml-opendatasets, numpy, matplotlib, requests are required)

   If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at [here](https://github.com/Azure/MachineLearningNotebooks) first. This sets you up with a working config file that has information on your workspace, subscription id, etc.

## Initialize AzureML workspace

Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`. 

If you haven't done already please go to `config.json` file and fill in your workspace information.

In [2]:
from azureml.core.workspace import Workspace,  ComputeTarget
from azureml.exceptions import ComputeTargetException

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


Workspace name: amlarc-sample-test-ws-eastus
Azure region: eastus
Subscription id: 86204643-5a96-427b-b6bb-b35b2bd6e6ce
Resource group: AKS-HCI2


## Download data for regression modeling

First, you will prepare data for regression modeling. you will leverage the convenience of Azure Open Datasets. Perform `pip install azureml-opendatasets` to get the open dataset package.  The Open Datasets package contains a class representing each data source (NycTlcGreen and NycTlcYellow) to easily filter date parameters before downloading.


### Fetch data
Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid MemoryError with large datasets. To download a year of taxi data, iteratively fetch one month at a time, and before appending it to green_df_raw, randomly sample 500 records from each month to avoid bloating the dataframe. Then preview the data. To keep this process short, we are sampling data of only 1 month.


In [3]:
from azureml.opendatasets import NycTlcGreen, NycTlcYellow
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta

green_df_raw = pd.DataFrame([])
start = datetime.strptime("1/1/2016","%m/%d/%Y")
end = datetime.strptime("1/31/2016","%m/%d/%Y")

number_of_months = 1
sample_size = 5000

for sample_month in range(number_of_months):
    temp_df_green = NycTlcGreen(start + relativedelta(months=sample_month), end + relativedelta(months=sample_month)) \
        .to_pandas_dataframe()
    green_df_raw = green_df_raw.append(temp_df_green.sample(sample_size))

[Info] read from C:\Users\jiadu\AppData\Local\Temp\tmp2riuqlb1\https%3A\%2Fazureopendatastorage.azurefd.net\nyctlc\green\puYear=2016\puMonth=1\part-00119-tid-4753095944193949832-fee7e113-666d-4114-9fcb-bcd3046479f3-2689-1.c000.snappy.parquet


In [4]:
yellow_df_raw = pd.DataFrame([])
start = datetime.strptime("1/1/2016","%m/%d/%Y")
end = datetime.strptime("1/31/2016","%m/%d/%Y")

sample_size = 500

for sample_month in range(number_of_months):
    temp_df_yellow = NycTlcYellow(start + relativedelta(months=sample_month), end + relativedelta(months=sample_month)) \
        .to_pandas_dataframe()
    yellow_df_raw = yellow_df_raw.append(temp_df_yellow.sample(sample_size))

[Info] read from C:\Users\jiadu\AppData\Local\Temp\tmptfo2y6qe\https%3A\%2Fazureopendatastorage.azurefd.net\nyctlc\yellow\puYear=2016\puMonth=1\part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-90.c000.snappy.parquet
[Info] read from C:\Users\jiadu\AppData\Local\Temp\tmptfo2y6qe\https%3A\%2Fazureopendatastorage.azurefd.net\nyctlc\yellow\puYear=2016\puMonth=1\part-00001-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426336-89.c000.snappy.parquet
[Info] read from C:\Users\jiadu\AppData\Local\Temp\tmptfo2y6qe\https%3A\%2Fazureopendatastorage.azurefd.net\nyctlc\yellow\puYear=2016\puMonth=1\part-00002-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426334-91.c000.snappy.parquet
[Info] read from C:\Users\jiadu\AppData\Local\Temp\tmptfo2y6qe\https%3A\%2Fazureopendatastorage.azurefd.net\nyctlc\yellow\puYear=2016\puMonth=1\part-00003-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426340-87.c000.snappy.parquet
[Info] read from

### See the data

In [5]:
from IPython.display import display

display(green_df_raw.head(5))
display(yellow_df_raw.head(5))

Unnamed: 0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,paymentType,fareAmount,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType
405733,1,2016-01-04 20:32:02,2016-01-04 20:37:52,1,0.5,,,-73.895576,40.746532,-73.889244,...,1,4.5,0.5,0.5,0.3,0.0,0.0,,5.8,1.0
719159,2,2016-01-30 10:22:19,2016-01-30 10:28:59,1,2.66,,,-73.921677,40.766869,-73.893593,...,1,10.0,0.0,0.5,0.3,1.0,0.0,,11.8,1.0
1250163,2,2016-01-15 09:49:29,2016-01-15 10:16:30,1,6.26,,,-73.974739,40.687874,-73.969604,...,1,24.5,0.0,0.5,0.3,5.06,0.0,,30.36,1.0
324106,2,2016-01-28 13:10:07,2016-01-28 13:23:29,1,5.13,,,-73.844246,40.721375,-73.75383,...,2,17.0,0.0,0.5,0.3,0.0,0.0,,17.8,1.0
1064446,2,2016-01-25 17:31:42,2016-01-25 17:53:06,1,3.3,,,-73.949142,40.785168,-73.938004,...,1,15.0,1.0,0.5,0.3,3.36,0.0,,20.16,1.0


Unnamed: 0,vendorID,tpepPickupDateTime,tpepDropoffDateTime,passengerCount,tripDistance,puLocationId,doLocationId,startLon,startLat,endLon,...,rateCodeId,storeAndFwdFlag,paymentType,fareAmount,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,totalAmount
40494,1,2016-01-03 23:54:43,2016-01-04 00:07:17,1,3.6,,,-73.986992,40.760986,-73.953041,...,1,N,2,13.5,0.5,0.5,0.3,0.0,0.0,14.8
38392,1,2016-01-03 17:13:52,2016-01-03 17:38:07,2,5.7,,,-73.994987,40.749947,-74.008118,...,1,N,1,21.0,0.0,0.5,0.3,1.5,0.0,23.3
74255,1,2016-01-09 09:08:09,2016-01-09 09:17:40,1,4.6,,,-73.944206,40.792145,-73.936882,...,1,N,2,15.0,0.0,0.5,0.3,0.0,0.0,15.8
270873,2,2016-01-09 03:21:37,2016-01-09 03:48:56,1,10.85,,,-73.984688,40.763969,-73.972176,...,1,N,1,33.0,0.5,0.5,0.3,7.97,5.54,47.81
70525,1,2016-01-09 00:52:18,2016-01-09 00:58:42,2,0.9,,,-73.985779,40.738205,-74.001984,...,1,N,2,6.0,0.5,0.5,0.3,0.0,0.0,7.3


### Save data locally

In [6]:
import os
dataDir = "data"

if not os.path.exists(dataDir):
    os.mkdir(dataDir)

greenDir = dataDir + "/green"
yellowDir = dataDir + "/yellow"

if not os.path.exists(greenDir):
    os.mkdir(greenDir)
    
if not os.path.exists(yellowDir):
    os.mkdir(yellowDir)
    
greenTaxiData = greenDir + "/unprepared.csv"
yellowTaxiData = yellowDir + "/unprepared.csv"

green_df_raw.to_csv(greenTaxiData, index=False)
yellow_df_raw.to_csv(yellowTaxiData, index=False)

print("Data written to local folder.")

Data written to local folder.


## Prepare the dataset

Your next step is to upload these files to datastore of the workspace, and then registered as dataset in the workspace. 
  
"datastore_name" is the name of the datastore you setup in [this step](https://github.com/Azure/AML-Kubernetes/blob/master/docs/ASH/Train-AzureArc.md).

Upload and dataset registration take less than 1 min.

In [9]:
from azureml.core import Workspace, Datastore, Dataset

# datastore_name = "<NAME_OF_ASH_HOSTED_AML_DATASTORE>"
def_data_store = ws.get_default_datastore()
datastore_name = def_data_store.name
datastore =  Datastore.get(ws, datastore_name)

dataDir = "data"
greenDir, yellowDir  = dataDir + "/green", dataDir + "/yellow"
greenTaxiData, yellowTaxiData = greenDir + "/unprepared.csv", yellowDir + "/unprepared.csv"

datastore.upload_files([greenTaxiData], 
                           target_path = 'green', 
                           overwrite = True, 
                           show_progress = True)

datastore.upload_files([yellowTaxiData], 
                           target_path = 'yellow', 
                           overwrite = True, 
                           show_progress = True)

print("Upload calls completed.")

# register datasets

dataset_name,  target_path = 'green_taxi_data_o_file', 'green/unprepared.csv'
datastore_paths = [(datastore, target_path)]
green_taxi_data = Dataset.File.from_files(path=datastore_paths)
green_taxi_data.register(ws, dataset_name)

dataset_name,  target_path = 'yellow_taxi_data_o_file', 'yellow/unprepared.csv'
datastore_paths = [(datastore, target_path)]
yellow_taxi_data = Dataset.File.from_files(path=datastore_paths)
yellow_taxi_data.register(ws, dataset_name)

Uploading an estimated of 1 files
Uploading data/green/unprepared.csv
Uploaded data/green/unprepared.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Uploading an estimated of 1 files
Uploading data/yellow/unprepared.csv
Uploaded data/yellow/unprepared.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Upload calls completed.


{
  "source": [
    "('workspaceblobstore', 'yellow/unprepared.csv')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "de1355e1-0ca6-44b4-a685-f9ab9a1876f8",
    "name": "yellow_taxi_data_o_file",
    "version": 1,
    "workspace": "Workspace.create(name='amlarc-sample-test-ws-eastus', subscription_id='86204643-5a96-427b-b6bb-b35b2bd6e6ce', resource_group='AKS-HCI2')"
  }
}

## Setup compute target

Find the attach name for the Arc enabled  Azure Stack Hub kubernetes cluster in your AzureML workspace to create a ComputeTarget:

attach_name is the attached name for your ASH cluster you setup in [this step](https://github.com/Azure/AML-Kubernetes/blob/master/docs/ASH/AML-ARC-Compute.md)

In [10]:
from azureml.core.compute import KubernetesCompute

# attach_name = "<NAME_OF_AML_ATTACHED_COMPUTE_OF_YOUR_ASH_CLUSTER>"
attach_name = "arc-compute3"
arcK_compute = KubernetesCompute(ws, attach_name)

Class KubernetesCompute: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


### Define RunConfig for the compute target
We will also use `pandas`, `scikit-learn`,  `pyarrow` for the pipeline steps. Defining the `runconfig` for that.

In [11]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

# Create a new runconfig object
aml_run_config = RunConfiguration()

# Use the arcK_compute you created above. 
aml_run_config.target = arcK_compute

# Enable Docker
aml_run_config.environment.docker.enabled = True

# Use conda_dependencies.yml to create a conda environment in the Docker image for execution
aml_run_config.environment.python.user_managed_dependencies = False

# Specify CondaDependencies obj, add necessary packages
aml_run_config.environment.python.conda_dependencies = CondaDependencies.create(
    conda_packages=['pandas','scikit-learn'], 
    pip_packages=['azureml-sdk', 'pyarrow'])

print ("Run configuration created.")

'enabled' is deprecated. Please use the azureml.core.runconfig.DockerConfiguration object with the 'use_docker' param instead.


Run configuration created.


## Processing data with pipeline steps

The data process includes following pipeline steps implemented as PythonScriptStep:

1. Cleanse Green Taxi Data
2. Cleanse Yellow Taxi Data
3. Merge cleansed Green and Yellow Taxi Data
4. Filter Taxi Data
5. Normalize Taxi Data
6. Transform Taxi Data
7. Train Test Data Split

### Cleanse Taxi Data

Here a set of "useful" columns for both Green and Yellow taxi data are defined:

In [12]:
useful_columns = str(["cost", "distance", "dropoff_datetime", "dropoff_latitude", 
                      "dropoff_longitude", "passengers", "pickup_datetime", 
                      "pickup_latitude", "pickup_longitude", "store_forward", "vendor"]).replace(",", ";")

#### Cleanse Green taxi data

In [13]:
from azureml.pipeline.core import PipelineData
from azureml.data import OutputFileDatasetConfig
from azureml.pipeline.steps import PythonScriptStep
import os

# python scripts folder
prepare_data_folder = './scripts/prepdata'

# rename columns as per Azure Machine Learning NYC Taxi tutorial

green_columns = {
        "vendorID": "vendor",
        "lpepPickupDatetime": "pickup_datetime",
        "lpepDropoffDatetime": "dropoff_datetime",
        "storeAndFwdFlag": "store_forward",
        "pickupLongitude": "pickup_longitude",
        "pickupLatitude": "pickup_latitude",
        "dropoffLongitude": "dropoff_longitude",
        "dropoffLatitude": "dropoff_latitude",
        "passengerCount": "passengers",
        "fareAmount": "cost",
        "tripDistance": "distance"
    }

green_columns_key = str(list(green_columns.keys())).replace(",", ";")
green_columns_value = str(list(green_columns.values())).replace(",", ";")
    
# Define output after cleansing step
dest = (datastore, None)
cleansed_green_data = OutputFileDatasetConfig(name="cleansed_green_data", destination=dest).as_upload(overwrite=False)

print('Cleanse script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# cleansing step creation
# See the cleanse.py for details
cleansingStepGreen = PythonScriptStep(
    name="Cleanse Green Taxi Data",
    script_name="cleanse.py", 
    arguments=["--data-path", green_taxi_data.as_named_input('raw_data').as_mount(),
        "--useful_columns", useful_columns,
               "--columns_key", green_columns_key,
                "--columns_value", green_columns_value,
               "--output_cleanse", cleansed_green_data],
    outputs=[cleansed_green_data],
    compute_target=arcK_compute,
    runconfig=aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("cleansingStepGreen created.")

Cleanse script is in E:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline\scripts\prepdata.
cleansingStepGreen created.


#### Cleanse Yellow taxi data

In [14]:
yellow_columns = {
    "vendorID": "vendor",
    "tpepPickupDateTime": "pickup_datetime",
    "tpepDropoffDateTime": "dropoff_datetime",
    "storeAndFwdFlag": "store_forward",
    "startLon": "pickup_longitude",
    "startLat": "pickup_latitude",
    "endLon": "dropoff_longitude",
    "endLat": "dropoff_latitude",
    "passengerCount": "passengers",
    "fareAmount": "cost",
    "tripDistance": "distance"
}

yellow_columns_key = str(list(yellow_columns.keys())).replace(",", ";")
yellow_columns_value = str(list(yellow_columns.values())).replace(",", ";")
    
# Define output after cleansing step
dest = (datastore, None)
cleansed_yellow_data = OutputFileDatasetConfig(name="cleansed_yellow_data", destination=dest).as_upload(overwrite=False)

# cleansing step creation
# See the cleanse.py for details about input and output
cleansingStepYellow = PythonScriptStep(
    name="Cleanse Yellow Taxi Data",
    script_name="cleanse.py", 
    arguments=["--data-path", yellow_taxi_data.as_named_input('raw_data').as_mount(),
        "--useful_columns", useful_columns,
               "--columns_key", yellow_columns_key,
                "--columns_value", yellow_columns_value,
               "--output_cleanse", cleansed_yellow_data],
    compute_target=arcK_compute,
    runconfig=aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("cleansingStepYellow created.")

cleansingStepYellow created.


### Merge cleansed Green and Yellow datasets

Create a single data source by merging the cleansed versions of Green and Yellow taxi data.

In [15]:
# Define output after cleansing step
merged_data = OutputFileDatasetConfig(name="merged_data", destination=dest).as_upload(overwrite=False)


print('Merge script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# merging step creation
# See the merge.py for details about input and output
mergingStep = PythonScriptStep(
    name="Merge Taxi Data",
    script_name="merge.py", 
    arguments=["--green_data_path", cleansed_green_data.as_input(),
               "--yellow_data_path", cleansed_yellow_data.as_input(),
        "--output_merge", merged_data],
    compute_target=arcK_compute,
    runconfig=aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("mergingStep created.")

Merge script is in E:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline\scripts\prepdata.
mergingStep created.


### Filter taxi data

This step filters out coordinates for locations that are outside the city border. We use a TypeConverter object to change the latitude and longitude fields to decimal type. 

In [16]:
# Define output after merging step
filtered_data = OutputFileDatasetConfig(name="filtered_data", destination=dest).as_upload(overwrite=False)

print('Filter script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# filter step creation
# See the filter.py for details about input and output
filterStep = PythonScriptStep(
    name="Filter Taxi Data",
    script_name="filter.py", 
    arguments=["--data-path", merged_data.as_input(),
        "--output_filter", filtered_data],
    compute_target=arcK_compute,
    runconfig = aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("FilterStep created.")

Filter script is in E:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline\scripts\prepdata.
FilterStep created.


### Normalize taxi data
In this step, pickup and dropoff datetime values are splitted into the respective date and time columns.  Then rename the columns to meaningful names.

In [17]:
# Define output after normalize step
normalized_data = OutputFileDatasetConfig(name="normalized_data", destination=dest).as_upload(overwrite=False)

print('Normalize script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# normalize step creation
# See the normalize.py for details about input and output
normalizeStep = PythonScriptStep(
    name="Normalize Taxi Data",
    script_name="normalize.py", 
    arguments=["--data-path", filtered_data.as_input(),
        "--output_normalize", normalized_data],
    compute_target=arcK_compute,
    runconfig = aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("normalizeStep created.")

Normalize script is in E:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline\scripts\prepdata.
normalizeStep created.


### Transform taxi data
Transform the normalized taxi data to final required format. This steps does the following:

- Split the pickup and dropoff date further into the day of the week, day of the month, and month values. 
- To get the day of the week value, uses the derive_column_by_example() function. The function takes an array parameter of example objects that define the input data, and the preferred output. The function automatically determines the preferred transformation. For the pickup and dropoff time columns, split the time into the hour, minute, and second by using the split_column_by_example() function with no example parameter.
- After new features are generated, use the drop_columns() function to delete the original fields as the newly generated features are preferred. 
- Rename the rest of the fields to use meaningful descriptions.

In [18]:
# Define output after transform step
transformed_data = OutputFileDatasetConfig(name="transformed_data", destination=dest).as_upload(overwrite=False)

print('Transform script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# transform step creation
# See the transform.py for details about input and output
transformStep = PythonScriptStep(
    name="Transform Taxi Data",
    script_name="transform.py", 
    arguments=["--data-path", normalized_data.as_input(),
        "--output_transform", transformed_data],
    compute_target=arcK_compute,
    runconfig = aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("transformStep created.")

Transform script is in E:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline\scripts\prepdata.
transformStep created.


### Split the data into train and test sets

This function segregates the data into dataset for model training and dataset for testing.

In [19]:
train_model_folder = './scripts/trainmodel'

# train and test splits output
output_split_train = OutputFileDatasetConfig(name="output_split_train", destination=dest).as_upload(overwrite=False)
output_split_test = OutputFileDatasetConfig(name="output_split_test", destination=dest).as_upload(overwrite=False)

output_split_test.register_on_complete(name='nyc_taxi_test_set', 
                                       description='test set  from Train Test Data Split')

print('Data spilt script is in {}.'.format(os.path.realpath(train_model_folder)))

# test train split step creation
# See the train_test_split.py for details about input and output
testTrainSplitStep = PythonScriptStep(
    name="Train Test Data Split",
    script_name="train_test_split.py", 
    arguments=["--data-path", transformed_data.as_input(),
        "--output_split_train", output_split_train,
               "--output_split_test", output_split_test],
    compute_target=arcK_compute,
    runconfig = aml_run_config,
    source_directory=train_model_folder,
    allow_reuse=True
)

print("testTrainSplitStep created.")

Data spilt script is in E:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline\scripts\trainmodel.
testTrainSplitStep created.


### Define a custom training step

In [20]:
train_step = PythonScriptStep(
        name="train_step",
        script_name="train_step.py",
        arguments=["--train_data_path", output_split_train.as_input(),
                   "--test_data_path", output_split_test.as_input(),
        ],
      
        compute_target=arcK_compute,
        runconfig=aml_run_config,
        source_directory=train_model_folder,
        allow_reuse=True
    )

print("train_step created.")

train_step created.


## Create an experiement

In [21]:
from azureml.core import Experiment

experiment = Experiment(ws, 'NYCTaxi_Tutorial_Pipelines')

print("Experiment created")

Experiment created


### Build and run the pipeline

In [22]:
from azureml.pipeline.core import Pipeline

pipeline_steps = [train_step]

pipeline = Pipeline(workspace = ws, steps=pipeline_steps)
print("Pipeline is built.")

pipeline_run = experiment.submit(pipeline, regenerate_outputs=False)

print("Pipeline submitted for execution.")

pipeline_run.wait_for_completion()

Pipeline is built.
Created step train_step [732fe391][6a5373cd-fb70-44f9-9ea5-c10cdc1b12da], (This step will run and generate new outputs)
Created step Train Test Data Split [8bc2e8d4][982a0907-98f9-498c-a0db-042e9bbd6514], (This step will run and generate new outputs)
Created step Transform Taxi Data [0173ee30][6a15263a-d7d6-43b8-8d84-c1d542940dd9], (This step will run and generate new outputs)
Created step Normalize Taxi Data [24bd4890][7696f53a-ef4f-47cf-a65c-cb13cd74e1e4], (This step will run and generate new outputs)
Created step Filter Taxi Data [23aa11b1][70391ef1-d7ef-4a79-9286-74fb4f19cfc6], (This step will run and generate new outputs)
Created step Merge Taxi Data [697c34df][3d9a07e1-e15c-448f-b38f-180f9ea07ab3], (This step will run and generate new outputs)
Created step Cleanse Green Taxi Data [fe604d49][745dcc1e-612d-4406-b63a-87dc5cd8de02], (This step will run and generate new outputs)Created step Cleanse Yellow Taxi Data [7fc00336][9cb5e7b4-da00-425e-9bd3-9e7522a5c5e4], (

'Finished'

### Register the model

Register the trained model.

In [23]:
train_step_run = pipeline_run.find_step_run(train_step.name)[0]

registered_model_name='taxi_model'
train_step_run.register_model(model_name=registered_model_name, model_path='outputs/taxi.pkl')

Model(workspace=Workspace.create(name='amlarc-sample-test-ws-eastus', subscription_id='86204643-5a96-427b-b6bb-b35b2bd6e6ce', resource_group='AKS-HCI2'), name=taxi_model, id=taxi_model:1, version=1, tags={}, properties={})

### Get the model

In [25]:
from azureml.core.model import Model

model_name = registered_model_name
model = Model(ws, model_name)
model_id = f"azureml:{model.name}:{model.version}"
print(f"Get {model.name}, latest version {model.version}, id in deployment.yml: {model_id}")

Get taxi_model, latest version 1, id in endpoint.yml: azureml:taxi_model:1


The machine learning model named "taxi_model" should be registered in your AML workspace.

## Test Registered Model

To test the trained model, you can use AKS-HCI cluster for serving the model using AML deployment.

### Deploy the model

In [26]:
# endpoint = '<tax-model endpoint name>'
endpoint = 'tax-model-jiadu'

import os
from pathlib import Path
prefix = Path(os.getcwd())
endpoint_file = str(prefix.joinpath("endpoint.yml"))
deployment_file = str(prefix.joinpath("deployment.yml"))
print(f"Using Endpoint file: {endpoint_file}, Deployment file: {deployment_file} please replace <modelId> (e.g. azureml:taxi_model:1), <instanceTypeName> (e.g. defaultInstanceType) and <computeTargetName> (e.g. azureml:amlarc-compute) according above output")

Using Endpoint file: e:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline\endpoint.yml, Deployment file: e:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline\deployment.yml please replace <modelId> (e.g. azureml:taxi_model:1), <instanceTypeName> (e.g. defaultInstanceType) and <computeTargetName> (e.g. azureml:amlarc-compute) according above output


Need to **replace the properties in deployment.yml**, including,
* `<modelId>`: example value: azureml:taxi_model:1
* `<instanceTypeName>`: example value: defaultInstanceType

Need to **replace the properties in endpoint.yml**, including,
* `<computeTargetName>`: example value: azureml:amlarc-compute

In [28]:
import helpers
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

Workspace name: amlarc-sample-test-ws-eastus
Azure region: eastus
Subscription id: 86204643-5a96-427b-b6bb-b35b2bd6e6ce
Resource group: AKS-HCI2


In [29]:
helpers.run(f"az ml online-endpoint create -n {endpoint} -f {endpoint_file} -w {ws.name} -g {ws.resource_group}")

START: az ml online-endpoint create -n tax-model-jiadu -f e:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline\endpoint.yml -w amlarc-sample-test-ws-eastus -g AKS-HCI2 @ 2021-11-24 15:11:26 (2021-11-24 07:11:26 UTC)
       using: C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin\az.CMD (Windows 10 on AMD64)
       cwd: e:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline
{
  "allow_public_access": true,
  "auth_mode": "aml_token",
  "compute": "azureml:/subscriptions/86204643-5a96-427b-b6bb-b35b2bd6e6ce/resourceGroups/AKS-HCI2/providers/Microsoft.MachineLearningServices/workspaces/amlarc-sample-test-ws-eastus/computes/arc-compute3",
  "id": "/subscriptions/86204643-5a96-427b-b6bb-b35b2bd6e6ce/resourceGroups/AKS-HCI2/providers/Microsoft.MachineLearningServices/workspaces/amlarc-sample-test-ws-eastus/onlineEndpoints/tax-model-jiadu",
  "identity": {
    "principal_id": "d5770823-9c62-4ec5-a42c-422357a7c18e",
    "tenant_id": "72f988bf-86f1-41af-91ab-2d

In [30]:
helpers.run(f"az ml online-endpoint show -n {endpoint} -w {ws.name} -g {ws.resource_group}")

START: az ml online-endpoint show -n tax-model-jiadu -w amlarc-sample-test-ws-eastus -g AKS-HCI2 @ 2021-11-24 15:12:17 (2021-11-24 07:12:17 UTC)
       using: C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin\az.CMD (Windows 10 on AMD64)
       cwd: e:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline
{
  "allow_public_access": true,
  "auth_mode": "aml_token",
  "compute": "azureml:/subscriptions/86204643-5a96-427b-b6bb-b35b2bd6e6ce/resourceGroups/AKS-HCI2/providers/Microsoft.MachineLearningServices/workspaces/amlarc-sample-test-ws-eastus/computes/arc-compute3",
  "id": "/subscriptions/86204643-5a96-427b-b6bb-b35b2bd6e6ce/resourceGroups/AKS-HCI2/providers/Microsoft.MachineLearningServices/workspaces/amlarc-sample-test-ws-eastus/onlineEndpoints/tax-model-jiadu",
  "identity": {
    "principal_id": "d5770823-9c62-4ec5-a42c-422357a7c18e",
    "tenant_id": "72f988bf-86f1-41af-91ab-2d7cd011db47",
    "type": "system_assigned"
  },
  "location": "eastus",
  "name": "tax-

In [31]:
helpers.run(f"az ml online-deployment create -n blue --endpoint {endpoint} -f {deployment_file} -w {ws.name} -g {ws.resource_group} --all-traffic")

START: az ml online-deployment create -n blue --endpoint tax-model-jiadu -f e:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline\deployment.yml -w amlarc-sample-test-ws-eastus -g AKS-HCI2 --all-traffic @ 2021-11-24 15:12:27 (2021-11-24 07:12:27 UTC)
       using: C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin\az.CMD (Windows 10 on AMD64)
       cwd: e:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline
{
  "app_insights_enabled": true,
  "code_configuration": {
    "code": {
      "code_uri": "https://amlarcsamplete7068317793.blob.core.windows.net/azureml-blobstore-e6fafec8-81fb-4906-9685-77c5282597a0/LocalUpload/05472f5107cb927d0bf30d8d3334b32b/pipeline",
      "id": "azureml:/subscriptions/86204643-5a96-427b-b6bb-b35b2bd6e6ce/resourceGroups/AKS-HCI2/providers/Microsoft.MachineLearningServices/workspaces/amlarc-sample-test-ws-eastus/codes/d3665627-62ff-45aa-9b49-369e89805dd3/versions/1",
      "local_path": "E:\\github\\penorouzi\\AML-Kubernetes-1

### Test with inputs

A few intances of test data are stored in test_set.csv for convenience of testing.

In [32]:
# get score_url and access_token from AZ CLI
import helpers
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
cmd = f"az ml online-endpoint show -n {endpoint} -w {ws.name} -g {ws.resource_group}"
properties = helpers.run(cmd, return_output=True, no_output=True)

cmd = f"az ml online-endpoint get-credentials -n {endpoint} -w {ws.name} -g {ws.resource_group}"
credentials = helpers.run(cmd, return_output=True, no_output=True)

print(f"Got endpoint and credentials.")

START: az ml online-endpoint show -n tax-model-jiadu -w amlarc-sample-test-ws-eastus -g AKS-HCI2 @ 2021-11-24 15:24:10 (2021-11-24 07:24:10 UTC)
       using: C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin\az.CMD (Windows 10 on AMD64)
       cwd: e:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline
START: az ml online-endpoint get-credentials -n tax-model-jiadu -w amlarc-sample-test-ws-eastus -g AKS-HCI2 @ 2021-11-24 15:24:14 (2021-11-24 07:24:14 UTC)
       using: C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin\az.CMD (Windows 10 on AMD64)
       cwd: e:\github\penorouzi\AML-Kubernetes-1\docs\AKS-HCI\notebooks\pipeline
Got endpoint and credentials.


In [33]:
import json
prop_response = json.loads(properties.replace(os.linesep,""))
score_uri = prop_response["scoring_uri"]

cred_response = json.loads(credentials.replace(os.linesep, ""))
access_token = cred_response["accessToken"]

In [37]:
import json
import pandas as pd

df_raw = pd.read_csv("test_set.csv")
selected_columns = ['pickup_weekday', 'pickup_hour', 'distance', 'passengers', 'vendor']
df = df_raw[selected_columns][:3]
input_np = df.to_numpy()

instances = json.dumps({"data": input_np.tolist()})

import requests
headers = {'Content-Type': 'application/json', 'Authorization': f"Bearer {access_token}"}
r = requests.post(score_uri, data=instances, headers=headers)
print(f"predicted_costs: {r.json()}")

# predicted_costs = service.run(instances)
# print(predicted_costs)

predicted_costs: [8.28844153742305, 22.987604999054316, 7.808465764846998]


In [38]:
# Real costs

df_raw["cost"][:3].to_numpy().tolist()

[10.5, 20.0, 7.5]