# NYC Taxi Data Regression Model
This is an [Azure Machine Learning Pipelines](https://aka.ms/aml-pipelines) version of two-part tutorial ([Part 1](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-data-prep), [Part 2](https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-auto-train-models)) available for Azure Machine Learning.

You can combine the two part tutorial into one using AzureML Pipelines as Pipelines provide a way to stitch together various steps involved (like data preparation and training in this case) in a machine learning workflow.

In this notebook, you learn how to prepare data for regression modeling by using the [Azure Machine Learning Data Prep SDK](https://aka.ms/data-prep-sdk) for Python. You run various transformations to filter and combine two different NYC taxi data sets. Once you prepare the NYC taxi data for regression modeling, then you will use [AutoMLStep](https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlstep?view=azure-ml-py) available with [Azure Machine Learning Pipelines](https://aka.ms/aml-pipelines) to define your machine learning goals and constraints as well as to launch the automated machine learning process. The automated machine learning technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.

After you complete building the model, you can predict the cost of a taxi trip by training a model on data features. These features include the pickup day and time, the number of passengers, and the pickup location.

## Prerequisite
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at https://github.com/Azure/MachineLearningNotebooks first if you haven't. This sets you up with a working config file that has information on your workspace, subscription id, etc.

We will run various transformations to filter and combine two different NYC taxi data sets. We will use DataPrep SDK for this preparing data. 

Perform `pip install azureml-dataprep` if you have't already done so.

## Prepare data for regression modeling
First, we will prepare data for regression modeling. We will leverage the convenience of Azure Open Datasets along with the power of Azure Machine Learning service to create a regression model to predict NYC taxi fare prices. Perform `pip install azureml-opendatasets` to get the open dataset package.  The Open Datasets package contains a class representing each data source (NycTlcGreen and NycTlcYellow) to easily filter date parameters before downloading.


### Load data
Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid MemoryError with large datasets. To download a year of taxi data, iteratively fetch one month at a time, and before appending it to green_df_raw, randomly sample 500 records from each month to avoid bloating the dataframe. Then preview the data. To keep this process short, we are sampling data of only 1 month.

Note: Open Datasets has mirroring classes for working in Spark environments where data size and memory aren't a concern.

In [1]:
import azureml.core
# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.0.53


In [2]:
import os
import pandas as pd

In [3]:
# from azureml.opendatasets import NycTlcGreen, NycTlcYellow

# from datetime import datetime
# from dateutil.relativedelta import relativedelta

# green_df_raw = pd.DataFrame([])
# start = datetime.strptime("1/1/2016","%m/%d/%Y")
# end = datetime.strptime("1/31/2016","%m/%d/%Y")

# number_of_months = 1
# sample_size = 5000

# for sample_month in range(number_of_months):
#     temp_df_green = NycTlcGreen(start + relativedelta(months=sample_month), end + relativedelta(months=sample_month)) \
#         .to_pandas_dataframe()
#     green_df_raw = green_df_raw.append(temp_df_green.sample(sample_size))

In [4]:
# yellow_df_raw = pd.DataFrame([])
# start = datetime.strptime("1/1/2016","%m/%d/%Y")
# end = datetime.strptime("1/31/2016","%m/%d/%Y")

# sample_size = 500

# for sample_month in range(number_of_months):
#     temp_df_yellow = NycTlcYellow(start + relativedelta(months=sample_month), end + relativedelta(months=sample_month)) \
#         .to_pandas_dataframe()
#     yellow_df_raw = yellow_df_raw.append(temp_df_yellow.sample(sample_size))

### See the data

In [5]:
# import azureml.dataprep as dprep
# from IPython.display import display

# display(green_df_raw.head(5))
# display(yellow_df_raw.head(5))

### Download data locally and then upload to Azure Blob
This is a one-time process to save the dave in the default datastore. 

In [6]:
# dataDir = "data"

# if not os.path.exists(dataDir):
#     os.mkdir(dataDir)

# greenDir = dataDir + "/green"
# yelloDir = dataDir + "/yellow"

# if not os.path.exists(greenDir):
#     os.mkdir(greenDir)
    
# if not os.path.exists(yelloDir):
#     os.mkdir(yelloDir)
    
# greenTaxiData = greenDir + "/part-00000"
# yellowTaxiData = yelloDir + "/part-00000"

# green_df_raw.to_csv(greenTaxiData, index=False)
# yellow_df_raw.to_csv(yellowTaxiData, index=False)

# print("Data written to local folder.")

In [7]:
from azureml.core import Workspace
from azureml.core.authentication import InteractiveLoginAuthentication
from azureml.core.datastore import Datastore


auth = InteractiveLoginAuthentication(tenant_id="cf36141c-ddd7-45a7-b073-111f66d0b30c")
ws = Workspace.from_config()
print("Workspace: " + ws.name, "Region: " + ws.location, sep = '\n')

# Default datastore
default_store = Datastore.get(ws, datastore_name='deal_input_blob')

Workspace: avadevitsmlsvc
Region: westus2


In [8]:
# default_store.upload_files([greenTaxiData], 
#                            target_path = 'green', 
#                            overwrite = False, 
#                            show_progress = True)

# default_store.upload_files([yellowTaxiData], 
#                            target_path = 'yellow', 
#                            overwrite = False, 
#                            show_progress = True)

# print("Upload calls completed.")

### Setup Compute
#### Create new or use an existing compute

In [9]:
from pygit2 import Repository
from azureml.core.compute import AmlCompute, ComputeTarget, DataFactoryCompute

def get_or_create_compute(workspace, compute_target_name, **kwargs):
    compute_target = ComputeTarget.create(
            workspace,
            compute_target_name,
            AmlCompute.provisioning_configuration(**kwargs))

    compute_target.wait_for_completion(show_output=True)
    print(compute_target_name, "has been created")
    return compute_target

def make_resource_name(prefix, suffix, max_len, sep='-'):
    '''
    project max_len:36, compute max_len:16
    '''
    # suffix will be abbreviated if it is longer than
    # (max_length - len(prefix) - len(sep))
    suffix_max_len = min(len(suffix), max_len - len(prefix) - len(sep))
    suffix_abbv = (suffix
                   [0:suffix_max_len]
                   .replace('_', '-')
                   )
    resource_name = sep.join([prefix, suffix_abbv])
    return resource_name

prefix = 'deal'
repo_name = Repository('../../..').head.shorthand
compute_target_name = make_resource_name(prefix, repo_name, max_len=16)

aml_compute = get_or_create_compute(workspace=ws,
                                       compute_target_name=compute_target_name,
                                       vm_size='STANDARD_D2_V2',
                                       max_nodes=8
                                       )
aml_compute

Creating
Succeeded
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned
deal-automl-step has been created


AmlCompute(workspace=Workspace.create(name='avadevitsmlsvc', subscription_id='ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e', resource_group='RG-ITSMLTeam-Dev'), name=deal-automl-step, id=/subscriptions/ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e/resourceGroups/RG-ITSMLTeam-Dev/providers/Microsoft.MachineLearningServices/workspaces/avadevitsmlsvc/computes/deal-automl-step, type=AmlCompute, provisioning_state=Succeeded, location=westus2, tags=None)

#### Define RunConfig for the compute
We need `azureml-dataprep` SDK for all the steps below. We will also use `pandas`, `scikit-learn` and `automl` for the training step. Defining the `runconfig` for that.

In [10]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies

# Create a new runconfig object
aml_run_config = RunConfiguration()

# Use the aml_compute you created above. 
aml_run_config.target = aml_compute

# Enable Docker
aml_run_config.environment.docker.enabled = True

# Set Docker base image to the default CPU-based image
aml_run_config.environment.docker.base_image = "mcr.microsoft.com/azureml/base:0.2.1"

# Use conda_dependencies.yml to create a conda environment in the Docker image for execution
aml_run_config.environment.python.user_managed_dependencies = False

# Auto-prepare the Docker image when used for execution (if it is not already prepared)
aml_run_config.auto_prepare_environment = True

# Specify CondaDependencies obj, add necessary packages
aml_run_config.environment.python.conda_dependencies = CondaDependencies.create(
    conda_packages=['pandas','scikit-learn'], 
    pip_packages=['azureml-sdk', 'azureml-dataprep', 'azureml-train-automl==1.0.33'], 
    pin_sdk_version=False)

print ("Run configuration created.")



Run configuration created.


### Prepare data
Now we will prepare for regression modeling by using the `Azure Machine Learning Data Prep SDK for Python`. We run various transformations to filter and combine two different NYC taxi data sets.

We achieve this by creating a separate step for each transformation as this allows us to reuse the steps and saves us from running all over again in case of any change. We will keep data preparation scripts in one subfolder and training scripts in another.

> The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

#### Define Useful Colums
Here we are defining a set of "useful" columns for both Green and Yellow taxi data.

In [11]:
# display(green_df_raw.columns)
# display(yellow_df_raw.columns)

# useful columns needed for the Azure Machine Learning NYC Taxi tutorial
useful_columns = str(["cost", "distance", "dropoff_datetime", "dropoff_latitude", 
                      "dropoff_longitude", "passengers", "pickup_datetime", 
                      "pickup_latitude", "pickup_longitude", "store_forward", "vendor"]).replace(",", ";")

print("Useful columns defined.")

Useful columns defined.


#### Cleanse Green taxi data

In [12]:
from azureml.data.data_reference import DataReference 
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep

# python scripts folder
prepare_data_folder = './scripts/prepdata'

blob_green_data = DataReference(
    datastore=default_store,
    data_reference_name="green_taxi_data",
    path_on_datastore="green/part-00000")

# rename columns as per Azure Machine Learning NYC Taxi tutorial
green_columns = str({ 
    "vendorID": "vendor",
    "lpepPickupDatetime": "pickup_datetime",
    "lpepDropoffDatetime": "dropoff_datetime",
    "storeAndFwdFlag": "store_forward",
    "pickupLongitude": "pickup_longitude",
    "pickupLatitude": "pickup_latitude",
    "dropoffLongitude": "dropoff_longitude",
    "dropoffLatitude": "dropoff_latitude",
    "passengerCount": "passengers",
    "fareAmount": "cost",
    "tripDistance": "distance"
}).replace(",", ";")

# Define output after cleansing step
cleansed_green_data = PipelineData("green_taxi_data", datastore=default_store)

print('Cleanse script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# cleansing step creation
# See the cleanse.py for details about input and output
cleansingStepGreen = PythonScriptStep(
    name="Cleanse Green Taxi Data",
    script_name="cleanse.py", 
    arguments=["--input_cleanse", blob_green_data, 
               "--useful_columns", useful_columns,
               "--columns", green_columns,
               "--output_cleanse", cleansed_green_data],
    inputs=[blob_green_data],
    outputs=[cleansed_green_data],
    compute_target=aml_compute,
    runconfig=aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("cleansingStepGreen created.")

Cleanse script is in c:\Users\anders.swanson\Documents\MachineLearningNotebooks-1\how-to-use-azureml\machine-learning-pipelines\nyc-taxi-data-regression-model-building\scripts\prepdata.
cleansingStepGreen created.


#### Cleanse Yellow taxi data

In [13]:
blob_yellow_data = DataReference(
    datastore=default_store,
    data_reference_name="yellow_taxi_data",
    path_on_datastore="yellow/part-00000")

yellow_columns = str({
    "vendorID": "vendor",
    "tpepPickupDateTime": "pickup_datetime",
    "tpepDropoffDateTime": "dropoff_datetime",
    "storeAndFwdFlag": "store_forward",
    "startLon": "pickup_longitude",
    "startLat": "pickup_latitude",
    "endLon": "dropoff_longitude",
    "endLat": "dropoff_latitude",
    "passengerCount": "passengers",
    "fareAmount": "cost",
    "tripDistance": "distance"
}).replace(",", ";")

# Define output after cleansing step
cleansed_yellow_data = PipelineData("yellow_taxi_data", datastore=default_store)

print('Cleanse script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# cleansing step creation
# See the cleanse.py for details about input and output
cleansingStepYellow = PythonScriptStep(
    name="Cleanse Yellow Taxi Data",
    script_name="cleanse.py", 
    arguments=["--input_cleanse", blob_yellow_data, 
               "--useful_columns", useful_columns,
               "--columns", yellow_columns,
               "--output_cleanse", cleansed_yellow_data],
    inputs=[blob_yellow_data],
    outputs=[cleansed_yellow_data],
    compute_target=aml_compute,
    runconfig=aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("cleansingStepYellow created.")

Cleanse script is in c:\Users\anders.swanson\Documents\MachineLearningNotebooks-1\how-to-use-azureml\machine-learning-pipelines\nyc-taxi-data-regression-model-building\scripts\prepdata.
cleansingStepYellow created.


#### Merge cleansed Green and Yellow datasets
We are creating a single data source by merging the cleansed versions of Green and Yellow taxi data.

In [14]:
# Define output after merging step
merged_data = PipelineData("merged_data", datastore=default_store)

print('Merge script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# merging step creation
# See the merge.py for details about input and output
mergingStep = PythonScriptStep(
    name="Merge Taxi Data",
    script_name="merge.py", 
    arguments=["--input_green_merge", cleansed_green_data, 
               "--input_yellow_merge", cleansed_yellow_data,
               "--output_merge", merged_data],
    inputs=[cleansed_green_data, cleansed_yellow_data],
    outputs=[merged_data],
    compute_target=aml_compute,
    runconfig=aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("mergingStep created.")

Merge script is in c:\Users\anders.swanson\Documents\MachineLearningNotebooks-1\how-to-use-azureml\machine-learning-pipelines\nyc-taxi-data-regression-model-building\scripts\prepdata.
mergingStep created.


#### Filter data
This step filters out coordinates for locations that are outside the city border. We use a TypeConverter object to change the latitude and longitude fields to decimal type. 

In [15]:
# Define output after merging step
filtered_data = PipelineData("filtered_data", datastore=default_store)

print('Filter script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# filter step creation
# See the filter.py for details about input and output
filterStep = PythonScriptStep(
    name="Filter Taxi Data",
    script_name="filter.py", 
    arguments=["--input_filter", merged_data, 
               "--output_filter", filtered_data],
    inputs=[merged_data],
    outputs=[filtered_data],
    compute_target=aml_compute,
    runconfig = aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("FilterStep created.")

Filter script is in c:\Users\anders.swanson\Documents\MachineLearningNotebooks-1\how-to-use-azureml\machine-learning-pipelines\nyc-taxi-data-regression-model-building\scripts\prepdata.
FilterStep created.


#### Normalize data
In this step, we split the pickup and dropoff datetime values into the respective date and time columns and then we rename the columns to use meaningful names.

In [16]:
# Define output after normalize step
normalized_data = PipelineData("normalized_data", datastore=default_store)

print('Normalize script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# normalize step creation
# See the normalize.py for details about input and output
normalizeStep = PythonScriptStep(
    name="Normalize Taxi Data",
    script_name="normalize.py", 
    arguments=["--input_normalize", filtered_data, 
               "--output_normalize", normalized_data],
    inputs=[filtered_data],
    outputs=[normalized_data],
    compute_target=aml_compute,
    runconfig = aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("normalizeStep created.")

Normalize script is in c:\Users\anders.swanson\Documents\MachineLearningNotebooks-1\how-to-use-azureml\machine-learning-pipelines\nyc-taxi-data-regression-model-building\scripts\prepdata.
normalizeStep created.


#### Transform data
Transform the normalized taxi data to final required format. This steps does the following:

- Split the pickup and dropoff date further into the day of the week, day of the month, and month values. 
- To get the day of the week value, uses the derive_column_by_example() function. The function takes an array parameter of example objects that define the input data, and the preferred output. The function automatically determines the preferred transformation. For the pickup and dropoff time columns, split the time into the hour, minute, and second by using the split_column_by_example() function with no example parameter.
- After new features are generated, use the drop_columns() function to delete the original fields as the newly generated features are preferred. 
- Rename the rest of the fields to use meaningful descriptions.

In [17]:
# Define output after transforme step
transformed_data = PipelineData("transformed_data", datastore=default_store)

print('Transform script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# transform step creation
# See the transform.py for details about input and output
transformStep = PythonScriptStep(
    name="Transform Taxi Data",
    script_name="transform.py", 
    arguments=["--input_transform", normalized_data,
               "--output_transform", transformed_data],
    inputs=[normalized_data],
    outputs=[transformed_data],
    compute_target=aml_compute,
    runconfig = aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("transformStep created.")

Transform script is in c:\Users\anders.swanson\Documents\MachineLearningNotebooks-1\how-to-use-azureml\machine-learning-pipelines\nyc-taxi-data-regression-model-building\scripts\prepdata.
transformStep created.


### Extract features
Add the following columns to be features for our model creation. The prediction value will be *cost*.

In [18]:
feature_columns = str(['pickup_weekday','pickup_hour', 'distance','passengers', 'vendor']).replace(",", ";")

train_model_folder = './scripts/trainmodel'

print('Extract script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# features data after transform step
features_data = PipelineData("features_data", datastore=default_store)

# featurization step creation
# See the featurization.py for details about input and output
featurizationStep = PythonScriptStep(
    name="Extract Features",
    script_name="featurization.py", 
    arguments=["--input_featurization", transformed_data, 
               "--useful_columns", feature_columns,
               "--output_featurization", features_data],
    inputs=[transformed_data],
    outputs=[features_data],
    compute_target=aml_compute,
    runconfig = aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("featurizationStep created.")

Extract script is in c:\Users\anders.swanson\Documents\MachineLearningNotebooks-1\how-to-use-azureml\machine-learning-pipelines\nyc-taxi-data-regression-model-building\scripts\prepdata.
featurizationStep created.


### Extract label

In [19]:
label_columns = str(['cost']).replace(",", ";")

# label data after transform step
label_data = PipelineData("label_data", datastore=default_store)

print('Extract script is in {}.'.format(os.path.realpath(prepare_data_folder)))

# label step creation
# See the featurization.py for details about input and output
labelStep = PythonScriptStep(
    name="Extract Labels",
    script_name="featurization.py", 
    arguments=["--input_featurization", transformed_data, 
               "--useful_columns", label_columns,
               "--output_featurization", label_data],
    inputs=[transformed_data],
    outputs=[label_data],
    compute_target=aml_compute,
    runconfig = aml_run_config,
    source_directory=prepare_data_folder,
    allow_reuse=True
)

print("labelStep created.")

Extract script is in c:\Users\anders.swanson\Documents\MachineLearningNotebooks-1\how-to-use-azureml\machine-learning-pipelines\nyc-taxi-data-regression-model-building\scripts\prepdata.
labelStep created.


### Split the data into train and test sets
This function segregates the data into the **x**, features, dataset for model training and **y**, values to predict, dataset for testing.

In [30]:
# train and test splits output
output_split = PipelineData("output_split", datastore=default_store)

print('Data spilt script is in {}.'.format(os.path.realpath(train_model_folder)))

# test train split step creation
# See the train_test_split.py for details about input and output
testTrainSplitStep = PythonScriptStep(
    name="Train Test Data Split",
    script_name="train_test_split.py", 
    arguments=["--input_split_features", features_data, 
               "--input_split_labels", label_data,
               "--output_split", output_split],
    inputs=[features_data, label_data],
    outputs=[output_split],
    compute_target=aml_compute,
    runconfig = aml_run_config,
    source_directory=train_model_folder,
    allow_reuse=True
)

print("testTrainSplitStep created.")

Data spilt script is in c:\Users\anders.swanson\Documents\MachineLearningNotebooks-1\how-to-use-azureml\machine-learning-pipelines\nyc-taxi-data-regression-model-building\scripts\trainmodel.
testTrainSplitStep created.


## Use automated machine learning to build regression model
Now we will use **automated machine learning** to build the regression model. We will use [AutoMLStep](https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl.automlstep?view=azure-ml-py) in AML Pipelines for this part. These functions use various features from the data set and allow an automated model to build relationships between the features and the price of a taxi trip.

### Automatically train a model

#### Create experiment

In [31]:
from azureml.core import Experiment

experiment = Experiment(ws, 'NYCTaxi_Tutorial_Pipelines')

print("Experiment created")

Experiment created


#### Create get_data script

A script with `get_data()` function is necessary to fetch training features(X) and labels(Y) on remote compute, from input data. Here we use mounted path of `train_test_split` step to get the x and y train values. They are added as environment variable on compute machine by default

Note: Every DataReference are added as environment variable on compute machine since the defualt mode is mount

In [32]:
print('get_data.py will be written to {}.'.format(os.path.realpath(train_model_folder)))

get_data.py will be written to c:\Users\anders.swanson\Documents\MachineLearningNotebooks-1\how-to-use-azureml\machine-learning-pipelines\nyc-taxi-data-regression-model-building\scripts\trainmodel.


In [33]:
# %%writefile $train_model_folder/get_data.py


# import os
# import pandas as pd

# def get_data():
#     print("In get_data")
#     dir = os.environ['AZUREML_DATAREFERENCE_output_split']
#     print(dir)
#     X_train = pd.read_csv(os.path.join(dir,'x_train.csv'), header=0)
#     y_train = pd.read_csv(os.path.join(dir,'y_train.csv'), header=0)

#     return {"X": X_train.values, "y": y_train.values.flatten()}

#### Define settings for autogeneration and tuning

Here we define the experiment parameter and model settings for autogeneration and tuning. We can specify automl_settings as **kwargs as well. Also note that we have to use a get_data() function for remote excutions. See get_data script for more details.

Use your defined training settings as a parameter to an `AutoMLConfig` object. Additionally, specify your training data and the type of model, which is `regression` in this case.

Note: When using AmlCompute, we can't pass Numpy arrays directly to the fit method.

In [34]:
import logging
from azureml.train.automl import AutoMLConfig

# Change iterations to a reasonable number (50) to get better accuracy
automl_settings = {
    "iteration_timeout_minutes" : 10,
    "iterations" : 2,
    "primary_metric" : 'spearman_correlation',
    "preprocess" : True,
    "verbosity" : logging.INFO,
    "n_cross_validations": 5
}

automl_config = AutoMLConfig(task = 'regression',
                             debug_log = 'automated_ml_errors.log',
                             path = train_model_folder,
                             compute_target=aml_compute,
                             run_configuration=aml_run_config,
                             data_script = train_model_folder + "/get_data.py",
                             **automl_settings)
                             
print("AutoML config created.")

AutoML config created.


#### Define AutoMLStep

In [35]:
from azureml.train.automl import AutoMLStep

trainWithAutomlStep = AutoMLStep(
    name='AutoML_Regression',
    automl_config=automl_config,
    inputs=[output_split],
    allow_reuse=True,
    hash_paths=[os.path.realpath(train_model_folder)])

print("trainWithAutomlStep created.")



trainWithAutomlStep created.


#### Build and run the pipeline

In [36]:
from azureml.pipeline.core import Pipeline
from azureml.widgets import RunDetails

pipeline_steps = [trainWithAutomlStep]

pipeline = Pipeline(workspace = ws, steps=pipeline_steps)
print("Pipeline is built.")

pipeline_run = experiment.submit(pipeline, regenerate_outputs=False)

print("Pipeline submitted for execution.")



In get_data
Pipeline is built.
Created step AutoML_Regression [69d76b4c][a039bede-c05f-42bd-9708-c9d8506d9c4b], (This step will run and generate new outputs)
Created step Train Test Data Split [5ec47a7b][32c61789-19ae-4103-81b8-c276667bda53], (This step will run and generate new outputs)
Created step Extract Features [28ad1c54][6b3b5d92-7456-4e79-9c4a-437fe40cc140], (This step is eligible to reuse a previous run's output)
Created step Transform Taxi Data [6246dea6][cee9dddc-e4f8-4123-b0f9-34d434168851], (This step is eligible to reuse a previous run's output)
Created step Normalize Taxi Data [b7e10bab][ffef805f-623a-4c81-a035-9bcf75357d22], (This step is eligible to reuse a previous run's output)
Created step Filter Taxi Data [ea8f425c][a5e4f0ff-cfbc-4dff-98b2-b2f1c10393cc], (This step is eligible to reuse a previous run's output)
Created step Merge Taxi Data [04192902][e34fd3c9-ca0b-40fe-bf06-acc14544736a], (This step is eligible to reuse a previous run's output)
Created step Cleanse 

In [37]:
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', …

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

### Explore the results

In [28]:
# Before we proceed we need to wait for the run to complete.
pipeline_run.wait_for_completion()

# functions to download output to local and fetch as dataframe
def get_download_path(download_path, output_name):
    output_folder = os.listdir(download_path + '/azureml')[0]
    path =  download_path + '/azureml/' + output_folder + '/' + output_name
    return path

def fetch_df(step, output_name):
    output_data = step.get_output_data(output_name)
    
    download_path = './outputs/' + output_name
    output_data.download(download_path)
    df_path = get_download_path(download_path, output_name) + '/part-00000'
    return dprep.auto_read_file(path=df_path)

PipelineRunId: 467bd511-cd24-4633-9d68-71bfd38532be
Link to Portal: https://mlworkspace.azure.ai/portal/subscriptions/ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e/resourceGroups/RG-ITSMLTeam-Dev/providers/Microsoft.MachineLearningServices/workspaces/avadevitsmlsvc/experiments/NYCTaxi_Tutorial_Pipelines/runs/467bd511-cd24-4633-9d68-71bfd38532be
PipelineRun Status: NotStarted
PipelineRun Status: Running


StepRunId: 3e98ecd2-e960-495c-8086-f020e72b0d1a
Link to Portal: https://mlworkspace.azure.ai/portal/subscriptions/ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e/resourceGroups/RG-ITSMLTeam-Dev/providers/Microsoft.MachineLearningServices/workspaces/avadevitsmlsvc/experiments/NYCTaxi_Tutorial_Pipelines/runs/3e98ecd2-e960-495c-8086-f020e72b0d1a
StepRun( Cleanse Green Taxi Data ) Status: NotStarted
StepRun( Cleanse Green Taxi Data ) Status: Running

Streaming azureml-logs/70_driver_log.txt
Cleans the input data
Argument 1(input taxi data path): /mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/3e98




StepRunId: 30a8bd45-5e44-4de3-9535-d3bca60bd0bb
Link to Portal: https://mlworkspace.azure.ai/portal/subscriptions/ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e/resourceGroups/RG-ITSMLTeam-Dev/providers/Microsoft.MachineLearningServices/workspaces/avadevitsmlsvc/experiments/NYCTaxi_Tutorial_Pipelines/runs/30a8bd45-5e44-4de3-9535-d3bca60bd0bb

StepRun(Cleanse Yellow Taxi Data) Execution Summary
StepRun( Cleanse Yellow Taxi Data ) Status: Finished
{'runId': '30a8bd45-5e44-4de3-9535-d3bca60bd0bb', 'target': 'deal-automl-step', 'status': 'Completed', 'startTimeUtc': '2019-07-31T14:39:51.709496Z', 'endTimeUtc': '2019-07-31T14:42:35.765925Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '1034f585-bdcc-4da9-9512-3d661106bcff', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '467bd511-cd24-4633-9d68-71bfd38532be', '_azureml.ComputeTargetType': 'batchai'}, 'runDefinition': {'script': 'cleanse.py', 'arguments': ['--input_cleanse

/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/22d16ded-9fea-4654-b2a4-7c49c0a4f7c3/mounts/deal_input_blob/azureml/22d16ded-9fea-4654-b2a4-7c49c0a4f7c3/merged_data created


The experiment completed successfully. Finalizing run...
Logging experiment finalizing status in history service.
Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 0.0003635883331298828 seconds

StepRun(Merge Taxi Data) Execution Summary
StepRun( Merge Taxi Data ) Status: Finished
{'runId': '22d16ded-9fea-4654-b2a4-7c49c0a4f7c3', 'target': 'deal-automl-step', 'status': 'Completed', 'startTimeUtc': '2019-07-31T14:42:53.951853Z', 'endTimeUtc': '2019-07-31T14:43:34.444007Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '1034f585-bdcc-4da9-9512-3d661106bcff', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '467bd511-cd24-4633-9d68-71bfd38532be', '_azureml.ComputeTargetType': 'batch



The experiment completed successfully. Finalizing run...
Logging experiment finalizing status in history service.
Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 0.0006363391876220703 seconds

StepRun(Filter Taxi Data) Execution Summary
StepRun( Filter Taxi Data ) Status: Finished
{'runId': 'e0cbbace-7297-4180-85bc-9c93191d9faf', 'target': 'deal-automl-step', 'status': 'Completed', 'startTimeUtc': '2019-07-31T14:43:53.17335Z', 'endTimeUtc': '2019-07-31T14:44:29.122646Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '1034f585-bdcc-4da9-9512-3d661106bcff', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '467bd511-cd24-4633-9d68-71bfd38532be', '_azureml.ComputeTargetType': 'batchai'}, 'runDefinition': {'script': 'filter.py', 'arguments': ['--input_filter', '$AZUREML_DATAREFERENCE_merged_data', '--output_filter', '$AZUREML_DATAREFERENCE_filtered_data'], 'source

{'runId': '077388e2-10bc-4877-ab92-6a678f6d04ae', 'target': 'deal-automl-step', 'status': 'Completed', 'startTimeUtc': '2019-07-31T14:44:48.048567Z', 'endTimeUtc': '2019-07-31T14:45:31.693047Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '1034f585-bdcc-4da9-9512-3d661106bcff', 'StepType': 'PythonScriptStep', 'ComputeTargetType': 'AmlCompute', 'azureml.pipelinerunid': '467bd511-cd24-4633-9d68-71bfd38532be', '_azureml.ComputeTargetType': 'batchai'}, 'runDefinition': {'script': 'normalize.py', 'arguments': ['--input_normalize', '$AZUREML_DATAREFERENCE_filtered_data', '--output_normalize', '$AZUREML_DATAREFERENCE_normalized_data'], 'sourceDirectoryDataStore': None, 'framework': 'Python', 'communicator': 'None', 'target': 'deal-automl-step', 'dataReferences': {'filtered_data': {'dataStoreName': 'deal_input_blob', 'mode': 'Mount', 'pathOnDataStore': 'azureml/e0cbbace-7297-4180-85bc-9c93191d9faf/filtered_data', 'pathOnCompute': None, 'overwrite': False}, 'norm




StepRunId: 12c0e57c-8389-4066-8898-fea874860b02
Link to Portal: https://mlworkspace.azure.ai/portal/subscriptions/ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e/resourceGroups/RG-ITSMLTeam-Dev/providers/Microsoft.MachineLearningServices/workspaces/avadevitsmlsvc/experiments/NYCTaxi_Tutorial_Pipelines/runs/12c0e57c-8389-4066-8898-fea874860b02
StepRun( Extract Labels ) Status: Running

Streaming azureml-logs/70_driver_log.txt
Extracts important features from prepared data
Argument 1(input training data path): /mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/12c0e57c-8389-4066-8898-fea874860b02/mounts/deal_input_blob/azureml/18f0a113-29e9-4303-86f5-2d048d2cc2de/transformed_data
Argument 2(column features to use): ["'cost'"]
Argument 3:(output featurized training data path) /mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/12c0e57c-8389-4066-8898-fea874860b02/mounts/deal_input_blob/azureml/12c0e57c-8389-4066-8898-fea874860b02/label_data
/mnt/batch/tasks/shared/LS_root/jobs/av

StepRun( Extract Features ) Status: Running

Streaming azureml-logs/70_driver_log.txt
Extracts important features from prepared data
Argument 1(input training data path): /mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/1da8aa2a-9242-491b-a791-7307329d2581/mounts/deal_input_blob/azureml/18f0a113-29e9-4303-86f5-2d048d2cc2de/transformed_data
Argument 2(column features to use): ["'pickup_weekday'", " 'pickup_hour'", " 'distance'", " 'passengers'", " 'vendor'"]
Argument 3:(output featurized training data path) /mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/1da8aa2a-9242-491b-a791-7307329d2581/mounts/deal_input_blob/azureml/1da8aa2a-9242-491b-a791-7307329d2581/features_data
/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/1da8aa2a-9242-491b-a791-7307329d2581/mounts/deal_input_blob/azureml/1da8aa2a-9242-491b-a791-7307329d2581/features_data created


The experiment completed successfully. Finalizing run...
Logging experiment finalizing status in history serv

/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/a8cdd218-b015-452b-8880-3f71533e2089/mounts/deal_input_blob/azureml/a8cdd218-b015-452b-8880-3f71533e2089/output_split_train created
/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/a8cdd218-b015-452b-8880-3f71533e2089/mounts/deal_input_blob/azureml/a8cdd218-b015-452b-8880-3f71533e2089/output_split_train created
/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/a8cdd218-b015-452b-8880-3f71533e2089/mounts/deal_input_blob/azureml/a8cdd218-b015-452b-8880-3f71533e2089/output_split_train created
/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/a8cdd218-b015-452b-8880-3f71533e2089/mounts/deal_input_blob/azureml/a8cdd218-b015-452b-8880-3f71533e2089/output_split_train created


The experiment completed successfully. Finalizing run...
Logging experiment finalizing status in history service.
Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 0.000448703765

StepRun( AutoML_Regression ) Status: Running

StepRun(AutoML_Regression) Execution Summary
StepRun( AutoML_Regression ) Status: Failed
{'runId': '0e2ccb75-8f48-4cc5-ad34-9f4c9cc3bf3a', 'target': 'deal-automl-step', 'status': 'Failed', 'startTimeUtc': '2019-07-31T14:49:20.942269Z', 'endTimeUtc': '2019-07-31T14:51:05.459207Z', 'properties': {'azureml.runsource': 'azureml.StepRun', 'ContentSnapshotId': '2a439a3c-8c6a-4555-9141-e22f027f57d9', 'StepType': 'AutoMLStep', 'azureml.pipelinerunid': '467bd511-cd24-4633-9d68-71bfd38532be', 'num_iterations': '2', 'training_type': 'TrainFull', 'acquisition_function': 'EI', 'metrics': 'accuracy', 'primary_metric': 'spearman_correlation', 'train_split': '0', 'MaxTimeSeconds': '600', 'acquisition_parameter': '0', 'num_cross_validation': '5', 'target': 'deal-automl-step', 'RawAMLSettingsString': "{'name':'NYCTaxi_Tutorial_Pipelines','subscription_id':'ff2e23ae-7d7c-4cbd-99b8-116bb94dca6e','resource_group':'RG-ITSMLTeam-Dev','workspace_name':'avadevitsml

#### View cleansed taxi data

In [29]:
green_cleanse_step = pipeline_run.find_step_run(cleansingStepGreen.name)[0]
yellow_cleanse_step = pipeline_run.find_step_run(cleansingStepYellow.name)[0]

cleansed_green_df = fetch_df(green_cleanse_step, cleansed_green_data.name)
cleansed_yellow_df = fetch_df(yellow_cleanse_step, cleansed_yellow_data.name)

display(cleansed_green_df.head(5))
display(cleansed_yellow_df.head(5))

NameError: name 'dprep' is not defined

#### View the combined taxi data profile

In [None]:
merge_step = pipeline_run.find_step_run(mergingStep.name)[0]
combined_df = fetch_df(merge_step, merged_data.name)

display(combined_df.get_profile())

#### View the filtered taxi data profile

In [None]:
filter_step = pipeline_run.find_step_run(filterStep.name)[0]
filtered_df = fetch_df(filter_step, filtered_data.name)

display(filtered_df.get_profile())

#### View normalized taxi data

In [None]:
normalize_step = pipeline_run.find_step_run(normalizeStep.name)[0]
normalized_df = fetch_df(normalize_step, normalized_data.name)

display(normalized_df.head(5))

#### View transformed taxi data

In [None]:
transform_step = pipeline_run.find_step_run(transformStep.name)[0]
transformed_df = fetch_df(transform_step, transformed_data.name)

display(transformed_df.get_profile())
display(transformed_df.head(5))

#### View training data used by AutoML

In [None]:
split_step = pipeline_run.find_step_run(testTrainSplitStep.name)[0]
train_split_x = fetch_df(split_step, output_split_train_x.name)
train_split_y = fetch_df(split_step, output_split_train_y.name)

display_x_train = train_split_x.keep_columns(columns=["vendor", "pickup_weekday", "pickup_hour", "passengers", "distance"])
display_y_train = train_split_y.rename_columns(column_pairs={"Column1": "cost"})

display(display_x_train.get_profile())
display(display_x_train.head(5))
display(display_y_train.get_profile())
display(display_y_train.head(5))

#### View the details of the AutoML run

In [None]:
from azureml.train.automl.run import AutoMLRun
#from azureml.widgets import RunDetails

# workaround to get the automl run as its the last step in the pipeline 
# and get_steps() returns the steps from latest to first

for step in pipeline_run.get_steps():
    automl_step_run_id = step.id
    print(step.name)
    print(automl_step_run_id)
    break

automl_run = AutoMLRun(experiment = experiment, run_id=automl_step_run_id)
#RunDetails(automl_run).show()

#### Retrieve all Child runs

We use SDK methods to fetch all the child runs and see individual metrics that we log.

In [None]:
children = list(automl_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### Retreive the best model

Uncomment the below cell to retrieve the best model

In [None]:
# best_run, fitted_model = automl_run.get_output()
# print(best_run)
# print(fitted_model)

### Test the model

#### Get test data

Uncomment the below cell to get test data

In [None]:
# split_step = pipeline_run.find_step_run(testTrainSplitStep.name)[0]

# x_test = fetch_df(split_step, output_split_test_x.name)
# y_test = fetch_df(split_step, output_split_test_y.name)

# display(x_test.keep_columns(columns=["vendor", "pickup_weekday", "pickup_hour", "passengers", "distance"]).head(5))
# display(y_test.rename_columns(column_pairs={"Column1": "cost"}).head(5))

# x_test = x_test.to_pandas_dataframe()
# y_test = y_test.to_pandas_dataframe()

#### Test the best fitted model

Uncomment the below cell to test the best fitted model

In [None]:
# y_predict = fitted_model.predict(x_test.values)

# y_actual =  y_test.iloc[:,0].values.tolist()

# display(pd.DataFrame({'Actual':y_actual, 'Predicted':y_predict}).head(5))

In [None]:
# import matplotlib.pyplot as plt

# fig = plt.figure(figsize=(14, 10))
# ax1 = fig.add_subplot(111)

# distance_vals = [x[4] for x in x_test.values]

# ax1.scatter(distance_vals[:100], y_predict[:100], s=18, c='b', marker="s", label='Predicted')
# ax1.scatter(distance_vals[:100], y_actual[:100], s=18, c='r', marker="o", label='Actual')

# ax1.set_xlabel('distance (mi)')
# ax1.set_title('Predicted and Actual Cost/Distance')
# ax1.set_ylabel('Cost ($)')

# plt.legend(loc='upper left', prop={'size': 12})
# plt.rcParams.update({'font.size': 14})
# plt.show()