Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-how-to-use-estimatorstep.png)

# How to use EstimatorStep in AML Pipeline

This notebook shows how to use the EstimatorStep with Azure Machine Learning Pipelines. Estimator is a convenient object in Azure Machine Learning that wraps run configuration information to help simplify the tasks of specifying how a script is executed.


## Prerequisite:
* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning
* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](https://aka.ms/pl-config) to:
    * install the AML SDK
    * create a workspace and its configuration file (`config.json`)

Let's get started. First let's import some Python libraries.

In [17]:
import azureml.core

from azureml.core import Workspace
from azureml.core import Experiment
from azureml.core.container_registry import ContainerRegistry
from azureml.train.estimator import Estimator
import azureml.contrib.core
from azureml.contrib.core.compute.cmakscompute import CmAksCompute
# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  1.6.0


## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [19]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

# Default datastore
def_blob_store = ws.get_default_datastore() 
# The following call GETS the Azure Blob Store associated with your workspace.
# Note that workspaceblobstore is **the name of this store and CANNOT BE CHANGED and must be used as is** 
def_blob_store = Datastore(ws, "workspaceblobstore")
print("Blobstore's name: {}".format(def_blob_store.name))

Workspace name: cmaks-westcentralus-ws
Azure region: westcentralus
Subscription id: e9b2ec51-5c94-4fa8-809a-dc1e695e4896
Resource group: itp
Blobstore's name: workspaceblobstore


#### Upload data to default datastore
Default datastore on workspace is the Azure  File storage. The workspace has a Blob storage associated with it as well. Let's upload a file to each of these storages.

In [20]:
# get_default_datastore() gets the default Azure Blob Store associated with your workspace.
# Here we are reusing the def_blob_store object we obtained earlier
def_blob_store.upload_files(["./20news.pkl"], target_path="20newsgroups", overwrite=True)
print("Upload call completed")

Uploading an estimated of 1 files
Uploading ./20news.pkl
Uploaded ./20news.pkl, 1 files out of an estimated total of 1
Uploaded 1 files
Upload call completed


## Create or Attach existing Compute Target
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model.

In [22]:
# from azureml.core.compute import ComputeTarget
# from azureml.core.compute_target import ComputeTargetException

# import json
# with open(os.path.join(root_path, "env.json")) as f:
#     env_config = json.load(f)

from azureml.core.compute import ComputeTarget
#from azureml.contrib.core.compute.itpcompute import ItpCompute
from azureml.contrib.core.compute.cmakscompute import CmAksCompute
for key, target in ws.compute_targets.items():
    if type(target) is CmAksCompute:
        print('Found compute target:{}\ttype:{}\tprovisioning_state:{}\tlocation:{}'.format(target.name, target.type, target.provisioning_state, target.location))
cluster_name = 'cmaks0518'

cpu_cluster = ComputeTarget(workspace=ws, name=cluster_name)
print('Found existing compute target')
print(cpu_cluster)

Found compute target:cmaks0518	type:CmAks	provisioning_state:Succeeded	location:westcentralus
Found compute target:cmaks-portal	type:CmAks	provisioning_state:Succeeded	location:westcentralus
Found compute target:cmaks-demo	type:CmAks	provisioning_state:Succeeded	location:westcentralus
Found existing compute target
CmAksCompute(workspace=Workspace.create(name='cmaks-westcentralus-ws', subscription_id='e9b2ec51-5c94-4fa8-809a-dc1e695e4896', resource_group='itp'), name=cmaks0518, id=/subscriptions/e9b2ec51-5c94-4fa8-809a-dc1e695e4896/resourceGroups/itp/providers/Microsoft.MachineLearningServices/workspaces/cmaks-westcentralus-ws/computes/cmaks0518, type=CmAks, provisioning_state=Succeeded, location=westcentralus, tags=None)


## Use a simple script
We have already created a simple "hello world" script. This is the script that we will submit through the estimator pattern. It prints a hello-world message, and if Azure ML SDK is installed, it will also logs an array of values ([Fibonacci numbers](https://en.wikipedia.org/wiki/Fibonacci_number)).

## Build an Estimator object
Estimator by default will attempt to use Docker-based execution. You can also enable Docker and let estimator pick the default CPU image supplied by Azure ML for execution. You can target an Cmk8s cluster (or any other supported compute target types). You can also customize the conda environment by adding conda and/or pip packages.

> Note: The arguments to the entry script used in the Estimator object should be specified as *list* using
    'estimator_entry_script_arguments' parameter when instantiating EstimatorStep. Estimator object's parameter
    'script_params' accepts a dictionary. However 'estimator_entry_script_arguments' parameter expects arguments as
    a list.

> Estimator object initialization involves specifying a list of DataReference objects in its 'inputs' parameter.
    In Pipelines, a step can take another step's output or DataReferences as input. So when creating an EstimatorStep,
    the parameters 'inputs' and 'outputs' need to be set explicitly and that will override 'inputs' parameter
    specified in the Estimator object.
   
> The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

In [23]:
from azureml.data.data_reference import DataReference
from azureml.pipeline.core import PipelineData

def_blob_store = Datastore(ws, "workspaceblobstore")

input_data = DataReference(
    datastore=def_blob_store,
    data_reference_name="input_data",
    path_on_datastore="20newsgroups/20news.pkl")

output = PipelineData("output", datastore=def_blob_store)

source_directory = 'estimator_train'

In [24]:
from azureml.train.estimator import Estimator

est = Estimator(source_directory=source_directory, 
                compute_target=cpu_cluster, 
                entry_script='dummy_train.py', 
                conda_packages=['scikit-learn'])

## Create an EstimatorStep
[EstimatorStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.estimator_step.estimatorstep?view=azure-ml-py) adds a step to run Estimator in a Pipeline.

- **name:** Name of the step
- **estimator:** Estimator object
- **estimator_entry_script_arguments:** 
- **runconfig_pipeline_params:** Override runconfig properties at runtime using key-value pairs each with name of the runconfig property and PipelineParameter for that property
- **inputs:** Inputs
- **outputs:** Output is list of PipelineData
- **compute_target:** Compute target to use 
- **allow_reuse:** Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.
- **version:** Optional version tag to denote a change in functionality for the step

In [25]:
from azureml.pipeline.steps import EstimatorStep

est_step = EstimatorStep(name="Estimator_Train", 
                         estimator=est, 
                         estimator_entry_script_arguments=["--datadir", input_data, "--output", output],
                         runconfig_pipeline_params=None, 
                         inputs=[input_data], 
                         outputs=[output], 
                         compute_target=cpu_cluster,
                         allow_reuse=False)

## Build and Submit the Experiment

In [26]:
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment
pipeline = Pipeline(workspace=ws, steps=[est_step])
pipeline_run = Experiment(ws, 'Estimator_sample').submit(pipeline)

Created step Estimator_Train [fe510f1c][6b469817-3150-4e5c-9a4d-d1603ee2b710], (This step will run and generate new outputs)
Using data reference input_data for StepId [2b87aa4e][60292cfb-2d9f-45fa-b4dc-90b2ad1ecfb0], (Consumers of this data are eligible to reuse prior runs.)
Submitted PipelineRun a2adb9cc-5472-4a3e-8656-97360ce7bbae
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/Estimator_sample/runs/a2adb9cc-5472-4a3e-8656-97360ce7bbae?wsid=/subscriptions/e9b2ec51-5c94-4fa8-809a-dc1e695e4896/resourcegroups/itp/workspaces/cmaks-westcentralus-ws


## View Run Details

In [27]:
from azureml.widgets import RunDetails
RunDetails(pipeline_run).show()

_PipelineWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', '…