Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-how-to-use-estimatorstep.png)

# How to use EstimatorStep in AML Pipeline

This notebook shows how to use the EstimatorStep with Azure Machine Learning Pipelines. Estimator is a convenient object in Azure Machine Learning that wraps run configuration information to help simplify the tasks of specifying how a script is executed.


## Prerequisite:
* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning
* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](https://aka.ms/pl-config) to:
    * install the AML SDK
    * create a workspace and its configuration file (`config.json`)

Let's get started. First let's import some Python libraries.

In [1]:
import azureml.core
# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

Azure ML SDK Version:  0.1.0.15477075


## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [1]:
from azureml.core import Workspace
ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


Workspace name: centraleuap
Azure region: centraluseuap
Subscription id: 35f16a99-532a-4a47-9e93-00305f6c40f2
Resource group: rongduan-dev


## Create or Attach existing AmlCompute
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you create `AmlCompute` as your training compute resource.

If we could not find the cluster with the given name, then we will create a new cluster here. We will create an `AmlCompute` cluster of `STANDARD_NC6` GPU VMs. This process is broken down into 3 steps:
1. create the configuration (this step is local and only takes a second)
2. create the cluster (this step will take about **20 seconds**)
3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about **3-5 minutes** and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell

In [2]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "mlc"

try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', max_nodes=4)

    # create the cluster
    cpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    cpu_cluster.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

# use get_status() to get a detailed status for the current cluster. 
print(cpu_cluster.get_status().serialize())

Found existing compute target
{'currentNodeCount': 1, 'targetNodeCount': 1, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 1, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2020-06-02T04:14:32.924000+00:00', 'errors': None, 'creationTime': '2019-10-22T00:19:09.957713+00:00', 'modifiedTime': '2020-06-09T04:37:06.725202+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 1, 'maxNodeCount': 10, 'nodeIdleTimeBeforeScaleDown': 'PT3600S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


Now that you have created the compute target, let's see what the workspace's `compute_targets` property returns. You should now see one entry named 'cpu-cluster' of type `AmlCompute`.

## Use a simple script
We have already created a simple "hello world" script. This is the script that we will submit through the estimator pattern. It prints a hello-world message, and if Azure ML SDK is installed, it will also logs an array of values ([Fibonacci numbers](https://en.wikipedia.org/wiki/Fibonacci_number)).

## Build an Estimator object
Estimator by default will attempt to use Docker-based execution. You can also enable Docker and let estimator pick the default CPU image supplied by Azure ML for execution. You can target an AmlCompute cluster (or any other supported compute target types). You can also customize the conda environment by adding conda and/or pip packages.

> Note: The arguments to the entry script used in the Estimator object should be specified as *list* using
    'estimator_entry_script_arguments' parameter when instantiating EstimatorStep. Estimator object's parameter
    'script_params' accepts a dictionary. However 'estimator_entry_script_arguments' parameter expects arguments as
    a list.

> Estimator object initialization involves specifying a list of DataReference objects in its 'inputs' parameter.
    In Pipelines, a step can take another step's output or DataReferences as input. So when creating an EstimatorStep,
    the parameters 'inputs' and 'outputs' need to be set explicitly and that will override 'inputs' parameter
    specified in the Estimator object.
   
> The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the `source_directory` for the step. This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the `source_directory` would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the `source_directory` of the step.

In [3]:
from azureml.core import Datastore, Dataset
from azureml.data import OutputFileDatasetConfig

def_blob_store = Datastore(ws, "workspaceblobstore")

input_data = Dataset.File.from_files(def_blob_store.path('iris.csv')).as_named_input('input').as_mount()
output = OutputFileDatasetConfig(destination=(def_blob_store, 'may_sample/outputdataset'))

source_directory = 'estimator_train'



In [4]:

from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

conda_env = Environment('conda-env')
conda_env.python.conda_dependencies = CondaDependencies.create(pip_packages=['azureml-sdk<0.1.1'],
                                                               pip_indexurl='https://azuremlsdktestpypi.azureedge.net/Create-Dev-Index/15335858/')

In [5]:
from azureml.train.estimator import Estimator

est = Estimator(source_directory=source_directory, 
                compute_target=cpu_cluster, 
                entry_script='dummy_train.py', 
                environment_definition=conda_env)

## Create an EstimatorStep
[EstimatorStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.estimator_step.estimatorstep?view=azure-ml-py) adds a step to run Estimator in a Pipeline.

- **name:** Name of the step
- **estimator:** Estimator object
- **estimator_entry_script_arguments:** 
- **runconfig_pipeline_params:** Override runconfig properties at runtime using key-value pairs each with name of the runconfig property and PipelineParameter for that property
- **inputs:** Inputs
- **outputs:** Output is list of PipelineData
- **compute_target:** Compute target to use 
- **allow_reuse:** Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.
- **version:** Optional version tag to denote a change in functionality for the step

In [6]:
from azureml.pipeline.steps import EstimatorStep

est_step = EstimatorStep(name="Estimator_Train", 
                         estimator=est, 
                         estimator_entry_script_arguments=["--datadir", input_data, "--output", output],
                         compute_target=cpu_cluster)

## Build and Submit the Experiment

In [7]:
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment
pipeline = Pipeline(workspace=ws, steps=[est_step])
pipeline_run = Experiment(ws, 'Estimator_sample').submit(pipeline)

Created step Estimator_Train [e6719891][9a595dbf-0d83-4fad-ac92-98bf93b92a1f], (This step will run and generate new outputs)
Submitted PipelineRun 4d72c644-9f57-4352-b110-1c774e1f6956
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/Estimator_sample/runs/4d72c644-9f57-4352-b110-1c774e1f6956?wsid=/subscriptions/35f16a99-532a-4a47-9e93-00305f6c40f2/resourcegroups/rongduan-dev/workspaces/centraleuap


## View Run Details

In [8]:
pipeline_run.wait_for_completion(show_output=True)

PipelineRunId: 4d72c644-9f57-4352-b110-1c774e1f6956
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/Estimator_sample/runs/4d72c644-9f57-4352-b110-1c774e1f6956?wsid=/subscriptions/35f16a99-532a-4a47-9e93-00305f6c40f2/resourcegroups/rongduan-dev/workspaces/centraleuap
PipelineRun Status: NotStarted
PipelineRun Status: Running


StepRunId: bc9d6dd3-f571-4636-be40-6e3d9ef96ab9
Link to Azure Machine Learning Portal: https://ml.azure.com/experiments/Estimator_sample/runs/bc9d6dd3-f571-4636-be40-6e3d9ef96ab9?wsid=/subscriptions/35f16a99-532a-4a47-9e93-00305f6c40f2/resourcegroups/rongduan-dev/workspaces/centraleuap
StepRun( Estimator_Train ) Status: NotStarted
StepRun( Estimator_Train ) Status: Running

Streaming azureml-logs/20_image_build_log.txt
2020/06/09 18:47:44 Downloading source code...
2020/06/09 18:47:49 Finished downloading source code
2020/06/09 18:47:49 Creating Docker network: acb_default_network, driver: 'bridge'
2020/06/09 18:47:50 Successfully set up Do

Removing intermediate container 302a24e3bf24
 ---> 1fac65cea366
Step 9/15 : ENV PATH /azureml-envs/azureml_becb75071dac89d1dfc4639fd26ab126/bin:$PATH
 ---> Running in 7963c3d2327b
Removing intermediate container 7963c3d2327b
 ---> 03e91e230aed
Step 10/15 : ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/azureml_becb75071dac89d1dfc4639fd26ab126
 ---> Running in f5384c68361f
Removing intermediate container f5384c68361f
 ---> d0811d3d539c
Step 11/15 : ENV LD_LIBRARY_PATH /azureml-envs/azureml_becb75071dac89d1dfc4639fd26ab126/lib:$LD_LIBRARY_PATH
 ---> Running in 11f74fc61a77
Removing intermediate container 11f74fc61a77
 ---> 746a2eee1dff
Step 12/15 : COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/
 ---> b6b75c0bd157
Step 13/15 : RUN if [ $SPARK_HOME ]; then /bin/bash -c '$SPARK_HOME/bin/spark-submit  /azureml-environment-setup/spark_cache.py'; fi
 ---> Running in 4a9f7937eb09
Removing intermediate container 4a9f7937eb09

Failure while loading azureml_run_type_providers. Failed to load entrypoint hyperdrive = azureml.train.hyperdrive:HyperDriveRun._from_run_dto with exception cannot import name 'OutputDatasetConfig'.
Initialize DatasetContextManager.
Starting the daemon thread to refresh tokens in background for process with pid = 104
Set Dataset input's target path to /tmp/tmp8_sj0hdm
Enter __enter__ of DatasetContextManager
SDK version: azureml-core==0.1.0.15477075 azureml-dataprep==1.7.1a2020060801
Processing 'input'.
Processing dataset FileDataset
{
  "source": [
    "('workspaceblobstore', 'iris.csv')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "439f24a6-1ea0-4beb-8f3a-1e23d8f2493d",
    "name": null,
    "version": null,
    "workspace": "Workspace.create(name='centraleuap', subscription_id='35f16a99-532a-4a47-9e93-00305f6c40f2', resource_group='rongduan-dev')"
  }
}
Mounting input to /tmp/tmp8_sj0hdm.
Mounted input to /tmp/tmp8_sj0hdm
Processing 'output_1331



PipelineRun Execution Summary
PipelineRun Status: Finished
{'runId': '4d72c644-9f57-4352-b110-1c774e1f6956', 'status': 'Completed', 'startTimeUtc': '2020-06-09T18:47:26.386289Z', 'endTimeUtc': '2020-06-09T18:58:00.621285Z', 'properties': {'azureml.runsource': 'azureml.PipelineRun', 'runSource': 'SDK', 'runType': 'SDK', 'azureml.parameters': '{}'}, 'inputDatasets': [], 'logFiles': {'logs/azureml/executionlogs.txt': 'https://centralestoragedlzbbsxk.blob.core.windows.net/azureml/ExperimentRun/dcid.4d72c644-9f57-4352-b110-1c774e1f6956/logs/azureml/executionlogs.txt?sv=2019-02-02&sr=b&sig=HTfBAANWxr3KSx8inUV8MqZDq4mHXxkDhr7H3GMWcEY%3D&skoid=e3f42e2c-d581-4b65-a966-631cfa961328&sktid=72f988bf-86f1-41af-91ab-2d7cd011db47&skt=2020-06-09T04%3A03%3A56Z&ske=2020-06-10T03%3A09%3A51Z&sks=b&skv=2019-02-02&st=2020-06-09T18%3A48%3A07Z&se=2020-06-10T02%3A58%3A07Z&sp=r', 'logs/azureml/stderrlogs.txt': 'https://centralestoragedlzbbsxk.blob.core.windows.net/azureml/ExperimentRun/dcid.4d72c644-9f57-4352-

'Finished'