Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License [2017] Zalando SE, https://tech.zalando.com

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/datasets-tutorial/pipeline-with-datasets/pipeline-for-image-classification.png)

# Build a simple ML pipeline for image classification

## Introduction
This tutorial shows how to train a simple deep neural network using the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset and Keras on Azure Machine Learning. Fashion-MNIST is a dataset of Zalando's article imagesâ€”consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

Learn how to:

> * Set up your development environment
> * Create the Fashion MNIST dataset
> * Create a machine learning pipeline to train a simple deep learning neural network on a remote cluster
> * Retrieve input datasets from the experiment and register the output model with datasets

## Prerequisite:
* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning
* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) to:
    * install the latest version of AzureML SDK
    * create a workspace and its configuration file (`config.json`)

In [None]:
# install private build for the feature
!pip install -U -i https://azuremlsdktestpypi.azureedge.net/dev/aml/office/134157926D8F --extra-index-url https://pypi.org/simple "azureml-sdk==0.1.0.*"

## Set up your development environment

All the setup for your development work can be accomplished in a Python notebook.  Setup includes:

* Importing Python packages
* Connecting to a workspace to enable communication between your local computer and remote resources
* Creating an experiment to track all your runs
* Creating a remote compute target to use for training

### Import packages

Import Python packages you need in this session. Also display the Azure Machine Learning SDK version.

In [None]:
import os
import azureml.core
from azureml.core import Workspace, Dataset, Datastore, ComputeTarget, RunConfiguration, Experiment
from azureml.core.runconfig import CondaDependencies
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
from azureml.pipeline.core import Pipeline, PipelineData
from azureml.train.dnn import TensorFlow

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

### Connect to workspace

Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `workspace`.

In [None]:
# load workspace
workspace = Workspace.from_config()
print('Workspace name: ' + workspace.name, 
      'Azure region: ' + workspace.location, 
      'Subscription id: ' + workspace.subscription_id, 
      'Resource group: ' + workspace.resource_group, sep='\n')

### Create experiment and a directory

Create an experiment to track the runs in your workspace and a directory to deliver the necessary code from your computer to the remote resource.

In [None]:
# create an ML experiment
exp = Experiment(workspace=workspace, name='keras-mnist-fashion')

# create a directory
script_folder = './keras-mnist-fashion'
os.makedirs(script_folder, exist_ok=True)

### Create AzureML compute resource with user assigned identity
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

To provision the compute with identity:
* `identity_type`: Compute Identity type that you want to set on the cluster, which can either be SystemAssigned or UserAssigned
* `identity_id`: List of resource ID of identity in case it is a UserAssigned identity, optional otherwise. To get identity_id of your managed identity, run the following command:
`az identity show --resource-group yourRGName --name yourIdentityName`

**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
# cluster_name = "identitycomp2"
# cluster_name = 'shuyudebugvnet'
cluster_name = 'identitycomp3'

try:
    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D3_v2', 
                                                           max_nodes=4,
                                                           vnet_resourcegroup_name='maytest',
                                                           vnet_name='mayvnetcentral',
                                                           subnet_name='default',
                                                           identity_type='UserAssigned',
                                                           identity_id=['/subscriptions/35f16a99-532a-4a47-9e93-00305f6c40f2/resourcegroups/maytest/providers/Microsoft.ManagedIdentity/userAssignedIdentities/mayidentity'])

    # create the cluster
    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)

    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

## Create datastore without credentials
For datastores without credentials:
1. If you are interacting with the datastore/dataset on your local laptop (e.g. using notebook), you will be prompt to login and your identity will be used for data access authentication.
2. If you submit an azureML experiment with an AML Compute as compute target, the identity of the compute will be used for data access authentication.

Therefore, make sure you grant your user identity and the compute identity access to your storage account. We support Azure Blob, ADLS Gen1, ADLS Gen2, Azure SQL for private preview. 

In [None]:
# create blob datatastore without credentials
# the blob container in the sample is behind vnet
blob_dstore = Datastore.register_azure_blob_container(workspace=workspace,
                                                      datastore_name='credentialless_blob',
                                                      container_name='data',
                                                      account_name='mayvnet')

In [None]:
# create adls gen1 without credentials
adls_dstore = Datastore.register_azure_data_lake(workspace = workspace,
                                                 datastore_name='credentialless_adls1',
                                                 store_name='rozh')

In [None]:
# createn adls2 datastore without credentials
adls2_dstore = Datastore.register_azure_data_lake_gen2(workspace=workspace, 
                                                       datastore_name='credentialless_adls2', 
                                                       filesystem='tabular', 
                                                       account_name='mayadls2') 

In [None]:
# create sql datastore without credentials
sql_datastore = Datastore.register_azure_sql_database(workspace=workspace,
                                                      datastore_name='credentialless_sql',
                                                      server_name='dprep-sql-test',
                                                      database_name='dprep-sql-test')

## Interact with data in notebook
Dataset is the recommended approach to interact with data in AzureML. You can download, mount or load dataset into common dataframe. [Lear More](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets)

In [None]:
# create tabulardataset from the credential-less blobdatastore behind vnet
# if your datastore is behind vnet. Make sure the compute (e.g. compute instance) you are running the following code is behind the same vnet
blob_ds = Dataset.Tabular.from_delimited_files((blob_dstore,'Titanic.csv'))

In [None]:
blob_ds.take(5).to_pandas_dataframe()

In [None]:
# create filedataset from the credential-less adlsgen2 datastore 
adls2_ds = Dataset.File.from_files((adls2_dstore,'updates_ca.csv'))
adls2_ds.to_path()

In [None]:
# These two extra steps are required to enable identity-based data access to SQL Database
# 1. Create AAD user for the SQL Database, as instructed in this document: 
#    https://docs.microsoft.com/en-us/azure/sql-database/sql-database-aad-authentication-configure?tabs=azure-powershell#create-contained-database-users-in-your-database-mapped-to-azure-ad-identities
# 2. Grant the required permissions to execute sql query
#    https://docs.microsoft.com/en-us/sql/t-sql/statements/grant-object-permissions-transact-sql?view=sql-server-ver15

sql_ds = Dataset.Tabular.from_sql_query((sql_datastore, 'SELECT TOP (10) * FROM [SalesLT].[Product]'))
sql_ds.to_pandas_dataframe()

## Create the Fashion MNIST dataset

By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred. 

We will now upload the Fashion MNIST files to the blob datastore without credentials.

In [None]:
#this is a new api to upload data to dataset and create a dataset from it
fashion_ds = Dataset.File.upload_directory(src_dir='./data', target = (blob_dstore, "mnist-fashion"), overwrite=True)

Then we will create an unregistered FileDataset pointing to the path in the datastore. You can also create a dataset from multiple paths. [Learn More](https://aka.ms/azureml/howto/createdatasets) 

In [None]:
fashion_ds = Dataset.File.from_files((blob_dstore, "mnist-fashion"))

In [None]:
# list files referenced by dataset
fashion_ds.to_path()

## Build 2-step ML pipeline

The [Azure Machine Learning Pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines) enables data scientists to create and manage multiple simple and complex workflows concurrently. A typical pipeline would have multiple tasks to prepare data, train, deploy and evaluate models. Individual steps in the pipeline can make use of diverse compute options (for example: CPU for data preparation and GPU for training) and languages. [Learn More](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/machine-learning-pipelines)


### Step 1: data preparation

In step one, we will load the image and labels from Fashion MNIST dataset into mnist_train.csv and mnist_test.csv

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. Both mnist_train.csv and mnist_test.csv contain 785 columns. The first column consists of the class labels, which represent the article of clothing. The rest of the columns contain the pixel-values of the associated image.

In [None]:
# set up the compute environment to install required packages
conda = CondaDependencies.create(pip_packages=['azureml-sdk<0.1.1'],
                                 pip_indexurl='https://azuremlsdktestpypi.azureedge.net/dev/aml/office/134157926D8F')

conda.set_pip_option('--pre')

run_config = RunConfiguration()
run_config.environment.python.conda_dependencies = conda

Intermediate data (or output of a step) is represented by a OutputFileDatasetConfig object. preprared_fashion_ds is produced as the output of step 1, and used as the input of step 2. OutputFileDatasetConfig introduces a data dependency between steps, and creates an implicit execution order in the pipeline. You can register a OutputFileDatasetConfig as a dataset and version the output data automatically.

In [None]:
from azureml.data import OutputFileDatasetConfig

# learn more about the output config
help(OutputFileDatasetConfig)

**Configure output as dataset**: This is a new feature in private preview. We support mount write back to blob, adlsgen1, adlsgen2, fileshare via dataset.

In [None]:
# write output to blob datastore under folder `outputdataset` and registger it as a dataset after the experiment completes
prepared_fashion_ds = OutputFileDatasetConfig(destination=(blob_dstore, 'outputdataset/{run-id}')).register_on_complete(name='prepared_fashion_ds')

**Important**<br>
In remote training, the identity of the compute will be used for data authentication. In this example, a managed identity was assigned to the compute target. Therefore, you need to make sure the managed identity has blob data contributor role in order to write data back to the datastore without credentials.
![image](grantidentity.jpg)

A **PythonScriptStep** is a basic, built-in step to run a Python Script on a compute target. It takes a script name and optionally other parameters like arguments for the script, compute target, inputs and outputs. If no compute target is specified, default compute target for the workspace is used. You can also use a [**RunConfiguration**](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.runconfiguration?view=azure-ml-py) to specify requirements for the PythonScriptStep, such as conda dependencies and docker image.

In [None]:
prep_step = PythonScriptStep(name='prepare step',
                             script_name="prepare.py",
                             # mount fashion_ds dataset to the compute_target
                             arguments=[fashion_ds.as_named_input('fashion_ds').as_mount(), prepared_fashion_ds],
                             source_directory=script_folder,
                             compute_target=compute_target,
                             runconfig=run_config,
                             allow_reuse=True)

### Step 2: train CNN with Keras

Next, we construct an Estimator object. We will first set up the conda environment to install the necessary packages.

In [None]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

conda_env = Environment('conda-env')
conda_env.python.conda_dependencies = CondaDependencies.create(pip_packages=['azureml-sdk<0.1.1','keras','tensorflow','numpy','scikit-learn', 'matplotlib','pandas'],
                                                               pip_indexurl='https://azuremlsdktestpypi.azureedge.net/dev/aml/office/134157926D8F')

In [None]:
from azureml.train.estimator import Estimator
# set up training step with Estimator
est = Estimator(entry_script='train.py',
                source_directory=script_folder,                 
                environment_definition=conda_env,
                compute_target=compute_target)

est_step = EstimatorStep(name='train step',
                         estimator=est,
                         # parse the prepared_fashion_ds into tabulardataset and use it as the input for train step
                         estimator_entry_script_arguments=[prepared_fashion_ds.read_delimited_files().as_input(name='prepared_fashion_ds')],
                         compute_target=compute_target,
                         allow_reuse=True)

In [None]:
help(prepared_fashion_ds.read_delimited_files())

### Build the pipeline
Once we have the steps (or steps collection), we can build the [pipeline](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.pipeline.pipeline?view=azure-ml-py).

A pipeline is created with a list of steps and a workspace. Submit a pipeline using [submit](https://docs.microsoft.com/python/api/azureml-core/azureml.core.experiment(class)?view=azure-ml-py#submit-config--tags-none----kwargs-). When submit is called, a [PipelineRun](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinerun?view=azure-ml-py) is created which in turn creates [StepRun](https://docs.microsoft.com/python/api/azureml-pipeline-core/azureml.pipeline.core.steprun?view=azure-ml-py) objects for each step in the workflow.

In [None]:
# build pipeline & run experiment
pipeline = Pipeline(workspace, steps=[prep_step, est_step])
run = exp.submit(pipeline)

### Monitor the PipelineRun

In [None]:
run.wait_for_completion(show_output=True)

In [None]:
run.find_step_run('train step')[0].get_metrics()

## Register the input dataset and the output model

Azure Machine Learning dataset makes it easy to trace how your data is used in ML. [Learn More](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-version-track-datasets#track-datasets-in-experiments)<br>
For each Machine Learning experiment, you can easily trace the datasets used as the input through `Run` object.

In [None]:
# get input tabular datasets to train step
train_step = run.find_step_run('train step')[0]
inputs = train_step.get_details()['inputDatasets']
input_dataset = inputs[0]['dataset']

input_dataset.take(3).to_pandas_dataframe()

Register the prepared Fashion MNIST TabularDataset with the workspace so that you can reuse it in other experiments or share it with your colleagues who have access to your workspace.

In [None]:
tabular_prepared_ds = input_dataset.register(workspace = workspace,
                                    name = 'tabular_prepared_ds',
                                    description = 'prepared ds in tabular format',
                                    create_new_version = True)
tabular_prepared_ds

Register the output model with dataset

In [None]:
run.find_step_run('train step')[0].register_model(model_name = 'keras-model', model_path = 'outputs/model/',
                                                  datasets=[('training data', tabular_prepared_ds)])