Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License [2017] Zalando SE, https://tech.zalando.com

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/work-with-data/datasets-tutorial/pipeline-with-datasets/pipeline-for-image-classification.png)

# Identity based data access with compute identity

## Introduction
This tutorial shows how to setup compute and datastore for identitiy based data access.

Learn how to:

> * Create compute with managed identity.
> * Interact with dataset using users identity
> * Submit remote training using compute identity for data access

## Prerequisite:
* Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning
* If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, go through the [configuration notebook](../../../configuration.ipynb) to:
    * install the latest version of AzureML SDK. No private build is requried to use identity based data access with compute identity
    * create a workspace and its configuration file (`config.json`)

In [None]:
# install latest version of AzureML SDK
!pip install -U azureml-sdk

In [None]:
!pip install pandas pyarrow

## Set up your development environment

All the setup for your development work can be accomplished in a Python notebook.  Setup includes:

* Importing Python packages
* Connecting to a workspace to enable communication between your local computer and remote resources
* Creating an experiment to track all your runs
* Creating a remote compute target to use for training

### Import packages

Import Python packages you need in this session. Also display the Azure Machine Learning SDK version.

In [None]:
import os
import azureml.core
from azureml.core import Workspace, Dataset, Datastore, ComputeTarget, RunConfiguration, Experiment
from azureml.core.runconfig import CondaDependencies

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

### Connect to workspace

Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `workspace`.

In [None]:
workspace = Workspace.from_config()
print('Workspace name: ' + workspace.name, 
      'Azure region: ' + workspace.location, 
      'Subscription id: ' + workspace.subscription_id, 
      'Resource group: ' + workspace.resource_group, sep='\n')

### Create experiment and a directory

Create an experiment to track the runs in your workspace and a directory to deliver the necessary code from your computer to the remote resource.

In [None]:
# create an ML experiment
exp = Experiment(workspace=workspace, name='iris')

# create a directory
script_folder = './iris-train'
os.makedirs(script_folder, exist_ok=True)

### Create AzureML compute resource with system assigned identity
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

To provision the compute with identity:
* `identity_type`: Compute Identity type that you want to set on the cluster, which can either be SystemAssigned or UserAssigned
* `identity_id`: List of resource ID of identity in case it is a UserAssigned identity, optional otherwise. To get identity_id of your managed identity, run the following command:
`az identity show --resource-group yourRGName --name yourIdentityName`

**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "identitycomp3"

try:
    compute_target = ComputeTarget(workspace=workspace, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D3_v2', 
                                                           max_nodes=4,
                                                           identity_type='SystemAssigned')

    # create the cluster
    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)

    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

## Create datastore without credentials
For datastores without credentials:
1. If you are interacting with the datastore/dataset on your local laptop (e.g. using notebook), you will be prompt to login and your identity will be used for data access authentication.
2. If you submit an azureML experiment with an AML Compute as compute target, the identity of the compute will be used for data access authentication.

Therefore, make sure you grant your user identity and the compute identity access to your storage account. We support Azure Blob, ADLS Gen1, ADLS Gen2 for private preview. 

In [None]:
# create blob datatastore without credentials
blob_dstore = Datastore.register_azure_blob_container(workspace=workspace,
                                                      datastore_name='credentialless_mayblob',
                                                      container_name='openhack',
                                                      account_name='mayworkspace8597807414')

In [None]:
# create adls gen1 without credentials
adls_dstore = Datastore.register_azure_data_lake(workspace = workspace,
                                                 datastore_name='credentialless_adls1',
                                                 store_name='rozh')

In [None]:
# createn adls2 datastore without credentials
adls2_dstore = Datastore.register_azure_data_lake_gen2(workspace=workspace, 
                                                       datastore_name='credentialless_adls2', 
                                                       filesystem='tabular', 
                                                       account_name='mayadls2') 

## Interact with data in notebook
Dataset is the recommended approach to interact with data in AzureML. You can download, mount or load dataset into common dataframe. [Lear More](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets)

Your identity will be used for data access. For blob and ADLSGen 2, Make sure you have **blob data reader role** to read data from the resource or **blob data contributor role** if you plan to write data back to the storage account. 

In [None]:
# create tabulardataset from the credential-less blobdatastore
# if your datastore is behind vnet. Make sure the compute (e.g. compute instance) you are running the following code is behind the same vnet
blob_ds = Dataset.Tabular.from_delimited_files((blob_dstore,'test.csv'))

In [None]:
blob_ds.take(5).to_pandas_dataframe()

In [None]:
# create filedataset from the credential-less adlsgen2 datastore 
adls2_ds = Dataset.File.from_files((adls2_dstore,'updates_ca.csv'))
adls2_ds.to_path()

## Create the iris dataset

By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred. 

We will now upload the [Iris data](./train-dataset/Iris.csv) to a credentialess blob datastore within your workspace.

In [None]:
import pandas as pd
df = pd.read_csv("./iris-train/iris.csv")
df.head(3)

In [None]:
from azureml.core import Dataset
# this is a new API to directly create Tabulardataset from pandas dataframe
dataset = Dataset.Tabular.register_pandas_dataframe(df, target=(blob_dstore, "iris"), name='iris_train', show_progress=True)

### Create a training script

To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train_iris.py` in the script_folder. 

In [None]:
%%writefile $script_folder/train_iris.py

import os
import joblib

from azureml.core import Dataset, Run
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


run = Run.get_context()
# get input dataset by name
dataset = run.input_datasets['iris']

df = dataset.to_pandas_dataframe()

x_col = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
y_col = ['species']
x_df = df.loc[:, x_col]
y_df = df.loc[:, y_col]

#dividing X,y into train and test data
x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)

data = {'train': {'X': x_train, 'y': y_train},

        'test': {'X': x_test, 'y': y_test}}

clf = DecisionTreeClassifier().fit(data['train']['X'], data['train']['y'])
model_file_name = 'decision_tree.pkl'

print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(x_train, y_train)))
print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(clf.score(x_test, y_test)))

os.makedirs('./outputs', exist_ok=True)
with open(model_file_name, 'wb') as file:
    joblib.dump(value=clf, filename='outputs/' + model_file_name)

### Grant data access to compute identity

In remote trainig, the identity of the compute will be used to access data. Make sure your compute identity has blob data reader or contributor role to your storage service. If you choose system assigned identity when creating the compute, your compute identity name will be the following:

In [None]:
"{}/computes/{}".format(workspace.name, cluster_name)

You can go to portal to grant data access to this compute identity.
![image](grantaccess.jpg)

### Create an environment

Define a conda environment YAML file with your training script dependencies and create an Azure ML environment.

In [None]:
%%writefile conda_dependencies.yml

dependencies:
- python=3.6.2
- scikit-learn
- pip:
  - pandas
  - azureml-defaults

### Configure training run

A ScriptRunConfig object specifies the configuration details of your training job, including your training script, environment to use, and the compute target to run on. Specify the following in your script run configuration:
* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. 
* The training script name, train_iris.py
* The input dataset for training, passed as an argument to your training script. `as_named_input()` is required so that the input dataset can be referenced by the assigned name in your training script. 
* The compute target. In this case you will use the AmlCompute you created
* The environment definition for the experiment

In [None]:
from azureml.core import Environment

sklearn_env = Environment.from_conda_specification(name = 'sklearn-env', file_path = './conda_dependencies.yml')

In [None]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=script_folder,
                      script='train_iris.py',
                      arguments=[dataset.as_named_input('iris')],
                      compute_target=compute_target,
                      environment=sklearn_env)

### Submit job to run
Submit the estimator to the Azure ML experiment to kick off the execution.

In [None]:
run = exp.submit(src)

In [None]:
run.wait_for_completion(show_output=True)