# Welcome to Azure Machine Learning services

_Special thanks to the Azure CAT team and Daniel for the Dogbreeds sample._

The purpose of this sample is to demonstrate:

- What you need to start your run once the administrator has set up a workspaces for you
- How to start your Azure Machine Learning services run 

This set up notebook shows only the first feature of 
See [Daniel's dogbreed notebook](../) for:

- Distributed training
- Hyperparameter tuning
- Azure Machine Learning Pipelines
- Inferencing
- Deploy model as web service

## System overview

You run your application from the local development system and submit it to an Azure workspace that consists of:

- Workspace
- AML Compute
- Storage
- Key vault
- Container registry
- Applications Insight

The lab shares a common storage where the Dogbreeds demo data is stored. You will access the key to the lab storage using Key Vault.

![overview](assets/overview.png)

Admin provides you the workspace info:

- SUBSCRIPTION_ID
- RESOURCEGROUP_NAME
- WORKSPACE_NAME
- CLUSTER_NAME

Admin gives you the shared data store information for the Dogbreeds data:

- DATA_STORAGE_RESOURCE_GROUP
- DATA_STORAGE_ACCOUNT
- DATA_STORAGE_CONTAINER

(Before starting this notebook, you will used information provided by the admin to retrieve the storage key, such as:

```
export DATA_STORAGE_KEY=$(az storage account keys list -g $DATA_RESOURCE_GROUP \
	  -n $DATA_STORAGE_ACCOUNT --query [0].value | tr -d '"')
```

Set each of these variables as shown in the next cell are you are ready to go.

In [None]:
import os

subscription_id = os.environ['SUBSCRIPTION_ID']
resource_group_name  = os.environ['RESOURCE_GROUP']
workspace_name  = os.environ['WORKSPACE']
cluster_name = os.environ['CLUSTER_NAME']

print('Workspace name: ' + workspace_name, 
      'subscription_id: ' + subscription_id, 
      'Resource group: ' + resource_group_name , sep = '\n')

data_storage_resource_group = os.environ['DATA_RESOURCE_GROUP']
data_storage_account = os.environ['DATA_STORAGE_ACCOUNT']
data_storage_container= os.environ['DATA_STORAGE_CONTAINER']
key=os.environ['DATA_STORAGE_KEY']

print('Data storage resource group: ' + data_storage_resource_group, 
      'Data storage account: ' + data_storage_account, 
      'Data storage container: ' + data_storage_container , sep = '\n')


The data storage provides a common storage location for data shared in the lab. You can also replace the values for your own data storage.

In [None]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

# Dog breed classification using Pytorch Estimators on Azure Machine Learning service

Have you ever seen a dog and not been able to tell the breed? Some dogs look so similar, that it can be nearly impossible to tell. For instance these are a few breeds that are difficult to tell apart:

#### Alaskan Malamutes vs Siberian Huskies
![Image of Alaskan Malamute vs Siberian Husky](http://cdn.akc.org/content/article-body-image/malamutehusky.jpg)

#### Whippet vs Italian Greyhound 
![Image of Whippet vs Italian Greyhound](http://cdn.akc.org/content/article-body-image/whippetitalian.jpg)

There are sites like http://what-dog.net, which use Microsoft Cognitive Services to be able to make this easier. 

In this tutorial, you will learn how to train a Pytorch image classification model using transfer learning with the Azure Machine Learning service. The Azure Machine Learning python SDK's [PyTorch estimator](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-pytorch) enables you to easily submit PyTorch training jobs for both single-node and distributed runs on Azure compute. The model is trained to classify dog breeds using the [Stanford Dog dataset](http://vision.stanford.edu/aditya86/ImageNetDogs/) and it is based on a pretrained ResNet18 model. This ResNet18 model has been built using images and annotation from ImageNet. The Stanford Dog dataset contains 120 classes (i.e. dog breeds), to save time however, for most of the tutorial, we will only use a subset of this dataset which includes only 10 dog breeds.

## The Problem
At the start, the user is running her workload on their local machine and finds it is slow -- even thought they are only training on 8% of the data.

In [None]:
# !mkdir outputs
# !python pytorch_train.py --data_dir breeds-10 --num_epochs 10 --output_dir outputs 

## What is Azure Machine Learning service?
Azure Machine Learning service is a cloud service that you can use to develop and deploy machine learning models. Using Azure Machine Learning service, you can track your models as you build, train, deploy, and manage them, all at the broad scale that the cloud provides.
![](aml-overview.png)


## How can we use it for training image classification models?
Training machine learning models, particularly deep neural networks, is often a time- and compute-intensive task. Once you've finished writing your training script and running on a small subset of data on your local machine, you will likely want to scale up your workload.

To facilitate training, the Azure Machine Learning Python SDK provides a high-level abstraction, the estimator class, which allows users to easily train their models in the Azure ecosystem. You can create and use an Estimator object to submit any training code you want to run on remote compute, whether it's a single-node run or distributed training across a GPU cluster. For PyTorch and TensorFlow jobs, Azure Machine Learning also provides respective custom PyTorch and TensorFlow estimators to simplify using these frameworks.

### Steps to train with a Pytorch Estimator:
In this tutorial, we will:
- Connect to an Azure Machine Learning service Workspace 
- Create a remote compute target
- Upload your training data (Optional)
- Create your training script
- Create an Estimator object
- Submit your training job

## Create workspace

In the next step, you will create your own Workspace to use in this tutorial.

**You will be asked to login during this step. Please use your Microsoft AAD credentials.**

## Prerequisites
Make sure you have access to an Azure subscription. Your group's admin should have added you to your team's subscription. 

Details on how to set up the storage account are in the admin folder.

In [None]:
# from azureml.core import Workspace
#
# ws = Workspace.from_config(path='/Users/danielsc/git/dogbreeds/aml_config/config.json')
# 
# print('https://ms.portal.azure.com/#@microsoft.onmicrosoft.com/resource' + ws.get_details()['id'])

In [None]:
from azureml.core import Workspace
import os

print('Workspace name: ' + workspace_name, 
      'subscription_id: ' + subscription_id, 
      'Resource group: ' + resource_group_name , sep = '\n')

In [None]:
try:
    ws = Workspace(subscription_id = subscription_id, resource_group = resource_group_name, workspace_name = workspace_name)
    ws.write_config()
    print('Library configuration succeeded')
except:
    print('Workspace not found')

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')


This will take a few minutes, so let's talk about what a Workspace is while it is being created. 

In [None]:
from azureml.core.compute import ComputeTarget

target_list = ComputeTarget.list(ws)
for target in target_list:
    print(target.serialize()["name"])

## Create a remote compute target
For this tutorial, we will create an AML Compute cluster with a NC6s_v2, P100 GPU machines, created to use as the [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) to execute your training script on. 

**Creation of the cluster takes approximately 5 minutes, but we will not wait for it to complete** 

If the cluster is already in your workspace this code will skip the cluster creation process. Note that the code is not waiting for completion of the cluster creation. If needed you can call `compute_target.wait_for_completion(show_output=True)`, which will block you notebook until the compute target is provisioned.

In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget

# choose a name for your cluster
print('cluster_name: ' + cluster_name) 

try:
    compute_target = ws.compute_targets[cluster_name]
    print('Found existing compute target.')
except KeyError:
     print('Cannot find existing compute target.')

## Attach the blobstore with the training data to the workspace
While the cluster is still creating, let's attach some data to our workspace.

The dataset we will use consists of ~150 images per class. Some breeds have more, while others have less. Each class has about 100 training images each for dog breeds, with ~50 validation images for each class. We will look at 10 classes in this tutorial.

To make the data accessible for remote training, you will need to keep the data in the cloud. AML provides a convenient way to do so via a [Datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data). The datastore provides a way for you to upload/download data, and interact with it from your remote compute targets. It is an abstraction over Azure Storage. The datastore can reference either an Azure Blob container or Azure file share as the underlying storage. 

You can view the subset of the data used [here](https://github.com/heatherbshapiro/pycon-canada/tree/master/breeds-10). Or download it from [here](https://github.com/heatherbshapiro/pycon-canada/master/breeds-10.zip) as a zip file. 


To attach this blob container as a data store to your workspace, use  `Datastore.register_azure_blob_container`. 

**If you already have the breeds datstore attached you can skip the next cell**

In [None]:
from azureml.core import Datastore
import os

key = key.strip('\""')

print('Data storage resource group: ' + data_storage_resource_group, 
      'Data storage account: ' + data_storage_account, 
      'Data storage container: ' + data_storage_container , sep = '\n')

Datastore.register_azure_blob_container(workspace=ws, 
                                        datastore_name='thebreeds', 
                                        container_name="dogbreeds",
                                        account_name=data_storage_account, 
                                        account_key=key)

Now let's get a reference to the path on the datastore with the training data. We can do so using the `path` method. In the next section, we can then pass this reference to our training script's `--data_dir` argument. We will start with the 10 classes dataset.

In [None]:
from azureml.core import Datastore

ds = Datastore(ws, 'thebreeds')

path_on_datastore = 'breeds-10'
ds_data = ds.path(path_on_datastore)
print(ds_data)

## Up/Download Data

If you are interested in downloading the data locally, you can run `ds.download(".", 'breeds-10')`. This might take several minutes.

You can also upload your data. See [Upload to the datastore object](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.azure_storage_datastore.abstractazurestoragedatastore?view=azure-ml-py#upload-src-dir--target-path-none--overwrite-false--show-progress-true-).

In [None]:
# ds.upload('breeds-10', 'breeds-10')

### Prepare training script
Now you will need to create your training script. In this tutorial, the training script is already provided for you at `pytorch_train.py`. In practice, you should be able to take any custom training script as is and run it with AML without having to modify your code.

You will need to access to your data and define a location for your output.

#### Training and metrics

However, if you would like to use AML's [tracking and metrics](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#metrics) capabilities, you will have to add a small amount of AML code inside your training script. 

In `pytorch_train.py`, we will log some metrics to our AML run. To do so, we will access the AML run object within the script:
```Python
from azureml.core.run import Run
run = Run.get_context()
```
Further within `pytorch_train.py`, we log the learning rate and momentum parameters, the best validation accuracy the model achieves, and the number of classes in the model:
```Python
run.log('lr', np.float(learning_rate))
run.log('momentum', np.float(momentum))
run.log('num_classes', num_classes)

run.log('best_val_acc', np.float(best_acc))
```

If you downloaded the data, you can start to train the model locally (note that it will take long if you don't have a GPU -- 21 min. on a Core i7 CPU).

**This step requires PyTorch to be installed locally -- find instructions [here](https://pytorch.org/#pip-install-pytorch)**


In [None]:
# !mkdir outputs
# !python pytorch_train.py --data_dir breeds-10 --num_epochs 10 --output_dir outputs 

## Train model on the remote compute
Now that you have your data and training script prepared, you are ready to train on your remote compute cluster. You can take advantage of Azure compute to leverage GPUs to cut down your training time. 

### Create an experiment
Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this transfer learning PyTorch tutorial. 

In [None]:
from azureml.core import Experiment

experiment_name = 'pytorch-dogs-10' 
experiment = Experiment(ws, name=experiment_name)

print(experiment)

### Create a PyTorch estimator
The AML SDK's PyTorch estimator enables you to easily submit PyTorch training jobs for both single-node and distributed runs. For more information on the PyTorch estimator, see [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-pytorch). 

The following code will define a single-node PyTorch job.

In [None]:
from azureml.train.dnn import PyTorch

script_params = {
    '--data_dir': ds_data.as_mount(),
    '--num_epochs': 10,
    '--output_dir': './outputs',
    '--log_dir': './logs',
    '--mode': 'fine_tune'
}

estimator10 = PyTorch(source_directory='.', 
                    script_params=script_params,
                    compute_target=compute_target, 
                    entry_script='pytorch_train.py',
                    pip_packages=['tensorboardX'],
                    use_gpu=True)


In [None]:
print(estimator10.run_config.environment.docker.base_image)

In [None]:
print(estimator10.conda_dependencies.serialize_to_string())

The `script_params` parameter is a dictionary containing the command-line arguments to your training script `entry_script`. Please note the following:
- Pass the training data reference `ds_data` to our script's `--data_dir` argument. This will 1) mount our datastore on the remote compute and 2) provide the path to the training data `breeds` on our datastore.
- Specify the output directory as `./outputs`. The `outputs` directory is specially treated by AML in that all the content in this directory gets uploaded to your workspace as part of your run history. The files written to this directory are therefore accessible even once your remote run is over. In this tutorial, we will save our trained model to this output directory.

To leverage the Azure VM's GPU for training, set `use_gpu=True`.

### Tag the run

Before submitting the experiment, tag your run. Add information you will need to find your run at a later time. 

The tags are a Python dictionary of your own choosing.

In [None]:
dogtags = {'Run':'Dogbreeds', 'Source':'Notebook', 'Researcher':'Bruce', "Datasize" : path_on_datastore}

In [None]:
run = experiment.submit(estimator10, tags=dogtags)
run_id = run.id

print(run_id)

## To cancel
# run.cancel()

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

### What happens during a run?
If you are running this for the first time, the compute target will need to pull the docker image, which will take about 2 minutes. This gives us the time to go over how a **Run** is executed in Azure Machine Learning. 

Note: had we not created the workspace with an existing ACR, we would have also had to wait for the image creation to be performed -- that takes and extra 10-20 minutes for big GPU images like this one. This is a one-time cost for a given python configuration, and subsequent runs will then be faster. We are working on ways to make this image creation faster.

![](../aml-run.png)

## Track job in the portal

Once the job has completed, the results are copied into your storage account.
Inspect the portal for the results using the link in the widget in the previous cell.

![results](assets/results.png)

## Get results

You can get the logs and models from:

- The portal
- By inspecting results in Azure storage
- By downloading the logs

### Your results in portal 

For the models

![portal output](assets/outputs.png) 

For the logs

![logs](assets/logs.png)

In [None]:
### Download all logs to a local directory

run.get_all_logs(destination='../../results/')

# do not put them into a directory in the same location as your script. 
# If you do, the next time you run your run, it will upload it to Azure the next time you run your Python script

In [None]:
print(run_id)

In [None]:
# print(ws.get_details())

### Your results in Azure storage

The results are in workspace's storage account (not the $DATA_STORAGE_ACCOUNT where you got the data from).
The `<run_id>` is your run_id set in a previous cell.

`http://$WORKSPACE_STORAGE_ACCOUNT.blob.core.windows.net/azureml/ExperimentRun/dcid.<run_id>/outputs`

![data in storage](assets/datainstorage.png)

## Next steps

Continue Dogbreeds sample code to:

- Distributed training
- Hyperparameter tuning
- Azure Machine Learning Pipelines
- Inferencing
- Deploy model as web service

Copy your data to your storage account using AzCopy into your workspace storage.

```bash
STORAGE_CONTAINER_NAME="<folder name for your data>"

azcopy \ 
  --source /mnt/myfiles/ \ 
  --destination https://$WORKSPACE_STORAGE_ACCOUNT.blob.core.windows.net/$STORAGE_CONTAINER_NAME \     
  --dest-key $WORKSPACE_STORAGE_KEY \ 
  --recursive
```