Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Distributed Image Classification with Tensorflow

In this tutorial, you will train a PyTorch model on the [CIFAR10](http://www.cs.toronto.edu/~kriz/cifar.html) dataset using distributed training with Tensorflow 2 `MultiWorkerMirroredStrategy` module across a Azure Stack Hub CPU Kubernetes cluster.

## Prerequisites

*    [ A Kubernetes cluster deployed on Azure Stack Hub, connected to Azure through ARC](https://github.com/Azure/AML-Kubernetes/blob/master/docs/ASH/AML-ARC-Compute.md).
     

*     [Datastore setup in Azure Machine Learning workspace backed up by Azure Stack Hub storage account](https://github.com/Azure/AML-Kubernetes/blob/master/docs/ASH/Train-AzureArc.md) 


*      Last but not least, you need to be able to run a Notebook. 

   If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you go through the configuration Notebook located at [here](https://github.com/Azure/MachineLearningNotebooks) first. This sets you up with a working config file that has information on your workspace, subscription id, etc.

In [None]:
from azureml.core import Dataset, Environment, Experiment, Workspace
import os
import requests

## Initialize AzureML workspace

Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`. 

If you haven't done already please go to `config.json` file and fill in your workspace information.

In [None]:
ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

## Download cifar10 data

Use this function to download cifar10 data later. This function allows you to avoid download the data again when you run this notebook multiple times. The actual download time may takes 5 minutes.

In [None]:
import os
import requests
import tempfile

def download_cifar10_data():
    
    path = 'https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz'
    downloaded_folder = os.path.join(os.getcwd(), 'cifar10-data')
    os.makedirs(downloaded_folder, exist_ok=True) # download data to 'cifar10-data' folder

    data = requests.get(path, allow_redirects=True).content
    with open(os.path.join(downloaded_folder, path.split('/')[-1]), 'wb') as f:
        f.write(data)
        
    return downloaded_folder

## Prepare the dataset

The following cell will upload "cifar-10-python.tar.gz" to datastore of the workspace, and finally registered as dataset in the workspace. 

Upload and dataset registration take about 3 mins.

In [None]:
import os
from azureml.core import Datastore, Dataset

dataset_name = 'cifar10_ds'
datastore_name = "ashstore"

if dataset_name not  in ws.datasets:
    datastore =  Datastore.get(ws, datastore_name)
    
    downloaded_folder = download_cifar10_data()
    src_dir, target_path = downloaded_folder, 'cifar10-data-ash'
    
    # upload data from local to AML datastore:
    datastore.upload(src_dir, target_path)

    # register data uploaded as AML dataset:
    datastore_paths = [(datastore, target_path)]
    cifar_ds = Dataset.File.from_files(path=datastore_paths)
    cifar_ds.register(ws, dataset_name, "CIFAR-10 images from https://www.cs.toronto.edu/~kriz/cifar.html")
        
dataset_ash = ws.datasets[dataset_name]

## Setup compute target

Find the attach name for the Arc enabled  Azure Stack Hub kubernetes cluster in your AzureML workspace to create a ComputeTarget:

In [None]:
from azureml.contrib.core.compute.kubernetescompute import KubernetesCompute

attach_name = "attachedarc"
arcK_target = KubernetesCompute(ws, attach_name)

## Configure the training job and submit

### Create an experiement

In [None]:
experiment_name = 'dist-tf2-on-aks-arc'
experiment = Experiment(workspace=ws, name=experiment_name)

### Create an environment

In [None]:
env = Environment.from_dockerfile(
    name='tf_2.4',
    dockerfile='tf-script/Dockerfile.gpu',
    conda_specification='tf-script/tf-24-env.yaml')

### Configure the training job

Use TensorflowConfiguration to set number of worker and number of parameter server to use.
With worker_count= 3, training for one epoch may take 21 minutes with vm size comparable to Standard_DS3_v2.


In [None]:
from azureml.core import ScriptRunConfig, Run
from azureml.core.runconfig import TensorflowConfiguration

worker_count= 2
src = ScriptRunConfig(source_directory='tf-script',
                      script='train.py',
                      arguments=[
                          '--dataset-path', dataset_ash.as_mount(),
                          '--epochs', 1,#80
                          '--global-batch-size', 256,
                          '--batches-per-epoch', 256,
                          '--alpha-init', 0.005,
                      ],
                      compute_target=arcK_target,
                      environment=env,
                      distributed_job_config=TensorflowConfiguration(worker_count=worker_count, parameter_server_count=1))#configuring AML TF config

rs_config = src.run_config.amlk8scompute.resource_configuration
rs_config.gpu_count = 0
rs_config.cpu_count = worker_count - rs_config.gpu_count
rs_config.memory_request_in_gb = 6



### Submit the job

Run your experiment by submitting your ScriptRunConfig object. Note that this call is asynchronous.

In [None]:
run = experiment.submit(config=src)
run.wait_for_completion(show_output=True) # this provides a verbose log

### Register the model

Register the trained model.

In [None]:
#  the model is saved at path "outputs/001"
registered_model_name = 'cifar10tf'
model = run.register_model(model_name=registered_model_name, model_path='outputs/001')

The machine learning model named "cifar10tf" should be registered in your AML workspace.

## Test the registered model

To test the trained model, you can create (or use existing) a AKS cluster for serving the model using AML deployment.

Note: You will test the model in azure cloud because AzureML inferencing on Azure Stack Hub in currently unavailable.

In [None]:
from azureml.core import Environment, Workspace, Model, ComputeTarget
from azureml.core.compute import AksCompute
from azureml.core.model import InferenceConfig
from azureml.core.webservice import Webservice, AksWebservice
from azureml.core.compute_target import ComputeTargetException
import numpy as np
import json

### Provision the AKS cluster

This is a one time setup. You can reuse this cluster for multiple deployments after it has been created. If you delete the cluster or the resource group that contains it, then you would have to recreate it. It may take 5 mins to create a new AKS cluster.

In [None]:
ws = Workspace.from_config()

# Choose a name for your AKS cluster
aks_name = 'aks-service-2'

# Verify that cluster does not exist already
try:
    aks_target = ComputeTarget(workspace=ws, name=aks_name)
    is_new_compute  = False
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # Use the default configuration (can also provide parameters to customize)
    prov_config = AksCompute.provisioning_configuration()

    # Create the cluster
    aks_target = ComputeTarget.create(workspace = ws, 
                                    name = aks_name, 
                                    provisioning_configuration = prov_config)
    is_new_compute  = True
    
print("using compute target: ", aks_target.name)

### Deploy the model

In [None]:
env = Environment.from_conda_specification(name="tf_2.4", file_path="tf-script/tf-24-env.yaml")
inference_config = InferenceConfig(entry_script='score_tf.py', environment=env)
deploy_config = AksWebservice.deploy_configuration()

model = ws.models[registered_model_name]
service_name = 'cifartfservice1'
service = Model.deploy(workspace=ws,
                       name=service_name,
                       models=[model],
                       inference_config=inference_config,
                       deployment_config=deploy_config,
                       deployment_target=aks_target,
                       overwrite=True)

service.wait_for_deployment(show_output=True)

### Test with inputs

For testing purpose, first image (a cat) from test batch is extracted (it is saved at test_imgs/test_img_0_cat.jpg). It is shown here: 

![fishy](test_imgs/test_img_0_cat.jpg)

After some data process, the image converted to json as input (cifar10_test_input.json) for the trained model. The outputs are probabilities for each class per image:

In [None]:
with open("cifar10_test_input_tf.json", "r") as fp:
    inputs_json = json.load(fp)
inputs = json.dumps(inputs_json)
resp = service.run(inputs)
predicts = resp["predictions"]
print(predicts)

Then you can easily get the predictions of labels:

In [None]:
import numpy as np
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
np_predicts = np.array(predicts)
pred_indexes = np.argmax(np_predicts, 1)
predict_labels = [classes[i] for i in pred_indexes]
print(predict_labels)

Depending on your model, you may or may not get the correct label which is "cat".

### Delete the newly created cluster

Note: This is important if you wish to avoid the cost of this cluster

In [None]:
if is_new_compute:
    aks_target.delete()
    service.delete()

## Next steps

1. Learn how to [download model then upload to Azure Storage blobs](../AML-model-download-upload.ipynb)
2. Learn how to [inference using KFServing with model in Azure Storage Blobs](https://aka.ms/kfas)