Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/CMK8s-Samples/sample_notebooks/003%20Distribute%20job/distributed-tensorflow-with-parameter-server/distributed-tensorflow-with-parameter-server.png)

# Distributed TensorFlow with parameter server
In this tutorial, you will train a TensorFlow model on the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using native [distributed TensorFlow](https://www.tensorflow.org/deploy/distributed).

## Prerequisites
* Go through the [configuration notebook](../../../configuration.ipynb) to:
    * install the AML SDK
    * create a workspace and its configuration file (`config.json`)
* install CMAKS SDK

``` pip install --disable-pip-version-check --extra-index-url https://azuremlsdktestpypi.azureedge.net/CmAks-Compute-Test/D58E86006C65 azureml-pipeline-steps azureml-contrib-pipeline-steps azureml_contrib_itp --upgrade ```

In [None]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

## Diagnostics
Opt-in diagnostics for better experience, quality, and security of future releases.

In [None]:
from azureml.telemetry import set_diagnostics_collection

set_diagnostics_collection(send_diagnostics=True)

## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the Prerequisites step. `Workspace.from_config()` creates a workspace object from the details stored in `config.json`.

In [None]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

## Use existing CMAKS compute
You will need to create a [CMAKS compute target](https://github.com/Azure/CMK8s-Samples/blob/master/docs/3.%20attach%20CMAKS%20compute.markdown) for training your model. 

In [None]:
from azureml.contrib.core.compute.cmakscompute import CmAksCompute
# compute is attached
print("compute targets after attach:\n")
for targetName in ws.compute_targets:
    print(targetName)

In [None]:
# Choose a name for cmaks compute
compute_name = <compute_name> # compute name

In [None]:
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException
from azureml.contrib.core.compute.cmakscompute import CmAksCompute

cmaks_compute = ComputeTarget(workspace=ws, name=compute_name)

## Train model on the CMKASK compute
Now that we have the compute target ready to go, let's run our distributed training job.

### Create a project directory
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on.

In [None]:
import os

project_folder = './tf-distr-ps'
os.makedirs(project_folder, exist_ok=True)

Copy the training script `tf_mnist_replica.py` into this project directory.

In [None]:
import shutil

shutil.copy('tf_mnist_replica.py', project_folder)

### Create an experiment
Create an [Experiment](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) to track all the runs in your workspace for this distributed TensorFlow tutorial. 

In [None]:
from azureml.core import Experiment

experiment_name = 'tf-distr-ps'
experiment = Experiment(ws, name=experiment_name)

### Create Tensorflow constructor
The AML SDK's TensorFlow estimator enables you to easily submit TensorFlow training jobs for both single-node and distributed runs. For more information on the TensorFlow estimator, refer [here](https://docs.microsoft.com/azure/machine-learning/service/how-to-train-tensorflow).

In [None]:
from azureml.train.dnn import TensorFlow, ParameterServer

script_params={
    '--num_gpus': 2, # define the gpu used by single node
    '--train_steps': 500
}

tf_est = TensorFlow(source_directory=project_folder,
                       compute_target=cmaks_compute,
                       script_params=script_params,
                       entry_script='tf_mnist_replica.py',
                       node_count=2,
                       distributed_training=ParameterServer(worker_count=2),
                       use_gpu=True)

The above code specifies that we will run our training script on `2` nodes, with two workers and one parameter server. In order to execute a native distributed TensorFlow run, you must provide the argument `distributed_backend=ParameterServer()`. Using this estimator with these settings, TensorFlow and its dependencies will be installed for you. However, if your script also uses other packages, make sure to install them via the `TensorFlow` constructor's `pip_packages` or `conda_packages` parameters.

### Submit job
Run your experiment by submitting your estimator object. Note that this call is asynchronous.

In [None]:
run = experiment.submit(tf_est)
run

### Monitor your run
You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [None]:
from azureml.widgets import RunDetails

RunDetails(run).show()

Alternatively, you can block until the script has completed training before running more code.

In [None]:
run.wait_for_completion(show_output=True) # this provides a verbose log