# Distributed TensorFlow with Horovod

description: train tensorflow CNN model on mnist data distributed via horovod

For more information on using Horovod with TensorFlow, refer to Horovod documentation:

* [Horovod with TensorFlow](https://github.com/horovod/horovod/blob/master/docs/tensorflow.rst)
* [Horovod with Keras](https://github.com/horovod/horovod/blob/master/docs/keras.rst)

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
ws

In [None]:
import git
from pathlib import Path

# get root of git repo
prefix = Path(git.Repo(".", search_parent_directories=True).working_tree_dir)

# training script
source_dir = prefix.joinpath(
    "code", "models", "tensorflow", "mnist-distributed-horovod"
)
script_name = "train.py"

# environment file
environment_file = prefix.joinpath("environments", "tf-gpu-horovod.yml")

# azure ml settings
environment_name = "tf-gpu-horovod"
experiment_name = "tf-mnist-distr-horovod-example"
cluster_name = "gpu-K80-2"

In [None]:
print(open(source_dir.joinpath(script_name)).read())

## Create environment

In [None]:
from azureml.core import Environment

env = Environment.from_conda_specification(environment_name, environment_file)

# specify a GPU base image
env.docker.enabled = True
env.docker.base_image = (
    "mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04"
)

## Configure and run training job

Create a `ScriptRunConfig` to specify the training script & arguments, environment, and cluster to run on. Create an `MpiConfiguration` to run an MPI/Horovod job. Specify a `process_count_per_node` equal to the number of GPUs available per node of your cluster.

In [None]:
from azureml.core import ScriptRunConfig, Experiment
from azureml.core.runconfig import MpiConfiguration

cluster = ws.compute_targets[cluster_name]

distr_config = MpiConfiguration(process_count_per_node=2, node_count=2)

src = ScriptRunConfig(
    source_directory=source_dir,
    script=script_name,
    arguments=["--epochs", 30],
    compute_target=cluster,
    environment=env,
    distributed_job_config=distr_config,
)

In [None]:
run = Experiment(ws, experiment_name).submit(src)
run

In [None]:
from azureml.widgets import RunDetails

RunDetails(run).show()

In [None]:
run.wait_for_completion(show_output=True)