# Data Parallel Training with TensorFlow2

In this notebook we will learn how to engineer TensorFlow2 training job using native `Allreduce` implementation as well as `Horovod` framework. You will be able to compare both implementations side by side. As a test task we choose everyone's favorite MNIST dataset and train small computer vision model to solve classification task. We will use convenient `Keras` API to build and train model and evaluate results.

**Instance recommendations** <br>
Feel free to experiment with various GPU SageMaker instances. By default we use single GPU `p2.xlarge` instances to minimize cost of training. 

**Disclaimer** <br>
This example should not be used to draw any conclusions about training efficiencies since the dataset and model are very small.

## TensorFlow MultiWorkerMirroredStrategy Training
TensorFlow2 provides several native implementations of data parallel training known as `strategies`. In this examples we will use synchronous multi-GPU multi-node Allreduce implementation called `MultiWorkerMirroredStrategy` (`"MWMS"`). Refer to [this overview](https://www.tensorflow.org/guide/distributed_training) of data parallel strategies if you want to learn about others. 

As always, we start with necessary imports and basic SageMaker training configs.

In [None]:
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role() # replace it with role ARN if you are not using SageMaker Notebook or Studio environments.

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/tf-distribution-options'
print('Bucket:\n{}'.format(bucket))

### Developing training script

Let's review modification we need to make in training script to enable MWMS. Full sources are here: `chapter6/1_sources/train_ms.py`

#### Cluster configuration and setup
MWMS is not natively supported by Amazon SageMaker, so we need to correctly configure MWMS environment in SageMaker. TF2 uses environment variable called `TF_CONFIG` to represent cluster configuration. This configuration is then used to start training processes. You can read about building `TF_CONFIG` variable [here](https://www.tensorflow.org/guide/distributed_training#TF_CONFIG). We use `_build_tf_config()` method below to setup this variable. Note, that we are using SageMaker environment variables `SM_HOSTS` and `SM_CURRENT_HOST` for it.

```python
def _build_tf_config():

    hosts = json.loads(os.getenv("SM_HOSTS"))
    current_host = os.getenv("SM_CURRENT_HOST")

    workers = hosts

    def host_addresses(hosts, port=7777):
        return ["{}:{}".format(host, port) for host in hosts]

    tf_config = {"cluster": {}, "task": {}}
    tf_config["cluster"]["worker"] = host_addresses(workers)
    tf_config["task"] = {"index": workers.index(current_host), "type": "worker"}

    os.environ["TF_CONFIG"] = json.dumps(tf_config)
```


By default we use in this sample two `p2.xlarge` instances with total world size of just 2 training processes. So `_build_tf_config()` will produce following `TF_CONFIG` on rank=0 node:

```json
{
    "cluster": 
    {
        "worker": ["algo-1:7777", "algo-2:7777"]},
        "task": {"index": 0, "type": "worker"
    }
}
```

Once TF config is properly configured, TF2 should be able to start training processes on all nodes and utilize all available GPU devices (this is a default setting but you can provide a list of specific GPU devices to use too).

To complete cluster setup we also need to make sure that NCCL environment is properly configurate (see `_set_nccl_environment()` method) and that all nodes in cluster can communicate with each other (see `_dns_lookup()` method). Note, that these methods are required because TensorFlow2 strategies are not officially supported by SageMaker. For supported data parallel implementations, SageMaker provides these utilities out of the box and run them as part of training cluster initiation.


#### Using MWMS

To use MWMS we start by initiating strategy object like below. Please note, that here we explicitly set communication backend to `AUTO` which means that TF2 will identify which backend to use. You can also define a specific backend manually. `NCCL` and custom `RING` backends are available for GPU devices.

```python
strategy = tf.distribute.MultiWorkerMirroredStrategy(
    communication_options=tf.distribute.experimental.CommunicationOptions(
        implementation=tf.distribute.experimental.CollectiveCommunication.AUTO
    )
)
```

Once strategy is correctly initiated, you can confirm your cluster configuration by checking properly `strategy.num_replicas_in_sync` which will return your world size. It should match with number of GPUs per node *multiplied by number of nodes.

In this example we are using Keras API which fully supports MWMS, simplifying our training script. For instance, to create model copies on all workers you just need to initiate your Keras model within strategy.scope like below:

```python
    with strategy.scope():
        multi_worker_model = build_and_compile_cnn_model()
```

MWMS also automatically shards your dataset based on world sze. You only need to setup proper global batch size like below. Note, that auto sharding can be turned out if some custom sharding logic is needed.

```python
    global_batch_size = args.batch_size_per_device * _get_world_size()
    multi_worker_dataset = mnist_dataset(global_batch_size)
```

The rest of training script is similar to your single process Keras training script. As you can see, using MWMS is quite straighforward, and TF2 does a good job abstracting complexities from developers but at the same time giving flexibility to adjust default settings if needed.


#### Running SageMaker job

So far we discussed training script. In source directory you will also see `mnist_setup.py` script to download and configure MNIST dataset. Now we are ready to run Data Parallel training on SageMaker.

In cell  below we define TF version (2.8), Python version (3.9), instance type and number of instances. Additionally, we also pass several training hyperparameters. Since MNIST dataset is downloaded from internet as part of our training script, no data is passed to `estimator_ms.fit()`.

In [None]:
from sagemaker.tensorflow import TensorFlow

ps_instance_type = 'ml.p2.xlarge'
ps_instance_count = 2

hyperparameters = {'epochs': 4, 'batch-size-per-device' : 16, 'steps-per-epoch': 100}

estimator_ms = TensorFlow(
                       source_dir='1_sources',
                       entry_point='train_ms.py', 
                       role=role,
                       framework_version='2.8',
                       py_version='py39',
                       disable_profiler=True,
                       debugger_hook_config=False,
                       hyperparameters=hyperparameters,
                       instance_count=ps_instance_count, 
                       instance_type=ps_instance_type,
                       )

estimator_ms.fit()

The training job should complete within 10-12 minutes with default settings. Feel free to experiment with number of nodes in cluster and instasnce types and observe changes `TF_CONFIG`, training speed and convergence. 

## TensorFlow Horovod Training

Horovod is an open-source framework for data parallel training whic supports TF1, TF2, PyTorch, and MXNet. Given its popularity, Amazon SageMaker provides native support for Horovod as well, making using Horovod even simplier than native TF2 strategies. 

We will re-use same MNIST problem and `Keras` API.

### Configuring Horovod cluster

Unlike in case of MWMS, we don't have to configure and setup training cluster in training script since Horovod is supported by SageMaker. Horovod cluster configuration is done on level of `Tensorflow.Estimator` API via distributions object like below:

```python
distribution = {"mpi": {"enabled": True, "custom_mpi_options": "-verbose --NCCL_DEBUG=INFO", "processes_per_host": 1}}
```
Note parameter `processes_per_host` which should match number of GPUs on chosen instance type. You can also set `custom_mpi_options` as needed which SageMaker will pass to `mpirun` run utility. See list of supported MPI options [here](https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php).

### Developing training script
You can find full training script in `chapter6/1_sources/train_hvd.py`. We start by initiating Horovod in training script via `_initiate_hvd()` method. We need to associate Horovod training processes with available GPU devices (one device per process).

```python
def _initiate_hvd():
    # Horovod: initialize Horovod.
    hvd.init()

    # Horovod: pin GPU to be used to process local rank (one GPU per process)
    gpus = tf.config.experimental.list_physical_devices("GPU")
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    if gpus:
        tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], "GPU")
```

Next we need to shard our dataset based on world size, so each process can get a slice of data based on its global rank. For this we use `shard` method of `TensorFlow.Dataset`. Note that we are getting local and global ranks of given training process using Horovod properties `size()` and `rank()`.

```python
train_dataset = train_dataset.shard(hvd.size(), hvd.rank())
```

Next we need to use Horovod `DistributedOptimizer` wrapper to enable distributed gradient update. Note, that we are wrapping instance of native TF2 optimizer.

```python
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001 * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer)
```

Lastly we use use special Keras callbacks:
- `hvd.callbacks.BroadcastGlobalVariablesCallback(0)` to distribute initial variables from rank=0 process to other training processes in the cluster.
- `hvd.callbacks.MetricAverageCallback()` to calculate global average of metrics across all training processes.

These callbacks then passed to `model.fit()` method like below:
```python
    hvd_model.fit(
        shareded_by_rank_dataset,
        epochs=args.epochs,
        steps_per_epoch=args.steps_per_epoch // hvd.size(),
        callbacks=callbacks,
    )
```

These are minimal additions to your training script which allows to use Horovod.

### Runing SageMaker job

SageMaker Training job configuration is similar to MWMS example with exception of `distributions` parameter above which we discussed above.

In [None]:
from sagemaker.tensorflow import TensorFlow

ps_instance_type = 'ml.p2.xlarge'
ps_instance_count = 2

distribution = {"mpi": {"enabled": True, "custom_mpi_options": "-verbose --NCCL_DEBUG=INFO", "processes_per_host": 1}}
hyperparameters = {'epochs': 4, 'batch-size-per-device' : 16, 'steps-per-epoch': 100}

estimator_hvd = TensorFlow(
                       source_dir='1_sources',
                       entry_point='train_hvd.py', 
                       role=role,
                       framework_version='2.8',
                       py_version='py39',
                       disable_profiler=True,
                       debugger_hook_config=False,
                       hyperparameters=hyperparameters,
                    #   model_dir = "/opt/ml/model",
                       instance_count=ps_instance_count, 
                       instance_type=ps_instance_type,
                       distribution=distribution
                       )

estimator_hvd.fit()

## Summary

We implemented minimal viable examples of data parallel training jobs using TensorFlow2 MultiWorkerMirrored Strategy and TensorFlow2 Horovod. Now you should have some practical experience in developing baseline training jobs. There are certainly more knobs and capabilities to explore of both Allreduce implementations which we encorage to explore and try on your real-life use cases.