# Chapter 12: Distributing TensorFlow Across Devices and Servers

This notebook is my solution to exercise 10 of chapter 12. It contains three distributed models. Each model requires you restart the kernel and run the code in the **Installation** section.

## Exercise 10

Train a DNN using between-graph replication and data parallelism with asynchronous updates, timimg how long it taeks to reach a satisfying performance. Next, try again using synchronous updates. Do synchronous updates produce a better model? Does it train faster? Split the DNN vertically and place each vertical slice on a different device, and train the model again. Is training any faster? Is performance any different?

## Solution

### Installation

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


In [3]:
!pip3 install --upgrade tensorflow-gpu

Collecting tensorflow-gpu
[?25l  Downloading https://files.pythonhosted.org/packages/7b/b1/0ad4ae02e17ddd62109cd54c291e311c4b5fd09b4d0678d3d6ce4159b0f0/tensorflow_gpu-1.13.1-cp36-cp36m-manylinux1_x86_64.whl (345.2MB)
[K     |████████████████████████████████| 345.2MB 70kB/s 
Installing collected packages: tensorflow-gpu
Successfully installed tensorflow-gpu-1.13.1


### Asynchronous Updates

In [5]:
# Downloading MNIST dataset.

import tensorflow as tf
import numpy as np

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [0]:
# Defining the cluster spec for the parallel model.

n_dnns = 3

cluster_spec = tf.train.ClusterSpec({
    'ps': ['127.0.0.1:1000'],
    'worker': ['127.0.0.1:100{}'.format(i) for i in range(1, n_dnns + 1)]
})

In [0]:
# Abstracting the operations with the individual workers which train their own
# copy of the DNN into a class.

class DNNTask:
  def __init__(self):
    pass

In [0]:
# Defining the graph for the model.

n_inputs = 28 ** 2

tf.reset_default_graph()

with tf.device('/job:ps/task:0/cpu:0'):
  with tf.variable_scope('ps0'):
    X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
    y = tf.placeholder(tf.int32, shape=(None), name='y')

    input_queues = []
    enqueue_data_ops = []
    close_input_queue_ops = []
    for i in range(n_dnns):
      input_queues.append(
          tf.RandomShuffleQueue(capacity=len(X_train), min_after_dequeue=0,
                                dtypes=[tf.float32, tf.int32],
                                shapes=[(n_inputs), ()], name='input_queue',
                                shared_name='input_queue'))
      enqueue_data_ops.append(input_queues[-1].enqueue_many([X, y]))
      close_input_queue_ops.append(input_queues[-1].close())