# Chapter 12: Distributing TensorFlow Across Devices and Servers

Since training a large DNN for a complex task on a single CPU can take days or even weeks, this chapter discusses distributing TensorFlow across multiple devices on the same machine then multiple devices across multiple machines. TensorFlow has built in support for distributed computing, making it an ideal machine learning framework for this task.

## Multiple Devices on a Single Machine

You can speed up training a neural network by adding multiple GPUs to your machine. In some cases, it is faster to train a neural network with 8 GPUs on a single machine than 16 GPUs on multiple machines, since network communications can slow down training.

### Installation

Below is code for installing Nvidia's _Compute Unified Device Architecture_ library (CUDA) in Google Colab. TensorFlow uses CUDA for using the GPU for training DNNs.

In [1]:
!wget https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64 -O cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
!dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
!apt-key add /var/cuda-repo-9-2-local/7fa2af80.pub
!apt-get update
!apt-get install cuda

--2019-05-22 01:58:58--  https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64
Resolving developer.nvidia.com (developer.nvidia.com)... 192.229.189.146
Connecting to developer.nvidia.com (developer.nvidia.com)|192.229.189.146|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://developer.download.nvidia.com/compute/cuda/9.2/secure/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb?RVHf4zI8qOLFTzZi5hgsQ6WEhCOKKz8X0qzCNAjGsaMWOqdaCrN-U5Y-ayZf6XIqS_JgSx-5TlQxVstV-A_cIfHGkm_NjkD1CD3LdQU5fK3rO-80vHEqP-NSS_Bem2PDvS25yT42v7k6v91g1hJu83L3L13WePWGC4SsRfiyqVkr2_6bqrFyvxRp3B7cCBL3uTMbm191IxoBp0yjhzY [following]
--2019-05-22 01:58:58--  https://developer.download.nvidia.com/compute/cuda/9.2/secure/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb?RVHf4zI8qOLFTzZi5hgsQ6WEhCOKKz8X0qzCNAjGsaMWOqdaCrN-U5Y-ayZf6XIqS_JgSx-5TlQxVstV-A_cIfHGkm_NjkD1CD3LdQU5fK3rO-80vHEqP-

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168


The following code installs the GPU-enabled version of TensorFlow.

In [3]:
!pip3 install --upgrade tensorflow-gpu

Requirement already up-to-date: tensorflow-gpu in /usr/local/lib/python3.6/dist-packages (1.13.1)


### Managin the GPU RAM

By default, TensorFlow grabs all the available RAM on GPUs the first time you run a graph. One option is to run each process on different GPU cards. Below is code for doing so:

```bash
CUDA_VISIBLE_DEVICES=0,1 python3 program1.py
CUDA_VISIBLE_DEVICES=2,3 python3 program2.py
```

Another option is to tell TensorFlow to only use a fraction of the available memory. Code for doing so is below:

In [0]:
# Example code telling TensorFlow to grab only 40% of each GPU's memory
# so that multiple TensorFlow programs can run.

import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config)
session.close()

In [0]:
# Alternatively you can have TensorFlow only grab memory when it needs to.

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
session.close()

### Placing Operations on Devices

The [TensorFlow whitepaper](http://download.tensorflow.org/paper/whitepaper2015.pdf) presents a _dynamic placer_ algorithm that automatically distributes operations across all devices. This algorithm is internal to Google and is not released in the open source version of TensorFlow. This is due to the fact that in practice, a small set of placement rules specified by the user can perform just as well or better than dynamic placement.

Until the dynamic placer is made public, the open source version of TensorFlow relies on the _simple placer_.

#### Simple Placer

Whenever you run a graph, if a node has not yet been placed, the simple placer will allocate the operation to a device using the following rules:

- If a node has already been placed in a previous run of the graph, it is left on that device.

- If the user _pinned_ a node to a device (described below) then the placer places it on that device.

- Otherwise, it defaults to GPU #0 or the the CPU if there's no GPU.

Below is an example of using TensorFlow to _pin_ a node to a device, in this case the code pins the variable `a` and the constant `b` on the CPU.

In [6]:
with tf.device('/cpu:0'):
  a = tf.Variable(3.0, name='a')
  b = tf.constant(4.0, name='b')
c = a * b

Instructions for updating:
Colocations handled automatically by placer.


#### Logging Placements

Below is code for logging which device each node is pinned to. The code in the book does not work due to [this TenorFlow issue](https://github.com/tensorflow/tensorflow/issues/3047). Below is an example workaround from 

In [7]:
!pip install wurlitzer



In [8]:
from wurlitzer import pipes

tf.logging.set_verbosity(tf.logging.INFO)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
with pipes() as (out, err):
  print(sess.run(a.initializer))

print (out.read())

None
a: (VariableV2): /job:localhost/replica:0/task:0/device:CPU:0
a/Assign: (Assign): /job:localhost/replica:0/task:0/device:CPU:0
a/read: (Identity): /job:localhost/replica:0/task:0/device:CPU:0
mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
a/initial_value: (Const): /job:localhost/replica:0/task:0/device:CPU:0
b: (Const): /job:localhost/replica:0/task:0/device:CPU:0



In [9]:
sess.run(c)

12.0

In [0]:
sess.close()

#### Dynamic Placement Function

When you create a device block, you can also define a function which pins the nodes to devices. You can use this to implement more complex pinning algorithms such as pinning across GPUs in a round-robin fashion.

In [11]:
def variables_on_cpu(op):
  if op.type == 'Variable':
    return '/cpu:0'
  return '/gpu:0'

tf.reset_default_graph()

with tf.device(variables_on_cpu):
  a = tf.Variable(3.0)
  b = tf.constant(4.0)
  c = a * b
  
sess = tf.Session()
sess.run(a.initializer)
sess.run(c)

12.0

In [0]:
sess.close()

#### Operations and Kernels

For a TensorFlow variable to run on a device, it needs to have an implementation, or a _kernel_, for that device. Many operations have kernels for GPUs and CPUs. Integer variables, however, do not have a kernel for the GPU. The following code illustrates this:

In [13]:
tf.reset_default_graph()

with tf.device('/gpu:0'):
  i = tf.Variable(3)

try:
  sess = tf.Session()
  sess.run(i.initializer)
except Exception as ex:
  print(type(ex).__name__)

InvalidArgumentError


In [0]:
sess.close()

#### Soft placement

In order to prevent the exception being raised above, you can have TensorFlow fall back on the CPU instead.

In [0]:
with tf.device('/gpu:0'):
  i = tf.Variable(3)
  
config = tf.ConfigProto()
config.allow_soft_placement = True
sess = tf.Session(config=config)
sess.run(i.initializer)

In [0]:
sess.close()

### Parallel Execution

When TensorFlow evaluates a graph, it first evaluates all of the nodes with no dependencies, i.e. the source nodes. Once it evaluates a node which another depends on, the latter node's dependency counter decreases. Once it reaches zero, that node is evaluated. Once all of the nodes TensorFlow needs to evaluate are done, it outputs the result.

For nodes evaluated on the CPU, the evaluations are dispatched into a queue in a thread pool called the _inter-op thread pool_. If the CPU has multiple cores, then the operations are executed in parallel. If the operations themselves have multithreaded kernels, then these kernels split their task into sub-operations which are placed in a queue in another thread pool called the _intra-op thread pool_.

On the GPU, operations in the queue are evaluated sequentially. Operations which have multithreaded kernels are executed in parallel implemented by CUDA, cuDNN, and other GPU libraries that TensorFlow depends on.

### Control Dependencies

Sometimes, we do not want to evaluate nodes right when their dependency counter reaches zero. These nodes may take up a lot of compute resources to evaluate, and we may not need their values later. Or alternatively, some nodes rely on a lot of data not localized in the machine, so it may more make sense to evaluate them sequentially instead of in parallel.

Below is an example of adding _control dependencies_ in a TensorFlow graph, i.e. nodes which need to wait on the evaluation of other nodes even if they do not directly depend on them.

In [0]:
tf.reset_default_graph()

a = tf.constant(1.0)
b = a + 2.0

with tf.control_dependencies([a, b]):
  x = tf.constant(3.0)
  y = tf.constant(4.0)
  
z = x + y

Here, the evaluation of `z` depends on the evaluation of `a` and `b` even though `z`'s value does not depend on `a` or `b`. Since `b` depends on `a`, you need only list `b` as a control dependency, but sometimes it is better to be explicit.

## Distributing Devices Across Multiple Servers

In order to run a graph across multiple devices, you need to define a _cluster_ i.e. a group of TensorFlow servers called _tasks_ spread across several machines. Each task belongs to a _job_ i.e. a group of tasks which perform a common role.

The following code defines a _cluster specification_ which defines two jobs: `ps` and `worker`, the former is a _parameter server_ which records the model parameters whereas workers perform computations.

In [0]:
cluster_spec = tf.train.ClusterSpec({
    'ps': [
        '127.0.0.1:2221',
        '127.0.0.1:2222',
    ],
    'worker': [
        '127.0.0.1:2223',
        '127.0.0.1:2224',
        '127.0.0.1:2225',
    ],
})

The following code instantiates a TensorFlow `Server` object by passing it a cluster spec and then parameters to indicate its job and task number.

In [0]:
ps0 = tf.train.Server(cluster_spec, job_name='ps', task_index=0)
ps1 = tf.train.Server(cluster_spec, job_name='ps', task_index=1)
worker0 = tf.train.Server(cluster_spec, job_name='worker', task_index=0)
worker1 = tf.train.Server(cluster_spec, job_name='worker', task_index=1)
worker2 = tf.train.Server(cluster_spec, job_name='worker', task_index=2)

Typically you run one task per machine, but you can run multiple tasks per machine as long as you ensure that they don't all try to use all of the RAM on each GPU.

If you want the process to do nothing other than run the TensorFlow server, you can block the main thread by using the `join()` method:

```python

```

In [0]:
# This pauses the main thread until the server completes.
server.join()

### Opening a Session

Once all of the tasks are up and running, you can open a session on any of the servers from a client on any machine using the following code:

In [0]:
a = tf.constant(1.0)
b = a + 2
c = b * 3

with tf.Session('grpc://127.0.0.1:2223') as sess:
  print(c.eval())

The code creates a simple graph then opens a session on machine B and evaluates `c`. The master places operations on the appropriate device, if we do not pin the operation to a particular device then the master will place it on the machine's default device.

### The Master and Worker Services

The client uses _gRPC_ to communicate with the server. A protocol which uses HTTP2 to open a lasting connection for bidirectional communication. It exchanges data using _protocol buffers_.

Every TensorFlow server provides two services: a _master service_ and a _worker service_. The master allows clients to open sessions and run graphs whereas the worker service actually performs computations. This architecture allows a server to open multiple sessions from one or more clients.

### Pinning Operations Across Tasks

Below is an example of using a device block to pin an operation to a particular task and to a particular device on that task.

In [21]:
tf.reset_default_graph()

# This block pins `a` to task 0 of the `ps` job's CPU.
with tf.device('/job:ps/task:0/cpu:0'):
  a = tf.constant(1.0)
  
# This block pins `b` to task 1 of the `worker` job's GPU.
with tf.device('/job:worker/task:1/gpu:0'):
  b = a + 2

c = a + b

with tf.Session('grpc://127.0.0.1:2225') as sess:
  print(c.eval())

4.0


### Sharding Variables Across Multiple Parameter Servers

It is common to have a _parameter server_ job to store parameters while training a complex model. Some models, like DNNs, can have thousands or even millions of parameters. To avoid network saturation, it is common to distribute storing parameters across multiple servers.

Since manually pinning every variable to a different task can be tedious, TensorFlow provides a `replica_device_setter()` which distributes variables across servers. Below is an example:

In [0]:
tf.reset_default_graph()

with tf.device(tf.train.replica_device_setter(ps_tasks=2)):
  v1 = tf.Variable(1.0) # pinned to /job:ps/task:0
  v2 = tf.Variable(2.0) # pinned to /job:ps/task:1
  v3 = tf.Variable(3.0) # pinned to /job:ps/task:0
  v4 = tf.Variable(4.0) # pinned to /job:ps/task:1
  v5 = tf.Variable(5.0) # pinned to /job:ps/task:0

Alternatively you can pass the cluster spec and TensorFlow will automatically compute the number of tasks in the `ps` job.

If you create operations that are not just variables, then by default they are pinned to `/job:worker` which will default to the first device of the first worker task. You can pin them to devices using device blocks. Below is an example of a graph pinned to multiple tasks and multiple devices:

In [50]:
tf.reset_default_graph()

with tf.device(tf.train.replica_device_setter(ps_tasks=2, ps_device='/job:ps',
                                              worker_device='/job:worker')):
  v1 = tf.Variable(1.0) # pinned to /job:ps/task:0
  v2 = tf.Variable(2.0) # pinned to /job:ps/task:1
  v3 = tf.Variable(3.0) # pinned to /job:ps/task:0
  
  s = v1 + v2 # pinned to /job:worker/task:0/cpu:0
  
  with tf.device('/gpu:0'):
    p1 = 2 * s # pinned to /job:worker/task:0/gpu:0
    
    with tf.device('/task:1'):
      p2 = 3 * s # pinned to /job:worker/task:1/cpu:0
      
with tf.Session('grpc://127.0.0.1:2221') as sess:
  v1.initializer.run()
  v2.initializer.run()
  print(s.eval())
  print(p1.eval())
  print(p2.eval())

3.0
6.0
9.0


### Sharing State Across Sessions Using Resource Containers

When using a plain _local session_, variables values are stored in the session object, so when the session ends the values are deleted. Moreover multiple local sessions cannot share any state, even if they run the same graph.

When you are using _distributed sessions_, variable state is managed by _resource containers_ located on the cluster and persist across sessions. An example of this is given by the code below:

In [0]:
tf.reset_default_graph()

x = tf.Variable(0.0, name='x')
increment_x = tf.assign(x, x + 1)

In [48]:
with tf.Session('grpc://127.0.0.1:2222') as sess:
  sess.run(x.initializer)
  sess.run(increment_x)
  print(x.eval())

1.0


In [49]:
with tf.Session('grpc://127.0.0.1:2222') as sess:
  sess.run(increment_x)
  print(x.eval())

2.0


While this feature can be convenient, you have to be careful to not use the same variable names by accident. One way to avoid this is by using a variable scope with a unique name for each computation.

In [0]:
tf.reset_default_graph()

with tf.variable_scope('problem_1'):
  x1 = tf.Variable(0.0, name='x')
  increment_x1 = tf.assign(x1, x1 + 1)
  
with tf.variable_scope('problem_2'):
  x2 = tf.Variable(0.0, name='x')
  increment_x2 = tf.assign(x2, x2 + 1)

You can even use resource containers to store variables across different graphs.