<a href="https://colab.research.google.com/github/GiuliaLanzillotta/TensorflowEssentials/blob/master/Distributed_computation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Distributing Tensorflow computation

> [Tensorflow] gives you full control over how to split (or replicate) your computation graph across devices and servers, and it lets you parallelize and synchronize operations in flexible ways so you can choose between all sorts of parallelization approaches.

## Parallelizing simple graphs across several GPUs on a single machine.

In order to run TensorFlow on multiple GPU cards, you first need to make sure your GPU cards have NVidia Compute Capability.
You must then **download and install the appropriate version of the CUDA and cuDNN libraries**, and set a few environment variables so TensorFlow knows where to find CUDA and cuDNN. 

> #### What is CUDA?  
Short for "Compute Unified Device Architecture", it's both a parallel computing platform and a library created  by Nvidia allowing developers to use CUDA-enabled GPUs for general purpose processing. The CUDA platform is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels. 

> #### What about CuDNN?
Short for CUDA Deep Neural Network library, a GPU-accelerated library of primitives for DNNs created by Nvidia. It provides optimized implementations of common DNN computations such as activation layers, normalization, forward and backward convolutions, and pooling.

Let's look at what the Google server we're running on has to offer.

In [1]:
!nvidia-smi

Thu Apr 23 13:30:50 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [1]:
%tensorflow_version 1.x
import tensorflow as tf

TensorFlow 1.x selected.


In [5]:
sess = tf.Session()
sess.list_devices()

[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 11864881681466060498),
 _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 1397507663453625933),
 _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 7947584449248635740),
 _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 11330115994, 10096098157676641914)]

    Side note: 
    to test the following code snippets 
    I am going to use a python script 
    that builds and trains a DNN. 

### Managing GPU RAM 


To avoid that each Tensorflow process occupies the whole GPU RAM we can force each process to run on a single GPU card.<br>
We can obtain it by setting the ```CUDA_VISIBLE_DEVICES``` environment variable as follows:

```CUDA_VISIBLE_DEVICES=0,1 python3 program_1.py```

```CUDA_VISIBLE_DEVICES=3,2 python3 program_2.py```


Another option is to tell TensorFlow to grab only a fraction of the memory. 
The following code does the job:

```
    # At the beginning of the script:
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = 0.5 
    # And when you create a session:
    session = tf.Session(config=config)
  ```
I added the above code in the ```dnn.py``` script and in the following cell I am running it twice (I know it doesn't really make sense, but it serves the point).


In [23]:
!python3 dnn.py & python3 dnn.py  & nvidia-smi

Thu Apr 23 13:57:14 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    58W / 149W |     69MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

As you can see from the output, the two programs are running in parallel, hence each of them must be using no more than half of the entire GPU memory, which is what we wanted to obtain. 

### Placing graph nodes on the right devices


Section 3.2.1 of the [Tensorflow whitepaper](http://download.tensorflow.org/paper/whitepaper2015.pdf) describes a beautiful placement algorithm (the *dynamic placement algorithm*) which was not released though, because it did not result in significant efficency improvements. <br>
Tensorflow relies on another placing algorithm, called **simple placer**, which basically leaves the placement of operation to the user. 

> #### How does the simple placer works? 
Basically, by default all your nodes will be placed on GPU #0, if you have one, otherwise they'll be placed on CPU #0. However, you can change explicitly set the location of some nodes to be different. 

With the following lines of code we are placing some nodes on the CPU. 

In [0]:
with tf.device("/cpu:0"): #selecting the cpu
  # creating 2 nodes on the cpu
  a = tf.Variable(3.0)
  b = tf.Variable(4.0)

c = a*b # note that we're back in the default settings

We should have 2 nodes on the CPU and one on the GPU. Let's check whether this is actually the case. 

In [25]:
config = tf.ConfigProto()
config.log_device_placement = True
sess = tf.Session(config=config)

Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7



For a TensorFlow operation to run on a device, it needs to have an implementation for that device (calles a **kernel**).<br>
With the following code we try to place an integer on GPU: 

In [33]:
with tf.device("/gpu:0"):
  i = tf.Variable(10)

init = tf.global_variables_initializer()
init.run(session = sess)

InvalidArgumentError: ignored

> " Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Colocation Debug Info:" <br>

As you might have guessed, what Tensorflow is complaining about is that there's no GPU kernel for integer variables. 

Most of the fundamental operations have both a CPU and a GPU kernel. However, it can happen to get an exception as the one above. With the following code we are telling Tensorflow to automatically place on CPU if the GPU kernel is not available.


In [34]:
with tf.device("/gpu:0"):
  i = tf.Variable(10)

init = tf.global_variables_initializer()

config = tf.ConfigProto() 
config.allow_soft_placement = True 
config.log_device_placement = True
sess = tf.Session(config=config)

init.run(session = sess)

Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7



No exception! 

