<span style="font-size:300%">Distributed tensorflow experiments</span>

# Experiment 1: Hyperparameter tuning

**Idea:** perform hyperparameter tuning using grid search and random search by creating a `Tuner` which distributes the parameters to multiple GPUs, 

## NOTES
- The serial runs are done in the notebook. The parallel runs are in bash scripts. See below.

## Data: MNIST

**Add some words about MNIST**

## The network: A CNN for MNIST

From https://www.tensorflow.org/tutorials/layers

### Setup

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import tensorflow as tf

In [2]:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

## Test: Non-distributed

In [3]:
%%time
%run train.py $100 $9

Run ID: 100. batch_size = 99
-----------------
Extracting MNIST-data/train-images-idx3-ubyte.gz
Extracting MNIST-data/train-labels-idx1-ubyte.gz
Extracting MNIST-data/t10k-images-idx3-ubyte.gz
Extracting MNIST-data/t10k-labels-idx1-ubyte.gz
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/mnist_convnet_model100', '_tf_random_seed': 1, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_save_checkpoints_steps': None, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/mnist_convnet_model100/model.ckpt.
INFO:tensorflow:probabilities = [[ 0.099624    0.11363485  0.0787406   0.10657861  0.09933377  0.08840916
   0.09505805  0.11094289  0.09738424  0.11029379]
 [ 0.08310173  0.11847066  0.0689197   0.11711223  0.12243547  0.0866297
   0.10407785  0.07772831  0.09622091  0.1253034

INFO:tensorflow:loss = 2.29168, step = 1
INFO:tensorflow:global_step/sec: 190.346
INFO:tensorflow:loss = 2.27357, step = 101 (0.516 sec)
INFO:tensorflow:global_step/sec: 203.355
INFO:tensorflow:loss = 2.26597, step = 201 (0.492 sec)
INFO:tensorflow:global_step/sec: 206.599
INFO:tensorflow:loss = 2.25835, step = 301 (0.484 sec)
INFO:tensorflow:global_step/sec: 204.194
INFO:tensorflow:loss = 2.21681, step = 401 (0.490 sec)
INFO:tensorflow:global_step/sec: 206.997
INFO:tensorflow:loss = 2.18551, step = 501 (0.483 sec)
INFO:tensorflow:global_step/sec: 204.552
INFO:tensorflow:loss = 2.12642, step = 601 (0.489 sec)
INFO:tensorflow:global_step/sec: 205.231
INFO:tensorflow:loss = 2.10986, step = 701 (0.487 sec)
INFO:tensorflow:global_step/sec: 200.9
INFO:tensorflow:loss = 2.04894, step = 801 (0.498 sec)
INFO:tensorflow:global_step/sec: 193.897
INFO:tensorflow:loss = 1.90338, step = 901 (0.516 sec)
INFO:tensorflow:global_step/sec: 204.566
INFO:tensorflow:loss = 1.83081, step = 1001 (0.489 sec)


INFO:tensorflow:global_step/sec: 204.975
INFO:tensorflow:loss = 0.192474, step = 8501 (0.488 sec)
INFO:tensorflow:global_step/sec: 203.882
INFO:tensorflow:loss = 0.163997, step = 8601 (0.491 sec)
INFO:tensorflow:global_step/sec: 205.646
INFO:tensorflow:loss = 0.128122, step = 8701 (0.486 sec)
INFO:tensorflow:global_step/sec: 204.952
INFO:tensorflow:loss = 0.315299, step = 8801 (0.488 sec)
INFO:tensorflow:global_step/sec: 204.945
INFO:tensorflow:loss = 0.205544, step = 8901 (0.488 sec)
INFO:tensorflow:global_step/sec: 204.836
INFO:tensorflow:loss = 0.35766, step = 9001 (0.488 sec)
INFO:tensorflow:global_step/sec: 204.986
INFO:tensorflow:loss = 0.282887, step = 9101 (0.488 sec)
INFO:tensorflow:global_step/sec: 204.949
INFO:tensorflow:loss = 0.340325, step = 9201 (0.488 sec)
INFO:tensorflow:global_step/sec: 203.197
INFO:tensorflow:loss = 0.119882, step = 9301 (0.492 sec)
INFO:tensorflow:global_step/sec: 201.858
INFO:tensorflow:loss = 0.114718, step = 9401 (0.495 sec)
INFO:tensorflow:globa

INFO:tensorflow:global_step/sec: 194.064
INFO:tensorflow:loss = 0.151958, step = 16801 (0.515 sec)
INFO:tensorflow:global_step/sec: 194.388
INFO:tensorflow:loss = 0.208702, step = 16901 (0.514 sec)
INFO:tensorflow:global_step/sec: 205.85
INFO:tensorflow:loss = 0.259637, step = 17001 (0.486 sec)
INFO:tensorflow:global_step/sec: 200.021
INFO:tensorflow:loss = 0.203653, step = 17101 (0.500 sec)
INFO:tensorflow:global_step/sec: 205.109
INFO:tensorflow:loss = 0.110914, step = 17201 (0.488 sec)
INFO:tensorflow:global_step/sec: 204.058
INFO:tensorflow:loss = 0.0706823, step = 17301 (0.490 sec)
INFO:tensorflow:global_step/sec: 204.824
INFO:tensorflow:loss = 0.100052, step = 17401 (0.488 sec)
INFO:tensorflow:global_step/sec: 191.614
INFO:tensorflow:loss = 0.155132, step = 17501 (0.522 sec)
INFO:tensorflow:global_step/sec: 193.939
INFO:tensorflow:loss = 0.139435, step = 17601 (0.515 sec)
INFO:tensorflow:global_step/sec: 204.242
INFO:tensorflow:loss = 0.0924595, step = 17701 (0.490 sec)
INFO:tens

## The `Tuner`

### The hyperparameters

In [3]:
batch_sizes = [10,100]

In [1]:
# Add more hyperparameters to tune

### Serial run

In [5]:
def runInSerial(batch_sizes):
    i = 0
    for batch_size in batch_sizes:
        %run train.py $i $batch_size
        i = i + 1

In [6]:
%%time
runInSerial(batch_sizes)

Run ID: 0. batch_size = 10
-----------------
Extracting MNIST-data/train-images-idx3-ubyte.gz
Extracting MNIST-data/train-labels-idx1-ubyte.gz
Extracting MNIST-data/t10k-images-idx3-ubyte.gz
Extracting MNIST-data/t10k-labels-idx1-ubyte.gz
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/mnist_convnet_model0', '_tf_random_seed': 1, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_save_checkpoints_steps': None, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/mnist_convnet_model0/model.ckpt.
INFO:tensorflow:probabilities = [[ 0.09935603  0.11293184  0.07989363  0.10783623  0.10231627  0.0900475
   0.09539413  0.10588846  0.10056075  0.10577514]
 [ 0.08564846  0.11941904  0.07019046  0.12096466  0.11881043  0.08506248
   0.10913834  0.08364679  0.09516641  0.11195292]
 [ 

INFO:tensorflow:loss = 0.058332, step = 6401 (0.168 sec)
INFO:tensorflow:global_step/sec: 614.621
INFO:tensorflow:loss = 0.172869, step = 6501 (0.163 sec)
INFO:tensorflow:global_step/sec: 623.836
INFO:tensorflow:loss = 0.72493, step = 6601 (0.160 sec)
INFO:tensorflow:global_step/sec: 625.812
INFO:tensorflow:loss = 0.189664, step = 6701 (0.160 sec)
INFO:tensorflow:global_step/sec: 618.339
INFO:tensorflow:loss = 0.0421353, step = 6801 (0.162 sec)
INFO:tensorflow:global_step/sec: 624.671
INFO:tensorflow:loss = 0.389987, step = 6901 (0.160 sec)
INFO:tensorflow:global_step/sec: 622.849
INFO:tensorflow:loss = 0.129236, step = 7001 (0.161 sec)
INFO:tensorflow:global_step/sec: 616.876
INFO:tensorflow:loss = 1.00261, step = 7101 (0.162 sec)
INFO:tensorflow:global_step/sec: 618.049
INFO:tensorflow:loss = 0.603732, step = 7201 (0.162 sec)
INFO:tensorflow:global_step/sec: 623.661
INFO:tensorflow:loss = 0.726983, step = 7301 (0.160 sec)
INFO:tensorflow:global_step/sec: 620.021
INFO:tensorflow:loss 

INFO:tensorflow:global_step/sec: 632.084
INFO:tensorflow:loss = 0.18252, step = 14801 (0.158 sec)
INFO:tensorflow:global_step/sec: 522.132
INFO:tensorflow:loss = 0.0643593, step = 14901 (0.193 sec)
INFO:tensorflow:global_step/sec: 531.717
INFO:tensorflow:loss = 0.016279, step = 15001 (0.187 sec)
INFO:tensorflow:global_step/sec: 619.613
INFO:tensorflow:loss = 0.0797722, step = 15101 (0.161 sec)
INFO:tensorflow:global_step/sec: 623.409
INFO:tensorflow:loss = 0.463661, step = 15201 (0.160 sec)
INFO:tensorflow:global_step/sec: 623.758
INFO:tensorflow:loss = 0.217529, step = 15301 (0.160 sec)
INFO:tensorflow:global_step/sec: 623.576
INFO:tensorflow:loss = 0.411483, step = 15401 (0.160 sec)
INFO:tensorflow:global_step/sec: 620.761
INFO:tensorflow:loss = 0.20479, step = 15501 (0.161 sec)
INFO:tensorflow:global_step/sec: 614.833
INFO:tensorflow:loss = 0.0793642, step = 15601 (0.163 sec)
INFO:tensorflow:global_step/sec: 623.252
INFO:tensorflow:loss = 0.0746754, step = 15701 (0.160 sec)
INFO:ten

INFO:tensorflow:loss = 2.32683, step = 1
INFO:tensorflow:global_step/sec: 194.989
INFO:tensorflow:loss = 2.30672, step = 101 (0.504 sec)
INFO:tensorflow:global_step/sec: 201.793
INFO:tensorflow:loss = 2.26699, step = 201 (0.495 sec)
INFO:tensorflow:global_step/sec: 203.984
INFO:tensorflow:loss = 2.25604, step = 301 (0.490 sec)
INFO:tensorflow:global_step/sec: 190.087
INFO:tensorflow:loss = 2.21662, step = 401 (0.526 sec)
INFO:tensorflow:global_step/sec: 181.936
INFO:tensorflow:loss = 2.17118, step = 501 (0.550 sec)
INFO:tensorflow:global_step/sec: 187.235
INFO:tensorflow:loss = 2.1563, step = 601 (0.534 sec)
INFO:tensorflow:global_step/sec: 183.544
INFO:tensorflow:loss = 2.09216, step = 701 (0.545 sec)
INFO:tensorflow:global_step/sec: 184.008
INFO:tensorflow:loss = 2.062, step = 801 (0.543 sec)
INFO:tensorflow:global_step/sec: 187.38
INFO:tensorflow:loss = 1.94109, step = 901 (0.534 sec)
INFO:tensorflow:global_step/sec: 188.86
INFO:tensorflow:loss = 1.8349, step = 1001 (0.529 sec)
INFO

INFO:tensorflow:global_step/sec: 186.287
INFO:tensorflow:loss = 0.204005, step = 8501 (0.537 sec)
INFO:tensorflow:global_step/sec: 189.48
INFO:tensorflow:loss = 0.166799, step = 8601 (0.528 sec)
INFO:tensorflow:global_step/sec: 190.527
INFO:tensorflow:loss = 0.216963, step = 8701 (0.525 sec)
INFO:tensorflow:global_step/sec: 196.954
INFO:tensorflow:loss = 0.0622956, step = 8801 (0.511 sec)
INFO:tensorflow:global_step/sec: 196.871
INFO:tensorflow:loss = 0.0883304, step = 8901 (0.505 sec)
INFO:tensorflow:global_step/sec: 196.6
INFO:tensorflow:loss = 0.168, step = 9001 (0.509 sec)
INFO:tensorflow:global_step/sec: 196.315
INFO:tensorflow:loss = 0.372599, step = 9101 (0.509 sec)
INFO:tensorflow:global_step/sec: 198.85
INFO:tensorflow:loss = 0.174478, step = 9201 (0.503 sec)
INFO:tensorflow:global_step/sec: 196.55
INFO:tensorflow:loss = 0.229777, step = 9301 (0.509 sec)
INFO:tensorflow:global_step/sec: 198.241
INFO:tensorflow:loss = 0.212717, step = 9401 (0.504 sec)
INFO:tensorflow:global_ste

INFO:tensorflow:loss = 0.1362, step = 16801 (0.507 sec)
INFO:tensorflow:global_step/sec: 196.644
INFO:tensorflow:loss = 0.132007, step = 16901 (0.509 sec)
INFO:tensorflow:global_step/sec: 197.202
INFO:tensorflow:loss = 0.120952, step = 17001 (0.507 sec)
INFO:tensorflow:global_step/sec: 196.561
INFO:tensorflow:loss = 0.232074, step = 17101 (0.509 sec)
INFO:tensorflow:global_step/sec: 197.733
INFO:tensorflow:loss = 0.113184, step = 17201 (0.506 sec)
INFO:tensorflow:global_step/sec: 197.051
INFO:tensorflow:loss = 0.150577, step = 17301 (0.508 sec)
INFO:tensorflow:global_step/sec: 198.287
INFO:tensorflow:loss = 0.223771, step = 17401 (0.504 sec)
INFO:tensorflow:global_step/sec: 197.415
INFO:tensorflow:loss = 0.088833, step = 17501 (0.507 sec)
INFO:tensorflow:global_step/sec: 197.861
INFO:tensorflow:loss = 0.203374, step = 17601 (0.505 sec)
INFO:tensorflow:global_step/sec: 196.731
INFO:tensorflow:loss = 0.205345, step = 17701 (0.508 sec)
INFO:tensorflow:global_step/sec: 194.89
INFO:tensorfl

### Run in parallel

#### TODO
Use `ipyparallel`

#### Get the system's GPUs

In [10]:
from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

In [4]:
get_available_gpus()

['/gpu:0']

**Idea:**

```
def tuner(batch_sizes):
    i=0
    for batch_size in batch_sizes:
    # for x in X:
    # Parallell loop:
        for gpu in gpus:
            %env CUDA_DEVICE_ORDER=PCI_BUS_ID
            %env CUDA_VISIBLE_DEVICES=$gpu
            %run train.py $i $batch_size
            i = i + 1
```

#### Bash script

*Running the train function*
- 1: RunId
- 2: batch_size
- 3: GPU number

**`runOnGpu.sh`**

```
#!/bin/bash

time {
    export CUDA_VISIBLE_DEVICES=$1
    python train.py $2 $3
}
```

**`parallelRun.sh`**

```
#!/bin/bash

num_gpus=2
echo "$num_gpus"
gpus=($(seq 0 1 1))

batch_size=10

time {
    for gpu in "${gpus[@]}"; do
        echo "$gpu"
        ./runOnGpu.sh $gpu $0 $batch_size &
        batch_size=100
done
}

```

## Notes and further work

- Gridsearch is not optimal. Implement random search
- Mention advanced hyperparameter tuning methods and software. (I.e. Bayesian methods)

# Experiment 2: Distributed computational graphs

**DNF**..