[View in Colaboratory](https://colab.research.google.com/github/JozeeLin/google-tensorflow-exercise/blob/master/TensorBoard_%E5%A4%9AGPU%E5%B9%B6%E8%A1%8C%E5%8F%8A%E5%88%86%E5%B8%83%E5%BC%8F%E5%B9%B6%E8%A1%8C.ipynb)

## TensorBoard

TensorBoard是TensorFlow推出的可视化工具，它可以将模型训练过程中的各种汇总数据展示出来

包括标量(Scalars)、图片(Images)、音频(Audio)、计算图(Graphs)、数据分布(Distributions)、直方图(Histograms)和嵌入向量(Embeddings)。

我们在使用TensorFlow训练大型深度学习神经网络时，中间的计算过程可能非常复杂，因此为了理解、调试和优化我们设计的网络，可以使用 TensorBoard观察训练过程中的各种可视化数据。

可视化流程:
- 执行TensorFlow计算图的过程中，将各种类型的数据汇总并记录到日志文件中
- 使用TensorBoard读取这些日志文件
- 解析日志数据并生成数据可视化的Web页面，让我们可以在浏览器中观察各种汇总数据

In [16]:
!git clone https://github.com/mixuala/colab_utils

Cloning into 'colab_utils'...
remote: Counting objects: 219, done.[K
remote: Total 219 (delta 0), reused 0 (delta 0), pack-reused 219[K
Receiving objects: 100% (219/219), 60.24 KiB | 2.87 MiB/s, done.
Resolving deltas: 100% (85/85), done.


In [0]:
import os
import colab_utils.tboard

In [0]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
max_steps = 1000
learning_rate = 0.001
dropout = 0.9
data_dir='/tmp/tensorflow/mnist/input_data'
log_dir='/tmp/tensorflow/mnist/logs/mnist_with_summaries'

In [19]:
mnist = input_data.read_data_sets(data_dir, one_hot=True)
sess = tf.InteractiveSession()

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


In [0]:
with tf.name_scope('input'):
  x = tf.placeholder(tf.float32,[None,784],name='x-input')
  y_ = tf.placeholder(tf.float32, [None,10],name='y-input')

In [0]:
with tf.name_scope('input_reshape'):
  image_shaped_input = tf.reshape(x,[-1,28,28,1])
  tf.summary.image('input',image_shaped_input,10)

In [0]:
def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

In [0]:
def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

In [0]:
def variable_summaries(var):
  with tf.name_scope('summaries'):
    mean = tf.reduce_mean(var)
    tf.summary.scalar('mean', mean)
    with tf.name_scope('stddev'):
      stddev = tf.sqrt(tf.reduce_mean(tf.square(var-mean)))
      
    tf.summary.scalar('stddev', stddev)
    tf.summary.scalar('max', tf.reduce_max(var))
    tf.summary.scalar('min', tf.reduce_min(var))
    tf.summary.histogram('histogram',var)
    

In [0]:
def nn_layer(input_tensor, input_dim, output_dim, layer_name,act=tf.nn.relu):
  with tf.name_scope(layer_name):
    with tf.name_scope('weights'):
      weights = weight_variable([input_dim, output_dim])
      variable_summaries(weights)
      
    with tf.name_scope('biases'):
      biases = bias_variable([output_dim])
      variable_summaries(biases)
      
    with tf.name_scope('Wx_plus_b'):
      preactivate = tf.matmul(input_tensor, weights)+biases
      tf.summary.histogram('pre_activations', preactivate)
      
    activations = act(preactivate, name='activation')
    tf.summary.histogram('activations', activations)
    return activations

In [0]:
hidden1 = nn_layer(x, 784, 500, 'layer1')

In [0]:
with tf.name_scope('dropout'):
  keep_prob = tf.placeholder(tf.float32)
  tf.summary.scalar('dropout_keep_probability',keep_prob)
  dropped = tf.nn.dropout(hidden1, keep_prob)

In [0]:
y = nn_layer(dropped, 500, 10, 'layer2', act=tf.identity)

In [14]:
with tf.name_scope('cross_entropy'):
  diff = tf.nn.softmax_cross_entropy_with_logits(logits=y,labels=y_)
  with tf.name_scope('total'):
    cross_entropy = tf.reduce_mean(diff)

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.



In [15]:
tf.summary.scalar('cross_entropy',cross_entropy)

<tf.Tensor 'cross_entropy_1:0' shape=() dtype=string>

In [16]:
with tf.name_scope('train'):
  train_step = tf.train.AdamOptimizer(learning_rate).minimize(cross_entropy)
  
with tf.name_scope('accuracy'):
  with tf.name_scope('correct_prediction'):
    correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
    
  with tf.name_scope('accuracy'):
    accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))
    
tf.summary.scalar('accuracy', accuracy)

<tf.Tensor 'accuracy_1:0' shape=() dtype=string>

In [0]:
merged = tf.summary.merge_all()
train_writer = tf.summary.FileWriter(log_dir+'/train', sess.graph)

In [0]:
test_writer = tf.summary.FileWriter(log_dir+'/test')
tf.global_variables_initializer().run()

In [0]:
def feed_dict(train):
  if train:
    xs, ys = mnist.train.next_batch(100)
    k = dropout
  else:
    xs, ys = mnist.test.images, mnist.test.labels
    k = 1.0
    
  return {x:xs, y_:ys, keep_prob:k}

In [22]:
saver = tf.train.Saver()
for i in range(max_steps):
  if i%10==0:
    summary,acc = sess.run([merged, accuracy], feed_dict=feed_dict(False))
    test_writer.add_summary(summary, i)
    print('Accuracy at step %s: %s' % (i,acc))
    
  else:
    if i%100 == 99:
      run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
      run_metadata = tf.RunMetadata()
      summary,_ = sess.run([merged, train_step], feed_dict=feed_dict(True),
                          options=run_options, run_metadata=run_metadata)
      
      train_writer.add_run_metadata(run_metadata, 'step%03d'%i)
      train_writer.add_summary(summary, i)
      saver.save(sess, log_dir+'/model.ckpt',i)
      print('Adding run metadata for',i)
      
    else:
      summary,_ = sess.run([merged, train_step], feed_dict=feed_dict(True))
      train_writer.add_summary(summary, i)
      
train_writer.close()
test_writer.close()

Accuracy at step 0: 0.1243
Accuracy at step 10: 0.693
Accuracy at step 20: 0.8185
Accuracy at step 30: 0.8523
Accuracy at step 40: 0.8744
Accuracy at step 50: 0.8949
Accuracy at step 60: 0.8968
Accuracy at step 70: 0.904
Accuracy at step 80: 0.9109
Accuracy at step 90: 0.9171
Adding run metadata for 99
Accuracy at step 100: 0.9182
Accuracy at step 110: 0.9202
Accuracy at step 120: 0.9246
Accuracy at step 130: 0.9196
Accuracy at step 140: 0.9226
Accuracy at step 150: 0.9275
Accuracy at step 160: 0.9285
Accuracy at step 170: 0.9294
Accuracy at step 180: 0.9334
Accuracy at step 190: 0.934
Adding run metadata for 199
Accuracy at step 200: 0.9355
Accuracy at step 210: 0.9363
Accuracy at step 220: 0.9339
Accuracy at step 230: 0.9364
Accuracy at step 240: 0.9326
Accuracy at step 250: 0.9391
Accuracy at step 260: 0.9415
Accuracy at step 270: 0.9438
Accuracy at step 280: 0.9411
Accuracy at step 290: 0.9461
Adding run metadata for 299
Accuracy at step 300: 0.9475
Accuracy at step 310: 0.9471
Acc

Accuracy at step 760: 0.9645
Accuracy at step 770: 0.9643
Accuracy at step 780: 0.9602
Accuracy at step 790: 0.9667
Adding run metadata for 799
Accuracy at step 800: 0.9625
Accuracy at step 810: 0.9626
Accuracy at step 820: 0.9642
Accuracy at step 830: 0.9659
Accuracy at step 840: 0.9669
Accuracy at step 850: 0.9674
Accuracy at step 860: 0.9678
Accuracy at step 870: 0.9669
Accuracy at step 880: 0.9675
Accuracy at step 890: 0.9626
Adding run metadata for 899
Accuracy at step 900: 0.9659
Accuracy at step 910: 0.9653
Accuracy at step 920: 0.9658
Accuracy at step 930: 0.9668
Accuracy at step 940: 0.9682
Accuracy at step 950: 0.9675
Accuracy at step 960: 0.9655
Accuracy at step 970: 0.9667
Accuracy at step 980: 0.9677
Accuracy at step 990: 0.9683
Adding run metadata for 999


In [24]:
!tensorboard --logdir=/tmp/tensorflow/mnist/logs/mnist_with_summaries

^C


In [28]:
ROOT = %pwd
LOG_DIR = '/tmp/tensorflow/mnist/logs/mnist_with_summaries'

# will install `ngrok`, if necessary
# will create `log_dir` if path does not exist
colab_utils.tboard.launch_tensorboard( bin_dir=ROOT, log_dir=LOG_DIR )

calling wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip ...
calling unzip ngrok-stable-linux-amd64.zip ...
ngrok installed. path=/content/ngrok
status: tensorboard=False, ngrok=False
tensorboard url= http://33cc225b.ngrok.io


'http://33cc225b.ngrok.io'

## 多GPU并行

TensorFlow中的并行主要分为模型并行和数据并行。模型并行需要根据不同模型设计不同的并行方式，其主要原理是将模型中不同计算节点放在不同硬件资源上运算。

本节我们主要讲解同步的数据并行，即等待所有GPU都计算完一个batch数据的梯度后，再统一将多个梯度合在一起，并更新共享的模型参数，这种方法类似于使用了一个较大的batch。使用数据并行时，GPU的型号、速度最好一致，这样效率最高。

异步的数据并行，则不等待所有GPU都完成一次训练，而是哪个GPU完成了训练，就立即将梯度更新到共享的模型参数中。

通常来讲，同步的数据并行比异步的模式收敛速度更快，模型精度更高。

In [20]:
!git clone https://github.com/tensorflow/models.git 'mymodels'

Cloning into 'mymodels'...
remote: Counting objects: 16243, done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 16243 (delta 13), reused 13 (delta 7), pack-reused 16215[K
Receiving objects: 100% (16243/16243), 424.11 MiB | 37.34 MiB/s, done.
Resolving deltas: 100% (9590/9590), done.
Checking out files: 100% (2164/2164), done.


In [0]:
import os
os.chdir('mymodels/tutorials/image/cifar10')

In [0]:
import os.path
import re
import time
import numpy as np
import tensorflow as tf
import cifar10

In [23]:
dir(cifar10)

['DATA_URL',
 'FLAGS',
 'IMAGE_SIZE',
 'INITIAL_LEARNING_RATE',
 'LEARNING_RATE_DECAY_FACTOR',
 'MOVING_AVERAGE_DECAY',
 'NUM_CLASSES',
 'NUM_EPOCHS_PER_DECAY',
 'NUM_EXAMPLES_PER_EPOCH_FOR_EVAL',
 'NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN',
 'TOWER_NAME',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_activation_summary',
 '_add_loss_summaries',
 '_variable_on_cpu',
 '_variable_with_weight_decay',
 'absolute_import',
 'cifar10_input',
 'distorted_inputs',
 'division',
 'inference',
 'inputs',
 'loss',
 'maybe_download_and_extract',
 'os',
 'print_function',
 're',
 'sys',
 'tarfile',
 'tf',
 'train',
 'urllib']

In [0]:
batch_size=128
max_steps=1000000
num_gpus=4

In [0]:
def tower_loss(scope):
  images,labels=cifar10.distorted_inputs()
  logits = cifar10.inference(images)
  _ = cifar10.loss(logits, labels)
  losses = tf.get_collection('losses', scope)
  total_loss = tf.add_n(losses, name='total_loss')
  return total_loss

In [0]:
def average_gradients(tower_grads):
  average_grads = []
  for grad_and_vars in zip(*tower_grads):
    grads = []
    for g, _ in grad_and_vars:
      expanded_g = tf.expand_dims(g,0)
      grads.append(expanded_g)
      
    grad = tf.concat(grads, 0)
    grad = tf.reduce_mean(grad, 0)
    v = grad_and_vars[0][1]
    grad_and_var = (grad,v)
    average_grads.append(grad_and_var)
    
  return average_grads

In [0]:
def train():
  with tf.Graph().as_default(), tf.device('/cpu:0'):
    global_step = tf.get_variable('global_step', [], 
                                 initializer=tf.constant_initializer(0),
                                 trainable=False)
    
    num_batches_per_epoch = cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN/batch_size
    decay_steps = int(num_batches_per_epoch*cifar10.NUM_EPOCHS_PER_DECAY)
    
    lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
                                   global_step,
                                   decay_steps,
                                   cifar10.LEARNING_RATE_DECAY_FACTOR,
                                   staircase=True)
    
    opt = tf.train.GradientDescentOptimizer(lr)
    
    tower_grads = []
    for i in range(num_gpus):
      with tf.device('/gpu:%d' % i):
        with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME,i)) as scope:
          loss = tower_loss(scope)
          tf.get_variable_scope().reuse_variables()
          grads = opt.compute_gradients(loss)
          tower_grads.append(grads)
          
    grads = average_gradients(tower_grads)
    apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)
    
    saver = tf.train.Saver(tf.all_variables())
    init = tf.global_variables_initializer()
    sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
    sess.run(init)
    tf.train.start_queue_runners(sess=sess)
    
    for step in range(max_steps):
      start_time = time.time()
      _, loss_value = sess.run([apply_gradient_op, loss])
      duration = time.time() - start_time
      
      if step % 10 == 0:
        num_examples_per_step = batch_size * num_gpus
        examples_per_sec = num_examples_per_step/duration
        sec_per_batch = duration/num_gpus
        
        format_str = ('step %d,loss=%.2f (%.1f examples/sec; %.3f sec/batch)')
        print(format_str % (step, loss_value, examples_per_sec, sec_per_batch))
        
      if step % 1000 == 0 or (step+1)==max_steps:
        saver.save(sess, '/tmp/cifar10_train/model.ckpt', global_step=step)

In [0]:
cifar10.maybe_download_and_extract()

In [0]:
train()

Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
Instructions for updating:
Please use tf.global_variables instead.
step 0,loss=4.67 (16.1 examples/sec; 7.935 sec/batch)
step 10,loss=4.61 (94.8 examples/sec; 1.350 sec/batch)
step 20,loss=4.45 (95.9 examples/sec; 1.335 sec/batch)
step 30,loss=4.40 (91.6 examples/sec; 1.397 sec/batch)
step 40,loss=4.28 (93.8 examples/sec; 1.365 sec/batch)
step 50,loss=4.33 (96.0 examples/sec; 1.334 sec/batch)
step 60,loss=4.17 (92.4 examples/sec; 1.385 sec/batch)
step 70,loss=4.27 (93.5 examples/sec; 1.369 sec/batch)
step 80,loss=4.20 (92.4 examples/sec; 1.385 sec/batch)
step 90,loss=4.16 (92.5 examples/sec; 1.383 sec/batch)
step 100

## 分布式并行
TensorFlow的分布式并行基于gRPC通信框架，其中包括一个master负责创建session，还有多个worker负责执行计算图中的任务。

首先创建一个TensorFlow Cluster对象，它包含了一组task(每个task一般是一台单独的机器)用来分布式的执行TensorFlow的计算图。

一个Cluster可以切分成多个job，一个job是指一类特定的任务。我们需要为每一个task创建一个server，然后连接到Cluster上，通常每个task会执行在不同的机器上，当然也可以一台机器上执行多个task。

In [0]:
import math
import tempfile
import time
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

这里使用tf.app.flags定义标记，用以在命令行执行TensorFlow程序时设置参数。在命令行中指定的参数会被TensorFlow读取，并直接转为flags。

设定数据存储目录data_dir默认为/tmp/mnist-data,隐藏节点数默认为100，训练最大步数train_steps默认为1000000，batch_size默认为100，学习速率为默认0.01