# CH7 GPU计算

## 7.6 TensorFlow 的GPU 管理
在TensorFlow中，支持的设备被表示为字符串。
* /cup:0：你机器中的CPU
* /gpu:0：你机器中的GPU（如果有一个的话）
* /gpu:1：你机器中的第二个GPU，等等。

当一个操作被分配给GPU设备时，执行流是有优先级的。

### 程序示例

In [None]:
import numpy as np
import tensorflow as tf
import datetime

若要在TensorFlow程序中使用GPU，只需在设置操作后输入如下语句：
<br>`with tf.device("/gpu:0"):`

可以写一段程序，查看你的操作和张量被分配到哪一个设备。为实现这一操作，使用下述命令创建一个会话，将log_device_placement参数设置为True：

In [None]:
log_device_placement = True

In [None]:
# 然后确定参数n，即需要执行的乘法次数：
n = 10

In [None]:
# 之后创建一个随机的大型矩阵。A和B的大小分别为10 000×10 000。
# 使用NumPy中的rand函数执行这一操作：
A = np.random.rand(10000, 10000).astype('float32')
B = np.random.rand(10000, 10000).astype('float32')

In [None]:
# 下面的数组将用于存储运算结果：
c1 = []
c2 = []

In [None]:
# 此处定义内核矩阵乘法函数，将由GPU执行：
def matpow(M, n):
    if n < 1:
        return M
    else:
        return tf.matmul(M, matpow(M, n-1))

之前提到过，必须设置使用哪个GPU，以及用此GPU执行何种操作。

In [None]:
# 本例中，GPU将计算An+Bn，并将结果保存在c1中：
with tf.device('/gpu:0'):
    a = tf.placeholder(tf.float32, [10000, 10000])
    b = tf.placeholder(tf.float32, [10000, 10000])
    c1.append(matpow(a, n))
    c1.append(matpow(b, n))

In [None]:
# 所有元素的和，即An+Bn，储存在c1中。求和操作由CPU执行，因此我们定义如下：
with tf.device('/cpu:0'):
    sum = tf.add_n(c1)

# datetime类统计代码的执行时间：
t1_1 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:
    sess.run(sum, {a:A, b:B})
t2_1 = datetime.datetime.now()

# 运算时间由以下语句显示：
print("GPU computation time: " + str(t2_1-t1_1))

In [1]:
import numpy as np
import tensorflow as tf
import datetime

log_device_placement = True
n = 10
A = np.random.rand(10000, 10000).astype('float32')
B = np.random.rand(10000, 10000).astype('float32')
c1 = []
c2 = []

def matpow(M, n):
    if n < 1: #Abstract cases where n < 1
        return M
    else:
        return tf.matmul(M, matpow(M, n-1))

with tf.device('/gpu:0'): # For CPU use /cpu:0
    a = tf.placeholder(tf.float32, [10000, 10000])
    b = tf.placeholder(tf.float32, [10000, 10000])
    c1.append(matpow(a, n))
    c1.append(matpow(b, n))

# If the below code does not work use '/job:localhost/replica:0/task:0/cpu:0' as the GPU device
with tf.device('/cpu:0'):
  sum = tf.add_n(c1) #Addition of all elements in c1, i.e. A^n + B^n

t1_1 = datetime.datetime.now()
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, 
                                      log_device_placement=log_device_placement)) as sess:
     sess.run(sum, {a:A, b:B})
t2_1 = datetime.datetime.now()

# 运算时间由以下语句显示：
print("GPU computation time: " + str(t2_1-t1_1))

ResourceExhaustedError: OOM when allocating tensor with shape[10000,10000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_1, _arg_Placeholder_0_0/_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: MatMul_19/_7 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_21_MatMul_19", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'MatMul', defined at:
  File "D:\Python\Anaconda3\envs\GPU\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "D:\Python\Anaconda3\envs\GPU\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
    app.start()
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\ipykernel\kernelapp.py", line 497, in start
    self.io_loop.start()
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\tornado\platform\asyncio.py", line 132, in start
    self.asyncio_loop.run_forever()
  File "D:\Python\Anaconda3\envs\GPU\lib\asyncio\base_events.py", line 422, in run_forever
    self._run_once()
  File "D:\Python\Anaconda3\envs\GPU\lib\asyncio\base_events.py", line 1434, in _run_once
    handle._run()
  File "D:\Python\Anaconda3\envs\GPU\lib\asyncio\events.py", line 145, in _run
    self._callback(*self._args)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\tornado\platform\asyncio.py", line 122, in _handle_events
    handler_func(fileobj, events)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\tornado\stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\zmq\eventloop\zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\zmq\eventloop\zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\zmq\eventloop\zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\tornado\stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\ipykernel\kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\ipykernel\ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\ipykernel\zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\IPython\core\interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\IPython\core\interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\IPython\core\interactiveshell.py", line 2901, in run_ast_nodes
    if self.run_code(code, result):
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\IPython\core\interactiveshell.py", line 2961, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-1-100e7f32a75d>", line 21, in <module>
    c1.append(matpow(a, n))
  File "<ipython-input-1-100e7f32a75d>", line 16, in matpow
    return tf.matmul(M, matpow(M, n-1))
  File "<ipython-input-1-100e7f32a75d>", line 16, in matpow
    return tf.matmul(M, matpow(M, n-1))
  File "<ipython-input-1-100e7f32a75d>", line 16, in matpow
    return tf.matmul(M, matpow(M, n-1))
  [Previous line repeated 6 more times]
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2018, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 4750, in mat_mul
    name=name)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\tensorflow\python\util\deprecation.py", line 454, in new_func
    return func(*args, **kwargs)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\tensorflow\python\framework\ops.py", line 3155, in create_op
    op_def=op_def)
  File "D:\Python\Anaconda3\envs\GPU\lib\site-packages\tensorflow\python\framework\ops.py", line 1717, in __init__
    self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10000,10000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_1, _arg_Placeholder_0_0/_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: MatMul_19/_7 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_21_MatMul_19", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.



## 7.8 在多GPU 系统上分配单个GPU
如果你的系统里有超过一个GPU，那么TensorFlow会默认选取ID最小的那一块。如果你希望程序在不同的GPU上运行，那么需要进行手动设置，明确指定所用GPU。

例如，可以使用前面讲过的代码更改GPU分配：

In [None]:
with tf.device('/gpu:1'):
    a = tf.placeholder(tf.float32, [10000, 10000])
    b = tf.placeholder(tf.float32, [10000, 10000])
    c1.append(matpow(a, n))
    c1.append(matpow(b, n))

通过这种方式，我们让GPU执行了内核函数。

In [None]:
# 如果希望在指定设备不存在的情况下，TensorFlow能够自动选择已有的、支持的设备运行操作，
# 那么可以将allow_soft_placement参数设置为True：
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True,
                                      log_device_placement=log_device_placement)) as sess:

## 7.9 使用多个GPU
如果你希望在多个GPU上运行TensorFlow，那么可以在构建模型时将特定代码段分配给不同GPU。

例如，如果你有两个GPU，那么可以将前面的代码进行如下分割，将第一个矩阵运算分配给第一个CPU。代码如下：

In [None]:
with tf.device('/gpu:0'):
    a = tf.placeholder(tf.float32, [10000, 10000])
    c1.append(matpow(a, n))

第二个矩阵运算被分配给第二个CPU：

In [None]:
with tf.device('/gpu:1'):
    b = tf.placeholder(tf.float32, [10000, 10000])
    c1.append(matpow(b, n))

最后，CPU会管理程序的结果。另外需要注意，我们使用了共享的c1数组来收集结果：

In [None]:
with tf.device('/cpu:0'):
    sum = tf.add_n(c1)
    print(sum)