# 使用GPU来加速运算

In [1]:
!nvidia-smi

Fri Jan 19 15:58:39 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 381.22                 Driver Version: 381.22                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 108...  Off  | 0000:01:00.0      On |                  N/A |
| 29%   35C    P8    18W / 250W |    207MiB / 11171MiB |     28%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage    

In [2]:
import pip
for pkg in ['mxnet', 'mxnet-cu75', 'mxnet-cu80']:
    pip.main(['show', pkg])

Name: mxnet-cu80
Version: 1.0.1b20180105
Summary: MXNet is an ultra-scalable deep learning framework. This version uses CUDA-8.0.
Home-page: https://github.com/apache/incubator-mxnet
Author: UNKNOWN
Author-email: UNKNOWN
License: Apache 2.0
Location: /liang-jupyter/lib/python3.5/site-packages
Requires: requests, numpy, graphviz


## Context

MXNet使用Context来指定使用哪个设备来存储和计算。默认会将数据开在主内存，然后利用CPU来计算，这个由mx.cpu()来表示。GPU则由mx.gpu()来表示。注意mx.cpu()表示所有的物理CPU和内存，意味着计算上会尽量使用多有的CPU核。但mx.gpu()只代表一块显卡和其对应的显卡内存。如果有多块GPU，我们用mx.gpu(i)来表示第i块GPU（i从0开始）

In [3]:
import mxnet as mx
[mx.cpu(), mx.gpu()]

[cpu(0), gpu(0)]

如果我们有不指定ctx,数据默认会创建在cpu上

In [4]:
from mxnet import nd
a = nd.array([1,2,3])
a.context

cpu(0)

## GPU上创建内存

In [5]:
a = nd.array([1,2,3], ctx=mx.gpu())
b = nd.ones(shape=(3,2), ctx=mx.gpu())
c = nd.random.normal(shape=(3,4), ctx=mx.gpu())
(a,b,c)

(
 [ 1.  2.  3.]
 <NDArray 3 @gpu(0)>, 
 [[ 1.  1.]
  [ 1.  1.]
  [ 1.  1.]]
 <NDArray 3x2 @gpu(0)>, 
 [[-1.32045507  0.68232244 -0.98583829  0.01992839]
  [ 0.78424042  0.50066984 -1.02834916  0.98445743]
  [ 0.23791966  0.56752419  0.416008    1.2724396 ]]
 <NDArray 3x4 @gpu(0)>)

如果只有一块GPU，那么下面的代码会报错

```
import sys
try:
    nd.random.normal(shape=(3,4), ctx=mx.gpu(1))
except mx.MXNetError as err:
    sys.stderr.write(str(err))
```

我们使用``copyto``或者``as_in_context``来进行不同设备之间的数据传输，我们应该尽量使用``as_in_context``，因为当源和目标一致时，``as_in_context``不会开辟一块新的内存。

In [10]:
a = nd.array([1,2,3], ctx=mx.gpu())
b = a.copyto(mx.gpu())
c = a.as_in_context(mx.gpu())
d = a.as_in_context(mx.cpu())
print(b)
print(c)
print(d) 


[ 1.  2.  3.]
<NDArray 3 @gpu(0)>

[ 1.  2.  3.]
<NDArray 3 @gpu(0)>

[ 1.  2.  3.]
<NDArray 3 @cpu(0)>


**如果我们想使用GPU做运算，我们必须事先将数据开辟在GPU上，之后的计算结果会自动保存在GPU上**

In [12]:
a = nd.array([1,2,3], ctx=mx.gpu())
a + 2


[ 3.  4.  5.]
<NDArray 3 @gpu(0)>

注意所有计算要求输入数据在同一个设备上。不一致的时候系统不进行自动复制。这个设计的目的是因为设备之间的数据交互通常比较昂贵，我们希望用户确切的知道数据放在哪里，而不是隐藏这个细节。下面代码尝试将CPU上x和GPU上的y做运算。

In [14]:
import sys

x = nd.array([1,2,3], ctx=mx.gpu())
y = nd.array([4,5,6], ctx=mx.cpu())
try:
    x + y
except mx.MXNetError as err:
    sys.stderr.write(str(err))

[16:07:48] src/imperative/./imperative_utils.h:55: Check failed: inputs[i]->ctx().dev_mask() == ctx.dev_mask() (1 vs. 2) Operator broadcast_add require all inputs live on the same context. But the first argument is on gpu(0) while the 2-th argument is on cpu(0)

Stack trace returned 10 entries:
[bt] (0) /liang-jupyter/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2a4992) [0x7fd2d59b1992]
[bt] (1) /liang-jupyter/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2a4f88) [0x7fd2d59b1f88]
[bt] (2) /liang-jupyter/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x250880c) [0x7fd2d7c1580c]
[bt] (3) /liang-jupyter/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x24f9bde) [0x7fd2d7c06bde]
[bt] (4) /liang-jupyter/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x24399fb) [0x7fd2d7b469fb]
[bt] (5) /liang-jupyter/lib/python3.5/site-packages/mxnet/libmxnet.so(MXImperativeInvokeEx+0x63) [0x7fd2d7b46f63]
[bt] (6) /liang-jupyter/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_cal

In [15]:
x


[ 1.  2.  3.]
<NDArray 3 @gpu(0)>

如果某个操作需要将NDArray里面的内容转出来，例如打印或变成numpy格式，如果需要的话系统会自动将数据copy到主内存。

In [19]:
x.asnumpy()

array([ 1.,  2.,  3.], dtype=float32)

In [22]:
print(x.sum().asscalar())

6.0


## 使用``gluon``在GPU上定义模型并计算

我们只需要指定ctx参数即可

In [43]:
from mxnet import gluon

net = gluon.nn.Sequential()
with net.name_scope():
    net.add(gluon.nn.Dense(4))
net.collect_params().initialize(ctx=mx.gpu())

In [44]:
x = nd.random.normal(shape=(2,3), ctx=mx.gpu())
net(x)


[[ 0.01748613 -0.02448209  0.0320908  -0.00031306]
 [-0.00010406 -0.02248663 -0.01203028 -0.03666325]]
<NDArray 2x4 @gpu(0)>

In [45]:
net.collect_params()

sequential3_ (
  Parameter sequential3_dense0_weight (shape=(4, 3), dtype=<class 'numpy.float32'>)
  Parameter sequential3_dense0_bias (shape=(4,), dtype=<class 'numpy.float32'>)
)

In [40]:
net[0].weight

Parameter sequential2_dense0_weight (shape=(4, 10), dtype=<class 'numpy.float32'>)

In [32]:
net[0].bias.data()


[ 0.  0.  0.  0.]
<NDArray 4 @gpu(0)>

In [35]:
net[0]

Dense(10 -> 4, linear)