## Tensor Product at `GPU` is much faster then `CPU`

Most deep learning calculation is held by `Matrix Multiplication`<br>
We will make `Dezero` to use `GPU`

---

- install `cupy` for `GPU` usage
    - Library for parallel calculation using GPU

```python
$ pip install cupy
```

---

If the pip doesn't work and your at `Win10` then try the following

```python
$ conda install -c conda-forge cupy cudatoolkit
```

### CUDA / cuDNN

- Check the appropriate CUDA version for your graphic card
    - https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
- Install CUDA
    - https://developer.nvidia.com/cuda-toolkit-archive
- Install cuDNN
    - https://developer.nvidia.com/cudnn

## `Numpy` & `Cupy`

Both usage are almost the same :)

In [2]:
import cupy as cp

x = cp.arange(6).reshape(2, 3)
print(x)

y = x.sum(axis=1)
print(y)

[[0 1 2]
 [3 4 5]]
[ 3 12]


In [3]:
type(y)

cupy.core.core.ndarray

We can send data back and forth
- `cp.asarray` - **Main Memory** -> **GPU Memory**
- `cp.asnumpy` - **GPU Memory** -> **Main Memory**

In [5]:
import numpy as np
import cupy as cp

# numpy -> cupy
n = np.array([1, 2, 3])
c = cp.asarray(n)

assert type(c) == cp.ndarray

In [7]:
# cupy -> numpy
c = cp.array([1, 2, 3])
n = cp.asnumpy(c)

assert type(n) == np.ndarray

### Caution

This memory movement could be a bottleneck when we treat big data.<br>
It's important to code the memory movement less as possible

## `cp.get_array_module`

Tell is this tensor `numpy` or `cupy`

In [8]:
# numpy array
x = np.array([1, 2, 3])
xp = cp.get_array_module(x)
assert xp == np

# cupy array
x = cp.array([1, 2, 3])
xp = cp.get_array_module(x)
assert xp == cp

We can write **compatible code** for both `numpy` and `cupy` with **`cp.get_array_module`** :)

```python
xp = cp.get_array_module(x)

y = xp.sin(x)
```

## Cuda Module

Compatible with the computer which doesn't have `GPU` or `cupy`

In [10]:
import numpy as np

gpu_enable  = True

try:
    import cupy as cp
    cupy = cp
except ImportError:
    gpu_enable = False

### We are making a wrapper functions of `cupy` to correspond the situation when `cupy` is not installed

In [4]:
from dezero import Variable

def get_array_module(x):
    if isinstance(x, Variable):
        x = x.data
        
    if not gpu_enable:
        return np
    
    xp  = cp.get_array_module(x)
    return xp

def as_numpy(x):
    if isinstance(x, Variable):
        x = x.data
    
    # This code is necessary when cupy is not installed
    if np.isscalar(x):
        return np.array(x)
    elif isinstance(x, np.ndarray):
        return x
    
    return cp.asnumpy(x)

def as_cupy(x):
    if isinstance(x, Variable):
        x = x.data
        
    if not gpu_enable:
        raise Exception('Cannot load Cupy. Install Cupy.')
        
    return cp.asarray(x)

In [36]:
cp.asnumpy(None)

array(None, dtype=object)

In [37]:
np.array(None)

array(None, dtype=object)

In [3]:
np.isscalar(np.array(1)), np.isscalar(1)

(False, True)

In [7]:
np.array(np.array([1, 2, 3]))

array([1, 2, 3])

In [16]:
np.array(cp.array([1, 2, 3]))

ValueError: object __array__ method not producing an array

In [17]:
cp.array(np.array([1, 2, 3]))

array([1, 2, 3])

In [20]:
cp.array(cp.array([1, 2, 3]))

array([1, 2, 3])

only using `cp.asnumpy`

In [19]:
cp.asnumpy(1)

array(1)

In [11]:
cp.asnumpy(np.array([1, 2, 3]))

array([1, 2, 3])

In [15]:
cp.asnumpy(cp.array([1, 2, 3]))

array([1, 2, 3])

only using `cp.asarray`

In [14]:
cp.asarray(1)

array(1)

In [13]:
cp.asarray(cp.array([1, 2, 3]))

array([1, 2, 3])

In [12]:
cp.asarray(np.array([1, 2, 3]))

array([1, 2, 3])

## Modify `dezero` for `cupy` compatible
- Variable
- Layer
- DataLoader

```python
...

try:
    import cupy
    array_types = (np.ndarray, cupy.ndarray)
except ImportError:
    array_types = (np.ndarray)

    
class Variable:

    def __init__(self, data, name=None):
        if data is not None:
            if not isinstance(data, array_types):
                raise TypeError(f'{type(data)} is not supported')

    def backward(self, retain_grad=False, create_graph=False):
        if self.grad is None:
            # self.grad = np.ones_like(self.data)
            xp = dezero.cuda.get_array_module(self.data)
            self.grad = Variable(xp.ones_like(self.data))

    ...
    
    -> add new functions!
    
    def to_cpu(self):
        if self.data is not None:
            self.data = dezero.cuda.as_numpy(self.data)
            
    def to_gpu(self):
        if self.data is not None:
            self.data = dezero.cuda.as_cupy(self.data)
```

Now for the `Layer`

```python
class Layer:
    ...
    
    def to_cpu(self):
        for param in self.params():
            param.to_cpu()
        
    def to_gpu(self):
        for param in self.params():
            param.to_gpu()
```

And finally for the `DataLoader`

```python
import numpy as np
from dezero import cuda

class DataLoader:
    def __init__(self, dataset, batch_size, shuffle=True, gpu=False):
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.data_size = len(dataset)
        self.max_iter = math.ceil(self.data_size / batch_size)
        self.gpu = gpu

        self.reset()

    def __next__(self):
        
        ...
        
        xp = cuda.cupy if self.gpu else np
        x = xp.array([example[0] for example in batch])
        y = xp.array([example[1] for example in batch])

        self.iteration += 1

        return x, y

    def to_cpu(self):
        self.gpu = False
        
    def to_gpu(self):
        self.gpu = True
```

### Modify `np.xxx` usage

change the functions for the compatibility with `cupy` and `numpy`

```python
import dezero import cuda

xp = cuda.get_array_module(x)
y = xp.sin(x)
```

- **`functions.py`**
    - Sin
    - Cos
    - Tanh
    - Exp
    - Log
    
    - GetItemGrad
        - numpy - `np.add.at`
        - cupy - `cp.scatter_add`
    - Sigmoid
    - ReLU
    - Softmax
    - SoftmaxCrossEntropy
    - logsoftmax
    - Clip
    
- **`layers.py`**
    - Linear
    
- **`optimizers.py`**
    - MomentumSGD
    - AdaGrad
    - RMSProp
    - AdaDelta
    - Adam

**cp.scatter_add**

In [35]:
import cupy as cp

test = cp.array([0, 0, 0])
cp.scatter_add(test, 1, 1)
test

array([0, 1, 0])

### Modify `core.py`

change the functions for the compatibility with `cupy` and `numpy`

```python
def as_array(x, array_module=np):
    if np.isscalar(x):
        return array_module.array(x)
    return x
```

In [5]:
# only scalar value will be treated as True
np.isscalar(cp.array(1))

False

In [6]:
np.isscalar(1)

True

### arithmetic operations

```python
def add(x0, x1):
    x1 = as_array(x1, array_module=dezero.cuda.get_array_module(x0.data))
    return Add()(x0, x1)
```

- add
- sub
- rsub
- mul
- div
- rdiv

## `MNIST` with `GPU`

In [7]:
import time
import math
import numpy as np

import dezero
from dezero import optimizers
from dezero import DataLoader

import dezero.functions as F
from dezero.models import MLP
from dezero.datasets import MNIST

In [2]:
max_epoch = 5
batch_size = 100
hidden_size = 1000

train_set = MNIST(train=True)
test_set = MNIST(train=False)

train_loader = DataLoader(train_set, batch_size)
test_loader = DataLoader(test_set, batch_size, shuffle=False)

model = MLP((hidden_size, 10))
optimizer = optimizers.SGD(lr=0.1).setup(model)

### with `gpu`

In [3]:
if dezero.cuda.gpu_enable:
    train_loader.to_gpu()
    test_loader.to_gpu()
    model.to_gpu()

In [4]:
epoch_list = []

train_loss_list = []
test_loss_list = []

train_acc_list = []
test_acc_list = []


for epoch in range(max_epoch):
    start = time.time()
    sum_loss, sum_acc = 0, 0
    
    for x, y in train_loader:
        
        y_pred = model(x)
        
        loss = F.softmax_cross_entropy(y_pred, y)
        acc = F.accuracy(y_pred, y)
        
        model.cleargrads()
        loss.backward()
        optimizer.update()
        
        sum_loss += float(loss.data) * len(y)
        sum_acc += float(acc.data) * len(y)
    
    avg_loss = sum_loss / len(train_set)
    avg_acc = sum_acc / len(train_set)

    train_loss_list.append(avg_loss)
    train_acc_list.append(avg_acc)

    elasped_time = time.time() - start
    
    print('epoch : {}'.format(epoch + 1))
    print('train loss: {:.4f}, accuracy: {:.4f}, time: {:.4f}[sec]'.format(avg_loss, avg_acc, elasped_time))
    
    sum_loss, sum_acc = 0, 0
    
    with dezero.no_grad():
        for x, y in test_loader:
            y_pred = model(x)
            
            
            loss = F.softmax_cross_entropy(y_pred, y)
            acc = F.accuracy(y_pred, y)
            
            sum_loss += float(loss.data) * len(y)
            sum_acc += float(acc.data) * len(y)
            
    avg_loss = sum_loss / len(test_set)
    avg_acc = sum_acc / len(test_set)            
    
    test_loss_list.append(avg_loss)
    test_acc_list.append(avg_acc)

    print('test loss: {:.4f}, accuracy: {:.4f}'.format(avg_loss, avg_acc))
    
    epoch_list.append(epoch + 1)    

epoch : 1
train loss: 0.9686, accuracy: 0.7228, time: 4.7941[sec]
test loss: 0.4324, accuracy: 0.8818
epoch : 2
train loss: 0.4005, accuracy: 0.8855, time: 4.3793[sec]
test loss: 0.3684, accuracy: 0.8939
epoch : 3
train loss: 0.3511, accuracy: 0.8978, time: 4.3744[sec]
test loss: 0.3292, accuracy: 0.9045
epoch : 4
train loss: 0.3287, accuracy: 0.9054, time: 4.3783[sec]
test loss: 0.3230, accuracy: 0.9090
epoch : 5
train loss: 0.3165, accuracy: 0.9087, time: 4.3444[sec]
test loss: 0.2991, accuracy: 0.9139


### with `cpu`

In [5]:
train_loader.to_cpu()
test_loader.to_cpu()
model.to_cpu()

In [6]:
epoch_list = []

train_loss_list = []
test_loss_list = []

train_acc_list = []
test_acc_list = []


for epoch in range(max_epoch):
    start = time.time()
    sum_loss, sum_acc = 0, 0
    
    for x, y in train_loader:
        
        y_pred = model(x)
        
        loss = F.softmax_cross_entropy(y_pred, y)
        acc = F.accuracy(y_pred, y)
        
        model.cleargrads()
        loss.backward()
        optimizer.update()
        
        sum_loss += float(loss.data) * len(y)
        sum_acc += float(acc.data) * len(y)
    
    avg_loss = sum_loss / len(train_set)
    avg_acc = sum_acc / len(train_set)

    train_loss_list.append(avg_loss)
    train_acc_list.append(avg_acc)

    elasped_time = time.time() - start
    
    print('epoch : {}'.format(epoch + 1))
    print('train loss: {:.4f}, accuracy: {:.4f}, time: {:.4f}[sec]'.format(avg_loss, avg_acc, elasped_time))
    
    sum_loss, sum_acc = 0, 0
    
    with dezero.no_grad():
        for x, y in test_loader:
            y_pred = model(x)
            
            
            loss = F.softmax_cross_entropy(y_pred, y)
            acc = F.accuracy(y_pred, y)
            
            sum_loss += float(loss.data) * len(y)
            sum_acc += float(acc.data) * len(y)
            
    avg_loss = sum_loss / len(test_set)
    avg_acc = sum_acc / len(test_set)            
    
    test_loss_list.append(avg_loss)
    test_acc_list.append(avg_acc)

    print('test loss: {:.4f}, accuracy: {:.4f}'.format(avg_loss, avg_acc))
    
    epoch_list.append(epoch + 1)    

epoch : 1
train loss: 0.3054, accuracy: 0.9118, time: 5.3513[sec]
test loss: 0.2879, accuracy: 0.9174
epoch : 2
train loss: 0.2983, accuracy: 0.9133, time: 5.6289[sec]
test loss: 0.2853, accuracy: 0.9178
epoch : 3
train loss: 0.2923, accuracy: 0.9153, time: 6.0089[sec]
test loss: 0.2871, accuracy: 0.9163
epoch : 4
train loss: 0.2865, accuracy: 0.9175, time: 5.4624[sec]
test loss: 0.2760, accuracy: 0.9208
epoch : 5
train loss: 0.2814, accuracy: 0.9191, time: 5.6888[sec]
test loss: 0.2713, accuracy: 0.9224


### `gpu` - 4 sec,  `cpu` - 6 sec

There is a bit difference but not that big here :0