# 5.6.1 Computing Devices

In [2]:
import torch
from torch import nn

torch.device('cpu'), torch.device('cuda'), torch.device('cuda:1')

(device(type='cpu'), device(type='cuda'), device(type='cuda', index=1))

In [3]:
torch.cuda.device_count()

0

In [4]:
def try_gpu(i=0):
    """Return gpu(i) if exists, otherwise return cpu()."""
    if torch.cuda.device_count() >= i+1:
        return torch.device(f'cuda:{i}')
    return torch.device('cpu')
def try_all_gpus():
    """Return all available GPUs, or [cpu(),] if no GPU exists."""
    devices = [
        torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())
    ]
    return devices if devices else [torch.device('cpu')]
try_gpu(), try_gpu(10), try_all_gpus()

(device(type='cpu'), device(type='cpu'), [device(type='cpu')])

# 5.6.2. Tensors and Gpus

In [5]:
x = torch.tensor([1,2,3])
x.device

device(type='cpu')

## 5.6.2.1. Storage on the GPU

In [8]:
X = torch.ones(2,3,device=try_gpu())
X

tensor([[1., 1., 1.],
        [1., 1., 1.]])

In [9]:
Y = torch.rand(2,3,device=try_gpu(1))
Y

tensor([[0.3085, 0.1741, 0.2712],
        [0.9307, 0.7564, 0.5435]])

## 5.6.2.2. Copying

If we want to compute X + Y, we need to decide where to perform this operation. For instance, as shown in Fig. 5.6.1, we can transfer X to the second GPU and perform the operation there. Do not simply add X and Y, since this will result in an exception. The runtime engine would not know what to do: it cannot find data on the same device and it fails. Since Y lives on the second GPU, we need to move X there before we can add the two.

![image.png](attachment:image.png)
Fig. 5.6.1 Copy data to perform an operation on the same device.¶


In [11]:
'''
Z = X.cuda(1)
print(X)
print(Z)

=> X는 cuda:0, Z는 cuda:1

Z.cuda(1) 은 Z를 return한다.
'''

'\nZ = X.cuda(1)\nprint(X)\nprint(Z)\n\n=> X는 cuda:0, Z는 cuda:1\n'

People use GPUs to do machine learning because they expect them to be fast. But transferring variables between devices is slow. So we want you to be 100% certain that you want to do something slow before we let you do it. If the deep learning framework just did the copy automatically without crashing then you might not realize that you had written some slow code.

Also, transferring data between devices (CPU, GPUs, and other machines) is something that is much slower than computation. It also makes parallelization a lot more difficult, since we have to wait for data to be sent (or rather to be received) before we can proceed with more operations. This is why copy operations should be taken with great care. As a rule of thumb, many small operations are much worse than one big operation. Moreover, several operations at a time are much better than many single operations interspersed in the code unless you know what you are doing. This is the case since such operations can block if one device has to wait for the other before it can do something else. It is a bit like ordering your coffee in a queue rather than pre-ordering it by phone and finding out that it is ready when you are.

Last, when we print tensors or convert tensors to the NumPy format, if the data is not in the main memory, the framework will copy it to the main memory first, resulting in additional transmission overhead. Even worse, it is now subject to the dreaded global interpreter lock that makes everything wait for Python to complete.

# 5.6.3. Neural Networks and GPUs

In [12]:
net = nn.Sequential(nn.Linear(3,1))
net = net.to(device=try_gpu())

net(X)

tensor([[0.5721],
        [0.5721]], grad_fn=<AddmmBackward>)