# CUDA Semantics

### \underline{torch.cuda} is used to set up and run CUDA operations. It keeps track of the currently selected GPU, and all CUDA tensors you allocate will by default be created on that device. The selected device can be changed with a \underline{torch.cuda.device} context manager.

In [13]:
import argparse

import torch

In [14]:
# Some examples
cuda_s = torch.device('cuda') # default CUDA device
cuda_0 = torch.device('cuda:0')
cuda_1 = torch.device('cuda:1') # GPU 2
print(cuda_d, cuda_0, cuda_1)

x = torch.tensor([1., 2.], device=cuda_0)
print(x.device)
y = torch.tensor([1., 2.]).cuda()
print(y.device)

cuda cuda:0 cuda:1
cuda:0
cuda:0


In [15]:
with torch.cuda.device(0): # Select CUDA device 0
    # allocates a tensor on GPU 0
    a = torch.tensor([1., 2.], device=cuda_s)
    print(a.device)
    
    # transfers a tensor from CPU to GPU 0
    b = torch.tensor([1., 2.]).cuda()
    print(b.device)
    
    # you can also use ``Tensor.to`` to transfer a tensor
    b2 = torch.tensor([1., 2.]).to(device=cuda_s)
    print(b2.device)
    
    c = a + b
    # c.device is device(type='cuda', index=0)
    print(c.device)

cuda:0
cuda:0
cuda:0
cuda:0


# Best practices 

## Device-agnostic code 

### Due to the structure of PyTorch, you may need to explicitly write device-agnostic(CPU or GPU) code. 
* To determine whether the GPU should be used or not
* To move tensors to CPU or CUDA

In [20]:
parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--disable-cuda', action='store_true', help='Whether to disable CUDA')
args = parser.parse_args()

args.device = None
if not args.disable_cuda and torch.cuda.is_available():
    args.device = torch.device('cuda')
else:
    args.device = torch.device('cpu')
    
# Now that we have args.device, we can use it to create a Tensor on the desired device
x = torch.empty((10, 5), device=args.device)
net = Network().to(device=args.device)

### Use pinned memory buffers 
* pageable memory（由操作系统API malloc()在主机上分配的，可分页、交换）
* page-lock or pinned memory（由CUDA函数cudaHostAlloc()在主机内存上分配的，主机的操作系统将不会对这块内存进行内存分页和交换，确保该内存始终驻留在物理内存中）
* GPU知道page-lock memory的物理地址，可以通过直接内存访问（Direct Memory Access，DMA）技术直接在主机和GPU之间复制数据，速率很快。

In [21]:
# You can make the DataLoader return batches placed in pinned memory by passing pin_memory=True to its constructor.

### Use nn.DataParallel instead of multiprocessing  