## cuda基本函数

### 函数类型限定词
\_\_global\_\_：在device上执行，从host中调用（一些特定的GPU也可以从device上调用），返回类型必须是void，不支持可变参数参数，不能成为类成员函数。注意用__global__定义的kernel是异步的，这意味着host不会等待kernel执行完就执行下一步。  
\_\_device\_\_：在device上执行，单仅可以从device中调用，不可以和__global__同时用。  
\_\_host\_\_：在host上执行，仅可以从host上调用，一般省略不写，不可以和__global__同时用，但可和__device__，此时函数会在device和host都编译。 

### 内存相关
设备分配内存：cudaMalloc  
CPU分配内存：cudaMallocHost  
分配统一内存，CPU/GPU均可访问 ：cudaMallocManaged  
CPU/GPU内存拷贝（复制方向包括：cudaMemcpyHostToDevice/cudaMemcpyDeviceToHost）：cudaMemcpy  
数据异步预取  cudaMemPrefetchAsync

```
void foo(cudaStream_t s) {
  char *data;
  cudaMallocManaged(&data, N);
  init_data(data, N);                                   // execute on CPU
  cudaMemPrefetchAsync(data, N, myGpuId, s);            // prefetch to GPU
  mykernel<<<..., s>>>(data, N, 1, compare);            // execute on GPU
  cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s);    // prefetch to CPU
  cudaStreamSynchronize(s);
  use_data(data, N);
  cudaFree(data);
}
```

### CUDA stream
```
cudaStream_t stream;       // CUDA streams are of type `cudaStream_t`.
cudaStreamCreate(&stream); // Note that a pointer must be passed to `cudaCreateStream`.
someKernel<<<number_of_blocks, threads_per_block, 0, stream>>>(); // `stream` is passed as 4th EC argument.
cudaStreamDestroy(stream); // Note that a value, not a pointer, is passed to `cudaDestroyStream`.
```

### 工具相关
安装NsightSystems: https://developer.nvidia.com/nsight-systems/get-started#platforms
jupter配置nsys：https://pypi.org/project/jupyterlab-nvidia-nsight/


In [18]:
!nvcc -o 01-vector-add 01-vector-add.cu -run
!nsys nvprof ./01-vector-add
!rm report*

Device ID: 0	Number of SMs: 20
Error: invalid device ordinal
Success! All values calculated correctly.

Collecting data...
Device ID: 0	Number of SMs: 20
Error: invalid device ordinal
Success! All values calculated correctly.
Generating '/tmp/nsys-report-9184.qdstrm'
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /home/chenpeng/github/HPC-Practice/HPC-Practice/course/CUDA/notes/01-cuda基础/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)    Med (ns)   Min (ns)  Max (ns)   StdDev (ns)          Name         
 --------  ---------------  ---------  ----------  ----------  --------  ---------  -----------  ---------------------
     72.1        292474470          3  97491490.0  51344784.0  48064180  193065506   82785777.7  cudaMallocManaged    
     16.0         65053983          1  65053983.0  65053983.0  65053983   65053983          0.0  cudaDeviceSynchronize
     11.1         4489