# C++/CUDA extensions for Python

## Introduction

In [1]:
!lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           24
Model name:                      AMD Ryzen 7 3750H with Radeon Vega Mobile Gfx
Stepping:                        1
CPU MHz:                         2295.612
BogoMIPS:                        4591.22
Virtualization:                  AMD-V
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       128 KiB
L1i cache:                       256 KiB
L2 cache:                        2 MiB
L3 cache:                        4 MiB
Vul

In [2]:
!nvidia-smi

Fri Feb 18 06:26:31 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 511.65       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   50C    P5    12W /  N/A |    295MiB /  6144MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

*Latency numbers every programmer should know* (Jeff Dean):

**L1 cache reference 0.5 ns**

**L2 cache reference 7 ns**

**Main memory reference 100 ns**

![CPUCUDA](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/gpu-devotes-more-transistors-to-data-processing.png)

## PyTorch extensions

More info:
* [PyTorch C++ tutorial](https://pytorch.org/tutorials/advanced/cpp_extension.html)
* [Pybind11 docs](https://pybind11.readthedocs.io/en/stable/basics.html)

In [None]:
!gcc --version
!g++ --version

### Set-up

If you are on Google Colab execute:
```
!pip install Ninja
!add-apt-repository ppa:ubuntu-toolchain-r/test -y
!apt update
!apt upgrade -y
!apt install gcc-9 g++-9
!update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 100 --slave /usr/bin/g++ g++ /usr/bin/g++-9
```

In [3]:
import torch
from torch.utils.cpp_extension import load
print(torch.__config__.show())
print(torch.__config__.parallel_info())

PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initi

```cpp
//cpp_intro.cc file

#include <torch/extension.h>

torch::Tensor get_rotations(const torch::Tensor &thetas) {
    const auto f = thetas.flatten();
    const auto n = f.numel();
    const auto c = torch::cos(f);
    const auto s = torch::sin(f);
    return torch::stack({c, -s, s, c}).t().view({n, 2, 2});
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("get_rotations", &get_rotations, py::call_guard<py::gil_scoped_release>(),
          "Generate 2D rotations given angles thetas");
}
```

In [4]:
!mkdir -p build

In [5]:
cpp_intro = load(name='cpp_intro',
             build_directory='./build',
             sources=['cpp_intro.cc'],
             extra_cflags=['-Wall -Wextra -Wpedantic -O3 -std=c++17'],
             verbose=True)

Emitting ninja build file ./build/build.ninja...
Building extension module cpp_intro...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF cpp_intro.o.d -DTORCH_EXTENSION_NAME=cpp_intro -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/rolan/miniconda3/envs/noa/lib/python3.9/site-packages/torch/include -isystem /home/rolan/miniconda3/envs/noa/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/rolan/miniconda3/envs/noa/lib/python3.9/site-packages/torch/include/TH -isystem /home/rolan/miniconda3/envs/noa/lib/python3.9/site-packages/torch/include/THC -isystem /home/rolan/miniconda3/envs/noa/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -Wall -Wextra -Wpedantic -O3 -std=c++17 -c /home/rolan/devspace/numerics-2021/cpp_intro.cc -o cpp_intro.o 
[2/2] c++ cpp_intro.o -s

In [6]:
N = 3
PI = 2. * torch.acos(torch.tensor(0.))
thetas = 0.05 * PI * (torch.rand(N) - 0.5) # example of angles in radians
rots = cpp_intro.get_rotations(thetas)
rots

tensor([[[ 1.0000,  0.0089],
         [-0.0089,  1.0000]],

        [[ 0.9996, -0.0287],
         [ 0.0287,  0.9996]],

        [[ 0.9976, -0.0700],
         [ 0.0700,  0.9976]]])

In [7]:
torch.dist(rots.matmul(rots.transpose(-1,-2)), torch.eye(2))

tensor(8.4294e-08)

## Heterogeneous computing with TNL 

Tutorials worth working through include: 
* [TNL tutorials](https://mmg-gitlab.fjfi.cvut.cz/doc/tnl/md_Tutorials_index.html#Tutorials)
* [CUDA made easy](https://developer.nvidia.com/blog/even-easier-introduction-cuda)
* [CUDA guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)

In [8]:
torch.cuda.is_available()

True

In [9]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0


![multithreading](https://randu.org/tutorials/threads/images/process.png)

![sm](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/automatic-scalability.png)

![blocks](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/grid-of-thread-blocks.png)

![CUDA](https://developer-blogs.nvidia.com/wp-content/uploads/2017/01/cuda_indexing.png)

A [TNL](https://mmg-gitlab.fjfi.cvut.cz/gitlab/tnl/tnl-dev) version compatible with PyTorch is also available as third-party library within [NOA](https://github.com/grinisrit/noa)

In [None]:
!git clone https://github.com/grinisrit/noa.git

In [10]:
noa_location = 'noa'

In [11]:
tnl_intro = load(name='tnl_intro',
             build_directory='./build',
             sources=['tnl-intro.cc'],
             extra_include_paths=[f'{noa_location}/src', '.'],    
             extra_cflags=['-O3 -std=c++17 -fopenmp'],
             verbose=True)

Emitting ninja build file ./build/build.ninja...
Building extension module tnl_intro...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF tnl-intro.o.d -DTORCH_EXTENSION_NAME=tnl_intro -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/rolan/devspace/numerics-2021/noa/src -I/home/rolan/devspace/numerics-2021 -isystem /home/rolan/miniconda3/envs/noa/lib/python3.9/site-packages/torch/include -isystem /home/rolan/miniconda3/envs/noa/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/rolan/miniconda3/envs/noa/lib/python3.9/site-packages/torch/include/TH -isystem /home/rolan/miniconda3/envs/noa/lib/python3.9/site-packages/torch/include/THC -isystem /home/rolan/miniconda3/envs/noa/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++17 -fopenmp -c /home/rolan/devspace/n

In [12]:
tnl_intro_cuda = load(name='tnl_intro_cuda',
             build_directory='./build',
             sources=['tnl-intro.cu'],
             extra_include_paths=[f'{noa_location}/src', '.'],    
             extra_cflags=['-O3 -std=c++17'],
             extra_cuda_cflags=['-std=c++17 --expt-relaxed-constexpr --expt-extended-lambda'],
             verbose=True)  if torch.cuda.is_available() else None

Detected CUDA files, patching ldflags
Emitting ninja build file ./build/build.ninja...
Building extension module tnl_intro_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module tnl_intro_cuda...


In [13]:
t = torch.randn(10000000)
t_cuda = t.cuda()

In [14]:
tnl_intro.map_reduce(t)

-953.7388916015625

In [15]:
tnl_intro.map_reduce(t_cuda)

RuntimeError: CPU tensor required

In [16]:
tnl_intro_cuda.map_reduce(t_cuda)

-953.7645263671875

In [17]:
%timeit tnl_intro.map_reduce(t)

6.29 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [18]:
%timeit tnl_intro_cuda.map_reduce(t_cuda)

457 µs ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
