# C++/CUDA extensions for Python

## Introduction

In [None]:
!lscpu

In [None]:
!nvidia-smi

*Latency numbers every programmer should know* (Jeff Dean):

**L1 cache reference 0.5 ns**

**L2 cache reference 7 ns**

**Main memory reference 100 ns**

![CPUCUDA](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/gpu-devotes-more-transistors-to-data-processing.png)

## PyTorch extensions

More info:
* [PyTorch C++ tutorial](https://pytorch.org/tutorials/advanced/cpp_extension.html)
* [Pybind11 docs](https://pybind11.readthedocs.io/en/stable/basics.html)

In [None]:
!gcc --version
!g++ --version

### Set-up

If you are on Google Colab execute:
```
!pip install Ninja
!add-apt-repository ppa:ubuntu-toolchain-r/test -y
!apt update
!apt upgrade -y
!apt install gcc-9 g++-9
!update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 100 --slave /usr/bin/g++ g++ /usr/bin/g++-9
```

In [None]:
import torch
from torch.utils.cpp_extension import load
print(torch.__config__.show())
print(torch.__config__.parallel_info())

```cpp
//cpp_intro.cc file

#include <torch/extension.h>

torch::Tensor get_rotations(const torch::Tensor &thetas) {
    const auto f = thetas.flatten();
    const auto n = f.numel();
    const auto c = torch::cos(f);
    const auto s = torch::sin(f);
    return torch::stack({c, -s, s, c}).t().view({n, 2, 2});
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("get_rotations", &get_rotations, py::call_guard<py::gil_scoped_release>(),
          "Generate 2D rotations given angles thetas");
}
```

In [None]:
!mkdir -p build

In [None]:
cpp_intro = load(name='cpp_intro',
             build_directory='./build',
             sources=['cpp_intro.cc'],
             extra_cflags=['-Wall -Wextra -Wpedantic -O3 -std=c++17'],
             verbose=True)

In [None]:
N = 3
PI = 2. * torch.acos(torch.tensor(0.))
thetas = 0.05 * PI * (torch.rand(N) - 0.5) # example of angles in radians
rots = cpp_intro.get_rotations(thetas)
rots

In [None]:
torch.dist(rots.matmul(rots.transpose(-1,-2)), torch.eye(2))

## Heterogeneous computing with TNL 

Tutorials worth working through include: 
* [TNL tutorials](https://mmg-gitlab.fjfi.cvut.cz/doc/tnl/md_Tutorials_index.html#Tutorials)
* [CUDA made easy](https://developer.nvidia.com/blog/even-easier-introduction-cuda)
* [CUDA guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)

In [None]:
torch.cuda.is_available()

In [None]:
!nvcc --version

![multithreading](https://randu.org/tutorials/threads/images/process.png)

![sm](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/automatic-scalability.png)

![blocks](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/grid-of-thread-blocks.png)

![CUDA](https://developer-blogs.nvidia.com/wp-content/uploads/2017/01/cuda_indexing.png)

A [TNL](https://mmg-gitlab.fjfi.cvut.cz/gitlab/tnl/tnl-dev) version compatible with PyTorch is also available as third-party library within [NOA](https://github.com/grinisrit/noa)

In [None]:
!git clone https://github.com/grinisrit/noa.git

In [None]:
noa_location = 'noa'

Make sure that the following files are available with in the folder you run the notebook from
* tnl-intro.cc
* tnl-intro.cu
* tnl-intro.hh
* utils.hh

In [None]:
tnl_intro = load(name='tnl_intro',
             build_directory='./build',
             sources=['tnl-intro.cc'],
             extra_include_paths=[f'{noa_location}/src', '.'],    
             extra_cflags=['-O3 -std=c++17 -fopenmp'],
             verbose=True)

In [None]:
tnl_intro_cuda = load(name='tnl_intro_cuda',
             build_directory='./build',
             sources=['tnl-intro.cu'],
             extra_include_paths=[f'{noa_location}/src', '.'],    
             extra_cflags=['-O3 -std=c++17'],
             extra_cuda_cflags=['-std=c++17 --expt-relaxed-constexpr --expt-extended-lambda'],
             verbose=True)  if torch.cuda.is_available() else None

In [None]:
t = torch.randn(10000000)
t_cuda = t.cuda()

In [None]:
tnl_intro.map_reduce(t)

In [None]:
tnl_intro.map_reduce(t_cuda)

In [None]:
tnl_intro_cuda.map_reduce(t_cuda)

In [None]:
%timeit tnl_intro.map_reduce(t)

In [None]:
%timeit tnl_intro_cuda.map_reduce(t_cuda)