<a href="https://colab.research.google.com/github/Tensor-Reloaded/AI-Learning-Hub/blob/main/resources/advanced_pytorch/UsingCppModules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using CPP modules

Writing parts of a Deep Learning pipeline in C++ unlocks many oportunities for optimizing speed and resource consumption.

PyTorch provides a very nice and easy to use interface for writing a module in C++, compiling it and using it in Python.

In this lesson we learn how to implement the Macro F1 Score in C++ and use it in Python. However, the real performance of C++ is seen when moving even more complex Python based processing to C++.

## Setup

Here we get the `optimized_f1_score` module.

In [1]:
import shutil
import os
import subprocess

if not os.path.isdir("optimized_f1_score"):
    subprocess.run(["git", "clone", "https://www.github.com/Tensor-Reloaded/AI-Learning-Hub"], check=True)
    shutil.copytree("AI-Learning-Hub/resources/advanced_pytorch/optimized_f1_score", "optimized_f1_score")
    shutil.rmtree("AI-Learning-Hub", ignore_errors=True)


Let's see the content of each file. First we have the Macro F1 Score implemented in Python.

In [2]:
!pygmentize -g optimized_f1_score/f1_macro_py.py

[34mimport[39;49;00m[37m [39;49;00m[04m[36mtorch[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m[37m [39;49;00m[32mf1_score[39;49;00m(x: torch.BoolTensor, y: torch.BoolTensor) -> [36mfloat[39;49;00m:[37m[39;49;00m
    x_sum = x.sum()[37m[39;49;00m
    y_sum = y.sum()[37m[39;49;00m
    [34mif[39;49;00m x_sum == [34m0[39;49;00m [35mor[39;49;00m y_sum == [34m0[39;49;00m:[37m[39;49;00m
        [34mif[39;49;00m x_sum == [34m0[39;49;00m [35mand[39;49;00m y_sum == [34m0[39;49;00m:[37m[39;49;00m
            [34mreturn[39;49;00m [34m1.0[39;49;00m[37m[39;49;00m
        [34mreturn[39;49;00m [34m0.0[39;49;00m[37m[39;49;00m
    [34mreturn[39;49;00m [34m2[39;49;00m * (x & y).sum() / (x_sum + y_sum)[37m[39;49;00m
[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m[37m [39;49;00m[32mf1_macro[39;49;00m(x: torch.Tensor, y: torch.Tensor, classes: [36mint[39;49;00m) -> [36mfloat[39;49;00m:[37m[39;49;00m
  

Next we have the same function implemented in C++.

In [3]:
!pygmentize -g optimized_f1_score/f1_macro_cpp.cpp

[36m#[39;49;00m[36minclude[39;49;00m[37m [39;49;00m[37m<torch/extension.h>[39;49;00m[36m[39;49;00m
[37m[39;49;00m
[36mdouble[39;49;00m[37m [39;49;00m[32mf1_score[39;49;00m(torch::Tensor[37m [39;49;00mx,[37m [39;49;00mtorch::Tensor[37m [39;49;00my)[37m [39;49;00m{[37m[39;49;00m
[37m    [39;49;00m[34mauto[39;49;00m[37m [39;49;00mx_sum[37m [39;49;00m=[37m [39;49;00mx.sum().item<[36mdouble[39;49;00m>();[37m[39;49;00m
[37m    [39;49;00m[34mauto[39;49;00m[37m [39;49;00my_sum[37m [39;49;00m=[37m [39;49;00my.sum().item<[36mdouble[39;49;00m>();[37m[39;49;00m
[37m[39;49;00m
[37m    [39;49;00m[34mif[39;49;00m[37m [39;49;00m(x_sum[37m [39;49;00m==[37m [39;49;00m[34m0.0[39;49;00m[37m [39;49;00m||[37m [39;49;00my_sum[37m [39;49;00m==[37m [39;49;00m[34m0.0[39;49;00m)[37m [39;49;00m{[37m[39;49;00m
[37m        [39;49;00m[34mif[39;49;00m[37m [39;49;00m(x_sum[37m [39;49;00m==[37m [39;49;00m[34m0.0[39;49;00m[3

If you can read the code above, you can see that the C++ implementation of `f1_macro` and `f1_score` is very similar to the Python implementation.

Whenever we write a C++ module, we need to include the torch extensions (`#include <torch/extension.h>`) and we need to use the `PYBIND11_MODULE` macro to bind the functions implemented in C++ to Python, allowing them to be called just like regular Python functions.

To build the C++ code, we need a `setup.py` file just like we would build any Python library from source.

In [4]:
!pygmentize -g optimized_f1_score/setup.py

[34mimport[39;49;00m[37m [39;49;00m[04m[36mos[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mfrom[39;49;00m[37m [39;49;00m[04m[36msetuptools[39;49;00m[37m [39;49;00m[34mimport[39;49;00m setup[37m[39;49;00m
[34mfrom[39;49;00m[37m [39;49;00m[04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mcpp_extension[39;49;00m[37m [39;49;00m[34mimport[39;49;00m CppExtension, BuildExtension[37m[39;49;00m
[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m[37m [39;49;00m[32mget_extra_compile_args[39;49;00m():[37m[39;49;00m
    [34mif[39;49;00m os.name == [33m"[39;49;00m[33mnt[39;49;00m[33m"[39;49;00m:[37m[39;49;00m
        [34mreturn[39;49;00m {[37m[39;49;00m
            [33m"[39;49;00m[33mmsvc[39;49;00m[33m"[39;49;00m: [[33m"[39;49;00m[33m/std:c++20[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33m/O2[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33m/DNDEBUG[39;49;00m[33m"[39;49;

Above you have the `setup.py` file. In this file you can set compile options that can speed the C++ code using vectorization depending on your architecture.
For example, I can use `/arch:AVX2` or `-mavx2` to tell the compiler that it can use 256-bit SIMD instructions (see the commented out code).

On most non-Mac modern hardware, using AVX2 is a great way of speeding up your tensor computations using vectorized instructions.
Depending on your hardware, you can also use `AVX`, `SSE`, `SSE2`, and so on.

You can also enable fused multiply add or fast math (if you know what you are doing).

In [5]:
!pygmentize -g optimized_f1_score/build.py

[34mimport[39;49;00m[37m [39;49;00m[04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m[37m [39;49;00m[04m[36msubprocess[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m[37m [39;49;00m[04m[36mfilelock[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m[37m [39;49;00m[32mbuild[39;49;00m():[37m[39;49;00m
    p_dir = os.path.dirname([31m__file__[39;49;00m)[37m[39;49;00m
    lock_path = os.path.join(p_dir, [33m"[39;49;00m[33m.lock[39;49;00m[33m"[39;49;00m)[37m[39;49;00m
[37m[39;49;00m
    [34mtry[39;49;00m:[37m[39;49;00m
        [34mfrom[39;49;00m[37m [39;49;00m[04m[36moptimized_f1_score[39;49;00m[04m[36m.[39;49;00m[04m[36mf1_macro_cpp[39;49;00m[37m [39;49;00m[34mimport[39;49;00m f1_macro[37m[39;49;00m
    [34mexcept[39;49;00m [36mImportError[39;49;00m:[37m[39;49;00m
        [34mwith[39;49;00m filelock.FileLock(lock_path):[37m[39;49;00m
            [36mprint[39;49;00m([33m"[39;49;00

The build file above allows us to dynamically build the C++ module as a Python library. We just need to have a compatible C++ compiler vizible in the path (which means that we have to activate MSVC on Windows).

In [6]:
!pygmentize -g optimized_f1_score/__init__.py

[34mfrom[39;49;00m[37m [39;49;00m[04m[36moptimized_f1_score[39;49;00m[04m[36m.[39;49;00m[04m[36mbuild[39;49;00m[37m [39;49;00m[34mimport[39;49;00m build[37m[39;49;00m
[37m[39;49;00m
[37m[39;49;00m
[37m[39;49;00m
[34mtry[39;49;00m:[37m[39;49;00m
    [34mfrom[39;49;00m[37m [39;49;00m[04m[36moptimized_f1_score[39;49;00m[04m[36m.[39;49;00m[04m[36mf1_macro_cpp[39;49;00m[37m [39;49;00m[34mimport[39;49;00m f1_macro[37m[39;49;00m
[34mexcept[39;49;00m [36mImportError[39;49;00m [34mas[39;49;00m e:[37m[39;49;00m
    [34mtry[39;49;00m:[37m[39;49;00m
        build()[37m[39;49;00m
        [34mfrom[39;49;00m[37m [39;49;00m[04m[36moptimized_f1_score[39;49;00m[04m[36m.[39;49;00m[04m[36mf1_macro_cpp[39;49;00m[37m [39;49;00m[34mimport[39;49;00m f1_macro[37m[39;49;00m
    [34mexcept[39;49;00m [36mImportError[39;49;00m [34mas[39;49;00m e:[37m[39;49;00m
        [36mprint[39;49;00m([33m"[39;49;00m[33mBuilding Optimi

This is how we implement dynamic building and importing of a C++ module, with a fallback that allows us to use the Python implementation if compilation failed.

In [7]:
import torch
from optimized_f1_score import f1_macro_cpp, f1_macro_py

Building Optimized F1 Score
Done building


In [8]:
num_classes = 100
size = 10000

In [9]:
torch.random.manual_seed(3)
x = torch.randint(0, num_classes, (size,))
y = torch.randint(0, num_classes, (size,))
x_cuda = x.cuda()
y_cuda = y.cuda()


In [10]:
%%timeit
f1_macro_cpp.f1_macro(x, y, num_classes)

4.2 ms ± 82.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
%%timeit
f1_macro_py.f1_macro(x, y, num_classes)

8.87 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [12]:
%%timeit
f1_macro_cpp.f1_macro(x_cuda, y_cuda, num_classes)

14.3 ms ± 841 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [13]:
%%timeit
f1_macro_py.f1_macro(x_cuda, y_cuda, num_classes)

18.5 ms ± 2.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


We can see that the C++ module is faster than the Python implementation.
And hey, we can also make our C++ code faster by releasing the GIL lock and using threads!

Writing data processing pipelines in C++ can make your code 10x-100x faster or even more (if you know what are you doing)!

The benefits of using C++ are mostly seen on CPU tensors. To speed up GPU computation even more, you usually need to implement custom CUDA kernels, which is not covered in this tutorial.

---

### Additional resources:
* Check the cppdocs, torch has a really nice library interfacing with C++: https://docs.pytorch.org/cppdocs/
* Check the official PyTorch tutorial for custom C++ operators: https://docs.pytorch.org/tutorials/advanced/cpp_custom_ops.html
* This is the official documentation for CppExtension: https://docs.pytorch.org/docs/stable/cpp_extension.html
* This tutorial shows how to optimize the data processing pipeline with custom PyTorch operators: https://medium.com/data-science/how-to-optimize-your-dl-data-input-pipeline-with-a-custom-pytorch-operator-7f8ea2da5206

---

| All     | [advanced_pytorch/](https://github.com/Tensor-Reloaded/AI-Learning-Hub/blob/main/resources/advanced_pytorch) |
|---------|-- |
| Current | [Using Cpp Modules](https://github.com/Tensor-Reloaded/AI-Learning-Hub/blob/main/resources/advanced_pytorch/UsingCppModules.ipynb) |