<a href="https://colab.research.google.com/github/dlsyscourse/lecture14/blob/main/14_hardware_acceleration_architecture_overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 14: Hardware Acceleration Implementation

In this lecture, we will to walk through backend scafoldings to get us hardware accelerations for needle.




## Select a GPU runtime type
In this lecture, we are going to make use of c++ and CUDA to build accelerated linear algebra libraries. In order to do so, please make sure you select a runtime type with GPU and rerun the cells if needed:
- Click on the "Runtime" tab
- Click "Change runtime type"
- Select GPU

After you started the right runtime, you can run the following command to check if there is a GPU available.

In [1]:
# !nvidia-smi

## Prepare the codebase

To get started, we can clone the related repo from the github.

In [2]:
# # Code to set up the assignment
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/MyDrive/
# !mkdir -p 10714f23
# %cd /content/drive/MyDrive/10714f23
# # comment out the following line if you run it for the second time
# # as you already have a local copy of lecture14
# # !git clone https://github.com/dlsyscourse/lecture14
# !rm -rf /content/needle
# !ln -s /content/drive/MyDrive/10714f23/lecture14 /content/needle

In [3]:
# !python3 -m pip install pybind11

In [4]:
%set_env PYTHONPATH ./python
%set_env NEEDLE_BACKEND nd

env: PYTHONPATH=./python
env: NEEDLE_BACKEND=nd


In [5]:
import sys
sys.path.append('./python')

### Build the needle cuda library

We leverage pybind to build a c++/cuda library for acceleration. You can type make to build the corresponding library.

In [6]:
# !make clean
# !make

We can then run the following command to make the path to the package available in colab's environment as well as the PYTHONPATH.

In [7]:
# %set_env PYTHONPATH /content/needle/python:/env/python
import sys
sys.path.append('./python')
sys.path.append('./apps')

## 代码库演示

现在点击左侧的文件面板。您应该能够看到以下文件：

Python:

- needle/backend_ndarray/ndarray.py

- needle/backend_ndarray/ndarray_backend_numpy.py

C++/CUDA:

- src/ndarray_backend_cpu.cc

- src/ndarray_backend_cuda.cu


本讲座的主要目标是创建一个加速的ndarray库。
- 之后可以不再用numpy作为数据承载的后端

因此，我们现在不需要处理needle.Tensor，而是将重点放在backend_ndarray的实现上。

在我们构建这个数组库之后，我们可以将其用于支持needle中的后端数组计算。

In [8]:
# needle.backend_ndarray 目录中的实现，TQ建议都浏览一遍

# src目录下的两个backend code也建议浏览一遍

## Creating a CUDA NDArray






In [9]:
from needle import backend_ndarray as nd

Using needle backend


In [10]:
x = nd.NDArray([1, 2, 3], device=nd.cpu())

if branch 3
if branch 2


In [11]:
x = nd.NDArray([1, 2, 3], device=nd.cuda())

if branch 3
if branch 2


In [12]:
x.device

cuda()

In [13]:
x = nd.NDArray([1,2,3], device=nd.cuda())

if branch 3
if branch 2


In [14]:
y = x + 1

(1,)
(1,)


In [15]:
y

NDArray([2. 3. 4.], device=cuda())

In [16]:
y = x + x

(1,)
(1,)
(1,)
(1,)


In [17]:
y

NDArray([2. 4. 6.], device=cuda())

We can create a CUDA tensor from the data by specifying a device keyword.

In [18]:
x = nd.NDArray([1, 2, 3], device=nd.cuda())

if branch 3
if branch 2


In [19]:
y = x + 1

(1,)
(1,)


In [20]:
x.numpy()

array([1., 2., 3.], dtype=float32)

In [21]:
x.device

cuda()

In [22]:
y = x + 1

(1,)
(1,)


In [23]:
y.device

cuda()

In [24]:
y.numpy()

array([2., 3., 4.], dtype=float32)

### Key Data Structures

Key data structures in backend_ndarray

- NDArray: the container to hold device specific ndarray
- BackendDevice: backend device
    - mod holds the module implementation that implements all functions
    - checkout ndarray_backend_numpy.py for a python-side reference.



In [25]:
# 参考一下 ndarray_backend_numpy.py

# 看一下 numpy后端的实现（python的实现）

## Trace GPU execution

Now, let us take a look at what happens when we execute the following code


In [26]:
x = nd.NDArray([1, 2, 3], device=nd.cuda())
y = x + 1

if branch 3
if branch 2
(1,)
(1,)


执行过程

NDArray.\_\_init\_\_  -> 'if branch 3' -> 'if branch 2'  -> self.make,**array.device.from_numpy**

- 注意这个from_numpy，相当于是把我们的np格式的东西，开始转移存入我们的后端device，存储结果那个类，放在了**array._handle**中
- 也就是：make创建了这个类开辟了存储空间，_handle指向，并填充了size这个成员，但没填充具体的数据
- from_numpy向_handle填充了具体的数据

这个array._handle实际上就是我们cc,cu中封装的cpp类CudaArray、AlignedArray的指针

接下来看y = x+1

- NDArray重载运算符 \_\_add\_\_ \_\_radd\_\_
- 调用self.ewise_or_scalar
- 这个函数会让我们传入对应的ewise或者scalar的函数
- 然后有个compact值得注意，TQ没说太多，应该是多维和一维的相关问题

In [27]:
from needle.backend_ndarray import ndarray_backend_cuda
for k,v in ndarray_backend_cuda.__dict__.items():
    print(k, v)
    
# 可以发现有from_numpy这个函数

__name__ needle.backend_ndarray.ndarray_backend_cuda
__doc__ None
__package__ needle.backend_ndarray
__loader__ <_frozen_importlib_external.ExtensionFileLoader object at 0x7f74cc157410>
__spec__ ModuleSpec(name='needle.backend_ndarray.ndarray_backend_cuda', loader=<_frozen_importlib_external.ExtensionFileLoader object at 0x7f74cc157410>, origin='/home/haze/Code/DLsys/lecture/13-lecture14/./python/needle/backend_ndarray/ndarray_backend_cuda.cpython-311-x86_64-linux-gnu.so')
__device_name__ cuda
__tile_size__ 4
Array <class 'needle.backend_ndarray.ndarray_backend_cuda.Array'>
to_numpy <built-in method to_numpy of PyCapsule object at 0x7f74cc5d3bd0>
from_numpy <built-in method from_numpy of PyCapsule object at 0x7f74cc5d3c00>
fill <built-in method fill of PyCapsule object at 0x7f74cc5d3c30>
compact <built-in method compact of PyCapsule object at 0x7f74cc5d3c60>
ewise_setitem <built-in method ewise_setitem of PyCapsule object at 0x7f74cc5d3c90>
scalar_setitem <built-in method scalar_setite

In [28]:
x.device.from_numpy

<function needle.backend_ndarray.ndarray_backend_cuda.PyCapsule.from_numpy>

In [29]:
x = nd.NDArray([1, 2, 3])

if branch 3
if branch 2


In [30]:
x.device.from_numpy

<function needle.backend_ndarray.ndarray_backend_numpy.from_numpy(a, out)>

执行轨迹如下：
Have the following trace:

backend_ndarray/ndarray.py
- `NDArray.__add__`
- `NDArray.ewise_or_scalar`
- `ndarray_backend_cpu.cc:ScalarAdd`

In [31]:
y.numpy()

array([2., 3., 4.], dtype=float32)

Have the following trace:

- `NDArray.numpy`
- `ndarray_backend_cpu.cc:to_numpy`

## Guidelines for Reading C++/CUDA related Files

Read
- src/ndarray_backend_cpu.cc
- src/ndarray_backend_cuda.cu


Optional
- CMakeLists.txt: this is used to setup the build and likely you do not need to tweak it.







## NDArray Data Structure

Open up `python/needle/backend_ndarray/ndarray.py`.

An NDArray contains the following fields:
- handle: The backend handle that build a flat array which stores the data.
- shape: The shape of the NDArray
- strides: The strides that shows how do we access multi-dimensional elements
- offset: The offset of the first element.
- device: The backend device that backs the computation






## Transformation as Strided Computation

We can leverage the strides and offset to perform transform/slicing with zero copy.

- Broadcast: insert strides that equals 0
- Tranpose: swap the strides
- Slice: change the offset and shape

For most of the computations, however, we will call `array.compact()` first to get a contiguous and aligned memory before running the computation.

通过stride的运算，对张量进行各种transform（之前的pdf课件有提及过，实现zero copy）运算

**这是我们作业的一部分，我们无需copy，而是仅仅写一个相关的stride、shap的设置，返回一个新的数据结构，得到transform后的内容**

In [40]:
import numpy as np
x = nd.NDArray([0,1,2,3,4,5], device=nd.cpu_numpy())

if branch 3
if branch 2


In [41]:
x.numpy()

array([0., 1., 2., 3., 4., 5.], dtype=float32)

In [42]:
z = nd.NDArray.make(shape=(2, 3),
                strides=(3, 1),
                device=x.device,
                handle=x._handle,
                offset=0)
z
# reshape操作

NDArray([[0. 1. 2.]
 [3. 4. 5.]], device=cpu_numpy())

In [43]:
x._handle, z._handle

(<needle.backend_ndarray.ndarray_backend_numpy.Array at 0x7f74cc1bc950>,
 <needle.backend_ndarray.ndarray_backend_numpy.Array at 0x7f74cc1bc950>)

In [44]:
b = nd.NDArray.make(shape=(2,2),
                    strides=(3,1),
                    device=x.device,
                    handle=z._handle,
                    offset=1)
b
# slice操作:获取一个子矩阵
# b[i, j] = data[offset + i*3 + j*1]

# 这里有一个问题
b + 1
# 当我们执行 b+1，因为底层是z._handle
'''对应的比如cpu的代
void ScalarAdd(const AlignedArray& a, scalar_t val, AlignedArray* out) {
  for (size_t i = 0; i < a.size; i++) {
    out->ptr[i] = a.ptr[i] + val;
  }
}
'''
# 会对z的6个元素都执行操作


(3, 1)
(2, 1)


'对应的比如cpu的代\nvoid ScalarAdd(const AlignedArray& a, scalar_t val, AlignedArray* out) {\n  for (size_t i = 0; i < a.size; i++) {\n    out->ptr[i] = a.ptr[i] + val;\n  }\n}\n'

In [46]:
z+1, z

(3, 1)
(3, 1)


(NDArray([[1. 2. 3.]
  [4. 5. 6.]], device=cpu_numpy()),
 NDArray([[0. 1. 2.]
  [3. 4. 5.]], device=cpu_numpy()))

In [34]:
z
# 但我们发现没有，这就是compact函数存在，起到的作用
'''
out = NDArray.make(self.shape, device=self.device)
self.device.compact(
    self._handle, out._handle, self.shape, self.strides, self._offset
)
return out
'''

'''self.device.compact
def compact(a, out, shape, strides, offset):
    out.array[:] = to_numpy(a, shape, strides, offset).flatten()

def to_numpy(a, shape, strides, offset):
    return np.lib.stride_tricks.as_strided(
        a.array[offset:], shape, tuple([s * _datetype_size for s in strides])
    )

'''
# compact搞了一个新的out数组，这个out数组通过device的compact得到
# device的compact会将a的内容通过to_numpy得到
# to_numpy返回的是一个copy操作
# 也就是out和a底层不再是同一个内存区域了
# 同时可以发现，是lazy mode执行的理念

'\nout = NDArray.make(self.shape, device=self.device)\nself.device.compact(\n    self._handle, out._handle, self.shape, self.strides, self._offset\n)\nreturn out\n'

In [49]:
import needle.backend_ndarray.ndarray_backend_numpy as t
aa = t.Array(16)
aa.array[:] = np.arange(1,17)
b = t.to_numpy(aa, (2,2,2,2),strides=(8,2,4,1), offset=0)
aa.array


array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13.,
       14., 15., 16.], dtype=float32)

In [50]:
b

array([[[[ 1.,  2.],
         [ 5.,  6.]],

        [[ 3.,  4.],
         [ 7.,  8.]]],


       [[[ 9., 10.],
         [13., 14.]],

        [[11., 12.],
         [15., 16.]]]], dtype=float32)

In [51]:
id(aa.array), id(b)
# 说明to_numpy是返回值（copy了一份，而不是返回引用）

(140139617093648, 140138614537360)

In [52]:
b = nd.NDArray.make(shape=(1,3),
                    strides=(3,1),
                    device=x.device,
                    handle=z._handle,
                    offset=3)
b
# slice操作 获取第二行
# b[i, j] = data[offset + i*3 + j*1]

NDArray([[3. 4. 5.]], device=cpu_numpy())

In [53]:
b = nd.NDArray.make(shape=(3,2),
                    strides=(1,3),
                    device=x.device,
                    handle=z._handle,
                    offset=0)
b
# 转置操作

NDArray([[0. 3.]
 [1. 4.]
 [2. 5.]], device=cpu_numpy())

In [54]:
b = nd.NDArray.make(shape=(2, 3, 4),
                    strides=(3, 1, 0),
                    device=z.device,
                    handle=z._handle,
                    offset=0)
b
# broadcast操作

NDArray([[[0. 0. 0. 0.]
  [1. 1. 1. 1.]
  [2. 2. 2. 2.]]

 [[3. 3. 3. 3.]
  [4. 4. 4. 4.]
  [5. 5. 5. 5.]]], device=cpu_numpy())

In [55]:
b = nd.NDArray.make(shape=(4, 2, 3),
                    strides=(0, 3, 1),
                    device=z.device,
                    handle=z._handle,
                    offset=0)
b
# 轴交换操作

NDArray([[[0. 1. 2.]
  [3. 4. 5.]]

 [[0. 1. 2.]
  [3. 4. 5.]]

 [[0. 1. 2.]
  [3. 4. 5.]]

 [[0. 1. 2.]
  [3. 4. 5.]]], device=cpu_numpy())

In [56]:
y = nd.NDArray.make(
    shape=(3, 2, 2),
    strides=(2, 1, 0),
    device=x.device,
    handle=x._handle,
    offset=0)
y.numpy()
#

array([[[0., 0.],
        [1., 1.]],

       [[2., 2.],
        [3., 3.]],

       [[4., 4.],
        [5., 5.]]], dtype=float32)

In [57]:
x = nd.NDArray([1, 2, 3, 4], device=nd.cpu_numpy())

if branch 3
if branch 2


In [58]:
x.numpy()

array([1., 2., 3., 4.], dtype=float32)

We can use strides and shape manipulation to create different views of the same array.

In [59]:
y = nd.NDArray.make(shape=(2, 2), strides=(2, 1), device=x.device, handle=x._handle, offset=0)

In [60]:
y.numpy()

array([[1., 2.],
       [3., 4.]], dtype=float32)

In [61]:
z = nd.NDArray.make(shape=(2, 1), strides=(2, 1), device=x.device, handle=x._handle, offset=1)

In [62]:
z.numpy()

array([[2.],
       [4.]], dtype=float32)

## CUDA Acceleration

Now let us open `src/ndarray_cuda_backend.cu` and take a look at current implementation of GPU ops.


## Steps for adding a new operator implementation
- Add an implementation in `ndarray_backend_cuda.cu`, expose via pybind
- Call into the operator in ndarray.py
- Write up testcases

In [None]:
!make

If we directly run the code block, we will see an error, because ewise mul is not yet implemented

In [63]:
x = nd.NDArray([1,2,3], device=nd.cuda())
x * 2

if branch 3
if branch 2
(1,)
(1,)


NDArray([2. 4. 6.], device=cuda())

In [64]:
!python test_mul.py

Using needle backend
if branch 3
if branch 2
(1,)
(1,)
[3. 6. 9.]


In [65]:
!nvprof python test_mul.py

                  Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.
                  Refer https://developer.nvidia.com/tools-overview for more details.



## Connect back to needle Tensor

So far we only played with the `backend_ndarray` subpackage, which is a self-contained ndarray implementation within needle.

We can connect the ndarray back to needle as a backend.

In [66]:
import needle as ndl

In [67]:
x = ndl.Tensor([1,2,3], device=nd.cuda(), dtype="float32")
y = ndl.Tensor([2,3,5], device=nd.cuda(), dtype="float32")
z = x + y
z

if branch 3
if branch 2
if branch 3
if branch 2
(1,)
(1,)
(1,)
(1,)


needle.Tensor([3. 5. 8.])

In [68]:
z.device

cuda()

In [69]:
type(z.cached_data)

needle.backend_ndarray.ndarray.NDArray

## Write Standalone Python Test Files

Now that we have additional c++/cuda libraries in needle, we will need to type make in order to rebuild the library. Additionally, because the colab environment caches the old library, it is inconvenient to use the ipython cells to debug the updated library.




In [71]:
!make

-- Found pybind11: /home/haze/anaconda3/envs/dlsys/lib/python3.11/site-packages/pybind11/include (found version "2.11.1")
-- Found cuda, building cuda backend
Sun Jan 14 22:58:18 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 537.13       CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 30%   33C    P3    28W / 180W |   1998MiB /  8192MiB |     39%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
         


We recommend writing separate python files and invoke them from the command line. Create a new file `tests/mytest.py` and write your local tests. This is also a common develop practice in big projects that involves python c++ FFI.

In [76]:
!python tests/mytest.py

Using needle backend
if branch 3
if branch 2
if branch 3
if branch 2
(1,)
(1,)
(1,)
(1,)
[1. 1. 1.]


After we have building the library, we could choose to fully restart the runtime (factory reset runtime) if you want to bring the updated change back to another colab. Note that you will need to save your code changes to the drive or a private github repo.