<a href="https://colab.research.google.com/github/BradyHuai/Adventum/blob/master/5-conv2d_dw_gpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Depthwise-seperable 2D Convolution on GPU

## 1. Set-up

In [20]:
# Mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [21]:
# Make sure your token is stored in a txt file at the location below.
# This way there is no risk that you will push it to your repo
# Never share your token with anyone, it is basically your github password!
with open('/content/gdrive/MyDrive/ece5545/token.txt') as f:
    token = f.readline().strip()
# Use another file to store your github username
with open('/content/gdrive/MyDrive/ece5545/git_username.txt') as f:
    handle = f.readline().strip()

In [23]:
# Clone your github repo
YOUR_TOKEN = token
YOUR_HANDLE = handle
BRANCH = "main"

%mkdir /content/gdrive/MyDrive/ece5545
%cd /content/gdrive/MyDrive/ece5545
!git clone https://{YOUR_TOKEN}@github.com/ML-HW-SYS/a3-{YOUR_HANDLE}.git
%cd /content/gdrive/MyDrive/ece5545/a3-{YOUR_HANDLE}
!git checkout {BRANCH}
!git pull
%cd /content/gdrive/MyDrive/ece5545

PROJECT_ROOT = f"/content/gdrive/MyDrive/ece5545/a3-{YOUR_HANDLE}"

mkdir: cannot create directory ‘/content/gdrive/MyDrive/ece5545’: File exists
/content/gdrive/MyDrive/ece5545
fatal: destination path 'a3-BradyHuai' already exists and is not an empty directory.
/content/gdrive/MyDrive/ece5545/a3-BradyHuai
M	src/ops.py
Already on 'main'
Your branch and 'origin/main' have diverged,
and have 1 and 1 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)
[33mhint: You have divergent branches and need to specify how to reconcile them.[m
[33mhint: You can do so by running one of the following commands sometime before[m
[33mhint: your next pull:[m
[33mhint: [m
[33mhint:   git config pull.rebase false  # merge (the default strategy)[m
[33mhint:   git config pull.rebase true   # rebase[m
[33mhint:   git config pull.ff only       # fast-forward only[m
[33mhint: [m
[33mhint: You can replace "git config" with "git config --global" to set a default[m
[33mhint: preference for all repositories. You can also 

In [None]:
# This extension reloads all imports before running each cell
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
!ls {PROJECT_ROOT}

1-conv1d_cpu.ipynb  3-conv1d_fpga.ipynb  5-conv2d_dw_gpu.ipynb	README.md  tests
2-conv1d_gpu.ipynb  4-gemm_gpu.ipynb	 leaderboard_id.txt	src


## 2 Install TVM

In [None]:
!pip install tlcpack-nightly-cu102 -f https://tlcpack.ai/wheels

Looking in links: https://tlcpack.ai/wheels
Collecting tlcpack-nightly-cu102
  Downloading https://github.com/tlc-pack/tlcpack/releases/download/v0.12.dev/tlcpack_nightly_cu102-0.15.dev118%2Bg51bdaec6e-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (428.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m428.5/428.5 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tlcpack-nightly-cu102
Successfully installed tlcpack-nightly-cu102-0.15.dev118+g51bdaec6e


In [None]:
!pip install numpy==1.26.4

Collecting numpy==1.26.4
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m92.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
Successfully installed numpy-1.26.4


In [None]:
import sys
sys.setrecursionlimit(10000000)

## 3. Implement `make_dwsp_conv2d_gpu_scheduler` function in `src.ops`

In that function, you are required to implemented 2D convolution and use TVM to optimize it.
Please use zero padding and unit stride.
You can assume kernel size to be an odd number.
The padding will equals to kernel size minus ones.
In this case, the output image will preserve the input image dimension.

The `make_dwsp_conv2d_gpu_scheduler` takes following arguments:
1. Batch size $B$;
2. Input channel size $C$;
3. Input image height $H$;
4. Input image width $W$;
5. Output number of channels $O$;
6. Kernel size $K$

You should return both the TVM scheduler and the TVM opterator for
1. Input tensor $x$ with size (B, C, H, W)
2. Input kernel weight $y$ with size (O, 1, K, K)
3. Output $out$ with size (B, O, H, W)

The scheduler should be able to used to build a function with signature $func(x, y, out)$.
Please see the following cells the usage.

In [None]:
import tvm
import torch.nn.functional as F
import numpy as np
import sys
# Adding assignment 3 to the system path
# Make sure this matches your git directory
sys.path.insert(0, PROJECT_ROOT)
from src.ops import make_dwsp_conv2d_gpu_scheduler

B = 3
C = 4
H = 16
W = 32
K = 7
dtype = 'float32'
a_np = np.random.rand(B, C, H, W).astype(dtype)
w_np = np.random.rand(C, 1, K, K).astype(dtype)

s, inp, ker, out = make_dwsp_conv2d_gpu_scheduler(B, C, H, W, K)
func = tvm.build(s, [inp, ker, out], "cuda")

dev = tvm.cuda(0)
a = tvm.nd.array(a_np, dev)
w = tvm.nd.array(w_np, dev)
b = tvm.nd.array(np.zeros((B, C, H, W), dtype), dev)
func(a, w, b)
evaluator = func.time_evaluator(func.entry_name, dev, number=1, repeat =1)

print("Output:", b)
print(f"2DConv TVM: %f ms" % (evaluator(a, w, b).mean * 1e3))

Output: [[[[ 3.3276675  4.8995895  6.215834  ...  4.4062138  4.0085883
     3.4619527]
   [ 4.218955   6.305473   7.5448966 ...  6.680538   5.1722975
     4.1840568]
   [ 5.9705114  7.1710777  9.073423  ...  8.2471485  5.9428086
     5.8443937]
   ...
   [ 6.224985   7.59454    8.99978   ... 10.204355   8.168089
     7.530574 ]
   [ 5.340989   5.7137794  7.9899516 ...  9.435118   7.291466
     6.326056 ]
   [ 5.5450606  4.8329124  6.316245  ...  8.300325   5.673406
     4.4685507]]

  [[ 3.4801044  4.617272   5.5128174 ...  3.9814734  3.3263366
     2.3233905]
   [ 4.7848625  6.169961   8.493502  ...  5.746636   5.2827597
     3.3354123]
   [ 5.810236   7.881957  10.092031  ...  6.555023   5.776548
     4.225312 ]
   ...
   [ 5.153788   8.277957   8.574255  ...  9.321503   8.891026
     6.355317 ]
   [ 4.8645654  7.141524   8.083791  ...  8.101811   7.695097
     5.866612 ]
   [ 4.3487816  5.14711    7.46345   ...  6.7300224  6.4446535
     3.9061525]]

  [[ 4.7784367  6.5943384  7.557

In [None]:
import numpy as np
import torch
import timeit
import torch.nn.functional as F

def make_func(*args):
    s, A, W, O = make_dwsp_conv2d_gpu_scheduler(*args)
    func = tvm.build(s, [A, W, O], "cuda")
    return func

def ans_torch(a_torch, w_torch):
    B, C, H, W = a_torch.size()
    O, D, K1, K2 = w_torch.size()
    assert K1 == K2
    assert D == 1
    K = K1

    torch.cuda.synchronize()
    b_torch = F.conv2d(
        a_torch, w_torch, bias=None, stride=1,
        padding=((K - 1)//2), dilation=1, groups=C)
    torch.cuda.synchronize()
    return b_torch

# Define dimension
B = 3
C = 4
H = 16
W = 32
K = 7
dtype = 'float32'
np.random.seed(seed=1024)
a_np = np.random.rand(B, C, H, W).astype(dtype)
w_np = np.random.rand(C, 1, K, K).astype(dtype)

n_repeat = 100

# Torch input
a_torch = torch.tensor(a_np).float()
w_torch = torch.tensor(w_np).float()

# Time the torch implementation
def torch_time():
    ans_torch(a_torch, w_torch)

# Warm-up
for _ in range(10):
    torch_time()
time_torch = timeit.timeit(torch_time, number=n_repeat)

avg_time = time_torch / n_repeat

print(f"PyTorch conv2d (GPU) - Avg time over {n_repeat} runs: {avg_time * 1000:.4f} ms")

PyTorch conv2d (GPU) - Avg time over 100 runs: 0.1733 ms


In [None]:
print(tvm.lower(s, [inp, ker, out], simple_mode=True))

# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((3, 4, 16, 32), "float32"), W: T.Buffer((4, 1, 7, 7), "float32"), out: T.Buffer((3, 4, 16, 32), "float32")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        inp_pad = T.allocate([10032], "float32", "global")
        inp_pad_1 = T.Buffer((10032,), data=inp_pad)
        for bb, cc in T.grid(3, 4):
            blockIdx_y = T.launch_thread("blockIdx.y", 6)
            threadIdx_y = T.launch_thread("threadIdx.y", 2)
            for ww in range(38):
                A_1 = T.Buffer((6144,), data=A.data)
                inp_pad_1[bb * 3344 + cc * 836 + blockIdx_y * 152 + threadIdx_y * 38 + ww] = T.if_then_else(3 <= blockIdx_y * 4 + threadIdx_y and blockIdx_y * 4 + threadIdx_y < 19 and 3 <= ww and ww < 35, A_1[bb * 2048 + cc * 512 + blockIdx_y * 128 + threadIdx_y * 32 + ww - 99], T.float32(0))
                

In [None]:
%cd {PROJECT_ROOT}
!python -m pytest tests/test_dwsp_2dconv_gpu.py

/content/gdrive/MyDrive/ece5545/a3-BradyHuai
platform linux -- Python 3.11.11, pytest-8.3.5, pluggy-1.5.0
rootdir: /content/gdrive/MyDrive/ece5545/a3-BradyHuai
plugins: anyio-4.9.0, typeguard-4.4.2, langsmith-0.3.23
collected 192 items                                                                                [0m

tests/test_dwsp_2dconv_gpu.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m [ 32%]
[0m[32m.[0m[

In [None]:
!git config --global user.email "gh448@cornell.edu"  # update with your email
!git config --global user.name "Guo Qing Huai"   # update with your name

In [None]:
%cd /content/gdrive/MyDrive/ece5545/a3-BradyHuai/src
!git commit -am "benchmark runtime"
!git push

/content/gdrive/MyDrive/ece5545/a3-BradyHuai/src
[main c10bfb7] benchmark runtime
 2 files changed, 5 insertions(+), 13 deletions(-)
Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 2 threads
Compressing objects: 100% (6/6), done.
Writing objects: 100% (6/6), 584 bytes | 64.00 KiB/s, done.
Total 6 (delta 4), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (4/4), completed with 4 local objects.[K
To https://github.com/ML-HW-SYS/a3-BradyHuai.git
   280691e..c10bfb7  main -> main
