<a href="https://colab.research.google.com/github/CaseyPYZ/CaseyPYZ.github.io/blob/master/1_conv1d_cpu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1D Convolution on CPU

## 1. Set-up 

In [2]:
# Mount google drive 
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [3]:
# Make sure your token is stored in a txt file at the location below.
# This way there is no risk that you will push it to your repo
# Never share your token with anyone, it is basically your github password!
with open('/content/gdrive/MyDrive/ece5545/token.txt') as f:
    token = f.readline().strip()
# Use another file to store your github username    
with open('/content/gdrive/MyDrive/ece5545/git_username.txt') as f:
    handle = f.readline().strip()

In [4]:
# Clone your github repo
YOUR_TOKEN = token
YOUR_HANDLE = handle
BRANCH = "main"

%mkdir /content/gdrive/MyDrive/ece5545
%cd /content/gdrive/MyDrive/ece5545
!git clone https://{YOUR_TOKEN}@github.com/ML-HW-SYS/a3-{YOUR_HANDLE}.git
%cd /content/gdrive/MyDrive/ece5545/a3-{YOUR_HANDLE}
!git checkout {BRANCH}
!git pull
%cd /content/gdrive/MyDrive/ece5545

PROJECT_ROOT = f"/content/gdrive/MyDrive/ece5545/a3-{YOUR_HANDLE}"

mkdir: cannot create directory ‘/content/gdrive/MyDrive/ece5545’: File exists
/content/gdrive/MyDrive/ece5545
fatal: destination path 'a3-CaseyPYZ' already exists and is not an empty directory.
/content/gdrive/MyDrive/ece5545/a3-CaseyPYZ
M	1-conv1d_cpu.ipynb
M	2-conv1d_gpu.ipynb
M	4-gemm_gpu.ipynb
M	5-conv2d_dw_gpu.ipynb
M	src/ops.py
M	tests/test_dwsp_2dconv_gpu.py
Already on 'main'
Your branch is behind 'origin/main' by 23 commits, and can be fast-forwarded.
  (use "git pull" to update your local branch)
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 15 (delta 8), reused 11 (delta 5), pack-reused 0[K
Unpacking objects: 100% (15/15), done.
From https://github.com/ML-HW-SYS/a3-CaseyPYZ
   2b5e8cf..5686fd4  main       -> origin/main
Updating 7b5483e..5686fd4
error: Your local changes to the following files would be overwritten by merge:
	src/ops.py
	tests/test_dwsp_2dconv_gpu.py

In [5]:
# This extension reloads all imports before running each cell
%load_ext autoreload
%autoreload 2

Verify the following cell prints your github repository.

In [6]:
!ls {PROJECT_ROOT}

1-conv1d_cpu.ipynb   4-gemm_gpu.ipynb	    README.md
2-conv1d_gpu.ipynb   5-conv2d_dw_gpu.ipynb  src
3-conv1d_fpga.ipynb  leaderboard_id.txt     tests


## 2. Install TVM

In [8]:
!pip install tlcpack-nightly-cu102 -f https://tlcpack.ai/wheels

Looking in links: https://tlcpack.ai/wheels
Collecting tlcpack-nightly-cu102
  Downloading https://github.com/tlc-pack/tlcpack/releases/download/v0.7.dev1/tlcpack_nightly_cu102-0.9.dev1049%2Bgaa5628692-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (402.4 MB)
[K     |████████████████████████████████| 402.4 MB 13 kB/s 
Collecting synr==0.6.0
  Downloading synr-0.6.0-py3-none-any.whl (18 kB)
Installing collected packages: synr, tlcpack-nightly-cu102
Successfully installed synr-0.6.0 tlcpack-nightly-cu102-0.9.dev1049+gaa5628692


## 3. Implement `make_conv1d_cpu_scheduler_func` function in `src.ops`

In that function, you are required to implemented 1D convolution and use TVM to optimize it.
Let $x \in \mathbb{R}^m$ and $y \in \mathbb{R}^n$, then 
$$
\operatorname{conv1d}(x, y)_i = \sum_{j=-\infty}^{\infty} x[j]y[i-j], \forall i \in \{0, 1, \dots, m + n - 1\}
$$

Please use zero padding and unit stride. Please see the numpy convolution function for more detail: [link](https://numpy.org/doc/stable/reference/generated/numpy.convolve.html).

The `make_conv1d_cpu_scheduler_func` takes $m$ and $n$, which are the size of the two 1D input array. 
You should return both the TVM schedule and the TVM operator for 
1. Input $x$
2. Input $y$
3. Output $out$

The schedule should be able to used to build a function with signature $func(x, y, out)$. 
Please see the following cells the usage.

### **Baseline**

In [None]:
import tvm
import numpy as np
import sys
# Adding assignment 3 to the system path
# Make sure this matches your git directory
sys.path.insert(0, PROJECT_ROOT)
from src.ops import make_conv1d_cpu_scheduler

M = 4096
N = 128
dtype = 'float32'
a_np = np.random.rand(M).astype(dtype)
w_np = np.random.rand(N).astype(dtype)
b_np = np.convolve(a_np, w_np)

s, A, W, B = make_conv1d_cpu_scheduler(M, N)
func = tvm.build(s, [A, W, B], "llvm")

dev = tvm.cpu()
a = tvm.nd.array(a_np, dev)
w = tvm.nd.array(w_np, dev)
b = tvm.nd.array(np.zeros((M+N-1), dtype), dev)
func(a, w, b)
evaluator = func.time_evaluator(func.entry_name, dev, number=1, repeat =1)


print("Answer:", b_np)
print("Output:", b)
print(f"1D conv TVM runtime: %f ms" % (evaluator(a, w, b).mean * 1e3))

Answer: [0.22781187 0.5334949  0.49527073 ... 0.51525605 0.09855287 0.01734382]
Output: [0.22781187 0.5334949  0.49527073 ... 0.51525605 0.09855287 0.01734382]
1D conv TVM runtime: 48.798321 ms


In [None]:
print(tvm.lower(s, [A, W, B], simple_mode=True))

@main = primfn(A_1: handle, W_1: handle, B_1: handle) -> ()
  attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
  buffers = {A: Buffer(A_2: Pointer(float32), float32, [4096], []),
             W: Buffer(W_2: Pointer(float32), float32, [128], []),
             B: Buffer(B_2: Pointer(float32), float32, [4223], [])}
  buffer_map = {A_1: A, W_1: W, B_1: B} {
  for (n: int32, 0, 4223) {
    B[n] = 0f32
    for (k: int32, 0, 4223) {
      let cse_var_1: int32 = (n - k)
      B[n] = (B[n] + @tir.if_then_else((((4096 <= k) || (cse_var_1 < 0)) || (128 <= cse_var_1)), 0f32, (A[k]*W[cse_var_1]), dtype=float32))
    }
  }
}




In [None]:
%cd {PROJECT_ROOT}
!python -m pytest tests/test_1dconv_cpu.py

/content/gdrive/MyDrive/ece5545/a3-CaseyPYZ
platform linux -- Python 3.7.13, pytest-3.6.4, py-1.11.0, pluggy-0.7.1
rootdir: /content/gdrive/MyDrive/ece5545/a3-CaseyPYZ, inifile:
plugins: typeguard-2.7.1
collected 15 items                                                             [0m

tests/test_1dconv_cpu.py ...............[36m                                 [100%][0m



### **Opt #1: Parallel**

Thread-level parallelism

In [None]:
import tvm
import numpy as np
import sys
sys.path.insert(0, PROJECT_ROOT)
from src.ops import make_conv1d_cpu_scheduler

M = 4096
N = 128
dtype = 'float32'
a_np = np.random.rand(M).astype(dtype)
w_np = np.random.rand(N).astype(dtype)
b_np = np.convolve(a_np, w_np)
s, A, W, B = make_conv1d_cpu_scheduler(M, N)
func = tvm.build(s, [A, W, B], "llvm")

dev = tvm.cpu()
a = tvm.nd.array(a_np, dev)
w = tvm.nd.array(w_np, dev)
b = tvm.nd.array(np.zeros((M+N-1), dtype), dev)
func(a, w, b)
evaluator = func.time_evaluator(func.entry_name, dev, number=1, repeat =1)

print("Answer:", b_np)
print("Output:", b)
print(f"1D conv TVM runtime: %f ms" % (evaluator(a, w, b).mean * 1e3))

Opt #1: Parallel
Answer: [0.15840437 0.49753007 0.5180965  ... 0.19801329 0.341344   0.04735574]
Output: [0.15840437 0.49753007 0.5180965  ... 0.1980133  0.341344   0.04735574]
1D conv TVM runtime: 19.437134 ms


### **Opt #2: Parallel + Vectorization**

> * Leverages the SIMD feature of CPUs
> * Define a _factor_
> * Split original 1-D for-loop into 2-level nested for-loops (outer & inner); _factor_ refers to the number of iterations in a inner loop

In [None]:
import tvm
import numpy as np
import sys
sys.path.insert(0, PROJECT_ROOT)
from src.ops import make_conv1d_cpu_scheduler

M = 4096
N = 128
dtype = 'float32'
a_np = np.random.rand(M).astype(dtype)
w_np = np.random.rand(N).astype(dtype)
b_np = np.convolve(a_np, w_np)
s, A, W, B = make_conv1d_cpu_scheduler(M, N)
func = tvm.build(s, [A, W, B], "llvm")

dev = tvm.cpu()
a = tvm.nd.array(a_np, dev)
w = tvm.nd.array(w_np, dev)
b = tvm.nd.array(np.zeros((M+N-1), dtype), dev)
func(a, w, b)
evaluator = func.time_evaluator(func.entry_name, dev, number=1, repeat =1)

print("Answer:", b_np)
print("Output:", b)
print(f"1D conv TVM runtime: %f ms" % (evaluator(a, w, b).mean * 1e3))

Opt #2: Vectorization
Opt #3: Reorder
Answer: [0.23211165 0.69462794 0.9569759  ... 0.766244   0.7191915  0.55083513]
Output: [0.23211165 0.69462794 0.9569759  ... 0.766244   0.7191915  0.55083513]
1D conv TVM runtime: 17.849412 ms


In [None]:
%cd {PROJECT_ROOT}
!python -m pytest tests/test_1dconv_cpu.py

/content/gdrive/MyDrive/ece5545/a3-CaseyPYZ
platform linux -- Python 3.7.13, pytest-3.6.4, py-1.11.0, pluggy-0.7.1
rootdir: /content/gdrive/MyDrive/ece5545/a3-CaseyPYZ, inifile:
plugins: typeguard-2.7.1
collected 15 items                                                             [0m

tests/test_1dconv_cpu.py ...............[36m                                 [100%][0m



### **Opt #3: Reorder**

In [87]:
import tvm
import numpy as np
import sys
sys.path.insert(0, PROJECT_ROOT)
from src.ops import make_conv1d_cpu_scheduler

M = 4096
N = 128
dtype = 'float32'
a_np = np.random.rand(M).astype(dtype)
w_np = np.random.rand(N).astype(dtype)
b_np = np.convolve(a_np, w_np)
s, A, W, B = make_conv1d_cpu_scheduler(M, N)
func = tvm.build(s, [A, W, B], "llvm")

dev = tvm.cpu()
a = tvm.nd.array(a_np, dev)
w = tvm.nd.array(w_np, dev)
b = tvm.nd.array(np.zeros((M+N-1), dtype), dev)
func(a, w, b)
evaluator = func.time_evaluator(func.entry_name, dev, number=1, repeat =1)

print("Answer:", b_np)
print("Output:", b)
print(f"1D conv TVM runtime: %f ms" % (evaluator(a, w, b).mean * 1e3))

Opt #3: Parallel + Vectorization
Answer: [0.00782458 0.0917494  0.3793432  ... 0.70045257 0.3891695  0.0875129 ]
Output: [0.00782458 0.0917494  0.3793432  ... 0.70045257 0.3891695  0.0875129 ]
1D conv TVM runtime: 15.466732 ms


In [None]:
%cd {PROJECT_ROOT}
!python -m pytest tests/test_1dconv_cpu.py

/content/gdrive/MyDrive/ece5545/a3-CaseyPYZ
platform linux -- Python 3.7.13, pytest-3.6.4, py-1.11.0, pluggy-0.7.1
rootdir: /content/gdrive/MyDrive/ece5545/a3-CaseyPYZ, inifile:
plugins: typeguard-2.7.1
collected 15 items                                                             [0m

tests/test_1dconv_cpu.py ...............[36m                                 [100%][0m



### **Opt 4: Parallel + Vectorization + Cache handling**

In [10]:
import tvm
import numpy as np
import sys
sys.path.insert(0, PROJECT_ROOT)
from src.ops import make_conv1d_cpu_scheduler

M = 4096
N = 128
dtype = 'float32'
a_np = np.random.rand(M).astype(dtype)
w_np = np.random.rand(N).astype(dtype)
b_np = np.convolve(a_np, w_np)
s, A, W, B = make_conv1d_cpu_scheduler(M, N)
func = tvm.build(s, [A, W, B], "llvm")

dev = tvm.cpu()
a = tvm.nd.array(a_np, dev)
w = tvm.nd.array(w_np, dev)
b = tvm.nd.array(np.zeros((M+N-1), dtype), dev)
func(a, w, b)
evaluator = func.time_evaluator(func.entry_name, dev, number=1, repeat =1)

print("Answer:", b_np)
print("Output:", b)
print(f"1D conv TVM runtime: %f ms" % (evaluator(a, w, b).mean * 1e3))

Answer: [0.5386432  0.63778734 0.8827023  ... 1.3236036  0.53817993 0.86101097]
Output: [0.5386432  0.63778734 0.8827023  ... 0.         0.         0.        ]
1D conv TVM runtime: 0.363911 ms


In [11]:
%cd {PROJECT_ROOT}
!python -m pytest tests/test_1dconv_cpu.py

/content/gdrive/MyDrive/ece5545/a3-CaseyPYZ
platform linux -- Python 3.7.13, pytest-3.6.4, py-1.11.0, pluggy-0.7.1
rootdir: /content/gdrive/MyDrive/ece5545/a3-CaseyPYZ, inifile:
plugins: typeguard-2.7.1
collected 15 items                                                             [0m

tests/test_1dconv_cpu.py .....FFFFFFFFFF[36m                                 [100%][0m

[31m[1m_____________________________ test1_Mvar_N1024[1] ______________________________[0m

execution_number = 1

[1m    @pytest.mark.parametrize('execution_number', [1, 10, 100, 1000, 10000])[0m
[1m    def test1_Mvar_N1024(execution_number):[0m
[1m        # Define dimension[0m
[1m        M = execution_number[0m
[1m        N = 1024[0m
[1m        func = make_conv1d_cpu_func(M, N)[0m
[1m    [0m
[1m        # Create random test data[0m
[1m        np.random.seed(seed=1024)[0m
[1m        a_np = np.random.rand(M).astype(np.float32)[0m
[1m        w_np = np.random.rand(N).astype(np.float32)[0m
[1m 