### This notebook evaluates a custom linear layer written in CUDA C
We aim to check the following aspects
* forward pass
* backward pass
* speed in comparison to CPU
* integration for training networks

First, let's import torch original Linear layer and our custom linear layer:

In [None]:
import torch
import torch.nn as nn
import custom_linear
from custom_linear_layer import CustomLinearLayer
from time import time

Next, let's create some random data to work with:

In [None]:
in_features = 8
out_features = 3
batch_size = 32
device = "cuda" # the implementation is in CUDA C++ so it only works on CUDA devices, don't mix with CPU tensors

In [None]:
X = torch.rand(batch_size, in_features, device=device)
W = torch.rand(out_features, in_features, device=device)  
b = torch.rand(out_features, device=device) 

To have a fair comparison we need to control with that weights and biases we are computing the linear transformation in both cases:

In [None]:
torch_linear = nn.Linear(in_features, out_features)
torch_linear.weight.data = W
torch_linear.bias.data = b
torch_linear.to(device)

# the output of PyTorch
Y_torch = torch_linear(X)

In [None]:
Y_torch

In CUDA C we have written two functions: forward and backward <br>
First, let's check the forward correctness:

In [None]:
# 
Y_custom = custom_linear.forward(X, W, b)

In [None]:
Y_custom

In [None]:
Y_torch.shape == Y_custom.shape, (Y_torch == Y_custom).all().item(), ((Y_torch - Y_custom)**2).mean().item()

So forward pass seems fine for now, let's see the backward call:

In [None]:

custom_linear = CustomLinearLayer(in_features, out_features)
custom_linear.weight.data = W
custom_linear.bias.data = b

torch_linear = nn.Linear(in_features, out_features)
torch_linear.weight.data = W
torch_linear.bias.data = b

y_custom = custom_linear(X)
L_custom = y_custom.sum()

y_torch = torch_linear(X)
L_torch = y_torch.sum()

L_custom.backward()
L_torch.backward()

((custom_linear.weight.grad - torch_linear.weight.grad) ** 2).mean().item(), ((custom_linear.bias.grad - torch_linear.bias.grad) ** 2).mean().item()

The difference is very small due to numerical instabilitites in computing floating point numbers. <br>
Note something more interesting: let's see if we can use our linear for training a model!

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from custom_linear_layer import CustomLinearLayer


def f(x):
    return 2*x + 3

x = np.linspace(0, 10, 100)
y = f(x) + np.random.randn(100)

X = torch.tensor(x[:, None], dtype=torch.float32)
Y = torch.tensor(y[:, None], dtype=torch.float32)

model = nn.Sequential(CustomLinearLayer(1, 1))

X = X.to('cuda')
Y = Y.to('cuda')
model = model.to('cuda')

criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for i in range(10):
    optimizer.zero_grad()
    Y_hat = model(X)

    loss = criterion(Y_hat, Y)
    loss.backward()
    
    optimizer.step()

plt.scatter(x, y)
plt.plot(x, Y_hat.cpu().detach().numpy(), color='red')

Finally let's compare execution times. We have implementations only for CUDA backend, for CPU backend we will just use original PyTorch implementation

In [None]:
in_features = 256
out_features = 128
batch_size = 64
iterations = 1000
device = "cuda"

# again some random data to work with:
X_CPU = torch.rand(batch_size, in_features)
W_CPU = torch.rand(out_features, in_features)
b_CPU = torch.rand(out_features)
X_GPU = X_CPU.to(device)
W_GPU = W_CPU.to(device)
b_GPU = b_CPU.to(device)

custom_linear = CustomLinearLayer(in_features, out_features)
torch_linear = nn.Linear(in_features, out_features)

custom_linear.weight.data = W_GPU
custom_linear.bias.data = b_GPU

torch_linear.weight.data = W_CPU 
torch_linear.bias.data = b_CPU 

avg_gpu_time, avg_cpu_time = 0, 0
for i in range(iterations):
    
    # forward pass
    start_gpu = torch.cuda.Event(enable_timing=True)
    end_gpu = torch.cuda.Event(enable_timing=True)

    start_cpu = time()
    Y_torch = torch_linear(X_CPU)
    end_cpu = time()

    avg_cpu_time += (end_cpu - start_cpu) * 1000 
    
    start_gpu.record()
    Y_custom = custom_linear(X_GPU)
    end_gpu.record()

    torch.cuda.synchronize() # GPU is running async
    avg_gpu_time += start_gpu.elapsed_time(end_gpu)

print(f"Average time for GPU: {avg_gpu_time / iterations} [ms]")
print(f"Average time for CPU: {avg_cpu_time / iterations} [ms]")