# Custom Pooling Operation in Pytorch.

The goal of this tutorial is to learn how to create a pooling operation from scratch. 
We implement max pooling in two different ways, using pytorch API in python and creating our own C++ extension. 
The two ways are compared against the native max pool op.

## Data - Cifar

We use Cifar as our data set.


In [1]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import time
from sklearn.metrics import f1_score
import torch
import torchvision
import torchvision.transforms as transforms
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.modules.utils import _pair, _quadruple
from scipy import stats
import numpy as np


transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR100(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=8)

testset = torchvision.datasets.CIFAR100(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=8)

Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to ./data/cifar-100-python.tar.gz
Files already downloaded and verified


## Custom Max Pool using pytorch in pure python.

Reference: https://gist.github.com/rwightman/f2d3849281624be7c0f11c85c87c1598


We use existing pytorch functions and tensor operations, which makes use of autograd. There is no need to implement the backpropagation as it comes free with autograd. 

In [2]:
class MaxPool(nn.Module):

    def __init__(self, kernel_size=3, stride=1, padding=0, same=False, timing=False):
        super(MaxPool, self).__init__()
        self.k = _pair(kernel_size)
        self.stride = _pair(stride)
        self.padding = _quadruple(padding)  # convert to l, r, t, b
        self.same = same
        self.timing = timing

    def _padding(self, x):
        if self.timing:
            start = time.time()
        if self.same:
            ih, iw = x.size()[2:]
            if ih % self.stride[0] == 0:
                ph = max(self.k[0] - self.stride[0], 0)
            else:
                ph = max(self.k[0] - (ih % self.stride[0]), 0)
            if iw % self.stride[1] == 0:
                pw = max(self.k[1] - self.stride[1], 0)
            else:
                pw = max(self.k[1] - (iw % self.stride[1]), 0)
            pl = pw // 2
            pr = pw - pl
            pt = ph // 2
            pb = ph - pt
            padding = (pl, pr, pt, pb)
        else:
            padding = self.padding
        if self.timing:
            print('_pad {}'.format(time.time() - start))
        return padding
    
    def forward(self, x):
        if self.timing:
            start = time.time()
        x = F.pad(x, self._padding(x), mode='reflect')
        if self.timing:
            print('pad {}'.format(time.time() - start))
            start = time.time()
        x = x.unfold(2, self.k[0], self.stride[0]).unfold(3, self.k[1], self.stride[1])
        if self.timing:
            print('unfold {}'.format(time.time() - start))
            start = time.time()
            print('x.size()[:4] {}'.format(x.size()[:4] + (-1,)))
        x = x.contiguous().view(x.size()[:4] + (-1,))
        if self.timing:
            print('view {}'.format(time.time() - start))
            start = time.time()
        pool, indices = torch.max(x, dim=-1)        
        if self.timing:
            print('max {}'.format(time.time() - start))
        
        return pool

## Max Pool as a C++ extension.

In order to create a C++ extension you need the following files:
- pooling.cpp which has the C++ (pytorch API) implementation of your custom operation. At the end of this file you also need to declare a python binder as follows: 

    PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
      m.def("max_pool", &max_pool, "CPPPool max_pool");
    }

- setup.py which is responsible for creating the python extension. From command line and after installing the C++ version of pytorch run the following command: python setup.py install

The c++ implementation includes two different functions in pooling.cpp named 'max_pool' and 'same_padding'.
A more efficient implementation would be to create an operation from scratch at Cuda level, but that would require handling the backpropagation as well.

MaxPoolCpp implements our python wrapper calling the C++ extension.


In [3]:
import pooling_cpp

class MaxPoolCpp(nn.Module):

    def __init__(self, kernel_size=3, stride=1, padding=0, same=False):
        super(MaxPoolCpp, self).__init__()
        self.k = _pair(kernel_size)
        self.stride = _pair(stride)
        self.padding = _quadruple(padding)  # convert to l, r, t, b
        self.same = same
    
    def forward(self, x):
        pool = pooling_cpp.max_pool(x, 2, torch.tensor(list(self.k)), torch.tensor(list(self.stride)), self.same, list(self.padding))        
        return pool
    

## Loading images and placing them in a tensor.

In [4]:
print(trainset.train_data.shape)  
imgs = trainset.train_data
imgs = torch.Tensor(imgs)
print(imgs.shape)

(50000, 32, 32, 3)
torch.Size([50000, 32, 32, 3])


## Evaluating pure pooling performance.

The following code applies pure pooling operation on the images from cifar and evaluates native, custom (python) and cpp versions of pooling. 
The two main functions are test_pool_results and test_pool_perf. The former compares the results of two pooling operations validating our custom implementations. The latter one measures the time that it takes to apply max pooling on the given data. Some representative results are: native 0.519s, custom 1.217s, cpp 1.227s. 

The custom model implemented in pure python doesn't differ from the C++ implementation in terms of performance. Sometimes the cpp version can also be slightly slower, but the difference is minor. Max pooling is a simple operation and this might change for more complicated ones. 

Comparing against the native model (which is implemented in Cuda) we see that it is more than 2 times faster: custom/native time: 2.35

In [5]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

def test_pool_results(pool1, pool2, name1, name2, img):
    are_equal = torch.all(torch.eq(pool1, pool2))
    print("Comparing {} vs {}: {}".format(name1, name2, are_equal))
    return are_equal

def test_pool_perf(pool, name, img, device):
    start = time.time()
    pool.to(device)
    result = pool(img)
    duration = time.time() - start
    return duration, result

native_dur = list()
custom_dur = list()
cpp_dur = list()
for i in range(0, 30):
    img = imgs
    dur, native = test_pool_perf(nn.MaxPool2d(2, 2), 'native', img, device)
    native_dur.append(dur)
    dur, custom = test_pool_perf(MaxPool(2, 2), 'custom', img, device)
    custom_dur.append(dur)
    dur, cpp = test_pool_perf(MaxPoolCpp(2, 2), 'cpp', img, device)
    cpp_dur.append(dur)

test_pool_results(native, custom, 'native', 'custom', img)
test_pool_results(native, cpp, 'native', 'cpp', img)

avg = lambda x: sum(x)/len(x)

print('\nnative {}'.format(avg(native_dur)))
print('custom {}'.format(avg(custom_dur)))
print('cpp {}'.format(avg(cpp_dur)))
print('custom/native time: {}'.format(round(avg(custom_dur)/avg(native_dur), 2)))

cuda:0
Comparing native vs custom: 1
Comparing native vs cpp: 1

native 0.5481758673985799
custom 1.29437153339386
cpp 1.2856293042500815
custom/native time: 2.36


## Performance evaluation using a small neural network with 2 pooling operations.

In [6]:
CUSTOM = 'custom'
NATIVE = 'native'
CPP = 'cpp'

class Net2Pool(nn.Module):
    def __init__(self, num_classes=10, pooling=NATIVE):
        super(Net2Pool, self).__init__()
        self.conv1 = nn.Conv2d(3, 50, 5, 1)
        self.conv2 = nn.Conv2d(50, 50, 5, 1)
        
        if pooling is NATIVE:
            self.pool = nn.MaxPool2d(2, 2)
        elif pooling is CUSTOM:
            self.pool = MaxPool(2, 2)
        elif pooling is CPP:
            self.pool = MaxPoolCpp(2, 2)
      
        self.fc1 = nn.Linear(5*5*50, 500)
        self.fc2 = nn.Linear(500, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        
        x = F.relu(self.conv2(x))
        x = self.pool(x)

        x = x.view(-1, 5*5*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

def configure_net(net, device):
    net.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
    return net, optimizer, criterion

def train(net, optimizer, criterion, trainloader, device, epochs=10, logging=2000):
    for epoch in range(epochs):  
        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            start = time.time()
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)
        
            optimizer.zero_grad()
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            if i % logging == logging - 1:    
                print('[%d, %5d] loss: %.3f duration: %.5f' %
                      (epoch + 1, i + 1, running_loss / logging, time.time() - start))
                running_loss = 0.0
                
    print('Finished Training')

def test(net, testloader, device):
    correct = 0
    total = 0
    predictions = []
    l = []
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            images, labels = images.to(device), labels.to(device)

            outputs = net(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            predictions.extend(predicted.cpu().numpy())
            l.extend(labels.cpu().numpy())


    print('Accuracy: {}'.format(100 * correct / total))
    print('Micro {}'.format(f1_score(l, predictions, average='micro')))
    print('Macro {}'.format(f1_score(l, predictions, average='macro')))



## Performance evaluation results

The performance evaluation is 2-fold. It compares the **training time** of the neural network and the **inference time** for each of the three pooling operations.

The average training time per epoch for each pooling are:
_native_ = 27
_custom_ = 41
_cpp_ = 38

These results can vary and in general the **custom and cpp model are almost equivalent**. The neural net using the native max pool was the fastest one, with training duration around half the time.

Regarding inference time, the difference is not very significant. e.g. native = 4.34, custom = 4.70, cpp = 4.68.

In [8]:
EPOCHS = 20
LOGGING = 15000
NUM_CLASSES = 100

DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(DEVICE)

def measure_training_time(pooling):
    net, optimizer, criterion = configure_net(Net2Pool(num_classes=NUM_CLASSES, pooling=pooling), device)
    start = time.time()
    train(net, optimizer, criterion, trainloader, DEVICE, epochs=EPOCHS, logging=LOGGING)
    print('Average training time per epoch of {}: {}'.format(pooling, (time.time() - start)/EPOCHS))
    start = time.time()
    test(net, testloader, device)
    print('Testing time of {}: {}'.format(pooling, time.time() - start))
    
measure_training_time(NATIVE)
measure_training_time(CUSTOM)
measure_training_time(CPP)

cuda:0
Finished Training
Average training time per epoch of native: 32.15662657022476
Accuracy: 31.11
Micro 0.3111
Macro 0.31087129648577155
Testing time of native: 4.227973461151123
Finished Training
Average training time per epoch of custom: 41.62029505968094
Accuracy: 32.24
Micro 0.3224
Macro 0.32243667860463615
Testing time of custom: 5.201342821121216
Finished Training
Average training time per epoch of cpp: 42.65701643228531
Accuracy: 30.77
Micro 0.3077
Macro 0.30209508676358354
Testing time of cpp: 6.22520899772644
