We have observed that using CNM, we can capture certain information about neighbouring pixels, something that a neural network with only fully connected layer can't do very well.\
However, since we also use a fully connected layer at the end of the convolutional layer to fit the encoded images and classify them, the CNN we defined is also subject to a fixed size input. One way to approach this problem is to transform the image into a size the network can take. For example, through cropping, resizing, downsampling the image. We demonstrated this by applying a low-pass filter, that is, using a Gaussian blur on the image to smooth out features and then downsample the image using a constant hop size.\
Here, we demonstrate another way to deal with the problem of fixed size input.

We notice that when we do a convolution on an image, we need to know three things: the kernel weights (which implies the size as well), the number of input filters and the number of output filters. However, none of this is dependent on the image size. This then implies that if our network only consists of convolutional layers (which includes the pooling layers as well), our network would not need to know the image size beforehand and can apply the convolutions on any images as long as the channel dimension matches.\
This brings up the idea of a fully convolutional network where we completely get rid of the final fully connected layers and replace them with convolutional layers. This section defines a simple FCNN that is exactly equivalent to the CNN defined in section 1.a with a minor enhancement that this network can now recieve a grayscale-valued image of any arbitrary shape.

In [1]:
import torch

from torch import nn
from torch.utils.data import random_split, DataLoader
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor


import math
import matplotlib.pyplot as plt

gpu_available = torch.cuda.is_available()
print(f'{gpu_available=}')

gpu_available=True


In [None]:
import sys
sys.path.append('../../')

from CustomDL.loops.classification import train_loop, test_loop

# Load MNIST

In [3]:
digits_dataset = MNIST(
    root='../../data',
    download=True,
    transform=ToTensor()
)
print(digits_dataset.data.shape)

train_data, test_data = random_split(digits_dataset, [.8, .2])

torch.Size([60000, 28, 28])


# Defining a Fully Convolutional Network

In [4]:
from typing import Optional

In [5]:
class ConvLayer(nn.Module):
    def __init__(self,
        in_chans: int,
        out_chans: int,
        kern_size: int,
        activation_fn: Optional[nn.Module] = None,
        pooling_layer: Optional[nn.Module] = None
    ):
        super().__init__()

        self.layers = nn.Sequential(
            nn.Conv2d(
                in_channels=in_chans,
                out_channels=out_chans,
                kernel_size=kern_size
            )
        )
        if not activation_fn is None:
            self.layers.add_module('1', activation_fn)
        if not pooling_layer is None:
            self.layers.add_module('2', pooling_layer)

    def forward(self, input):
        return self.layers(input)

In [6]:
# out_dim = (in_dim - (kern_size - 1) - 1) / stride + 1
fcnn = nn.Sequential(
    ConvLayer(1, 4, 3, nn.ReLU(), nn.MaxPool2d(2)),
    ConvLayer(4, 9, 3, nn.ReLU(), nn.MaxPool2d(2)),

    ConvLayer(9, 100, 5, nn.ReLU()),
    ConvLayer(100, 100, 1, nn.ReLU()),
    ConvLayer(100, 10, 1)
)

Above, we defined the FCNN to have five convolutional layers. However, the last three layers actually represents the fully connected layers that we had in the older CNN network. To see why, we compute the shape of the input layer-by-layer:

**ConvLayer (1)**\
&nbsp;&nbsp;&nbsp;&nbsp;
The original image has the shape (1, 28, 28). We then apply a convolution with a (3, 3) filter and stride of 1. Following the formula given in the [documentation](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html#torch.nn.Conv2d):
$$
\text{out} = \frac{\text{in}-(3 - 1)-1}{1}+1=28-2=26
$$
which gives us an image of shape (4, 26, 26). Finally, we apply max pooling using a (2, 2) window and stride of 2. The resulting shape can be obtained by dividing the original shape by 2 and take the floor of the division.\
Finally, after the convolution and pooling layers, we obtain an image of shape (4, 13, 13).

**ConvLayer (2)**\
&nbsp;&nbsp;&nbsp;&nbsp;
Similar to the layer above, this layer take in a 4-channel image of arbitrary size (13x13 using the result above). Then, it applies a convolution with a (3, 3) filter and stride of 1, which results in a (9, 11, 11) image. After max pooling with a (2, 2) window and stride of 2, we get a (9, 5, 5) image.\

**ConvLayer (3)**\
&nbsp;&nbsp;&nbsp;&nbsp;
In terms of the old CNN, this is where we flatten the output of the previous convolutional layer to pass it into a fully connected layers of 100 perceptrons. Mathematically speaking, we obtained a value for a perceptron in the fully connected layer by multiplying the (1, $4\cdot5\cdot5$) pixel values by some (100, 1) weight matrix. After which, we added the result to some bias.\
So in terms of the (9, 5, 5) image, we multiplied each (5, 5) channel in the image by some (5, 5) weight matrix element-wise and added them together respectively. After which, we added the result with some bias. This exactly represents the convolution process on the (5, 5) channel with a (5, 5) filter and **no padding**. If we repeat this 100 times, we get a convolution with output channels of 100.\
We can see this in the code above by defining a ConvLayer with input channels of 9, output channels of 100 and a kernel size of 5. We also note that we do not apply any max pooling.

**ConvLayer (4)** and **ConvLayer (5)**\
&nbsp;&nbsp;&nbsp;&nbsp;
Likewise, a fully connected layer of 100 inputs into 100 perceptrons is equivalent to a convolution on the 100-channel 'image' of size (1, 1) to output another 100-channel 'image' of size (1, 1). Finally, we define the output layer to be a convolutional layer with 100 input channels and 10 output channels. We can interpret this result the same way as we interpret the result of the old CNN.\
However, we need to note that we are receiving an output vector of shape (..., 10, 1, 1) and not the regular (..., 1, 10).

---

Note: in order for us to have the output of (10, 1, 1) for the final convolutional layer, we had to compute the output shape of the second ConvLayer. This is something that we also had to do for the other CNN. A question raised is why we would use this in place of the CNN if we have to do the same thing. This is discussed later on.

In [7]:
sample_image = digits_dataset.data[:2].unsqueeze(1) / 255
sample_labels = digits_dataset.targets[:2]
with torch.no_grad():
    forward_res = fcnn(sample_image)

# Note that during training, we should also expect (batch, 10, 1, 1) outputs.
print(forward_res.shape)

# however, most loss functions, like CrossEntropyLoss expects a shape (batch, C) for target shape (C)
# so we need to 'squeeze out' the extra dimensions or we get an error.
out = nn.CrossEntropyLoss()(forward_res.squeeze(), sample_labels)
print(out)

# However, note that if batch size is 1, forward_res.squeeze() can return shape (10) which
# will also throw an error. We can use flatten as an alternative
nn.CrossEntropyLoss()(torch.flatten(forward_res, 1), sample_labels)

torch.Size([2, 10, 1, 1])
tensor(2.2802)


tensor(2.2802)

We can either change the train loop to do the squeeze operation, or do it manually in the forward method.\
Thus, we define a class for the FCNN. Note that this should only be done during training. When deploying the model, we do not flatten the output anymore because that is the nature of the network.

In [8]:
class FCNN(nn.Module):
    def __init__(self):
        super().__init__()

        self.layers = nn.Sequential(
            ConvLayer(1, 4, 3, nn.ReLU(), nn.MaxPool2d(2)),
            ConvLayer(4, 9, 3, nn.ReLU(), nn.MaxPool2d(2)),

            ConvLayer(9, 100, 5, nn.ReLU()),
            ConvLayer(100, 100, 1, nn.ReLU()),
            ConvLayer(100, 10, 1)
        )
    
    def forward(self, one_chan_image):
        res = self.layers(one_chan_image)
        return torch.flatten(res, 1) # convenience for computing loss

# Training and Evaluation

In [None]:
def run_epochs(
    epochs: int,
    model: nn.Module,
    train_loader: tuple[DataLoader, DataLoader],
    test_loader: tuple[DataLoader, DataLoader],
    loss_fn: nn.Module,
    optimizer: nn.Module
):
    num_dig = int(math.log10(epochs)) + 1
    update_rate = 1 if epochs <= 20 else 10
    loss, acc = None, None

    for epoch in range(epochs):
        print(f"Epoch {epoch + 1:>{num_dig}}/{epochs}")
        loss = train_loop(model, train_loader, loss_fn, optimizer,
                          use_gpu=gpu_available)
        print(f"  Average Training Loss: {sum(loss) / len(loss):.6f}")

        loss, acc = test_loop(model, test_loader, loss_fn, True,
                               use_gpu=gpu_available)
        print(f"  Average Eval Loss: {loss:.6f} | {acc * 100:.4f}%")
    return loss, acc

In [10]:
batch_size = 64
learning_rate = 0.002

trainloader = DataLoader(train_data, batch_size, shuffle=True)
testloader = DataLoader(test_data, shuffle=True)

fcnn = FCNN()
if gpu_available: fcnn.cuda()

cross_entrop = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(fcnn.parameters(), learning_rate)

In [11]:
epochs = 20
loss, acc = run_epochs(
    epochs, fcnn,
    trainloader, testloader,
    cross_entrop,
    optimizer
)

Epoch  1/20 || Average Loss: 2.300262 | 13.8333%
Epoch  2/20 || Average Loss: 2.296425 | 11.9000%
Epoch  3/20 || Average Loss: 2.291495 | 12.5917%
Epoch  4/20 || Average Loss: 2.283396 | 18.5167%
Epoch  5/20 || Average Loss: 2.267552 | 28.9167%
Epoch  6/20 || Average Loss: 2.225723 | 42.9500%
Epoch  7/20 || Average Loss: 2.017937 | 52.6667%
Epoch  8/20 || Average Loss: 1.054533 | 70.3083%
Epoch  9/20 || Average Loss: 0.643230 | 80.6000%
Epoch 10/20 || Average Loss: 0.507024 | 84.7583%
Epoch 11/20 || Average Loss: 0.438814 | 87.0250%
Epoch 12/20 || Average Loss: 0.383199 | 88.4750%
Epoch 13/20 || Average Loss: 0.348915 | 89.4667%
Epoch 14/20 || Average Loss: 0.327089 | 89.9250%
Epoch 15/20 || Average Loss: 0.305629 | 90.5417%
Epoch 16/20 || Average Loss: 0.284852 | 91.2333%
Epoch 17/20 || Average Loss: 0.270700 | 91.5583%
Epoch 18/20 || Average Loss: 0.255782 | 92.1333%
Epoch 19/20 || Average Loss: 0.241370 | 92.4750%
Epoch 20/20 || Average Loss: 0.228948 | 93.0833%


In [12]:
params = {
    'train_params': {'batch_size': batch_size, 'lr': learning_rate, 'epochs': epochs},
    'loss': loss,
    'accuracy': acc,
    'model': fcnn.state_dict(),
    'optimizer': optimizer.state_dict()
}
torch.save(params, './output/fcnn.pth')

The FCNN can converge as well as the CNN can. Again, this is because the FCNN that we defined is equivalent to that of the CNN but we just replaced the Fully Connected Layer with a Convolutional Layer that does not apply max pooling.

So from this, we know that the CNN we saw in section 1.a and 1.b can be constructed purely from just convolution operations with slightly slower running time. Next section will be going into how these convolution operations can help us use the network on images of different sizes.