# 3-Dimension Convolutional Neural Network (3D-CNN)

> To write this section, I have referred to the following resources: 
> - Niyas S. et al., *Medical image segmentation with 3D convolutional neural networks: A survey*.
> ---

While traditional CNNs are highly effective for processing 2D images, there are numerous applications where data inherently exists in three dimensions. Examples include medical imaging, video analysis, and 3D object recognition. To address these applications, **3D Convolutional Neural Networks** (3D-CNNs) have emerged as a powerful extension of the standard 2D-CNNs, capable of directly handling volumetric data and capturing spatial dependencies in three dimensions. 

At this point, you might be wondering, what is the difference between 2D and 3D CNNs? In the context of Convolutional Neural Networks (CNNs), images are typically represented as 3D tensors, where the dimensions correspond to height, width, and the number of color channels (e.g., RGB channels) while videos or volumetric data are represented as 4D tensors, where the additional dimension corresponds to time or depth. So, the primary difference between 2D and 3D CNNs lies in the convolution operation itself.

* The 2D convolution operation involves sliding a filter (kernel) over the height and width dimensions of the image. For example, the input 2D image can be represented as a 3D tensor with dimensions $(H, W, C)$ where $H$ is the height, $W$ is the width, and $C$ is the number of channels, the filter has dimensions $(f_H, f_W, C)$, where $f_H$ and $f_W$ are the height and width of the filter, respectively. At each position, the filter is multiplied element-wise with the input image patch, and the results are summed to produce a single output value. When the filter is convolved across the entire image, it produces a 2D feature map that captures spatial patterns in the image.

* In contrast, the 3D convolution operation extends this idea to three dimensions. Instead of sliding a 2D filter over the height and width dimensions of the image, a 3D filter is convolved across the height, width, and depth dimensions of the input volume. If the input volume has dimensions $(D, H, W, C)$ where $D$ is the depth, $H$ is the height, $W$ is the width, and $C$ is the number of channels, the 3D filter has dimensions $(f_D, f_H, f_W, C)$, where $f_D$, $f_H$, and $f_W$ are the depth, height, width of the filter, respectively, and the depth of the filter matches the depth of the input volume. At each position, the 3D filter is multiplied element-wise with the input volume patch, and the results are summed to produce a single output value. When the 3D filter is convolved across the entire volume, it produces a 3D feature map that captures spatial patterns in the volume.

However, I need to emphasize that the input data for 2D CNNs is not always 3D tensors. For example, in the case of grayscale images, the input data is represented as a 2D tensor with dimensions $(H, W)$ where $H$ is the height, $W$ is the width, and the number of channels is implicitly assumed to be 1. Similarly, in the case of videos, the input data is represented as a 4D tensor with dimensions $(T, H, W, C)$ where $T$ is the number of frames, $H$ is the height, $W$ is the width, and $C$ is the number of channels. But for grayscale videos, the number of channels is implicitly assumed to be 1 and it becomes a 3D tensor with dimensions $(T, H, W)$. 

The kernel always has the same number of channels as the input tensor. That why in all cases, we can say that the input of 2D CNNs is a 2D object and the input of 3D CNNs is a 3D object.

## Mathematical Formulation

### The Convolution Operation

From a mathematical perspective, the convolution operation is defined as the integral of the product of two functions after one is reversed and shifted. Formally, we write:

\begin{equation}
(f * g)(t) = \int_{-\infty}^{\infty} f(\tau)g(t - \tau) d\tau \tag{1.1}
\end{equation}

where $f$ and $g$ are the input function and kernel function, respectively, and $*$ denotes the convolution operation. 

In order to deal with discrete data, we can approximate the integral with a summation. Specifically, given two discrete functions $f, g: \mathbb{Z} \to \mathbb{R}$, the discrete convolution operation is defined as:

\begin{equation}
(f * g)(n) = \sum_{m \in \mathbb{Z}} f(m)g(n - m) \tag{1.2}
\end{equation} 

where $f$ and $g$ are the input function and kernel function, respectively and $n \in \mathbb{Z}$ is the index of the output.

Lastly, the convolution operation can be extended to higher dimensions. For example, we can write the convolution operation between two discrete and finite two-dimensional functions $f, g: \mathbb{Z}^2 \to \mathbb{R}$ as:

\begin{equation}
(f * g)(i, j) = \sum_{m \in \mathbb{Z}} \sum_{n \in \mathbb{Z}} f(m, n)g(i - m, j - n) \tag{1.3}
\end{equation}

where $(i, j) \in \mathbb{Z}^2$ is the index of the output.

Similarly, the convolution operation between two discrete and finite three-dimensional functions $f, g: \mathbb{Z}^3 \to \mathbb{R}$ can be written as:

\begin{equation}
(f * g)(i, j, k) = \sum_{m \in \mathbb{Z}} \sum_{n \in \mathbb{Z}} \sum_{p \in \mathbb{Z}} f(m, n, p)g(i - m, j - n, k - p) \tag{1.4}
\end{equation}

where $(i, j, k) \in \mathbb{Z}^3$ is the index of the output.

### The 3D Convolution Operation

In the context of 3D CNNs, the 3D convolution operation is defined as the element-wise multiplication of a 3D filter (kernel) with a 3D input volume, followed by the summation of the results. Formally, given a 3D input volume represented as a tensor $X \in \mathbb{R}^{D \times H \times W \times C}$ and a 3D filter represented as a tensor $K \in \mathbb{R}^{f_D \times f_H \times f_W \times C}$, after the convolution operation, we obtain a 3D feature map represented as a tensor $Y \in \mathbb{R}^{D' \times H' \times W'}$ where $D'$, $H'$, and $W'$ are the depth, height, and width of the feature map, respectively. The 3D convolution operation can be mathematically expressed as:

\begin{equation}
Y(i, j, k) = \sum_{m = 1}^{f_D} \sum_{n = 1}^{f_H} \sum_{p = 1}^{f_W} \sum_{q = 1}^{C} X(i + m - 1, j + n - 1, k + p - 1, q) \cdot K(m, n, p, q) \tag{1.5}
\end{equation}

where $(i, j, k) \in \mathbb{Z}^3$ is the index of the output feature map, and $(m, n, p) \in \mathbb{Z}^3$ is the index of the filter.

In [14]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [9]:
# Implement a 3D CNN using PyTorch
class CNN3D(nn.Module):
	def __init__(self, in_channels, num_classes):
		super(CNN3D, self).__init__()
		self.conv1 = nn.Conv3d(in_channels, 8, kernel_size=(3, 3, 3), padding=(1, 1, 1))
		self.conv2 = nn.Conv3d(8, 16, kernel_size=(3, 3, 3), padding=(1, 1, 1))
		self.conv3 = nn.Conv3d(16, 32, kernel_size=(3, 3, 3), padding=(1, 1, 1))
		self.conv4 = nn.Conv3d(32, 64, kernel_size=(3, 3, 3), padding=(1, 1, 1))
		self.conv5 = nn.Conv3d(64, 128, kernel_size=(3, 3, 3), padding=(1, 1, 1))
		self.fc1 = nn.Linear(128, 1024)
		self.fc2 = nn.Linear(1024, 512)
		self.fc3 = nn.Linear(512, num_classes)
		self.pool = nn.MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2))
		self.dropout = nn.Dropout(p=0.5)
		self.relu = nn.ReLU()

	# Assuming input shape is (N, C, 32, 32, 32)
	def forward(self, x):
		x = self.relu(self.conv1(x)) # (N, 8, 32, 32, 32) due to padding
		x = self.pool(x)             # (N, 8, 16, 16, 16) due to pooling
		x = self.relu(self.conv2(x)) # (N, 16, 16, 16, 16)
		x = self.pool(x)             # (N, 16, 8, 8, 8)
		x = self.relu(self.conv3(x)) # (N, 32, 8, 8, 8)
		x = self.pool(x)             # (N, 32, 4, 4, 4)
		x = self.relu(self.conv4(x)) # (N, 64, 4, 4, 4)
		x = self.pool(x)             # (N, 64, 2, 2, 2)
		x = self.relu(self.conv5(x)) # (N, 128, 2, 2, 2)
		x = self.pool(x)             # (N, 128, 1, 1, 1)
		x = x.view(-1, 128)          # => need to flatten to (N, 128)
		x = self.relu(self.fc1(x))
		x = self.dropout(x)
		x = self.relu(self.fc2(x))
		x = self.dropout(x)
		x = self.fc3(x)
		return x

In [13]:
model = CNN3D(in_channels=1, num_classes=10)
x = torch.randn((1, 1, 32, 32, 32)) # (batch, channel, depth, height, width)

model(x).shape

torch.Size([1, 10])

In [None]:
'''
Above is simple 3D CNN model with 5 convolutional layers and 3 fully connected layers.
We can define more complex model with more layers and different hyperparameters.
	- In some state-of-the-art models, they use BatchNorm and LeakyReLU instead of Dropout and ReLU.
    - Also, they add some skip connections to improve the performance and ignore the degradation problem.
'''

# Implement a 3D CNN using PyTorch
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1):
        super(ResidualBlock, self).__init__()
        self.conv1 = nn.Conv3d(in_channels, out_channels, kernel_size, stride, padding)
        self.bn1 = nn.BatchNorm3d(out_channels)
        self.conv2 = nn.Conv3d(out_channels, out_channels, kernel_size, stride, padding)
        self.bn2 = nn.BatchNorm3d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.downsample = None

        # This is used to adjust the dimensions of the residual block
        if in_channels != out_channels:
            self.downsample = nn.Sequential(
                nn.Conv3d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm3d(out_channels),
            )        

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)
        return out