# Convolutional neural networks

It is a type of neural network that are especially well-suited for image recognition and other problems where the input has a spatial structure. They use specific architectures and connection patterns to learn features at different scales and locations in the input.

- **Convolution operation:**

In a convolutional layer, the convolution operation is used to <mark style="background: #FFB8EBA6;">apply a set of filters to the input image</mark>. This operation can be represented mathematically as follows:

$$\begin{equation} S(i,j) = (I * K)(i,j) = \sum_{m}\sum_{n}I(m,n)K(i-m,j-n) \end{equation}$$

where $S(i,j)$ is the output of the convolution operation at position $(i,j)$, I is the input image, K is the filter/kernel, and * denotes the convolution operation.

Similar to the process of sliding a small square horizontally and vertically, performing calculations to create a new square while also implementing zero padding to prevent the square from decreasing in size as shown in the image below:

![[convolutional_Process.jpg]]

- **Non-linearity:**

Rectified Linear Unit (ReLU) activation:

ReLU is a popular activation function used in CNNs. It is defined as follows:

$$\begin{equation} f(x) = \max(0,x) \end{equation}$$

where x is the input to the activation function.

Softmax activation:

Softmax is commonly used as the activation function for the output layer in classification problems. It maps the output of the last hidden layer to a probability distribution over the classes. Mathematically, softmax can be defined as follows:

$$\begin{equation} P(y=j|x) = \frac{\exp(z_j)}{\sum_{k=1}^{K}\exp(z_k)} \end{equation}$$

where $P(y=j|x)$ is the probability of the j-th class given the input x, z_j is the j-th element of the output vector z of the last hidden layer, and K is the number of classes.

-   **Pooling operation:**

The pooling operation is used to downsample the feature maps obtained from the convolutional layer. The most common pooling operation is <mark style="background: #BBFABBA6;">max pooling</mark>, which takes the maximum value in each pooling region. Mathematically, max pooling can be defined as follows:

$$\begin{equation} M(i,j) = \max_{(m,n)\in R_{i,j}}S(m,n) \end{equation}$$

where M(i,j) is the output of the pooling operation at position (i,j), R_{i,j} is the pooling region centered at (i,j), and S is the input feature map.

-   **Fully Connected Layers:**

After the feature maps are extracted through convolution, passed through activation functions, and downsampled through pooling, the output is flattened and passed through one or more fully connected layers to generate the final output. These layers are similar to the ones used in traditional neural networks and act as a classifier that uses the learned features to classify or identify the input image. The output of each fully connected layer is calculated as follows:

$$z = Wx + b$$

where $W$ is the weight matrix, $x$ is the input vector, $b$ is the bias vector, and $z$ is the output vector.

The output of the last fully connected layer is passed through a softmax activation function to produce a probability distribution over the classes. Mathematically, softmax can be defined as follows:

$$P(y=j|x) = \frac{\exp(z_j)}{\sum_{k=1}^{K}\exp(z_k)}$$

-   **Backpropagation:**

Once the output is generated, the network's weights and biases are updated using backpropagation to minimize the difference between predicted and actual outputs. This involves calculating the gradient of the loss function with respect to the model's parameters, and <mark style="background: #FFB86CA6;">updating these parameters in the opposite direction of the gradient</mark>. This process is repeated over several iterations until the model converges. The gradients of the loss function with respect to the output of the last layer can be calculated as follows:

$$\delta_{i}^{(L)} = y_{i} - t_{i}$$

where $\delta_{i}^{(L)}$ is the error for the i-th output neuron, $y_{i}$ is the predicted output, and $t_{i}$ is the true output.

Backpropagation is an <mark style="background: #ABF7F7A6;">iterative algorithm that works by propagating the error back through the layers of the network</mark>. The error is calculated as the difference between the predicted and actual outputs, and is backpropagated through the layers to update the weights and biases. This process is computationally expensive and can be improved through techniques such as momentum and adaptive learning rates.

The process of backpropagation enables the network to learn from its mistakes and improve its performance over time. The goal of backpropagation is to minimize the loss function, which measures the difference between the predicted and actual outputs. Once the loss function is minimized, the network is considered to have converged, and it can be used for making predictions on new data.

In [10]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

# Set the random seed for reproducibility. This ensures that the random numbers generated by PyTorch will be the same every time the code is run, allowing for consistent resultsduring training.
torch.manual_seed(0)

# Define the CNN architecture
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        # Defines the first convolutional layer. It takes input with 3 channels, applies 16 filters, uses a kernel size of 3x3, a stride of 1, and padding of 1 to maintain the spatial dimensions of the input.
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc = nn.Linear(16 * 16 * 16, 10)
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# Create an instance of the CNN
model = SimpleCNN()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Load the data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)

# Train the model
for epoch in range(5):
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if i % 2000 == 1999:
            print(f'Epoch: {epoch+1}, Batch: {i+1}, Loss: {running_loss/2000:.3f}')
            running_loss = 0.0

print('Training finished.')

# Save the trained model
torch.save(model.state_dict(), 'cnn_model.pth')


Files already downloaded and verified
Epoch: 1, Batch: 2000, Loss: 1.782
Epoch: 1, Batch: 4000, Loss: 1.479
Epoch: 1, Batch: 6000, Loss: 1.414
Epoch: 1, Batch: 8000, Loss: 1.354
Epoch: 1, Batch: 10000, Loss: 1.332
Epoch: 1, Batch: 12000, Loss: 1.273
Epoch: 2, Batch: 2000, Loss: 1.216
Epoch: 2, Batch: 4000, Loss: 1.209
Epoch: 2, Batch: 6000, Loss: 1.184
Epoch: 2, Batch: 8000, Loss: 1.205
Epoch: 2, Batch: 10000, Loss: 1.144
Epoch: 2, Batch: 12000, Loss: 1.147
Epoch: 3, Batch: 2000, Loss: 1.075
Epoch: 3, Batch: 4000, Loss: 1.116
Epoch: 3, Batch: 6000, Loss: 1.046
Epoch: 3, Batch: 8000, Loss: 1.116
Epoch: 3, Batch: 10000, Loss: 1.087
Epoch: 3, Batch: 12000, Loss: 1.095
Epoch: 4, Batch: 2000, Loss: 1.016
Epoch: 4, Batch: 4000, Loss: 1.041
Epoch: 4, Batch: 6000, Loss: 1.020
Epoch: 4, Batch: 8000, Loss: 1.035
Epoch: 4, Batch: 10000, Loss: 1.055
Epoch: 4, Batch: 12000, Loss: 1.047
Epoch: 5, Batch: 2000, Loss: 0.939
Epoch: 5, Batch: 4000, Loss: 0.998
Epoch: 5, Batch: 6000, Loss: 1.008
Epoch: 5,