# Fashion MNIST classification 

We still use the Fashion MNIST dataset to do classification with Convolutional layers, Batch normalization and MaxPooling.



In [1]:
import torchvision
import matplotlib.pyplot as plt
from torch import nn, optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import numpy as np

In [2]:
# load dataset
train_set = torchvision.datasets.FashionMNIST(root = './data/FashionMNIST', download = True,
                                              train = True, transform = transforms.Compose([transforms.ToTensor(),]))
test_set = torchvision.datasets.FashionMNIST(root = './data/FashionMNIST', download=True,
                                             train=False, transform = transforms.Compose([transforms.ToTensor()]))

100%|██████████| 26.4M/26.4M [00:01<00:00, 14.4MB/s]
100%|██████████| 29.5k/29.5k [00:00<00:00, 2.43MB/s]
100%|██████████| 4.42M/4.42M [00:00<00:00, 11.4MB/s]
100%|██████████| 5.15k/5.15k [00:00<00:00, 7.83MB/s]


Let's first look at an example of CNN, and then explain it step by step.

In [3]:
# define the model
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()

        self.conv1 = nn.Conv2d(1,10,5)    # 1 image -> 10 filters from 5x5 kernels
        self.conv2 = nn.Conv2d(10,20,3)   # 10 images/filters -> 20 filters 

        self.fc1 = nn.Linear(20*10*10,500) # 20 channels * 10 h * 10 w
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):     
        input_size = x.size(0)
        # in: batch*1*28*28, out: batch*10*24*24(28-5+1)
        x = self.conv1(x)
        # out: batch*10*24*24
        x = F.relu(x)
        # in: batch*10*24*24, out: batch*10*12*12
        x = F.max_pool2d(x,2,2)

        # in: batch*10*12*12, out: batch*20*10*10 (12-3+1)
        x = self.conv2(x)
        x = F.relu(x)

        # 20*10*10 = 2000
        x = x.view(input_size,-1)

        # in: batch*2000  out:batch*500
        x = self.fc1(x)
        x = F.relu(x)

        # in:batch*500 out:batch*10
        x = self.fc2(x)
        return F.log_softmax(x) # log soft max to logits
    

We first define our CNN Class by inheriting **nn.Module**, and then create each layer of CNN in **init**. All operations of the neural network are implemented through the forward function. In this CNN example, there are two 2-dimensional convolutional layers and two fully connected linear layers, which are connected through some activation functions, and finally output the softmax classification results.


## Convolutional layer
Fashionmnist is a two-dimensional image dataset to be recognized, so we use two-dimensional convolutional layer **torch.nn.Conv2d**. 

`torch.nn.Conv2d(in_channels, out_channels, kernel_size,
                stride=1, padding=0, dilation=1, groups=1,
                bias=True, padding_mode='zeros')`

- in_channels (int): number of input image channels  
- out_channels (int): the number of channels after convolution  
- kernel_size (int or tuple): convolution kernel size  
- stride (int, optional): Convolution stride, default is 1

Let's look at an example operation of convolution layer:

In [4]:
import torch

input = torch.randn(1,1,28,28)
conv1 = nn.Conv2d(1,10,5)
output = conv1(input)

print(input.shape)
print(output.shape)

torch.Size([1, 1, 28, 28])
torch.Size([1, 10, 24, 24])


The size of input image is `(1x1x28x28)`: 
* The first 1 is the batch size, which can be ignored here. 
* The second is number of channel, for 1X28x28 image. The input of the convolutional layer is also a single channel, which needs to be consistent with the number of channels of the image! The output is 10 channels, and the size of the convolution kernel is 5x5. So our output is naturally (1x10x24x24): where batch size = 1 remains unchanged, the image becomes **(24x24), 24 = 28 - 5 + 1**.

After understanding the operation mode of the convolutional layer, let's understand the most basic usage: after creating a Net class, we can directly enter the input, and the result can be obtained through the forward function. 
Of course, this example is just a simple demonstration, the model has not been trained, so the output is not accurate every time.

In [5]:
network = CNN()
print(network)
output = network(input)
pred = output.argmax(1)
print(pred)

CNN(
  (conv1): Conv2d(1, 10, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(10, 20, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=2000, out_features=500, bias=True)
  (fc2): Linear(in_features=500, out_features=10, bias=True)
)
tensor([5])


  return F.log_softmax(x) # log soft max to logits


## Hyperparameters and Optimizer

In [6]:
learning_rate = 1e-3
batch_size = 60
epochs = 5
#loss = F.nll_loss(output, target)
optimizer = torch.optim.SGD(network.parameters(), lr=learning_rate) # Stochastic Gradient Descent

In a training loop, the optimization has 3 steps:

1. Execute optimizer.zero_grad to clear the gradient accumulated in the system,
2. The prediction loss is backpropagated by calling loss.backward(). PyTorch will store the loss gradient corresponding to each parameter.
3. After getting the loss gradient, call optimizer.step() to optimize and adjust parameters

In [7]:
def train(network, train_loader, optimizer):
    network.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()

        output = network(data)
        loss = F.nll_loss(output, target) # negative log likelihood loss used for classification
        loss.backward()

        optimizer.step()

Note that **network.train()** is to start Batch Normalization and Dropout

## Batch Normalization
Its role is to normalize each layer in the network, and use Batch Normalization Transform to ensure that the feature distribution extracted by each layer will not be destroyed. The training is for each mini-batch, but the test is for a single picture, that is, there is no concept of batch. Since the parameters are fixed after the network training is completed, the mean and variance of each batch are unchanged, so the mean and variance of all batches are directly settled.

Batch Normalization has many benefits, the most direct of which is training speed: the theory behind the algorithm can support us to choose a relatively large initial learning rate. We also know that the stochastic gradient descent algorithm converges faster when the learning rate becomes larger. In this way, we do not need to adjust the learning rate tediously, which greatly improves the efficiency of the optimized model.

In [8]:
class CNN_BN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()

        # --- Convolutional layers ---
        self.conv1 = nn.Conv2d(1, 10, 5)       # in: 1 channel, out: 10 filters, kernel 5x5
        self.bn1 = nn.BatchNorm2d(10)          # normalize 10 feature maps from conv1

        self.conv2 = nn.Conv2d(10, 20, 3)      # in: 10, out: 20, kernel 3x3
        self.bn2 = nn.BatchNorm2d(20)          # normalize 20 feature maps from conv2

        # --- Fully connected layers ---
        self.fc1 = nn.Linear(20 * 10 * 10, 500)
        self.bn3 = nn.BatchNorm1d(500)         # normalize 500 features (1D after flatten)
        self.fc2 = nn.Linear(500, 11)          # output layer (11 classes)

    def forward(self, x):
        input_size = x.size(0)

        # --- Layer 1 ---
        x = self.conv1(x)
        x = self.bn1(x)                        # apply batch normalization
        x = F.relu(x)
        x = F.max_pool2d(x, 2, 2)              # reduce spatial dims by half

        # --- Layer 2 ---
        x = self.conv2(x)
        x = self.bn2(x)
        x = F.relu(x)

        # --- Flatten for fully connected layers ---
        x = x.view(input_size, -1)             # (batch, 2000)

        # --- Layer 3 ---
        x = self.fc1(x)
        x = self.bn3(x)
        x = F.relu(x)

        # --- Output layer ---
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

Calculate accuracy:

In [9]:
def accuracy(epoch_idx, test_loader, network, set_type = None):   
    correct = 0
    with torch.no_grad():      # to calculate accuracy, we do not need the gradient any more
        for data, target in test_loader:
            outputs = network(data)
            _, predicted = torch.max(outputs.data, 1)
            correct += (predicted == target).sum().item()

    if set_type == "train":
        print('\nEpoch{}: Train accuracy: {}/{} ({:.0f}%)\n'.format(
            epoch_idx, correct, len(test_loader.dataset),
            100. * correct / len(test_loader.dataset)))

    if set_type == "test":
        print('\nEpoch{}: Test accuracy: {}/{} ({:.0f}%)\n'.format(
            epoch_idx, correct, len(test_loader.dataset),
            100. * correct / len(test_loader.dataset)))

    return correct / len(test_loader.dataset)

## Training
In our example, we only train 5 epochs by a simple CNN network. Using complex CNN structures, such as VGG, ResNet, etc., and more epochs are ways to improve accuracy.

In [10]:
from torch.utils.data import Dataset, DataLoader

train_loader = DataLoader(dataset=train_set,batch_size=104,shuffle=True) # training set shuffle the data
test_loader = DataLoader(dataset=test_set,batch_size=5,shuffle=False) # testing set fix the data order

network = CNN()
optimizer = optim.SGD(network.parameters(), lr=learning_rate)

for i in range(1,epochs+1):
  print(f"Epoch {i}\n-------------------------------")
  train(network = network, train_loader = train_loader, optimizer = optimizer)
  train_accuracy = accuracy(epoch_idx=i, test_loader = train_loader, network = network, set_type = "train")
  val_accuracy = accuracy(epoch_idx=i, test_loader = test_loader, network = network, set_type = "test")

Epoch 1
-------------------------------


  return F.log_softmax(x) # log soft max to logits



Epoch1: Train accuracy: 15287/60000 (25%)


Epoch1: Test accuracy: 2538/10000 (25%)

Epoch 2
-------------------------------

Epoch2: Train accuracy: 30331/60000 (51%)


Epoch2: Test accuracy: 4989/10000 (50%)

Epoch 3
-------------------------------

Epoch3: Train accuracy: 37747/60000 (63%)


Epoch3: Test accuracy: 6240/10000 (62%)

Epoch 4
-------------------------------

Epoch4: Train accuracy: 40572/60000 (68%)


Epoch4: Test accuracy: 6662/10000 (67%)

Epoch 5
-------------------------------

Epoch5: Train accuracy: 41791/60000 (70%)


Epoch5: Test accuracy: 6884/10000 (69%)



## Max Pooling
The pooling layer is used to downsample/compress image(matrix), thereby reducing network computing consumption. There are some operators based on [Pytorch](https://pytorch.org/docs/stable/nn.html#pooling-layers).

Here is an example of maxpool.

In [11]:
import torch
from torch import nn
from torch.nn import MaxPool2d

input = torch.tensor([[1, 2, 0, 3, 1],
                      [0, 1, 2, 3, 1],
                      [1, 2, 1, 0, 0],
                      [5, 2, 3, 1, 1],
                      [2, 1, 0, 1, 1]], dtype=torch.float32)
input = torch.reshape(input, (-1, 1, 5, 5))
print("Input Size: ", input.shape)

class Test(nn.Module):
    def __init__(self):
        super(Test, self).__init__()
        self.maxpool1 = MaxPool2d(kernel_size=3, ceil_mode=True)

    def forward(self, input):
        output = self.maxpool1(input)
        return output

test = Test()
output = test(input)
print("Output: ", output)

Input Size:  torch.Size([1, 1, 5, 5])
Output:  tensor([[[[2., 3.],
          [5., 1.]]]])


Using a maxpool layer of 3x3, a 5x5 matrix become 2x2.

The relationship of input size and output size can be found: https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html#torch.nn.MaxPool2d .

