In [1]:
import torch
import torch.nn as nn
from pathlib import Path
import warnings
warnings.simplefilter("ignore", UserWarning)
import mltrainer
mltrainer.__version__

'0.1.128'

Lets get some data

In [2]:
from mads_datasets import DatasetFactoryProvider, DatasetType
from mltrainer.preprocessors import BasePreprocessor
preprocessor = BasePreprocessor()

fashionfactory = DatasetFactoryProvider.create_factory(DatasetType.FASHION)
streamers = fashionfactory.create_datastreamer(batchsize=64, preprocessor=preprocessor)
# flowersfactory = DatasetFactoryProvider.create_factory(DatasetType.FLOWERS)
# streamers = flowersfactory.create_datastreamer(batchsize=32, preprocessor=preprocessor)
train = streamers["train"]
valid = streamers["valid"]

[32m2024-11-26 19:08:05.196[0m | [1mINFO    [0m | [36mmads_datasets.base[0m:[36mdownload_data[0m:[36m121[0m - [1mFolder already exists at C:\Users\Francesca\.cache\mads_datasets\fashionmnist[0m
[32m2024-11-26 19:08:05.197[0m | [1mINFO    [0m | [36mmads_datasets.base[0m:[36mdownload_data[0m:[36m124[0m - [1mFile already exists at C:\Users\Francesca\.cache\mads_datasets\fashionmnist\fashionmnist.pt[0m
  data = torch.load(self.filepath)  # type: ignore


In [3]:
len(train), len(valid)

(937, 156)

We can obtain an item:

In [4]:
trainstreamer = train.stream()
validstreamer = valid.stream()
x, y = next(iter(trainstreamer))
x.shape, y.shape

(torch.Size([64, 1, 28, 28]), torch.Size([64]))

The image follows the channels-first convention: (channel, width, height). The label is an integer.

Lets pull this through a Conv2d layer:

In [5]:
in_channels = x.shape[1]

In [6]:
conv = nn.Conv2d(
    in_channels=in_channels,
    out_channels=64,
    kernel_size=3,
    padding=(1,1))
out = conv(x)
out.shape

torch.Size([64, 64, 28, 28])

What is happening here? Can you explain all the parameters, and relate them to the outputshape?

Let's see what happens if we change the padding:

In [7]:
conv = nn.Conv2d(
    in_channels=in_channels,
    out_channels=64,
    kernel_size=3,
    padding=(0,0))
out = conv(x)
out.shape

torch.Size([64, 64, 26, 26])

And if we change the stride from the default 1 to 2:

In [8]:
conv = nn.Conv2d(
    in_channels=in_channels,
    out_channels=64,
    kernel_size=3,
    padding=(1,1),
    stride=2)
out = conv(x)
out.shape

torch.Size([64, 64, 14, 14])

As you can see, you need to think about what is going in and out of the convolution. We can stitch multiple layers together like this:

In [9]:
convolutions = nn.Sequential(
    nn.Conv2d(in_channels, 32, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=0),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=0),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
)
out = convolutions(x)
out.shape

torch.Size([64, 32, 2, 2])

As you can see, the dimensions of the featuremap have become really small. You need to take this into account: If we would have started with a smaller image, we could get errors...

In [10]:
x_too_small = torch.rand((32, 1, 12, 12))

try:
    convolutions(x_too_small)
except RuntimeError as err:
    print("ERROR:", err)

ERROR: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size


At this point our `out` has 32 activation maps, each 2x2 big.

If we want to pull the activation maps through a neural network (A dense layer) we will need to flatten them (do you understand what happens if you dont do that?)

In [11]:
input_nn = nn.Flatten()(out)
input_nn.shape

torch.Size([64, 128])

Note that there are potential problems connecting the image layers and the linear layers:
- Conv2d and MaxPool both expect 4 dimensional data (batch, channels/activationmaps, width, height)
- Linear layers expect 2 dimensional data (batch, features)
- Linear layers wont crash if you feed them data with more dimensions! However, they will just work on the last dimension, and thats probably not what you want.

This means we need to somehow transform the 4D data into 2D. There are some options here:
- Some sort of aggregation; the activationmaps are typically small (eg 2x2) and they indicate that the filter has detected a features. There are a lot of different ways to aggregate this: mean, max, min, sum, etc...
- Flatten: a flatten layer simple transforms (batch, C, W, H) into (batch, C * W * H). lets say you have (32, 32, 2, 2) than after a flatten you end up with (32, 128). The problem here is, when you use a different amount of Conv2d layers, or a different stride or padding, you will end up with a different size of activationmap, eg (32, 32, 3, 3), which would mean you would end up with 32 * 3 * 3 = 288 features. 

I have solved this problem by calculating the size of the activationmap with the ._conv_test method. After I calculate the size of the map (eg (2,2)) I can create an AvgPool2d layer that will take the average of the (2,2) map. This way you will always end up with (batch, filters, 1, 1) and after the flatten this will be filter * 1 * 1, which is exactly the amount of filters.

In [13]:
avgpool = nn.AvgPool2d((2,2))
pooled = avgpool(out)
pooled.shape

torch.Size([64, 32, 1, 1])

If we flatten this, we obtain 32x1x1 numbers, which is still 32, which makes designing your model a bit easier (and you might also argue that taking the average is a good approach in terms of model logic)

Let's combine it all together, and add a _conv_test method to create the right size for the AvgPool2D layer.

In [14]:
import torch
from torch import nn
from loguru import logger
from torchsummary import summary
import copy


# Define model
class CNN(nn.Module):
    def __init__(self, filters: int, units1: int, units2: int, input_size: tuple):
        super().__init__()
        self.in_channels = input_size[1]
        self.input_size = input_size

        self.convolutions = nn.Sequential(
            nn.Conv2d(self.in_channels, filters, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(filters, filters, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(filters, filters, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
        )

        activation_map_size = self._conv_test(self.input_size)
        logger.info(f"Aggregating activationmap with size {activation_map_size}")
        self.agg = nn.AvgPool2d(activation_map_size)

        self.dense = nn.Sequential(
            nn.Flatten(),
            nn.Linear(filters, units1),
            nn.ReLU(),
            nn.Linear(units1, units2),
            nn.ReLU(),
            nn.Linear(units2, 10)
        )

    def _conv_test(self, input_size):
        x = torch.ones(input_size, dtype=torch.float32)
        x = self.convolutions(x)
        return x.shape[-2:]

    def forward(self, x):
        x = self.convolutions(x)
        x = self.agg(x)
        logits = self.dense(x)
        return logits


In [15]:
model = CNN(filters=128, units1=128, units2=64, input_size=(32, 3, 224, 224))
summary(model, input_size=(3, 224, 224), device="cpu")

[32m2024-11-26 19:12:33.956[0m | [1mINFO    [0m | [36m__main__[0m:[36m__init__[0m:[36m28[0m - [1mAggregating activationmap with size torch.Size([26, 26])[0m


----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1        [-1, 128, 224, 224]           3,584
              ReLU-2        [-1, 128, 224, 224]               0
         MaxPool2d-3        [-1, 128, 112, 112]               0
            Conv2d-4        [-1, 128, 110, 110]         147,584
              ReLU-5        [-1, 128, 110, 110]               0
         MaxPool2d-6          [-1, 128, 55, 55]               0
            Conv2d-7          [-1, 128, 53, 53]         147,584
              ReLU-8          [-1, 128, 53, 53]               0
         MaxPool2d-9          [-1, 128, 26, 26]               0
        AvgPool2d-10            [-1, 128, 1, 1]               0
          Flatten-11                  [-1, 128]               0
           Linear-12                  [-1, 128]          16,512
             ReLU-13                  [-1, 128]               0
           Linear-14                   

In [16]:
model = CNN(filters=128, units1=128, units2=64, input_size=(32, 1, 28, 28))
summary(model, input_size=(1, 28, 28), device="cpu")

[32m2024-11-26 19:13:11.926[0m | [1mINFO    [0m | [36m__main__[0m:[36m__init__[0m:[36m28[0m - [1mAggregating activationmap with size torch.Size([2, 2])[0m


----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1          [-1, 128, 28, 28]           1,280
              ReLU-2          [-1, 128, 28, 28]               0
         MaxPool2d-3          [-1, 128, 14, 14]               0
            Conv2d-4          [-1, 128, 12, 12]         147,584
              ReLU-5          [-1, 128, 12, 12]               0
         MaxPool2d-6            [-1, 128, 6, 6]               0
            Conv2d-7            [-1, 128, 4, 4]         147,584
              ReLU-8            [-1, 128, 4, 4]               0
         MaxPool2d-9            [-1, 128, 2, 2]               0
        AvgPool2d-10            [-1, 128, 1, 1]               0
          Flatten-11                  [-1, 128]               0
           Linear-12                  [-1, 128]          16,512
             ReLU-13                  [-1, 128]               0
           Linear-14                   

We have about 15k parameters. You will always need to judge that relative to your input data: 

- how many observations do you have? 
- maybe even more important: how many features do you have? Images sized 28x28 will need much less complexity than images sized 224x224 (note how the first one has 784 features, the second one more than 50.000!)
- Do you think the model needs a lot of complexity, or not so much? E.g. classifying if there is a stamp, or not, on a piece of paper is much easier than classifying the age of a face.

Also think about:
What is the trade off between adding more complexity? Or reducing complexity?

Try to answer this trade of in terms of:

- speed
- generalization
- accuracy

Eg 512 filters might add 0.1 % accuracy, but it might double training time. Is that worth it? Often, not...

We will need to tell the model how good it is performing. To do that, we will need to pick a loss function $\mathcal{L}$. We will discuss this in more depth, but for now, just take my word for it that a CrossEntropyLoss is a good pick.

In [17]:
import torch.optim as optim
from mltrainer import metrics, Trainer
optimizer = optim.Adam
loss_fn = torch.nn.CrossEntropyLoss()
accuracy = metrics.Accuracy()

In [18]:
model = CNN(filters=128, units1=128, units2=64, input_size=(32, 1, 28, 28))

[32m2024-11-26 19:13:38.988[0m | [1mINFO    [0m | [36m__main__[0m:[36m__init__[0m:[36m28[0m - [1mAggregating activationmap with size torch.Size([2, 2])[0m


In [19]:
yhat = model(x)
accuracy(y, yhat)

tensor(0.1094)

In [20]:
log_dir = Path("../../models/cnn").resolve()
if not log_dir.exists():
    log_dir.mkdir(parents=True)

We now have everything we need to train the model.

In [21]:
from mltrainer import TrainerSettings, ReportTypes

settings = TrainerSettings(
    epochs=10,
    metrics=[accuracy],
    logdir=log_dir,
    train_steps=len(train),
    valid_steps=len(valid),
    reporttypes=[ReportTypes.TENSORBOARD],
)
settings

epochs: 10
metrics: [Accuracy]
logdir: C:\Users\Francesca\Documents\osint\code_repo\AI\MADS-MachineLearning-FP\dev\models\cnn
train_steps: 937
valid_steps: 156
reporttypes: [<ReportTypes.TENSORBOARD: 2>]
optimizer_kwargs: {'lr': 0.001, 'weight_decay': 1e-05}
scheduler_kwargs: {'factor': 0.1, 'patience': 10}
earlystop_kwargs: {'save': False, 'verbose': True, 'patience': 10}

In [22]:
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
    device = torch.device("mps")
    print("Using MPS")
elif torch.cuda.is_available():
    device = "cuda:0"
    print("using cuda")
else:
    device = "cpu"
    print("using cpu")

using cuda


In [23]:
trainer = Trainer(
    model=model,
    settings=settings,
    loss_fn=loss_fn,
    optimizer=optimizer,
    traindataloader=trainstreamer,
    validdataloader=validstreamer,
    scheduler=optim.lr_scheduler.ReduceLROnPlateau,
    device=device,
    )

[32m2024-11-26 19:14:05.875[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36mdir_add_timestamp[0m:[36m29[0m - [1mLogging to C:\Users\Francesca\Documents\osint\code_repo\AI\MADS-MachineLearning-FP\dev\models\cnn\20241126-191405[0m
[32m2024-11-26 19:14:07.794[0m | [1mINFO    [0m | [36mmltrainer.trainer[0m:[36m__init__[0m:[36m72[0m - [1mFound earlystop_kwargs in settings.Set to None if you dont want earlystopping.[0m


In [24]:
trainer.loop()

  0%|[38;2;30;71;6m                                                                                           [0m| 0/10 [00:00<?, ?it/s][0m
  0%|[38;2;30;71;6m                                                                                          [0m| 0/937 [00:00<?, ?it/s][0m[A
  0%|[38;2;30;71;6m                                                                                [0m| 1/937 [00:05<1:28:55,  5.70s/it][0m[A
  0%|[38;2;30;71;6m▏                                                                                 [0m| 2/937 [00:05<37:34,  2.41s/it][0m[A
  1%|[38;2;30;71;6m▍                                                                                 [0m| 5/937 [00:05<11:01,  1.41it/s][0m[A
  2%|[38;2;30;71;6m█▍                                                                               [0m| 16/937 [00:06<02:24,  6.38it/s][0m[A
  3%|[38;2;30;71;6m██▍                                                                              [0m| 28/937 [00:06<01:08,

If you have version 0.1.129 of `mltrainer`, have a look at the `imagemodels.py` file. There you can find this model, but also a model that uses a more modular strategy (see the `ConvBlock` and `CNNBlocks` architectures)

In [25]:
accuracy

Accuracy