# Train

Training the `BART` model to recognize 42 characters from "The Simpsons" TV Show. The repo name "BART" is a reference to Bart Simpson, one of the main characters from The Simpsons. It also nods to the popular NLP architecture "BERT," blending the themes of deep learning and the Simpsons - even though BERT is not a vision model.

This is the notebook version of the three scripts from this repository: `train.py`, `BART.py`, and `SimpsonsDataset.py`. Since I began writing the Python code first, it might not be very notebook-oriented. In a notebook, you typically work with a single file, so you do not have to worry about saving things locally and importing them later.

This project uses `Python 3.14.0`. Start by installing the required libraries (handled by `pip install -r requirements.txt` with the script version):

In [1]:
!pip install numpy pillow scikit-learn torch



 and importing the necessary libraries/modules:

In [2]:
import argparse
import joblib
import logging
import numpy as np
import os
import random
import torch
import torch.nn as nn
import torch.optim as optim

from PIL import Image as PILImage
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torch.utils.data import DataLoader, Dataset

hyperparameters:

In [3]:
BS = 32
LR = 0.001
EPOCHS = 25

`logging` configuration and setting random seed for reproducibility:

It started when an alien device did what it did! And stuck itself upon his wrist with secrets that it hid! Now he's got superpowers, he's no ordinary kid He's Ben 10! Ben 10! $\implies$ `random_state=10`

In [4]:
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', datefmt='%d/%m/%Y %H:%M:%S')
logger = logging.getLogger("Batlogger (Train)")

SEED = 10
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)  # for multi-GPU setups. I have no idea on what kinda setup this code will be run on...
torch.backends.cudnn.deterministic = True  # ensures deterministic convolution algorithms
torch.backends.cudnn.benchmark = False  # disables auto-tuning (which can introduce randomness)
g = torch.Generator()
g.manual_seed(SEED)

# Ensures reproducibility in DataLoader workers
# I had not though about this before and will keep in mind for the future
def seed_worker(worker_id: int) -> None:
    # PEP 8 actually thinks we shall write 2**32 emphasizing the higher precedence, but I think that's just ugly
    worker_seed = torch.initial_seed() % 2 ** 32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

We need to import data from the `characters_train/` folder. For that reason, I wrote the `get_data` method, which will return the images, their labels (the folder name), and the number of different classes, which will turn out to be 42:

In [5]:
def get_data(directory: str) -> tuple[list[PILImage.Image], list[str], int]:
    images: list[PILImage.Image] = []
    labels: list[str] = []
    class_count: int = 0

    for c in sorted(os.listdir(directory)):
        if c in [".DS_Store", "simpsons_dataset"]:  # Remnants of the past...
            continue

        cp = os.path.join(directory, c)  # not that CP!
        class_count += 1

        for f in sorted(os.listdir(cp)):
            fpath = os.path.join(cp, f)
            with PILImage.open(fpath) as img:
                images.append(img.copy())  # .copy() loads image into memor, cause PIL loads images lazily
            labels.append(c)

    return images, labels, class_count

Using the `get_data` method from above and some exploratory logging:

In [6]:
images, labels, class_count = get_data('characters_train')
logger.info(f"The number of images: {len(images)} is equal to the number of labels: {len(images) == len(labels)}")
logger.info(f"Number of classes: {class_count}")

img_shape_set: set[tuple[int, int, int]] = set()
for img in images:
    w, h = img.size
    img_shape_set.add((h, w, 3))  # they do have RGB, but how PIL works is counterintuitive...

if len(img_shape_set) == 1:
    logger.info(f"Dimensions of each image: {list(img_shape_set)[0]}")
else:
    logger.info(f"There are {len(img_shape_set)} different sizes for images")
    logger.info(f"Different shapes found: {sorted(img_shape_set)[::len(img_shape_set) - 1]}")  # Get only the smallest and the largest image shapes

03/12/2025 19:12:56 - Batlogger (Train) - INFO - The number of images: 16764 is equal to the number of labels: True
03/12/2025 19:12:56 - Batlogger (Train) - INFO - Number of classes: 42
03/12/2025 19:12:56 - Batlogger (Train) - INFO - There are 275 different sizes for images
03/12/2025 19:12:56 - Batlogger (Train) - INFO - Different shapes found: [(256, 256, 3), (1072, 1912, 3)]


  Based on the logging, there are 275 different sizes, all RGB, most images have 256 pixels in one dimension.
  The images starting from $(256, 256, 3)$ all the way to $(1072, 1912, 3)$.

  Resizing is a MUST:
  275 different image sizes make batching impossible without resizing.
  Unless you want to write some dumb out-of-this-world მატრაკვეცა case splits
  Variable sizes would require padding/cropping, losing information unpredictably.

  Why $128 \times 128$?:
  Computational efficiency: $128 \times 128 = 16384$ pixels vs $256 \times 256 = 65536$ pixels (4x less memory/computation)
  Sufficient for character recognition: based on `characters_illustration.png`,
  Simpsons characters have distinctive shapes that SHOULD survive downsampling
  With only 16764 samples, smaller inputs reduce overfitting risk.

In [7]:
images = np.array([np.array(img.resize((128, 128))) for img in images])

# LabelEncoder converts categorical string labels into integer indices.
# NNs require numerical inputs.
# CrossEntropyLoss expects integer class indices as targets.
label_encoder = LabelEncoder()
labels_encoded = label_encoder.fit_transform(labels)

X_train, X_test, y_train, y_test = train_test_split(images, labels_encoded, test_size=0.2, random_state=SEED)

os.makedirs('inference_images', exist_ok=True)
for i, img_array in enumerate(X_test):
    img = PILImage.fromarray(img_array)
    img.save(f'inference_images/pic_{i}.jpg')

joblib.dump(label_encoder, 'label_encoder.joblib')
logger.info("Saved label_encoder.joblib")
logger.info(f"Saved {len(X_test)} test images to inference_images/")

logger.info(f"Training set size: {len(X_train)}")
logger.info(f"Test set size: {len(X_test)}")

03/12/2025 19:13:57 - Batlogger (Train) - INFO - Saved label_encoder.joblib
03/12/2025 19:13:57 - Batlogger (Train) - INFO - Saved 3353 test images to inference_images/
03/12/2025 19:13:57 - Batlogger (Train) - INFO - Training set size: 13411
03/12/2025 19:13:57 - Batlogger (Train) - INFO - Test set size: 3353


Next, we define `Dataset` class called `SimpsonsDataset`:

In [8]:
class SimpsonsDataset(Dataset):
    def __init__(self, images: np.ndarray, labels: np.ndarray):
        self.images = images
        self.labels = labels


    def __len__(self) -> int:
        return len(self.images)


    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        image = self.images[idx]
        label = self.labels[idx]

        # convert to tensor and normalize, change from (H, W, C) to (C, H, W)
        image = torch.tensor(image, dtype=torch.float32).permute(2, 0, 1) / 255.0
        label = torch.tensor(label, dtype=torch.long)

        return image, label

and use it with `DataLoader`, which has been maximally forced to be reproducible:

In [9]:
# create datasets and dataloaders
train_dataset = SimpsonsDataset(X_train, y_train)
test_dataset = SimpsonsDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=BS, shuffle=True, worker_init_fn=seed_worker, generator=g)
test_loader = DataLoader(test_dataset, batch_size=BS, shuffle=False)

We now need to define the main class - `BART`.

## Why the specific architecture below?

### **Quick overview before we go into more detail**:
- Convolution blocks: Extract features $\to$ spatial patterns
- FC layers: Learn complex decision boundaries between classes
- Final linear layer: Produces raw scores (logits) for each of the 42 classes
- Softmax: Applied externally through `nn.CrossEntropyLoss()`

### **Depth of 4 convolution blocks**:
1. Block 1: Learns simple edges, colors, basic textures
2. Block 2: Combines edges into simple shapes $\to$ curves, corners
3. Block 3: Recognizes more complex patterns $\to$ facial features, clothing textures
4. Block 4: Understands high-level features $\to$ faces, body parts, character-specific details

Also, for 16764 samples, fewer than 3 layers will likely underfit, and more than 5 may overfit.

### **My summary from Fei-Fei Li's CNN slides**:
As we go deeper through the network, two things happen simultaneously but in opposite directions:
The spatial dimensions shrink due to pooling
$$128 \times 128 \to 64 \times 64 \to 32 \times 32 \to 16 \times 16 \to 8 \times 8,$$
meaning we lose precise information about where features appear in the image.
At the same time, the channel dimensions grow
$$3 \to 32 \to 64 \to 128 \to 256,$$
meaning we gain richer and more abstract semantic representations at each location.

The first law of alchemy is the Law of Equivalent Exchange:
Early layers with high resolution and few channels detect simple patterns such as edges at specific locations,
while later layers with low resolution and many channels recognize complex concepts such as facial features.

### **Dropout**
$$0.25 \to 0.25 \to 0.4 \to 0.4 \to 0.5$$
The reasoning is that as dimensionality increases, overfitting risk also increases.

Two FC layers at the end:
$$ 512 \to 256 \to \text{num\_classes} $$
One FC layer might be too simple for decision boundaries across 42 classes,
while more than three FC layers would risk overfitting with this dataset size.


## **Calculating the number of parameters**

(This includes the entire dataset, before the train/test split.)

Given:
- $C_\text{in}$ = Number of input channels
- $K_h$ = Kernel height
- $K_w$ = Kernel width
- $C_\text{out}$ = Number of output channels
- $N_\text{in}$ = Number of input features
- $N_\text{out}$ = Number of output features
- and the "+ 1" term is for the bias parameter

The formulas for the number of parameters are:

- For Conv2d:
  $$ (C_\text{in} \times K_h \times K_w + 1) \times C_\text{out} $$
- For BatchNorm2d:
  $$ 2 \times C_\text{out} $$
- For Linear:
  $$ N_\text{in} \times N_\text{out} + N_\text{out} $$


### Block 1

Conv2d$(3 \to 32,; 3 \times 3)$:
$$ (3 \times 3 \times 3 + 1) \times 32 = 896 $$
BatchNorm2d$(32)$:
$$ 32 \times 2 = 64 $$
Conv2d$(32 \to 32,; 3 \times 3)$:
$$ (3 \times 3 \times 32 + 1) \times 32 = 9248 $$
BatchNorm2d$(32)$:
$$64$$
Block 1 total:
$$10272$$

### Block 2

Conv2d$(32 \to 64,; 3 \times 3)$:
$$ (3 \times 3 \times 32 + 1) \times 64 = 18496 $$
BatchNorm2d$(64)$:
$$128$$
Conv2d$(64 \to 64,; 3 \times 3)$:
$$ (3 \times 3 \times 64 + 1) \times 64 = 36928 $$
BatchNorm2d$(64)$:
$$128$$
Block 2 total:
$$55680$$

### Block 3

Conv2d$(64 \to 128,; 3 \times 3)$:
$$ (3 \times 3 \times 64 + 1) \times 128 = 73856 $$
BatchNorm2d$(128)$:
$$256$$
Conv2d$(128 \to 128,; 3 \times 3)$:
$$ (3 \times 3 \times 128 + 1) \times 128 = 147584 $$
BatchNorm2d$(128)$:
$$256$$
Block 3 total:
$$221952$$

### Block 4

Conv2d$(128 \to 256,; 3 \times 3)$:
$$ (3 \times 3 \times 128 + 1) \times 256 = 295168 $$
BatchNorm2d$(256)$:
$$512$$
Conv2d$(256 \to 256,; 3 \times 3)$:
$$ (3 \times 3 \times 256 + 1) \times 256 = 590080 $$
BatchNorm2d$(256)$:
$$512$$
Block 4 total:
$$886272$$

Total for 4 convolutional blocks:
$$1174176$$

### Classification Head (the heavyweight!)

Linear$(16384 \to 512)$:
$$16384 \times 512 + 512 = 8389120$$
Linear$(512 \to 256)$:
$$512 \times 256 + 256 = 131328$$
Linear$(256 \to 42)$ (42 classes):
$$256 \times 42 + 42 = 10794$$
FC total:
$$8531242$$


### **Total parameters**

$$8531242 + 1174176 = 9705418$$
Ratio of samples to parameters:
$$\frac{16764}{9705418} \approx 0.00173,$$
which is quite low.
I think around 10 samples per parameter would be sufficient, but we do not have that amount of data (even if the remaining 20% currently not in `characters_train/` were included).

Approximately $87%$ of the parameters are in the first FC layer alone, meaning most parameters are in the dense layers, not the convolutional ones.

In [10]:
class BART(nn.Module):
    def __init__(self, num_classes=42):
        super(BART, self).__init__()

        # Block 1: (128, 128, 3) -> (64, 64, 32)
        self.block1 = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding='same'),  # 'same' maintains input size after convolution
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),  # save memory by modifying the input tensor directly,
            nn.Conv2d(32, 32, kernel_size=3, padding='same'),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),  # at the risk of losing the original data
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Dropout2d(0.25)
        )

        # Block 2: (64, 64, 32) -> (32, 32, 64)
        self.block2 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=3, padding='same'),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding='same'),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Dropout2d(0.25)
        )

        # Block 3: (32, 32, 64) -> (16, 16, 128)
        self.block3 = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, padding='same'),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding='same'),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Dropout2d(0.4)
        )

        # Block 4: (16, 16, 128) -> (8, 8, 256)
        self.block4 = nn.Sequential(
            nn.Conv2d(128, 256, kernel_size=3, padding='same'),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding='same'),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Dropout2d(0.4)
        )

        # Classification Head: (8*8*256) -> num_classes
        self.classification_head = nn.Sequential(
            nn.Flatten(),
            nn.Linear(8 * 8 * 256, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(512, 256),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )


    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = self.block4(x)
        x = self.classification_head(x)
        return x


    def predict(self, x: torch.Tensor) -> torch.Tensor:  # predict class, just in case
        return torch.argmax(self.forward(x), dim=1)


    def predict_proba(self, x: torch.Tensor) -> torch.Tensor:  # get class distribution, just in case
        return torch.softmax(self.forward(x), dim=1)


    def save(self, path: str = "") -> None:
        model_name = 'BART-10M.pth'
        torch.save(self.state_dict(), f"{path}/{model_name}" if path else model_name)  # "works on my machine" (Linux)

Now we create an instance of `BART`:

In [11]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if torch.cuda.is_available():
    logger.info(f"Using CUDA")

model = BART(num_classes=class_count).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)

and start the training/validation loop:

In [None]:
logger.info("Starting training...")

for epoch in range(EPOCHS):
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for img, label in train_loader:
        img = img.to(device)
        label = label.to(device)

        optimizer.zero_grad()
        outputs = model(img)
        loss = criterion(outputs, label)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        values, predicted = torch.max(outputs.data, dim=1)
        # initially, I had dim=0 but there were errors until I recalled that
        # dim=X means "reduce dimension X", not "operate on dimension X"...
        total += label.size(0)  # cause 16764 / BS = 16764 / 32 is not an integer, hence the last batch will not be of size 32.
        correct += (predicted == label).sum().item()

    train_acc = 100 * correct / total
    avg_loss = running_loss / len(train_loader)

    # validation
    model.eval()
    val_correct = 0
    val_total = 0

    with torch.no_grad():
        for img, label in test_loader:
            img = img.to(device)
            label = label.to(device)

            outputs = model(img)
            values, predicted = torch.max(outputs.data, dim=1)
            val_total += label.size(0)
            val_correct += (predicted == label).sum().item()

    val_acc = 100 * val_correct / val_total

    logger.info(f"Epoch [{epoch + 1}/{EPOCHS}], Loss: {avg_loss:.3f}, Train Acc: {train_acc:.3f}%, Val Acc: {val_acc:.3f}%")

model.save()
logger.info("BART-10M.pth SAVED AT CURRENT WORKING DIRECTORY")

03/12/2025 19:23:55 - Batlogger (Train) - INFO - Starting training...


For inference, please check `inference.ipynb`