KUL H02A5a Computer Vision: Group Assignment 2
---------------------------------------------------------------
Student numbers: <span style="color:red">r1030550, r2, r3, r4, r5</span>. (fill in your student numbers!)

In this group assignment your team will delve into some deep learning applications for computer vision. The assignment will be delivered in the same groups from *Group assignment 1* and you start from this template notebook. The notebook you submit for grading is the last notebook pinned as default and submitted to the [Kaggle competition](https://www.kaggle.com/t/90a3b6380ecb4700857b9e07a44ca41b) prior to the deadline on **Tuesday 20 May 23:59**. Closely follow [these instructions](https://github.com/gourie/kaggle_inclass) for joining the competition, sharing your notebook with the TAs and making a valid notebook submission to the competition. A notebook submission not only produces a *submission.csv* file that is used to calculate your competition score, it also runs the entire notebook and saves its output as if it were a report. This way it becomes an all-in-one-place document for the TAs to review. As such, please make sure that your final submission notebook is self-contained and fully documented (e.g. provide strong arguments for the design choices that you make). Most likely, this notebook format is not appropriate to run all your experiments at submission time (e.g. the training of CNNs is a memory hungry and time consuming process; due to limited Kaggle resources). It can be a good idea to distribute your code otherwise and only summarize your findings, together with your final predictions, in the submission notebook. For example, you can substitute experiments with some text and figures that you have produced "offline" (e.g. learning curves and results on your internal validation set or even the test set for different architectures, pre-processing pipelines, etc). We advise you to first go through the PDF of this assignment entirely before you really start. Then, it can be a good idea to go through this notebook and use it as your first notebook submission to the competition. You can make use of the *Group assignment 2* forum/discussion board on Toledo if you have any questions. Good luck and have fun!

---------------------------------------------------------------
NOTES:
* This notebook is just a template. Please keep the five main sections, but feel free to adjust further in any way you please!
* Clearly indicate the improvements that you make! You can for instance use subsections like: *3.1. Improvement: applying loss function f instead of g*.


# 1. Overview
This assignment consists of *three main parts* for which we expect you to provide code and extensive documentation in the notebook:
* Image classification (Sect. 2)
* Semantic segmentation (Sect. 3)
* Adversarial attacks (Sect. 4)

In the first part, you will train an end-to-end neural network for image classification. In the second part, you will do the same for semantic segmentation. For these two tasks we expect you to put a significant effort into optimizing performance and as such competing with fellow students via the Kaggle competition. In the third part, you will try to find and exploit the weaknesses of your classification and/or segmentation network. For the latter there is no competition format, but we do expect you to put significant effort in achieving good performance on the self-posed goal for that part. Finally, we ask you to reflect and produce an overall discussion with links to the lectures and "real world" computer vision (Sect. 5). It is important to note that only a small part of the grade will reflect the actual performance of your networks. However, we do expect all things to work! In general, we will evaluate the correctness of your approach and your understanding of what you have done that you demonstrate in the descriptions and discussions in the final notebook.

## 1.1 Deep learning resources
If you did not yet explore this in *Group assignment 1 (Sect. 2)*, we recommend using the TensorFlow and/or Keras library for building deep learning models. You can find a nice crash course [here](https://colab.research.google.com/drive/1UCJt8EYjlzCs1H1d1X0iDGYJsHKwu-NO).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
import numpy as np
import pandas as pd
import tensorflow as tf
from matplotlib import pyplot as plt
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
from torchvision import transforms
import torch.nn.functional as F
from torchvision.transforms.functional import resize
import albumentations as A
from albumentations.pytorch import ToTensorV2
from torchvision.transforms import functional as TF
from sklearn.model_selection import train_test_split
from PIL import Image
import random
from torchvision.transforms import InterpolationMode

## 1.2 PASCAL VOC 2009
For this project you will be using the [PASCAL VOC 2009](http://host.robots.ox.ac.uk/pascal/VOC/voc2009/index.html) dataset. This dataset consists of colour images of various scenes with different object classes (e.g. animal: *bird, cat, ...*; vehicle: *aeroplane, bicycle, ...*), totalling 20 classes.

In [None]:
# Loading the training data
train_df = pd.read_csv('/kaggle/input/kul-computer-vision-ga-2-2025/train/train_set.csv', index_col="Id")
# train_df =pd.read_csv('train/train_set.csv', index_col="Id")
labels = train_df.columns
train_df["img"] = [np.load('/kaggle/input/kul-computer-vision-ga-2-2025/train/img/train_{}.npy'.format(idx)) for idx, _ in train_df.iterrows()]
train_df["seg"] = [np.load('/kaggle/input/kul-computer-vision-ga-2-2025/train/seg/train_{}.npy'.format(idx)) for idx, _ in train_df.iterrows()]
# train_df["img"] = [np.load('train/img/train_{}.npy'.format(idx)) for idx, _ in train_df.iterrows()]
# train_df["seg"] = [np.load('train/seg/train_{}.npy'.format(idx)) for idx, _ in train_df.iterrows()]
print("The training set contains {} examples.".format(len(train_df)))

# Show some examples
fig, axs = plt.subplots(2, 20, figsize=(10 * 20, 10 * 2))
for i, label in enumerate(labels):
    df = train_df.loc[train_df[label] == 1]
    axs[0, i].imshow(df.iloc[0]["img"], vmin=0, vmax=255)
    axs[0, i].set_title("\n".join(label for label in labels if df.iloc[0][label] == 1), fontsize=40)
    axs[0, i].axis("off")
    axs[1, i].imshow(df.iloc[0]["seg"], vmin=0, vmax=20)  # with the absolute color scale it will be clear that the arrays in the "seg" column are label maps (labels in [0, 20])
    axs[1, i].axis("off")
    
plt.show()

# The training dataframe contains for each image 20 columns with the ground truth classification labels and 20 column with the ground truth segmentation maps for each class
train_df.head(1)

In [None]:
# Loading the test data
test_df = pd.read_csv('/kaggle/input/kul-computer-vision-ga-2-2025/test/test_set.csv', index_col="Id")
test_df["img"] = [np.load('/kaggle/input/kul-computer-vision-ga-2-2025/test/img/test_{}.npy'.format(idx)) for idx, _ in test_df.iterrows()]
# test_df = pd.read_csv('test/test_set.csv', index_col="Id")
# test_df["img"] = [np.load('test/img/test_{}.npy'.format(idx)) for idx, _ in test_df.iterrows()]
test_df["seg"] = [-1 * np.ones(img.shape[:2], dtype=np.int8) for img in test_df["img"]]
print("The test set contains {} examples.".format(len(test_df)))

# The test dataframe is similar to the training dataframe, but here the values are -1 --> your task is to fill in these as good as possible in Sect. 2 and Sect. 3; in Sect. 6 this dataframe is automatically transformed in the submission CSV!
test_df.head(1)

## 1.3 Your Kaggle submission
Your filled test dataframe (during Sect. 2 and Sect. 3) must be converted to a submission.csv with two rows per example (one for classification and one for segmentation) and with only a single prediction column (the multi-class/label predictions running length encoded). You don't need to edit this section. Just make sure to call this function at the right position in this notebook.

In [4]:
def _rle_encode(img):
    """
    Kaggle requires RLE encoded predictions for computation of the Dice score (https://www.kaggle.com/lifa08/run-length-encode-and-decode)

    Parameters
    ----------
    img: np.ndarray - binary img array
    
    Returns
    -------
    rle: String - running length encoded version of img
    """
    pixels = img.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    rle = ' '.join(str(x) for x in runs)
    return rle

def generate_submission(df):
    """
    Make sure to call this function once after you completed Sect. 2 and Sect. 3! It transforms and writes your test dataframe into a submission.csv file.
    
    Parameters
    ----------
    df: pd.DataFrame - filled dataframe that needs to be converted
    
    Returns
    -------
    submission_df: pd.DataFrame - df in submission format.
    """
    df_dict = {"Id": [], "Predicted": []}
    for idx, _ in df.iterrows():
        df_dict["Id"].append(f"{idx}_classification")
        df_dict["Predicted"].append(_rle_encode(np.array(df.loc[idx, labels])))
        df_dict["Id"].append(f"{idx}_segmentation")
        df_dict["Predicted"].append(_rle_encode(np.array([df.loc[idx, "seg"] == j + 1 for j in range(len(labels))])))
    
    submission_df = pd.DataFrame(data=df_dict, dtype=str).set_index("Id")
    submission_df.to_csv("submission.csv")
    return submission_df

# 2. Image classification
The goal here is simple: implement a classification CNN and train it to recognise all 20 classes (and/or background) using the training set and compete on the test set (by filling in the classification columns in the test dataframe).

## 2.1 CNN from scratch

### Loading
Loading the images seperately because for this I wanted everything resized for my CNN

In [None]:
#Train data
train_df_cnn1 = pd.read_csv('/kaggle/input/kul-computer-vision-ga-2-2025/train/train_set.csv', index_col="Id")
labels = train_df.columns
#there was an issue with the size so im also resizing everything
target_size = (128, 128)
train_df_cnn1["img"] = [
    tf.image.resize(
        np.load('/kaggle/input/kul-computer-vision-ga-2-2025/train/train_{}.npy'.format(idx)),
        target_size
    ).numpy()
    for idx, _ in train_df_cnn1.iterrows()
]

train_df_cnn1["seg"] = [np.load('/kaggle/input/kul-computer-vision-ga-2-2025/train//seg/train{}.npy'.format(idx)) for idx, _ in train_df_cnn1.iterrows()]
print("The training set contains {} examples.".format(len(train_df_cnn1)))

In [None]:
# Loading the test data
test_df_cnn1 = pd.read_csv('/kaggle/input/kul-computer-vision-ga-2-2025/test/test.csv', index_col="Id")
test_df_cnn1["img"] = [
    tf.image.resize(
        np.load('/kaggle/input/kul-computer-vision-ga-2-2025/test/test_{}.npy'.format(idx)),
        target_size
    ).numpy()
    for idx, _ in test_df_cnn1.iterrows()
]

test_df_cnn1["seg"] = [-1 * np.ones(img.shape[:2], dtype=np.int8) for img in test_df_cnn1["img"]]

### Model



Layer Purposes

* Conv2D - Extracts spatial features from image patches.
* BatchNormalization - Stabilizes and speeds up training by normalizing activations.
* ReLU Activation- Introduces non-linearity to help model complex patterns.
* MaxPooling2D - Downsamples feature maps to reduce spatial dimensions and overfitting.
* GlobalAveragePooling2D - Reduces the feature map to a vector, maintaining spatial invariance.
* Dense	Learns - final non-linear combinations for classification.
* Sigmoid Output	Suited for multi-label classification (probability per class).

The first 4 are ran as a section 3 times to enhance qccurqcy without leading to too much overfitting.
BatchNorm and dropout help regularize, and the use of sigmoid activation with binary_crossentropy works for multi-label scenarios.


Parameters:

Optimizer: Adam (adaptive, works well out-of-the-box)

Loss: binary_crossentropy (correct for multi-label)

Metric: AUC (multi_label=True is a strong choice for imbalanced data)


In [None]:
from tensorflow.keras import layers, models
import tensorflow as tf
import numpy as np

class CNNClassificationModel:
    """
    CNN classification model:
        - learns to predict labels from images using a Convolutional Neural Network
        - assumes an input can have multiple labels (multi-label classification)
    """
    def __init__(self, input_shape=(32, 32, 3), nb_classes=20):
        self.input_shape = input_shape
        self.nb_classes = nb_classes
        self._build_model()

    def _build_model(self):
        self.model = models.Sequential([
            layers.Input(shape=self.input_shape),

            # Conv Block 1
            layers.Conv2D(64, (3, 3), padding='same'),
            layers.BatchNormalization(),
            layers.Activation('relu'),
            layers.Conv2D(64, (3, 3), padding='same'),
            layers.BatchNormalization(),
            layers.Activation('relu'),
            layers.MaxPooling2D((2, 2)),

            # Conv Block 2
            layers.Conv2D(128, (3, 3), padding='same'),
            layers.BatchNormalization(),
            layers.Activation('relu'),
            layers.Conv2D(128, (3, 3), padding='same'),
            layers.BatchNormalization(),
            layers.Activation('relu'),
            layers.MaxPooling2D((2, 2)),

            # Conv Block 3
            layers.Conv2D(256, (3, 3), padding='same'),
            layers.BatchNormalization(),
            layers.Activation('relu'),
            layers.Conv2D(256, (3, 3), padding='same'),
            layers.BatchNormalization(),
            layers.Activation('relu'),
            layers.MaxPooling2D((2, 2)),

            # Conv Block 4 (optional, for larger images like 128x128+)
            layers.Conv2D(512, (3, 3), padding='same'),
            layers.BatchNormalization(),
            layers.Activation('relu'),
            layers.MaxPooling2D((2, 2)),

            # Classification head
            layers.GlobalAveragePooling2D(),
            layers.Dense(512, activation='relu'),
            layers.Dropout(0.5),
            layers.Dense(self.nb_classes, activation='sigmoid')  # sigmoid for multi-label
        ])

        self.model.compile(
            optimizer='adam',
            loss='binary_crossentropy',
            metrics=[tf.keras.metrics.AUC(multi_label=True, name="auc")]
        )
        self.model.summary()


    def fit(self, X, y, epochs=10, batch_size=32):
        """
        Trains the CNN model.

        Parameters
        ----------
        X: list of arrays - n x (height x width x 3)
        y: list of arrays - n x (nb_classes)
        """
        X = np.stack(X)  # Convert list to array
        y = np.stack(y)
        self.model.fit(X, y, epochs=epochs, batch_size=batch_size, validation_split=0.1)
        return self

    def predict(self, X):
        """
        Predicts labels for each input.

        Parameters
        ----------
        X: list of arrays - n x (height x width x 3)

        Returns
        -------
        y_pred: list of arrays - n x (nb_classes)
        """
        X = np.stack(X)
        preds = self.model.predict(X)
        return (preds > 0.3).astype(int)  # threshold at 0.3 (better for large images)

    def __call__(self, X):
        return self.predict(X)


### Tuning

This code block uses Keras tuner to optimised the:
* Number of filters
* Dropout rate
* Dense units
* Number of layers

It takes around an hour to run so DON'T

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models

def build_model(hp):
    def get_activation_layer():
        act = hp.Choice("activation", ["relu", "leaky_relu"])
        return layers.Activation("relu") if act == "relu" else layers.LeakyReLU()

    model = models.Sequential()
    model.add(layers.Input(shape=train_df.iloc[0]["img"].shape))

    # Conv Block 1
    k1 = hp.Choice("kernel1", [3, 5])
    model.add(layers.Conv2D(hp.Choice("conv1_filters", [32, 64, 128]),
                            kernel_size=(k1, k1), padding='same'))
    if hp.Boolean("conv1_batchnorm"):
        model.add(layers.BatchNormalization())
    model.add(get_activation_layer())
    model.add(layers.MaxPooling2D((2, 2)))

    # Conv Block 2
    k2 = hp.Choice("kernel2", [3, 5])
    model.add(layers.Conv2D(hp.Choice("conv2_filters", [64, 128, 256]),
                            kernel_size=(k2, k2), padding='same'))
    if hp.Boolean("conv2_batchnorm"):
        model.add(layers.BatchNormalization())
    model.add(get_activation_layer())
    model.add(layers.MaxPooling2D((2, 2)))

    # Optional Conv Block 3
    if hp.Boolean("add_third_conv"):
        k3 = hp.Choice("kernel3", [3, 5])
        model.add(layers.Conv2D(hp.Choice("conv3_filters", [128, 256]),
                                kernel_size=(k3, k3), padding='same'))
        model.add(layers.BatchNormalization())
        model.add(get_activation_layer())
        model.add(layers.MaxPooling2D((2, 2)))

    model.add(layers.GlobalAveragePooling2D())

    # Dense Layer
    model.add(layers.Dense(hp.Int("dense_units", 128, 512, step=64),
                           activation='relu'))
    if hp.Boolean("dense_dropout"):
        model.add(layers.Dropout(hp.Float("dropout_rate", 0.2, 0.5, step=0.1)))

    model.add(layers.Dense(len(labels), activation='sigmoid'))

    # Compile model
    model.compile(
        optimizer=tf.keras.optimizers.Adam(
            learning_rate=hp.Float("learning_rate", 1e-4, 1e-2, sampling="log")),
        loss='binary_crossentropy',
        metrics=[tf.keras.metrics.AUC(multi_label=True, name="auc")]
    )

    return model


tuner = RandomSearch(
  build_model,
  objective='val_auc',
  max_trials=30,
  executions_per_trial=1,
  overwrite=True,
  directory='tuning',
  project_name='cnn_deep_tuned_v2'
)

X_train = np.stack(train_df_cnn1["img"].values)
y_train = train_df_cnn1[labels].astype(int).values

# Optional: Search over batch sizes manually in a loop
for bs in [16, 32, 64]:
  tuner.search(X_train, y_train,
                epochs=30,
                batch_size=bs,
                validation_split=0.2,
                callbacks=[tf.keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)])


best_model = tuner.get_best_models(1)[0]
print(best_model.summary())

X_test = np.stack(test_df["img"].values)
test_df.loc[:, labels] = (best_model.predict(X_test) > 0.5).astype(int)

generate_submission(test_df)
best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print("Best hyperparameters:")
for param in best_hp.values:
    print(f"{param}: {best_hp.get(param)}")



### Running

In [None]:
model = CNNClassificationModel(input_shape=train_df_cnn1.iloc[0]["img"].shape, nb_classes=len(labels))
# Convert the labels to numeric values if they are not already
label_values = train_df_cnn1[labels].astype(int).values

history = model.model.fit(
    np.stack(train_df_cnn1["img"].values), 
    label_values, 
    epochs=50, 
    batch_size=32,
    validation_split=0.1  # <- this enables val_auc
)




# Convert the features and labels data to numpy arrays before passing to fit
test_df_cnn1.loc[:, labels] = model.predict(test_df_cnn1["img"])
test_df_cnn1.head(1)



# After predictions
generate_submission(test_df_cnn1)


### Results and analysis
The value for AUC seeme to plateau at around 35 epochs whilst the loss drops off almost immedately, AUC still varies due to some random noise but the scale to that variation reduces by the end. I would not reccomend running more than 35 epochs to also avoid overfitting and/or wasting resources.

In [None]:
import matplotlib.pyplot as plt

def plot_predictions(images, true_labels, pred_labels, class_names, n=10):
    print(len(images))
    # n=len(images)
    for i in range(n):
        img = images[i]
        # Normalize
        if img.dtype == np.float32 or img.max() > 1.0:
            img = img / 255.0

        plt.imshow(img)
        plt.axis('off')
        plt.title(
            f"True: {np.where(true_labels[i]==1)[0]}\nPred: {np.where(pred_labels[i]==1)[0]}"
        )
        true_classes = [class_names[i] for i in np.where(true_labels[i]==1)[0]]
        pred_classes = [class_names[i] for i in np.where(pred_labels[i]==1)[0]]
        plt.title(f"True: {true_classes}\nPred: {pred_classes}")

        plt.show()


# Generate predictions for training set to compare
X_train = np.stack(train_df_cnn1["img"].values)
y_true = train_df_cnn1[labels].astype(int).values
y_pred = model.predict(train_df_cnn1["img"])

# Compare predictions
correct = np.all(y_true == y_pred, axis=1)
incorrect = ~correct

print("✅ Good predictions:")
plot_predictions(X_train[correct], y_true[correct], y_pred[correct], labels)

print("❌ Bad predictions:")
plot_predictions(X_train[incorrect], y_true[incorrect], y_pred[incorrect], labels)


In [None]:
import matplotlib.pyplot as plt

plt.plot(history.history['auc'], label='Train AUC')
plt.plot(history.history['val_auc'], label='Val AUC')
plt.xlabel('Epoch')
plt.ylabel('AUC')
plt.title('AUC vs Epochs')
plt.legend()
plt.grid(True)
plt.show()
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss vs Epochs')
plt.legend()
plt.grid(True)
plt.show()


For a from-scratch classification model, the CNN doesn't perform too badly but, as you can see from the images below it very quickly starts to struggle with more complex images with multiple segments or with obscured areas.

In [None]:
class RandomClassificationModel:
    """
    Random classification model: 
        - generates random labels for the inputs based on the class distribution observed during training
        - assumes an input can have multiple labels
    """
    def fit(self, X, y):
        """
        Adjusts the class ratio variable to the one observed in y. 

        Parameters
        ----------
        X: list of arrays - n x (height x width x 3)
        y: list of arrays - n x (nb_classes)

        Returns
        -------
        self
        """
        self.distribution = np.mean(y, axis=0)
        print("Setting class distribution to:\n{}".format("\n".join(f"{label}: {p}" for label, p in zip(labels, self.distribution))))
        return self
        
    def predict(self, X):
        """
        Predicts for each input a label.
        
        Parameters
        ----------
        X: list of arrays - n x (height x width x 3)
            
        Returns
        -------
        y_pred: list of arrays - n x (nb_classes)
        """
        np.random.seed(0)
        return [np.array([int(np.random.rand() < p) for p in self.distribution]) for _ in X]
    
    def __call__(self, X):
        return self.predict(X)
    
model = RandomClassificationModel()
model.fit(train_df["img"], train_df[labels])
test_df.loc[:, labels] = model.predict(test_df["img"])
test_df.head(1)

Pre-Processing the input images size and scale to fit into EfficientNetB3-Model

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.image import img_to_array
import cv2  # OpenCV
from tensorflow.keras.applications.efficientnet import preprocess_input as effnet_pre


def preprocess_images(image_list, target_size=(300, 300)):
    processed = []
    for img in image_list:
        if isinstance(img, str):
            # If img is a file path
            img = cv2.imread(img)
        img = cv2.resize(img, target_size)
        img = effnet_pre(img.astype("float32"))  # ✔ proper scale (−1 … +1)
        processed.append(img)
    return np.array(processed, dtype=np.float32)



EfficientNetB3-Model Trainig for baseline

# Preprocess
X_all = preprocess_images(train_df["img"])
y_all = train_df[labels].values.astype(np.float32)

# Split
X_train, X_val, y_train, y_val = train_test_split(X_all, y_all, test_size=0.2, random_state=42)

# Train base model
model = EfficientNetClassifier()
model.fit(X_train, y_train, val_data=(X_val, y_val))

# Fine-tune
# This assumes to modified EfficientNetClassifier to store base_model
model.base_model.trainable = True
for layer in model.base_model.layers[:-20]:
    layer.trainable = False
model.model.compile(
    optimizer=Adam(learning_rate=1e-4),
    loss='binary_crossentropy',
    metrics=[
        tf.keras.metrics.BinaryAccuracy(name='binary_accuracy'),
        tf.keras.metrics.AUC(name='auc')
    ]
)
model.model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=5)

# Evaluate
from sklearn.metrics import classification_report
y_val_pred = model.predict(X_val)
print(classification_report(y_val, y_val_pred, target_names=labels))

# Predict on test
X_test = preprocess_images(test_df["img"])
test_df.loc[:, labels] = model.predict(X_test)
test_df.head(1)


Fine-Tuning EfficientNetB3-Model by:
1. Data Augementation for input dataset
2. Focal_loss for imbalanced dataset
3. Apply two-phases training: 
    - first phase freezing all layers of pre-trained model and only apply training on custom layers
    - second phase un-freeze the last 20 layers of the pre-trained model for higher level abstraction fitting in our own dataset


from sklearn.metrics import average_precision_score, classification_report

class MAPCallback(tf.keras.callbacks.Callback):
    def __init__(self, train_data, val_data):
        super().__init__()
        self.tr_ds  = train_data
        self.val_ds = val_data

    def _compute_map(self, ds):
        y_true, y_prob = [], []
        for bx, by in ds:
            y_true.append(by.numpy())
            y_prob.append(self.model.predict(bx, verbose=0))
        return average_precision_score(
            np.vstack(y_true), np.vstack(y_prob), average="macro")

    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        logs["mAP"]     = self._compute_map(self.tr_ds)   # solid curve
        logs["val_mAP"] = self._compute_map(self.val_ds)  # dashed curve
        print(f"\ntrain mAP: {logs['mAP']:.4f}  "
              f"val mAP: {logs['val_mAP']:.4f}")
              
# Data Augementation with original image to increase the diversity and size(Double the size)
def make_ds_with_originals(X, y, training):
    ds_orig = tf.data.Dataset.from_tensor_slices((X, y))

    if training:
        aug = tf.keras.Sequential([
            tf.keras.layers.RandomFlip("horizontal"),
            tf.keras.layers.RandomRotation(0.1),
            tf.keras.layers.RandomContrast(0.1)
        ])

        # Apply augmentation only to second copy
        ds_aug = tf.data.Dataset.from_tensor_slices((X, y))
        ds_aug = ds_aug.map(lambda x, y: (aug(x), y),
                            num_parallel_calls=tf.data.AUTOTUNE)

        # Combine original and augmented datasets
        ds = ds_orig.concatenate(ds_aug)
        ds = ds.shuffle(2048)
    else:
        ds = ds_orig

    return ds.batch(BATCH).prefetch(tf.data.AUTOTUNE)


In [None]:
from tensorflow.keras.applications.efficientnet import preprocess_input as effnet_pre
from sklearn.model_selection import train_test_split
import tensorflow as tf, numpy as np, cv2

IMG_SIZE = 300
NUM_CLASSES = 20
THRESH = 0.5                    # ← normal threshold again

def preprocess_images(image_list):
    out = []
    for img in image_list:
        if isinstance(img, str):           # path → array
            img = cv2.imread(img)
        img = cv2.resize(img, (IMG_SIZE, IMG_SIZE))
        img = effnet_pre(img.astype("float32"))  # ✔ proper scale (−1 … +1)
        out.append(img)
    return np.stack(out, axis=0)

# -- build numpy arrays ----------------------------------------------------
X_all = preprocess_images(train_df["img"])
y_all = train_df[labels].values.astype("float32")

# split BEFORE building tf.data
X_tr, X_val, y_tr, y_val = train_test_split(
    X_all, y_all, test_size=0.2, random_state=42, stratify=y_all.sum(axis=1) > 0
)

# tf.data pipeline with on‑the‑fly light augmentation
BATCH = 32
def make_ds(X, y, training):
    ds = tf.data.Dataset.from_tensor_slices((X, y))
    if training:
        aug = tf.keras.Sequential([
            tf.keras.layers.RandomFlip("horizontal"),
            tf.keras.layers.RandomRotation(0.1),
            tf.keras.layers.RandomContrast(0.1)
        ])
        ds = ds.shuffle(2048).map(lambda x, y: (aug(x), y),
                                  num_parallel_calls=tf.data.AUTOTUNE)
    return ds.batch(BATCH).prefetch(tf.data.AUTOTUNE)

# train_ds = make_ds(X_tr, y_tr, True)
# val_ds   = make_ds(X_val, y_val, False)

train_ds = make_ds_with_originals(X_tr, y_tr, True)
val_ds   = make_ds_with_originals(X_val, y_val, False)
# Pick focal‑loss hyper‑params (ImageNet default: γ = 2, α = 0.25)--------------------------------------

FOCAL_LOSS = tf.keras.losses.BinaryFocalCrossentropy(
    gamma=2.0,
    alpha=0.25,
    from_logits=False,   # your model ends with Sigmoid → probabilities
)


# -- model -----------------------------------------------------------------
base = tf.keras.applications.EfficientNetB3(
    include_top=False, weights="imagenet",
    input_shape=(IMG_SIZE, IMG_SIZE, 3), pooling="avg"
)
base.trainable = False

# The following setup is the newly added custom layer
inputs  = tf.keras.Input((IMG_SIZE, IMG_SIZE, 3))
x       = base(inputs, training=False)
x       = tf.keras.layers.Dropout(0.3)(x)
x       = tf.keras.layers.Dense(256, activation="relu")(x) #ReLUx>0 → no vanishing‑gradient problem never saturates for 
x       = tf.keras.layers.Dropout(0.3)(x)
outputs = tf.keras.layers.Dense(NUM_CLASSES, activation="sigmoid")(x) #Sigmoid map the hidden result into prob(with 0-1 bound) for each class
model   = tf.keras.Model(inputs, outputs)

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss="binary_crossentropy",
    # loss=FOCAL_LOSS,                              # ← focal
    metrics=[tf.keras.metrics.AUC(multi_label=True, name="auc")]
)

map_cb = MAPCallback(train_ds,val_ds)

history = model.fit(train_ds, epochs=8, validation_data=val_ds,callbacks=[map_cb])

# ---- fine‑tune -----------------------------------------------------------
base.trainable = True                     # allow  all EfficientNetB3 layers to update
for layer in base.layers[:-20]:           # then re‑freeze everything except the last 20 layers(for higher level abstraction)
    layer.trainable = False


# Phase 2 makes tiny adjustments on a delicate tweak to capture specific features on our dataset 
# Therefore we need to carefully change the pre-trained mode by turning down the learning rate(10x smaller) to prevent demage of it
map_cb_ft = MAPCallback(train_ds,val_ds)

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-4),
    loss="binary_crossentropy",
    # loss=FOCAL_LOSS,                            
    metrics=[tf.keras.metrics.AUC(multi_label=True, name="auc")]
)
history_ft = model.fit(train_ds, epochs=15, validation_data=val_ds,callbacks=[map_cb_ft])

# ---- evaluation ----------------------------------------------------------
y_true, y_prob = [], []
for bx, by in val_ds:
    y_true.append(by.numpy())
    y_prob.append(model.predict(bx, verbose=0))
y_true = np.vstack(y_true)
y_prob = np.vstack(y_prob)

mAP = average_precision_score(y_true, y_prob, average="macro")
print(f"Validation mAP: {mAP:.3f}")

y_pred = (y_prob >= THRESH).astype(int)
print(classification_report(y_true, y_pred, target_names=labels))


Further fine-tuning the model by adding per-class thresholding
 - In this step each class will be classified by its own threshold value instead of fixed value 0.5 for all classes

In [None]:
from sklearn.metrics import f1_score
import numpy as np

# STEP 1: Use already computed predictions
# You should already have these:
#   y_true: shape (N, 20)
#   y_prob: shape (N, 20)

thresholds = np.linspace(0.0, 1.0, 101)  # test 101 values from 0.00 to 1.00
best_thresholds = np.zeros(y_true.shape[1])

for i in range(y_true.shape[1]):  # loop over 20 classes
    f1s = []
    for t in thresholds:
        preds = (y_prob[:, i] >= t).astype(int)
        f1 = f1_score(y_true[:, i], preds, zero_division=0)
        f1s.append(f1)
    best_t = thresholds[np.argmax(f1s)]
    if best_t<0.3:
        best_thresholds[i] = best_t
    else:
        best_thresholds[i] = best_t
    print(f"{labels[i]:<15}  best_thresh={best_t:.2f}  max_f1={max(f1s):.3f}")
# Step 2: Apply thresholds to validation predictions
y_pred_best = (y_prob >= best_thresholds).astype(int)  # broadcasting

from sklearn.metrics import classification_report, average_precision_score
print(classification_report(y_true, y_pred_best, target_names=labels))
mAP = average_precision_score(y_true, y_prob, average="macro")
print(f"Validation mAP: {mAP:.3f} (unchanged)\nMacro F1 (new): {f1_score(y_true, y_pred_best, average='macro'):.3f}")
    
# Step 3: Use best thresholds for test submission

# best_thresholds = best_thresholds.reshape(1, -1)
# # -- preprocess exactly as you did for training/validation --
# X_test = preprocess_images(test_df["img"])      # shape (N_test, 300, 300, 3)

# # -- probability logits --
# y_test_prob = model.predict(X_test, verbose=1)  # shape (N_test, 20)

# # -- apply per‑class thresholds --
# y_test_pred = (y_test_prob >= best_thresholds).astype(np.uint8)

# test_df.loc[:, labels] = -1        # or 0, or np.nan depending on your need

# # write the 20 binary columns into the dataframe
# test_df.loc[:, labels] = y_test_pred


# submission_df = generate_submission(test_df)
# print("Saved:", submission_df.shape, "→ submission.csv")
# display(submission_df.head())

visualization for the training process of the classification model training

In [None]:
# Visualization of the error during training progress

def plot_history(histories, metric, suptitle):
    """
    histories : dict(label -> History object)
    metric    : str   e.g. 'loss', 'auc', 'mAP', 'val_mAP', …
    suptitle  : str   overall figure title
    """
    n = len(histories)
    fig, axes = plt.subplots(1, n, figsize=(6*n, 4), sharey=True)

    if n == 1:                     # keep API identical for one phase
        axes = [axes]

    for ax, (label, h) in zip(axes, histories.items()):
        # training curve
        if metric in h.history:
            ax.plot(h.history[metric], label=f'{metric}', linestyle='-')
        # validation curve
        val_key = metric if metric.startswith('val_') else f'val_{metric}'
        if val_key in h.history:
            ax.plot(h.history[val_key], label=f'{val_key}', linestyle='--')

        ax.set_title(label)
        ax.set_xlabel('epoch')
        ax.grid(True)
        if ax is axes[0]:          # y‑label only on first subplot
            ax.set_ylabel(metric.lstrip('val_'))

        ax.legend()

    fig.suptitle(suptitle, fontsize=14)
    fig.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()


plot_history({'phase 1': history, 'phase 2': history_ft},metric='loss',suptitle='Binary‑cross‑entropy')
plot_history({'phase 1': history, 'phase 2': history_ft},metric='auc',suptitle='AUC')
plot_history({'phase 1': history, 'phase 2': history_ft},metric='mAP',suptitle='macro‑mAP')


# 3. Semantic segmentation
The goal here is to implement a segmentation CNN that labels every pixel in the image as belonging to one of the 20 classes (and/or background). Use the training set to train your CNN and compete on the test set (by filling in the segmentation column in the test dataframe).

In [None]:
class RandomSegmentationModel:
    """
    Random segmentation model: 
        - generates random label maps for the inputs based on the class distributions observed during training
        - every pixel in an input can only have one label
    """
    def fit(self, X, Y):
        """
        Adjusts the class ratio variable to the one observed in Y. 

        Parameters
        ----------
        X: list of arrays - n x (height x width x 3)
        Y: list of arrays - n x (height x width)

        Returns
        -------
        self
        """
        self.distribution = np.mean([[np.sum(Y_ == i) / Y_.size for i in range(len(labels) + 1)] for Y_ in Y], axis=0)
        print("Setting class distribution to:\nbackground: {}\n{}".format(self.distribution[0], "\n".join(f"{label}: {p}" for label, p in zip(labels, self.distribution[1:]))))
        return self
        
    def predict(self, X):
        """
        Predicts for each input a label map.
        
        Parameters
        ----------
        X: list of arrays - n x (height x width x 3)
            
        Returns
        -------
        Y_pred: list of arrays - n x (height x width)
        """
        np.random.seed(0)
        return [np.random.choice(np.arange(len(labels) + 1), size=X_.shape[:2], p=self.distribution) for X_ in X]
    
    def __call__(self, X):
        return self.predict(X)
    
model = RandomSegmentationModel()
model.fit(train_df["img"], train_df["seg"])
test_df.loc[:, "seg"] = model.predict(test_df["img"])
test_df.head(1)

## Semantic segmentation from scratch

Semantic segmentation involves classifying each pixel in an image into one of several predefined categories. It provides a dense, pixel-level understanding of the visual scene. 

Implemented U-Net uses skip connections, so features from the contracting path are concatenated with those from expanding path. Also it is a symmetrical architecture featuring encoder-decoder structure with matching levels so all downsampling steps has a corresponding unsampling step. Additional bottleneck layer captures higher-level features before upsampling. It can provide better feature presentation because double convolutional block at each level helps learning more robust features.

The connection of traditional convolutional neural network with skip connections was used to help with loosing resolution with feature extraction. Encoder has four layers were each of them has double convolutional layer 3x3 with batch normalisation and ReLu activation. ReLU is nonlinear activation function to help to learn more advanced fearures. Batch normalisation stabilises training by normalising activations towards to reach mean value close to 0 and variance close to 1. First layer of encoder has 64 filters converting input image 126x128x3 every next doubles number of filters and at the same time it shrinks the spacial resolution. Bottelneck layer is the deeperst part of network and operates on the most abstract data. It  has the resolution of 8x8x512 and the double convolution 3x3 was used to get 1024 channels. Decoder does the same thing as encoder but instead of max pooling it uses transposed convolution for upsampling. Each block starts with upsampling then concatenates the corresponding encoder features and finally applies two 3x3 convolutions. The number of filters shrinks with the growing resolution. Skip connection helps with connection of encoder features with decoder so connect information about context with localisation.
Output layer is 1x1 and transforms 64-channel feature map into the required number of selected classes and generates probability map of each class in each pixel. 

In [5]:
#Constants
#Number of classes for segmentation including background class
NUM_CLASSES = 21 #20 classes + background
#target size (height, width) to which all input images and masks will be resized - for the size consistancy in the network
TARGET_SIZE = (128, 128)
#Dataset
#Define dataset for semantic segmentation - loading and preprocessin of image and mask pairs
class SegmentationDataset(Dataset):
    """
    Constructor for dataset 
    Initializes the dataset with a dataframe containing file paths or image/mask data,
    also sets the target size for resizing and defines image transformations
    """
    def __init__(self, df, target_size=(128, 128), augment=False):
        #store the image data from dataframe
        self.images = df["img"].values
        #store the mask data from dataframe
        self.masks = df["seg"].values
        #store the target size for resizing
        self.target_size = target_size
        self.augment = augment
        #Define image transformations: ToTensor converts the image to PyTorch Tensor
        # It also scales pixel values from [0,255] to [0,1].
        # Normalize the tensor with given mean and standard deviation which helps standarizing the input data distribution 
        self.img_transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                               std=[0.229, 0.224, 0.225]),
        ])
        
    #return the total number of samples in the dataset
    def __len__(self):
        return len(self.images)
    #retrive the single sample at the given index
    def __getitem__(self, idx):
        #Load the image and mask data at the specified index and convert it to uint8 data type
        img = Image.fromarray(self.images[idx].astype(np.uint8))
        mask = Image.fromarray(self.masks[idx].astype(np.uint8))

        if self.augment: 
            if random.random() > .5:
                img = TF.hflip(img)
                mask = TF.hflip(mask)
            angle = random.choice([0, 90, 180, 270])
            #Resize image to the target size
            img = TF.rotate(img, angle, interpolation = InterpolationMode.BILINEAR)
            #Resize the mask tensor to the target size
            #Using InterpolationMode.NEAREST is crucial for masks to preserve discrete class labels
            mask = TF.rotate(mask, angle, interpolation=InterpolationMode.NEAREST)
        img = TF.resize(img, self.target_size, interpolation=InterpolationMode.BILINEAR)
        mask = TF.resize(mask, self.target_size, interpolation=InterpolationMode.NEAREST)
        #Apply the defined image transformations to the image
        img = self.img_transform(img)
        #Convert the mask numpy array to a PyTorch Tensor
        mask = torch.as_tensor(np.array(mask), dtype=torch.long)

        #Return the processed image and mask tensors
        return img, mask 

In [5]:
def split_dataframe(df, val_split=0.2, random_state=42):
    """
    Split a dataframe into training and validation sets for evaluation
    of the model performance on unseen data during training process

    Use train_test_split from scikit-learn to perform the split.
    df => The input dataframe
    test_size => The proportion of the dataset to include in the validation split
    random_state => Ensure reproducibility of the split
    shuffle => Shuffle the data before splitting, important for preventing ordered biases
    """
    train_df, val_df = train_test_split(df,test_size=val_split,
        random_state=random_state, shuffle=True)
    return train_df.reset_index(drop=True), val_df.reset_index(drop=True)

#Split the main training dataframe into training and validation sets which provides data for training and evaluating the model during the training process
train_df, val_df = split_dataframe(train_df)

"""
    Create dataset and dataloader for training and validation data
    DataLoader provides an iterable over the dataset, handling batching, shuffling, and multiprocessing
    batch_size => The number of samples per batch.
    shuffle => Shuffle the data at each epoch
    num_workers => Number of subprocesses to use for data loading(0 means the main process)
"""
train_dataset = SegmentationDataset(train_df, target_size=TARGET_SIZE)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, num_workers=0)

val_dataset = SegmentationDataset(val_df, target_size=TARGET_SIZE)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=False, num_workers=0)


Decision about this specific network was based on fact that the U-Net networks gives a bit better representation of shape of the structures. Approach with classic convolutional neural network step by step shrinks the resolution of the representation in each layer which results in getting small feature maps which are good for classifications of whole image. Getting the full segmentation mask from that may affect in not emphasising the curves and the borders of objects may be lost. U-Net with its more complexed architecture that includes encoder and decoder solves this problem a bit. Encoder acts like a bit like classical version of CNN decreasing the resolution and enhances number of channels to get deep semantic context. But on the other hand decoder with each step uses transposition and in implemented case also skip-connection where the feature map from encoder is added to the representation. The network gets  new information in every reconstruction layer about textures or corves which helps in emphasis of contours of the objects.


In [7]:
class UNet(nn.Module):
    """ 
        Define of UNet model
        It consists of a contracting path (encoder) to capture context and an expanding path (decoder)
        to enable precise localization, with skip connections between the encoder and decoder
    """
    def __init__(self, num_classes):
        """
            Args: num_classes(init): number of ourput classes with background
        """
        super().__init__() #construct the parent class nn.Module
        self.num_classes = num_classes #store the number of classes
        
        """
            Contracting Path - Encoder
            The network downsamples the input image and extracts features. 
            Each downsampling block consists of convolutional layers and a pooling layer and
            the number of channels increases with depth to capture more complex features
        """
        # First double convolution block: Input channels = 3 (for RGB images), Output channels = 64
        self.down_conv1 = self.double_conv(3, 64)
        # Second double convolution block: Input channels = 64, Output channels = 128
        self.down_conv2 = self.double_conv(64, 128)
        # Third double convolution block: Input channels = 128, Output channels = 256
        self.down_conv3 = self.double_conv(128, 256)
        # Fourth double convolution block: Input channels = 256, Output channels = 512
        self.down_conv4 = self.double_conv(256, 512)
        # Max pooling layer for downsampling => kernel size and stride of 2 reduce the spatial dimensions by half
        self.maxpool = nn.MaxPool2d(2)
        
        # Bottleneck - the layer with the lowest spatial resolution and highest number of channels connecting encoder and decoder
        self.bottleneck = self.double_conv(512, 1024)
        
        """
            Expanding Path - Decoder
            The network upsamples here the feature maps and reconstructs the segmentation mask.
            It uses transposed convolutions (or upsampling followed by convolution) and skip connections.
            The number of channels decreases with depth.
        """
        # First transposed convolution for upsampling from the bottleneck
        # Input channels = 1024, Output channels = 512. Kernel size and stride of 2 double the spatial dimensions.
        self.up_trans1 = nn.ConvTranspose2d(1024, 512, kernel_size=2, stride=2)
        # First up-convolution block after the skip connection
        # Input channels = 1024 (512 from transposed conv + 512 from skip connection), Output channels = 512
        self.up_conv1 = self.double_conv(1024, 512)
        # Second transposed convolution for upsampling
        # Input channels = 512, Output channels = 256

        self.up_trans2 = nn.ConvTranspose2d(512, 256, kernel_size=2, stride=2)
        # Second up-convolution block after the skip connection
        # Input channels = 512 (256 from transposed conv + 256 from skip connection), Output channels = 256
        self.up_conv2 = self.double_conv(512, 256)

        # Third transposed convolution for upsampling
        # Input channels = 256, Output channels = 128
        self.up_trans3 = nn.ConvTranspose2d(256, 128, kernel_size=2, stride=2)
        # Third up-convolution block after the skip connection
        # Input channels = 256 (128 from transposed conv + 128 from skip connection), Output channels = 128
        self.up_conv3 = self.double_conv(256, 128)

        # Fourth transposed convolution for upsampling
        # Input channels = 128, Output channels = 64
        self.up_trans4 = nn.ConvTranspose2d(128, 64, kernel_size=2, stride=2)
        # Fourth up-convolution block after the skip connection
        # Input channels = 128 (64 from transposed conv + 64 from skip connection), Output channels = 64
        self.up_conv4 = self.double_conv(128, 64)
        
        # Final output layer
        # A 1x1 convolution to map the final feature maps to the number of classes
        # Input channels = 64, Output channels = num_classes
        self.out_conv = nn.Conv2d(64, num_classes, kernel_size=1)
    
    #Double convolutional block function  consists of two convolutional layers, each followed by batch normalization and ReLU activation
    def double_conv(self, in_channels, out_channels):
        """
            Double convolution block: Conv -> BatchNorm -> ReLU -> Conv -> BatchNorm -> ReLU
        """
        return nn.Sequential(
            # First convolutional layer. Kernel size 3x3, padding 1 to maintain spatial dimensions
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
            # Batch normalization layer to normalize the activations, improving training stability
            nn.BatchNorm2d(out_channels),
            # ReLU activation function for non-linearity. inplace=True saves memory
            nn.ReLU(inplace=True),
            # Second convolutional layer
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),
            # Second batch normalization layer
            nn.BatchNorm2d(out_channels),
            # Second ReLU activation function
            nn.ReLU(inplace=True)
        )
    
    def forward(self, x):
        """
            Defines the forward pass of the U-Net model
            Args:
                x: The input tensor (image batch)
        """
        # Forward pass through Encoder
        # Apply the first double convolution block. Store the output (x1) for the skip connection
        x1 = self.down_conv1(x)
        # Apply max pooling to reduce spatial dimensions
        x2 = self.maxpool(x1)
        
        # Apply the second double convolution block. Store the output (x3) for the skip connection
        x3 = self.down_conv2(x2) 
        # Apply max pooling
        x4 = self.maxpool(x3)
        
        # Apply the third double convolution block. Store the output (x5) for the skip connection
        x5 = self.down_conv3(x4)
        # Apply max pooling
        x6 = self.maxpool(x5)
        
        # Apply the fourth double convolution block. Store the output (x7) for the skip connection
        x7 = self.down_conv4(x6)
        # Apply max pooling => the input to the bottleneck
        x8 = self.maxpool(x7)
        
        # Bottleneck - Apply the bottleneck double convolution block
        x9 = self.bottleneck(x8)
        
        # Forward pass through the Decoder 
        # Apply the first transposed convolution to upsample from the bottleneck
        x = self.up_trans1(x9)
        # Concatenate the upsampled feature map with the corresponding feature map from the encoder (x7)
        # This is the skip connection, providing high-resolution features to the decoder.
        # dim=1 means concatenating along the channel dimension
        x = torch.cat([x, x7], dim=1)  # Skip connection
        # Apply the first up-convolution block
        x = self.up_conv1(x)
        
        # Apply the second transposed convolution
        x = self.up_trans2(x)
        # Concatenate with the feature map from the encoder (x5)
        x = torch.cat([x, x5], dim=1)  # Skip connection
        # Apply the second up-convolution block
        x = self.up_conv2(x)
        
        # Apply the third transposed convolution
        x = self.up_trans3(x)
        # Concatenate with the feature map from the encoder (x3)
        x = torch.cat([x, x3], dim=1)  # Skip connection
        # Apply the third up-convolution block
        x = self.up_conv3(x)
        
        # Apply the fourth transposed convolution
        x = self.up_trans4(x)
        # Concatenate with the feature map from the encoder (x1)
        x = torch.cat([x, x1], dim=1)  # Skip connection
        # Apply the fourth up-convolution block
        x = self.up_conv4(x)
        
        # Final output => Apply the 1x1 convolution to produce the final segmentation map
        out = self.out_conv(x)
        #Return the output tensor 
        return out

#Set device for training (GPU if available otherwise use CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#Initialize U-Net model and move it to the selected device
model = UNet(NUM_CLASSES).to(device)

Choosing combination of two training losses emphasises positive aspects of both of them. Mainly the Cross-Entropy focuses on measuring the pixels separetly and at the same time in Dice the overlap of whole spaces. The first one may lose some information about some small object and may not emphasise them properly and the second one adds extra pressure for smaller elements. So combinantion allows for Dice to be aware of small object and Cross-Entropy holds all pixels. With only using Cross-Entropy the segmentation was even more disrupted/blurry but this method converges faster then Dice which allows to mark contours a bit more percise. The used technique of combined losses combines the main features of the methods and allows to better training for notbalanced classes.

In [8]:
class DiceLoss(nn.Module):
    """
    Used as loss function especially when there is class imbalance
    It measures the similarity between the predicted segmentation and the ground truth mask
    Args:
        smooth => A small value added to the numerator and denominator to prevent division by zero
        ignore_index => Class index to ignore in the loss calculation for example invalid regions
    """
    def __init__(self, smooth=1, ignore_index=255):
        super(DiceLoss, self).__init__()

        self.smooth = smooth # Store the smoothing value
        self.ignore_index = ignore_index # Store the index to ignore

    """
        Forward pass for the Dice Loss calculation
        Args:
            pred => The predicted segmentation map
            target=> The ground truth segmentation mask
    """
    def forward(self, pred, target):
        # Apply softmax to the predicted logits to get probabilities for each class
        # dim=1 means applying softmax across the channel dimension
        pred = torch.softmax(pred, dim=1)
        # Get the number of classes from the prediction tensor
        num_classes = pred.shape[1]
        
        # Create a mask to exclude pixels with the ignore_index from the loss calculation
        mask = (target != self.ignore_index).float()
        #Apply the mask to the target, convert back to long as target should have class indices
        target = target * mask.long()
        
        # Convert the target mask to one-hot encoding
        #Create a binary tensor where for each pixel, only the channel corresponding to the
        # ground truth class is 1, and others are 0
        target_onehot = torch.nn.functional.one_hot(target, num_classes=num_classes).permute(0,3,1,2)
        
        # Calculate the intersection between the predicted probabilities and the one-hot target
        # Sum across the spatial dimensions (height and width) to get the intersection for each class in each batch
        intersection = (pred * target_onehot).sum(dim=(2,3))
        # Calculate the union of the predicted probabilities and the one-hot target and sum across the spatial dimensions
        union = pred.sum(dim=(2,3)) + target_onehot.sum(dim=(2,3))
        # Calculate the Dice coefficient for each class in each batch
        # Add 'smooth' to numerator and denominator to avoid division by zero
        dice = (2. * intersection + self.smooth) / (union + self.smooth)

        # Return the mean Dice loss (1 - Dice coefficient) averaged across all classes and batches.
        return 1 - dice.mean()

""""
Define a combined loss function that is a weighted sum of Cross-Entropy Loss and Dice Loss.
Using a combination of loss functions can often lead to better performance, especially
for segmentation tasks with class imbalance. Cross-Entropy focuses on individual pixel classification,
while Dice Loss focuses on the overall overlap of segmentation regions
"""
class CombinedLoss(nn.Module):
    """
    Args:
        weight => Class weights for Cross-Entropy Loss to handle class imbalance
        alpha => Weighting factor for the Cross-Entropy Loss (1 - alpha is the weight for Dice Loss)
    """
    def __init__(self, weight=None, alpha=0.5):
        super().__init__()
        self.alpha = alpha
        # Initialize the Cross-Entropy Loss
        self.ce_loss = nn.CrossEntropyLoss(weight=weight)
        # Initialize the Dice Loss
        self.dice_loss = DiceLoss()
        
    def forward(self, pred, target):
        """
        Define the forward pass for the Combined Loss calculation
        pred => The predicted segmentation map 
        target => The ground truth segmentation mask
        """
        # Calculate the Cross-Entropy Loss
        ce = self.ce_loss(pred, target)
        # Calculate the Dice Loss
        dice = self.dice_loss(pred, target)
        # Return the weighted sum of the two losses
        return self.alpha * ce + (1 - self.alpha) * dice
    
"""
    Class weighting for impalanced datasets
    Class weighting assigns higher importance to less frequent classes during training,
    helping the model learn to segment them better
"""
# Calculate the total count of pixels belonging to each class across all training masks
# np.concatenate joins all mask arrays into a single array
# np.bincount counts the occurrances of each non-negative integer value
class_counts = np.bincount(np.concatenate([m.flatten() for m in train_df["seg"]]))
# Calculate initial class weights as the inverse of class counts
class_weights = 1.0 / torch.tensor(class_counts, dtype=torch.float32)
# Normalize the class weights so they sum to 1 encuring that the overall scale of the weighted loss is consistent
class_weights = class_weights / class_weights.sum()
# Initialize the combined loss function with calculated class weights and an alpha value
criterion = CombinedLoss(weight=class_weights.to(device), alpha=0.5)
"""
    Adam optimization algorithm  adapts the learning rate for each parameter
    Args:
        model.parameters()=> specifies which parameters of the model should be optimized
        lr => learning rate, controls the step size during optimization
        weight_decay => L2 regularization term, helps prevent overfitting by penalizing large weights
"""
#Selection optimizer
optimizer = optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-4)
# Choose a learning rate scheduler to adjust the learning rate during training
# 'min' =>  Monitor a metric that should be minimized (validation loss)
# patience => Number of epochs with no improvement after which the learning rate will be reduced
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=3)


In [None]:
#Vizualise train losses
# Lists to store training and validation losses for plotting
train_losses = []
val_losses = []
# Number of epochs to train the model => An epoch is one full pass through the training dataset
num_epochs = 50
for epoch in range(num_epochs):
    # Set the model to training mode
    model.train()
    # Initialize running loss for the current epoch
    running_loss = 0.0
    # Iterate over batches in the training data loader
    for imgs, masks in train_loader:
        # Move images and masks to the selected device
        imgs, masks = imgs.to(device), masks.to(device)
        
        # Zero the gradients of the model parameters
        # Gradients are accumulated by default, so this is necessary to prevent
        # gradients from previous iterations affecting the current update
        optimizer.zero_grad()
        # Perform a forward pass to get model predictions for the current batch of images
        outputs = model(imgs) 
        # Calculate the loss using the defined criterion and the predictions and ground truth masks
        loss = criterion(outputs, masks)
        # Perform backpropagation - calculate gradients of the loss with respect to the model parameters
        loss.backward()
        # Update the model parameters using the calculated gradients and the optimizer
        optimizer.step()
        
        # Accumulate the loss for the current epoch
        running_loss += loss.item()
    epoch_loss = running_loss/len(train_loader)
    train_losses.append(epoch_loss)

    # Validation phase
    # Set the model to evaluation mode.
    model.eval()
    # Initialize validation loss for the current epoch
    val_loss = 0.0
    # Disable gradient calculation during validation
    with torch.no_grad():
        # Iterate over batches in the validation data loader
        for imgs, masks in val_loader:
            # Move images and masks to the selected device
            imgs, masks = imgs.to(device), masks.to(device)
            # Perform a forward pass to get model predictions
            outputs = model(imgs)
            # Calculate the loss on the validation data
            loss = criterion(outputs, masks)
            # Accumulate the validation loss
            val_loss += loss.item()
    # Calculate the average validation loss for the epoch
    val_loss /= len(val_loader)
    val_losses.append(val_loss)
    print(f"Epoch {epoch+1}/{num_epochs} - Train Loss: {epoch_loss:.4f} - Val Loss: {val_loss:.4f}")


Epoch 1/50 - Train Loss: 2.0168 - Val Loss: 1.9658


In [None]:
# Plotting losses
plt.figure(figsize=(10, 6))
# Plot the training loss over epochs
plt.plot(range(1, num_epochs+1), train_losses, 'b-o', label='Training Loss', linewidth=2, markersize=8)
# Plot the validation loss over epochs
plt.plot(range(1, num_epochs+1), val_losses, 'r-o', label='Validation Loss',linewidth=2, markersize=8)
plt.title('Training and Validation Loss over Epochs', fontsize=14)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.xticks(range(1, num_epochs+1))
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()

In [None]:
#Visualize the test results
def show_predictions(model, dataloader, num_show):
    # Set the model to evaluation mode.
    model.eval()
    imgs, masks = next(iter(dataloader))
    imgs, masks = imgs.to(device), masks.to(device)

    # Disable gradient calculation for predictions
    with torch.no_grad():
        preds = model(imgs)
        preds = torch.argmax(preds, dim=1)

    #Convert to numpy for visualization
    imgs_np = imgs.cpu().numpy()
    masks_np = masks.cpu().numpy()
    preds_np = preds.cpu().numpy()

    # Reverse normalization applied during preprocessing to display the images correctly
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    imgs_np = imgs_np.transpose(0, 2, 3, 1)
    imgs_np = imgs_np * std + mean
    imgs_np = np.clip(imgs_np, 0, 1)

    #Plot results
    num_show = min(3, len(imgs))
    _, axs = plt.subplots(num_show, 3, figsize=(15, 5*num_show))

    # Iterate through the selected number of samples
    for i in range(num_show):
        axs[i, 0].imshow(imgs_np[i])
        axs[i, 0].set_title("Input Image")
        axs[i, 0].axis('off')
        
        axs[i, 1].imshow(masks_np[i], vmin=0, vmax=NUM_CLASSES-1, cmap='jet')
        axs[i, 1].set_title("Ground Truth")
        axs[i, 1].axis('off')
        
        axs[i, 2].imshow(preds_np[i], vmin=0, vmax=NUM_CLASSES-1, cmap='jet')
        axs[i, 2].set_title("Prediction")
        axs[i, 2].axis('off')

    plt.tight_layout()
    plt.show()

show_predictions(model, val_loader, 3)


In [None]:
"""
    Define datataset for the test data. It loads the images and track their original sizes
"""
class TestDataset(Dataset):
    """
        Load images with basic transformations including resizng and stores the original image size
    """
    def __init__(self,df,target_size=(128,128)):
        """
        Initialize with the test dataframe and the target size for resizing
        """
        # Store the target image data from dataframe
        self.images = df["img"].values
        # Store the target size for resizing
        self.target_size = target_size
        # Define transformation for the image similar to the training dataset but without augmentation
        self.img_transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485,0.456,0.406],
                                 std=[0.229,0.224,0.225])])
    # Return total number of images 
    def __len__(self): 
        return len(self.images)
    
    # Get a single test image and its original size at given index
    def __getitem__(self,idx):
        # Convert the numpy array to a PIL Image object for applying transforms
        img = Image.fromarray(self.images[idx].astype(np.uint8))
        # Store original dimentions before resizing (W, H)
        original_size = img.size[::-1] #[::-1] reverse it to (H, W)
        img = TF.resize(img, self.target_size, interpolation = InterpolationMode.BILINEAR)
        # Return the processed image tensor and the original size as tensor
        return self.img_transform(img), torch.tensor(original_size)

#Create DataLoader for the TestDataset
test_loader=DataLoader(TestDataset(test_df), batch_size=8, shuffle=False)

In [None]:
# Set model to evaluation mode
model.eval()
# Store in list the predicted segmentation masks and classification labels for all test images
all_segmentations = []
all_labels = []

# Gradient calculations
with torch.no_grad():
    # Iterate through batches of test images using the test loader
    for imgs, original_sizes in test_loader:
        imgs = imgs.to(device)
        # Perform the forward pass through the model to get the raw output logits
        logits = model(imgs)
        # Get predicted class index for each pixel by finding the index with the maximum logit value
        masks_small=logits.argmax(dim=1)
        # Process each predicted mask in the current batch individually
        for mask, (H,W) in zip(masks_small.cpu(), original_sizes):
            # Upsample the predicted mask from the target size back to the original image size
            mask = TF.resize(mask.unsqueeze(0).float(), (int(H), int(W)), interpolation=InterpolationMode.NEAREST).squeeze(0).to(torch.uint8).numpy()
            # Store the predicted segmentation mask
            all_segmentations.append(mask)
            #Determine presence of each class -> result is a list of booleans for each class
            present = [(mask == (cls + 1)).any() for cls in range(len(labels))]
            # Store the list of class presence flags 
            all_labels.append(present)

# Assign the list of predicted segmentation mask to the seg column
test_df["seg"] = all_segmentations
# Update class label columns - iterate thorough each class label index
for i, lab in enumerate(labels):
    # Fo each image ger the presence flag for current class
    test_df[lab] = [int(p[i]) for p in all_labels]


The results leaves a lot of space to improve because at current state the main shape of some classes is mostly preserved but sometimes the imbalance in classes may cause a lot of artifacts in segmentation. It can select the element but selecting background is not giving proper results. The loss functions mostly go down but at some point the fall is smaller and there is no drastic overfiting in these range of epochs. 

## Transfer Learning

In [None]:
!pip install crfseg
!pip install git+https://github.com/lucasb-eyer/pydensecrf.git

In [None]:
import torch
from torch.utils.data import Dataset
from PIL import Image
import torchvision.transforms as transforms
import albumentations as A
import numpy as np
from torch.utils.data import DataLoader
import torchvision.models.segmentation as models
import torch.nn as nn
import matplotlib.pyplot as plt
import torchvision.transforms as T
from tqdm import tqdm
import torch.nn.functional as F
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import ConcatDataset, DataLoader
import matplotlib.colors as mcolors
import pydensecrf.densecrf as dcrf
from pydensecrf.utils import unary_from_softmax, create_pairwise_bilateral
from torch import Tensor
from typing import Tuple
import cv2
from tqdm import trange, tqdm
from torch.optim import AdamW
from torch.optim.lr_scheduler import ReduceLROnPlateau
torch.cuda.empty_cache()

In [None]:
BATCH_SIZE = 16
NUM_WORKERS = 0
EPOCH = 70
N_frozen = 3
LR = 1e-5
LR_FROZEN = 1e-4

In [None]:
def get_device():
    if torch.cuda.is_available():
        return torch.device('cuda')
    elif torch.backends.mps.is_available():
        return torch.device('mps')
    elif hasattr(torch, 'xla') and torch.xla.device_count() > 0:
        return torch.device('xla')
    else:
        return torch.device('cpu')

device = get_device()

### Data preparation
A customized dataset for training

In [None]:
class VOC2009Dataset(Dataset):
    def __init__(self, dataframe, transform=None, target_transform=None, paired_transform=None, ignore_label=21):
        self.df = dataframe.reset_index()
        self.transform = transform
        self.target_transform = target_transform
        self.paired_transform = paired_transform
        
        self.ignore_label = ignore_label
        self.classes = 22  # 20 classes + background + void

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx) -> Tuple[Tensor, Tensor]:
        image = self.df.iloc[idx]['img'] # np.array
        mask = self.df.iloc[idx]['seg']   

        image = Image.fromarray(image.astype(np.uint8))  
        mask = Image.fromarray(mask.astype(np.uint8))    

        if self.paired_transform:
            image, mask = self.paired_transform(image, mask)
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            mask = self.target_transform(mask)

        return image, mask # (C, H, W), (H, W)

### Transforms
Firstly, masks and images are together resized to (256, 256), and then augmented via geometric augmentation and photometric augmentation for preventing overfitting. Then images are normalized via ImageNet normalization, since it is a standard procedure for the used models. A paired_transform without augmentation is also defined for valdation dataset and test dataset. During interpolation, image is interpolated using bilinear interpolation, and nearest neighbor is applied to the mask, because image pixel values are continuous and mask pixel values are 21 labels.<br>

Augmentation is observed to matigate the overfitting greatly.

In [None]:
# Inputs for these transform functions are PIL.Image
paired_transform = A.Compose(
    [
        A.Resize(
            256, 256,
            interpolation=cv2.INTER_LINEAR, # Linear interpolation for continuous values
            mask_interpolation=cv2.INTER_NEAREST, # Nearest neighbours for discrete (label) values
        ),
    ],
    additional_targets={'mask': 'mask'}
)

paired_transform_aug = A.Compose(
    [
        A.Resize(
            256, 256,
            interpolation=cv2.INTER_LINEAR,
            mask_interpolation=cv2.INTER_NEAREST,
        ),

        # geometric flips & rotations
        A.HorizontalFlip(p=0.5),

        # small random affine (shift/scale/rotate)
        A.ShiftScaleRotate(
            shift_limit=0.0625,  # up to ±6.25% shift
            scale_limit=0.1,     # up to ±10% zoom
            rotate_limit=15,     # up to ±15°
            interpolation=cv2.INTER_LINEAR,
            mask_interpolation=cv2.INTER_NEAREST,
            p=0.5
        ),

        # elastic / grid warps for shape variation
        A.ElasticTransform(
            alpha=1, sigma=50, alpha_affine=50,
            interpolation=cv2.INTER_LINEAR,
            mask_interpolation=cv2.INTER_NEAREST,
            p=0.2
        ),
        A.GridDistortion(
            distort_limit=0.3,
            interpolation=cv2.INTER_LINEAR,
            mask_interpolation=cv2.INTER_NEAREST,
            p=0.2
        ),

        # photometric changes (image-only)
        A.GaussNoise(var_limit=(10.0, 50.0), p=0.2),
        A.RandomBrightnessContrast(p=0.2),
    ],
    additional_targets={'mask': 'mask'}
)

# Inputs PIL image, outputs tensor (C, H, W)
image_transform = transforms.Compose([
    transforms.ToTensor(), # Changes to channel first, (C, H, W)
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std =[0.229, 0.224, 0.225]), # Normalize for three channels
])

# Inputs PIL image, outputs tensor (H, W)
mask_transform = transforms.Compose([
    transforms.Lambda(lambda x: torch.tensor(np.array(x), dtype=torch.long)),
    transforms.Lambda(lambda x: torch.where(x == 255, 21, x)), # Map void labels to 21 for one-hot encoding later
])

def apply_paired_transform(image: Image.Image, mask: Image.Image) -> Tuple[Image.Image, Image.Image]:
    image_np = np.array(image)
    mask_np  = np.array(mask)
    resized = paired_transform(image=image_np, mask=mask_np)
    image_resized = Image.fromarray(resized['image'])
    mask_resized  = Image.fromarray(resized['mask'])
    
    return image_resized, mask_resized # (H, W, 3), (H, W)

def apply_paired_transform_aug(image: Image.Image, mask: Image.Image) -> Tuple[Image.Image, Image.Image]:
    image_np = np.array(image)
    mask_np  = np.array(mask)
    resized = paired_transform_aug(image=image_np, mask=mask_np)
    image_resized = Image.fromarray(resized['image'])
    mask_resized  = Image.fromarray(resized['mask'])
    
    return image_resized, mask_resized # (H, W, 3), (H, W)

Split train and val dataset for training and validation.<br>
No augmentation is used for valdation dataset.

In [None]:
def split_dataframe(df: pd.DataFrame, val_split=0.2, random_state=42) -> Tuple[pd.DataFrame, pd.DataFrame]:
    df = df.reset_index()
    
    train_df, val_df = train_test_split(
        df,
        test_size=val_split,
        random_state=random_state,
        shuffle=True
    )
    
    train_df = train_df.reset_index(drop=True)
    val_df = val_df.reset_index(drop=True)
    
    return train_df, val_df

train_df, val_df = split_dataframe(train_df)

train_dataset = VOC2009Dataset(
    dataframe=train_df,
    transform=image_transform,
    target_transform=mask_transform,
    paired_transform=apply_paired_transform_aug
    )

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKERS)

val_dataset = VOC2009Dataset(
    dataframe=val_df,
    transform=image_transform,
    target_transform=mask_transform,
    paired_transform=apply_paired_transform
)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS)

### Early Stopping
A customized EarlyStopping wrapper.

In [None]:
class EarlyStopping:
    def __init__(self, patience=5, delta=1e-4, verbose=False):
        self.patience = patience
        self.delta = delta # Minimum improvement
        self.verbose = verbose
        self.best_score = None
        self.early_stop = False
        self.counter = 0
        self.best_loss = float('inf')

    def __call__(self, val_loss, model):
        score = -val_loss  # Convert to negative if minimizing loss

        if self.best_score is None:
            self.best_score = score
            self.best_loss = val_loss
            self.save_checkpoint(val_loss, model)
        elif score < self.best_score + self.delta:
            self.counter += 1
            if self.verbose:
                print(f'EarlyStopping counter: {self.counter}/{self.patience}')
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_score = score
            self.save_checkpoint(val_loss, model)
            self.counter = 0

    def save_checkpoint(self, val_loss, model):
        if self.verbose:
            print(f'Validation loss decreased ({self.best_loss:.4f} --> {val_loss:.4f}). Saving model...')
        torch.save(model.state_dict(), 'segmentation_transferlearning_checkpoint.pt')
        self.best_loss = val_loss

### DiceLoss
A customized DiceLoss for transfer learning for segmentation.<br>
During calculating the loss, all values whose labels == 21 (which means void, mapped from 255 during the preprocessing) are ignored. This is done by masking all void values as 0 during one-hot encoding using a numpy mask. This is more logical compared with simply mapping void pixels to background pixels or others. In the implementation, both the average DICE loss and the DICE score per class are returned.

Explanation of DICE:<br>
It is quite similar to IoU. The DICE coeffcient is calculated as the amount of matching pixels divided by the total number of pixels of both the image and the mask. There is also a smooth item at both the numerator and denominator. And DICE loss = 1 - DICE coefficient

In [None]:
class DiceLoss(nn.Module):
    def __init__(self, smooth=1, ignore_index=21):
        super(DiceLoss, self).__init__()
        self.smooth = smooth
        self.ignore_index = ignore_index

    def forward(self, pred, target) -> Tuple[torch.Tensor, torch.Tensor]:
        # pred: [B, C, H, W], probabilities, not logits
        # target: [B, H, W]
        num_classes = pred.size(1) + 1
        mask = (target != self.ignore_index).float() 
        mask = mask.unsqueeze(1) # [B, 1, H, W]

        target = torch.nn.functional.one_hot(target.long(), num_classes=num_classes)  # [batch_size, 1, height, width, num_classes]
        target = target.squeeze(1).permute(0, 3, 1, 2) # [batch_size, num_classes, height, width]]
        
        # Apply mask to target
        mask_target = mask.expand_as(target)  # [batch_size, 22, height, width]
        target = target * mask_target  # Zero out ignored pixels
        target = target[:, :-1] # [batch_size, 21, height, width]
        
        # Apply mask to predictions
        mask_pred = mask.expand_as(pred) # [batch_size, 21, height, width]
        pred = pred * mask_pred 

        # Flatten predictions and targets for each class
        pred = pred.contiguous().view(-1, pred.size(1))  # [batch_size * height * width, num_classes]
        target = target.contiguous().view(-1, target.size(1))  # [batch_size * height * width, num_classes]
        
        # Compute Dice coefficient for each class
        intersection = (pred * target).sum(dim=0)  # Sum over pixels for each class
        union = pred.sum(dim=0) + target.sum(dim=0)  # Sum over pixels for each class
        dice = (2. * intersection + self.smooth) / (union + self.smooth + 1e-8)  # Dice score per class
        
        # Return 1 - mean Dice score as loss
        return 1 - dice.mean(), dice

### CEDiceLoss
A loss combining unweighted cross entropy + Dice loss used for training.
There is also a parameter assigning different weights to CE and DICE.

Explanation of cross entropy:<br>
CE = - sum of log (p_pred * p_truth) <br>
So it is maximizing the probability that a pixel is predicted as the truth label, summed over all the pixels.
A more advanced CE is weighted CE, which assignes different weights to different classes, where the weights relates to the proportion of the class in the dataset. It can be helpful in dealing with unbalanced classes.

In [None]:
class CEDiceLoss(nn.Module):
    def __init__(self, smooth=1, ignore_index=21, alpha=0.5):
        super(CEDiceLoss, self).__init__()
        self.alpha = alpha
        self.diceloss_fn = DiceLoss(smooth, ignore_index)
        self.celoss_fn = nn.CrossEntropyLoss(ignore_index=ignore_index)
        
    def forward(self, pred, target):
        '''
        pred should be logits instead of probabilities here.
        pred [B, C, H, W], target [B, H, W]
        '''
        pred_probs = F.softmax(pred, dim=1)
        diceloss, _ = self.diceloss_fn.forward(pred_probs, target)
        target_ce = target.squeeze(1)
        celoss = self.celoss_fn.forward(pred, target_ce)
        return self.alpha * diceloss + (1 - self.alpha) * celoss

### Trainer
It wraps the training and validating process, visulizing loss at the end.<br>
The backbone is frozen during several first training epochs as there are already pretrained parameters. <br>
Transfer learning without frozen them will mess all them up and making the pretrian less meaningful.<br>

In [None]:
class SegmentationTrainer:
    def __init__(self, model, optimizer_head, scheduler_head, criterion, train_loader, val_loader,
                 device, early_stopping=None, freeze_epoch=None, 
                 optimizer_full=None, scheduler_full=None):

        self.model = model.to(device)
        self.crit = criterion
        self.tr_dl = train_loader
        self.val_dl = val_loader
        self.device = device

        self.optimizer_head = optimizer_head
        self.scheduler_head = scheduler_head
        self.optimizer_full = optimizer_full
        self.scheduler_full = scheduler_full

        self.es = early_stopping
        self.freeze_epoch = freeze_epoch
        self.history = {'train_loss':[], 'val_loss':[]}

        self.scheduler = scheduler_head
        self.opt = optimizer_head

    def train_epoch(self, epoch):
        self.model.train()
        total_loss = 0
        for imgs, masks in tqdm(self.tr_dl, desc=f"Train {epoch}"): # imgs and masks are tensors
            # imgs [B, C, H, W], masks [B, H, W]
            imgs, masks = imgs.to(self.device), masks.to(self.device)
            logits = self.model(imgs)['out']
            loss = self.crit(logits, masks)

            self.opt.zero_grad()
            loss.backward()
            self.opt.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(self.tr_dl)
        self.history['train_loss'].append(avg_loss)
        print(f'Training loss: {avg_loss}')
        return avg_loss

    def validate(self):
        self.model.eval()
        total_loss = 0
        with torch.no_grad():
            for imgs, masks in tqdm(self.val_dl, desc="Val"):
                imgs, masks = imgs.to(self.device), masks.to(self.device)
                logits  = self.model(imgs)['out']
                loss = self.crit(logits, masks)
                total_loss += loss.item()
        avg_loss = total_loss / len(self.val_dl)
        self.history['val_loss'].append(avg_loss)
        print(f'Val loss: {avg_loss}')
        return avg_loss

    def unfreeze_check(self, epoch):
        if epoch == self.freeze_epoch:
            for p in self.model.backbone.parameters():
                p.requires_grad = True
            # set BatchNorm back to train()
            for m in self.model.backbone.modules():
                if isinstance(m, nn.BatchNorm2d):
                    m.train()

            self.scheduler = self.scheduler_full
            self.opt = self.optimizer_full

    def fit(self, epochs: int, checkpoint_path='segmentation_transferlearning_checkpoint.pt'):
        best_loss = float('inf')
        for epoch in range(1, epochs+1):
            train_loss = self.train_epoch(epoch)
            val_loss = self.validate()

            # early stopping
            if self.es:
                self.es(val_loss, self.model)
                if self.es.early_stop:
                    print("Early stopping.")
                    break

            # scheduler step
            if self.scheduler:
                self.scheduler.step(val_loss)

            # unfreeze logic
            self.unfreeze_check(epoch)

        # load best
        self.model.load_state_dict(torch.load(checkpoint_path))
        return self.history
    
    def visualize_losses(self, save_path=None):
        epochs = list(range(1, len(self.history['train_loss']) + 1))
        train_loss = self.history['train_loss']
        val_loss = self.history['val_loss']

        plt.figure(figsize=(10, 6))
        plt.plot(epochs, train_loss, label='Training Loss', marker='o')
        plt.plot(epochs, val_loss, label='Validation Loss', marker='s')
        plt.xlabel('Epoch')
        plt.ylabel('Loss')
        plt.title('Training and Validation Loss Over Epochs')
        plt.legend()
        plt.grid(True)

        if save_path:
            plt.savefig(save_path)
            print(f"Plot saved to {save_path}")

        plt.show()
        plt.close()

### Trainer and Model Choice
The following code uses deeplabv3 using resnet50 as backbone, trained using CE + DICE loss. Various other models with various loss functions are tried but gave smaller scores evaluated using DICE coefficient. The other attempted models with various loss are listed as following:<br>
(All weights in CE are carefully tuned) <br>

At first DeepLabV3 is chosen as the baseline and to pick the best training loss function. <br>
(Model - Loss function - DICE coefficient)<br>
DeepLabV3 without transfer learning (Baseline) - 0.900 <br>
DeepLabV3 - CE - 0.912 <br>
DeepLabV3 - Weighted CE - 0.870 <br>
DeepLabV3 - DICE - 0.937<br>
DeepLabV3 - CE + DICE - 0.934, slightly worse than using only DICE, but considers more metrics, thus is chosen<br>
DeepLabV3 - Weighted CE + DICE - 0.917 <br>
Therefore, CE + DICE is chosen as the loss function <br>

After selecting the loss function, various models are tried to pick the best one<br>
(Model - DICE coefficient)
DeepLabV3 + MobileNetV2 - 0.867<br>
LinkeNet + ResNet18 - 0.744<br>
UNet + ResNet34 - 0.867<br>
FPN + ResNet50 - 0.867<br>
PSP + ResNet50 - 0.767<br>
DeepLabV3 + ResNet50 - 0.927<br>
DeepLabV3 + ResNet101 - 0.920<br>
DeepLabV3Plus + ResNet50 - 0.800<br>
FCN + ResNet50 - 0.915<br>
Therefore, DeepLabV3 + Resnet50 is chosen for the task.<br>
During this phase, more simpler models compared with DeepLabV3 are tried as there is overfitting observed in DeepLabV3 + ResNet50.

Codes for these models are not shown in the notbook as the models are simply called from packages and will take ages to run all of them.<br>

### DeepLabV3 Explanation
The biggest feature of DeepLabV3 is atrous convolution, which skips some pixels during convolution so that the reception field is increased without loss much information like pool. Convolutions in the later several layers in ResNet50 are replaced using atrous convulution, and ResNet50 generates a feature map. The feature map will be further processed by a 1\*1 convolution kernel, 3 atrous convolution kernels with various rates aiming to capturing features at different abstract level, and a image pooling which compresses the whole image information to a single pixel so that the context information is represented, and the single pixel from the image pooling will be expanded to the same size as the three outputs using bilinear interpolation and then all the channels are concatentated. The concatenated channels go through a 1\*1 convolution head for fusion the information across channels. The whole structure works as an encoder. 

### DeepLabV3Plus Explanation
DeepLabV3Plus is more commonly used than DeepLabV3, which is and encoder-decoder structure with DeepLabV3 as the encoder. It concatenates the output from DeepLabV3 and early features extracted from early conv layers of ResNet50, then fusion then. The result then goes through several small convolutional layers to produce the last result. 

### Why choose DeepLabV3 not Plus
We actually expect DeepLabV3Plus performs better than DeepLabV3 becuase of its light weight decoder. The reason DeepLabV3Plus is worse is that it can ony be imported from segmentation_models_pytorch, which are pretrained on ImageNet dataset. But models imported from torchvision.models.segmentation are trained on COCO. We think COCO represents a closer distribution of the VOC2009 in this project.

In [None]:
model = models.deeplabv3_resnet50(pretrained=True, num_classes=21) 
criterion = CEDiceLoss(alpha=0.8)

The following code is a customized model we also tried. It inserts a CRF layer to DeepLabV3, so its weights can be trained. But the result is slightly worse than the current model (0.927 VS 0.935). It may have introduced unnecessary complexity given DeepLabV3 is already overfitting and CRF layer fails to capture the pixel values distribution.

In [None]:
# class DeepLabWithCRF(nn.Module):
#     def __init__(self, num_classes):
#         super().__init__()
#         self.deeplab = models.deeplabv3_resnet50(
#             pretrained=True,
#             num_classes=num_classes
#         )
#         self.crf = CRF(n_spatial_dims=2)

#     def forward(self, x):
#         out_dict = self.deeplab(x)
#         logits = out_dict['out']           
#         refined = self.crf(logits)          # [B, C, H, W], log-prob’s
#         refined = F.softmax(refined, dim=1)
#         eps = 1e-6
#         logits = torch.log(refined.clamp(min=eps))
#         return logits
    
# model = DeepLabWithCRF(21)
# criterion = CEDiceLoss(alpha=0.8)

# for param in model.deeplab.backbone.parameters():
#     param.requires_grad = False
# for m in model.deeplab.backbone.modules():
#     if isinstance(m, nn.BatchNorm2d):
#         m.eval()

# head_params = filter(lambda p: p.requires_grad, model.parameters())
# optimizer_head = torch.optim.Adam([
#     {'params': head_params, 'lr': LR_FROZEN},
# ], lr=LR_FROZEN)
# scheduler_head = torch.optim.lr_scheduler.ExponentialLR(optimizer_head, gamma=0.8)

# optimizer_full = torch.optim.Adam([
#     {'params': model.deeplab.backbone.parameters(), 'lr': LR},   
#     {'params': model.deeplab.classifier.parameters(), 'lr': LR},
# ])
# scheduler_full = torch.optim.lr_scheduler.ExponentialLR(optimizer_full, gamma=0.8)

# early_stopping = EarlyStopping(patience=5, verbose=1)

### Optimizers and Schedules
The commented out version is an older and worse version we used. It often leads the val_loss to stuck on a plateau. AdamW has built-in weight decay which handles regularization smartly. And ReduceLROnPlateau smartly decides when to reduce the lr, instead of reducing it continuously that sometimes leads to a too small lr.

In [None]:
# for param in model.backbone.parameters():
#     param.requires_grad = False
# for m in model.backbone.modules():
#     if isinstance(m, nn.BatchNorm2d):
#         m.eval()

# head_params = filter(lambda p: p.requires_grad, model.parameters())
# optimizer_head = torch.optim.Adam([
#     {'params': head_params, 'lr': LR_FROZEN},
# ], lr=LR_FROZEN)
# scheduler_head = torch.optim.lr_scheduler.ExponentialLR(optimizer_head, gamma=0.8)

# optimizer_full = torch.optim.Adam([
#     {'params': model.backbone.parameters(), 'lr': LR},   
#     {'params': model.classifier.parameters(), 'lr': LR},
# ])
# scheduler_full = torch.optim.lr_scheduler.ExponentialLR(optimizer_full, gamma=0.8)

# early_stopping = EarlyStopping(patience=5, verbose=1)

The following code is used for calculating weights for weighted CE but it isn't chosen.

In [None]:
# def compute_enet_weights(dataset, num_classes, c=1.02, decimals=3):
#     counts = np.zeros(num_classes, dtype=np.float64)
#     total = 0
#     for _, mask in dataset:
#         m = np.array(mask)
#         for cls in range(num_classes):
#             counts[cls] += (m == cls).sum()
#         total += m.size
#     p = counts / total
#     weights = 1.0 / np.log(c + p)
#     return torch.tensor(np.round(weights, decimals), dtype=torch.float32)

# class_weights = compute_enet_weights(train_dataset, 21)

In [None]:
for param in model.backbone.parameters():
    param.requires_grad = False
for m in model.backbone.modules():
    if isinstance(m, nn.BatchNorm2d):
        m.eval()

head_param = filter(lambda p: p.requires_grad, model.parameters())
optimizer_head = AdamW(
    head_param,
    lr=LR_FROZEN
)
scheduler_head = ReduceLROnPlateau(
    optimizer_head,
    mode='min',      
    factor=0.8,     
    patience=3,       
    verbose=True
)

optimizer_full = AdamW([
    {'params': model.backbone.parameters(), 'lr': LR},   
    {'params': model.classifier.parameters(), 'lr': LR},
])
scheduler_full = ReduceLROnPlateau(
    optimizer_full,
    mode='min',
    factor=0.8,
    patience=3,
    verbose=True
)

early_stopping = EarlyStopping(patience=10, verbose=1)

In [None]:
trainer = SegmentationTrainer(
    model = model,  
    optimizer_head = optimizer_head,
    scheduler_head = scheduler_head,
    criterion = criterion,
    train_loader = train_dataloader, 
    val_loader = val_dataloader, 
    device = device,
    early_stopping = early_stopping,
    freeze_epoch = 5,
    optimizer_full = optimizer_full,
    scheduler_full = scheduler_full
)

In [None]:
trainer.fit(EPOCH)
trainer.visualize_losses()

### Post processing
In this section, two different post processing techniques are tried, which are dense conditional random field and a classfication<br>
threshold, which acts the same as the threshold when calulating a ROC.

### DCRF (Dense conditional random field)

A DenseCRF refines the noisy, per-pixel label scores produced by a segmentation network by defining a global energy that combines those unary predictions with pairwise terms encouraging pixels that are both close in space and similar in color to share the same label. Inference is performed approximately via a mean-field algorithm that iteratively updates each pixel’s label distribution based on the entire image, using efficient high-dimensional filtering to propagate information in linear time. After a handful of iterations, the result is a segmentation with sharply defined boundaries and minimal isolated errors, fully aligned with the image’s natural edges.  

In [None]:
def crf_refine_probs(probs: Tensor, img: Tensor, n_iters: int = 5):
    probs = probs.detach().cpu().numpy()
    img = img.detach().cpu()

    C, H, W = probs.shape
    U = unary_from_softmax(probs)
    d = dcrf.DenseCRF2D(W, H, C)
    d.setUnaryEnergy(U)
    
    mean = torch.tensor([0.485, 0.456, 0.406])[:, None, None]
    std  = torch.tensor([0.229, 0.224, 0.225])[:, None, None]

    img_unnorm = img * std + mean  

    img_uint8 = (img_unnorm.clamp(0,1) * 255).byte()  
    img_np = img_uint8.permute(1, 2, 0).numpy()  

    feats = create_pairwise_bilateral(
        sdims=(20, 20), schan=(13,13,13),
        img=img_np, chdim=2
    )
    d.addPairwiseEnergy(feats, compat=21)

    Q = d.inference(n_iters)                     

    refined_probs = np.array(Q).reshape((C, H, W))
    return refined_probs

In [None]:
def post_process(probs_batch, imgs_batch) -> Tensor:
    processed_probs_batch = []
    for (probs, img) in zip(probs_batch, imgs_batch):
        processed_probs = crf_refine_probs(probs, img)
        processed_probs_batch.append(processed_probs)

    processed_probs_batch = torch.tensor(processed_probs_batch, dtype=torch.float32)
    return processed_probs_batch

### Binary Mask
The binary mask in the following applies different classification thresholds for all the pixels all the lables. For all pixels, it will only be considered as a label candidate if the probability is larger than the threshold. Then the final prediction is made as the label candidate with the highest probability. This increases DICE score about 0.004 in validation. Each label class has its own threshold.

In [None]:
def apply_thresholds_single_label(probs: torch.Tensor, thresholds: torch.Tensor) -> torch.Tensor:
    """
    Compare all label channels, and channels with predicted probability < threshold are ignored.
    Returns a (B, H, W) integer tensor of class indices.
    """
    B, C, H, W = probs.shape
    th = thresholds.view(1, -1, 1, 1).to(device)
    mask = probs > th                        

    masked_probs = probs.clone()
    masked_probs[~mask] = -1.0              
    idx = masked_probs.argmax(dim=1)       
    return idx

This calculates DICE score for a single class for fine-tuning the per class thresholds.

In [None]:
def dice_single_channel(pred_mask: torch.Tensor, gt_mask: torch.Tensor, eps: float = 1e-6) -> torch.Tensor:
    pred_f = pred_mask.view(pred_mask.size(0), -1).float()
    gt_f = gt_mask.view(gt_mask.size(0), -1).float()
    inter = (pred_f * gt_f).sum(dim=1)
    denom = pred_f.sum(dim=1) + gt_f.sum(dim=1)
    dice = (2 * inter + eps) / (denom + eps)
    return dice.mean()

Function for finetuning thresholds.

In [None]:
def tune_thresholds(preds: torch.Tensor, target: torch.Tensor, num_steps: int = 100) -> torch.Tensor:
    B, C, H, W = preds.shape
    target_onehot = torch.stack([(target == c).to(torch.uint8) for c in range(C)], dim=1) # (B, H, W) to (B, C, H, W)

    thresholds = torch.zeros(C, device=preds.device)
    grid = torch.linspace(0, 1, num_steps, device=preds.device)

    for c in trange(C, desc="Tuning thresholds for each class"):
        best_dice = -1.0
        best_thres = 0.0
        target_c = target_onehot[:, c] # (B, H, W)
        pred_c = preds[:, c] # (B, H, W)
        for thres in grid:
            prediction = (pred_c > thres).to(torch.uint8)
            dice_score  = dice_single_channel(prediction, target_c)
            if dice_score > best_dice:
                best_dice = dice_score
                best_thres = thres
        thresholds[c] = best_thres
    
    return thresholds

### Evaluation

In [None]:
all_probs = []
all_targets = []

model.eval()
with torch.no_grad():
    with tqdm(val_dataloader, desc=f"Validating", unit="batch") as pbar:
        for images, masks in pbar:    
            images = images.to(device)
            masks = masks.to(device)               
            logits = model(images)['out']                   
            probs = F.softmax(logits, dim=1)
            refined_probs = post_process(probs, images)
            refined_probs = refined_probs.to(device)
            all_probs.append(probs.cpu())
            all_targets.append(masks.cpu())

all_probs = torch.cat(all_probs, dim=0)        
all_targets = torch.cat(all_targets, dim=0)        
best_thresholds = tune_thresholds(all_probs, all_targets, num_steps=100)

print(f'Tuned thresholds: {best_thresholds}')

In [None]:
def DICE_evaluator(model, val_dataloader, device):
    dice_loss = DiceLoss(smooth=1)
    
    voc_classes = [
        "background",
        "aeroplane", "bicycle", "bird", "boat", "bottle",
        "bus", "car", "cat", "chair", "cow",
        "diningtable", "dog", "horse", "motorbike", "person",
        "pottedplant", "sheep", "sofa", "train", "tvmonitor",
    ]

    model.eval()
    total_loss = 0.0
    total_loss_per_class = 0.0
    num_batches = 0

    with torch.no_grad():
        with tqdm(val_dataloader, desc=f"Validating", unit="batch") as pbar:
            for images, masks in pbar:
                images = images.to(device)
                masks = masks.to(device)
                logits = model(images)['out'] # Tensor [B, C, H, W]
                probs  = F.softmax(logits, dim=1) # Tensor [B, C, H, W]
                refined_probs = post_process(probs, images)
                refined_probs = refined_probs.to(device)
                preds = apply_thresholds_single_label(refined_probs, best_thresholds)
                # preds = probs.argmax(dim=1)
                preds_onehot = F.one_hot(preds, num_classes=21)   # [B, H, W, C]
                preds_onehot = preds_onehot.permute(0, 3, 1, 2).float()
                loss, loss_per_class = dice_loss(preds_onehot, masks)
                total_loss += loss.item()
                total_loss_per_class += loss_per_class

                num_batches += 1
        
    avg_loss = total_loss / num_batches
    avg_loss_per_class = total_loss_per_class / num_batches

    avg_loss_per_class_dict = {}
    for name, score in zip(voc_classes, avg_loss_per_class):
        avg_loss_per_class_dict[name] = float(score.cpu().numpy())

    print(f'Final average DICE score: {1 - avg_loss}, \n average DICE score per class:\n {avg_loss_per_class_dict}')

    return avg_loss_per_class_dict

In [None]:
loss_dict = DICE_evaluator(trainer.model, val_dataloader, device)
classes = list(loss_dict.keys())
losses  = list(loss_dict.values())

plt.figure(figsize=(12, 5))
bars = plt.bar(classes, losses)
plt.xticks(rotation=90)
plt.ylabel("Dice Loss")
plt.title("Per-Class Dice Loss")
plt.tight_layout()
plt.show()

### Visualization

In [None]:
def visualize_segmentation(image: Tensor, mask: Tensor, pred: Tensor):
    # Convert tensors to NumPy
    image = image.permute(1, 2, 0).cpu().numpy()  # Convert to HWC
    mask = mask.cpu().numpy()
    pred = pred.cpu().numpy()
    
    # Ensure mask and pred are 2D (H, W)
    if mask.ndim > 2:
        mask = mask.squeeze()
    if pred.ndim > 2:
        pred = pred.squeeze()
    
    # Initialize RGB images for masks
    height, width = mask.shape
    colored_mask = np.zeros((height, width, 3), dtype=np.uint8)
    colored_pred = np.zeros((height, width, 3), dtype=np.uint8)
    
    # Get the viridis colormap
    cmap = plt.get_cmap('viridis')
    norm = mcolors.Normalize(vmin=0, vmax=20)  # Scale for labels [0, 20]
    
    # Map class indices to colors for ground truth and prediction
    for class_idx in np.unique(np.concatenate([mask, pred])):
        if class_idx <= 20:
            # Convert normalized colormap value to RGB (0-255)
            color = cmap(norm(class_idx))[:3]  # Get RGB (ignore alpha)
            color = (np.array(color) * 255).astype(np.uint8)
            colored_mask[mask == class_idx] = color
            colored_pred[pred == class_idx] = color
        elif class_idx == 255:
            # Void label mapped to white, consistent with original visualize_segmentation
            colored_mask[mask == class_idx] = (255, 255, 255)
            colored_pred[pred == class_idx] = (255, 255, 255)
        else:
            print(f"Warning: Class index {class_idx} not in expected range [0, 20] or 255. Using black.")
            colored_mask[mask == class_idx] = (0, 0, 0)
            colored_pred[pred == class_idx] = (0, 0, 0)
    
    # Visualize
    plt.figure(figsize=(15, 5))
    
    plt.subplot(1, 3, 1)
    plt.title("Input Image")
    plt.imshow(image)  # May need denormalization if normalized
    plt.axis('off')
    
    plt.subplot(1, 3, 2)
    plt.title("Ground Truth")
    plt.imshow(colored_mask)
    plt.axis('off')
    
    plt.subplot(1, 3, 3)
    plt.title("Prediction")
    plt.imshow(colored_pred)
    plt.axis('off')
    
    plt.show()
    
    # Return colored masks as PIL Images
    return Image.fromarray(colored_mask), Image.fromarray(colored_pred)

images, masks = next(iter(val_dataloader))
images = images.to(device)
masks = masks.to(device)
with torch.no_grad():
    logits = model(images)['out'] # Tensor [B, C, H, W]
    probs  = F.softmax(logits, dim=1) # Tensor [B, C, H, W]
    refined_probs = post_process(probs, images)
    preds = torch.argmax(refined_probs, dim=1)
    preds = torch.where(preds == 21, 255, preds)

visualize_segmentation(images[1], masks[1], preds[1])

### TestDataset
Customized Testdataset.<br>
It returns with original images as well, which is used for crf postprocessing for test instances. There is no augmentation applied for test set.

In [None]:
class VOC2009TestDataset(Dataset):
    def __init__(self, dataframe, transform=None, target_transform=None, paired_transform=None, ignore_label=21):
        self.df = dataframe.reset_index()
        self.transform = transform
        self.target_transform = target_transform
        self.paired_transform = paired_transform
        
        self.ignore_label = ignore_label
        self.classes = 22  # 20 classes + background + void

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx) -> Tuple[Tensor, Tensor]:
        image = self.df.iloc[idx]['img'] # np.array
        ori_img = image.copy().transpose((2, 0, 1))
        mask = self.df.iloc[idx]['seg']   

        image = Image.fromarray(image.astype(np.uint8))  
        mask = Image.fromarray(mask.astype(np.uint8))    

        if self.paired_transform:
            image, mask = self.paired_transform(image, mask)
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            mask = self.target_transform(mask)

        return image, mask, ori_img

In [None]:
test_dataset = VOC2009TestDataset(
    dataframe=test_df,
    transform=image_transform,
    target_transform=mask_transform,
    paired_transform=apply_paired_transform
    )

test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False, num_workers=NUM_WORKERS)

preds = []

with torch.no_grad():
    with tqdm(test_dataloader, desc=f"Predicting", unit="batch") as pbar:
        for image, mask, ori_img in pbar: # tensors [1, C, H, W], [1, H, W], [1, C, H, W]
            image = image.to(device)

            logits = model(image)['out'] # Tensor [1, C, H, W]
            logits_orig = F.interpolate(
                logits, size=(ori_img.shape[2], ori_img.shape[3]),
                mode='bilinear', align_corners=False
            ) # [1, C, H, W]

            probs_orig = F.softmax(logits_orig, dim=1)   # [1, C, H, W]
            refined = post_process(
                probs_orig.cpu(), # [1, C, H, W]
                ori_img # [1, C, H, W]                        
            )

            pred = apply_thresholds_single_label(probs_orig, best_thresholds).squeeze(0)
            pred = pred.to(torch.uint8).cpu().numpy()
            pred = np.where(pred == 21, 255, pred).astype(np.uint8)
            preds.append(pred)

In [None]:
if not isinstance(pred, Tensor):
    pred = torch.from_numpy(pred)

visualize_segmentation(ori_img.squeeze(0), pred, pred)

In [None]:
test_df["seg"] = preds

## Submit to competition
You don't need to edit this section. Just use it at the right position in the notebook. See the definition of this function in Sect. 1.3 for more details.

In [None]:
generate_submission(test_df)

# 4. Adversarial attack
For this part, your goal is to fool your classification and/or segmentation CNN, using an *adversarial attack*. More specifically, the goal is build a CNN to perturb test images in a way that (i) they look unperturbed to humans; but (ii) the CNN classifies/segments these images in line with the perturbations.

# 5. Discussion
Finally, take some time to reflect on what you have learned during this assignment. Reflect and produce an overall discussion with links to the lectures and "real world" computer vision.


## Segmentation
# 5. Discussion

For implentation of the from scratch CNN, we used a lot of concepts introduced early on in the lecture section on Deep Learning, applying the various convolution steps throughout multiple layers is explicitly described towrads the end of lecture 9. We used the hyperparmater explanation in lecture 10 as a guide for how many layers to use depending on image size and what we should be tweaking in our tuner. But even then there are a comparatively small number of layers in our CNN according to the nuber of layers included in the pre-made CNN architectures. The overall performance of this CNN wasn't great as you can when visualising its good and bad predictions, it seemed to struggle most with multi-label classifiers.
However as the epochs increased and the layers increased it got better with its predicitons, explaining why the pre-made CNNs have so many layer for such precise predictions. The AUC of the self made CNN topped at around 0.85 whereas for a premade model it reached [PREDICTION HERE].

We used AUC as our performance metric due to the inbalanced attribute of the dataset, it is able to focus on ranking predictions correctly instead of a perfect accuracy.


For implementation of the segmentation from scratch without transfer learning the model that was selected is UNet which was mentioned on Lecture 11 as one of the techniques suitable for segmentation. Before that the somge classic convolutional networks were tested with different parameters like layers or loss functions. The final model was trained using combination of two loss techniques Cross Entropy and Dice Loss. The used weight median-frequency limited a bit  dissapearance of classes but there is clear tendency to over-segmentation (Lecture 6) for example sometimes method finds a lot of elements/points from other classes in the background of desired element - visible in surrounding of the segmented motorbike. Under-segmentation is also visible but occurs less frequent. The solution from scratch reaches not so good results but its possible to segment some objects more or less accurately. 

In transfer learning of segmentation, we have learned various architectures before exploring them, including FCN, UNet, DeepLabV3, etc. We also practiced transfer learning, which is the mostly used training method. At first we forgot to freeze initial layers, resuling in a much worse result. We also found post processing techniques are helpful. DCRF helps make segmentation boundaries sharper, making results more trust-worthy. Two common scenarios of segmentation include autonomous driving, where segmentation of lanes, objects etc. helps operation system make decisions, and medical domain, like tooth segmentation given a oral CT scan image.

When evaluating results, both over segmentation and under segmentaion are observed. Like for a electrical bike image there are multilabels predicted for the bike, and for an image of a woman sitting on a couch many areas are identified as background. The concept of over and under segmentation is introduced in Lecture6. The (D)CRF algorithm is introduced in Lecture7. Lectures10 and 11 help us understand neural networks used here better. And specifically, (weighted) cross entropy and ResNet are introduced in Lecture11.

