# Deep Learning Course Project
## Students:
* ### Filippo Momesso - filippo.momesso@studenti.unitn.it
* ### Thomas De Min - thomas.demin@studenti.unitn.it

## Introduction
This year's (A.Y. 2021/2022) Deep Learning course's project involves the topic of Unsupervised Domain Adaptation (UDA). We have been provided an UDA dataset consisting of two domains: <u>Real World</u> and <u>Product</u>. The objective is to "propose a UDA technique to counteract the negative impact of the domain gap". 

### Dataset
The dataset [Adaptiope](https://ieeexplore.ieee.org/document/9423412) counts 123 classes but, in this case, only 20 of them will be investigated, each of them is balanced. Indeed, each class is made of 100 samples. As requested by the assignment we used a 80% train/test split.

### Delivery
In order to deliver a competitive solution in the real world, as a Deep Learning's project, we decided to dig into the literature of Unsupervised Domain Adaptation. Moreover, we also investigated the results of some of the recent techniques on the provided dataset.

We decided to deliver:
* A **baseline** implementation, that involves a ResNet18 fine-tuned on the source domain and tested on the target domain, in order to investigate upper and lower bound on the accuracy;
* The implementation of the [**Contrastive Adaptation Network (CAN)**](https://arxiv.org/abs/1901.00976), one of the state-of-the-art techniques in the field and the most promising approach for Adaptiope Dataset;
* Our **improved version** of CAN.

[Here](https://wandb.ai/229356_229298/DL2022_229356_229298) you can find our Weight and Biases project with all the plots we showed and additional metrics and informations, like some wrongly predicted images.

> Note: In all these scenarios a ResNet18 has been employed as backbone model.

### Experiments
For each approach we performed the UDA experiment in both directions, one at a time.
* For the **baseline** approach we trained on Products and tested on Real World test set, in order to get a lower bound accuracy for the domain adaptation task. Vice-versa for Real World to Products. We also trained the network on both training sets and tested on their corresponding test set in order to get the upper bound accuracy.
* For the implementation of **CAN** and our further **improvements** we procedeed in a similar way, but we omitted the computation of the upper bound.
  
Since each approach has its own requirements we trained all approaches with 30 epochs, except for CAN vanilla which required 60 epochs in order to converge. We kept as best model the one with the highest accuracy on the test set. For reproducible results, train and test splits are handled by a Generator with seed 0.

> Note: We would have liked to train also the other approaches with 60 epochs, but it was time consuming in terms of GPU. 

### Requirements
* We expect to load the datasets from a directory positioned in the root of Google Drive, named "datasets". The file must be named `Adaptiope.zip` (default name). Basically we will load the dataset from the following path `/content/drive/MyDrive/datasets/Adaptiope.zip `
* By default weights are not saved throughout epochs but, if you would like to do so, please provide a directory in the root of Google Drive named "weights". Moreover, turn the save flag to `True` before running the training function. To load the stored weights, set the variable `weights=/path/to/weights` in the training loop function.

> Note: Weights saving can be enabled only for CAN and CAN improvements.

## Notebook Initialization
This section will include imports and installation of python packages, data related operations, model and optimizer definition and other useful procedures.

### Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Preparation of the notebook
Install wandb, [spherecluser](https://github.com/rfayat/spherecluster) (an implementation of the Spherical K-Means used to implement CAN) and a specific version of scikit-learn (required for Sphere Cluster to work).

In [None]:
!pip install wandb --quiet
!pip install scikit-learn==0.24.2 --quiet # To use spherecluster https://stackoverflow.com/a/68182958/17566218
                                          # this breaks the dependecy with yellowbrick but it is needed
!pip install git+https://github.com/rfayat/spherecluster.git@scikit_update --quiet

Import all required Python Modules, perform the login on wandb and define global constants.

Moreover, checks the availability of "cuda" and set the variable `device` accordingly.

In [None]:
#  imports
import os
import shutil
import copy
import wandb
import time
import torch
import torchvision
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
import torch.nn as nn
import torchvision.models as models
import torchvision.transforms as transforms
import torch.nn.functional as F

from tqdm import tqdm
from tqdm.notebook import tqdm_notebook
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from torch.utils.data.dataset import Subset
from torch.utils.data.sampler import BatchSampler
from sklearn.decomposition import KernelPCA
from sklearn.mixture import GaussianMixture
from sklearn.manifold import TSNE
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from spherecluster.spherical_kmeans import SphericalKMeans

# install wandb and login
#%env WANDB_MODE=disabled
wandb.login()

# set device globally
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Selected Device: {device}")

# used in order to always have the same dataset
generator = torch.Generator().manual_seed(0)

# define constants
NUM_CLASSES = 20 # required in the assignment
BATCH_SIZE = 256 # empiric limit of Tesla K80 Colab GPU's
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]
SPLIT_RATIO = 0.8
BETA = 0.3 # scaling of CDD loss as described in CAN paper
DEFAULT_EPOCHS = 10 # default number of epochs for traing
DEFAULT_MOMENTUM = 0.9
LR_SCHEDULER_A = 10 # default 'a' value for the sgd scheduler as described in can paper
LR_SCHEDULER_B = 0.75 # default 'b' value for the sgd scheduler as described in can paper
DEFAULT_OPTIM = 'adam' # default optimizer
DEFAULT_NUM_WORKERS = 2 # empiric limit of colab as number of processes to use
DEFAULT_D_0 = 0.05 # default minimum allowed distance from a sample 
                   # to its cluster centroid in order to be taken into consideration
                   # for CDD loss. set as described in CAN paper
DEFAULT_N_0 = 3 # default minimum number of samples (actually N_0+1) that satisfy the distance
                # constraint (D_0) in order for a calss to be into consideration
                # for CDD loss. set as described in CAN paper

### Unzip full dataset in ```/content```
By unzipping directly in the ```/content``` directory we do not have to remove the zipped file afterwards. The result is the full dataset downloaded and unzipped in the working directory.
> Note: ```-qq``` inhibit the log.

In [None]:
# Quietly unzip the dataset
!unzip -qq /content/drive/MyDrive/datasets/Adaptiope.zip 

### Store a subset of Adaptiope
Load the 20 selected classes for the assignment and store the subset of the Dataset.
> Code adapted from the provided one.

In [None]:
classes = ["backpack", "bookcase", "car jack", "comb", "crown", "file cabinet", "flat iron", "game controller", "glasses",
           "helicopter", "ice skates", "letter tray", "monitor", "mug", "network switch", "over-ear headphones", "pen",
           "purse", "stand mixer", "stroller"]

for d, td in zip(["Adaptiope/product_images", "Adaptiope/real_life"], ["adaptiope_small/product_images", "adaptiope_small/real_life"]):
    for c in tqdm(classes):
        c_path = os.path.join(d, c)
        c_target = os.path.join(td, c)
        shutil.copytree(c_path, c_target)

shutil.rmtree('Adaptiope')

### Load and split dataset
Creates two dataset:
1. Products;
2. Real.

Required for the UDA task.

Moreover here we resize to images in order to match ResNet18 input dimensions and normalize according to ImageNet mean and std.
As CAN authors did, we decided not to use data augmentation in order to have results which can be explained only through the architecture and training procedures we employed.

In [None]:
# Create transformation sequence
#   - Resize to match resnet input dimensions
#   - Transform into a Tensor
#   - Normalize with ImageNet mean and std.
transformation_seq = [
    transforms.Resize((224, 224)),           # Same input size of resnet
    transforms.ToTensor(),                   # convert PIL to pytorch Tensor
    transforms.Normalize(mean=IMAGENET_MEAN,
                         std=IMAGENET_STD)   # normalize with ImageNet mean and std
]
transformations = transforms.Compose(transformation_seq)

# Load datasets and apply transformations
products = torchvision.datasets.ImageFolder('/content/adaptiope_small/product_images', transformations)
reals = torchvision.datasets.ImageFolder('/content/adaptiope_small/real_life', transformations)

# Select 2 random images from the real and products datasets and show them
idx = random.randint(0, len(products))
f, axarr = plt.subplots(1, 2)

# Transform to show unnormalized images
invTransform = transforms.Compose([ transforms.Normalize(mean = [ 0., 0., 0. ],
                                                     std = [1/item for item in IMAGENET_STD]),
                                    transforms.Normalize(mean = [-item for item in IMAGENET_MEAN],
                                                     std = [ 1., 1., 1. ]),
])

# Permute in order to shift channels in last dimension
axarr[0].imshow(invTransform(products[idx][0]).permute(1, 2, 0))
axarr[1].imshow(invTransform(reals[idx][0]).permute(1, 2, 0))

Split both dataset into train and test set. We used a **ratio** of 0.8 as requested by the assignment.

In [None]:
assert len(products) == len(reals), "Products and reals are not the same length"

len_ds = len(products)  # Length of the dataset (products or reals)
len_tr = int(SPLIT_RATIO * len_ds)  # compute train size
len_ts = len_ds - len_tr  # compute test size

# split product dataset
train_products, test_products = torch.utils.data.random_split(products, [len_tr, len_ts], generator=generator)
# split test dataset
train_real, test_real = torch.utils.data.random_split(reals, [len_tr, len_ts], generator=generator)

print(f"Training Size: {len_tr} - Test Size: {len_ts}")

### Create Dataloaders
Create a dataloader for each domain and split. The number of workers is set to 2 as suggested by a `UserWarning`.

In [None]:
# Create a dataloader for each domain and split
# num_workers set to 2 as suggested by UserWarning
tr_dl_products = DataLoader(train_products, BATCH_SIZE, shuffle=True, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)
ts_dl_products = DataLoader(test_products, BATCH_SIZE, shuffle=False, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)
tr_dl_real = DataLoader(train_real, BATCH_SIZE, shuffle=True, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)
ts_dl_real = DataLoader(test_real, BATCH_SIZE, shuffle=False, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)

### Netword definition
Definition of a custom `nn.Module` in order to download the pretrained weights of the ResNet18 and to substitute the last linear layer to match the number of classes.

We decided to aggregate everything into a single class so that the instantiation was cleaner.

In this cell we also defined a snippet that allows us to hook the activations of the Network. Useful to compute the CDD loss.

In [None]:
# extract activations from network
# this snippet is taken from the link below
# https://discuss.pytorch.org/t/how-can-l-load-my-best-model-as-a-feature-extractor-evaluator/17254/6
activation = {}
def get_activation(name):
    def hook(model, input, output):
        activation[name] = output.detach()
    return hook

class ResNet18(nn.Module):
    def __init__(self, out_dim):
        super(ResNet18, self).__init__()
        self.num_classes = out_dim
        # Get resnet18 pretrained
        self.backbone = models.resnet18(pretrained=True)
        # get number of output feature from the penultimate layer
        num_ftrs = self.backbone.fc.in_features
        # replace last Linear layer:
        #   - input number of feature of previous layer
        #   - output dimension equal to number of classes taken 
        #     into consideration for the assignment
        self.backbone.fc = nn.Linear(num_ftrs, self.num_classes)

    def forward(self, x):
        x = self.backbone(x)
        return x

### Define cost function and optimizer
Define functions to get:
* The **crossentropy** loss function;
* An optimizer (adam or sgd) with lower learning rate for pre-trained layers of the network:
    * **Adam**: Return Adam optimizer;
    * **SGD**: Return Stochastic Gradient Descent optimizer.
* A **scheduler**. Used in the vanilla implementation of CAN, as described in the paper: $$\eta_p = \frac{\eta_0}{(1 + ap)^b}$$ where $p$ linearly increases from 0 to 1 and represent the progress throughout the epochs.

In [None]:
def get_ce():
    return nn.CrossEntropyLoss()


def get_optimizer(model, lr, optim:str=DEFAULT_OPTIM):
    '''
    optim: either 'adam' or 'sgd'
    '''
    final_layer_weights = []
    rest_of_the_net_weights = []

    # iterate through the layers of the network
    for name, param in model.named_parameters():
        if name.startswith('bacbone.fc'):
            final_layer_weights.append(param)
        else:
            rest_of_the_net_weights.append(param)

    # assign the distinct learning rates to each group of parameters
    lr_specs = [
            {'params': rest_of_the_net_weights},
            {'params': final_layer_weights, 'lr': lr}
        ]
    if (optim == 'adam'):
        optimizer = torch.optim.Adam(lr_specs, lr=lr/10)
    elif (optim == 'sgd'):
        optimizer = torch.optim.SGD(lr_specs, lr=lr/10, momentum=DEFAULT_MOMENTUM)
    return optimizer


def get_lr_scheduler(optimizer, tot_epochs):
    a = LR_SCHEDULER_A
    b = LR_SCHEDULER_B
    coeff_func = lambda epoch: 1 / ((1 + a * (epoch / tot_epochs))**b)
    return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=coeff_func)

### Performance visualization
Define a function in order to visualize the performance of the model. Prints the classification report, confusion matrix plot and T-SNE plot to visualize samples in a 2D space: similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

In [None]:
def visualize(model, dataloader, wandb_run):
    '''
    Function to log additional metrics to wandb such as confusion matrix, 
    T-SNE scatter plot, images with wrong predictions.
    No training is done here.
    '''
    model.backbone.avgpool.register_forward_hook(get_activation('phi_1'))

    phi_list = []
    y_pred_list = []
    y_true_list = []
    incorrect_examples_list = []
    incorrect_preds_list = []
    incorrect_probs_list = []
    incorrect_y_true_list = []
    with torch.no_grad():
        # Forward pass to compute predictions 
        for _, (inputs, labels) in enumerate(dataloader):
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs)
            preds = outputs.max(dim=1)[1]
            probs = F.softmax(outputs).max(dim=1)[0]

            phi_list.append(activation['phi_1'].squeeze())
            y_pred_list.append(preds)
            y_true_list.append(labels)

            # get a mask with wrong images indexes.
            idxs_mask = (preds != labels).view(-1)
            incorrect_examples_list.append(inputs[idxs_mask])
            incorrect_preds_list.append(preds[idxs_mask])
            incorrect_y_true_list.append(labels[idxs_mask])
            incorrect_probs_list.append(probs[idxs_mask])

    # Concat lists into tensors
    incorrect_examples = torch.cat(incorrect_examples_list)
    incorrect_preds = torch.cat(incorrect_preds_list)
    incorrect_y_true = torch.cat(incorrect_y_true_list)
    incorrect_probs = torch.cat(incorrect_probs_list)
    y_preds = torch.cat(y_pred_list).cpu().numpy()
    y_trues = torch.cat(y_true_list).cpu().numpy()
    phis = torch.cat(phi_list).cpu().numpy()
    
    # Take five random wrong predicted images and plot them on wandb.
    rand_indexes = np.random.randint(low=0, high=incorrect_examples.shape[0], size=5)
    wrong_preds_table = wandb.Table(columns=['Image', 'Prediction', 'Ground Truth', 'Probability'])
    for rnd_idx in rand_indexes:
        wrong_preds_table.add_data(
            wandb.Image(incorrect_examples[rnd_idx]), 
            classes[incorrect_preds[rnd_idx]], 
            classes[incorrect_y_true[rnd_idx]], 
            incorrect_probs[rnd_idx]
            )
    wandb.log({"test/wrong_predictions": wrong_preds_table}, commit=False)

    # Print classification report
    print(classification_report(y_trues, y_preds, target_names=classes))

    # Build and log confusion matrix to wandb
    cf_matrix = confusion_matrix(y_trues, y_preds, normalize="true")
    df_cm = pd.DataFrame(cf_matrix, index=classes, columns=classes)
    plt.figure(0, figsize = (12,7))
    sns.heatmap(df_cm, annot=True)
    cm_plot = plt.figure(0)
    plt.savefig(f"cm-{wandb_run.project}-{wandb_run.name}.png")
    wandb.log({"test/confusion_matrix_img": wandb.Image(cm_plot)}, commit=False)

    # Apply T-SNE on last avgpool layer activations to visualize data separation
    # Adapted from https://www.datatechnotes.com/2020/11/tsne-visualization-example-in-python.html
    perplexity = 5
    tsne = TSNE(n_components=2, perplexity=perplexity, random_state=123)
    phis_reduced = tsne.fit_transform(phis) 
    df_tsne = pd.DataFrame()
    df_tsne["y"] = y_trues
    df_tsne["comp-1"] = phis_reduced[:,0]
    df_tsne["comp-2"] = phis_reduced[:,1]

    plt.figure(1, dpi=100)
    sns.scatterplot(x="comp-1", y="comp-2", hue=df_tsne.y.tolist(),
                    palette=sns.color_palette("hls", NUM_CLASSES),
                    data=df_tsne, legend=False).set(title="Last avgpool layer activations T-SNE projection")
    plt.tick_params(left=False, right=False, labelleft=False, 
                    labelbottom=False, bottom=False)
    tsne_plot = plt.figure(1, dpi=100)
    plt.savefig(f"tsne-{perplexity}-{wandb_run.project}-{wandb_run.name}.png")
    wandb.log({"test/tsne_visualization": wandb.Image(tsne_plot)}, commit=False)
    cf_matrix_wandb = wandb.sklearn.plot_confusion_matrix(y_trues, y_preds, classes, normalize=True)

## Baseline Approach
Here we present the first attempt to solve the problem. As just mentioned above, we fine-tune a ResNet18 on the source training set and we test it on the target test set. In this case we **do not** adopt any Domain Adaptation technique, we just evaluate the performace of ResNet in the case of Domain Shift.

We can also appreciate the upper bound accuracy in the supervised scenario for both domains, computed as a standard classification task using training and test set from a single domain.

### Define baseline training step and test step

In [None]:
def training_step_baseline(model, data_loader, optimizer, cost_function, device, epoch=0):
    n_samples = 0
    cumulative_loss = 0.
    cumulative_accuracy = 0.

    model.train()
    for batch_idx, (inputs, targets) in enumerate(tqdm_notebook(data_loader, desc="Training Step", leave=False)):
        inputs = inputs.to(device)
        targets = targets.to(device)

        outputs = model(inputs)

        loss = cost_function(outputs, targets)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # add batch size
        n_samples += inputs.shape[0]
        # cumulative loss
        cumulative_loss += loss.item()
        # return predicted labels
        max_prob, predicted = outputs.max(dim=1)
        # cumulative accuracy
        cumulative_accuracy += predicted.eq(targets).sum().item()

    # avg loss and accuracy
    loss = cumulative_loss / n_samples
    acc = cumulative_accuracy / n_samples

    metrics = {
        "train/train_loss": loss,
        "train/train_acc": acc
    }

    return metrics


def test_step_baseline(model, data_loader, cost_function, device, epoch=0):
    n_samples = 0
    cumulative_loss = 0.
    cumulative_accuracy = 0.

    model.eval()

    with torch.no_grad():
        for batch_idx, (inputs, targets) in enumerate(tqdm_notebook(data_loader, desc="Test Step", leave=False)):
            inputs = inputs.to(device)
            targets = targets.to(device)

            outputs = model(inputs)

            loss = cost_function(outputs, targets)

            # add batch size
            n_samples += inputs.shape[0]
            # cumulative loss
            cumulative_loss += loss.item()
            # return predicted labels
            max_prob, predicted = outputs.max(dim=1)

            # cumulative accuracy
            cumulative_accuracy += predicted.eq(targets).sum().item()

    
    # avg loss and accuracy
    loss = cumulative_loss / n_samples
    acc = cumulative_accuracy / n_samples

    metrics = {
        "test/test_loss": loss,
        "test/test_acc": acc
    }

    return metrics

### Train baseline

In [None]:
def training_loop_baseline(tr_dl, ts_dl, device, wandb_run):
    print(wandb_run.name)

    model = ResNet18(NUM_CLASSES).to(device)

    optimizer = get_optimizer(model, lr=wandb.config['lr'], optim=wandb.config["optimizer"])
    cost_fn = get_ce()
    
    best_loss = 0.
    best_acc = 0.

    print("Start training")
    for e in tqdm_notebook(range(wandb_run.config['epochs']), desc="Training Loop"):
        train_metrics = training_step_baseline(model, tr_dl, optimizer, cost_fn, device, epoch=e)
        test_metrics = test_step_baseline(model, ts_dl, cost_fn, device, epoch=e)

        wandb.log({**train_metrics, **test_metrics})

        train_loss = train_metrics['train/train_loss']
        train_acc = train_metrics['train/train_acc']
        
        test_loss = test_metrics['test/test_loss']
        test_acc = test_metrics['test/test_acc']

        if best_acc < test_acc or e == 0:
            best_acc = test_acc
            best_loss = test_loss
            best_model = copy.deepcopy(model)
        
        print('\n Epoch: {:d}'.format(e + 1))
        print('\t Training loss {:.5f}, Training accuracy {:.2f}'.format(train_loss, train_acc))
        print('\t Test loss {:.5f}, Test accuracy {:.2f}'.format(test_loss, test_acc))
        print('-----------------------------------------------------')

    visualize(best_model, ts_dl, wandb_run)
    wandb.summary["test_best_loss"] = best_loss
    wandb.summary["test_best_accuracy"] = best_acc
    wandb.finish()
    print('\t BEST Test loss {:.5f}, Test accuracy {:.2f}'.format(best_loss, best_acc))
    
    return best_model

Train baseline model on Products, test on Real World.

In [None]:
run = wandb.init(
    project="DL2022_229356_229298",
    entity="229356_229298",
    name='P_to_R_baseline',
    config={
        "model": "ResNet18",
        "trained-on": "Source only",
        "epochs": 30,
        "batch_size": BATCH_SIZE,
        "optimizer": "sgd",
        "lr": 1e-2,
        "loss": "CrossEntropyLoss"
    }
)

best_P_R = training_loop_baseline(tr_dl_products, ts_dl_real, device, run)

**Best test accuracy $P \rightarrow R:$ 0.74**

Train baseline model on Real World, test on Products.

In [None]:
run = wandb.init(
    project="DL2022_229356_229298",
    entity="229356_229298",
    name='R_to_P_baseline',
    config={
        "model": "ResNet18",
        "trained-on": "Source only",
        "epochs": 30,
        "batch_size": BATCH_SIZE,
        "optimizer": "sgd",
        "lr": 1e-2,
        "loss": "CrossEntropyLoss"
    }
)

best_R_P = training_loop_baseline(tr_dl_real, ts_dl_products, device, run)

**Best test accuracy $R \rightarrow P:$ 0.93**

Computer Upper Bound accuracy on Products.

In [None]:
run = wandb.init(
    project="DL2022_229356_229298",
    entity="229356_229298",
    name='Upper_Bound_P',
    config={
        "model": "ResNet18",
        "trained-on": "Source only: P",
        "epochs": 30,
        "batch_size": BATCH_SIZE,
        "optimizer": "sgd",
        "lr": 1e-2,
        "loss": "CrossEntropyLoss"
    }
)

model_upper_bound_P = training_loop_baseline(tr_dl_products, ts_dl_products, device, run)

precision    | recall |  f1-score |  support
---|---|---|---
backpack            | 0.97   |   1.00  |   0.98   |    29
bookcase            | 0.91   |   1.00  |   0.95   |    21
car jack            | 0.88   |   0.88  |   0.88   |    17
comb                | 0.86   |   0.95  |   0.90   |    19
crown               | 1.00   |   1.00  |   1.00   |    20
file cabinet        | 0.94   |   0.89  |   0.91   |    18
flat iron           | 0.76   |   1.00  |   0.86   |    16
game controller     | 0.95   |   0.83  |   0.89   |    24
glasses             | 1.00   |   1.00  |   1.00   |    19
helicopter          | 0.89   |   1.00  |   0.94   |    17
ice skates          | 0.94   |   0.89  |   0.92   |    19
letter tray         | 1.00   |   1.00  |   1.00   |    16
monitor             | 1.00   |   0.90  |   0.95   |    20
mug                 | 1.00   |   1.00  |   1.00   |    17
network switch      | 1.00   |   1.00  |   1.00   |    24
over-ear headphones | 1.00   |   1.00  |   1.00   |    15
pen                 | 0.92   |   0.83  |   0.87   |    29
purse               | 1.00   |   0.90  |   0.95   |    21
stand mixer         | 1.00   |   1.00  |   1.00   |    19
stroller            | 1.00   |   1.00  |   1.00   |    20
 |||
accuracy            |        |         |   0.95   |   400
macro avg           | 0.95   |   0.95  |   0.95   |   400
weighted avg        | 0.95   |   0.95  |   0.95   |   400

**Upper Bound Products Acc: 0.95**

Computer Upper Bound accuracy on Real.

In [None]:
run = wandb.init(
    project="DL2022_229356_229298",
    entity="229356_229298",
    name='Upper_Bound_R',
    config={
        "model": "ResNet18",
        "trained-on": "Source only: R",
        "epochs": 30,
        "batch_size": BATCH_SIZE,
        "optimizer": "sgd",
        "lr": 1e-2,
        "loss": "CrossEntropyLoss"
    }
)

model_upper_bound_R = training_loop_baseline(tr_dl_real, ts_dl_real, device, run)


class | precision   | recall | f1-score |  support
---|---|---|---|---
backpack            |   0.77    |  1.00  |    0.87  |      17
bookcase            |   0.94    |  0.80  |    0.86  |      20
car jack            |   0.83    |  1.00  |    0.91  |      15
comb                |   0.83    |  0.91  |    0.87  |      22
crown.              |   1.00    |  1.00  |    1.00  |      21
file cabinet.       |   0.83    |  0.91  |    0.87  |      22
flat iron           |   0.94    |  0.94  |    0.94  |      16
game controller     |   0.95    |  0.90  |    0.93  |      21
glasses.            |   0.95    |  1.00  |    0.97  |      19
helicopter          |   1.00    |  0.95  |    0.97  |      19
ice skates          |   0.94    |  0.81  |    0.87  |      21
letter tray         |   0.92    |  0.85  |    0.88  |      27
monitor             |   0.95    |  0.90  |    0.92  |      20
mug                 |   0.88    |  0.92  |    0.90  |      24
network switch      |   0.81    |  1.00  |    0.89  |      17
over-ear headphones |   1.00    |  1.00  |    1.00  |      17
pen                 |   0.94    |  0.94  |    0.94  |      17
purse               |   1.00    |  0.74  |    0.85  |      23
stand mixer         |   0.95    |  0.95  |    0.95  |      21
stroller            |   1.00    |  0.95  |    0.98  |      21
||||
accuracy            |           |        |    0.92  |     400
macro avg           |   0.92    |  0.92  |    0.92  |     400
weighted avg        |   0.92    |  0.92  |    0.92  |     400

**Upper Bound RealWorld Acc: 0.92**

### Observations
Before going into the observations, we must recall that each Network is trained on the source training set and tested on the target test set.

We can observe how test accuracies are quite high just by using a pretrained ResNet18. This is thanks to the pretraining of the network on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset which allows the network to learn general features that transfer well on Adaptiope dataset.

It might seem that the training suffers from overfitting but we are saving the best model (i.e. the model with highest test accuracy) and therefore applying some form of early stopping to cope with possibile overfitting. 
In the Tables below we have a summary of the accuracies we get. 

**Product to Real World** - 
Of the two domain adaptation tasks, $P → R$ is the most difficult one. In fact the overall accuracy obtained is lower and several classes (like "glasses") have low classwise F1-score. Moreover we can observe that the confusion matrix is very noisy and that the t-SNE representation is not very accurate. As a reason for this behaviour, our hypothesis is that the Real World domain presents objects in different orientations and light conditions (often poor) and mixed up with other objects. Furthermore the classes with lower accuracy are the ones where objects are very similar, for example "purse" and "backpack". The Product domain instead presents samples in which the objects are isolated and with good illumination. Therefore the model trained on the Product domain does not have a good transfer on Real World domain.

       
<img src="https://drive.google.com/uc?export=view&id=1frS4h0qSPAqZ2Q2gcqT9Zo8wsjJILJCw" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1Eo6D3KUQEZBogOBJ1KywNs5NJ-cmZTp8" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1fcKDWMseKCiGcUeZb4Pr14ECZPGRzWUD" height="400px">
<img src="https://drive.google.com/uc?export=view&id=1I24sggX5SDkQD7FjbmEvErCwm9aLLU0E" height="300px">

class | precision | recall | f1-score  | support
------|-----------|--------|-----------|--------
backpack       |0.44      |0.94      |0.60        |17
bookcase       |0.81      |0.65      |0.72        |20
car jack       |0.80      |0.80      |0.80        |15
comb       |0.52      |0.77      |0.62        |22
crown       |0.91      |0.95      |0.93        |21
file cabinet       |0.56      |0.64      |0.60        |22
flat iron       |0.93      |0.81      |0.87        |16
game controller       |1.00      |0.76      |0.86        |21
glasses       |0.90      |0.47      |0.62        |19
helicopter       |0.78      |0.95      |0.86        |19
ice skates       |0.88      |0.67      |0.76        |21
letter tray       |0.73      |0.70      |0.72        |27
monitor       |0.79      |0.75      |0.77        |20
mug       |1.00      |0.88      |0.93        |24
network switch       |0.82      |0.53      |0.64        |17
over-ear headphones       |0.84      |0.94      |0.89        |17
pen       |0.86      |0.71      |0.77        |17
purse       |0.50      |0.48      |0.49        |23
stand mixer       |0.76      |0.90      |0.83        |21
stroller       |0.82      |0.67      |0.74        |21
||||
accuracy     |          |          |  0.74    |   400
macro avg     |  0.78    |  0.75    |  0.75    |   400
weighted avg     |  0.78    |  0.74    |  0.75    |   400

**Real World to Product** - 
In the case of $R → P$ the baseline performances are higher. The main reason could be training on the Real World dataset allows to generalize better, getting as a consequence a higher accuracy on the target test set. As we can see, both the confusion matrix and the t-SNE representation are accurate.

<img src="https://drive.google.com/uc?export=view&id=1KoZ19gAQ7IrimxR5zj9QqS9mWHO5eYx8" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1azMQXE7HW2UOom3RsRu9eqJjLITi44gV" width="500px">
<img src="https://drive.google.com/uc?export=view&id=16BmAsC_tE-pvEcHx6IMlOZnPUlo1uUXT" height="400px">
<img src="https://drive.google.com/uc?export=view&id=1Eha_2alrv81awDCwm9FEkh6BsaZ9f9Ck" height="300px">

class | precision  |  recall | f1-score |  support
------|------------|---------|----------|----------
backpack       |0.96      |0.90      |0.93        |29
bookcase       |0.86      |0.86      |0.86        |21
car jack       |0.93      |0.82      |0.87        |17
comb       |0.83      |1.00      |0.90        |19
crown       |1.00      |1.00      |1.00        |20
file cabinet       |0.88      |0.78      |0.82        |18
flat iron       |0.88      |0.94      |0.91        |16
game controller       |0.95      |0.88      |0.91        |24
glasses       |0.95      |1.00      |0.97        |19
helicopter       |0.89      |1.00      |0.94        |17
ice skates       |1.00      |0.95      |0.97        |19
letter tray       |0.89      |1.00      |0.94        |16
monitor       |1.00      |0.90      |0.95        |20
mug       |1.00      |1.00      |1.00        |17
network switch       |1.00      |0.96      |0.98        |24
over-ear headphones       |0.79      |1.00      |0.88        |15
pen       |0.89      |0.83      |0.86        |29
purse       |0.86      |0.86      |0.86        |21
stand mixer       |1.00      |0.95      |0.97        |19
stroller       |0.95      |1.00      |0.98        |20
||||
accuracy   |            |          |  0.93    |  400
macro avg   |    0.93    |  0.93    |  0.93    |  400
weighted avg   |    0.93    |  0.93    |  0.92    |  400

**Upper bounds** -
The upper bound on the accuracy for the Products dataset is 0.95 while in the Real World one is 0.92. We need to take into account these numbers while conducting performance evaluation of CAN domain adaptation method and our improvements on it.

## CAN Implementation
As mentioned above, in order to increase the performance from the baseline results we decided to implement the [Contrastive Adaptation Network](https://arxiv.org/abs/1901.00976) by Kang et. al. 

CAN is a discrepancy based domain adaptation method which aims at minimizing the discrepancy between the source and target domains, through statistical domain alignment.

Previous discrepancy based methods measured domain discrepancy by Maximum Mean Discrepancy (MMD) and Joint MMD, obtaining state of the art results on several UDA benchmarks. Despite the success of those methods, they all have a common problem which is the fact that the domain discrepancy is measured at the domain level, neglecting the class from which the samples are drawn. 

By performing a class-agnostic domain alignment, the MMD and JMMD can be minimized even when target-domain samples are misaligned w.r.t. source-domain samples of a different class. Thus the decision boundary may generalize poorly for the target domain. 

The authors of CAN, instead, propose a discrepancy measure which explicitly takes into consideration the class information. For this purpose labels for both domains are needed, therefore they use clustering in order to infer pseudolabels for the target set. 
That given, they propose the Contrastive Domain Discrepancy (CDD) which is established on the difference between conditional data distribution across domains. Using this approach, authors were able to obtain SOTA results in Unsupervised Domain Adaptation. 

![Architecture](https://drive.google.com/uc?export=view&id=18zSj57F5lBfVRjqjtZ9saJXeD3RUI4iP)

Training process of CAN. To minimize CDD, authors perform alternative optimization between updating the target label hypothesis, through clustering, and adapting feature representations through bach-propagation.

Each part of the algorithm will be explained in the following sections.

### CDD Loss
The Contrastive Domain Discrepancy Loss is computed as follows:
$$
\hat{D}^{cdd}_L = \sum_{l=1}^{L} \hat{D}_l^{cdd}
$$
where $l$ is the index of a Fully Connected Layer. It is known that CNNs are able to learn more transferable features than shallow models. However, the discrepancy still exists for domain-specific layers. In other words, convolutional layers extract general and more transferable features while FC layers exhibit abstract and domain specific features. Thus, they must be adapted and to do so we must compute $\hat{D}^{cdd}$ to all fully connected layers of the network. In our case 2 layers are involved, the output of the global average pooling and the output layer.

Each $\hat{D}^{cdd}_l$ measures the difference between the intra- and inter-class domain discrepancy of the given layer, which will be optimized in opposite direction. As the training proceeds the intra-class domain discrepancy becomes smaller while the inter-class domain discrepancy becomes larger, so that the hard (ambiguous) classes are able to be taken into account. It is worth noting that, due to this property, is very likely that this value will be negative.
$$
\hat{D}^{cdd}_l = \underset{intra}{\underbrace{\frac{1}{M} \sum_{c=1}^{M} \hat{D}^{cc}(\hat{y}_{1:n_t}^{t}, \phi_l)}} - \underset{inter}{\underbrace{\frac{1}{M(M-1)} \sum_{c=1}^{M} \sum_{c'=1\\c'\neq c}^{M} \hat{D}^{cc'} (\hat{y}_{1:n_t}^{t}, \phi_l)}}
$$
where $M$ is the number of classes, $\phi_l$ is the output of the layer $l \in L$ taken into consideration for computing $\hat{D}^{cdd}_l$ and $\hat{y}_{1:n_t}^{t}$ is the abbreviation of $\hat{y}_1^t, ..., \hat{y}_{n_t}^t$ which are the estimated classes for the target dataset ($n_t$ number of target examples).

$\hat{D}^{c_1 c_2}$ is then defined as follows:
$$
\hat{D}^{c_1 c_2}(\hat{y}_1^t, ..., \hat{y}_{n_t}^t, \phi) = e_1 + e_2 - 2e_3
$$
where:
$$
e_1 = \sum_{i=1}^{n_s} \sum_{j=1}^{n_s} \frac{\mu_{c_1 c_1}(y_i^s, y_j^s)k(\phi(x_i^s), \phi(x_j^s))}{\sum_{i=1}^{n_s} \sum_{j=1}^{n_s} \mu_{c_1 c_1}(y_i^s, y_j^s)} \\
e_2 = \sum_{i=1}^{n_t} \sum_{j=1}^{n_t} \frac{\mu_{c_2 c_2}(\hat{y}_i^t, \hat{y}_j^t)k(\phi(x_i^t), \phi(x_j^t))}{\sum_{i=1}^{n_t} \sum_{j=1}^{n_t} \mu_{c_2 c_2}(\hat{y}_i^t, \hat{y}_j^t)} \\
e_3 = \sum_{i=1}^{n_s} \sum_{j=1}^{n_t} \frac{\mu_{c_1 c_2}(y_i^s, \hat{y}_j^t)k(\phi(x_i^s), \phi(x_j^t))}{\sum_{i=1}^{n_s} \sum_{j=1}^{n_t} \mu_{c_1 c_2}(y_i^s, \hat{y}_j^t)}
$$

$\mu_{c c'}$ acts like a filter that selects only the examples from class $c$ and class $c'$:
$$
\mu_{c c'}(y, y') = \begin{cases}
                        1 & \text{ if } y=c, y'=c' \\ 
                        0 & \text{ otherwise } 
                    \end{cases}
$$
Clearly, to compute the mask $\mu_{c_2 c_2}(\hat{y}_i^t, \hat{y}_j^t)$ and $\mu_{c_1 c_2}(y_i^s, \hat{y}_j^t)$ we need to estimate target labels. In Clustering section the approach used by the authors will be explained.

$k$ is the kernel function that is used to compute the similarity between two activations.

$\hat{D}^{c_1 c_2}$ defines two kinds of class-aware domain discrepancies:
1. when $c_1 = c_2 = c$, it measures the intra-class domain discrepancy;
2. when $c_1 \neq c_2$, it measures the inter-class domain discrepancy.

The resulting $\hat{D}^{cdd}$ will then be multiplied by a rescaling factor $\beta$ and then added to the CrossEntropy Loss function computed on the source domain. The overall objective function is:
$$
\underset{\theta}{min}\,l = l^{ce} + \beta \hat{D}_L^{cdd}
$$

> Note that also source examples are used in the computation of CDD but authors did not include them in functions' signature ($n_s$ number of source examples).

![Discrepancy](https://drive.google.com/uc?export=view&id=19vqJ48RKssJLJIEIE_Iio4uWK8sXgx8-)

Comparison between no adaptation, other domain-discrepancy minimization methods, and CAN

In [None]:
def rbf_kernel(X, Y=None, gamma=None):
    """
    Based on sklearn.metrics.pairwise.rbf_kernel implementation.
    Compute the rbf (gaussian) kernel between X and Y:
        K(x, y) = exp(-gamma ||x-y||^2)
    for each pair of rows x in X and y in Y.
    Parameters
    ----------
    X : Tensor of shape (n_samples_X, n_features)
    Y : Tensor of shape (n_samples_Y, n_features), default=None
        If `None`, uses `Y=X`.
    gamma : float, default=None
        If None, defaults to 1.0 / n_features.
    Returns
    -------
    kernel_matrix : Tensor of shape (n_samples_X, n_samples_Y)
    """
    if Y is None:
        Y = torch.clone(X)

    if gamma is None:
        gamma = 1.0 / X.shape[1]

    K = torch.cdist(X, Y, compute_mode="use_mm_for_euclid_dist").square()
    K = torch.mul(K, -gamma)
    return torch.exp(K)  # exponentiate K in-place

In [None]:
def CDD_loss(source_phis, target_phis, source_labels, target_labels, classes, beta):
    ''' 
    Compute the Contrastive Domain Discrepancy Loss between source 
    and target activations given source labels and target pseudo-labels,
    for specified classes.
    Parameters
    ----------
    source_phis : list containing each layer's source batched activations
                  activation is a Tensor of shape (batch_size, num_features)
    target_phis : list containing each layer's target batched activations
                  activation is a Tensor of shape (batch_size, num_features)
    source_labels : Tensor of shape (batch_size)
    target_labels : Tensor of shape (batch_size)
    classes : set of classes (indexes) on which the CDD is computed.
    '''
    cdd_loss = torch.tensor(0., device=device)
    for phi_src, phi_tgt in zip(source_phis, target_phis):
        cdd_loss = torch.add(cdd_loss, D_cdd(phi_src, source_labels, phi_tgt, target_labels, device, classes))
    return torch.mul(cdd_loss, beta)


def D_cdd(phi_source, y_source, phi_target, y_target, device, classes):
    '''
    Compute CDD Loss for a single source/target activation pair.
    '''
    num_classes = len(classes)

    # intra_sum is the cumulative intra-class domain discrepancy. By minimizing it we try to align the domains of the same class
    intra_sum = torch.tensor(0., device=device)
    # On the other hand, inter_sum is the comulative intra-class domain discrepancy. This one is maximized in order to separate the distributions of two diffrent classes
    inter_sum = torch.tensor(0., device=device)

    # compute the cumulative intra-class domain discrepancy
    for c in classes:
        intra_sum = torch.add(intra_sum, domain_discrepancy(c, c, phi_source, y_source, phi_target, y_target, device))

    # normalize by the number of classes
    intra = torch.div(intra_sum, num_classes)

    # compute the cumulative inter-class domain discrepancy
    # basically we are interested in pair of classes that are different to each other
    for c in classes:
        for c_prime in classes:
            if c != c_prime:
                inter_sum = torch.add(inter_sum, domain_discrepancy(c, c_prime, phi_source, y_source, phi_target, y_target, device))

    inter = torch.div(inter_sum, (num_classes*(num_classes-1)))

    return torch.sub(intra, inter)


def domain_discrepancy(c, c_prime, phi_source, y_source, phi_target, y_target, device):
    """
    return the domain discrepancy between y_source and y_target according to c and c' and their activations phi
    e1 measures the intra-class domain discrepancy in class c
    e2 measures the intra-class domain discrepancy in class c'
    e3 measures the inter-class domain discrepancy between c and c'
    """
    intra = torch.add(
        e(c, c, phi_source, y_source, phi_source, y_source, device),
        e(c_prime, c_prime, phi_target, y_target, phi_target, y_target, device)
    )
    inter = torch.mul(-2, e(c, c_prime, phi_source, y_source, phi_target, y_target, device))
    return torch.add(intra, inter)

def e(c, c_prime, phi, y, phi_prime, y_prime, device):
    """
    In order to be computationally efficient it exploits Pytorch Tensor operations,
    instead of a straight forward implementation of the formula above with for-loops.

    if c == c' measures the intra-class domain discrepancy
    if c != c' measures the inter-class domain discrepancy
    """    
    # matrix with all pairwise distances between phi and phi_prime vectors
    kernel_covariance = rbf_kernel(phi, phi_prime)

    # compute a boolean mask to filter out unwanted pairwise distances.
    A = (y == c).unsqueeze(1).expand(-1, y_prime.size(0))
    B = (y_prime == c_prime).unsqueeze(0)
    mask = (A & B)
    masked = kernel_covariance*mask

    # sum the remaining similarities to then return average.
    similarity = torch.sum(masked)
    count = torch.count_nonzero(masked)

    return torch.div(similarity, count) if count > 0 else torch.tensor(0., device=device)


### Clustering
In order to compute the target labels, authors proposed a clustering approach based on a Spherical K-Means algorithm.

Specifically they perform what they call *Alternative optimization*. Basically they want to jointly optimize the target label hypothesis $\hat{y}_{1:n_t}^t$ and the feature representations $\phi_{1:L}$. At the beginning of each loop, the target labels are computed through clustering then, based on the updated target labels $\hat{y}$, CDD is estimated and minimized to adapt the features. Model parameters $\theta$ are updated through standard back-propagation.

In order to represent a sample (on which perform clustering), they use the input activations $\phi_1(\cdot)$ of the first task-specific layer. In our case, each sample will be represented by the output of the global average pooling layer of a ResNet18. Then the Spherical K-Means is employed to cluster the target samples and compute the estimated labels.

In order to provide a good initialization of the K-Means, for each class the target cluster center $O^{tc}$ is initialized as the source cluster center $O^{sc}$ ($O^{tc} \gets O^{sc}$) where:
$$
O^{sc} = \sum_{i=1}^{n_s} \mathbf{1}_{y_i^s = c} \frac{\phi_1(x_i^s)}{||\phi_1(x_i^s)||}, \qquad \mathbf{1}_{y_i^s = c} \begin{cases}
                                1 & \text{ if } y_i^s= c\\ 
                                0 & \text{ otherwise } 
                            \end{cases}, \qquad c \in \{1, ..., M\}
$$

In order to measure the distance between two points in the feature space, the cosine dissimilarity is applied:
$$
d(a, b) = \frac{1}{2}\left (1 - \frac{\left \langle a, b \right \rangle}{||a||\,||b||}\right )
$$

The clustering process proceeds as the classic K-Means using the just-defined distance. After clustering, each target sample is assigned to its estimated label.

In [None]:
def compute_centroids(model, dataloader):
    ''' 
    Forward dataloader through model, extract phi_1 (last pooling layer features) 
    and compute feature mean per class
    return torch.Tensor with shape (num_classes, num_features)
    '''
    centroids = 0
    samples_count = 0

    # unsqueeze to have size (num_classes, 1) instead of simply (num_classes) in order to exploit broadcasting later
    references = torch.tensor(range(model.num_classes), device=device).unsqueeze(1)
    
    labels_source = []

    model.eval()
    with torch.no_grad():
        for _, (inputs, labels) in enumerate(dataloader):
            inputs = inputs.to(device)
            labels_source.extend(labels.tolist())
            labels = labels.to(device)  # tensor of size (batch_size)
            samples_count += labels.size(0)

            outputs = model(inputs)
            # get activations of the first task specific layer of size (batch_size, num_features)
            features = torch.squeeze(activation["phi_1"])

            # resize labels tensor to (num_classes, batch_size) to use it later to generate the boolean mask
            labels = labels.unsqueeze(0).expand(model.num_classes, -1)

            # (labels == references) returns a tensor (num_classes, batch_size) thanks to broadcasting
            # Item [c][i] in the vector is true if sample i belongs to class c
            # By unsqueezing on last dimension mask becomes (num_classes, batch_size, 1)
            # this is needed to compute the mask on the features exploiting again broadcasting
            mask = (labels == references).unsqueeze(2)

            # feature * mask returns a tensor (num_classes, batch_size, num_feature)
            # where only rows on dim=1 for which the related samplelabel == class are not 0 but contain feature values
            # by summing on dim=1 we sum feature-wise all samples belonging to a class getting a (num_classes, num_features) tensor
            # then add the batch centroids to the centroid accumulator
            centroids += torch.sum(features*mask, dim=1)
    
    # return mean centroids of the dataset
    centroids = torch.div(centroids, samples_count)
    return torch.nn.functional.normalize(centroids, p=2, dim=1), labels_source


def estimate_target_labels(model, source_dl, target_dl, clustering_report=False):
    '''
    Returns estimated labels for target set
    '''
    # compute centroids, for each class, of the source domain
    centroids, labels_source = compute_centroids(model, source_dl)
    centroids_np = centroids.cpu().detach().numpy()
    phi_target = []
    labels_total = []

    # get target phi_1
    model.eval()
    with torch.no_grad():
        for batch_idx, (inputs, labels) in enumerate(target_dl):
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs)
            batch_phi = torch.squeeze(activation["phi_1"])

            phi_target.append(batch_phi)
            labels_total.append(labels)

    # create batch
    phi_target = torch.cat(phi_target, dim=0)
    labels_total = torch.cat(labels_total, dim=0)

    phi_target = torch.squeeze(phi_target)
    phi_target_np = phi_target.cpu().detach().numpy()

    # cluster target data given centroids
    kmeans = SphericalKMeans(n_clusters=20, init=centroids_np, n_init=1, random_state=0)
    kmeans.fit(phi_target_np)
    target_est_labels = kmeans.labels_

    # check clustering accuracy
    clustering_acc = accuracy_score(labels_total.to('cpu'), target_est_labels)
    if clustering_report:
        print(f"Clustering accuracy: {clustering_acc}")

    return target_est_labels, phi_target, kmeans.cluster_centers_, clustering_acc, labels_source

### Filtering
Ambiguous data, which is far from its affiliated cluster center, is discarded. The authours deal with ambiguous data by discarding it from the computation of CDD. Data is discarded by constructing a subset of the target dataset using the following criterion:
$$
\tilde{T} = \{ (x^t, y^t) \mid d(\phi_1(x^t), O^{t(\hat{y}^t)}) < D_0, x^t \in T \}
$$
where $D_0 \in [0, 1]$ is a constant. Moreover, in order to provide more accurate estimations of the distribution statistics, they require each class to have a minimum number of samples assigned in $\tilde{T}$. Classes which do not satisfy such condition will not be considered in current loop. At loop $T_e$ the selected subset of classes is:
$$
C_{T_e} = \left \{ c \mid \sum_{i}^{|\tilde{T}|} \mathbf{1}_{y_i^t=c} > N_0,\, c \in \{ 0, ..., M-1 \} \right \}
$$
where $N_0$ is a constant.

In [None]:
def cosine_dissimilarity(x, x_prime):
    # return 0.5 * (1 - cos_sim) as described in CAN paper
    return torch.mul(0.5, torch.sub(1, F.cosine_similarity(x, x_prime, dim=0)))


def filter_classes(phi_batch, labels, centroids, device, D_0=DEFAULT_D_0, N_0=DEFAULT_N_0):
    """
    Filter the samples that does not respect the constraints.

    Params:
    -----
    phi_batch:
        List or tensor of activations.
    labels:
        List or tensor of phi_batche's labels.
    centroids:
        Computed target centroids
    D_0:
        Maximum distance from centroid
    N_0:
        Minimum distance for a class to be used in computing CDD.
    """

    # keeps track of the number of occurrencies a class respects D_0 constraint
    class_counters = [0] * 20

    # init intermediate and filtered phi and labels
    intermediate_labels = []
    filtered_labels = []
    intermediate_idx = []
    filtered_idx = []
    intermediate_phi = None
    intermediate_phi_list = []
    filtered_phi = None
    filtered_phi_list = []

    # apply D_0 constraint----------------------------------------
    for i, (phi, label) in enumerate(zip(phi_batch, labels)):
        # centroid of estimated label
        est_centroid = centroids[label]
        # compute distance between phi and its centroid
        dist = cosine_dissimilarity(phi, est_centroid)
        # check if distance respects constraint
        if dist < D_0:
            phi = phi.unsqueeze(0)
            # increase the counter of the respective class
            class_counters[label] += 1
            intermediate_phi_list.append(phi)
            intermediate_labels.append(label)
            intermediate_idx.append(i)

    if len(intermediate_phi_list):
      print("intermediate_phi_list")
      intermediate_phi = torch.cat(intermediate_phi_list, dim=0)
    
      # apply N_0 constraint-----------------------------------------
      for phi, label, idx in zip(intermediate_phi, intermediate_labels, intermediate_idx):
          # check if the number of classes of label respect N_0 constraint
          if class_counters[label] > N_0:
              phi = phi.unsqueeze(0)
              filtered_phi_list.append(phi)
              filtered_labels.append(label)
              filtered_idx.append(idx)

      if len(filtered_phi_list):
        filtered_phi = torch.cat(filtered_phi_list)
      else:
        filtered_phi = torch.tensor([])

    # set of classes that respects constraints
    legal_classes = set(l.item() for l in filtered_labels)

    # return the filtered targets
    return filtered_phi, torch.tensor(filtered_labels), filtered_idx, legal_classes

Class `FilteredDataset` is used to filter the dataset given a list of indexes and to match each filterd example with its estimated target label.

In [None]:
class FilteredDataset(Dataset):
    # mirrors torch.utils.data.Subset
    def __init__(self, dataset, indexes, est_targets):
        """
        Params:
        -----
        dataset:
            Unfiltered dataset.
        indexes:
            List of indexes to keep track of.
        est_targets:
            estimated targets of filtered dataset (len(indexes) = len(est_targets)).
        """
        # Compute subset of dataset
        self.subset_ds = torch.utils.data.Subset(dataset, indexes)
        self.labels = est_targets

    def __len__(self):
        return len(self.subset_ds)

    def __getitem__(self, idx):
        return self.subset_ds[idx][0], self.labels[idx].item()


### Class Aware Sampling
A mini-batch of data is usually sampled in a class-agnostic manner. However, it will be less efficient for computing the CDD. The authors then propose a class-aware sampling to "enable the efficient update of network with CDD".

In [None]:
class BalancedBatchSampler(BatchSampler):
    """
    Sampler to get the same number of samples from different classes. 
    Adapted from https://discuss.pytorch.org/t/load-the-same-number-of-data-per-class/65198/3
    """
    def __init__(self, dataset, n_classes, n_samples):
        start = time.time()
        loader = DataLoader(dataset)
        self.labels_list = []
        for _, label in loader:
            self.labels_list.append(label)
        del loader
        self.labels = torch.LongTensor(self.labels_list)
        self.labels_set = list(set(self.labels.numpy()))
        self.label_to_indices = {label: np.where(self.labels.numpy() == label)[0]
                                 for label in self.labels_set}
        for l in self.labels_set:
            np.random.shuffle(self.label_to_indices[l])
        self.used_label_indices_count = {label: 0 for label in self.labels_set}
        self.count = 0
        self.n_classes = n_classes
        self.n_samples = n_samples
        self.dataset = dataset
        self.batch_size = self.n_samples * self.n_classes
        

    def __iter__(self):
        start = time.time()
        self.count = 0
        while self.count + self.batch_size < len(self.dataset):
            classes = np.random.choice(self.labels_set, self.n_classes, replace=False)
            indices = []
            for class_ in classes:
                indices.extend(self.label_to_indices[class_][
                               self.used_label_indices_count[class_]:self.used_label_indices_count[
                                   class_] + self.n_samples])
                self.used_label_indices_count[class_] += self.n_samples
                if self.used_label_indices_count[class_] + self.n_samples > len(self.label_to_indices[class_]):
                    np.random.shuffle(self.label_to_indices[class_])
                    self.used_label_indices_count[class_] = 0
            yield indices
            self.count += self.n_classes * self.n_samples

    def __len__(self):
        return len(self.dataset) // self.batch_size


### StaticDataset
A major issue we encountered, during this project, was the huge amount of time required to run CAN ($\approx 8$ minutes per epoch). A first solution to attenuate the amount of time required was the use of masks. Masks are employed in both `e` (CDD) and `compute_centroids` (Clustering) functions. The aim is to exploit low level code, provided by pytorch, to replace for loops that otherwise would have required lots of time to run. Although they helped, time was still a major issue. After lots of tries, we found out that I/O operations (i.e. read images from the disk) were the problem. To solve this inconvenient, the idea of keeping the dataset in RAM came up to our mind. `StaticDataset`, in fact, pays a key role in reducing the time-per-epoch required by CAN. When Google Colab provides full performances the time-per-epoch is $\approx 45$ seconds. 

This approach, however, bring with itself two drawbacks:
1. Lots of RAM is used. Thus, if a cell is interrupted so the memory is not de-allocated, the runtime may require to be factory-resetted.
2. DataLoaders with StaticDataset are created only once, but the time required is $\approx 100$ seconds. It is not a major issue but it must be taken into consideration.

In [None]:
class StaticDataset(Dataset):
    """
    Custom dataset used to reduce the time required for training. Store the whole dataset in RAM.
    """
    def __init__(self, dataset):
        """
        Load the whole set (train or test) and store all of it in RAM. Used in order to avoid I/O
        operations at each iteration.

        Params:
        -----
        dataset: 
            train OR test set already transformed (ImageFolder + Transformations + split).
        """
        super().__init__()
        # create a DataLoader to iterate through the dataset
        dl = DataLoader(dataset, shuffle=False, num_workers=DEFAULT_NUM_WORKERS)
        inputs = []
        targets = []
        for x, y in dl:
            inputs.append(x)
            targets.append(y)
        del dl
        
        # crate two tensors, one for inputs and one for labels
        self.inputs = torch.cat(inputs)
        self.targets = torch.cat(targets)

    def __len__(self):
        return len(self.inputs)
    
    def __getitem__(self, index):
        return self.inputs[index], self.targets[index]

### Training and Test step CAN
In the training step we can first appreciate the forward of the source domain to compute the CrossEntropy Loss. Then we can see the sampling from class-aware dataloaders, through `get_samples`, in order to obtain their activations and compute the CDD loss.

In [None]:
def get_samples(dataloader, dl_iter):
    try:
        samples, labels = next(dl_iter)
    except StopIteration:
        dl_iter = iter(dataloader)
        samples, labels = next(dl_iter)
    return dl_iter, samples, labels


def training_step_can(model, source_dl, source_cla_dl, target_cla_dl, optimizer, classes, device, cdd_weight=BETA):
    cumulative_cdd_loss = 0.
    cumulative_ce_loss = 0.
    cumulative_loss = 0.
    cumulative_accuracy = 0.

    samples_count = 0
    cdd_samples_count = 0
    total_samples_count = 0

    model.train()

    # init class aware iterators for CDD loss
    if target_cla_dl is not None:
        # source_cla_dl and target_cla_dl should have the same number of classes
        source_cla_iter = iter(source_cla_dl)
        target_cla_iter = iter(target_cla_dl)

    for batch_idx, (inputs, labels) in enumerate(tqdm_notebook(source_dl,  desc="Training step", leave=False)):
        # Compute crossentropy
        inputs = inputs.to(device)
        labels = labels.to(device)
        outputs = model(inputs)
        ce_loss = torch.nn.CrossEntropyLoss()(outputs, labels)
        ce_loss.backward()

        cdd_loss = 0

        # if target_cla_dl is not None it means that there some classes survived after filtering
        if target_cla_dl is not None:            
            source_cla_iter, inputs_source, labels_source = get_samples(source_cla_dl, source_cla_iter)
            target_cla_iter, inputs_target, pseudolabels_target = get_samples(target_cla_dl, target_cla_iter)

            # move to device
            inputs_source, labels_source = inputs_source.to(device), labels_source.to(device)
            inputs_target, pseudolabels_target = inputs_target.to(device), pseudolabels_target.to(device)

            outputs_source = model(inputs_source)
            # list of activations
            #   - global average pool
            #   - output layer
            source_phis = [activation['phi_1'].squeeze(), outputs_source]

            outputs_target = model(inputs_target)
            target_phis = [activation['phi_1'].squeeze(), outputs_target]

            cdd_loss = CDD_loss(source_phis, target_phis, labels_source, pseudolabels_target, classes, cdd_weight) # already weighted by cdd_weight
            
            # backward pass for CDD Loss
            cdd_loss.backward()
            cdd_samples_count += inputs_source.shape[0]

        samples_count += inputs.shape[0]
        total_samples_count += samples_count + cdd_samples_count

        loss = ce_loss + cdd_loss # already weighted

        # update parameters
        optimizer.step()

        # reset the optimizer
        optimizer.zero_grad()

        cumulative_ce_loss += ce_loss
        cumulative_cdd_loss += cdd_loss
        cumulative_loss += loss

        _, predicted = outputs.max(1)
        cumulative_accuracy += predicted.eq(labels).sum().item()

    ce = cumulative_ce_loss / samples_count
    cdd = cumulative_cdd_loss / cdd_samples_count if cdd_samples_count > 0 else 0
    loss = cumulative_loss / total_samples_count
    acc = cumulative_accuracy / samples_count

    metrics = {
        "train/train_ce": ce,
        "train/train_cdd": cdd,
        "train/train_loss": loss,
        "train/train_acc": acc
    }

    return metrics


def test_step_can(model, data_loader, cost_function, device, epoch=0):
    n_samples = 0
    cumulative_loss = 0.
    cumulative_accuracy = 0.

    model.eval()

    with torch.no_grad():
        for batch_idx, (inputs, targets) in enumerate(tqdm_notebook(data_loader, desc="Test Step", leave=False)):
            inputs = inputs.to(device)
            targets = targets.to(device)

            outputs = model(inputs)

            loss = cost_function(outputs, targets)

            # cumulative loss
            n_samples += inputs.shape[0]  # add batch size
            cumulative_loss += loss.item()
            max_prob, predicted = outputs.max(dim=1)  # return predicted labels

            # cumulative accuracy
            cumulative_accuracy += predicted.eq(targets).sum().item()
            # log to wandb
            metrics = {"test/test_loss": cumulative_loss/n_samples,
                       "test/test_acc": cumulative_accuracy/n_samples}

    return metrics

### Training Loop CAN

In [None]:
def create_class_aware_dataloader(dataset, classes):
    """
    Return a dataloader, with batch_sampler=BalancedBatchSampler, on the :dataset: with the selected :classes:.
    """
    # compute number of samples per class
    n_labels = torch.zeros(len(classes))

    # mapping from class indexes to actual indexes of the list
    # e.g. classes = {3, 5, 6}
    # map_ = {
    #   '3': 0,
    #   '5': 1,
    #   '6': 2
    #}
    map_ = {cls: idx for idx, cls in enumerate(classes)}

    # count the number of occurrences for each class in the filtered dataset
    for _, target in dataset:
        if torch.is_tensor(target):
            target = target.item()
        n_labels[map_[target]] += 1

    # get class with minimum number of samples
    min_samples = torch.min(n_labels).item()

    # update samples per class
    # In absence of guidance, we selected 3 samples_per_class as upper bound.
    # This is motivated by the fact that 3 * 20 (max number of classes) = 60
    # which is a reasonable dimension. Moreover, it does not overload too much
    # the computation.
    samples_per_class = min(3, min_samples)
    sampler = BalancedBatchSampler(dataset, len(classes), samples_per_class)
    return DataLoader(dataset, batch_sampler=sampler, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)

In [None]:
def save_weights(epoch, model, optimizer, loss, path, scheduler=None):
    save_dict = {
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'scheduler': scheduler.state_dict() if scheduler is not None else None
    }
    torch.save(save_dict, path)


def load_weights(model, optimizer, weights_path, device, scheduler=None):
    checkpoint = torch.load(weights_path, map_location=device)
    epoch = checkpoint['epoch']
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    if scheduler != None:
        scheduler.load_state_dict(checkpoint['scheduler'])

    return epoch, model, optimizer, scheduler

Here comes the training loop. In the first two lines we can see the network initialization and the addition of a hook. The latter is used to grab the activations of intermediate layers in the network.

Proceeding we can appreciate the retrieval of a scheduler in case `sgd` is used as optimizer. This feature has been implemented for testing purposes (since Adam is way faster in converging and helped us to reduce time while developing). 

`StaticDatasets` are then initialized. You may have noticed that only 3 out of 4 datasets are taken into consideration. This is due to the fact that the source test set is not used during training.

`DataLoaders` used for the computation of CDD aren't shuffled as it is not required. Moreover, not shuffling the dataset makes the code much simpler.

Then, we can see CAN algorithm:
1. Estimation of target labels;
2. Class filtering;
3. Computation of the source dataset subset based on the filtered classes;
4. Computation of the `FilteredDataset` on the target dataset in order to filter it and applying it the estimated labels;
5. Creation of the Class-Aware DataLoaders;
6. Training step
7. Loop

In [None]:
def training_loop_can(source_tr_ds, target_ts_ds, target_tr_ds, device, wandb_run, weights=None, save=False):
    # instatiate model
    model = ResNet18(NUM_CLASSES).to(device)
    model.backbone.avgpool.register_forward_hook(get_activation('phi_1'))  # add hook to extract activations
    
    epochs = wandb.config['epochs']

    optimizer = get_optimizer(model, lr=wandb.config["lr"], optim=wandb.config["optimizer"])
    scheduler = None
    if (wandb.config["optimizer"] == "sgd"):
        scheduler = get_lr_scheduler(optimizer, epochs)

    experiment = wandb_run.name

    # used to resume epoch after load weights
    last_epoch = 0
    if weights != None:
        last_epoch, model, optimizer, scheduler = load_weights(model, optimizer, weights, device, scheduler)
        print(f"Weights restored from epoch {last_epoch}")
    
    print(experiment)

    # static datasets
    source_tr_static = StaticDataset(source_tr_ds)
    target_ts_static = StaticDataset(target_ts_ds)
    target_tr_static = StaticDataset(target_tr_ds)
    
    # for ce
    source_tr_dl = DataLoader(source_tr_static, BATCH_SIZE, shuffle=True, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)

    # for cdd 
    target_ts_dl = DataLoader(target_ts_static, BATCH_SIZE, shuffle=False, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)
    tr_dl_source_unshuff = DataLoader(source_tr_static, BATCH_SIZE, shuffle=False, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)
    tr_dl_target_unshuff = DataLoader(target_tr_static, BATCH_SIZE, shuffle=False, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)

    print("Start Training")
    for e in tqdm_notebook(range(last_epoch, epochs), desc="Training loop"):
        # Clustering
        target_est_labels, phi_target, centroids, clustering_acc, source_labels = estimate_target_labels(model, tr_dl_source_unshuff, tr_dl_target_unshuff, clustering_report=True)

        # Filtering classes
        _, target_pseudo_labels, target_idxs, classes = filter_classes(phi_target, target_est_labels, torch.tensor(centroids).to(device), device, D_0=wandb.config["D_0"], N_0=wandb.config["N_0"])

        # iterates through source labels, keeps only the indexes wich label is in classes set
        source_idxs = [i for i, y in enumerate(source_labels) if y in classes]

        # Define Filtered Datasets
        source_filt_ds = Subset(source_tr_static, source_idxs)
        print(f"Source filtered dataset lenght: {len(source_filt_ds)}")

        target_filt_ds = FilteredDataset(target_tr_static, target_idxs, target_pseudo_labels)
        print(f"Target filtered dataset lenght: {len(target_filt_ds)}")

        if len(classes) < 2:
            print("No classes after filtering")
            source_cla_dl = None
            target_cla_dl = None
        else:
            # Define C-A Dataloader
            source_cla_dl = create_class_aware_dataloader(source_filt_ds, classes)
            target_cla_dl = create_class_aware_dataloader(target_filt_ds, classes)

        train_metrics = training_step_can(model, source_tr_dl, source_cla_dl, target_cla_dl, optimizer, classes, device)

        test_metrics = test_step_can(model, target_ts_dl, torch.nn.CrossEntropyLoss(), device)
        
        if scheduler is not None:
            scheduler.step()

        # Log to WandB
        metrics = {**train_metrics,
                   'train/filtered_classes': len(classes),
                   'train/clustering_acc': clustering_acc,
                   **test_metrics}
        wandb.log(metrics)

        train_loss = train_metrics['train/train_loss']
        train_ce = train_metrics['train/train_ce']
        train_cdd = train_metrics['train/train_cdd']
        train_acc = train_metrics['train/train_acc']

        test_loss = test_metrics['test/test_loss']
        test_acc = test_metrics['test/test_acc']

        new_best = False
        if e == 0 or best_acc < test_acc:
            best_acc = test_acc
            best_loss = test_loss
            best_model = copy.deepcopy(model)
            new_best = True

            # Save new best weights
            if save:
                save_weights(e, model, optimizer, test_loss, f'/content/drive/MyDrive/weights/ResNet18CAN_{experiment}.pth', scheduler)
                artifact = wandb.Artifact(f'ResNet18CAN_{experiment}', type='model', metadata={**wandb_run.config, **metrics})
                artifact.add_file(f'/content/drive/MyDrive/weights/ResNet18CAN_{experiment}.pth')
                wandb_run.log_artifact(artifact)

        print('\n Epoch: {:d}'.format(e + 1))
        print('\t Training loss {:.5f}, Training accuracy {:.2f}'.format(train_loss, train_acc))
        print('\t Training CE {:.5f}, Training CDD {:.5f}'.format(train_ce, train_cdd))
        print('\t Training filtered classes {}'.format(len(classes)))
        print('\t Test loss {:.5f}, Test accuracy {:.2f}'.format(test_loss, test_acc))
        if new_best:
            print("\t New best weights saved")
        print('-----------------------------------------------------')
    
    visualize(best_model, target_ts_dl, wandb_run)
    wandb.summary["test_best_loss"] = best_loss
    wandb.summary["test_best_accuracy"] = best_acc

    wandb.finish()

    del source_tr_static
    del target_ts_static
    del target_tr_static

    print('\t BEST Test loss {:.5f}, Test accuracy {:.2f}'.format(best_loss, best_acc))

    return best_model

### Experiments



Train model on Products, test on Real World

In [None]:
wandb_run = wandb.init(
    project="DL2022_229356_229298",
    entity="229356_229298",
    name="P_to_R_can",
    config={
        "model": "ResNet18CAN",
        "trained-on": "Source + Target unsupervised",
        "epochs": 60,
        "batch_size": BATCH_SIZE,
        "optimizer": "sgd",
        "lr": 1e-2,
        "D_0": 0.075,
        "N_0": DEFAULT_N_0
    }
)

best_model = training_loop_can(train_products, test_real, train_real, device, wandb_run, save=False)

**Best test accuracy $P \rightarrow R:$ 0.79**

class | precision  |  recall | f1-score  | support
------|------------|---------|-----------|---------
backpack       |0.53      |0.94      |0.68        |17
bookcase       |0.78      |0.70      |0.74        |20
car jack       |0.64      |0.93      |0.76        |15
comb       |0.77      |0.77      |0.77        |22
crown       |0.76      |0.90      |0.83        |21
file cabinet       |0.59      |0.73      |0.65        |22
flat iron       |0.82      |0.88      |0.85        |16
game controller       |0.94      |0.71      |0.81        |21
glasses       |1.00      |0.68      |0.81        |19
helicopter       |1.00      |0.74      |0.85        |19
ice skates       |0.69      |0.86      |0.77        |21
letter tray       |0.73      |0.70      |0.72        |27
monitor       |0.79      |0.75      |0.77        |20
mug       |1.00      |0.92      |0.96        |24
network switch       |0.91      |0.59      |0.71        |17
over-ear headphones       |0.94      |1.00      |0.97        |17
pen       |1.00      |0.76      |0.87        |17
purse       |0.67      |0.61      |0.64        |23
stand mixer       |0.91      |0.95      |0.93        |21
stroller       |0.89      |0.76      |0.82        |21
||||
accuracy      |           |          |0.79       |400
macro avg     |  0.82     | 0.79      |0.79       |400
weighted avg  |     0.82  |    0.79      |0.79       |400

In [None]:
wandb_run = wandb.init(
    project="DL2022_229356_229298",
    entity="229356_229298",
    name="R_to_P_can",
    config={
        "model": "ResNet18CAN",
        "trained-on": "Source + Target unsupervised",
        "epochs": 60,
        "batch_size": BATCH_SIZE,
        "optimizer": "sgd",
        "lr": 1e-2,
        "D_0": 0.075,
        "N_0": DEFAULT_N_0
    }
)

best_model = training_loop_can(train_real, test_products, train_products, device, wandb_run, save=False)

**Best test accuracy $R \rightarrow P:$ 0.94**

|      class         |precision  |  recall  |f1-score  | support |
|---|---|---|---|---|
|           backpack |      0.97 |     0.97 |     0.97 |       29|
|           bookcase |      0.83 |     0.90 |     0.86 |       21|
|           car jack |      0.94 |     0.94 |     0.94 |       17|
|               comb |      0.95 |     0.95 |     0.95 |       19|
|              crown |      1.00 |     1.00 |     1.00 |       20|
|       file cabinet |      0.88 |     0.78 |     0.82 |       18|
|          flat iron |      0.94 |     0.94 |     0.94 |       16|
|    game controller |      0.95 |     0.88 |     0.91 |       24|
|            glasses |      0.95 |     1.00 |     0.97 |       19|
|         helicopter |      0.94 |     1.00 |     0.97 |       17|
|         ice skates |      1.00 |     0.95 |     0.97 |       19|
|        letter tray |      0.83 |     0.94 |     0.88 |       16|
|            monitor |      0.95 |     0.90 |     0.92 |       20|
|                mug |      0.94 |     1.00 |     0.97 |       17|
|     network switch |      0.92 |     0.96 |     0.94 |       24|
|over-ear headphones |      0.83 |     1.00 |     0.91 |       15|
|                pen |      0.92 |     0.83 |     0.87 |       29|
|              purse |      0.95 |     0.86 |     0.90 |       21|
|        stand mixer |      1.00 |     1.00 |     1.00 |       19|
|           stroller |      1.00 |     1.00 |     1.00 |       20|
| | | | |
|           accuracy |           |          |     0.94 |      400|
|          macro avg |      0.93 |     0.94 |     0.94 |      400|
|       weighted avg |      0.94 |     0.94 |     0.93 |      400|

### Observations
We can observe that the effect of unsupervised domain adaptation through Contrastive Adaptation Network are noticeable. Overall accuracy increases in both directions. 
This is clearly due to the effect of the CDD Loss which allows to perform better domain alignment comparing to the baseline approach. We may notice also that the CDD behaves like a sort of regularizer, reducing overfitting. As we can see from the plots below, even when training accuracy is 1, loss minimization (Crossentropy + $\beta$ CDD) continues as well as the increase in test accuracy. As the authors suggest, "maximizing the inter-class domain discrepancy may alleviate the possibility of the model overfitting to the source data and benefits the adaptation". Therefore we expect that with longer training the fine tuning of the network could make us reach slightly higher values in terms of accuracy. 

<img src="https://drive.google.com/uc?export=view&id=1FaOkrprdEX_Y4sfSiqcL3TPFLuzBuxBF" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1qU2XQ7WQxnxAHBlHXST9s9y4Hu81neXi" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1lQGxxxNjfg0ZppBD94W1hYfgCiMYgdPW" width="500px">

<img src="https://drive.google.com/uc?export=view&id=1w4joWo8oZx-3WssXIt7ZygKs7jMydtwX" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1v89OtDB2aYqvhf_s41E34XOn4TqZfc6l" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1OAOSlx4PJvZLD1G4hzj0UQZHCVK9Rzjl" width="500px">

Moreover we can observe that the clustering accuracy, which is the accuracy on the predicted target pseudo-labels, sligthly increases throughout the training thanks to the CDD. Feature representations of samples are more compact among the same class due to the intra class domain discrepancy minimization and further away among different classes thanks to the inter class domain discrepancy maximization.

In the case of Products to Real Word direction, the number of classes remaining after the filtering step behaves as expected. At the beginning of the training the number of classes remaining is low and increases as the CDD loss is minimized. In order to obtain this behaviour we needed to slightly tweak $D_0$ hyper-parameter to $0.075$ instead of $0.05$ used by the authors. Default $D_0$ value did not allow enough samples to be kept, obtaining as a consequence zero classes after filtering, making impossible the computation of CDD loss. $N_0$ was left to $3$. 

In the opposite direction of the UDA task this behaviour is not observed since the domains are already well aligned thanks to the pretraining of the backbone on ILSVRC dataset. Therefore, since the beginning of the training, all 20 classes are used for the computation of the CDD.
<img src="https://drive.google.com/uc?export=view&id=1_JTeQxh0UTrFwCV8TC2FtuCpxy10fN6e" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1mBdRO-dlAQm6F7tZnnZePywsjqrTq8aQ" width="500px">

In terms of the **gain** obtained by applying CAN, we got the following results:

| Method | Acc P -> R (%) | Acc R -> P (%) |
|---|---|---|
| *Upper bound* |  *92.0*  |  *95.0*  |
| Baseline      |   74.5   |   92.5   |
| CAN           |   79.0   |   93.5   |
| **Gain CAN**  | **4.5** | **1.0** |


Those results were expected and are coherent with those presentend in Adaptiope article modulo the fact that for this project we are dealing with 20 classes instead of 123. Moreover, we used a ResNet18 while, in the literature, CAN is implemented through ResNet50 or ResNet101.
We acknoledge that and, based on the upper bounds we computed, there is room for improvements for both directions of the UDA task, expecially for Products to Real World direction. As we can see the confusion matrix and the t-SNE 2D visualization are still inaccurate and noisy. However, compared to the baseline, CAN demonstrates higher intra-class compactness and much larger inter-class margin.

<div><b>Products to Real World</b> confusion matrix and t-SNE scatter plot: <br>
<img src="https://drive.google.com/uc?export=view&id=1CZwSTv2EWenahP6O73o3X37HX4IZi3jw" height="400px">
<img src="https://drive.google.com/uc?export=view&id=1TejKrBi7LLcawX5UZQ8fAoTru7F1m0X1" height="300px">
</div>


Furthermore we note that this vanilla implementation of CAN is very slow in converging due to Stochastic Gradient Descent optimization algorithm. Therefore, many epochs are needed to get good results. Clearly training with plain SGD plus a learning rate scheduler is not the best choiche. An adaptive optimization algorithm like Adam would help CAN converging faster to a better local optimum. 
 

## Improving CAN
The vanilla implementation of CAN it is very slow in converging. Moreover, it uses 512 features to compute clustering (using ResNet18) and K-Means is subject to the "curse of dimensionality". To be more precise, distances in general are subject to this issue.

In the following sections we will propose our improvements in order to make CAN converging faster and slightly more accurate.

> Note: Since this section is very similar to the CAN one (referring to code), in order to make the notebook more compact, much of CAN code is used also here. Thus, make sure to run each part of CAN (except for experiments) in order to have everything needed.

### Better Clustering
As just mentioned, K-Means clustering is subject to the "curse of dimensionality". Generally speaking when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. Therefore, distance based algorithm like K-Means tend to perform poorly.

In order to attenuate the negative effects of the large amount of features, we decided to adopt a Kernel PCA to reduce substantially the number of features. Basically we compute the Kernel PCA to source and target domain together, and we use the obtained lower dimensional features to compute the Spherical K-Means. Moreover, we adopt it also to compute the CDD loss, as cosine distance is also affected by the "curse of dimensionality". Additionally, by applying feature reduction, we try to compact the feature representation in order to reduce noise in the data.

The choice of using [Kernel PCA](https://www.face-rec.org/algorithms/Kernel/kernelPCA_scholkopf.pdf) feature reduction algorithm, instead of linear PCA, is motivated by the fact that feature reduction through non-linear transformations is better suited for dealing with non linearly separable data, like the features obtained by the CNN.

In [None]:
def compute_centroids_improvements(model, dataloader):
    ''' 
    Forward dataloader through model, extract phi_1 (last pooling layer features) 
    and compute feature mean per class
    return torch.Tensor with shape (num_classes, num_features)
    '''
    centroids = 0
    samples_count = 0

    # unsqueeze to have size (num_classes, 1) instead of simply (num_classes) in order to exploit broadcasting later
    references = torch.tensor(range(model.num_classes), device=device).unsqueeze(1)
    
    labels_source = []

    model.eval()
    with torch.no_grad():
        for _, (inputs, labels) in enumerate(dataloader):
            inputs = inputs.to(device)
            labels_source.extend(labels.tolist())
            labels = labels.to(device)  # tensor of size (batch_size)
            samples_count += labels.size(0)

            outputs = model(inputs)
            # get activations of the first task specific layer of size (batch_size, num_features)
            features = torch.squeeze(activation["phi_1"])

            # resize labels tensor to (num_classes, batch_size) to use it later to generate the boolean mask
            labels = labels.unsqueeze(0).expand(model.num_classes, -1)

            # (labels == references) returns a tensor (num_classes, batch_size) thanks to broadcasting
            # Item [c][i] in the vector is true if sample i belongs to class c
            # By unsqueezing on last dimension mask becomes (num_classes, batch_size, 1)
            # this is needed to compute the mask on the features exploiting again broadcasting
            mask = (labels == references).unsqueeze(2)

            # feature * mask returns a tensor (num_classes, batch_size, num_feature)
            # where only rows on dim=1 for which the related samplelabel == class are not 0 but contain feature values
            # by summing on dim=1 we sum feature-wise all samples belonging to a class getting a (num_classes, num_features) tensor
            # then add the batch centroids to the centroid accumulator
            centroids += torch.sum(features*mask, dim=1)
    
    # return mean centroids of the dataset
    centroids = torch.div(centroids, samples_count)
    return torch.nn.functional.normalize(centroids, p=2, dim=1), labels_source, features

def estimate_target_labels_improvements(model, source_dl, target_dl, clustering_report=False):
    '''
    Returns estimated labels for target set in order
    '''
    # compute centroids, for each class, of the source domain
    centroids, labels_source, features = compute_centroids_improvements(model, source_dl)
    centroids_np = centroids.cpu().detach().numpy()

    phi_target = []
    labels_total = []

    # get target phi_1
    model.eval()
    with torch.no_grad():
        for batch_idx, (inputs, labels) in enumerate(target_dl):
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs)
            batch_phi = torch.squeeze(activation["phi_1"])

            phi_target.append(batch_phi)
            labels_total.append(labels)

    phi_target = torch.cat(phi_target, dim=0)
    labels_total = torch.cat(labels_total, dim=0)

    phi_target = torch.squeeze(phi_target)
    phi_target_np = phi_target.cpu().detach().numpy()

    features_np = features.cpu().detach().numpy()

    # dim reduction
    kpca = KernelPCA(80, kernel='rbf')
    kpca.fit(np.concatenate((phi_target_np, features_np)))
    phi_target_np = kpca.transform(phi_target_np)
    centroids_np = kpca.transform(centroids_np)

    phi_target = torch.tensor(phi_target_np, device=device)

    # cluster target data given centroids
    kmeans = SphericalKMeans(n_clusters=20, init=centroids_np, n_init=1, random_state=0)
    kmeans.fit(phi_target_np)
    target_est_labels = kmeans.labels_

    # check clustering accuracy
    clustering_acc = accuracy_score(labels_total.to('cpu'), target_est_labels)
    if clustering_report:
        print(f"Clustering accuracy: {clustering_acc}")

    return target_est_labels, phi_target, kmeans.cluster_centers_, clustering_acc, labels_source, kpca

### Training Step Improvements
This code is the same as above but, here we can see the application on the Kernel PCA in the central part of the for loop.

In [None]:
def apply_kpca(batch_phi, kpca):
    batch_phi = batch_phi.squeeze().cpu().numpy()
    batch_phi = kpca.transform(batch_phi)
    return torch.tensor(batch_phi).to(device)

def training_step_improvements(model, source_dl, source_cla_dl, target_cla_dl, optimizer, classes, kpca, device, cdd_weight=BETA):
    cumulative_cdd_loss = 0.
    cumulative_ce_loss = 0.
    cumulative_loss = 0.
    cumulative_accuracy = 0.

    samples_count = 0
    cdd_samples_count = 0
    total_samples_count = 0

    model.train()

    # init class aware iterators for CDD loss
    if target_cla_dl is not None:
        # source_cla_dl and target_cla_dl should have the same number of classes
        source_cla_iter = iter(source_cla_dl)
        target_cla_iter = iter(target_cla_dl)

    for batch_idx, (inputs, labels) in enumerate(tqdm_notebook(source_dl,  desc="Training step", leave=False)):
        # Compute crossentropy
        inputs = inputs.to(device)
        labels = labels.to(device)
        outputs = model(inputs)
        ce_loss = torch.nn.CrossEntropyLoss()(outputs, labels)
        ce_loss.backward()

        cdd_loss = 0

        # if target_cla_dl is not None it means that there some classes remain after filtering
        if target_cla_dl is not None:            
            source_cla_iter, inputs_source, labels_source = get_samples(source_cla_dl, source_cla_iter)
            target_cla_iter, inputs_target, pseudolabels_target = get_samples(target_cla_dl, target_cla_iter)

            # move to device
            inputs_source, labels_source = inputs_source.to(device), labels_source.to(device)
            inputs_target, pseudolabels_target = inputs_target.to(device), pseudolabels_target.to(device)

            outputs_source = model(inputs_source)
            source_phis = apply_kpca(activation['phi_1'], kpca)
            
            outputs_target = model(inputs_target)
            target_phis = apply_kpca(activation['phi_1'], kpca)

            source_phis = [source_phis, outputs_source]
            target_phis = [target_phis, outputs_target]

            cdd_loss = CDD_loss(source_phis, target_phis, labels_source, pseudolabels_target, classes, cdd_weight) # already weighted by cdd_weight
            
            # backward pass for CDD Loss
            cdd_loss.backward()
            cdd_samples_count += inputs_source.shape[0]

        samples_count += inputs.shape[0]
        total_samples_count += samples_count + cdd_samples_count

        loss = ce_loss + cdd_loss # already weighted

        # update parameters
        optimizer.step()

        # reset the optimizer
        optimizer.zero_grad()

        cumulative_ce_loss += ce_loss
        cumulative_cdd_loss += cdd_loss
        cumulative_loss += loss

        _, predicted = outputs.max(1)
        cumulative_accuracy += predicted.eq(labels).sum().item()

    ce = cumulative_ce_loss / samples_count
    cdd = cumulative_cdd_loss / cdd_samples_count if cdd_samples_count > 0 else 0
    loss = cumulative_loss / total_samples_count
    acc = cumulative_accuracy / samples_count

    metrics = {
        "train/train_ce": ce,
        "train/train_cdd": cdd,
        "train/train_loss": loss,
        "train/train_acc": acc
    }

    return metrics

### Training Loop with Improvements
Also here the code is basically the same as Vanilla CAN but `kpca` is returned from `estimate_target_labels` and passed to `training_step_improvements`.

In [None]:
def training_loop_improvements(source_tr_ds, target_ts_ds, target_tr_ds, device, wandb_run, weights=None, save=False):
    # instatiate model
    model = ResNet18(NUM_CLASSES).to(device)
    model.backbone.avgpool.register_forward_hook(get_activation('phi_1'))  # add hook to extract activations
    
    epochs = wandb.config['epochs']

    optimizer = get_optimizer(model, lr=wandb.config["lr"], optim=wandb.config["optimizer"])
    scheduler = None
    if (wandb.config["optimizer"] == "sgd"):
        scheduler = get_lr_scheduler(optimizer, epochs)

    experiment = wandb_run.name

    # used to resume epoch after load weights
    last_epoch = 0
    if weights != None:
        last_epoch, model, optimizer, scheduler = load_weights(model, optimizer, weights, device, scheduler)
        print(f"Weights restored from epoch {last_epoch}")
    
    print(experiment)

    # static datasets
    source_tr_static = StaticDataset(source_tr_ds)
    target_ts_static = StaticDataset(target_ts_ds)
    target_tr_static = StaticDataset(target_tr_ds)
    
    # for ce
    source_tr_dl = DataLoader(source_tr_static, BATCH_SIZE, shuffle=True, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)

    # for cdd 
    target_ts_dl = DataLoader(target_ts_static, BATCH_SIZE, shuffle=False, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)
    tr_dl_source_unshuff = DataLoader(source_tr_static, BATCH_SIZE, shuffle=False, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)
    tr_dl_target_unshuff = DataLoader(target_tr_static, BATCH_SIZE, shuffle=False, num_workers=DEFAULT_NUM_WORKERS, pin_memory=True)

    print("Start Training")
    for e in tqdm_notebook(range(last_epoch, epochs), desc="Training loop"):
        # Clustering
        target_est_labels, phi_target, centroids, clustering_acc, source_labels, kpca = estimate_target_labels_improvements(model, tr_dl_source_unshuff, tr_dl_target_unshuff, clustering_report=True)

        # Filtering classes
        _, target_pseudo_labels, target_idxs, classes = filter_classes(phi_target, target_est_labels, torch.tensor(centroids).to(device), device, D_0=wandb.config["D_0"], N_0=wandb.config["N_0"])

        # iterates through source labels, keeps only the indexes wich label is in classes set
        source_idxs = [i for i, y in enumerate(source_labels) if y in classes]

        # Define Filtered Datasets
        source_filt_ds = Subset(source_tr_static, source_idxs)
        print(f"Source filtered dataset lenght: {len(source_filt_ds)}")

        target_filt_ds = FilteredDataset(target_tr_static, target_idxs, target_pseudo_labels)
        print(f"Target filtered dataset lenght: {len(target_filt_ds)}")

        if len(classes) < 2:
            print("No classes after filtering")
            source_cla_dl = None
            target_cla_dl = None
        else:
            # Define C-A Dataloader
            source_cla_dl = create_class_aware_dataloader(source_filt_ds, classes)
            target_cla_dl = create_class_aware_dataloader(target_filt_ds, classes)

        train_metrics = training_step_improvements(model, source_tr_dl, source_cla_dl, target_cla_dl, optimizer, classes, kpca, device)

        test_metrics = test_step_can(model, target_ts_dl, torch.nn.CrossEntropyLoss(), device)
        
        if scheduler is not None:
            scheduler.step()

        # Log to WandB
        metrics = {**train_metrics,
                   'train/filtered_classes': len(classes),
                   'train/clustering_acc': clustering_acc,
                   **test_metrics}
        wandb.log(metrics)

        train_loss = train_metrics['train/train_loss']
        train_ce = train_metrics['train/train_ce']
        train_cdd = train_metrics['train/train_cdd']
        train_acc = train_metrics['train/train_acc']

        test_loss = test_metrics['test/test_loss']
        test_acc = test_metrics['test/test_acc']

        new_best = False
        if e == 0 or best_acc < test_acc:
            best_acc = test_acc
            best_loss = test_loss
            best_model = copy.deepcopy(model)
            new_best = True

            # Save new best weights
            if save:
                save_weights(e, model, optimizer, test_loss, f'/content/drive/MyDrive/weights/ResNet18CAN_{experiment}.pth', scheduler)
                artifact = wandb.Artifact(f'ResNet18CAN_{experiment}', type='model', metadata={**wandb_run.config, **metrics})
                artifact.add_file(f'/content/drive/MyDrive/weights/ResNet18CAN_{experiment}.pth')
                wandb_run.log_artifact(artifact)

        print('\n Epoch: {:d}'.format(e + 1))
        print('\t Training loss {:.5f}, Training accuracy {:.2f}'.format(train_loss, train_acc))
        print('\t Training CE {:.5f}, Training CDD {:.5f}'.format(train_ce, train_cdd))
        print('\t Training filtered classes {}'.format(len(classes)))
        print('\t Test loss {:.5f}, Test accuracy {:.2f}'.format(test_loss, test_acc))
        if new_best:
            print("\t New best weights saved")
        print('-----------------------------------------------------')

    visualize(best_model, target_ts_dl, wandb_run)
    wandb.summary["test_best_loss"] = best_loss
    wandb.summary["test_best_accuracy"] = best_acc
    wandb.finish()

    del source_tr_static
    del target_ts_static
    del target_tr_static

    print('\t BEST Test loss {:.5f}, Test accuracy {:.2f}'.format(best_loss, best_acc))

    return best_model

### Experiments

Notice that since we are using Adam optimizer we expect faster convergence, therefore we train the model for 30 epochs instead of 60 as in vanilla CAN.

In [None]:
conf = wandb.init(
    project="DL2022_229356_229298",
    entity="229356_229298",
    name="P_to_R_improvements",
    config={
        "model": "ResNet18Improvements",
        "trained-on": "Source + Target unsupervised",
        "epochs": 30,
        "batch_size": BATCH_SIZE,
        "optimizer": "adam",
        "lr": 1e-3,
        "D_0": 0.1,
        "N_0": DEFAULT_N_0
    }
)

best_model = training_loop_improvements(train_products, test_real, train_real, device, conf, save=False)

|    class | precision | recall| f1-score | support |
|---|---|---|---|---|
|            backpack    |   0.68 |     1.00 |     0.81 |       17|
|            bookcase    |   0.85 |     0.85 |     0.85 |       20|
|            car jack    |   0.83 |     1.00 |     0.91 |       15|
|                comb    |   0.84 |     0.73 |     0.78 |       22|
|               crown    |   0.91 |     1.00 |     0.95 |       21|
|        file cabinet    |   0.68 |     0.68 |     0.68 |       22|
|           flat iron    |   0.75 |     0.94 |     0.83 |       16|
|     game controller    |   0.84 |     0.76 |     0.80 |       21|
|             glasses    |   1.00 |     0.95 |     0.97 |       19|
|          helicopter    |   0.89 |     0.89 |     0.89 |       19|
|          ice skates    |   0.78 |     0.67 |     0.72 |       21|
|         letter tray    |   0.72 |     0.78 |     0.75 |       27|
|             monitor    |   0.95 |     0.95 |     0.95 |       20|
|                 mug    |   1.00 |     0.88 |     0.93 |       24|
|      network switch    |   0.88 |     0.82 |     0.85 |       17|
| over-ear headphones    |   0.94 |     0.94 |     0.94 |       17|
|                 pen    |   0.81 |     0.76 |     0.79 |       17|
|               purse    |   0.76 |     0.57 |     0.65 |       23|
|         stand mixer    |   0.88 |     1.00 |     0.93 |       21|
|            stroller    |   1.00 |     0.90 |     0.95 |       21|
| ||||
|            accuracy    |        |          |     0.84 |      400|
|           macro avg    |   0.85 |     0.85 |     0.85 |      400|
|        weighted avg    |   0.85 |     0.84 |     0.84 |      400|


In [None]:
conf = wandb.init(
    project="DL2022_229356_229298",
    entity="229356_229298",
    name="R_to_P_improvements",
    config={
        "model": "ResNet18Improvements",
        "trained-on": "Source + Target unsupervised",
        "epochs": 30,
        "batch_size": BATCH_SIZE,
        "optimizer": "adam",
        "lr": 1e-3,
        "D_0": 0.1,
        "N_0": DEFAULT_N_0
    }
)

best_model = training_loop_improvements(train_real, test_products, train_products, device, conf, save=False)

| class                |precision |   recall | f1-score |  support|
|---|---|---|---|---|
|            backpack  |     0.94 |     1.00 |     0.97 |       29|
|            bookcase  |     0.87 |     0.95 |     0.91 |       21|
|            car jack  |     1.00 |     0.94 |     0.97 |       17|
|                comb  |     0.86 |     1.00 |     0.93 |       19|
|               crown  |     1.00 |     1.00 |     1.00 |       20|
|        file cabinet  |     1.00 |     0.83 |     0.91 |       18|
|           flat iron  |     0.88 |     0.88 |     0.88 |       16|
|     game controller  |     1.00 |     0.96 |     0.98 |       24|
|             glasses  |     1.00 |     1.00 |     1.00 |       19|
|          helicopter  |     0.94 |     1.00 |     0.97 |       17|
|          ice skates  |     1.00 |     0.95 |     0.97 |       19|
|         letter tray  |     0.94 |     1.00 |     0.97 |       16|
|             monitor  |     0.95 |     0.95 |     0.95 |       20|
|                 mug  |     1.00 |     1.00 |     1.00 |       17|
|      network switch  |     1.00 |     0.96 |     0.98 |       24|
| over-ear headphones  |     0.88 |     1.00 |     0.94 |       15|
|                 pen  |     0.89 |     0.83 |     0.86 |       29|
|               purse  |     1.00 |     0.86 |     0.92 |       21|
|         stand mixer  |     1.00 |     1.00 |     1.00 |       19|
|            stroller  |     0.95 |     1.00 |     0.98 |       20|
|                      |          |          |          |         |   
|            accuracy  |          |          |     0.95 |      400|
|           macro avg  |     0.96 |     0.96 |     0.95 |      400|
|        weighted avg  |     0.95 |     0.95 |     0.95 |      400|

### Observations

Thanks to the above mentioned improvements we were able to substantially increase the performances of the Contrastive Adaptation Network reaching 84.5% accuracy on $P → R$ task and 95.2% accuracy on $R → P$ task, which is the same as the upper bound accuracy we computed on Products test set. Thus the **gain** obtained is of 10% on $P → R$ task and of 2.7% $R → P$ task.

As we can see in the plots below not only the training required less epochs but the convergence to the optimum was steeper so that the best accuracy obtained in vanilla CAN was already outperformed by epoch 5 in $P → R$ task and epoch 6 in $R → P$ task, whereas for vanilla CAN it took 60 iterations to reach those performaces.

<img src="https://drive.google.com/uc?export=view&id=1YlWZRu4rwxNqqeNoPQRj_RpdotprlHJz" width="500px">

We can notice that classes are compact and well separated even though some outliers remain. Thus intra-class compactness is higher and inter-class margin is larger compared to standard CAN implementation. This behaviour is a symptom of better minimization of the Contrastive Domain Discrepancy thanks to both Adam and the computation of the CDD on features with reduced dimensionality. We believe that by implementing Kernel PCA dimensionality reduction, the CDD computed was less subject to noise and therefore more representative of the actual distribution of the data, leading to better domain adaptation.

<div>P->R t-SNE representation on the left and R->P on the right.<br>
<img src="https://drive.google.com/uc?export=view&id=1E2AErYYu_1qK3C5WtGHtYQNXv_tz64CG" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1wK2eUv9pLvmbzdAvsZjB6UanFCbGT909" width="500px">
</div>

In addition we can see that the CDD loss minimization does not stall nor oscillates around a constant value like in standard CAN. In our improved CAN the CDD loss has a smoother behaviour and its value is steadily minimized througout the training.

<img src="https://drive.google.com/uc?export=view&id=1AW516FV6T8gkynfjFkkNpVK-74vRqDI6" width="500px">

The better behaviour of the CDD has a positive impact on the accuracy of the predicted target labels, as we can see in the plot below.
Although the clustering accuracy does not increase smoothly as the CDD is minimzed, we can see that it reaches a higher value earlier during the training, w.r.t. vanilla CAN. We believe that this creates a virtuous circle since more accurate pseudo-labels for the target set allow to compute a more precise CDD, thus the model's features are more representative of the samples, leading to more accurate clustering.

<img src="https://drive.google.com/uc?export=view&id=1fOMdHK919yz5nbl9gL7oAWKicq7e1DKc" width="500px">

For the same reasons we observe, particularly for the task $P → R$, that the number of classes used for the computation of the CDD increases to an amount near to the maximum in less iterations. This means that the compacteness of the feature representations (intra-class domain discrepancy) for the same class is higher, leading to more confident target-label predictions. Therefore less samples are excluded by the filtering and more (and precise) information is used to compute the CDD.

We can speculate that the reason why in $P → R$ task the number of remaining classes never reaches the maximum could be related to the need of finding better $D_0$ and $N_0$ hyperparameters as we just used similar values to the ones proposed in the original article by the authors. It could be also that the model fails in compacting enough the worst performing class (maybe "file cabinet" as we can se from the confusion matrix below or the classification report in $P → R$ training section), therefore those parameters, in particular $N_0$, are too strict for it, while they are fine for the majority of the classes.

<img src="https://drive.google.com/uc?export=view&id=1Na0vbxkQ-99QEIjUTkvVXbMXfuVD35Z6" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1Jo6L-T0nAFVoGGhezNzHtBMYMxmqh85u" width="700px">

In the end we can appreciate the final **gain** obtained in our improved version of CAN with respect to the baseline.

| Method | Acc P -> R (%) | Acc R -> P (%)|
|---|---|---|
| *Upper bound*         |  *92.0*  |  *95.0*  |
| Baseline              |   74.5   |   92.5   |
| CAN                   |   79.0   |   93.5   |
| **Gain CAN**          | **4.5**  |  **1.0** |
| CAN-Improved          |   84.5   |   95.2   |
| **Gain CAN-Improved** | **10.0** |  **2.7** |

## Ablation studies

With the following experiments we tried to understand what is the impact on the performance of CAN by only applying dimensionality reduction, keeping the original SGD optimizer. To get meaningful results, we needed to tweak slightly $D_0$ hyperparameter to $0.15$, increasing the threshold for choosing to filter out a sample for the CDD loss computation.

As we can see below, in $P → R$ task there is an improvemnt in the overall accuracy from $0.79$ to $0.82$. Furthermore, as expected, the CDD loss has a smoother behaviour compared to the vanilla CAN, even though it stalls at an higher value. Despite this, the test accuracy continues to increase. 

We can also see that, thanks to the more compact feature representation, the clustering accuracy is higher than vanilla CAN, allowing the model to exploit a more accurate pseudo-labelling for the computation of the CDD loss.

<img src="https://drive.google.com/uc?export=view&id=1yeg6oGzMWRoNqLpVTUxEAjfCX6aCXMAF" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1dc1EsouaCWbdUaJrRYipk_hCp3kVled4" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1d_TOdHF3donxE2DsUwZaT5EaEt8RzCq5" width="500px">

In $R → P$ task, instead, we do not get an improvement in the best test accuracy value, however similar behaviours both in the clustering and in the CDD loss curves are observed. 

Comparing those results with the full improved version we discussed previously, we can observe that Adam optimizer has a huge effect both in reaching higher accuracy and smoother loss minimization thanks to its adaptive learning rate.



In [None]:
conf = wandb.init(
    project="DL2022_229356_229298",
    entity="229356_229298",
    name="P_to_R_ablation_dim_red",
    config={
        "model": "ResNet18ImprovementsAblationDimRed",
        "trained-on": "Source + Target unsupervised",
        "epochs": 60,
        "batch_size": BATCH_SIZE,
        "optimizer": "sgd",
        "lr": 1e-2,
        "D_0": 0.15,
        "N_0": DEFAULT_N_0
    }
)

best_model = training_loop_improvements(train_products, test_real, train_real, device, conf, save=False)

|       class        | precision |   recall  |f1-score  | support|
|---|---|---|---|---|
|           backpack |      0.62 |     0.88  |    0.73  |      17|
|           bookcase |      0.75 |     0.90  |    0.82  |      20|
|           car jack |      0.76 |     0.87  |    0.81  |      15|
|               comb |      1.00 |     0.73  |    0.84  |      22|
|              crown |      0.83 |     0.90  |    0.86  |      21|
|       file cabinet |      0.76 |     0.59  |    0.67  |      22|
|          flat iron |      0.82 |     0.88  |    0.85  |      16|
|    game controller |      1.00 |     0.76  |    0.86  |      21|
|            glasses |      0.93 |     0.68  |    0.79  |      19|
|         helicopter |      0.82 |     0.95  |    0.88  |      19|
|         ice skates |      0.80 |     0.76  |    0.78  |      21|
|        letter tray |      0.68 |     0.78  |    0.72  |      27|
|            monitor |      0.84 |     0.80  |    0.82  |      20|
|                mug |      0.91 |     0.88  |    0.89  |      24|
|     network switch |      0.80 |     0.71  |    0.75  |      17|
|over-ear headphones |      0.89 |     1.00  |    0.94  |      17|
|                pen |      0.94 |     0.94  |    0.94  |      17|
|           stroller |      0.89 |     0.76  |    0.82  |      21|
|              purse |      0.65 |     0.74  |    0.69  |      23|
|        stand mixer |      0.91 |     0.95  |    0.93  |      21|
| | | | |
|           accuracy |           |           |    0.82  |     400|
|          macro avg |      0.83 |     0.82  |    0.82  |     400|
|       weighted avg |      0.83 |     0.82  |    0.82  |     400|

In [None]:
conf = wandb.init(
    project="DL2022_229356_229298",
    entity="229356_229298",
    name="R_to_P_ablation_dim_red",
    config={
        "model": "ResNet18ImprovementsAblationDimRed",
        "trained-on": "Source + Target unsupervised",
        "epochs": 60,
        "batch_size": BATCH_SIZE,
        "optimizer": "sgd",
        "lr": 1e-2,
        "D_0": 0.15,
        "N_0": DEFAULT_N_0
    }
)

best_model = training_loop_improvements(train_real, test_products, train_products, device, conf, save=False)

|                    | precision |   recall|  f1-score|   support|
|---|---|---|---|---|
|           backpack |      0.93 |     0.97|      0.95|        29|
|           bookcase |      0.74 |     0.95|      0.83|        21|
|           car jack |      0.93 |     0.76|      0.84|        17|
|               comb |      0.86 |     0.95|      0.90|        19|
|              crown |      1.00 |     1.00|      1.00|        20|
|       file cabinet |      0.93 |     0.78|      0.85|        18|
|          flat iron |      0.79 |     0.94|      0.86|        16|
|    game controller |      1.00 |     0.92|      0.96|        24|
|            glasses |      0.95 |     0.95|      0.95|        19|
|         helicopter |      0.81 |     1.00|      0.89|        17|
|         ice skates |      1.00 |     0.95|      0.97|        19|
|        letter tray |      0.92 |     0.75|      0.83|        16|
|            monitor |      1.00 |     0.95|      0.97|        20|
|                mug |      1.00 |     1.00|      1.00|        17|
|     network switch |      0.92 |     1.00|      0.96|        24|
|over-ear headphones |      0.88 |     1.00|      0.94|        15|
|                pen |      0.92 |     0.83|      0.87|        29|
|              purse |      0.94 |     0.81|      0.87|        21|
|        stand mixer |      0.94 |     0.89|      0.92|        19|
|           stroller |      1.00 |     1.00|      1.00|        20|
| | | | | |
|           accuracy |           |         |      0.92|       400|
|          macro avg |      0.92 |     0.92|      0.92|       400|
|       weighted avg |      0.93 |     0.92|      0.92|       400|

## Experiment with tuned hyperparameters
In order to have more comparable results in the experiments discussed previously we avoided, as much as possible, tweaking hyper-parameters $D_0$ and $N_0$. 

However, focusing on the most difficult task $P → R$, we empirically found that setting $N_0=0.2$ the improved version with Adam optimizer and dimensionality reduction, was able to reach even higher test accuracy, with a best value of $0.88$. Similar behaviour is observed in the CDD loss and accuracy curves. The main diffence is that all 20 classes are used for CDD loss computation since the beginning of the training. In other words, no class filtering is performed.

<img src="https://drive.google.com/uc?export=view&id=1c_wo4dvn3m7QZeo_idPhFe62nleTVBae" width="500px">
<img src="https://drive.google.com/uc?export=view&id=1rbZaCV1YfGyNDzQ2L4W_MraNPzwK5NUH" width="500px">

## Conclusions
With this Notebook we showed the three major steps that led us to solve the task. First by using no domain adaptations techniques then, by appling the Contrastive Adaptation Network and, in the end, our improved version of it. 

During the developement of the project we encountered several challenges expecially due to performance issues, but also in understanding how to implement CAN and its training procedure due to the lack of details in the original paper. 

In the last section we showed how our improved version of CAN has been able to be competitive against its vanilla implementation and how we have been able to address the domain gap between Real World and Products datasets. Moreover, results are comparable with those provided by Adaptiope paper on CAN, although should be taken with a grain of salt since we are dealing with a subset of the original dataset.

[Here](https://wandb.ai/229356_229298/DL2022_229356_229298) you can find our Weight and Biases project with all the plots we showed and additional metrics and informations, like some wrongly predicted images.

Below we can find a recap of the test accuracy results for each one of the methods we performed, with gain in accuracy with respect to the baseline in bold.

| Method | Acc P -> R (%) | Acc R -> P (%)|
|---|---|---|
| *Upper bound*           |  *92.0*  |  *95.0*  |
| Baseline                |   74.5   |   92.5   |
| CAN                     |   79.0   |   93.5   |
| **Gain CAN**            | **4.5**  |  **1.0** |
| CAN-Improved            |   84.5   |   95.2   |
| CAN-Improved-tuned |   87.8  |      |
| **Gain CAN-Improved**   | **13.3** |  **2.7** |
