<a href="https://colab.research.google.com/github/Zinni98/Symnet-Unsupervised-domain-adaptation/blob/main/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Conventions and notes:_

- _Some small sentences of the paper are copied exactly in this document. This is done in parts when we felt that there were no reason to rewrite, better explain or summarise the concept because it was already clear and concise for us._

- _For better coherence we decided to stick with the notatation used in the paper in writing formulas._

- _Wherever code is not commented, it is because we thought that it was something trivial._

- _Somewhere you can find no markdown preceeding a code cell, this is because the included code lines were either discussed in class or a similar chunk of code was already explained earlier or a docstring was more suitable for the context._

- _If there is any problem in the visualization of images, here is the link to the github repository, where images should be displayed correctly: [Symnet github](https://github.com/Zinni98/DL-Project/blob/main/project.ipynb)._

# Unsupervised Domain Adaptation
## 1. Introduction
The **goal** of this project is to **build a deep learning framework for** Unsupervised Domain Adaptation (**UDA**).
**Domain Adaptation** is a subdiscipline of machine learning which deals with scenarios where a **model is trained supervisedly on** data coming from a known distribution (*source* domain) **but** during the **test** phase, data are sampled from **another unknown distribution** (*target* domain). This of course can impact the performance singnificantly.
The *underlying concept* of DA is *closely related to transfer learning* which refers to a class of ML problems that *deals with different tasks or domains*. 
Furthermore, it is possible to use the target data without labels (that's why "unsupervised") in order to improve the performance of the model.

## 2. Objective
The **aim** of this project is to **use** an **Unsupervised Domain Adpatation technique of choice** in order **to improve** performances with respect to a **baseline**. The latter is obtained by training on the source domain and testing on the target domain without deploying any specific technique to take the domain shift into account.

### 2.1 UDA Method
The **method chosen** for Unsepervised Domain Adaptation is [**Symnet**](https://arxiv.org/pdf/1904.04663.pdf) (explained later in more details) which belongs to the **domain adversarial family** of methods for domain adaptation. The main idea is to **use a two-level domain confusion scheme** in order to let the intermediate network learn features that are **invariant to the corresponding category in both domains**.

### 2.2 Problem setting
In this problem we are required to perform an **object recognition task**: given an image, the model should be able to produce the label, hopefully correct, associated with that image.

### 2.2.1 Datasets
The chosen **dataset** is [**Adaptiope**](https://openaccess.thecvf.com/content/WACV2021/papers/Ringwald_Adaptiope_A_Modern_Benchmark_for_Unsupervised_Domain_Adaptation_WACV_2021_paper.pdf) which has images coming from **3 domains and 123 classes**.
We are going to **use** a subset of this consisting on **2 domains with 20 classes**. The two domains are:

- **Real world**

- **Product images**

As simplifying assumption the **20 classes are the same across the two domains**.

From now on we will treat the data coming from the two domains as two separate datasets: the **source dataset $X^S$** and the **target dataset $X^T$**.
Both $X^S$ and $X^T$ use a 80%/20% split for the training and test set respectively.

*Note : Here we refer to Source and Target instead of Real World and Product because the domain adaptation procedure is applied both ways, i.e. first the real world domain dataset is considered as training set and the target domain dataset as test set. Consequently, the reverse has been addressed as well.*

[comment]: <> (
_Note: Here we refer to Source and Target instead of Real World and Product, because during training we are going evaluate our model first by considering real world domain as source and product as target, and then the opposite._
)

### 2.2.2 Methodology
To evaluate performances of the proposed domain adaptation technique, we need to define a *baseline score* which we want *to improve* on. The baseline score is obtained by firstly training on the source training set, and then evaluating on the target training set.

| ![Baseline](https://drive.google.com/uc?id=15EdVkmyHUD_sXPuPDE2P0XB5jk-dn9t0) |
|:--:|
| *The image shows a sketch with the highlighted parts used to train and test the baseline* |

It can be *useful* also *to define an upper bound* to the performance, obtained by *training on the target training set* and *testing on the target test set*.

| ![Baseline](https://drive.google.com/uc?id=1Y-jM7yED_ssufwL-hAGNTlbr_KntTCSZ) |
|:--:|
| *The image shows a sketch with the highlighted parts used to train and test the upper bound* |

Finally, for **Unsupervised Domain Adaptation the training is performed supervisedly on the source training set**, **unsupervisedly on the target training set** and then **tested on the target test set**.

| ![Baseline](https://drive.google.com/uc?id=17Wmg7Yd7RltaOBAkA5DEksTp4HUTJvrT) |
|:--:|
| *The image shows a sketch with the highlighted parts used to train and test the UDA technique* |

We **expect** that the **proposed** Domain Adaptation **technique will perform better than the baseline** but worse than the upper bound and of course, the more close is to the upper bound, the better.

Performance are measured in terms of **validation accuracy**: 
$$\text{Accuracy} = \dfrac{TP+TN}{TP+TN+FP+FN}$$

In this project we are required to test the proposed Unsupervised Domain Adaptation technique first by considering the "Product" domain as the source domain and the "Real World" domain as the target domain and then to the other way around: "Real World" as source and "Product" as target.
So we need to compute baseline, upper bound and UDA performances first in one direction (i.e. Product $\rightarrow$ Real World) and then in the other direction (i.e. Real World $\rightarrow$ Product)




In [None]:
from google.colab import drive  # to mount personal drive


from tqdm import tqdm   # for progress bar 
from time import sleep

import torch  # importing pytorch
import torch.optim as optim  # importing optimizer module
from torch.utils.data import Subset  # useful in defining data of interest in a dataset
import torch.nn as nn  # Neural Network tools
from torch.utils.tensorboard import SummaryWriter # to get plots of trends
from torch.utils.data import DataLoader
# import torch.nn.functional as F

import torchvision
import torchvision.transforms as T  # to apply transformations to dataset images
from torchvision.datasets import ImageFolder  # to load and applying transformations on data
#import torchvision.transforms.functional as F

from sklearn.model_selection import train_test_split  # to split a dataset into training and test set

import math

import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
from typing import Tuple, List

In [None]:
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [None]:
classes = ['backpack', 'bookcase', 'car jack', 'comb', 'crown', 'file cabinet', 'flat iron', 'game controller', 'glasses', 'helicopter', 'ice skates', 'letter tray', 'monitor', 'mug', 'network switch', 'over-ear headphones', 'pen', 'purse', 'stand mixer', 'stroller']

cuda = "cuda" if torch.cuda.is_available() else "cpu"
BATCH_SIZE = 128
num_classes = len(classes)
rootdir = 'gdrive/My Drive/Colab Notebooks/data/adaptiope_small'
# rootdir_alessandro_uni = 'gdrive/My Drive/project/data/adaptiope_small'
runs_dir = 'gdrive/My Drive/Colab Notebooks/runs/'

## 3. Data Extraction
In `get_data(batch_size, root_dir)` the following steps are performed :
- images *transforms are defined*. In particular, the adopted transformation sequence has been found there: [ResNet Transforms](https://pytorch.org/hub/pytorch_vision_resnet/);
- *images* from the local drive are *loaded and the transforms applied*;
- data *splitting*;
- *collecting individual fetched data samples into batches*.
The returned objects are the real world and product domain data loaders.

In [None]:
def get_data(batch_size: int, root_dir: str, random_state = 42, test_split = 0.2) -> Tuple[torch.utils.data.DataLoader]:
  """

  Params:
  ------
  batch_size: int
    batch size for the dataloader
  root_dir: str
    Directory of adaptiope_small (e.g. "something/something_else/adaptiope_small")
  """

  # Transforms for resnet found there https://pytorch.org/hub/pytorch_vision_resnet/
  transform_img = list()
  transform_img.append(T.Resize(256))
  transform_img.append(T.CenterCrop(224))
  transform_img.append(T.ToTensor())
  transform_img.append(T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]))
  transform_img = T.Compose(transform_img)

  # load data
  product_images_dataset = ImageFolder(root = f"{root_dir}/product_images/", transform = transform_img)
  rw_images_dataset = ImageFolder(root = f"{root_dir}/real_life/", transform = transform_img)

  product_train_indexes, product_test_indexes = train_test_split(list(range(len(product_images_dataset.targets))),
                                                test_size = test_split, stratify = product_images_dataset.targets, random_state = random_state)
  
  rw_train_indexes, rw_test_indexes = train_test_split(list(range(len(rw_images_dataset.targets))),
                                                test_size = test_split, stratify = rw_images_dataset.targets, random_state = random_state)
  

  product_train_data = Subset(product_images_dataset, product_train_indexes)
  product_test_data = Subset(product_images_dataset, product_test_indexes)

  rw_train_data = Subset(rw_images_dataset, rw_train_indexes)
  rw_test_data = Subset(rw_images_dataset, rw_test_indexes)

  product_train_loader = DataLoader(product_train_data, batch_size, shuffle = False)
  product_test_loader = DataLoader(product_test_data, batch_size, shuffle = False)

  rw_train_loader = DataLoader(rw_train_data, batch_size, shuffle = False)
  rw_test_loader = DataLoader(rw_test_data, batch_size, shuffle = False)

  return product_train_loader, product_test_loader, rw_train_loader, rw_test_loader




## 4. Baseline
### 4.1 Network initialization
The **ResNet34 pretrained model is intialized**. Since for the **baseline** we decided to perform a **simple fine-tune**, the original classifier layer has been overwritten and the gradients has been enabled.

In [None]:
def initialize_resnet34(num_classes: int):
  """
  resnet34 initialization

  Parameters
  ----------
  num_classes: int
    number of categories the net should output

  pretrained: bool
    Specify if pretrained version of resnet should be retrieved

  """

  model = torchvision.models.resnet34(weights='ResNet34_Weights.DEFAULT')

  in_features = model.fc.in_features

  ##model.fc = nn.Sequential(nn.Linear(512, num_classes))#, nn.LogSoftmax(dim = 1))
  model.fc = nn.Linear(512, num_classes)
  for param in model.fc.parameters():
    param.requires_grad = True

  return model

### 4.2 Cross-entropy loss for training data


In [None]:
def get_ce_cost_function() -> torch.nn.CrossEntropyLoss:
  """
  Simply returns cross entropy an object for computing the cross entropy loss
  """
  cost_function = torch.nn.CrossEntropyLoss()
  return cost_function

### 4.3 Defining the optimizer
We have written the abstract class `AnnealingOptimizer` in order to define an optimizer that updates the learning rate using the annealing strategy proposed in the [Symnet paper](https://arxiv.org/pdf/1904.04663.pdf).

The strategy used is the following:
$$\eta = \frac{\eta_0}{(1+\alpha p)^\beta}$$
where:

- $\eta_0$ is the **base learning rate**, which by default is $0.001$. Note that it has been *changed with respect to the one proposed in the paper*, because we noticed that it was too high ;

- $p$ is the progress in training: $\text{p} = \frac{\text{epoch}}{\text{total epochs}}$. Notice that when the function `update_lr()` is called, the value of $p$ gets updated ;

- $\alpha = 10$ is a constant ;
- $\beta = 0.75$ is a constant ;

Than the **class `ResNetOptimizer`** inherits from the `AnnealingOptimizer` and just **defines the optimizer** to be used. This optimizer will be used for **ResNet34**, which is the network of choice, either for the baseline, the upper bound and the deployed UDA technique.

<br>

We decided to use the **annealing optimizer** also **for** the **baseline and the upper bound**, in order **to compare it better with** [Symnet](https://arxiv.org/pdf/1904.04663.pdf).




In [None]:
from abc import ABC, abstractmethod

class AnnealingOptimizer(torch.optim.Optimizer, ABC):
  """
  Defines and abstract class in order to implement an sgd optimizer using an annealing strategy
  """
  def __init__(self, model, nr_epochs, lr: float = 0.001, epoch: int = 0) -> None:
    if not 0.0 <= lr:
      raise ValueError(f"Invalid learning rate: {lr}")
    if not 0 <= epoch:
      raise ValueError(f"Invalid epoch value: {epoch}")
    
    self.nr_epochs = nr_epochs
    self.epoch = epoch
    self._alpha = 10
    self._beta = 0.75
    self._base_lr = lr

  def update_lr(self):
    """
    Updates the learning rate using the annealing strategy.
    In order to let the annealing strategy to work correctly, this method should be called at every epoch during the network training

    The learning rate for the classifier is 10 times bigger as proposed in the [Symnet paper](https://arxiv.org/pdf/1904.04663.pdf)
    """
    self.epoch += 1
    new_lr = self._compute_lr()
    for g in self.optimizer.param_groups:
      if g["name"] == "fe":
        g["lr"] = new_lr
      else:
        g["lr"] = new_lr*10

    
  def _compute_lr(self):
    """
    Computes the learning rate using the proposed annealing strategy

    Returns
    -------
    float
      updated learning rate
    """
    etap = 1 / ((1 + self._alpha * self.epoch / self.nr_epochs ) ** self._beta)
    return self._base_lr * etap

  def step(self):
    self.optimizer.step()
  
  def zero_grad(self):
    self.optimizer.zero_grad()
  
  

class ResNetOptimizer(AnnealingOptimizer):
  """
  Implements an annealing optimizer for Resnet
  """
  def __init__(self, model, nr_epochs, lr: float = 0.001, epoch: int = 0, momentum: float = 0.9) -> None:
    super(ResNetOptimizer ,self).__init__(model, nr_epochs, lr, epoch)
    
    # Note that names for parameters group are important in order to update each group differently
    self.optimizer = optim.SGD([
                {'params': self.__get_fe_params(model), "name": "fe"},
                {'params': model.fc.parameters(), "lr": self._compute_lr()*10, "name": "classifier"}
            ], lr=lr, momentum=momentum)
    

  def __get_fe_params(self, model):
    """
    Takes parameters of the Resnet's feature extractor
    """
    fe_layers = list(model.children())[:-1]
    all_parameters = [param for layer in fe_layers for param in layer.parameters()]
    for param in all_parameters:
      yield param

### 4.4 Training procedure
Briefly :
- the **net** is set into **train mode**.

- The **training dataset is iteratively cycled** through on groups of `batch_size` dimension. 

- **For each sample** in the current batch, **inputs and targets** are **moved** to the specified **device**, the predicted outputs and the losses computed.

- After that, an **optimization step** is performed **to update the weights**.

- Finally, **accuracy and cumulative loss are computed**.

In [None]:
def training_step(net, data_loader: torch.utils.data.DataLoader, optimizer,
                           cost_function, device: str = cuda) -> Tuple[float]:

  """
  Performs the training of the network for one epoch.

  Parameters
  ----------
  net
    network model.
  
  data_loader: torch.utils.data.DataLoader
    Data loader intialized with the training set.
  
  optimizer: torch.optim.optimizer.Optimizer
    Optimizer of choice

  cost_function: torch.nn.modules._Loss
    Loss function to be used

  device: str
    Device in which computations should be performed.
    Admitted values:

    - "cpu"

    - "cuda:n" -> where n is the gpu number in case of multiple gpu configurations
  
  Returns
  -------
  tuple
    A tuple of length 2 containing:
    
    - Cumulative loss for the whole training set

    - Cumulative accuracy for the whole training set
  

  """
  samples = 0.
  cumulative_loss = 0.
  cumulative_accuracy = 0.
  
  net.train() 
 
  # iterate over the training set
  for batch_idx, (inputs, targets) in enumerate(data_loader):
    # load data into GPU
    inputs = inputs.to(device)
    targets = targets.to(device)
      
    # forward pass
    outputs = net(inputs)

    # loss computation
    loss = cost_function(outputs,targets)

    # backward pass
    loss.backward()
    
    # parameters update
    optimizer.step()

    # gradients reset
    optimizer.zero_grad()

    # fetch prediction and loss value
    samples += inputs.shape[0]
    cumulative_loss += loss.item()
    _, predicted = outputs.max(dim=1) # max() returns (maximum_value, index_of_maximum_value)

    # compute training accuracy
    cumulative_accuracy += predicted.eq(targets).sum().item()

  return cumulative_loss/samples, (cumulative_accuracy/samples)*100


### 4.5 Test procedure
- The **network** is set to **evaluation mode**. 

- After this, we **disable all the gradients** to avoid keeping track of the gradients (not needed for testing). The tesiting procedure from now on is pretty much analogous to what's been done during the training with the only difference that the **weights of the network don't get updated**.

In [None]:
def test_step(net, data_loader, cost_function, device=cuda):
  """
  Test the network for one epoch

  Parameters
  ----------
  net
    network model.
  data_loader
    Data loader intialized with the test set.
  cost_function: torch.nn.modules._Loss
    Loss function to be used
  device: str
    Device in which computations should be performed.
    Admitted values:

    - "cpu"

    - "cuda:n" -> where n is the gpu number in case of multiple gpu configurations
  
  Returns
  -------
  tuple
    A tuple of length 2 containing:
    
    - Cumulative loss for the whole training set

    - Cumulative accuracy for the whole training set
  
  """

  samples = 0.
  cumulative_loss = 0.
  cumulative_accuracy = 0.

  # set the network to evaluation mode
  net.eval() 

  # disable gradient computation (we are only testing, we do not want our model to be modified in this step!)
  with torch.no_grad():

    # iterate over the test set
    for batch_idx, (inputs, targets) in enumerate(data_loader):
      
      # load data into GPU
      inputs = inputs.to(device)
      targets = targets.to(device)
        
      # forward pass
      outputs = net(inputs)

      # loss computation
      loss = cost_function(outputs, targets)

      # fetch prediction and loss value
      samples+=inputs.shape[0]
      cumulative_loss += loss.item() # Note: the .item() is needed to extract scalars from tensors
      _, predicted = outputs.max(1)

      # compute accuracy
      cumulative_accuracy += predicted.eq(targets).sum().item()

  return cumulative_loss/samples, cumulative_accuracy/samples*100

### 4.6 Main Function
This function is meant to be a '**wrapper**' **function** where every aforecited function is called when needed.  
First, the **parameters values are defined as arguments** of the function.  
Then, sequentially :     
- extract, **process** and load **data**;
- network, optimizer and cost function are **initialized**;
- **iterating** a certain number of times equal to a **fixed number of epochs**. In here the following steps are performed :    
  - **computation of training loss and accuracy**;
  - **computation of test loss and accuracy**;
  - informing the writer of the values obtained.

- At the **end of the training**, the **network** is **tested on** both test sets of **source and tagret** domains.

Here the network is trained on product images and then tested on real world ones.

In [None]:
def main(train_loader,
         test_loader,
         batch_size = BATCH_SIZE, 
         device = cuda,
         epochs = 15,
         nr_classes = num_classes, 
         img_root = rootdir,
         runs_dir=runs_dir,
         ):
  """
  Parameters
  ----------
  batch_size
    Dimension of the batch for the single optimization step

  device
    The device on which the computation takes place.
    Admitted values:

    - "cpu"

    - "cuda:n" -> where n is the gpu number in case of multiple gpu configurations

  epochs
    Number of training epochs

  nr_classes
    Number of classes for the classification task
  
  img_root
    Root where the dataset is stored
  
  runs_dir
    Directory for saving the results of runs

  """
  
  net = initialize_resnet34(nr_classes).to(device)
  print('Network Init Done')
  optimizer = ResNetOptimizer(net, epochs)
  print('Got Optimizer')
  cost_function = get_ce_cost_function()
  print('Got Cost Function')

  # writer = SummaryWriter(log_dir=f"{runs_dir}/runs_upper_bound/RW2PRD")

  for e in range(epochs):
    print(f"Epoch {e}:")
    train_loss, train_accuracy = training_step(net, train_loader, optimizer, cost_function, device)
    print(f"Training loss: {train_loss} \n Training accuracy: {train_accuracy}")
    # Needed to apply the annealing strategy
    optimizer.update_lr()

    # add values to logger
   # writer.add_scalar('Loss/train_loss', train_loss, e + 1)
   # writer.add_scalar('Accuracy/train_accuracy', train_accuracy, e + 1)
  

  # perform final test step and print the final metrics
  _, test_accuracy = test_step(net, test_loader, cost_function, device)

  # close the logger
  # writer.close()

  return test_accuracy


## 5. Domain Adaptation Technique : SymNet
The **design** of the **proposed symmetric network** is characterized by:

- The **Feature extractor G**. We decided to use the *feature extractor defined by ResNet34* (i.e. Resnet34 without the last fully connected layer) to make the comparison with the baseline and upper bound results much fair as possible ;

[comment]: <> (in order to allow the comparison with the results obtained with the baseline and the upper bound)

- **Two parallel task classifiers** $C_s$ and $C_t$ are both based on a single fully connected layer (as proposed in the paper) and they contain **$20$ neurons each**, as the number of categories for the proposed problem (When we will explain the losses used for Symnet, we are going to use $K$ to denote the number of classes in order to be more generic). 
Composing the two classifiers, we get the $C_{st}$ classifer that presents a total of $40$ units, e.g. the union of the two FC layers.

| ![symnet](https://drive.google.com/uc?id=1qyPClxz8zcJvhVGF84-IKv1Owljt0h7H) |
|:--:|
| *The image shows a sketch of the network architecture, including the error functions which will be explained later on.* |

<br>

As it is possible to notice, the **architecture** results to be **pretty simple**. 
Indeed, the **core reasoning** has been developped on **losses definitions level**.

In [None]:
class SymNet(nn.Module):
  """
  Class representing the proposed symmetric network
  """
  def __init__(self, n_classes: int = 20) -> None:
    super(SymNet, self).__init__()
    resnet = initialize_resnet34(20, True)
    # Taking the feature extractor of resnet34
    # Reference: https://stackoverflow.com/questions/55083642/extract-features-from-last-hidden-layer-pytorch-resnet18
    self.feature_extractor = torch.nn.Sequential(*list(resnet.children())[:-1])
    self.source_classifier = nn.Linear(in_features=512, out_features=n_classes)
    self.target_classifier = nn.Linear(in_features=512, out_features=n_classes)
  

  def forward(self, x: torch.Tensor) -> tuple:
    """
    Performs the forward pass

    Parameters
    ----------
    x : torch.Tensor
      Input tensor to the network
    
    Returns
    -------
    tuple
      The returned values are respectively the result of the source classifier, target classifier and the concatenation of the two.

    """
    features = self.feature_extractor(x)
    features = features.squeeze()
    source_output = self.source_classifier(features)
    # source_output = nn.Softmax(source_output)

    target_output = self.target_classifier(features)
    # target_output = nn.Softmax(target_output)

    source_target_classifier = torch.cat((source_output, target_output), dim=1)
    
    return source_output , target_output, source_target_classifier
  
  def parameters(self) -> torch.Tensor:
    """
    Paramters of the netowork

    Yields
    ------
    torch.Tensor
      Network parameter
    """
    fe = list(self.feature_extractor.parameters())
    sc = list(self.source_classifier.parameters())
    tc = list(self.target_classifier.parameters())
    tot = fe + sc + tc
    for param in tot:
      yield param
    
  def classifier_parameters(self) -> torch.Tensor:
    """
    Parameters of the classification layer

    Yields
    ------
    torch.Tensor
      Classification layer parameter
    """
    sc = list(self.source_classifier.parameters())
    tc = list(self.target_classifier.parameters())
    tot = sc + tc
    for param in tot:
      yield param

  def feature_extractor_parameters(self) -> torch.Tensor:
    """
    Parameters of the feature extractor

    Yields
    ------
    torch.Tensor
      Feature extractor parameter
    """
    return self.feature_extractor.parameters()


### 5.1 Optimizer for symnet
Symnet uses the **AnnealingOptimizer strategy** in order **to adjust the learning rate during epochs** as proposed in the paper.

[comment]: <> (It just defines the optimizer with the right parameters.)

Additionaly, again following the paper, the **learning rate for** the combined classifier (e.g. **$C_{st}$**) is set **$10$ times bigger than the feature extractor one**.

In [None]:
class SymNetOptimizer(AnnealingOptimizer):
  """
  Implements an annealing optimizer for SymNet
  """
  def __init__(self, model, nr_epochs, lr: float = 0.001, epoch: int = 0):
    super(SymNetOptimizer ,self).__init__(model, nr_epochs, lr, epoch)

    # Note that names for parameters group are important in order to update each group differently
    self.optimizer = optim.SGD([
                {'params': model.feature_extractor_parameters(), "name": "fe"},
                {'params': model.classifier_parameters(), "lr": self._compute_lr()*10, "name": "classifier"}
            ], lr=lr, momentum=0.9)

### 5.2 Notation used
In order to define the losses, we explicit here the notation used in the following formulas:

- The classifiers are denoted as: **$C^s$** for the **source classifier**, **$C^t$** for the **target classifier** and **$C^{st}$** for the **combined classifier** (source + target) ;

- **$K$** is the **ouput dimension** of each classifier (source and target), which corresponds to the number of categories in both domains. Since in UDA the number of classes in both domains is the same we get :
$K=K_s=K_t= \# \:\: of \:\: categories$;

- $v^s(x) \in R^K$, $v^t(x) \in R^K$ and $[v^s(x),v^t(x)] \in R^{2k}$ are the output vectors of $C^s$, $C^t$ and $C^{st}$, respectevely, **before the softmax operation** ;

- $p^s(x) \in [0,1]^K$, $p^t(x) \in [0,1]^K$ and $p^{st} \in [0,1]^{2K}$ are the output vectors of $C^s$, $C^t$ and $C^{st}$, respectevely, **after the softmax operation**. To denote the $k^{th}$ element of the vector the following notation is used: $p^s_k$ (resp. $p^t_k$ and $p^{st}_k$), $k \in \{1,...,K\}$. <br>
_Note: $p^{st}$ is computed considering 2K classes, so it is not equal to the concatenation of $p^s$ and $p^t$_

<br>

### Definining the losses
[Symnet Paper](https://arxiv.org/pdf/1904.04663.pdf) defines **two losses**:
- **One for updating the weights of the three classifier** ($C^s$, $C^t$ and $C^{st}$). We refer to this loss as "Classifier loss"

- The **other for updating the weights of the feture extractor**. We refere to this loss as "feature extractor loss"

In the following sections, these two losses will be explained in greater detail

*Note: remember that the weights for $C^{st}$  are shared with the other two classifiers*

<br>

#### Classifier loss
The **objective for updating the classifiers weigths** is the following:

$$\min_{C^s, C^t, C^{st}} \mathcal{E}^s_{task}(G,C^s) + \mathcal{E}^t_{task}(G,C^t) + \mathcal{E}^{st}_{domain}(G,C^{st})$$

It is possible to notice that the the whole **objective** is **composed of three errors** which are defined as follows:

- **Error for the task classifier** is simple **cross entropy** but considering only the output corresponding to the true category ($y^s_i$) $$\mathcal{E}^s_{task}(G,C^s) = - \frac{1}{n_s}\sum_{i=1}^{n_s}{log(p^s_{y^s_i}(x^s_i))}$$


- The **same thing** is done **for the target classifier**. In this case, since **no direct supervision to learn task classfier $C_t$** is available, the **labelled source samples are leveraged** as follows: $$\mathcal{E}^t_{task}(G,C^t) = - \frac{1}{n_s}\sum_{i=1}^{n_s}{log(p^t_{y^s_i}(x^s_i))}$$
The use of **this loss** is **essential to provide the correspondance between $C^s$ and $C^t$** in order to allow the **achievement of category-level domain confusion** which will be obtained later using one of the errors for the update of the classifier.

- By using only these two errors, $C^s$ and $C^t$ learn the exact same thing, so the **third error**, which acts on the combined classifier $C^{st}$, is **needed to distinguish between the two**: $$\mathcal{E}^{st}_{domain}(G,C^{st}) = - \frac{1}{n_t}\sum_{j=1}^{n_t}{\log\bigg(\sum_{k=1}^{K}{p^{st}_{K+k}(x^t_j)}\bigg)} \\ -\frac{1}{n_s}\sum_{i=1}^{n_s}{\log\bigg(\sum_{k=1}^{K}{p^{st}_{k}(x^s_i)}\bigg)}$$ 
It's is important to notice that this is **completely computed in an unsupervised manner**, so it is possible to take advantage also of target training samples. It is also possible to see $\sum_{k=1}^{K}{p^{st}_{K+k}(x^t_j)}$ and $\sum_{k=1}^{K}{p^{st}_{k}(x^s_i)}$ as the probability of classify an input sample **x** as target or source respectively.

<br>

#### Feature extractor loss

As in other strategies for adversarial training in domain adapatation, the **aim** is to find a **feature extractor G that is invariant to the domain**; in other words we are seeking to find a feature extractor that can generalize better. **To do that**, the [paper](https://arxiv.org/pdf/1904.04663.pdf) proposes a "**two-level domain confusion**" method based on a domain-level confusion loss and a category-level confusion loss.
The objective for updating the feature extractor loss, is the following:

$$\min_{G} \mathcal{F}^{st}_{category}(G, C^{st}) + \lambda (\mathcal{F}^{st}_{domain}(G, C^{st}) + \mathcal{M}^{st}(G, C^{st}))$$

Where $\lambda \in [0,1]$ is a **trade-off parameters to suppress noisy signals** of $\mathcal{F}^{st}_{domain}(G, C^{st})$ and $\mathcal{M}^{st}(G, C^{st}))$ **at early stages of training**. *This is because* at the beginning, convolutional features aren't extracting meaningful information (since the network is not trained yet), so *we need better convolutional features before starting to confuse them*.

As for the classifier objective, it is possible to distiniguish three distinct terms:

- **Category-level confusion loss** using **labeled source samples**: $$\mathcal{F}^{st}_{category}(G, C^{st}) = -\frac{1}{2n_s}\sum_{i=1}^{n_s}{\log(p^{st}_{y^s_i + K}(x^s_i))} \\ -\frac{1}{2n_s}\sum_{i=1}^{n_s}{\log(p^{st}_{y^s_i}(x^s_i))}$$

- **Domain-level confusion loss** using **unlabeled target samples**:
$$\mathcal{F}^{st}_{domain}(G,C^{st}) = - \frac{1}{2n_t}\sum_{j=1}^{n_t}{\log\bigg(\sum_{k=1}^{K}{p^{st}_{K+k}(x^t_j)}\bigg)} \\ -\frac{1}{2n_t}\sum_{j=1}^{n_t}{\log\bigg(\sum_{k=1}^{K}{p^{st}_{k}(x^s_j)}\bigg)}$$

- **Entropy minimization principle**:
$$\mathcal{M}^{st}(G, C^{st}) = - \frac{1}{n_t}\sum_{j=1}^{n_t}\sum_{k=1}^{K}q^{st}_k(x^t_j)log(q^{st}_k(x^t_j))$$
The above entropy minimization objective enhances discrimination among task categories

In [None]:
def source_loss(output_source, label):
  """
   Cross entropy loss of source classifier C_s for source samples (equation 5 of the paper)

  Parameters
  ----------
  output: torch.Tensor
    Output batch of the network. Notice that in order to let the algorithm work correctly, this should
    be the output of the source classifier
  
  label: torch.Tensor
    Labels corresponding to the samples whose output is computed

  Returns
  -------
  torch.Tensor
    The result of the computed loss for the entire batch
  """
  loss_fun = nn.CrossEntropyLoss()
  loss = loss_fun(output_source, label)
  return loss

def target_loss(output_target, label):
  """
  Cross entropy loss of target classifier C_t for source samples (equation 6 of the paper)

  Parameters
  ----------
  output: torch.Tensor
    Output batch of the network. Notice that in order to let the algorithm work correctly, this should
    be the output of the target classifier
  
  label: torch.Tensor
    Labels corresponding to the samples whose output is computed

  Returns
  -------
  torch.Tensor
    The result of the computed loss for the entire batch
  """
  return source_loss(output_target, label)

def source_target_loss(output, st = True):
  """
  Two-way cross-entropy loss for the joint classifier C_st (equation 7 of the paper)

  Parameters
  ----------
  output: torch.Tensor
    Output batch of the network. Notice that in order to let the algorithm work correctly, this should
    be the output of the combined source-target classifier
  st: bool
    True if train batch belongs to source, False if belongs to target
  
  Returns
  -------
  torch.Tensor
    The result of the computed loss for the entire batch

  """
  n_classes = int(output.size(1)/2)
  soft = nn.Softmax(dim=1)
  prob_out = soft(output)
  if st:
    loss = -(prob_out[:,:n_classes].sum(1).log().mean())
  else:
    loss = -(prob_out[:,n_classes:].sum(1).log().mean())
  return loss

def feature_category_loss(output_st, label):
  """
  Category level confusion loss (equation 8 of the Symnet paper)

  Parameters
  ----------
  output_st: torch.Tensor
    Output batch of the network. Notice that in order to let the algorithm work correctly, this should
    be the output of the combined source-target classifier
  
  label: torch.Tensor
    Labels corresponding to the samples whose output is computed
  
  Returns
  -------
  torch.Tensor
    The result of the computed loss for the entire batch

  """
  n_classes = int(output_st.size(1)/2)

  loss_fun_1 = nn.CrossEntropyLoss()
  loss_fun_2 = nn.CrossEntropyLoss()

  loss_1 = loss_fun_1(output_st[:, :n_classes], label)/2
  loss_2 = loss_fun_2(output_st[:,n_classes:], label)/2
  return loss_1 + loss_2

def feature_domain_loss(output_st):
  """
  Domain level confusion loss (equation 9 of the Symnet paper)

  Parameters
  ----------
  output: torch.Tensor
    Output batch of the network. Notice that in order to let the algorithm work correctly, this should
    be the output of the combined source-target classifier
  
  Returns
  -------
  torch.Tensor
    The result of the computed loss for the entire batch

  """
  n_classes = int(output_st.size(1)/2)

  soft = nn.Softmax(dim=1)
  prob_out = soft(output_st)

  loss_1 = -(prob_out[:,:n_classes]).sum(1).log().mean()/2
  loss_2 = -(prob_out[:,n_classes:]).sum(1).log().mean()/2

  return loss_1 + loss_2



def entropyMinimizationPrinciple(output_st):
    """
    Entropy minimization principle (equation 10 of the Symnet paper)

    Parameters
    ----------
    output: torch.Tensor
      Output batch of the network. Notice that in order to let the algorithm work correctly, this should
      be the output of the combined source-target classifier
    
    Returns
    -------
    torch.Tensor
      The corresponding entropy minimization loss for the entire batch
    """
    nr_classes = int(output_st.size(1)/2)
    soft = nn.Softmax(dim=1)
    prob_out = soft(output_st)

    p_st_source = prob_out[:, :nr_classes]
    p_st_target = prob_out[:, nr_classes:]
    qst = p_st_source + p_st_target

    emp = -qst.log().mul(qst).sum(1).mean()

    return emp

### 5.4 Training step
_Note: The high level procedure, is very similar to the one proposed for training the baseline and the upper bound (i.e. forward-step/compute-loss/backward-step/update-gradients/zeroing-gradients) but there are some details that are worth mentioning, so in general we will focus on these details._

The following is the process that we used to train Symnet:

- For each single training step we decided to give an input batch containing half source samples and half target samples, this is done in order to keep each batch balanced.

- Now it comes the tricky part of computing the gradients and updating the weights:

  1. First the classifier loss is computed (Equations 5, 6 and 7 of the paper)
  2. Then the gradients of the feature extractor are set to 0 because we don't want to update the weights of the feature extractor with gradients computed using the classifier loss.
  3. After that, we need to save the computed gradients for both classifiers, otherwise when performing the backward step of the feature extractor loss, the gradients of the classifiers would be overwritten.
  4. Now we can safely compute the feature extractor loss and perform the backward pass.
  5. After that we just need to manually overwrite the computed gradients for the classifier (Note that ad these point the gradients are the ones computed on the feature extractor loss) with the previously saved gradients (**step 3**)
  6. Finally an optimizer step plus zeroing gradients, can be safely computed assuring a correct functioning of the network.

- Cumulative accuracy and loss are finally computed

### Other methods tried (but not correct):
We tried other strategies in order to update gradients, but non of those worked:

#### 1<sup>st</sup> trial
Using only one optimizer but without saving weigths:

1. Compute classifier loss
2. Perform the backward pass
3. Update weights: `optimizer.step()`
4. zeroing the gradients: `optimizer.zero_grad()`
5. Perform the same thing for the feature extractor loss
6. ....

The problem with this method is that we update the feature extractor parameters with the first loss (other than classifier's ones) and the classifier parameters with the second loss (other than feature extractor's ones), which doesn't make sense, otherwise we wouldn't needed two separate losses for updating the two separate groups of parameters in the first place.

#### 2<sup>nd</sup> trial
Use two optimizers to update one group of parameters each at a time (_Note: we will call optimizer1 the optimizer for classifier weigths and optimizer2 the one for feature extractor weigths_):

1. Compute classifier loss
2. Perform the backward pass
3. Update classifier weights: `optimizer1.step()`
4. Zeroing optimizer1 gradients: `optimizer1.zero_grad()`
5. Compute feature extractor loss
6. Compute the backward pass
7. Update feature extractor weights: `optimizer2.step()`
8. Zeroing optimizer2 gradients: `optimizer2.zero_grad()`

The problem with this approach is that an error occurs when computing the second backward pass (bullet point 6). This appens because pytorch keeps track of the version number of the tensor, which is incremented when performing an in-place operation on the tensor value (__Attention: not the value of the gradient, but the actual tensor__). So when we are updating the classifier weights (bullet point 3), the tensors corresponding to the weights of the classifier, get updated, this implies that they will have a different version number with respect to the ones of the feature extractor.



In [None]:
def training_step_uda(net, src_data_loader, target_data_loader, optimizer, lam, e, device=cuda):
  source_samples = 0.
  target_samples = 0.
  cumulative_classifier_loss = 0.
  cumulative_feature_loss = 0.
  cumulative_accuracy = 0.

  target_iter = iter(target_data_loader)

  net.train()

  # iterate over the training set
  for batch_idx, (inputs_source, labels) in enumerate(src_data_loader):
    try:
      inputs_target, _ = next(target_iter)
      inputs_target = inputs_target.to(device)
    except:
      target_iter = iter(target_data_loader)
      inputs_target, _ = next(target_iter)
      inputs_target = inputs_target.to(device)
    
    # load data into GPU
    inputs_source = inputs_source.to(device)
    labels = labels.to(device)

    length_source_input = inputs_source.shape[0]

    ## concatenation along batch dimension.
    inputs = torch.cat((inputs_source, inputs_target), dim=0)

    # forward pass
    c_s, c_t, c_st = net(inputs)

    c_s_source = c_s[:length_source_input,:]
    c_s_target = c_s[length_source_input:,:]

    c_t_source = c_t[:length_source_input,:]
    c_t_target = c_t[length_source_input:,:]

    c_st_source = c_st[:length_source_input,:]
    c_st_target = c_st[length_source_input:,:]


    # Equation 5 of the paper
    error_source_task = source_loss(c_s_source, labels)

    # Equation 6 of the paper
    error_target_task = target_loss(c_t_source, labels)

    # Equation 7 of the paper
    domain_loss_source = source_target_loss(c_st_source)
    domain_loss_target = source_target_loss(c_st_target, st = False)
    error_domain = domain_loss_source + domain_loss_target

    classifier_total_loss = error_source_task + error_target_task + error_domain

    # Retain graph needed because otherwise the parts of the computation graph
    # needed to compute classifier_total_loss will be freed up, but we
    # need those parts in order to compute the next loss
    classifier_total_loss.backward(retain_graph = True)

    for param in net.feature_extractor.parameters():
      param.grad.data.zero_()
    
    class_params = []
    for param in net.source_classifier.parameters():
      class_params.append(param.grad.data.clone())
      param.grad.data.zero_()
    for param in net.target_classifier.parameters():
      class_params.append(param.grad.data.clone())
      param.grad.data.zero_()

    # Equation 8 of the paper
    error_feature_category = feature_category_loss(c_st_source, labels)

    # Equation 9 of the paper
    error_feature_domain = feature_domain_loss(c_st_target)

    min_entropy = entropyMinimizationPrinciple(c_st_target)

    # Equations 11 of the paper
    feature_total_loss = error_feature_category + lam * (error_feature_domain + min_entropy)

    feature_total_loss.backward()

    idx = 0
    for param in net.source_classifier.parameters():
      param.grad.data = class_params[idx]
      idx += 1
    for param in net.target_classifier.parameters():
      param.grad.data = class_params[idx]
      idx += 1

    
    optimizer.step()
    optimizer.zero_grad()
    


    # print statistics
    source_samples+=inputs_source.shape[0]
    target_samples+=inputs_target.shape[0]
    
    cumulative_classifier_loss += classifier_total_loss.item()
    cumulative_feature_loss += feature_total_loss.item()
    _, predicted = c_s_source.max(dim = 1) ## to get the maximum probability
    cumulative_accuracy += predicted.eq(labels).sum().item()

  return cumulative_classifier_loss/source_samples, cumulative_feature_loss/target_samples, cumulative_accuracy/source_samples*100


### 5.6 Test step
The test step is very similar to the ones proposed for baseline and upper bound.
The only detail is that for us is worth mentioning, is that the predictions considered are the ones of the target classifier $C^t$.

In [None]:
def test_step_uda(net, data_target_test_loader, device=cuda):

    '''
    Params
    ------

    net : model 
    data_loader : DataLoader obj of the domain to test on
    cost_function : cost function used to address accuracies (not necessary) -> TargetClassifierLoss
    device : GPU or CPU device

    '''

    samples = 0.
    cumulative_loss = 0.
    cumulative_accuracy = 0.

    net.eval()

    with torch.no_grad():

        for batch_idx, (inputs, labels) in enumerate(data_target_test_loader):

            # load data into GPU
            inputs = inputs.to(device)
            targets = labels.to(device)
        
            # forward pass
            _, c_t, _ = net(inputs)

            # apply the loss
            loss = target_loss(c_t, targets)

            # print statistics
            samples+=inputs.shape[0]
            cumulative_loss += loss.item() # Note: the .item() is needed to extract scalars from tensors
            _, predicted = c_t.max(1)
            cumulative_accuracy += predicted.eq(targets).sum().item()

    return cumulative_loss/samples, cumulative_accuracy/samples*100

### 5.7 Main

In [None]:
from torch.utils.tensorboard import SummaryWriter
import math

def main_uda(source_train_loader,
             target_train_loader,
             target_test_loader,
             device=cuda,
             epochs=15,
             nr_classes = num_classes, 
             img_root=rootdir,
            ):
    
  # writer = SummaryWriter(log_dir="gdrive/My Drive/Colab Notebooks/runs/exp2")
  ## DataLoader split the size of the given dataset into #of elements in the dataset/batch size
  
  print('DataLoaders Done')
  net = SymNet().to(device)
  print('Network Init Done')
  optimizer = SymNetOptimizer(model = net, nr_epochs = epochs)
  print('Got optimizers')

  for e in range(epochs):
    lam = 2 / (1 + math.exp(-1 * 10 * e / epochs)) - 1

    train_ce_loss, train_en_loss, train_accuracy = training_step_uda(net=net, src_data_loader=source_train_loader, 
                                                        target_data_loader=target_train_loader, 
                                                        optimizer=optimizer, lam=lam, e=e, device=device)
    torch.cuda.empty_cache()

    print(f'Epoch: {e+1:d}')
    print(f'\t Train: CE loss {train_ce_loss:.5f}, Entropy loss {train_en_loss:.5f}, Accuracy {train_accuracy:.2f}')
    optimizer.update_lr()

  test_loss, test_accuracy = test_step_uda(net, target_test_loader, device)
  return test_accuracy

## 6. Evaluation of the proposed method
### 6.1 Product $\to$ Real World
Here we are comparing the proposed method with the baseline and the upper bound, by using the product dataset as source and the real world dataset as target.
First we need to get the dataloaders:

In [None]:
# Loading the dataset:
product_train_loader, product_test_loader, rw_train_loader, rw_test_loader = get_data(BATCH_SIZE, rootdir)

#### 6.1.1 Baseline

In [None]:
acc_base = main(product_train_loader, rw_test_loader, runs_dir=runs_dir)
print(f"Baseline accuracy Product -> Real World: {acc_base}")

  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


Network Init Done
Got Optimizer
Got Cost Function
DataLoaders Done
Time to train!
Epoch 0:
Training loss: 0.01795370575040579 
 Training accuracy: 40.25
Epoch 1:
Training loss: 0.00427818444557488 
 Training accuracy: 91.9375
Epoch 2:
Training loss: 0.0017345193773508072 
 Training accuracy: 96.3125
Epoch 3:
Training loss: 0.0012162890005856753 
 Training accuracy: 97.75
Epoch 4:
Training loss: 0.000965243827085942 
 Training accuracy: 98.5625
Epoch 5:
Training loss: 0.000824839398264885 
 Training accuracy: 98.9375
Epoch 6:
Training loss: 0.0007277771341614426 
 Training accuracy: 99.1875
Epoch 7:
Training loss: 0.0006549248937517405 
 Training accuracy: 99.3125
Epoch 8:
Training loss: 0.0005981729598715901 
 Training accuracy: 99.375
Epoch 9:
Training loss: 0.00055244832765311 
 Training accuracy: 99.4375
Epoch 10:
Training loss: 0.0005146433413028717 
 Training accuracy: 99.5
Epoch 11:
Training loss: 0.00048276068177074196 
 Training accuracy: 99.5625
Epoch 12:
Training loss: 0.0004

#### 6.1.2 Upper bound

In [None]:
acc_upperbound = main(rw_train_loader, rw_test_loader, img_root=rootdir_alessandro, runs_dir=runs_dir_alessandro, mode="u")
print(f"Upper Bound accuracy Product -> Real World: {acc_upperbound}")

  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


Network Init Done
Got Optimizer
Got Cost Function
DataLoaders Done
Time to train!
Epoch 0:
Training loss: 0.02011194162070751 
 Training accuracy: 29.562500000000004
Epoch 1:
Training loss: 0.007576549611985684 
 Training accuracy: 82.4375
Epoch 2:
Training loss: 0.003562918156385422 
 Training accuracy: 91.8125
Epoch 3:
Training loss: 0.00237793386913836 
 Training accuracy: 95.8125
Epoch 4:
Training loss: 0.0019154228921979665 
 Training accuracy: 96.75
Epoch 5:
Training loss: 0.0016147012542933226 
 Training accuracy: 97.625
Epoch 6:
Training loss: 0.00141404977068305 
 Training accuracy: 98.125
Epoch 7:
Training loss: 0.0012683934019878506 
 Training accuracy: 98.375
Epoch 8:
Training loss: 0.001151873115450144 
 Training accuracy: 98.625
Epoch 9:
Training loss: 0.0010581590654328466 
 Training accuracy: 98.875
Epoch 10:
Training loss: 0.000981118776835501 
 Training accuracy: 99.125
Epoch 11:
Training loss: 0.0009162570117041468 
 Training accuracy: 99.25
Epoch 12:
Training loss: 

#### 6.1.3 Domain Adaptation

In [None]:
acc_da = main_uda(product_train_loader, rw_train_loader, rw_test_loader, img_root=rootdir)
print(f"Domain adaptation accuracy Product -> Real World: {acc_da}")

DataLoaders Done


  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


Network Init Done
Got optimizers
Epoch: 1
	 Train: CE loss 0.04573, Entropy loss 0.01776, Accuracy 42.88
	 Test: CE loss 0.01343, Accuracy 70.75
-----------------------------------------------------
Epoch: 2
	 Train: CE loss 0.01443, Entropy loss 0.00628, Accuracy 91.88
	 Test: CE loss 0.00911, Accuracy 72.25
-----------------------------------------------------
Epoch: 3
	 Train: CE loss 0.00750, Entropy loss 0.00577, Accuracy 97.50
	 Test: CE loss 0.00739, Accuracy 78.50
-----------------------------------------------------
Epoch: 4
	 Train: CE loss 0.00576, Entropy loss 0.00701, Accuracy 98.38
	 Test: CE loss 0.00656, Accuracy 80.75
-----------------------------------------------------
Epoch: 5
	 Train: CE loss 0.00549, Entropy loss 0.00803, Accuracy 98.75
	 Test: CE loss 0.00583, Accuracy 82.00
-----------------------------------------------------
Epoch: 6
	 Train: CE loss 0.00586, Entropy loss 0.00863, Accuracy 99.12
	 Test: CE loss 0.00498, Accuracy 84.00
-------------------------

TypeError: ignored

### 6.2 Real World $\to$ Product
#### 6.2.1 Baseline

In [None]:
acc_base = main(rw_train_loader, product_test_loader, img_root=rootdir, runs_dir=runs_dir)
print(f"Baseline accuracy Real World -> Product: {acc_base}")

  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


Network Init Done
Got Optimizer
Got Cost Function
DataLoaders Done
Time to train!
Epoch 0:
Training loss: 0.0195840073376894 
 Training accuracy: 30.8125
Epoch 1:
Training loss: 0.007453262209892273 
 Training accuracy: 81.0625
Epoch 2:
Training loss: 0.003515565562993288 
 Training accuracy: 91.9375
Epoch 3:
Training loss: 0.0023773890081793068 
 Training accuracy: 95.25
Epoch 4:
Training loss: 0.0019036665000021458 
 Training accuracy: 96.6875
Epoch 5:
Training loss: 0.0016059506125748158 
 Training accuracy: 97.25
Epoch 6:
Training loss: 0.0014053122233599424 
 Training accuracy: 97.9375
Epoch 7:
Training loss: 0.0012587914103642107 
 Training accuracy: 98.375
Epoch 8:
Training loss: 0.0011426056222990156 
 Training accuracy: 98.4375
Epoch 9:
Training loss: 0.0010492381127551198 
 Training accuracy: 98.9375
Epoch 10:
Training loss: 0.000972515526227653 
 Training accuracy: 99.25
Epoch 11:
Training loss: 0.0009080296196043491 
 Training accuracy: 99.375
Epoch 12:
Training loss: 0.000

#### 6.2.2 Upper bound

In [None]:
acc_upperbound = main(product_train_loader, product_test_loader, runs_dir=runs_dir)
print(f"Upper Bound accuracy Real World -> Product: {acc_upperbound}")

  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


Network Init Done
Got Optimizer
Got Cost Function
Epoch 0:
Training loss: 0.01809109877794981 
 Training accuracy: 40.4375
Epoch 1:
Training loss: 0.004270306602120399 
 Training accuracy: 91.75
Epoch 2:
Training loss: 0.0016956347646191717 
 Training accuracy: 96.1875
Epoch 3:
Training loss: 0.0011931087728589774 
 Training accuracy: 97.4375
Epoch 4:
Training loss: 0.0009483035514131189 
 Training accuracy: 97.9375
Epoch 5:
Training loss: 0.000811921963468194 
 Training accuracy: 98.8125
Epoch 6:
Training loss: 0.000718159805983305 
 Training accuracy: 99.3125
Epoch 7:
Training loss: 0.0006481324951164425 
 Training accuracy: 99.4375
Epoch 8:
Training loss: 0.0005935926060192287 
 Training accuracy: 99.4375
Epoch 9:
Training loss: 0.000549595607444644 
 Training accuracy: 99.5
Epoch 10:
Training loss: 0.0005132094933651388 
 Training accuracy: 99.5625
Epoch 11:
Training loss: 0.000482511242153123 
 Training accuracy: 99.5625
Epoch 12:
Training loss: 0.00045620030490681527 
 Training a

#### 6.2.3 Domain Adaptation

In [None]:
acc_da = main_uda(rw_train_loader, product_train_loader, product_test_loader, img_root=rootdir)
print(f"Domain adaptation accuracy Product -> Real World: {acc_da}")

DataLoaders Done


  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


Network Init Done
Got optimizers
Epoch: 1
	 Train: CE loss 0.05249, Entropy loss 0.02102, Accuracy 22.38
	 Test: CE loss 0.01339, Accuracy 75.75
-----------------------------------------------------
Epoch: 2
	 Train: CE loss 0.02471, Entropy loss 0.01514, Accuracy 77.75
	 Test: CE loss 0.00612, Accuracy 86.75
-----------------------------------------------------
Epoch: 3
	 Train: CE loss 0.01413, Entropy loss 0.01249, Accuracy 90.75
	 Test: CE loss 0.00305, Accuracy 92.25
-----------------------------------------------------
Epoch: 4
	 Train: CE loss 0.01069, Entropy loss 0.01202, Accuracy 93.69
	 Test: CE loss 0.00251, Accuracy 92.25
-----------------------------------------------------
Epoch: 5
	 Train: CE loss 0.00963, Entropy loss 0.01180, Accuracy 95.44
	 Test: CE loss 0.00204, Accuracy 92.75
-----------------------------------------------------
Epoch: 6
	 Train: CE loss 0.00914, Entropy loss 0.01152, Accuracy 96.50
	 Test: CE loss 0.00179, Accuracy 93.00
-------------------------

## 7. Testing statistical significance of the result
It's easy to see from the accuracies above that the domain adaptation method outperforms the baseline. But we want to be sure that the result is statistically significant.
In order to do that, we are going to follow the method proposed by the following paper:
[Approximate Statistical Tests for Comparing
Supervised Classification Learning Algorithms](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.3325&rep=rep1&type=pdf). To be precise (Since the paper shows more methods), we are referring to the method proposed in section 3.5: **"The 5x2cv paired t test"**

In [None]:
import scipy.stats as stat
from functools import partial


def student_test(model1, model2, replications:int, root_dir:str, random_sate_list:List[int], alpha:float, degree_of_freedom: int, batch_size=BATCH_SIZE) -> Tuple[int, str]:
  ''' Function that performes T-test of two models
    Params:
    ------
      model1:
        first model to compare
      model2:
        second model to compare
      replications: int
        number of replications of 2-fold cross-validation 
      root_dir: str
        path to adaptiope_small
      random_state_list: list(int)
        list of random states when performing 5 times the dataset split
      aplha: float
        reference value to enstablish statistical significance
    Return:
    ------
      tuple of p-value and string whether statistical significan or not 
  '''

  p1s = []
  p2s = []
  variances = []
  for rep in range(replications):
    source_train_loader, source_test_loader, target_train_loader, target_test_loader = get_data(batch_size=batch_size, root_dir=root_dir, random_state=random_sate_list[rep], test_split=0.5)
    tst_acc_model1 = model1(source_train_loader, target_test_loader)
    tst_acc_model2 = model2(source_train_loader, target_train_loader, target_test_loader)
    p1 = tst_acc_model1 - tst_acc_model2
    tst_acc_model1 = model1(source_test_loader, target_train_loader)
    tst_acc_model2 = model2(source_test_loader, target_test_loader, target_train_loader)
    p2 = tst_acc_model1 - tst_acc_model2
    if rep == 1:
      # to save difference of very first repetition
      p11 = p1
      p21 = p2
  
    p_mean = (p1+p2)/2
    # s^2 = (p1-p_mean)^2 + (p2-p_mean)^2
    variance_i = (p1 - p_mean)**2 + (p2 - p_mean)**2
    variances.append(variance_i)
  
  variances = np.array(variances)
  t = p11 / (math.sqrt(np.mean(variances)))
  p_value = stat.t.sf(abs(t), df=degree_of_freedom)

  if p_value < alpha:
    string = 'Null hypothesis rejected. Significance difference in performance between models.'
  else :
    string = 'No singificant difference.'

  return p_value, string

In [None]:
p_value, message = student_test(model1 = partial(main), model2 = partial(main_uda), replications=5, root_dir=rootdir, random_sate_list=[91, 11, 57, 822, 19], alpha=0.05, degree_of_freedom=5)
print(f"(p-value, message) : {p_value, message}")

Network Init Done
Got Optimizer
Got Cost Function
Epoch 0:
Training loss: 0.022471324920654297 
 Training accuracy: 17.5
Epoch 1:
Training loss: 0.009985714435577393 
 Training accuracy: 83.89999999999999
Epoch 2:
Training loss: 0.00423361474275589 
 Training accuracy: 95.39999999999999
Epoch 3:
Training loss: 0.0025652790963649748 
 Training accuracy: 96.7
Epoch 4:
Training loss: 0.0018961628526449204 
 Training accuracy: 97.1
Epoch 5:
Training loss: 0.0015410507023334503 
 Training accuracy: 97.5
Epoch 6:
Training loss: 0.001326120674610138 
 Training accuracy: 97.8
Epoch 7:
Training loss: 0.0011817959696054459 
 Training accuracy: 98.1
Epoch 8:
Training loss: 0.0010747862160205841 
 Training accuracy: 98.7
Epoch 9:
Training loss: 0.0009900044724345208 
 Training accuracy: 99.0
Epoch 10:
Training loss: 0.0009205811321735382 
 Training accuracy: 99.1
Epoch 11:
Training loss: 0.0008625919073820114 
 Training accuracy: 99.4
Epoch 12:
Training loss: 0.0008132843896746636 
 Training accur

In [None]:
baseline_accuracies = np.empty(0)
uda_accuracies = np.empty(0)

# randomly chosen seeds for the dataset split
seeds = [91, 11, 57, 822, 19]

for seed in seeds:
  product_train_loader, product_test_loader, rw_train_loader, rw_test_loader = get_data(128, rootdir, random_state=seed)

  base_acc = main(product_train_loader, rw_test_loader, runs_dir=runs_dir)
  baseline_accuracies = np.append(baseline_accuracies, base_acc)

  uda_acc = main_uda(product_train_loader, rw_train_loader, rw_test_loader, img_root=rootdir)
  uda_accuracies = np.append(uda_accuracies, uda_acc)

  print(f"baseline_accuracies: {baseline_accuracies}")
  print(f"uda accuracies: {uda_accuracies}")


  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


Network Init Done
Got Optimizer
Got Cost Function
Epoch 0:
Training loss: 0.018003707081079484 
 Training accuracy: 41.9375
Epoch 1:
Training loss: 0.004193783309310675 
 Training accuracy: 91.375
Epoch 2:
Training loss: 0.0017572841746732592 
 Training accuracy: 95.9375
Epoch 3:
Training loss: 0.0012142252689227463 
 Training accuracy: 97.375
Epoch 4:
Training loss: 0.0009767045616172255 
 Training accuracy: 98.125
Epoch 5:
Training loss: 0.0008371801604516805 
 Training accuracy: 98.5625
Epoch 6:
Training loss: 0.0007409020862542093 
 Training accuracy: 98.75
Epoch 7:
Training loss: 0.0006697058188728988 
 Training accuracy: 98.9375
Epoch 8:
Training loss: 0.0006136374524794519 
 Training accuracy: 99.25
Epoch 9:
Training loss: 0.0005680043110623956 
 Training accuracy: 99.4375
Epoch 10:
Training loss: 0.0005301959649659693 
 Training accuracy: 99.5
Epoch 11:
Training loss: 0.0004983114521019161 
 Training accuracy: 99.5625
Epoch 12:
Training loss: 0.00047096268506720664 
 Training a

  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


Network Init Done
Got Optimizer
Got Cost Function
Epoch 0:
Training loss: 0.01821212224662304 
 Training accuracy: 40.25
Epoch 1:
Training loss: 0.004134849980473519 
 Training accuracy: 92.875
Epoch 2:
Training loss: 0.001741063268855214 
 Training accuracy: 96.5625
Epoch 3:
Training loss: 0.0012065851641818882 
 Training accuracy: 97.9375
Epoch 4:
Training loss: 0.000961151747033 
 Training accuracy: 98.4375
Epoch 5:
Training loss: 0.000817558285780251 
 Training accuracy: 98.75
Epoch 6:
Training loss: 0.000719412041362375 
 Training accuracy: 99.0
Epoch 7:
Training loss: 0.0006469863979145884 
 Training accuracy: 99.25
Epoch 8:
Training loss: 0.0005907136318273842 
 Training accuracy: 99.6875
Epoch 9:
Training loss: 0.0005454503209330142 
 Training accuracy: 99.6875
Epoch 10:
Training loss: 0.0005081417574547231 
 Training accuracy: 99.6875
Epoch 11:
Training loss: 0.0004768064571544528 
 Training accuracy: 99.75
Epoch 12:
Training loss: 0.00045004433253780006 
 Training accuracy: 9

  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


Network Init Done
Got Optimizer
Got Cost Function
Epoch 0:
Training loss: 0.01774728760123253 
 Training accuracy: 43.375
Epoch 1:
Training loss: 0.004058751221746207 
 Training accuracy: 92.3125
Epoch 2:
Training loss: 0.0016710284817963838 
 Training accuracy: 97.0
Epoch 3:
Training loss: 0.0011533382628113032 
 Training accuracy: 97.75
Epoch 4:
Training loss: 0.0009109698189422488 
 Training accuracy: 98.5
Epoch 5:
Training loss: 0.0007751479744911194 
 Training accuracy: 99.25
Epoch 6:
Training loss: 0.0006809956301003694 
 Training accuracy: 99.4375
Epoch 7:
Training loss: 0.0006118861096911133 
 Training accuracy: 99.6875
Epoch 8:
Training loss: 0.0005588559992611409 
 Training accuracy: 99.6875
Epoch 9:
Training loss: 0.0005162901873700321 
 Training accuracy: 99.75
Epoch 10:
Training loss: 0.00048120530555024744 
 Training accuracy: 99.8125
Epoch 11:
Training loss: 0.00045175325591117143 
 Training accuracy: 99.8125
Epoch 12:
Training loss: 0.0004266248922795057 
 Training accu

  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


Network Init Done
Got Optimizer
Got Cost Function
Epoch 0:
Training loss: 0.017751939222216608 
 Training accuracy: 42.3125
Epoch 1:
Training loss: 0.0040741883590817455 
 Training accuracy: 93.4375
Epoch 2:
Training loss: 0.001780945621430874 
 Training accuracy: 96.3125
Epoch 3:
Training loss: 0.001222041556611657 
 Training accuracy: 97.8125
Epoch 4:
Training loss: 0.0009798221779055893 
 Training accuracy: 98.5
Epoch 5:
Training loss: 0.0008412263100035489 
 Training accuracy: 98.9375
Epoch 6:
Training loss: 0.0007449926668778062 
 Training accuracy: 99.1875
Epoch 7:
Training loss: 0.0006731950002722442 
 Training accuracy: 99.25
Epoch 8:
Training loss: 0.000617189051117748 
 Training accuracy: 99.3125
Epoch 9:
Training loss: 0.0005718926945701241 
 Training accuracy: 99.375
Epoch 10:
Training loss: 0.0005343244899995625 
 Training accuracy: 99.375
Epoch 11:
Training loss: 0.0005025446123909205 
 Training accuracy: 99.4375
Epoch 12:
Training loss: 0.0004752443241886795 
 Training a

  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


Network Init Done
Got Optimizer
Got Cost Function
Epoch 0:
Training loss: 0.018018654584884643 
 Training accuracy: 42.5625
Epoch 1:
Training loss: 0.004225632399320603 
 Training accuracy: 93.6875
Epoch 2:
Training loss: 0.001737426118925214 
 Training accuracy: 96.8125
Epoch 3:
Training loss: 0.0012205179780721664 
 Training accuracy: 97.5
Epoch 4:
Training loss: 0.0009836634900420903 
 Training accuracy: 98.3125
Epoch 5:
Training loss: 0.0008441854687407613 
 Training accuracy: 98.6875
Epoch 6:
Training loss: 0.000747526572085917 
 Training accuracy: 98.9375
Epoch 7:
Training loss: 0.0006759040290489793 
 Training accuracy: 99.0
Epoch 8:
Training loss: 0.000619657018687576 
 Training accuracy: 99.125
Epoch 9:
Training loss: 0.0005737917008809745 
 Training accuracy: 99.25
Epoch 10:
Training loss: 0.0005356373894028365 
 Training accuracy: 99.4375
Epoch 11:
Training loss: 0.0005033424030989409 
 Training accuracy: 99.5625
Epoch 12:
Training loss: 0.00047564606182277203 
 Training acc

In [None]:
print(f"Baseline mean: {baseline_accuracies.mean()} \t Baseline std deviation: {baseline_accuracies.std()}")
print(f"Uda mean: {uda_accuracies.mean()} \t Uda std deviation: {uda_accuracies.std()}")

Baseline mean: 77.65 	 Baseline std deviation: 2.1598611066455176
Uda mean: 86.8 	 Uda std deviation: 0.7968688725254612
