# Unsupervised Domain Adaptation Project


## 0: Running Instruction

This Notebook is supporsed to be **run by order**.

- Running Parameter configuration can be modified [here](#configuration).

- DANN Model are defined [here](#DANN)

- Source Only Training Starts [here](#source)

- UDA functions defined [here](#UDA) while UDA training starts [here](#UDA_training)

- UDA Mid define and runs [here](#UDA_mid)

- Summary is [here](#summary)

## 1: Data download
Load data to project from Google Drive. Copy a subset of classes of images to the path:
- `adaptiope_small/product_images`
- `adaptiope_small/real_life` 

two directories. They represent images from two different domain **product** and **real_life**

In [1]:
from os import makedirs, listdir
from tqdm import tqdm
from google.colab import drive
from os.path import join
from shutil import copytree

drive.mount('/content/gdrive')

!mkdir dataset
!cp "gdrive/My Drive/Colab Notebooks/data/Adaptiope.zip" dataset/
# !ls dataset

!unzip -qq dataset/Adaptiope.zip   # unzip file

!rm -rf dataset/Adaptiope.zip 
!rm -rf adaptiope_small

Mounted at /content/gdrive


In [2]:
!mkdir adaptiope_small
classes = listdir("Adaptiope/product_images")
print(classes)
classes = ["backpack", "bookcase", "car jack", "comb", "crown", "file cabinet", "flat iron", "game controller", "glasses",
           "helicopter", "ice skates", "letter tray", "monitor", "mug", "network switch", "over-ear headphones", "pen",
           "purse", "stand mixer", "stroller"]
domain_classes = ["product_images", "real_life"]
for d, td in zip(["Adaptiope/product_images", "Adaptiope/real_life"], ["adaptiope_small/product_images", "adaptiope_small/real_life"]):
  makedirs(td)
  for c in tqdm(classes):
    c_path = join(d, c)
    c_target = join(td, c)
    copytree(c_path, c_target)

['tape dispenser', 'comb', 'car jack', 'toothbrush', 'handcuffs', 'magic lamp', 'usb stick', 'glasses', 'tank', 'computer', 'rubber boat', 'shower head', 'telescope', 'quadcopter', 'scooter', 'tyrannosaurus', 'rc car', 'syringe', 'cordless fixed phone', 'bottle', 'vr goggles', 'razor', 'helicopter', 'hat', 'dart', 'compass', 'hoverboard', 'corkscrew', 'projector', 'file cabinet', 'smoking pipe', 'rifle', 'mug', 'fan', 'sewing machine', 'pen', 'keyboard', 'knife', 'trash can', 'tent', 'drum set', 'nail clipper', 'phonograph', 'monitor', 'toilet brush', 'skateboard', 'electric guitar', 'screwdriver', 'coat hanger', 'speakers', 'boxing gloves', 'roller skates', 'computer mouse', 'ladder', 'motorbike helmet', 'scissors', 'handgun', 'power strip', 'ruler', 'microwave', 'golf club', 'stapler', 'watering can', 'over-ear headphones', 'umbrella', 'pipe wrench', 'vacuum cleaner', 'purse', 'in-ear headphones', 'webcam', 'pikachu', 'letter tray', 'chainsaw', 'ice cube tray', 'fighter jet', 'grill'

100%|██████████| 20/20 [00:00<00:00, 20.13it/s]
100%|██████████| 20/20 [00:00<00:00, 29.37it/s]


In [3]:
product_path = 'adaptiope_small/product_images'
real_life_path = 'adaptiope_small/real_life'

<a name="DANN">
</a>

## 2: Domain-Adversarial training of Neural Network 

We implement DANN UDA method [DANN](https://arxiv.org/pdf/1505.07818.pdf)  

![DANN.png](https://raw.githubusercontent.com/CrazyAlvaro/UDA/main/images/DANN.png)

As displayed in the model architecture above, DANN is consist of three component: feature extractor, domain classifier, and label predictor(classifier).

While in order to adversarial training from both label predictor and domain classifier, a gradient reversal layer(GRL) is added.

### 2.0: Import Libraries and Data Loading


In [4]:
from PIL import Image
from os.path import join
import math

img = Image.open(join(product_path, 'backpack', 'backpack_003.jpg'))
print('Image size: ', img.size)
#img

Image size:  (679, 679)


Import libraries

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import softmax
from torchvision import transforms
from torchvision.datasets import ImageFolder
from torchvision.models import vgg11, alexnet 
from torch.utils.data import DataLoader, random_split
from torchvision.transforms.transforms import ToTensor

Config Parameters
<a name="configuration">
</a>

In [84]:
img_size = 256
# mean, std used by pre-trained models from PyTorch
mean, std = [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]
config = dict(epochs=10, batch_size=64,lr=0.01, wd=0.001, momentum=0.9, alpha=10, beta=0.75, gamma=10)

Configue GPU

In [7]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)

cuda:0


In [8]:
def get_dataset(root_path):
  '''
    Get dataset from specific data path

    # parameters:
        root_path: path to image folder

    # return: train_loader, test_loader
  '''
  # Construct image transform
  image_transform = transforms.Compose([
    transforms.Resize(img_size),
    transforms.CenterCrop(img_size),
    transforms.ToTensor(),
    transforms.Normalize(mean, std)
  ])

  # Load data from filesystem
  image_dataset = ImageFolder(root_path, transform=image_transform)

  return image_dataset

def get_dataloader(dataset, batch_size, shuffle_train=True, shuffle_test=False):
  '''
    Get DataLoader from specific data path

    # parameters:
        dataset: ImageFolder instance
        batch_size: batch_size for DataLoader
        shuffle_train: whether to shuffle training data
        shuffle_test: whether to shuffle test data
  '''
  # Get train, test number
  num_total = len(dataset)
  num_train = int(num_total * 0.8 + 1)
  num_test  = num_total - num_train

  # random split dataset
  data_train, data_test = random_split(dataset, [num_train, num_test])

  # initialize dataloaders
  loader_train = DataLoader(data_train, batch_size=batch_size, shuffle=shuffle_train)
  loader_test  = DataLoader(data_test, batch_size=batch_size, shuffle=shuffle_test)

  return loader_train, loader_test

### 2.1 Define Feature Extractor with Pretrain Network

For the feature extractor, we select pretrained AlexNet. 
The reason to choose AlexNet comparing more recent Network like 
Residural Network is because it has a good balance between model 
performance and traning complexity. Even though ResNet may perform 
better than AlexNet by a reasonable amount of gain, it takes way much longer 
to train or tune a complex network like this, which will dramatically increase the training time.

In [63]:
class FeatureExtractor(nn.Module):
  """
  FeatureExtractor

  Pretrained neural network as a backbone for later domain adaptation task
  """
  def __init__(self):
    super(FeatureExtractor, self).__init__()

    # Feature Extractor with AlexNet
    self.feature_extractor = alexnet(weights='DEFAULT')
    self.feature_dim = self.feature_extractor.classifier[-1].in_features

    # make the last layer identity
    self.feature_extractor.classifier[-1] = nn.Identity()

  def forward(self, x):
    return self.feature_extractor(x)
  
  def output_dim(self):
    return self.feature_dim

### 2.2 Define Classifier, Discriminator with RevereLayerF for training the Feature Extractor

For the classifier, we implement a three fully connected linear layer 
with LeakyReLU because of its advantage over regular ReLU activation function. 
And finally we use Logrithm Softmax function as the selection layer.

In [64]:
class Classifier(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(Classifier, self).__init__()
        self.classifier = nn.Sequential(
            nn.Linear(input_dim, 1024),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(1024, 512),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(512, output_dim),
            nn.LogSoftmax(dim=1)
        )
    
    def forward(self, X):
        return self.classifier(X) 

Here we implement a ReverseLayer Function to pass thourgh the data 
as it is without doing any compuation, but on backward propagation, 
it will reverse the sign of the value to provide the capability 
to adversarial training from the later Discriminator.

In [65]:
from torch.autograd import Function

class ReverseLayerF(Function):
    @staticmethod
    def forward(ctx, tensor): 
        """
        Without doing any computation
        """
        return tensor.view_as(tensor)

    @staticmethod
    def backward(ctx, grad_output):
        """
        Change the sign of the gradient 
        """
        return grad_output.neg(), None

Here for the discriminator, we only implement a two-layer linear connection here to avoid overly complicate the Discriminator, becuase usually more complex discriminator will have a negative effect on adversarial training.

In [66]:
class Discriminator(nn.Module):
    def __init__(self, input_dim):
        super(Discriminator, self).__init__()
        self.discriminator =  nn.Sequential(
            nn.Linear(int(input_dim), 1024),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(1024,1),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Sigmoid()
        )

    def forward(self, x):
        validity = self.discriminator(x)
        return validity 

Finally we connect the feature extractor, classifer, and discriminator to form a 
Domain Adversarial Neural Network. Both classifier and discriminator will take in 
data from the fearture extractor processed the imput images. The classifer take the 
number of classes as input, the output is the prediction of which class of the current 
image belong. While the discriminator are supposed to distinguish the image from two 
different domains.

The discriminator will serve as a doamin alignment unit to train the feature extractor 
to extract domain independent features from both domains.

In [67]:
class DANN(nn.Module):
  """ 
  DANN

  Implement the domain adversarial neural network that train the feature extractor from both classification and discrimination of 
  different domains.
  """
  def __init__(self, num_classes):
    """ 
    Parameter:
      @num_classes: number of classes of different images
    """
    super(DANN, self).__init__()
    self.output_dim = num_classes

    # define inner network component
    self.feature_extractor = FeatureExtractor()
    self.classifier = Classifier(self.feature_extractor.output_dim(), num_classes)
    self.discriminator = Discriminator(self.feature_extractor.output_dim())  
  
  def forward(self, x):
    feature_output = self.feature_extractor(x)

    class_pred = self.classifier(feature_output)

    # Add a ReverseLayer here for negative gradient computation
    reverse_feature = ReverseLayerF.apply(feature_output)
    domain_pred = self.discriminator(reverse_feature)

    return class_pred, domain_pred 

### 2.3 Cost function

In [68]:
class BinaryDiceLoss(nn.Module):
    """Dice loss of binary class
    Args:
        smooth: A float number to smooth loss, and avoid NaN error, default: 1
        p: Denominator value: \sum{x^p} + \sum{y^p}, default: 2
        predict: A tensor of shape [N, *]
        target: A tensor of shape same with predict
        reduction: Reduction method to apply, return mean over batch if 'mean',
            return sum if 'sum', return a tensor of shape [N,] if 'none'
    Returns:
        Loss tensor according to arg reduction
    Raise:
        Exception if unexpected reduction
    """
    def __init__(self, smooth=1, p=2, reduction='mean'):
        super(BinaryDiceLoss, self).__init__()
        self.smooth = smooth
        self.p = p
        self.reduction = reduction

    def forward(self, predict, target):
        assert predict.shape[0] == target.shape[0], "predict & target batch size don't match"
        predict = predict.contiguous().view(predict.shape[0], -1)
        target = target.contiguous().view(target.shape[0], -1)

        num = torch.sum(torch.mul(predict, target), dim=1) + self.smooth
        den = torch.sum(predict.pow(self.p) + target.pow(self.p), dim=1) + self.smooth

        loss = 1 - num / den

        if self.reduction == 'mean':
            return loss.mean()
        elif self.reduction == 'sum':
            return loss.sum()
        elif self.reduction == 'none':
            return loss
        else:
            raise Exception('Unexpected reduction {}'.format(self.reduction))

In [69]:
class DiceLoss(nn.Module):
    """Dice loss, need one hot encode input
    Args:
        weight: An array of shape [num_classes,]
        ignore_index: class index to ignore
        predict: A tensor of shape [N, C, *]
        target: A tensor of same shape with predict
        other args pass to BinaryDiceLoss
    Return:
        same as BinaryDiceLoss
    """
    def __init__(self, weight=None, ignore_index=None, **kwargs):
        super(DiceLoss, self).__init__()
        self.kwargs = kwargs
        self.weight = weight
        self.ignore_index = ignore_index

    def forward(self, predict, target):
        # one hot encode input
        num_class = predict.shape[1]
        # one hot
        target = F.one_hot(target, num_classes=num_class)
        
        assert predict.shape == target.shape, 'predict & target shape do not match'
        dice = BinaryDiceLoss(**self.kwargs)
        total_loss = 0
        predict = F.softmax(predict, dim=1)

        for i in range(target.shape[1]):
            if i != self.ignore_index:
                dice_loss = dice(predict[:, i], target[:, i])
                if self.weight is not None:
                    assert self.weight.shape[0] == target.shape[1], \
                        'Expect weight shape [{}], get[{}]'.format(target.shape[1], self.weight.shape[0])
                    dice_loss *= self.weights[i]
                total_loss += dice_loss

        return total_loss/target.shape[1]

In this experiment, two loss functions, dice loss and cross entropy, were selected for comparison in the classifier.

In [70]:
def get_class_loss_func(dice_loss=False):
  if dice_loss:
    print("## Loss Function ## dice_loss being used.")
    return DiceLoss()
  else:
    print("## Loss Function ## CrossEntropyLoss being used.")
    return nn.CrossEntropyLoss()

### 2.4 Optimizer

Setting the **learning rate** according to the original [paper](https://arxiv.org/pdf/1505.07818.pdf) section 5.2.2

$$ \mu_p =  \frac{\mu_0}{(1+\alpha \cdot p)^\beta}$$

where p is the training progress linearly changing from 0 to 1.

And for the learning rate, for the pretrain weights, we set the learning rate only to be 1/10 
of the learning rate for the classifier. And we use Stochastic Gradient Descent to optimize the 
model.

In [71]:
def get_optimizer(model, config, progress, adversarial=True):
  '''
  get_optimizer

  parameter:
    @model: Neural Network to be optimizd
    @config: configuration dictionary contains parameters
    @progress: training progress to configurate learning rate
    @adersarial: if we are in adversarial traning scenario 

  return:
    @optimizer: the optimizer we use to train our model

  '''
  learning_rate = config['lr']
  learning_rate = learning_rate / ((1 + config['alpha']*progress)**config['beta'])

  weight_decay  = config['wd']
  momentum      = config['momentum']

  feature_ext   = model.get_submodule("feature_extractor")
  classifier    = model.get_submodule("classifier")
  discriminator = model.get_submodule("discriminator")

  pre_trained_weights = feature_ext.parameters()

  if adversarial:
    other_weights = list(classifier.parameters()) + list(discriminator.parameters())
  else:
    other_weights = list(classifier.parameters())

  # assign parameters to parameters
  optimizer = torch.optim.SGD([
    {'params': pre_trained_weights},
    {'params': other_weights, 'lr': learning_rate}
  ], lr= learning_rate/10, weight_decay=weight_decay, momentum=momentum)
  
  return optimizer

### 2.5 Training Loop and Testing Loop

In [72]:
def train_loop(dataloader, model, device, progress, dice_loss=False):
  """
  train_loop

  Iterate through dataloader to train the network with SGD optimizer.

  Parameters:
    @dataloader: Pytorch dataloader to iterate through training
    @model: Neural Network model that we are training
    @device: GPU or CPU
    @progress: the progress of traning based on current epoch over total epochs
  """
  size = len(dataloader.dataset)
  loss_fn = get_class_loss_func(dice_loss)

  optimizer = get_optimizer(model, config, progress, adversarial=False)

  for batch, (X, y) in enumerate(dataloader):
    X, y = X.to(device), y.to(device)
    
    # compute prediction and loss
    class_pred, _ = model(X)

    # classification loss
    loss = loss_fn(class_pred, y) 
    curr_loss = loss.item()
    
    # backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if batch % 10 == 0:
      current = batch * len(X)
      print(f"## Meter ## current loss: {curr_loss:>7f} [{current:>5d}/{size:>5d}]")

In [73]:
def test_loop(dataloader, model, device, dice_loss=False):
  """ 
  test_loop

  Test the model by iterate through the dataloader and compute the correctness.

  Parameters:
    @dataloader: Pytorch dataloader to iterate through training
    @model: Neural Network model that we are training
    @progress: the progress of traning based on current epoch over total epochs
  
  @return:
    @test_loss: test loss
    @correct: correctness of the test data set
  """
  test_loss, correct = 0, 0
  loss_fn = get_class_loss_func(dice_loss)

  with torch.no_grad():
    for X, y in dataloader:
      X, y = X.to(device), y.to(device)
      class_pred, _ = model(X)

      test_loss += loss_fn(class_pred, y).item()
      correct += (class_pred.argmax(1) == y).type(torch.float).sum().item()

  size = len(dataloader.dataset)
  num_batches = len(dataloader)

  test_loss /= num_batches
  correct /= size
  print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

  return test_loss, correct

### 2.6 Training Function

In [74]:
def training(model, train_dataloader, test_dataloader, config, device, dice_loss=False):
  """ 
  training

  Acturall training function iterate through the training epochs
  """
  epochs = config['epochs']

  for epoch in range(epochs):
    print(f"Epoch {epoch+1}\n------------------")
    progress = epoch/epochs

    train_loop(train_dataloader, model, device, progress, dice_loss)

  test_loop(test_dataloader, model, device, dice_loss)
  print("Done")

<a name="source">
</a>

## 3 Training without using Domain Adaptation techniques

 ### 3.1 Product Domain -> Real Life

In [81]:
# Get dataloader
product_dataset   = get_dataset(product_path)
real_life_dataset = get_dataset(real_life_path)

#### 3.1.1 Training on Product

In [85]:
train_dataloader, test_dataloader = get_dataloader(product_dataset, config['batch_size'])

source_product_model = DANN(len(product_dataset.classes)).to(device)

# Training
training(source_product_model, train_dataloader, test_dataloader, config, device)

Epoch 1
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter ## current loss: 3.000221 [    0/ 1601]
## Meter ## current loss: 2.353845 [  640/ 1601]
## Meter ## current loss: 0.540513 [ 1280/ 1601]
Epoch 2
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter ## current loss: 0.441889 [    0/ 1601]
## Meter ## current loss: 0.260454 [  640/ 1601]
## Meter ## current loss: 0.255725 [ 1280/ 1601]
Epoch 3
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter ## current loss: 0.177589 [    0/ 1601]
## Meter ## current loss: 0.148569 [  640/ 1601]
## Meter ## current loss: 0.189561 [ 1280/ 1601]
Epoch 4
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter ## current loss: 0.109997 [    0/ 1601]
## Meter ## current loss: 0.121884 [  640/ 1601]
## Meter ## current loss: 0.144018 [ 1280/ 1601]
Epoch 5
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter ## current loss: 0.0851

#### 3.1.2 Test on Real Life

In [86]:
loader_target_dataset = DataLoader(real_life_dataset, batch_size=config['batch_size'], shuffle=False)

# model.load_state_dict(torch.load('model_state.pt', map_location='cpu'))
test_loop(loader_target_dataset, source_product_model, device)

## Loss Function ## CrossEntropyLoss being used.
Test Error: 
 Accuracy: 63.8%, Avg loss: 1.262631 



(1.2626310754567385, 0.638)

In [87]:
del train_dataloader, test_dataloader, loader_target_dataset
# del model
print(torch.cuda.memory_allocated())

5634485760


### 3.2 Real Life -> Product


#### 3.2.1 Training on Real Life

In [88]:
train_dataloader, test_dataloader = get_dataloader(real_life_dataset, config['batch_size'])

source_real_model = DANN(len(real_life_dataset.classes)).to(device)

# Training
training(source_real_model, train_dataloader, test_dataloader, config, device)

Epoch 1
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter ## current loss: 3.018649 [    0/ 1601]
## Meter ## current loss: 2.755090 [  640/ 1601]
## Meter ## current loss: 1.871956 [ 1280/ 1601]
Epoch 2
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter ## current loss: 1.195980 [    0/ 1601]
## Meter ## current loss: 1.003482 [  640/ 1601]
## Meter ## current loss: 0.784357 [ 1280/ 1601]
Epoch 3
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter ## current loss: 0.450122 [    0/ 1601]
## Meter ## current loss: 0.732290 [  640/ 1601]
## Meter ## current loss: 0.453970 [ 1280/ 1601]
Epoch 4
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter ## current loss: 0.393326 [    0/ 1601]
## Meter ## current loss: 0.442337 [  640/ 1601]
## Meter ## current loss: 0.413421 [ 1280/ 1601]
Epoch 5
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter ## current loss: 0.4869

#### 3.2.2 Testing on Product


In [89]:
loader_target_dataset = DataLoader(product_dataset, batch_size=config['batch_size'], shuffle=False)

# model.load_state_dict(torch.load('model_state.pt', map_location='cpu'))
test_loop(loader_target_dataset, source_real_model, device)

## Loss Function ## CrossEntropyLoss being used.
Test Error: 
 Accuracy: 70.5%, Avg loss: 0.951106 



(0.9511063224636018, 0.7045)

In [90]:
del train_dataloader, test_dataloader, loader_target_dataset
# del model
print(torch.cuda.memory_allocated())

6147605504


<a name="UDA">
</a>

## 4: Define UDA functions


### 4.1 Adversarial Discriminator Loss

Here we compute the discrimination loss of from the discriminator

In [91]:
def get_discriminator_loss(source_pred, target_pred): 
    """ 
    get_discriminator_loss

    parameters:
        @source_pred: model prediction from the source data
        @target_pred: model prediction from the target data

    return:
        @domain_loss: computed domain loss
    """
    domain_pred = torch.cat((source_pred, target_pred),dim=0).cuda()
    #print(domain_pred.shape) # [128,1024]
    source_truth = torch.zeros(len(source_pred))
    target_truth = torch.ones(len(target_pred))
    domain_truth = torch.cat((source_truth, target_truth),dim=0).cuda()
    #print(domain_truth.shape) # [128]

    domain_loss = domain_truth*torch.log(1/domain_pred)+(1-domain_truth)*torch.log(1/(1-domain_pred))
    domain_loss = domain_loss.mean()

    return domain_loss 

### 4.2 Adversarial optimizer

We are using the Stochastic Gradient Descent optimizer and 
set learning rate for the pre_trained_weights to be 1/10
of other learning rate.

In [92]:
def get_adversarial_optimizer(model, config, progress, adversarial=True):
  '''
  Get Adversarial Optimizers
  '''
  lr, wd, momtm = config['lr'], config['wd'], config['momentum']
  lr = lr / ((1 + config['alpha']*progress)**config['beta'])

  feature_ext   = model.get_submodule("feature_extractor")
  classifier    = model.get_submodule("classifier")
  discriminator = model.get_submodule("discriminator")

  pre_trained_weights   = feature_ext.parameters()
  classifier_weights    = classifier.parameters()
  discriminator_weights = discriminator.parameters()

  feature_optim       = torch.optim.SGD([{'params': pre_trained_weights}],     lr=lr/10, weight_decay=wd, momentum=momtm)
  classifier_optim    = torch.optim.SGD([{'params': classifier_weights}],      lr=lr,    weight_decay=wd, momentum=momtm)
  discriminator_optim = torch.optim.SGD([{'params': discriminator_weights}],   lr=lr,    weight_decay=wd, momentum=momtm)
  
  return feature_optim, classifier_optim, discriminator_optim 

### 4.3 Adversarial Train Loop

Setting the **domain adaptation parameter** according to the original [paper](https://arxiv.org/pdf/1505.07818.pdf) section 5.2.2

$$ \lambda_p = \frac{2}{1 + exp(-\gamma \cdot p)} - 1 $$

where p is the training progress linearly changing from 0 to 1.

So here we optimize the model by calculating the classification loss and discrimination loss. 
Then we optimize the classifier, the discriminator, and the feature extractor based
on the loss we get.

In [93]:
def adversarial_train_loop(source_loader, target_loader, model, config, progress, device):
  """
  parameters:
    @source_loader
    @target_loader
    @model
    @config
    @progress
    @device

  return:
    @best_state
    @best_loss
  """
  size = len(source_loader.dataset)
  
  # cross entropy loss
  classification_loss = get_class_loss_func()

  # Get three optimizer
  feature_optim, class_optim, discriminator_optim = get_adversarial_optimizer(model, config, progress)

  # Target data loader iterator
  iter_target = iter(target_loader)

  domain_adapt = 2 / (1 + math.exp(-config['gamma']*progress)) - 1

  for batch, (X_source, y_source) in enumerate(source_loader):
    try:
      X_target, _ = next(iter_target)
    except:
      iter_target = iter(target_loader)
      X_target, _ = next(iter_target)  

    # Some internal bug return nested tesnor with size 1
    if len(X_source) < 64:
      continue

    X_source, y_source, X_target = X_source.to(device), y_source.to(device), X_target.to(device)

    class_pred_source, domain_pred_source = model(X_source)
    _,                 domain_pred_target = model(X_target)

    class_loss   = classification_loss(class_pred_source, y_source)
    discrim_loss = get_discriminator_loss(domain_pred_source, domain_pred_target)

    feature_optim.zero_grad()

    # Update discriminator
    discriminator_optim.zero_grad()
    discrim_loss.backward(retain_graph=True)
    discriminator_optim.step()

    # Update classifier
    class_optim.zero_grad()
    class_loss.backward(retain_graph=True)
    class_optim.step()

    # Update feature extractor
    feature_optim.step()  

    # Total loss
    total_loss = class_loss - domain_adapt * discrim_loss 

    if batch % 10 == 0:
      class_loss, discrim_loss, current = class_loss.item(), discrim_loss.item(), batch * len(X_source)
      total_loss = total_loss.item()
      # print(f"## Meter  ## [{current:>5d}/{size:>5d}]")
      print(f"## Meter  ## classification loss: {class_loss:>7f} discrim loss: {discrim_loss:>7f} total loss: {total_loss:>7f}[{current:>5d}/{size:>5d}]")

    del class_loss, discrim_loss 
    del X_source, y_source, X_target, class_pred_source, domain_pred_source, domain_pred_target
  
  # return best_state, best_loss

### 4.4 Adversarial Test Loop

In [94]:
def adversarial_test_loop(dataloader, model, device, name=""):
  """ 
  adversarial_test_loop

  Test the model compute the loss and accuracy
  """
  test_loss, correct = 0, 0

  class_loss_func = get_class_loss_func()

  with torch.no_grad():
    for X, y in dataloader:
      X, y = X.to(device), y.to(device)
      class_pred, _ = model(X)

      test_loss += class_loss_func(class_pred, y).item()
      correct += (class_pred.argmax(1) == y).type(torch.float).sum().item()

  size = len(dataloader.dataset)
  num_batches = len(dataloader)

  test_loss /= num_batches
  correct /= size
  print(f"{name} Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

  return test_loss, correct

### 4.5 Adversarial Training

In [95]:
def adversarial_training(model, source_loader, source_test_loader, target_loader, config, device):
  """ 
  adversarial_training

  Training the adversarial model with the config
  """
  no_improve_count = 0

  for epoch in range(config['epochs']):
    print(f"Epoch {epoch+1}\n------------------")
    progress = epoch/config['epochs']

    adversarial_train_loop(source_loader, target_loader, model, config, progress, device)

    source_loss, _ = adversarial_test_loop(source_test_loader, model, device, "Source Test")

  print("Done")

<a name="UDA_training">
</a>

## 5 Training with UDA Techniques

In [96]:
torch.cuda.empty_cache()

In [97]:
product_adv_model = DANN(len(product_dataset.classes)).to(device)

### 5.1 Product -> Real Life

#### 5.1.1 Training on Product

In [98]:
train_dataloader, train_test_dataloader = get_dataloader(product_dataset, config['batch_size'])
target_dataloader, target_test_dataloader = get_dataloader(real_life_dataset, config['batch_size'])

In [99]:
torch.autograd.set_detect_anomaly(True)
adversarial_training(product_adv_model, train_dataloader, train_test_dataloader, target_dataloader, config, device)

Epoch 1
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter  ## classification loss: 2.968360 discrim loss: 0.697306 total loss: 2.968360[    0/ 1601]
## Meter  ## classification loss: 2.331265 discrim loss: 0.695401 total loss: 2.331265[  640/ 1601]
## Meter  ## classification loss: 0.402743 discrim loss: 0.694994 total loss: 0.402743[ 1280/ 1601]
## Loss Function ## CrossEntropyLoss being used.
Source Test Test Error: 
 Accuracy: 83.5%, Avg loss: 0.693817 

Epoch 2
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter  ## classification loss: 0.814303 discrim loss: 0.698190 total loss: 0.282565[    0/ 1601]
## Meter  ## classification loss: 0.423594 discrim loss: 0.695006 total loss: -0.105718[  640/ 1601]
## Meter  ## classification loss: 0.364948 discrim loss: 0.694518 total loss: -0.163993[ 1280/ 1601]
## Loss Function ## CrossEntropyLoss being used.
Source Test Test Error: 
 Accuracy: 90.2%, Avg loss: 0.406736 

Epoch 3
------------

#### 5.1.2 Testing on Real Life

In [100]:
loader_target_dataset = DataLoader(real_life_dataset, batch_size=config['batch_size'], shuffle=False)

test_loop(loader_target_dataset, product_adv_model, device)

## Loss Function ## CrossEntropyLoss being used.
Test Error: 
 Accuracy: 63.9%, Avg loss: 1.276491 



(1.2764913607388735, 0.6395)

### 5.2 Real Life -> Product

#### 5.2.1 Training on Real Life

In [101]:
train_dataloader, train_test_dataloader = get_dataloader(real_life_dataset, config['batch_size'])
target_dataloader, target_test_dataloader = get_dataloader(product_dataset, config['batch_size'])

In [102]:
# Training
real_adv_model = DANN(len(product_dataset.classes)).to(device)
adversarial_training(real_adv_model, train_dataloader, train_test_dataloader, target_dataloader, config, device)

Epoch 1
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter  ## classification loss: 2.999510 discrim loss: 0.695183 total loss: 2.999510[    0/ 1601]
## Meter  ## classification loss: 2.776052 discrim loss: 0.694211 total loss: 2.776052[  640/ 1601]
## Meter  ## classification loss: 1.730621 discrim loss: 0.694183 total loss: 1.730621[ 1280/ 1601]
## Loss Function ## CrossEntropyLoss being used.
Source Test Test Error: 
 Accuracy: 63.2%, Avg loss: 1.210765 

Epoch 2
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter  ## classification loss: 1.088701 discrim loss: 0.694209 total loss: 0.559996[    0/ 1601]
## Meter  ## classification loss: 0.725001 discrim loss: 0.694996 total loss: 0.195696[  640/ 1601]
## Meter  ## classification loss: 0.809038 discrim loss: 0.695145 total loss: 0.279619[ 1280/ 1601]
## Loss Function ## CrossEntropyLoss being used.
Source Test Test Error: 
 Accuracy: 69.9%, Avg loss: 0.925751 

Epoch 3
--------------

#### 5.2.2 Testing on Product

In [103]:
loader_target_dataset = DataLoader(product_dataset, batch_size=config['batch_size'], shuffle=False)

test_loop(loader_target_dataset, real_adv_model, device)

## Loss Function ## CrossEntropyLoss being used.
Test Error: 
 Accuracy: 80.8%, Avg loss: 0.591975 



(0.591975215356797, 0.8075)

<a name="UDA_mid">
</a>

## 6 UDA Ablation Study

### 6.1 How UDA over different levels of feature layer work

In [104]:
# To test the middle layer feature adversarial training, we copy the model from torchvision and modify it.
class AlexNet(nn.Module):
    def __init__(self, num_classes: int = 1000, dropout: float = 0.5) -> None:
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )# 12544 = (6*6*256)
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(p=dropout),
            nn.Linear(256 * 6 * 6, 4096), 
            nn.ReLU(inplace=True),
            nn.Dropout(p=dropout),
            nn.Linear(4096, 4096),  # original 4096
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        feature = self.features(x)
        x = self.avgpool(feature)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        feature = feature.view(feature.size(0), -1)
        # return x, x
        return x, feature

In [105]:
class FeatureExtractor(nn.Module):
  def __init__(self, pretrained=True):
    super(FeatureExtractor, self).__init__()
    if pretrained:
      state_dict = alexnet(weights='DEFAULT').state_dict()
    else:
      state_dict = alexnet().state_dict()
    self.feature_extractor = AlexNet()
    self.feature_extractor.load_state_dict(state_dict)
    self.feature_dim = self.feature_extractor.classifier[-1].in_features
    self.adv_feature_dim = 12544
    print(self.feature_dim, self.adv_feature_dim)
    # print(f"Feature dimension: {self.feature_dim}")
    # make the last layer identity
    self.feature_extractor.classifier[-1] = nn.Identity()

  def forward(self, x):
    out = self.feature_extractor(x)
    return out
  
  def output_dim(self):
    return self.feature_dim
  
  def adv_output_dim(self):
    return self.adv_feature_dim

In [106]:
class Discriminator(nn.Module):
    def __init__(self, input_dim):
        super(Discriminator, self).__init__()
        self.discriminator =  nn.Sequential(
            nn.Linear(int(input_dim), 1024),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(1024,1),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Sigmoid()
        )

    def forward(self, x):
        validity = self.discriminator(x)
        return validity 

In [110]:
class DANN_Mid(nn.Module):
  def __init__(self, num_classes, pretrained=True):
    super(DANN_Mid, self).__init__()
    self.output_dim = num_classes

    # define inner network component
    self.feature_extractor = FeatureExtractor(pretrained=pretrained)
    self.classifier = Classifier(self.feature_extractor.output_dim(), num_classes)
    self.discriminator = Discriminator(self.feature_extractor.adv_output_dim())  
  
  def forward(self, x):
    # 4096, 12544
    feature_output, adv_feature = self.feature_extractor(x)
    
    class_pred = self.classifier(feature_output)

    # Add a ReverseLayer here for negative gradient computation
    reverse_feature = ReverseLayerF.apply(adv_feature)
    domain_pred = self.discriminator(reverse_feature)

    return class_pred, domain_pred 

### 6.2 Product -> Real Life

In [111]:
train_dataloader, train_test_dataloader = get_dataloader(product_dataset, config['batch_size'])
target_dataloader, target_test_dataloader = get_dataloader(real_life_dataset, config['batch_size'])
mid_product_adv_model = DANN_Mid(len(product_dataset.classes), pretrained=True).to(device)
torch.autograd.set_detect_anomaly(True)
adversarial_training(mid_product_adv_model, train_dataloader, train_test_dataloader, target_dataloader, config, device)

4096 12544
Epoch 1
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter  ## classification loss: 3.023133 discrim loss: 0.697544 total loss: 3.023133[    0/ 1601]
## Meter  ## classification loss: 2.263576 discrim loss: 0.694350 total loss: 2.263576[  640/ 1601]
## Meter  ## classification loss: 0.462087 discrim loss: 0.695956 total loss: 0.462087[ 1280/ 1601]
## Loss Function ## CrossEntropyLoss being used.
Source Test Test Error: 
 Accuracy: 82.0%, Avg loss: 0.641624 

Epoch 2
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter  ## classification loss: 0.596202 discrim loss: 0.695391 total loss: 0.066597[    0/ 1601]
## Meter  ## classification loss: 0.449357 discrim loss: 0.694993 total loss: -0.079945[  640/ 1601]
## Meter  ## classification loss: 0.141513 discrim loss: 0.695210 total loss: -0.387955[ 1280/ 1601]
## Loss Function ## CrossEntropyLoss being used.
Source Test Test Error: 
 Accuracy: 89.5%, Avg loss: 0.373861 

Epoch 3
-

In [112]:
loader_target_dataset = DataLoader(real_life_dataset, batch_size=config['batch_size'], shuffle=False)

test_loop(loader_target_dataset, mid_product_adv_model, device)

## Loss Function ## CrossEntropyLoss being used.
Test Error: 
 Accuracy: 62.8%, Avg loss: 1.277676 



(1.277676198631525, 0.628)

### 6.3 Real Life -> Product

In [113]:
config = dict(epochs=10, batch_size=64, lr=0.01, wd=0.001, momentum=0.9, alpha=10, beta=0.75, gamma=10)
train_dataloader, train_test_dataloader = get_dataloader(real_life_dataset, config['batch_size'])
target_dataloader, target_test_dataloader = get_dataloader(product_dataset, config['batch_size'])
mid_real_adv_model = DANN_Mid(len(real_life_dataset.classes), pretrained=True).to(device)
torch.autograd.set_detect_anomaly(True)
adversarial_training(mid_real_adv_model, train_dataloader, train_test_dataloader, target_dataloader, config, device)

4096 12544
Epoch 1
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter  ## classification loss: 3.008753 discrim loss: 0.705185 total loss: 3.008753[    0/ 1601]
## Meter  ## classification loss: 2.743634 discrim loss: 0.697054 total loss: 2.743634[  640/ 1601]
## Meter  ## classification loss: 1.983272 discrim loss: 0.696823 total loss: 1.983272[ 1280/ 1601]
## Loss Function ## CrossEntropyLoss being used.
Source Test Test Error: 
 Accuracy: 71.4%, Avg loss: 1.171896 

Epoch 2
------------------
## Loss Function ## CrossEntropyLoss being used.
## Meter  ## classification loss: 1.160598 discrim loss: 0.696179 total loss: 0.838882[    0/ 1601]
## Meter  ## classification loss: 1.296221 discrim loss: 0.694806 total loss: 0.975140[  640/ 1601]
## Meter  ## classification loss: 1.015076 discrim loss: 0.695105 total loss: 0.693856[ 1280/ 1601]
## Loss Function ## CrossEntropyLoss being used.
Source Test Test Error: 
 Accuracy: 76.7%, Avg loss: 0.761417 

Epoch 3
---

In [114]:
loader_target_dataset = DataLoader(product_dataset, batch_size=config['batch_size'], shuffle=False)

test_loop(loader_target_dataset, mid_real_adv_model, device)

## Loss Function ## CrossEntropyLoss being used.
Test Error: 
 Accuracy: 80.8%, Avg loss: 0.644416 



(0.6444155343342572, 0.8075)

<a name="summary">
</a>

## 7 Summary

### 7.1 Losses Governing Training

In this experiment, two loss functions, dice loss and cross entropy, were selected for comparison in the classifier. As expected, the cross-entropy loss is calculated as the average value of the per-pixel loss, and the per-pixel loss is calculated discretely. So the CE loss only considers the microscopic, but not the global consideration; while the Dice loss considers the local and global loss information, perhaps to improve accuracy. The dice training in our case shows a very bad performance.

| Domain   | Method | Performance |
|----------|:-------------:|:---:|
| Real Life -> Product | baseline | 6.6% 
|  |  UDA | 4.8%  | 
|  | UDA(Mid) |  5.1%  | 
| Product -> Real Life | baseline | 4.5% 
|  |  UDA |  8.9% | 
|  | UDA(Mid) | 5.2%  | 

The reason to consider is that CE has better gradients. Suppose p is the softmax outputs and t is the target. The gradients of cross-entropy wrt the logits is something like $ p-t $. While the dice coefficient in a differentiable form is $ \frac{2pt}{p+t} $ , whose gradients wrt p are much uglier: $ \frac{2t(t^2-p^2)}{(p^2+t^2)^2} $. Consider cases where p and t are small, the gradient will explode to some huge value. And severe oscillations may occur. Dice loss has good performance for scenes with severely imbalanced samples, but it is not applicable in our case. We choose CE loss as the loss function.

### 7.2 Hyperparameter Selection

<table>
    <thead>
        <tr>
            <th rowspan=2>Domain</th>
            <th rowspan=2>Method</th>
            <th colspan=4>Target Accuracy</th>
        </tr>
        <tr>
            <th>lr = 0.01</th>
            <th>lr = 0.005</th>
            <th>lr = 0.02</th>
            <th>lr = 0.03</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td rowspan=3>Real Life -> Product</td>
            <td>baseline</td>
            <td>80.7%</td>
            <td>79.8%</td>
            <td>79.5%</td>
            <td>82.4%</td>
        </tr>
        <tr>
            <td>UDA</td>
            <td>79.9%</td>
            <td>79.3%</td>
            <td>81.5%</td>
            <td>82.0%</td>
        </tr>
        <tr>
            <td>UDA(mid)</td>
            <td>80.7%</td>
            <td>79.8%</td>
            <td>79.5%</td>
            <td>82.4%</td>
        </tr>
        <tr>
            <td rowspan=3>Product -> Real Life</td>
            <td>baseline</td>
            <td>63.7%</td>
            <td>62.7%</td>
            <td>62.5%</td>
            <td>63.6%</td>
        </tr>
        <tr>
            <td>UDA</td>
            <td>62.5%</td>
            <td>62.3%</td>
            <td>63.2%</td>
            <td>61.3%</td>
        </tr>
        <tr>
            <td>UDA(mid)</td>
            <td>63.8%</td>
            <td>63.5%</td>
            <td>64.0%</td>
            <td>63.5%</td>
        </tr>
    </tbody>
</table>

### 7.2 Training Accuracy and Test Result Comparison

Previously, we use DANN for UDA, and the adversarial predictor and discriminator act on the upper layer of the neural network (feature size 4096). Consider what if the adversarial acts on the lower layer and makes judgments from a more basic place, we add discimination to the features of the mid layer of the neural network (feature size 12544). When the source domain and target domain are Product -> Real life, the mid layer result has a slight improvement; otherwise, there is no improvement.

| Method   | Product -> Real Life | Real Life -> Product |
|----------|:-------------:|:------:|
| AlexNet(Training) | 76.7% | 91% |
| AlexNet |  62.8% | 81.5%  | |
| DANN |    62.8%   |   78.5%  | |
| DANN-Mid | 64.2% | 80.0% |
| DANN Gain |0.0% | -3%|
| DANN-Mid Gain | 1.4%| -1.5%|


Potentially, the drop in performance from the DANN UDA method is because of not enough training data, so the feature extractor is not able to learn a mapping that could represent inter domain features which eventually improve the performance.