# Domain Adaptation
### **What is domain adaptation:**

Domain adaptation is a field associated with machine learning and transfer learning. This scenario arises when we aim at learning from a source data distribution a well performing model on a different (but related) target data distribution. In this assgiment we focus on single-source domain adaptation. Note that, when more than one source distribution is available the problem is referred to as multi-source domain adaptation.

###  **Relationships with transfer learning:**

Domain adaptation is a subcategory of transfer learning. In domain adaptation, the source and target domains all have the same feature space (but different distributions); in contrast, transfer learning includes cases where the target domain's feature space is different from the source feature space or spaces.

###  **Why we need domain adaptation:**

An important premise of traditional machine learning algorithms is that training data and test data must be independent and identically distributed. However, if the inputs at test time differ significantly from the training data, the traditional machine learning model might not perform very well. In these cases, domain adaptation comes to our rescue.

###  **Classification:**
* **Unsupervised domain adaptation**: the learning sample contains a set of labeled source domain examples with only unlabeled data in the target domain.
* **Semi-supervised domain adaptation**: in this situation, we consider a "small" set of labeled target examples.
* **Supervised domain adaptation**: all the examples considered are supposed to be labeled.

###  **Different types of adaptation algorithms:**
- **Reweighting algorithm**: 
    The objective is to reweight the source labeled sample such that it "looks like" the target sample (in terms of the error measure considered).
- **Iterative algorithms**:
    A method for adapting consists in iteratively "auto-labeling" the target examples. The principle is simple:
    - a model h is learned from the labeled examples;
    - h automatically labels some target examples;
    - a new model is learned from the new labeled examples.
- **Common representation space algorithms**:
    The goal is to find or construct a common representation space for the two domains. The objective is to obtain a space in which the domains are close to each other while keeping good performances on the source labeling task. This can be achieved through the use of Adversarial machine learning techniques where feature representations from samples in different domains are encouraged to be indistinguishable.A well known model of this type is DANN, which is selected as the Deep Domain Adaptation Method for this assignment.
- **Hierarchical Bayesian Model**:
    The goal is to construct a Bayesian hierarchical model p(n), which is essentially a factorization model for counts n, to derive domain-dependent latent representations allowing both domain-specific and globally shared latent factors.

### **Different types of domain adaptation methods:**
- **Distribution Adaptation** (Minimize the probability distribution distance so that the probability distribution of the source domain and the target domain are similar)
    - Marginal Distribution Adaptation
        - TCA (Transfer Component Analysis)
        - MMD (Maximum Mean Discrepancy)
        - DDC (Deep Domain Confusion): MMD + Neural Network
        - DAN (Deep Adaptation Networks): MKK-MMD + Neural Network
        - DME (Distribution-Matching Embedding): matrix transformation + projection
        - CMD (Central Moment Discrepancy): K-order MMD
    - Conditional Distribution Adaptation
        - CTC (Conditional Transferrable Components)
    - Joint Distribution Adaptation
        - JDA (Joint Distribution Adaptation): TCA + Conditional distribution adaptation
        - ARTL (Adaptation Regularization): JDA + Classifier learning
        - VDA (Visual Domain Adaptation): JDA+ Inner class distance + Among class distance
        - JGSA (Joint Geometrical and Statistical Alignment): JDA + Inner class distance + Among class distance + Label adaptation
        - JAN (Joint Adaptation Networks): JDA + JMMD, used in deep network
        - BDA (Balanced Distribution Adaptation): Add the balance factor to dynamically measure the importance of the two distributions
- **Feature Selection** (Select and extract shared features from the source domain and target domain, and establish a unified model)
    - SCL (Structural Correspondence Learning)
    - TJM (Transfer Joint Matching): MMD Distribution Adapation + Source Domain Sampling
    - FSSL (Feature Selection and Structure Preservation): Feature Selection + Information Immutability
- **Subspace Learning** (Transform the source domain and target domain to the same subspace, and then build a unified model)
    - Statistical Characteristics Alignment
        - SA (Subspace Alignment): Directly seek a linear transformation, which transform the source into the target space
        - SDA (Subspace Distribution Alignment): SA + Prob Distribution Adaptation
        - CORAL (CORrelation Alignment): Minimize the second-order statistical characteristics of the source and target domains
        - Deep-CORAL: CORAL + DNN
    - Manifold Learning
        - SGF (Sample Geodesic Flow)
        - GFK (Geodesic Flow Kernel): SGF + Gaussian Kernel
        - DIP (Domain-Invariant Projection):
- **Deep Learning**
    - Domain-Adversarial Neural Network (DANN): ICCV 2014
    - Deep Adaptation Networks (DAN): ICML 2015
    - Simultaneous feature and task transfer: ICCV 2015 
    - Joint Adaptation Networks (JAN): ICML 2017
    - Deep Hashing Network (DHN): CVPR 2017
    - Adversarial Discriminative Domain Adaptation (ADDA): arXiv 2017
    - Learning to Transfer (L2T): arXiv 2017
    - Label Efficient Learning of Transferable Representations across Domains and Tasks: NIPS 2017

Note:

1. General result of domain adaptation methods: 
    Deep Learning methods(DDC,DAN,JAN,...) > Balanced Distribution Adaptation methods(BDA,..) > Joint Distribution Adaptation methods(JDA,..) > Edge distribution adaptation methods(TCA,...) > Conditional distribution adaptation methods(CTC,...)
2. Subspace Learning's advantage: simple method + efficient calculation

Useful links:
- [Domain adaptation theoretical analysis:](https://zhuanlan.zhihu.com/p/50710267)

- [Domain adaptation overview:](http://jd92.wang/assets/files/l12_da.pdf)



# Methods we used:
- Traditional methods:
    - TCA
    - JDA
    - BDA,WBDA
    - CORAL
- Deep methods:
    - DANN (SVM classifier, Label predictor)
    - DDC DeepCORAL

# Dataset description:
Office-Home dataset is a domain adaptation dataset, which consists of 65 categories of office depot from four domains (i.e., A: Art, C:Clipart, P:Product, R: Real-world).

The **raw images** can be downloaded from: http://hemanthdv.org/OfficeHome-Dataset/.

The 2048-dim ResNet50 **deep learning features** of all images can be downloaded from:https://pan.baidu.com/s/1qvcWJCXVG8JkZnoM4BVoGg#list/path=%2F.

### Note:
The following experiments will conduct in the following three settings (source domain -> target domain):
- a) A->R;
- b) C->R; 
- c) P->R

In X->Y setting, use the deep learning features X_X.csv as source domain features and X_Y.csv as target domain features.

# Read the data and preprocess

In [1]:
import numpy as np
import scipy.io
import scipy.linalg
import sklearn.metrics
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd 
import matplotlib.pyplot as plt

In [2]:
sd_features_AA=pd.read_csv(r'Dataset\Office-Home_resnet50\Art_Art.csv',header=None)
sd_features_CC=pd.read_csv(r'Dataset\Office-Home_resnet50\Clipart_Clipart.csv',header=None)
sd_features_PP=pd.read_csv(r'Dataset\Office-Home_resnet50\Product_Product.csv',header=None)
td_features_AR=pd.read_csv(r'Dataset\Office-Home_resnet50\Art_RealWorld.csv',header=None)
td_features_CR=pd.read_csv(r'Dataset\Office-Home_resnet50\Clipart_RealWorld.csv',header=None)
td_features_PR=pd.read_csv(r'Dataset\Office-Home_resnet50\Product_RealWorld.csv',header=None)

sd_labels_AA=sd_features_AA.iloc[:,-1].astype(int)
sd_labels_CC=sd_features_CC.iloc[:,-1].astype(int)
sd_labels_PP=sd_features_PP.iloc[:,-1].astype(int)
td_labels_AR=td_features_AR.iloc[:,-1].astype(int)
td_labels_CR=td_features_CR.iloc[:,-1].astype(int)
td_labels_PR=td_features_PR.iloc[:,-1].astype(int)

sd_features_AA.drop(labels=2048,axis=1,inplace=True)
sd_features_CC.drop(labels=2048,axis=1,inplace=True)
sd_features_PP.drop(labels=2048,axis=1,inplace=True)
td_features_AR.drop(labels=2048,axis=1,inplace=True)
td_features_CR.drop(labels=2048,axis=1,inplace=True)
td_features_PR.drop(labels=2048,axis=1,inplace=True)

In [3]:
sd_features_AA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2427 entries, 0 to 2426
Columns: 2048 entries, 0 to 2047
dtypes: float64(2048)
memory usage: 37.9 MB


In [4]:
sd_features_AA.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2038,2039,2040,2041,2042,2043,2044,2045,2046,2047
0,0.339104,0.094242,0.574024,1.576792,0.208177,0.048539,0.514466,0.76195,0.003735,0.677419,...,0.140119,0.194172,0.217043,0.408074,0.361293,0.070492,0.600183,0.209631,0.133738,0.046542
1,0.003123,0.017756,0.007342,0.320789,0.033795,0.040097,0.607107,0.069854,0.005501,0.008634,...,0.190653,0.01987,0.103396,0.054031,0.133227,0.038206,0.073742,0.128579,0.028439,0.083659
2,1.756924,0.138735,0.828411,0.850277,0.064541,0.121117,0.417071,0.957303,1.156487,0.43719,...,0.271353,0.537854,0.425364,1.189281,0.463679,0.078043,0.372926,0.287346,0.244757,0.237532
3,0.447522,0.960308,1.076476,0.378051,0.447546,0.356158,0.268489,0.0,1.085499,0.351589,...,0.227814,0.154202,0.102128,0.328863,0.303151,0.226952,0.408608,0.770035,2.369065,0.162354
4,0.097336,0.172056,0.242983,0.123272,0.285455,0.198112,0.376254,0.547338,0.221591,0.020542,...,0.658062,0.166302,0.695773,0.300673,0.229382,0.036234,0.17859,0.523891,0.171992,0.02402


In [5]:
sd_labels_AA

0        3
1       15
2       53
3       21
4        0
        ..
2422    19
2423     4
2424    35
2425     4
2426     0
Name: 2048, Length: 2427, dtype: int32

In [6]:
# convert to numpy
sd_features_AA=sd_features_AA.to_numpy()
sd_features_CC=sd_features_CC.to_numpy()
sd_features_PP=sd_features_PP.to_numpy()
td_features_AR=td_features_AR.to_numpy()
td_features_CR=td_features_CR.to_numpy()
td_features_PR=td_features_PR.to_numpy()
sd_labels_AA=sd_labels_AA.to_numpy()
sd_labels_CC=sd_labels_CC.to_numpy()
sd_labels_PP=sd_labels_PP.to_numpy()
td_labels_AR=td_labels_AR.to_numpy()
td_labels_CR=td_labels_CR.to_numpy()
td_labels_PR=td_labels_PR.to_numpy()

In [7]:
sd_labels_AA

array([ 3, 15, 53, ..., 35,  4,  0])

In [8]:
np.unique(sd_labels_AA)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64])

In [9]:
sd_features_AA.shape

(2427, 2048)

# KNN without Domain Adaptation

In [10]:
def knn_model(Xs,Ys,Xt,Yt,n_neighbors=1):
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(Xs, Ys.ravel())
    y_pred = knn.predict(Xt)
    y_train_pred=knn.predict(Xs)
    acc = sklearn.metrics.accuracy_score(Yt, y_pred)
    acc_train=sklearn.metrics.accuracy_score(Ys, y_train_pred)
    return acc,acc_train

In [11]:
def kernel(ker, X1, X2, gamma):
    K = None
    if not ker or ker == 'primal':
        K = X1
    elif ker == 'linear':
        if X2 is not None:
            K = sklearn.metrics.pairwise.linear_kernel(np.asarray(X1).T, np.asarray(X2).T)
        else:
            K = sklearn.metrics.pairwise.linear_kernel(np.asarray(X1).T)
    elif ker == 'rbf':
        if X2 is not None:
            K = sklearn.metrics.pairwise.rbf_kernel(np.asarray(X1).T, np.asarray(X2).T, gamma)
        else:
            K = sklearn.metrics.pairwise.rbf_kernel(np.asarray(X1).T, None, gamma)
    return K

In [76]:
acc_AR_test,acc_AR_train=knn_model(sd_features_AA,sd_labels_AA,td_features_AR,td_labels_AR)
acc_CR_test,acc_CR_train=knn_model(sd_features_CC,sd_labels_CC,td_features_CR,td_labels_CR)
acc_PR_test,acc_PR_train=knn_model(sd_features_PP,sd_labels_PP,td_features_PR,td_labels_PR)

In [77]:
print("A--> R:source domain acc",acc_AR_train,'target domain acc',acc_AR_test)
print("C--> R:source domain acc",acc_CR_train,'target domain acc',acc_CR_test)
print("P--> R:source domain acc",acc_PR_train,'target domain acc',acc_PR_test)

A--> R:source domain acc 0.9983518747424804 target domain acc 0.6580215744778517
C--> R:source domain acc 0.995418098510882 target domain acc 0.5873307321551526
P--> R:source domain acc 0.9997747240369452 target domain acc 0.6956621528574707


# 1. TCA （Transfer Component Analysis）
**Main idea:**
    将两个领域的数据一起映射到一个高维的再生核希尔伯特空间。在此空间中，最小化源和目标的数据距离，同时最大程度地保留它们各自的内部属性。此处距离为MMD(Maximum Mean Discrepancy)，最大均值差异（$dist(X^{'}_{src},X^{'}_{tar})=||1/n_1 \sum_{i=1}^{n_1}\phi (x_{scr_i})-1/n_2 \sum_{j=1}^{n_2}\phi (x_{tar_i})||_{H}$）。
    
那么，如何求出这个映射（原始空间--> 高维的再生核希尔伯特空间）呢？通常，一个难求出的映射可以用核函数的方法来求解：
    MMD= trace(KL)-$\lambda$trace(K)，其中K为kernel function (可以为线性核，高斯核，rbf核...)。因此，根据优化问题求解（求解拉格朗日对偶，详细见link）可得source domain data和target domain data经过映射、降维后为：$(KLK+\lambda I)^{-1}KHK$ 的前m个特征值，其中K为kernel function， H 为中心矩阵（$H=I_{n_1+n_2}-1/(n_1+n_2)*\ 11^T$），$L=\begin{cases}
    1/n_1^2 & x_i,x_j \in X_{src}\\
    1/n_2^2 & x_i,x_j \in X_{tar}\\
    -1/(n_1 n_2) & otherwise\\
    \end{cases}$

**Pros and Cons:**
Pros:
实现简单，方法本身没有太多的限制，就跟PCA一样很好用
Cons:
对于大矩阵的运算计算开销很大 (主要开销在特征值计算上)

Useful Link: https://zhuanlan.zhihu.com/p/26764147

In [11]:
class TCA:
    def __init__(self, kernel_type='primal', dim=30, lamb=1, gamma=1):
        '''
        Init func
        :param kernel_type: kernel, values: 'primal' | 'linear' | 'rbf'
        :param dim: dimension after transfer
        :param lamb: lambda value in equation
        :param gamma: kernel bandwidth for rbf kernel
        '''
        self.kernel_type = kernel_type
        self.lamb = lamb
        self.gamma = gamma
        self.dim=dim

    def fit(self, Xs, Xt):
        '''
        Transform Xs and Xt
        :param Xs: ns * n_feature, source feature
        :param Xt: nt * n_feature, target feature
        :return: Xs_new and Xt_new after TCA
        '''
        X = np.hstack((Xs.T, Xt.T))
        X /= np.linalg.norm(X, axis=0)
        m, n = X.shape
        ns, nt = len(Xs), len(Xt)
        e = np.vstack((1 / ns * np.ones((ns, 1)), -1 / nt * np.ones((nt, 1))))
        L = e * e.T
        L = L / np.linalg.norm(L, 'fro')
        H = np.eye(n) - 1 / n * np.ones((n, n))
        K = kernel(self.kernel_type, X, None, gamma=self.gamma)
        n_eye = m if self.kernel_type == 'primal' else n
        a, b = np.linalg.multi_dot([K, L, K.T]) + self.lamb * np.eye(n_eye), np.linalg.multi_dot([K, H, K.T])
        w, V = scipy.linalg.eig(a, b)
        ind = np.argsort(w)
        A = V[:, ind[:self.dim]]
        Z = np.dot(A.T, K)
        Z /= np.linalg.norm(Z, axis=0)
        Xs_new, Xt_new = Z[:, :ns].T, Z[:, ns:].T
        return Xs_new, Xt_new

    def fit_predict(self, Xs, Ys, Xt, Yt,n_neighbors=5):
        '''
        Transform Xs and Xt, then make predictions on target using 1NN
        :param Xs: ns * n_feature, source feature
        :param Ys: ns * 1, source label
        :param Xt: nt * n_feature, target feature
        :param Yt: nt * 1, target label
        :return: Accuracy and predicted_labels on the target domain
        '''
        Xs_new, Xt_new = self.fit(Xs, Xt)
        clf = KNeighborsClassifier(n_neighbors=n_neighbors)
        clf.fit(Xs_new, Ys.ravel())
        y_pred = clf.predict(Xt_new)
        acc = sklearn.metrics.accuracy_score(Yt, y_pred)
        return acc, y_pred

In [12]:
tca_AR = TCA(kernel_type='linear', dim=30, lamb=1, gamma=1)
acc_AR, ypre_AR = tca_AR.fit_predict(sd_features_AA,sd_labels_AA,td_features_AR,td_labels_AR)
print(f'A--> R: The accuracy of TCA is: {acc_AR:.4f}')

A--> R: The accuracy of TCA is: 0.6224


In [13]:
tca_CR = TCA(kernel_type='linear', dim=30, lamb=1, gamma=1)
acc_CR, ypre_CR = tca_CR.fit_predict(sd_features_CC,sd_labels_CC,td_features_CR,td_labels_CR)
print(f'C--> R: The accuracy of TCA is: {acc_CR:.4f}')

C--> R: The accuracy of TCA is: 0.5699


In [14]:
tca_PR = TCA(kernel_type='linear', dim=30, lamb=1, gamma=1)
acc_PR, ypre_PR = tca_PR.fit_predict(sd_features_PP,sd_labels_PP,td_features_PR,td_labels_PR)
print(f'P--> R: The accuracy of TCA is: {acc_PR:.4f}')

P--> R: The accuracy of TCA is: 0.6787


**Analysis:**

From the result we can see that all the TCA results is worse than the original KNN results. It means that this domain adaptation method cause `negative transfer`. Negative transfer means that the knowledge learned in the source domain has a negative effect on the learning in the target domain. 

The cause of negative transfer can be:
- There is no/weak relation between source domain and target domain.(the problem of dataset)
- The transfer learning method is not suitable for current dataset. (the problem of method)

Here we consider the second cause as the one to result in bad performance. Selecting appropriate method can solve this cause. And there are new proposed ways to solve cause 1. One of the most famous anti-negative transfer learning methods is the `Transitive transfer learning`. When the two domains are not similar, we can use several intermediate domains between these two domains to complete the transfer of knowledge. Another significant work is `Distant domain transfer learning`, proposed by Prof. QiangYang.
![image.jpg](images/1.jpg)

# 2. JDA (Joint Distribution Adaptation)
**Difference between JDA and TCA:**

1. JDA= TCA + Conditional distribution adaptation
2. TCA是无监督的方法（边缘分布适配不需要label），JDA是监督的方法，需要源域有label；
3. TCA不需要迭代，JDA需要迭代

**Main idea:**
JDA对数据的假设：1）源域和目标域边缘分布不同（这是肯定的，引入迁移学习的初衷）；2）源域和目标域条件分布不同（此在TCA中没有假设）。因此JDA方法所做的事就是同时适配源域和目标域的边缘分布和条件分布（此也称之为“联合分布”，但此时联合指的是同时适配，并非概率上的联合）。

对于适配源域和目标域边缘分布，JDA使用TCA，寻找一个变换A （TCA中为映射$\phi(.)$）,使得$P(A^T x_s),P(A^T x_t)$之间距离（MMD）尽可能的相近。同样，我们引入核方法，得到$D(D_s,D_t)=tr(A^T X M_0 X^T A)$，其中A对应K，$M_0$对应L。

对于适配源域和目标域条件分布，JDA 寻找一个变换A，使得$P(y_s|A^T x_s),P(y_t|A^T x_t)$之间距离（MMD）尽可能的相近。然而此setting下并不知道$y_t$，因此使用贝叶斯方法+充分统计量，我们得到：$P(y_t|x_t)=p(y_t)p(x_t|y_t)$ (经充分统计量可以忽略$P(x_t)$)。而求$P(y_t)$，JDA使用在source domain上训练的弱分类器（e.g.KNN,LR）来预测伪标签，使用伪标签代替真实标签。得到了$P(y_s|A^T x_s),P(y_t|A^T x_t)$之后，同样用TCA方法可以适配。这里使用弱分类器预测的伪标签虽然不准确，但是可以通过迭代解决，每一次迭代的标签都是上一轮的伪标签，因此可以逐步增高预测的准确率。

得到两个优化目标后，JDA将其结合起来得到优化目标：$\min \sum_{c=0}^C tr(A^T X M_c X^T A)+\lambda ||A||_F^2$，其中第二项为正则项。添加constraint: $\max A^T X H X^T A$，此限制确保了变换前后数据的方差不变（PCA思想）。合并上述三个目标得到JDA优化总目标：$\min \frac{\sum_{c=0}^C tr(A^T X M_c X^T A)+\lambda ||A||_F^2}{A^T X H X^T A}$。而此目标可以通过rayleigh quotient + 拉格朗日乘子法解决，得到:$(X\sum_{c=0}^C M_c X^T+\lambda I)A=XHX^TA\phi$，其中$\phi$为拉格朗日乘子。此时可通过Matlab直接求解。得到变换A。

Useful Link: https://zhuanlan.zhihu.com/p/27336930

In [17]:
class JDA:
    def __init__(self, kernel_type='primal', dim=30, lamb=1, gamma=1, T=10):
        '''
        Init func
        :param kernel_type: kernel, values: 'primal' | 'linear' | 'rbf'
        :param dim: dimension after transfer
        :param lamb: lambda value in equation
        :param gamma: kernel bandwidth for rbf kernel
        :param T: iteration number
        '''
        self.kernel_type = kernel_type
        self.dim = dim
        self.lamb = lamb
        self.gamma = gamma
        self.T = T

    def fit_predict(self, Xs, Ys, Xt, Yt):
        '''
        Transform and Predict using 1NN as JDA paper did
        :param Xs: ns * n_feature, source feature
        :param Ys: ns * 1, source label
        :param Xt: nt * n_feature, target feature
        :param Yt: nt * 1, target label
        :return: acc, y_pred, list_acc
        '''
        list_acc = []
        X = np.hstack((Xs.T, Xt.T))
        X /= np.linalg.norm(X, axis=0)
        m, n = X.shape
        ns, nt = len(Xs), len(Xt)
        e = np.vstack((1 / ns * np.ones((ns, 1)), -1 / nt * np.ones((nt, 1))))
        C = len(np.unique(Ys))
        H = np.eye(n) - 1 / n * np.ones((n, n))

        M = 0
        Y_tar_pseudo = None
        for t in range(self.T):
            N = 0
            M0 = e * e.T * C
            if Y_tar_pseudo is not None and len(Y_tar_pseudo) == nt:
                for c in range(0, C):
                    e = np.zeros((n, 1))
                    tt = Ys == c
                    e[np.where(tt == True)] = 1 / len(Ys[np.where(Ys == c)])
                    yy = Y_tar_pseudo == c
                    ind = np.where(yy == True)
                    inds = [item + ns for item in ind]
                    e[tuple(inds)] = -1 / len(Y_tar_pseudo[np.where(Y_tar_pseudo == c)])
                    e[np.isinf(e)] = 0
                    N = N + np.dot(e, e.T)
            M = M0 + N
            M = M / np.linalg.norm(M, 'fro')
            K = kernel(self.kernel_type, X, None, gamma=self.gamma)
            n_eye = m if self.kernel_type == 'primal' else n
            a, b = np.linalg.multi_dot([K, M, K.T]) + self.lamb * np.eye(n_eye), np.linalg.multi_dot([K, H, K.T])
            w, V = scipy.linalg.eig(a, b)
            ind = np.argsort(w)
            A = V[:, ind[:self.dim]]
            Z = np.dot(A.T, K)
            Z /= np.linalg.norm(Z, axis=0)
            Xs_new, Xt_new = Z[:, :ns].T, Z[:, ns:].T

            clf = KNeighborsClassifier(n_neighbors=1)
            clf.fit(Xs_new, Ys.ravel())
            Y_tar_pseudo = clf.predict(Xt_new)
            acc = sklearn.metrics.accuracy_score(Yt, Y_tar_pseudo)
            list_acc.append(acc)
            print('JDA iteration [{}/{}]: Acc: {:.4f}'.format(t + 1, self.T, acc))
        return acc, Y_tar_pseudo, list_acc

In [18]:
jda_AR = JDA(kernel_type='primal', dim=30, lamb=1, gamma=1,T=5)
acc_AR, ypre_AR, list_acc_AR = jda_AR.fit_predict(sd_features_AA,sd_labels_AA,td_features_AR,td_labels_AR)
print(f'A--> R: The accuracy of JDA is: {acc_AR:.4f}')

JDA iteration [1/5]: Acc: 0.6583
JDA iteration [2/5]: Acc: 0.6656
JDA iteration [3/5]: Acc: 0.6599
JDA iteration [4/5]: Acc: 0.6596
JDA iteration [5/5]: Acc: 0.6617
A--> R: The accuracy of JCA is: 0.6617


In [19]:
jda_CR = JDA(kernel_type='primal', dim=30, lamb=1, gamma=1,T=5)
acc_CR, ypre_CR, list_acc_CR = jda_CR.fit_predict(sd_features_CC,sd_labels_CC,td_features_CR,td_labels_CR)
print(f'C--> R: The accuracy of JDA is: {acc_CR:.4f}')

JDA iteration [1/5]: Acc: 0.5869
JDA iteration [2/5]: Acc: 0.6084
JDA iteration [3/5]: Acc: 0.5922
JDA iteration [4/5]: Acc: 0.5894
JDA iteration [5/5]: Acc: 0.5912
C--> R: The accuracy of JCA is: 0.5912


In [20]:
jda_PR = JDA(kernel_type='primal', dim=30, lamb=1, gamma=1,T=5)
acc_PR, ypre_PR, list_acc_PR = jda_PR.fit_predict(sd_features_PP,sd_labels_PP,td_features_PR,td_labels_PR)
print(f'P--> R: The accuracy of JDA is: {acc_PR:.4f}')

JDA iteration [1/5]: Acc: 0.6888
JDA iteration [2/5]: Acc: 0.6989
JDA iteration [3/5]: Acc: 0.6780
JDA iteration [4/5]: Acc: 0.6780
JDA iteration [5/5]: Acc: 0.6764
P--> R: The accuracy of JCA is: 0.6764


# 3. BDA (Balanced Distribution Adaptation)
**Main idea:**
JDA同时适配边缘分布和条件分布，但是对于不同的任务，边缘分布和条件分布并不是同等重要，因此，BDA方法可以有效衡量（通过平衡因子）这两个分布的权重，从而达到最好的结果。BDA优化的目标为：$$D(D_S,D_T) \approx (1-\mu)||\frac{1}{n}\sum_{i=1}^n x_{s_i} - \frac{1}{m}\sum_{j=1}^m x_{t_j}||^2 + \mu ||\frac{1}{n_c}\sum_{x_{s_i} \in D_s^{(c)}}^n x_{s_i} - \frac{1}{m_c}\sum_{x_{t_j} \in D_t^{(c)}}^n x_{t_j}||^2$$

如何求解平衡因子$\mu$？采用A-distance估计：
- Calculate total A-distance of source domain and target domain.  -->A
- Apply clustering in target domain, get the class $c_1,c_2,...c_k$. Calculate the cluster A-distance for each pair of class in source domain and target domain.  -->B
- $\mu=\frac{A}{B},\mu \in [0,1]$. 

当$\mu$ 趋近于0，表明源域和目标域之间不相似，因此优化边缘分布比较重要；当$\mu$ 趋近于1，表明源域和目标域之间很相似，因此优化条件分布比较重要。


In [13]:
class BDA:
    def __init__(self, kernel_type='primal', dim=30, lamb=1, mu=0.5, gamma=1, T=10, mode='BDA', estimate_mu=False):
        '''
        Init func
        :param kernel_type: kernel, values: 'primal' | 'linear' | 'rbf'
        :param dim: dimension after transfer
        :param lamb: lambda value in equation
        :param mu: mu. Default is -1, if not specificied, it calculates using A-distance
        :param gamma: kernel bandwidth for rbf kernel
        :param T: iteration number
        :param mode: 'BDA' | 'WBDA'
        :param estimate_mu: True | False, if you want to automatically estimate mu instead of manally set it
        '''
        self.kernel_type = kernel_type
        self.dim = dim
        self.lamb = lamb
        self.mu = mu
        self.gamma = gamma
        self.T = T
        self.mode = mode
        self.estimate_mu = estimate_mu

    def fit_predict(self, Xs, Ys, Xt, Yt):
        '''
        Transform and Predict using 1NN as JDA paper did
        :param Xs: ns * n_feature, source feature
        :param Ys: ns * 1, source label
        :param Xt: nt * n_feature, target feature
        :param Yt: nt * 1, target label
        :return: acc, y_pred, list_acc
        '''
        list_acc = []
        X = np.hstack((Xs.T, Xt.T))
        X /= np.linalg.norm(X, axis=0)
        m, n = X.shape
        ns, nt = len(Xs), len(Xt)
        e = np.vstack((1 / ns * np.ones((ns, 1)), -1 / nt * np.ones((nt, 1))))
        C = len(np.unique(Ys))
        H = np.eye(n) - 1 / n * np.ones((n, n))
        mu = self.mu
        M = 0
        Y_tar_pseudo = None
        Xs_new = None
        for t in range(self.T):
            N = 0
            M0 = e * e.T * C
            if Y_tar_pseudo is not None and len(Y_tar_pseudo) == nt:
                for c in range(0, C):
                    e = np.zeros((n, 1))
                    Ns = len(Ys[np.where(Ys == c)])
                    Nt = len(Y_tar_pseudo[np.where(Y_tar_pseudo == c)])

                    if self.mode == 'WBDA':
                        Ps = Ns / len(Ys)
                        Pt = Nt / len(Y_tar_pseudo)
                        alpha = Pt / Ps
                        mu = 1
                    else:
                        alpha = 1

                    tt = Ys == c
                    e[np.where(tt == True)] = 1 / Ns
                    yy = Y_tar_pseudo == c
                    ind = np.where(yy == True)
                    inds = [item + ns for item in ind]
                    e[tuple(inds)] = -alpha / Nt
                    e[np.isinf(e)] = 0
                    N = N + np.dot(e, e.T)

            # In BDA, mu can be set or automatically estimated using A-distance
            # In WBDA, we find that setting mu=1 is enough
            if self.estimate_mu and self.mode == 'BDA':
                if Xs_new is not None:
                    mu = estimate_mu(Xs_new, Ys, Xt_new, Y_tar_pseudo)
                else:
                    mu = 0
            M = (1 - mu) * M0 + mu * N
            M /= np.linalg.norm(M, 'fro')
            K = kernel(self.kernel_type, X, None, gamma=self.gamma)
            n_eye = m if self.kernel_type == 'primal' else n
            a, b = np.linalg.multi_dot(
                [K, M, K.T]) + self.lamb * np.eye(n_eye), np.linalg.multi_dot([K, H, K.T])
            w, V = scipy.linalg.eig(a, b)
            ind = np.argsort(w)
            A = V[:, ind[:self.dim]]
            Z = np.dot(A.T, K)
            Z /= np.linalg.norm(Z, axis=0)
            Xs_new, Xt_new = Z[:, :ns].T, Z[:, ns:].T

            clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)
            clf.fit(Xs_new, Ys.ravel())
            Y_tar_pseudo = clf.predict(Xt_new)
            acc = sklearn.metrics.accuracy_score(Yt, Y_tar_pseudo)
            list_acc.append(acc)
            print('{} iteration [{}/{}]: Acc: {:.4f}'.format(self.mode, t + 1, self.T, acc))
        return acc, Y_tar_pseudo, list_acc

In [None]:
bda_AR = BDA(kernel_type='primal', dim=30, lamb=1, mu=0.5, mode='BDA', gamma=1, estimate_mu=False,T=5)
acc_AR, ypre_AR, list_acc_AR = bda_AR.fit_predict(sd_features_AA,sd_labels_AA,td_features_AR,td_labels_AR)
print(f'A--> R: The accuracy of BDA is: {acc_AR:.4f}')

BDA iteration [1/10]: Acc: 0.6583
BDA iteration [2/10]: Acc: 0.6656
BDA iteration [3/10]: Acc: 0.6599
BDA iteration [4/10]: Acc: 0.6596
BDA iteration [5/10]: Acc: 0.6617
BDA iteration [6/10]: Acc: 0.6617
BDA iteration [7/10]: Acc: 0.6608
BDA iteration [8/10]: Acc: 0.6615
BDA iteration [9/10]: Acc: 0.6610
BDA iteration [10/10]: Acc: 0.6610
A--> R: The accuracy of BDA is: 0.6610


In [14]:
bda_CR = BDA(kernel_type='primal', dim=30, lamb=1, mu=0.5, mode='BDA', gamma=1, estimate_mu=False,T=5)
acc_CR, ypre_CR, list_acc_CR = bda_CR.fit_predict(sd_features_CC,sd_labels_CC,td_features_CR,td_labels_CR)
print(f'C--> R: The accuracy of BDA is: {acc_CR:.4f}')

BDA iteration [1/5]: Acc: 0.5869
BDA iteration [2/5]: Acc: 0.6084
BDA iteration [3/5]: Acc: 0.5922
BDA iteration [4/5]: Acc: 0.5894
BDA iteration [5/5]: Acc: 0.5912
C--> R: The accuracy of BDA is: 0.5912


In [15]:
bda_PR = BDA(kernel_type='primal', dim=30, lamb=1, mu=0.5, mode='BDA', gamma=1, estimate_mu=False,T=5)
acc_PR, ypre_PR, list_acc_PR = bda_PR.fit_predict(sd_features_PP,sd_labels_PP,td_features_PR,td_labels_PR)
print(f'P--> R: The accuracy of BDA is: {acc_PR:.4f}')

BDA iteration [1/5]: Acc: 0.6888
BDA iteration [2/5]: Acc: 0.6989
BDA iteration [3/5]: Acc: 0.6780
BDA iteration [4/5]: Acc: 0.6780
BDA iteration [5/5]: Acc: 0.6764
P--> R: The accuracy of BDA is: 0.6764


# 3.1 WBDA (Weighted BDA)
**Difference between WBDA and BDA:**
Add weights for each of the class in target domain: $w_i=\frac{c_{it}}{c_{is}}$, where $c_{1t},c_{2t},...c_{kt}$ are the number of class $i$ in target domain and $c_{1s},c_{2s},...c_{ks}$ are the number of class $i$ in source domain

In [21]:
bda_AR = BDA(kernel_type='primal', dim=30, lamb=1, mode='WBDA', gamma=1, estimate_mu=False,T=5)
acc_AR, ypre_AR, list_acc_AR = bda_AR.fit_predict(sd_features_AA,sd_labels_AA,td_features_AR,td_labels_AR)
print(f'A--> R: The accuracy of WBDA is: {acc_AR:.4f}')

WBDA iteration [1/5]: Acc: 0.6583
WBDA iteration [2/5]: Acc: 0.6702
WBDA iteration [3/5]: Acc: 0.6709
WBDA iteration [4/5]: Acc: 0.6709
WBDA iteration [5/5]: Acc: 0.6704
A--> R: The accuracy of WBDA is: 0.6704


In [22]:
bda_CR = BDA(kernel_type='primal', dim=30, lamb=1, mode='WBDA', gamma=1, estimate_mu=False,T=5)
acc_CR, ypre_CR, list_acc_CR = bda_CR.fit_predict(sd_features_CC,sd_labels_CC,td_features_CR,td_labels_CR)
print(f'C--> R: The accuracy of WBDA is: {acc_CR:.4f}')

WBDA iteration [1/5]: Acc: 0.5869
WBDA iteration [2/5]: Acc: 0.6103
WBDA iteration [3/5]: Acc: 0.6149
WBDA iteration [4/5]: Acc: 0.6117
WBDA iteration [5/5]: Acc: 0.6140
C--> R: The accuracy of BDA is: 0.6140


In [23]:
bda_PR = BDA(kernel_type='primal', dim=30, lamb=1, mode='WBDA', gamma=1, estimate_mu=False,T=5)
acc_PR, ypre_PR, list_acc_PR = bda_PR.fit_predict(sd_features_PP,sd_labels_PP,td_features_PR,td_labels_PR)
print(f'P--> R: The accuracy of WBDA is: {acc_PR:.4f}')

WBDA iteration [1/5]: Acc: 0.6888
WBDA iteration [2/5]: Acc: 0.6874
WBDA iteration [3/5]: Acc: 0.6986
WBDA iteration [4/5]: Acc: 0.6954
WBDA iteration [5/5]: Acc: 0.6980
P--> R: The accuracy of BDA is: 0.6980


**Analysis:**
从实验结果我们可以看到，BDA的效果要优于TCA和JDA，这从BDA的工作原理上可以得到较好的解释：
- BDA is better than TCA because BDA's optimization goal considers the conditional distribution. Thus, when the source domain and target domain is similar, BDA can learn more cross-domain knowledge through the optimization of conditional distribution difference.
- BDA is better than JDA because it introduce the self-adapatation mechemism: balance factor. For different tasks, edge distribution and condition distribution are not equally important. BDA method can effectively measure the weight of the two distributions (through the balance factor), so as to achieve the best results. Therefore, it's more general than JDA.

# 4. CORAL (CORelation ALignment)
**Main idea:**
CORAL uses a linear transformation method to `align` (minimize the distance) the second-order statistical features of the source domain and target domain distribution. It is very effective for unsupervised domain adaptation.

CORAL uses the Frobenius norm as the matrix distance metric. The optimization goal of CORAL is:
$$\min_A ||C_{S'} - C_T||_F^2 = \min_A ||A^TC_SA - C_T||_F^2$$
where:

$A$ is the linear transformation matrix

$C_{S'}$ is the covariance of the transformed source features $D_SA$ and $||\cdot||_F^2$ 

$C_S$ and $C_T$ are the feature vector covariance matrices.

In [24]:
class CORAL:
    def __init__(self):
        super(CORAL, self).__init__()

    def fit(self, Xs, Xt):
        '''
        Perform CORAL on the source domain features
        :param Xs: ns * n_feature, source feature
        :param Xt: nt * n_feature, target feature
        :return: New source domain features
        '''
        cov_src = np.cov(Xs.T) + np.eye(Xs.shape[1])
        cov_tar = np.cov(Xt.T) + np.eye(Xt.shape[1])
        A_coral = np.dot(scipy.linalg.fractional_matrix_power(cov_src, -0.5),
                         scipy.linalg.fractional_matrix_power(cov_tar, 0.5))
        Xs_new = np.real(np.dot(Xs, A_coral))
        return Xs_new

    def fit_predict(self, Xs, Ys, Xt, Yt):
        '''
        Perform CORAL, then predict using 1NN classifier
        :param Xs: ns * n_feature, source feature
        :param Ys: ns * 1, source label
        :param Xt: nt * n_feature, target feature
        :param Yt: nt * 1, target label
        :return: Accuracy and predicted labels of target domain
        '''
        Xs_new = self.fit(Xs, Xt)
        clf = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)
        clf.fit(Xs_new, Ys.ravel())
        y_pred = clf.predict(Xt)
        acc = sklearn.metrics.accuracy_score(Yt, y_pred)
        return acc, y_pred

In [25]:
coral_AR = CORAL()
coral_CR = CORAL()
coral_PR = CORAL()
acc_AR, ypre_AR = coral_AR.fit_predict(sd_features_AA,sd_labels_AA,td_features_AR,td_labels_AR)
print(f'A--> R: The accuracy of CORAL is: {acc_AR:.4f}')
acc_CR, ypre_CR = coral_CR.fit_predict(sd_features_CC,sd_labels_CC,td_features_CR,td_labels_CR)
print(f'C--> R: The accuracy of CORAL is: {acc_CR:.4f}')
acc_PR, ypre_PR = coral_PR.fit_predict(sd_features_PP,sd_labels_PP,td_features_PR,td_labels_PR)
print(f'P--> R: The accuracy of CORAL is: {acc_PR:.4f}')


A--> R: The accuracy of CORAL is: 0.6479
C--> R: The accuracy of CORAL is: 0.5837
P--> R: The accuracy of CORAL is: 0.6851


**Analysis:**
虽然CORAL的结果要略差于pure KNN，但是要优于TCA。CORAL最大的优势在于它训练速度非常快，因为CORAL只是进行了简单的矩阵运算，没有像TCA那样需要进行特征分解，也没有像JDA,BDA这样的迭代过程。综合计算成本和时间成本，我们认为CORAL在此任务上的表现优于TCA和JDA。