In classification, it is typically assumed that the labeled training data comes from the same distribution as that of the test data. However, many real world applications challenge this assumption. These different but related marginal distributions are referred as domains. In this context, the learner must take special care during the learning process to infer models that adapt well to the test data they are deployed on.

For example, in study 4 and 5 of our paper, we train the model on mumsnet data and test it on data from reddit and experimental data. Even though both the source and target data lie in the same D-dimensional space (D equals to the number of features or independant variables), they have been drawn according to different marginal distributions. Consequently, rather than working on the original data themselves, the shift between these two source and target domains is learned. This issue is known as domain adaptation(DA). DA typically aims at making use of information coming from both source(train) and target(test) domains during the learning process to adapt automatically.


# Subspace Alignment
In this tutorial, we apply an unsupervised DA solution proposed in [1], as subspace alignment, to learn a mapping function which aligns the source distribution with the target one (the unsupervised setting means that the training data only need the labelled examples from source data and unlabeled target examples). 

We go through this process step-by-step, and at end of this tutorial you would be able to apply it on provided datasets by us or datasets of your own. 

1- First, we transform every source and target data in the form of a D-dimensional z-normalized vector (i.e. of zero mean and unit standard deviation). 

2- Then, using PCA, for each domain, d eigenvectors is selected which corresponds to the d largest eigenvalues.

3- These eigenvectors are used as bases of the source and target subspaces, respectively denoted by X_S and X_T (X_S , X_T ∈ R D×d ). 

4- Finally, X_S and X_T are used to learn the shift between the two domains.

## Tuning the efficient number of eigenvectors (d_max)

The unique hyperparameter of the algorithm is the number d of eigenvectors. For this, we first calculate the upper bound of d, as d_max. Afterwards, we consider the subspaces of dimensionality from d = 1 to d_max and select the
best d∗ that minimizes the classification error using a 10 fold cross-validation over the labelled source data.
More details can be found here [1].




[1] Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In Proceedings of the IEEE international conference on computer vision (pp. 2960-2967).

In [78]:
import numpy as np
from scipy import stats
from numpy import linalg as LA
import copy
import math

In [79]:
# using Principal Component Analysis (PCA) to identify eigenvectors and eigenvalues
def PrincipleC(X, n):
    C = np.dot(np.matrix(X).transpose(), np.matrix(X) ) /X.shape[1]
    Lambda, U = LA.eig(C)
    indices = np.argsort(Lambda)[::-1]
    Lambda = [Lambda[i] for i in indices]
    U = U[:, indices]
    return U[: ,:n], Lambda[:n]

#this function calculates the upper bound of d - maximum number of dimensions
def get_max_dimension(X_source, X_target, dimensions, verbose=True):
    #transform the data in the form of a D-dimensional z-normalized vector
    X1 = stats.zscore(X_source)
    X2 = stats.zscore(X_target)
    
    #extract the eigenvectors and eigenvalues
    X_s, Lambda_s = PrincipleC(X1.copy(), dimensions)
    X_t, Lambda_t = PrincipleC(X2.copy(), dimensions)

    lambdas = []
    gammas = []
    B = 100 #a random positive number
    delta = 0.1
    n_min = np.minimum(X_source.shape[0], X_target.shape[0])
    for i in range(0, dimensions-1):
        lmin = np.minimum(Lambda_t[i]-Lambda_t[i+1], Lambda_s[i]-Lambda_s[i+1])
        gamma = (1+np.sqrt(math.log(2/delta)/2))*((16*np.power(i+1, 3/2)*B)/(np.sqrt(n_min)*lmin))
        lambdas.append({'d': i+1, 'lmin': copy.deepcopy(lmin)})
        gammas.append(copy.deepcopy(gamma))

    gamma = max(gammas)

    d_max = 1
    for dic_ in lambdas:
        d = dic_['d']
        lmin = dic_['lmin']
        upper_b = (1+np.sqrt(math.log(2/delta)/2))*((16*np.power(d, 3/2)*B)/(gamma*np.sqrt(n_min)))
        if lmin >= upper_b:
            d_max = d
            
    if verbose:
        print('\n upper dimension bound:', d_max)
    return d_max

by running the code above, we calculate the upper bound of d, as d_max. In order to calculate the efficient number of dimensions, we run the function get_optimum_dimensions where for dimensionality from d = 1 to d_max, we shiftf source and target data accordignly (function align) and calculate the accuracy of training and testing on source data using 10 fold cross validation. The optimum dimensionality is the one which results in the highest accuracy on source data. 

In [80]:
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler

# the main function for shifting the source and target distributions
def align(X_1, X_2, dim):
    X_1 = stats.zscore(X_1)
    X_2 = stats.zscore(X_2)

    X_s, Lambda_s = PrincipleC(X_1.copy(), dim)
    X_t, Lambda_t = PrincipleC(X_2.copy(), dim)

    X_a = np.matrix(X_s) * np.matrix(X_s).transpose() * np.matrix(X_t)

    S_a = np.matrix(X_1) * np.matrix(X_a)

    T_t = np.matrix(X_2) * np.matrix(X_t)

    return S_a, T_t

# 
def singletest(model, test_X, test_y, verbose=False):
    predict_abs = model.predict(test_X)
    probs = model.predict_proba(test_X)
    tn, fp, fn, tp = confusion_matrix(test_y, predict_abs).ravel()
    fpr, tpr, thresholds = roc_curve(test_y, probs[:, 1], pos_label=1)
    
    acc_ = (tp + tn) / (tp + tn + fp + fn)
    fpr_tpr_auc_ = auc(fpr, tpr)

    if verbose:
        print('\n accuracy:', acc_, 'AUC:', fpr_tpr_auc_)
        print('tn:', tn, 'fp:', fp, 'fn:', fn, 'tp:', tp)
        print('false positive error rate:', fp / (fp + tn))
        print('false negative error rate:', fn / (fn + tp))
    return acc_


# cross validation
def test_cross_validation(df, label):
    cv = StratifiedKFold(n_splits=10, shuffle=True)
    y = df[label].values
    X = df.drop(columns=[label]).values
    accuracy = []
    for train, test in cv.split(X, y):
        clf = train(X[train], y[train])
        acc_ = singletest(clf, X[test], y[test])
        accuracy.append(copy.deepcopy(acc_))
    return np.mean(accuracy)


# calculating the optimum number of dimensions
def get_optimum_dimension(d_max, X_source, y_source, X_target, label='forum_id', verbose=False):
    acc_max = 0
    d_optimum = 1
    for j in range(1, d_max + 1):
        X_s, X_t = align(copy.deepcopy(X_source), copy.deepcopy(X_target), j)
        train_df = pd.concat([pd.DataFrame(X_s), y_source], axis=1)
        acc_ = test_cross_validation(train_df, label=label)
        if acc_ >= acc_max:
            acc_max = acc_
            d_optimum = j
    if verbose:
        print('\n maximum accuracy for training and testing on source data:', acc_max)
        print('\n optimum dimensions:', d_optimum)

    return d_optimum


def get_shared_subspace(X_1, X_2, dim):
    scaler_1 = StandardScaler().fit(X_1)
    X_1 = scaler_1.transform(X_1)

    scaler_2 = StandardScaler().fit(X_2)
    X_2 = scaler_2.transform(X_2)

    X_s, Lambda_s = PrincipleC(X_1.copy(), dim)
    X_t, Lambda_t = PrincipleC(X_2.copy(), dim)

    X_a = np.matrix(X_s) * np.matrix(X_s).transpose() * np.matrix(X_t)

    return X_a, X_t, scaler_1, scaler_2

Here, there are some general utility functions, for reading the datasets and training the model.

In [81]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression


def preprocessing(df, word_limit, min_WC):
    df = df.dropna()
    if word_limit:
        df = df.loc[df['WC'] >= min_WC]
    return df

def read_csv(path, word_limit, min_WC):
    try:
        df = pd.read_csv(path)
    except:
        print('error in reading file')
        raise
    df = preprocessing(df, word_limit, min_WC)
    return df


def separating_users(df):
    fem_df = df.loc[df.forum_id == 1]
    par_df = df.loc[df.forum_id == 0]
    
    # participants who are posting in both forums
    within_p = set(fem_df.user_id.unique()).intersection(par_df.user_id.unique())
    # participants who are posting only in one forum, parent or feminist
    between_p = df[~df.user_id.isin(within_p)].user_id.unique()

    return between_p, within_p


def split_between_within_sets(df):
    # separating users into between and withing participants
    between_participants, within_participants = separating_users(df.copy())

    within_set = df.loc[df.user_id.isin(within_participants)]
    between_set = df.loc[df.user_id.isin(between_participants)]
    
    return between_set, within_set


def get_train_set(between_set, batch, verbose=False):
    between_set = between_set.sample(frac=1)
    
    if verbose:
        print('number of between participants is:{}'.format(len(between_set.user_id.unique())))

    # buiding train set by randomly selecting posts from between participants
    posts_between_forum1 = between_set[between_set['forum_id'] == 1][:batch]
    posts_between_forum0 = between_set[between_set['forum_id'] == 0][:batch]
    trainDB = pd.concat([posts_between_forum1, posts_between_forum0])
    
    return trainDB


def extract_testcases(posts_within, no=None):
    # randomly selecting one post per forum for each within participant
    testDB = posts_within.sample(frac=1)
    testDB = testDB.drop_duplicates(subset=['user_id', 'forum_id'])

    # if there is no limit on the number of test cases, choose one post per forum
    # for each within participant, otherwise, randomly choose no number users from
    # within participants
    if no is not None:
        within_participants = posts_within.user_id.unique()
        testUsers = np.random.choice(within_participants, no, replace=False)
        testDB = testDB.loc[testDB['user_id'].isin(testUsers)]

    return testDB


def get_test_set(within_set, batch, verbose=False):
    within_set = within_set.sample(frac=1)
    
    if verbose:
        print('number of within participants is:{}'.format(len(within_set.user_id.unique())))
    
    # buidling test set by randomly selecting one posts per from from each within participant
    testDB = extract_testcases(within_set)
    
    return testDB


def Logistic_Regression(grid_search=False):
    clf = LogisticRegression(solver='lbfgs', max_iter = 2000)

    if grid_search:
        tuned_parameters = [{'C': [1e-3, 1e-2, 1e-1, 1]}]
        clf = GridSearchCV(LogisticRegression(solver='lbfgs'), tuned_parameters, cv=10,
                           scoring='accuracy', n_jobs=-1)

    return clf


def train(X_train, y_train, verbose=True):
    model = Logistic_Regression()
    model.fit(X_train, y_train)


    if verbose:
        print('model is trained')

    return model

In this part, we prepare the datasets for applying the DA solution. The source data is the mumsnet dataset (Mumsnet_feminist_parent.csv) and the target data is the reddit dataset(Reddit_feminist_parent.csv) or experimental dataset (Experimental_data_within.csv, Experimental_data_between.csv).

First:

- download the datasets : Mumsnet_feminist_parent.csv, Reddit_feminist_parent.csv, Experimental_data_within.csv, Experimental_data_between.csv
- copy the datasets in a folder named data


DBDIR is pointing to directory that contains dataset (data folder).

Here we considered all the stylistic LIWC features for training our identity detection model, however, any subset of these features can be considered. 

In [82]:
DBDIR = './data/'
Mumsnet_DB = 'Mumsnet_feminist_parent.csv'
Reddit_DB = 'Reddit_feminist_parent.csv'
Exp_DB_within = 'Experimental_data_within.csv'
Exp_DB_between = 'Experimental_data_between.csv'

# all LIWC stylistic features
ALL_STYLISTIC_FEATURES = ['WPS', 'i', 'we', 'you', 'shehe', 'they', 'ipron','article', 'auxverb', 'past',
                    'present', 'future', 'adverb', 'preps','conj', 'quant', 'number', 'time', 'Sixltr',
                    'Period', 'Colon', 'SemiC', 'QMark', 'Dash', 'Quote', 'Apostro', 'Parenth', 'OtherP',
                    'negate', 'swear', 'posemo','negemo', 'assent', 'nonfl', 'filler', 'Exclam', 'insight',
                    'cause', 'discrep', 'tentat', 'certain', 'inhib', 'incl', 'excl']

In [83]:

def prepare_mumsnet_data():
        mumsnet_df = read_csv(dbdir+Mumsnet_DB, word_limit, min_WC)
        mumsnet_between_set, mumsnet_within_set = split_between_within_sets(mumsnet_df)
        mumsnet_train = get_train_set(mumsnet_between_set, batch_size, verbose=False)
        mumsnet_test = get_test_set(mumsnet_within_set, batch_size, verbose=False)
        
        
        source_train = mumsnet_train.reset_index(drop=True)
        source_test = mumsnet_test.reset_index(drop=True)
        
        source_train_X = source_train[ALL_STYLISTIC_FEATURES].astype(float)
        source_train_y = source_train['forum_id']
        
        source_test_X = source_test[ALL_STYLISTIC_FEATURES].astype(float)
        source_test_y = source_test['forum_id']
        
        return source_train_X, source_test_X, source_train_y, source_test_y
        

def prepare_reddit_data():
        reddit_df = read_csv(dbdir+Reddit_DB, word_limit, min_WC)
        reddit_between_set, reddit_within_set = split_between_within_sets(reddit_df)
        reddit_train = get_train_set(reddit_between_set, batch_size, verbose=False)
        reddit_test = get_test_set(reddit_within_set, batch_size, verbose=False)
        
        target_train = reddit_train.reset_index(drop=True)
        target_test = reddit_test.reset_index(drop=True)
        
        target_train_X = target_train[ALL_STYLISTIC_FEATURES].astype(float)
        target_train_y = target_train['forum_id']
        
        target_test_X = target_test[ALL_STYLISTIC_FEATURES].astype(float)
        target_test_y = target_test['forum_id']
        
        return target_train_X, target_test_X, target_train_y, target_test_y
    
    

def prepare_experimental_data(test_condition='within', test_topic='hm'):
    exp_within = pd.read_csv(dbdir + Exp_DB_within)
    exp_between = pd.read_csv(dbdir + Exp_DB_between)

    exp_data = pd.concat([exp_within, exp_between], axis=0)
    
    target_train_X = exp_data[ALL_STYLISTIC_FEATURES].astype(float)
    target_train_y = exp_data['condition']
    
    if test_condition=='within':
        test_df = exp_within.loc[exp_within.topic==test_topic]
        target_test_X = test_df[ALL_STYLISTIC_FEATURES].astype(float)
        target_test_y = test_df['condition']
    elif test_condition=='between':
        test_df = exp_between.loc[exp_between.topic==test_topic]
        target_test_X = test_df[ALL_STYLISTIC_FEATURES].astype(float)
        target_test_y = test_df['condition']
        
    
    return target_train_X, target_test_X, target_train_y, target_test_y


The source data is Mumsnet dataset, and reddit dataset and experimental dataset are the target data. For reddit dataset, train data are chosen from the between samples, and test data are the within samples. For the esperimentad data, train data includes both the within and between samples, and test data can be chosen by determining two parameters of test_condition {'within', 'between'} and test_topic {'cc', 'ow', 'hm'}.

In [85]:

batch_size = 50000
word_limit = True
min_WC = 25
features = ALL_STYLISTIC_FEATURES   
dbdir = DBDIR


print('loading source and target data...')
source_train_X, source_test_X, source_train_y, source_test_y = prepare_mumsnet_data()
# choose the target data - reddit data or experimental data
#target_train_X, target_test_X, target_train_y, target_test_y = prepare_reddit_data()
target_train_X, target_test_X, target_train_y, target_test_y = prepare_experimental_data(test_condition='between', 
                                                                                         test_topic='hm')

print('\n source data size:', source_train_X.shape[0], '\n traget_train_size:', target_train_X.shape[0], 
     '\n traget_test_size:', target_test_X.shape[0])


# train a model on source data and test it on target (test) data before applying the DA
print('\n training a model on original source data ...')
clf = train(source_train_X, source_train_y)

print('\n test the model on original target test data ...\n')
singletest(clf, target_test_X.copy(deep=True), target_test_y, verbose=True)


# calculating the upper bound of dimensionality
d_max = get_max_dimension(source_train_X[ALL_STYLISTIC_FEATURES].copy(deep=True), 
                          target_train_X[ALL_STYLISTIC_FEATURES].copy(deep=True), source_train_X.shape[1])

# calculate the optimum dimensionality
d_optimum = get_optimum_dimension(d_max, source_train_X[ALL_STYLISTIC_FEATURES].copy(deep=True), source_train_y, 
                                  target_train_X[ALL_STYLISTIC_FEATURES].copy(deep=True), verbose=True)


# create the shared subspace according to the optimum number of dimensions
X_a, X_t, scaler_a, scaler_t = get_shared_subspace(source_train_X.copy(deep=True),target_train_X.copy(deep=True), 
                                                   d_optimum)

# transform the source and target test data
X_1 = scaler_a.transform(source_train_X)
X_2 = scaler_t.transform(target_test_X)

S_a = np.matrix(X_1) * np.matrix(X_a)
T_t = np.matrix(X_2) * np.matrix(X_t)

y_s, y_t = source_train_y, target_test_y


# train a model using transformed source train data
print('\n training a model on transformed source data ...')
clf = train(pd.DataFrame(S_a), y_s)

print('\n test the model on transformed target test data: \n')
# test the model on transformed target test data
singletest(clf, pd.DataFrame(T_t), y_t, verbose=True)

loading source and target data...

 source data size: 100000 
 traget_train_size: 456 
 traget_test_size: 110

 training a model on original source data ...
model is trained

 test the model on original target test data ...


 accuracy: 0.6454545454545455 AUC: 0.6431924882629108
tn: 7 fp: 32 fn: 7 tp: 64
false positive error rate: 0.8205128205128205
false negative error rate: 0.09859154929577464

 training a model on transformed source data ...
model is trained

 test the model on transformed target test data: 


 accuracy: 0.6454545454545455 AUC: 0.6710003611412062
tn: 28 fp: 11 fn: 28 tp: 43
false positive error rate: 0.28205128205128205
false negative error rate: 0.39436619718309857
