*** If unfamiliar with Jupyter Notebooks, instructions for installing and running can be found here: http://jupyter.org/install. Before installing Jupyter Notebook, make sure that Python is installed (our code is with Python3) in your system. We recommend installing Python and Jupyter using the conda package manager.***



This is the python implementation of the paper "ASIA: Automated Social Identity Assessment Using Linguistic Style"

# Introduction

This tutorial is aimed at readers who have a basic familiarity with Python; if you have not previously used Python, we recommend that you start by taking the course "Introduction to Data Science in Python" from coursera (https://www.coursera.org/learn/python-data-analysis). 

Please see the original paper for the detailed description of the procedure. (Lines starting with "#" are comments and will not be executed by Python.)

In order to run the code, first you need to

- download the datasets (Mumsnet_feminist_parent.csv, Reddit_feminist_parent.csv, Experimental_data.csv)
- copy the datasets in a folder named data




DBDIR is pointing to directory that contains dataset (data folder), and SAVEDIR is destination for saving the trained model (if you want). 

Here we considered all the stylistic LIWC features for training our identity detection model, however, any subset of these features can be considered. 


In [42]:

DBDIR = './data/'
Mumsnet_DB = 'Mumsnet_feminist_parent.csv'
Reddit_DB = 'Reddit_feminist_parent.csv'
Experimental_DB = 'Experimental_data_within.csv' # or 'Experimental_data_between.csv'

SAVEDIR = './save_dir/logr.sav'

# all LIWC stylistic features
ALL_STYLISTIC_FEATURES = ['WPS', 'i', 'we', 'you', 'shehe', 'they', 'ipron','article', 'auxverb', 'past',
                    'present', 'future', 'adverb', 'preps','conj', 'quant', 'number', 'time', 'Sixltr',
                    'Period', 'Colon', 'SemiC', 'QMark', 'Dash', 'Quote', 'Apostro', 'Parenth', 'OtherP',
                    'negate', 'swear', 'posemo','negemo', 'assent', 'nonfl', 'filler', 'Exclam', 'insight',
                    'cause', 'discrep', 'tentat', 'certain', 'inhib', 'incl', 'excl']



## Some utility functions

This section contains functions which prepare data for training and testing.


### Reading input files 
We first read csv files, and preprocess them by removing rows which contain Nan value and dropping short posts if it is specified.



In [43]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler


def preprocessing(df, word_limit, min_WC):
    df = df.dropna()
    if word_limit:
        df = df.loc[df['WC'] >= min_WC]
    return df

def read_csv(path, word_limit, min_WC):
    try:
        df = pd.read_csv(path)
    except:
        print('error in reading file')
        raise
    df = preprocessing(df, word_limit, min_WC)
    return df


### Buiding train and test sets

In order to build train and test set, we first need to separate users into between and within participants. 

Within pariticipants are users who are participating in both forums by posting at least once in each forums. Between participants are those who have posted only in one forum (feminist or parent).


In [44]:
def separating_users(df):
    fem_df = df.loc[df.forum_id == 1]
    par_df = df.loc[df.forum_id == 0]
    
    # participants who are posting in both forums
    within_p = set(fem_df.user_id.unique()).intersection(par_df.user_id.unique())
    # participants who are posting only in one forum, parent or feminist
    between_p = df[~df.user_id.isin(within_p)].user_id.unique()

    return between_p, within_p

We build the test set by randomly choosing one post per forum for each within participant, if there is no limit on size of the test set.

In [45]:
def extract_testcases(posts_within, no=None):
    # randomly selecting one post per forum for each within participant
    testDB = posts_within.sample(frac=1)
    testDB = testDB.drop_duplicates(subset=['user_id', 'forum_id'])

    # if there is no limit on the number of test cases, choose one post per forum
    # for each within participant, otherwise, randomly choose no number users from
    # within participants
    if no is not None:
        within_participants = posts_within.user_id.unique()
        testUsers = np.random.choice(within_participants, no, replace=False)
        testDB = testDB.loc[testDB['user_id'].isin(testUsers)]

    return testDB

Train set is build by randomly choosing posts from between participants, and test set is build by randomly choosing two posts from each within participant one per forum.

In [46]:
def split_between_within_sets(df):
    # separating users into between and withing participants
    between_participants, within_participants = separating_users(df.copy())

    within_set = df.loc[df.user_id.isin(within_participants)]
    between_set = df.loc[df.user_id.isin(between_participants)]
    
    return between_set, within_set


def get_train_set(sub_df, batch, verbose=True):
    sub_df = sub_df.sample(frac=1)

    # buiding train set by randomly selecting posts
    posts_forum1 = sub_df[sub_df['forum_id'] == 1][:batch]
    posts_forum0 = sub_df[sub_df['forum_id'] == 0][:batch]
    trainDB = pd.concat([posts_forum1, posts_forum0])
    
    if verbose:
        print('\ntrain set size:{}'.format(trainDB.shape[0]))
    
    return trainDB
    

def get_test_set(sub_df, batch, verbose=True):
    sub_df = sub_df.sample(frac=1)
    

    # buidling test set by randomly selecting one posts per from from each within participant
    testDB = extract_testcases(sub_df)
    
    if verbose:
        print('test set size:{}\n'.format(testDB.shape[0]))
        
        
    return testDB


def train_test_prepration(trainDB, testDB, features, standardize=False):
    if standardize:
        scaler = StandardScaler().fit(trainDB[features])
        trainDB[params.features] = scaler.transform(trainDB[features])
        testDB[params.features] = scaler.transform(testDB[features])

    return trainDB, testDB
    

## Training and testing

### Choosing a model for training

Here we apply Logistic regression for training our model. If grid_search is true, best parameter would be chosen for the model, but it takes longer to train the model. 

In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, accuracy_score
import joblib



def Logistic_Regression(penalty, grid_search=False):
    clf = LogisticRegression(solver='lbfgs', penalty=penalty, max_iter = 2000)

    if grid_search:
        tuned_parameters = [{'C': [1e-3, 1e-2, 1e-1, 1]}]
        clf = GridSearchCV(LogisticRegression(solver='lbfgs', penalty=penalty), tuned_parameters, cv=10,
                           scoring='accuracy', n_jobs=-1)
    
    
    return clf


def train(X_train, y_train, cross_val=True, save_dir=SAVEDIR, verbose=True, save_model=True):
    
    # to add the regularization, you can set penalty=‘l1’ or penalty=‘l2’ (depending on the solver).
    model = Logistic_Regression(penalty='none')

    # training accuracy and AUC
    tr_auc, tr_acc = None, None
    if cross_val:
        tr_auc = np.mean(cross_val_score(model, X_train, y_train, cv=10, scoring='roc_auc'))
        tr_acc = np.mean(cross_val_score(model, X_train, y_train, cv=10, scoring='accuracy'))

    model.fit(X_train, y_train)

    if save_model:
        joblib.dump(model, save_dir)

    if verbose:
        print('model is trained')

    return model, tr_auc, tr_acc


def test(clf, X_test, y_test):
    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)

    s = clf.decision_function(X_test)
    auc = roc_auc_score(y_test, s)

    return acc, auc


### Running the identity detection model
We then run training and testing for multiple times. In each iteration specified number of posts are randomly selected from the between participant users, which are equally coming from both forums (feminist and parent). Test set is also built by randomly selecting one post per each within participant from each forum. 

We first need to specify some parameters: 

1) Size of the training dataset. Here we chose 50000 posts from each forum of parent and femnist (by specifying batch_size = 50000). 

2) As posts are randomly selected for both training and test, we run our analysis for multiple times, which can be specified by Max_Iter parameter. 

3) If you want to include only posts which are longer than a specified threshold, you have to set word_limit=true, and choose the minimum word count by setting min_WC parameter. 

We report the AUC and accuracy of our identity detection model by averaging over the results from multiple iterations. 


In [48]:
import argparse
import numpy as np
import pandas as pd


def run(trainingDB, testDB, features, max_iter, save_dir, verbose=True):

    tr_results= []
    tt_results = []
    for T in range(max_iter):
        print('\nround:{}'.format(T + 1))

        X_train, y_train = trainingDB[features], trainingDB['forum_id']
        X_test, y_test = testDB[features], testDB['forum_id']

        model, tr_auc, tr_acc = train(X_train, y_train, save_dir)
        if verbose:
            print('training accuracy:{}'.format(tr_acc), 'training AUC :{}'.format(tr_auc))

        tt_acc, tt_auc = test(model, X_test, y_test)

        if verbose:
            print('test accuracy:{}'.format(tt_acc), 'test AUC :{}'.format(tt_auc))

        tr_results.append({'acc': tr_acc, 'auc': tr_auc})
        tt_results.append({'acc': tt_acc, 'auc': tt_auc})

    return tr_results, tt_results


def main(verbose=True):
    batch_size = 50000
    max_iter = 1
    word_limit = True
    min_WC = 25
    features = ALL_STYLISTIC_FEATURES
    standardize = False
    
    dbdir = DBDIR
    save_dir = SAVEDIR
    
    
    # choose the dataset you want to test the trained model on
    testing_on = 'mumsnet'
    

    # reading datasets from CSV files
    mumsnet_df = read_csv(dbdir + Mumsnet_DB, word_limit, min_WC)
    reddit_df = read_csv(dbdir + Reddit_DB, word_limit, min_WC)
    experimental_df = read_csv(dbdir + Experimental_DB, word_limit=False, min_WC=min_WC)
    
    if verbose:
        print('size of mumsnet dataset:{}'.format(mumsnet_df.shape[0]))
        print('size of reddit dataset:{}'.format(reddit_df.shape[0]))
        print('size of experimental dataset:{}'.format(experimental_df.shape[0]))
        
        
    trainDB = get_train_set(mumsnet_df.copy(deep=True), batch_size)
    
    
    if testing_on == 'mumsnet':
        between_set, within_set = split_between_within_sets(
            mumsnet_df.loc[~mumsnet_df.msg_id.isin(trainDB.msg_id)].copy(deep=True))
        testDB = get_test_set(within_set, batch_size)
    elif testing_on == 'reddit':
        reddit_between_set, reddit_within_set = split_between_within_sets(reddit_df)
        testDB = get_test_set(reddit_within_set, batch_size)
    elif testing_on == 'experimental':
        # choose the topic you want to test on
        # there are three topics: healthy mealtimes ('hm'), 
        # objectification of women ('ow')
        # and climate change ('cc')
        topic = 'ow'
        testDB = experimental_df.loc[experimental_df['topic'] == topic]
        testDB = testDB.rename(columns = {'condition' : 'forum_id'})
        
        
    trainDB, testDB = train_test_prepration(trainDB, testDB, features, standardize)
    


    tr_results, tt_results = run(trainDB, testDB, features, max_iter, save_dir)

    if max_iter > 1:
        print('\n Accuracy and AUC after {} iteration(s):'.format(max_iter))
        print('training_auc:', 'AUC_mean:{}'.format(np.mean([item['auc'] for item in tr_results])),
              'AUC_std:{}'.format(np.std([item['auc'] for item in tr_results])))

        print('training_acc:', 'acc_mean:{}'.format(np.mean([item['acc'] for item in tr_results])),
              'acc_std:{}'.format(np.std([item['acc'] for item in tr_results])))

        print('testing_auc:', 'AUC_mean', np.mean([item['auc'] for item in tt_results]), 'AUC_std',
              np.std([item['auc'] for item in tt_results]))

        print('testing_acc:', 'acc_mean:{}'.format(np.mean([item['acc'] for item in tt_results])),
              'acc_std:{}'.format(np.std([item['acc'] for item in tt_results])))




In [None]:
main()

size of mumsnet dataset:461371
size of reddit dataset:388096
size of experimental dataset:128

train set size:100000
test set size:3954


round:1
