This is the python implementation of the paper "Predicting a Salient Social Identity from Linguistic Style"

# Introduction

This tutorial is aimed at readers who have a basic familiarity with Python; if you have not previously used Python, we recommend that you start by taking the course "Introduction to Data Science in Python" from coursera (https://www.coursera.org/learn/python-data-analysis). 

Please see the original paper for the detailed description of the procedure. (Lines starting with "#" are comments and will not be executed by Python.)

In order to run the code, first you need to

- download the dataset
- copy the dataset in a folder named data


## Specifying some parameters

We first need to specify some parameters: 

1) Size of the training dataset. Here we chose 50000 posts from each forum of parent and femnist (by specifying batch_size = 50000). 

2) As posts are randomly selected for both training and test, we run our analysis for multiple times, which can be specified by Max_Iter parameter. 

3) If you want to include only posts which are longer than a specified threshold, you can set WORD_LIMIT = true, and choose the minimum word count by setting MIN_WC parameter. 

4) DBDIR is pointing to directory that contains dataset, and SAVEDIR is destination for saving the trained model. 

5) Here we considered all the stylistic features for training our identity detection model, however, any subset of these features can be considered. 


In [4]:
Batch_Size = 50000
MAX_Iter = 20
WORD_LIMIT = False
MIN_WC = 25

DBDIR = './data/Mumsnet_parent_feminist.csv'
SAVEDIR = './save_dir/logr.sav'

# all LIWC stylistic features
ALL_STYLISTIC_FEATURES = ['WPS', 'i', 'we', 'you', 'shehe', 'they', 'ipron','article', 'auxverb', 'past',
                    'present', 'future', 'adverb', 'preps','conj', 'quant', 'number', 'time', 'Sixltr',
                    'Period', 'Colon', 'SemiC', 'QMark', 'Dash', 'Quote', 'Apostro', 'Parenth', 'OtherP',
                    'negate', 'swear', 'posemo','negemo', 'assent', 'nonfl', 'filler', 'Exclam', 'insight',
                    'cause', 'discrep', 'tentat', 'certain', 'inhib', 'incl', 'excl']



## Some utility functions

This section contains functions which prepare data for training and testing.


### Reading input files 
We first read csv files, and preprocess them by removing rows which contain Nan value and dropping short posts if it is specified.



In [9]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler


def read_csv(path):
    try:
        df = pd.read_csv(path)
    except:
        print('error in reading file')
        raise
    return df


def preprocessing(df, word_limit=False, min_WC=25):
    df = df.dropna()
    if word_limit:
        df = df.loc[df['WC'] >= min_WC]
    return df


### Buiding train and test sets

In order to build train and test set, we first need to separate users into between and within participants. 

Within pariticipants are users who are participating in both forums by posting at least once in each forums. Between participants are those who have posted only in one forum (feminist or parent).


In [10]:
def separating_users(df):
    fem_df = df.loc[df.forum_id == 1]
    par_df = df.loc[df.forum_id == 0]
    
    # participants who are posting in both forums
    within_p = set(fem_df.user_id.unique()).intersection(par_df.user_id.unique())
    # participants who are posting only in one forum, parent or feminist
    between_p = df[~df.user_id.isin(within_p)].user_id.unique()

    return between_p, within_p

We build the test set by randomly choosing one post per forum for each within participant, if there is no limit on size of the test set.

In [11]:
def extract_testcases(posts_within, within_participants, no=None):
    # randomly selecting one post per forum for each within participant
    testDB = posts_within.sample(frac=1)
    testDB = testDB.drop_duplicates(subset=['user_id', 'forum_id'])

    # if there is no limit on the number of test cases, choose one post per forum
    # for each within participant, otherwise, randomly choose no number users from
    # within participants
    if no is not None:
        testUsers = np.random.choice(within_participants, no, replace=False)
        testDB = testDB.loc[testDB['user_id'].isin(testUsers)]

    return testDB

Train set is build by randomly choosing posts from between participants, and test set is build by randomly choosing two posts from each within participant per forum.

In [12]:
def split_train_test(df, batch, word_limit, min_WC, features, standardize=False, verbose=True):
    df = preprocessing(df, word_limit, min_WC)

    # separating users into between and withing participants
    between_participants, within_participants = separating_users(df.copy())
    
    if verbose:
        print('number of within participants are:{}'.format(len(within_participants)))
        print('number of between participants are:{}'.format(len(between_participants)))


    df = df.sample(frac=1)

    # buiding train set by randomly selecting posts from between participants
    fem_posts_between = df[(df['user_id'].isin(between_participants)) & (df['forum_id'] == 1)][:batch]
    par_posts_between = df[(df['user_id'].isin(between_participants)) & (df['forum_id'] == 0)][:batch]
    trainDB = pd.concat([fem_posts_between, par_posts_between])

    # buidling test set by randomly selecting one posts per from from each within participant
    posts_within = df[df['user_id'].isin(within_participants)]
    testDB = extract_testcases(posts_within, within_participants)

    if standardize:
        scaler = StandardScaler().fit(trainDB[features])
        trainDB[params.features] = scaler.transform(trainDB[features])
        testDB[params.features] = scaler.transform(testDB[features])

    return trainDB, testDB

## Training and testing

### Choosing a model for training

Here we apply Logistic regression for training our model. If grid_search is true, best parameter would be chosen for the model, but it takes longer to train the model. 

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score, accuracy_score
import joblib


def Logistic_Regression(grid_search=False):
    clf = LogisticRegression(solver='lbfgs', max_iter = 500)

    if grid_search:
        tuned_parameters = [{'C': [1e-3, 1e-2, 1e-1, 1]}]
        clf = GridSearchCV(LogisticRegression(solver='lbfgs'), tuned_parameters, cv=10,
                           scoring='accuracy', n_jobs=-1)

    return clf


def train(X_train, y_train, save_dir, verbose=True, save_model=True):
    model = Logistic_Regression()

    # training accuracy and AUC
    tr_auc = np.mean(cross_val_score(model, X_train, y_train, cv=10, scoring='roc_auc'))
    tr_acc = np.mean(cross_val_score(model, X_train, y_train, cv=10, scoring='accuracy'))

    model.fit(X_train, y_train)

    if save_model:
        joblib.dump(model, save_dir)

    if verbose:
        print('model is trained')

    return model, tr_auc, tr_acc


def test(clf, X_test, y_test):
    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)

    s = clf.decision_function(X_test)
    auc = roc_auc_score(y_test, s)

    return acc, auc


### Running the identity detection model
We then run training and testing for multiple times (here we set that as 20 iterations). In each iteration 100000 posts are randomly selected from the between participant users, which are equally coming from feminist and parent posts. Test set is also built by randomly selecting one post per each within participant from each forum. 

We report the AUC and accuracy of our identity detection model by averaging over the results from multiple iterations. 


In [15]:
import argparse
import numpy as np
import pandas as pd


def run(df, batch_size, features, max_iter, word_limit, word_count, save_dir, verbose=True):

    tr_results= []
    tt_results = []
    for T in range(max_iter):
        print('round:{}'.format(T + 1))

        trainingDB, testDB = split_train_test(df.copy(deep=True), batch_size, word_limit, word_count, features)
        X_train, y_train = trainingDB[features], trainingDB['forum_id']
        X_test, y_test = testDB[features], testDB['forum_id']

        model, tr_auc, tr_acc = train(X_train, y_train, save_dir)
        if verbose:
            print('training accuracy:{}'.format(tr_acc), 'training AUC :{}'.format(tr_auc))

        tt_acc, tt_auc = test(model, X_test, y_test)

        if verbose:
            print('test accuracy:{}'.format(tt_acc), 'test AUC :{}'.format(tt_auc))

        tr_results.append({'acc': tr_acc, 'auc': tr_auc})
        tt_results.append({'acc': tt_acc, 'auc': tt_auc})

    return tr_results, tt_results


def main():
    batch_size = Batch_Size
    max_iter = MAX_Iter
    min_WC = MIN_WC
    word_limit = WORD_LIMIT
    features = ALL_STYLISTIC_FEATURES
    dbdir = DBDIR
    save_dir = SAVEDIR

    df = read_csv(dbdir)

    tr_results, tt_results= run(df.copy(deep=True), batch_size, features, max_iter, word_limit, min_WC, save_dir)

    print('training_auc:', 'AUC_mean:{}'.format(np.mean([item['auc'] for item in tr_results])),
          'AUC_std:{}'.format(np.std([item['auc'] for item in tr_results])))

    print('training_acc:', 'acc_mean:{}'.format(np.mean([item['acc'] for item in tr_results])),
          'acc_std:{}'.format(np.std([item['acc'] for item in tr_results])))

    print('testing_auc:', 'AUC_mean', np.mean([item['auc'] for item in tt_results]), 'AUC_std',
          np.std([item['auc'] for item in tt_results]))

    print('testing_acc:', 'acc_mean:{}'.format(np.mean([item['acc'] for item in tt_results])),
          'acc_std:{}'.format(np.std([item['acc'] for item in tt_results])))




In [None]:
main()