## Evaluation setup

This notebook serves as a reference, documenting additional details about testing data and preparatory steps for our two experiments. Processed data, which can be directly used for the experiments as demo-ed the Evaluation notebook, are also provided. 

In [19]:
import pandas as pd
import numpy as np
import json
import os
import spacy
from collections import Counter
from tqdm import tqdm

from convokit import Corpus, Utterance, Speaker, download
from convokit import TextParser, PolitenessStrategies

### 1. Test Data

We consider two types of experiments, each illustrating one potential source of misalignment. 

### Experiment A

Test data for experiment A (translated communication) is saved as `mt-test-corpus`, which consists of sampled test utterances involving the use of any of the four politeness strategies: {_Subjunctive_, _Please_, _Filler_, _Swearing_} which tend to be lost often when translating from English to Chinese at the time when we obtained the translations. 

In [20]:
mt_corpus = Corpus(filename="data/test/mt-test-corpus")

It includes the following utterance-level metadata:

    - strategy: strategy the utterance is sampled from. Should be one of {Subjunctive, Please, Filler, Swearing}
    - parsed: parses for the utterance 
    - back-translation: back-translation for the original text, i.e., output after having the utterance go through English to Chinese translation and then translate back from Chinese to English. 
    
An example is shown below:

In [21]:
mt_corpus.get_utterance('226408211.416.416').meta

{'strategy': 'Please',
 'parsed': [{'rt': 3,
   'toks': [{'tok': 'can', 'tag': 'MD', 'dep': 'aux', 'up': 3, 'dn': []},
    {'tok': 'you', 'tag': 'PRP', 'dep': 'nsubj', 'up': 3, 'dn': []},
    {'tok': 'please', 'tag': 'UH', 'dep': 'intj', 'up': 3, 'dn': []},
    {'tok': 'remove', 'tag': 'VB', 'dep': 'ROOT', 'dn': [0, 1, 2, 5, 14]},
    {'tok': 'the', 'tag': 'DT', 'dep': 'det', 'up': 5, 'dn': []},
    {'tok': 'ability', 'tag': 'NN', 'dep': 'dobj', 'up': 3, 'dn': [4, 7]},
    {'tok': 'to', 'tag': 'TO', 'dep': 'aux', 'up': 7, 'dn': []},
    {'tok': 'edit', 'tag': 'VB', 'dep': 'acl', 'up': 5, 'dn': [6, 10, 11]},
    {'tok': 'helio', 'tag': 'NNP', 'dep': 'poss', 'up': 10, 'dn': [9]},
    {'tok': "'s", 'tag': 'POS', 'dep': 'case', 'up': 8, 'dn': []},
    {'tok': 'page', 'tag': 'NN', 'dep': 'dobj', 'up': 7, 'dn': [8]},
    {'tok': 'by', 'tag': 'IN', 'dep': 'prep', 'up': 7, 'dn': [13]},
    {'tok': 'anonymous', 'tag': 'JJ', 'dep': 'amod', 'up': 13, 'dn': []},
    {'tok': 'users', 'tag': 'NNS', 

We can verify that the four strategies identified as high-risk do have relatively low retention rate, by aggregating strategies in the back-translations, and those in the original corpuses, and compute the % of loss. 

PS: note that while we aim to situate our experiments in realisitic settings, our focus is not to rigorously analyze how to best estimate politeness signals that are lost. As such, we take the reasonable approach of using back-tranlsation as proxies for what might be left or lost after translation to set up the experiments, but leave more robust and rigourous assessment for MT systems in maintaining politeness signals to future work. 

In [22]:
ps = PolitenessStrategies(strategy_attribute_name = "strategies", \
                          marker_attribute_name = "markers", \
                          strategy_collection="politeness_local")

spacy_nlp = spacy.load('en', disable=['ner'])

In [23]:
# get politeness strategies for the back translations 
translated_strategies = {utt.id: ps.transform_utterance(utt.meta['back_translation'], \
                         spacy_nlp=spacy_nlp, markers=True) for utt in mt_corpus.iter_utterances()}

In [24]:
cnts = {'Subjunctive': 0, "Please": 0, 'Filler': 0, "Swearing": 0}

for utt in mt_corpus.iter_utterances():
    
    k = utt.meta['strategy']
    cnts[k] += translated_strategies[utt.id].meta['strategies'][k]

print("===Strategy Preservation Rate===")
for k, v in cnts.items():
    
    # counts are out of a total of 200 instances  
    print("{}: {:.1f}% ".format(k.upper(), v/2))

===Strategy Preservation Rate===
SUBJUNCTIVE: 20.0% 
PLEASE: 20.0% 
FILLER: 12.5% 
SWEARING: 37.0% 


### Experiment B

Test data for experiment B (misaligned perceptions) is saved as `ind-test-corpus`, which contains sampled test utterances that are predicted to result in larger differences in perceived politeness levels between pairs of annotators. 

In [25]:
ind_corpus = Corpus(filename="data/test/ind-test-corpus/")

It includes the following utterance-level metadata:

    - sender: turker acting like the sender in this simulated setting
    - receiver: turker acting like the receiver in this simulated settimg
    - original_id: id of the utterance in WikiConv
    - parsed: parses for the utterance 
        
See the following example:

In [26]:
ind_corpus.get_utterance("A23-A3S-1").meta

{'sender': 'A23',
 'receiver': 'A3S',
 'original_id': '398481738.1386.1371',
 'parsed': [{'rt': 1,
   'toks': [{'tok': 'I', 'tag': 'PRP', 'dep': 'nsubj', 'up': 1, 'dn': []},
    {'tok': 'added', 'tag': 'VBD', 'dep': 'ROOT', 'dn': [0, 3, 9]},
    {'tok': 'one', 'tag': 'CD', 'dep': 'nummod', 'up': 3, 'dn': []},
    {'tok': 'sentence', 'tag': 'NN', 'dep': 'dobj', 'up': 1, 'dn': [2, 4]},
    {'tok': 'about', 'tag': 'IN', 'dep': 'prep', 'up': 3, 'dn': [6]},
    {'tok': 'her', 'tag': 'PRP$', 'dep': 'poss', 'up': 6, 'dn': []},
    {'tok': 'construction',
     'tag': 'NN',
     'dep': 'pobj',
     'up': 4,
     'dn': [5, 7, 8]},
    {'tok': 'and', 'tag': 'CC', 'dep': 'cc', 'up': 6, 'dn': []},
    {'tok': 'design', 'tag': 'NN', 'dep': 'conj', 'up': 6, 'dn': []},
    {'tok': '.', 'tag': '.', 'dep': 'punct', 'up': 1, 'dn': []}]},
  {'rt': 4,
   'toks': [{'tok': 'What', 'tag': 'WP', 'dep': 'dobj', 'up': 4, 'dn': [1]},
    {'tok': 'else', 'tag': 'RB', 'dep': 'advmod', 'up': 0, 'dn': []},
    {'tok'

We consider 20 different communication setting, as specified by the _(sender, receiver)_ pair. We consider 100 examples for each pair. 

In [27]:
pairs = [(utt.meta['sender'], utt.meta['receiver']) for utt in ind_corpus.iter_utterances()]

In [28]:
Counter(pairs)

Counter({('A23', 'A2U'): 100,
         ('A23', 'A1F'): 100,
         ('A23', 'A3S'): 100,
         ('A23', 'AYG'): 100,
         ('A2U', 'A23'): 100,
         ('A2U', 'A1F'): 100,
         ('A2U', 'A3S'): 100,
         ('A2U', 'AYG'): 100,
         ('A1F', 'A23'): 100,
         ('A1F', 'A2U'): 100,
         ('A1F', 'A3S'): 100,
         ('A1F', 'AYG'): 100,
         ('A3S', 'A23'): 100,
         ('A3S', 'A2U'): 100,
         ('A3S', 'A1F'): 100,
         ('A3S', 'AYG'): 100,
         ('AYG', 'A23'): 100,
         ('AYG', 'A2U'): 100,
         ('AYG', 'A1F'): 100,
         ('AYG', 'A3S'): 100})

### 2. Perception models 

The sender's and receiver's perceptions models are part of the circumstance we aim to incorporate. To set up the experiments. To model how an individual may perceive politeness, we use linear regression models as approximations. 

For Experiment A, we consider an _average person_, by taking the average scores from annotations from the Wikipedia portion of the Stanford Politeness Corpus. For experiment B, we consider the top 5 most prolific annotators from the Stanford Politeness Corpus to learn their perception models. 

In [29]:
from collections import defaultdict
from perception_utils import scale_func, get_strategies_df, get_model_info, get_ind_model_info

### 2.1 Average-person model

In [30]:
wiki_politeness = Corpus(download("wikipedia-politeness-corpus"))

wiki_politeness = ps.transform(wiki_politeness, markers = False)

Dataset already exists at /kitchen/convokit_corpora_lf/wikipedia-politeness-corpus


In [31]:
# for interpretability, we re-scale the annotations to -3 to 3 range 
for utt in wiki_politeness.iter_utterances():
    
    utt.meta['avg_score'] = scale_func(np.mean(list(utt.meta['Annotations'].values())))

In [32]:
df_avg = get_strategies_df(wiki_politeness, 'strategies')
scores = [wiki_politeness.get_utterance(idx).meta['avg_score'] for idx in df_avg.index]

avg_model = get_model_info(df_avg, scores)

The coefs are saved in `/perception_models/average.json`. We can get a feel for the strength of each strategy by looking at the coefficients: 

In [33]:
avg_model

{'coefs': {'Actually': -0.35832310780678395,
  'Adverb.Just': -0.003919148991569088,
  'Affirmation': 0.17106202210309893,
  'Apology': 0.4294026792755307,
  'By.The.Way': 0.33073420981106394,
  'Conj.Start': -0.2449076077632873,
  'Filler': -0.24450549480703548,
  'For.Me': 0.12754640242716142,
  'For.You': 0.19676631448406204,
  'Gratitude': 0.9891183440757589,
  'Greeting': 0.49055689805230834,
  'Hedges': 0.1310668053348257,
  'Indicative': 0.22110648465666793,
  'Please': 0.22987732522609228,
  'Please.Start': -0.2088696526965267,
  'Reassurance': 0.6682643217184243,
  'Subjunctive': 0.4540683006967976,
  'Swearing': -1.2995687206430528},
 'intercept': 0.10544430163739996}

### 2.2 Individualized models for prolific annotators

We consider the top 5 annotators from the Wikipedia portion of the Stanford Politeness Corpus. They will be referred to by the first 3 characters of their id. Their annotations are saved as a ConvoKit corpus, with the following utterance level metadata:

- score: scaled score provided by the 'speaker' (i.e., annotator) for the given utterance
- original id: id of the utterance as it appears in the original dataset
- parsed: parses for the utterance text

In [34]:
turker_corpus = Corpus(filename="data/perceptions/turker-corpus/")

Each of them has provided over 500 annotations, with the most prolific annotators having done 2,063 utterances: 

In [35]:
Counter([utt.speaker.id for utt in turker_corpus.iter_utterances()])

Counter({'A2U': 1430, 'A23': 2063, 'AYG': 715, 'A3S': 832, 'A1F': 962})

We can then extract politeness markers and train individualized perception models:

In [36]:
turker_corpus = ps.transform(turker_corpus, markers=True)

In [37]:
turkers = ['A23', 'A2U', 'A1F', 'A3S', 'AYG']
turker_dfs = defaultdict()

for turker in turkers:
    
    corpus = turker_corpus.filter_utterances_by(selector=lambda utt:utt.speaker.id == turker)
    
    df = get_strategies_df(corpus, "strategies")
    
    df['score'] = [corpus.get_utterance(idx).meta['score'] for idx in df.index]
    
    turker_dfs[turker] = df

We can then obtain individual perception models as follows (the resultant models are saved in `/perception_models/<turker_prefix>.json`, but detailed model information is also printed below: 

In [38]:
avg_coefs = avg_model['coefs']

for t, df in turker_dfs.items():
            
    df_feat = df.iloc[:, 0:-1]
    scores = dict(df.iloc[:, -1])

    ind_model = get_ind_model_info(df_feat, scores, avg_coefs)     
  
    print(t)
    print(ind_model)
    print('======')

A23
{'coefs': {'Actually': -0.39958752122564184, 'Adverb.Just': -0.09733664110682594, 'Affirmation': -0.10983282152085155, 'Apology': -0.1980578971909959, 'By.The.Way': 0.36428388231865705, 'Conj.Start': -0.22488431102767364, 'Gratitude': 0.6285864971752283, 'Greeting': 0.4557020205871193, 'Hedges': 0.18643497711609897, 'Indicative': 0.7326680955961387, 'Please': 0.3962552437581173, 'Please.Start': -0.20488489664695417, 'Subjunctive': 1.1019655188010744, 'Filler': -0.24450549480703548, 'For.Me': 0.12754640242716142, 'Reassurance': 0.6682643217184243, 'Swearing': -1.2995687206430528, 'For.You': 0.19676631448406204}, 'intercept': -0.07388039486261758}
A2U
{'coefs': {'Actually': -0.2832263741915894, 'Adverb.Just': 0.06926703482339108, 'Affirmation': -0.08968564618061908, 'Apology': 0.36252514772196825, 'By.The.Way': 0.3271172242009981, 'Conj.Start': -0.12103532461519026, 'Gratitude': 0.9641442769677369, 'Greeting': 0.34658127900946856, 'Hedges': 0.19394608324520143, 'Indicative': 0.163294