## Evaluation 

This notebooks provides details on our two experiments as described in Section 5 in the paper, each illustrates one potential source of misalignement. More specifically, we simulate two realistic scenarios: 

- Experiment A (Translated communication): we consider the case of communication over an imperfect MT system 
- Experiment B (Misaligned perception): we consider cases when the sender and the listener have different perceptions 

In [1]:
import json
import spacy
import numpy as np
import pandas as pd
from tqdm import tqdm
from collections import defaultdict

from perception_utils import estimate_perception

from convokit import Corpus, Utterance, Speaker
from convokit import TextParser, PolitenessStrategies

from bleu import get_bleu

### 1. Preliminary 

We first load the two test corpuses, `mt-test-corpus` and `ind-test-corpus`, as well as the politeness perception models that we will use to estimate the perceived level of politeness. More details about them can be found in [Evaluation_Data.ipynb](Evaluation_Data.ipynb).

In [2]:
mt_corpus = Corpus(filename="data/test/mt-test-corpus")
ind_corpus = Corpus(filename="data/test/ind-test-corpus/")

We consider the following "individuals", where 'average' represents a prototypical "average person", and the rest, prolific annotators in Stanford Politeness corpus. 

In [249]:
individuals = ['average', 'A23', 'A2U', 'A1F', 'A3S', 'AYG']

perception_models = defaultdict()

for ind in individuals:
    with open("data/perceptions/{}.json".format(ind), 'r') as f:
        perception_models[ind] = json.load(f)

We consider the following strategies as at-risk for Experiment A:

In [250]:
at_risk_strategies = {'Subjunctive', "Please", 'Filler', "Swearing"}

### 2. Default perception gaps

We first compute the default level of misalignement in perception, i.e., the difference between the (predicted) politeness level perceived by the _sender_ and the (predicted) politeness level preceived by the _receiver_, when no intervention is in place. 

In [251]:
ps = PolitenessStrategies(strategy_attribute_name = "strategies", \
                          marker_attribute_name = "markers", \
                          strategy_collection="politeness_local")

spacy_nlp = spacy.load('en', disable=['ner'])

We first identify politeness strategies used in these utterances:

In [252]:
# get politeness strategies for the test utterances 
mt_corpus = ps.transform(mt_corpus, markers=True)

# get politeness strategies for the back translations 
translated_strategies = {utt.id: ps.transform_utterance(utt.meta['back_translation'], \
                                                        spacy_nlp=spacy_nlp, markers=True) for utt in mt_corpus.iter_utterances()}

ind_corpus = ps.transform(ind_corpus, markers=True)

# for readability, we save the strategy set present as seperate meta data 
for corpus in [mt_corpus, ind_corpus]:
    
    for utt in corpus.iter_utterances():
        
        utt.meta['strategy_set'] = {k for k,v in utt.meta['strategies'].items() if v ==1}

We can then compute the intended and perceived levels of politeness with the perception models used by the sender and the receiver. In the case of translated communication, we take the back-translation as an approximation for the "no intervention" scenario. 

In [253]:
# We use the "average person" model as both the sender and the receiver
average_model = perception_models['average']

for utt in mt_corpus.iter_utterances():
    
    strategies_used = utt.meta['strategy_set']
    strategies_preserved = {k for k, v in translated_strategies[utt.id].meta['strategies'].items() if v == 1}
    
    utt.meta['intended_politeness'] = estimate_perception(average_model, strategies_used)
    utt.meta['perceived_politeness'] = estimate_perception(average_model, strategies_preserved)

In [254]:
for utt in ind_corpus.iter_utterances():
    
    strategies_used = utt.meta['strategy_set']
    
    sender_model = perception_models[utt.meta['sender']]
    receiver_model = perception_models[utt.meta['receiver']]
    
    utt.meta['intended_politeness'] = estimate_perception(sender_model, strategies_used)
    utt.meta['perceived_politeness'] = estimate_perception(receiver_model, strategies_used)

By comparing these politeness perception estimates, we obtain the default perception gaps. Specifically, we compute the mean absolute difference/error (MAE) between the (estimated) intended and perceived politeness, which we aim to reduce with our paraphrasing approach. 

In [255]:
print("MAE: No intervention")

print("\t Translated communication: {:.2f}".format(np.mean([abs(utt.meta['intended_politeness'] - utt.meta['perceived_politeness']) \
                                                            for utt in mt_corpus.iter_utterances()])))
      
print("\t Translated communication: {:.2f}".format(np.mean([abs(utt.meta['intended_politeness'] - utt.meta['perceived_politeness']) \
                                                            for utt in ind_corpus.iter_utterances()])))

MAE: No intervention
	 Translated communication: 0.43
	 Translated communication: 1.01


### 3. Effect of planning 

We propose to plan for an optimal strategy combination under the circumstance with Integer Linear Programming (ILP). This ILP-based planning is compared with two other baselines. 

### Planning with ILP

In [256]:
import imp
import plan_with_ilp

In [257]:
from plan_with_ilp import get_ilp_solution

In [258]:
# For Experiment A, we consider average perception models, and account for the set of at-risk strategies
mt_ilp_solutions = {utt.id: get_ilp_solution(utt.id, utt.meta['strategy_set'], \
                                          average_model, average_model, \
                                          at_risk_strategies) for utt in tqdm(mt_corpus.iter_utterances())}

800it [01:40,  7.95it/s]


In [14]:
# For Experiment B, we consider individualized perception models, and all strategies are considered safe
ind_ilp_solutions = {utt.id: get_ilp_solution(utt.id, utt.meta['strategy_set'], \
                                          perception_models[utt.meta['sender']], \
                                          perception_models[utt.meta['receiver']], set()) for utt in tqdm(ind_corpus.iter_utterances())}

2000it [01:16, 26.03it/s]


### Planning baselines 

We consider two baseline approaches in planning and compare them with our proposed ILP-based approach. We compare the effects of these different planning approaches by the extent to which they reduce the MAE from the No Intervention case. 

#### Greedy

The greedy approach seeks to substitute strategies with ones that are safe to transmit and has the closest strategy strength in the eye of the receiver (if a strategy itself is safe and do not result in different interpretations, it will be left as is). 

In [259]:
from plan_with_baselines import get_greedy_plan

In [260]:
mt_greedy_solutions = {utt.id: get_greedy_plan(utt.meta['strategy_set'], average_model, average_model, \
                                              at_risk_strategies) for utt in mt_corpus.iter_utterances()}

ind_greedy_solutions = {utt.id: get_greedy_plan(utt.meta['strategy_set'], \
                                                perception_models[utt.meta['sender']], \
                                                perception_models[utt.meta['receiver']], set()) for utt in ind_corpus.iter_utterances()}

#### Retrieval

The retrieval approach takes the coarser grained view that aims to find, among utterances in the training corpus that exibhit the same poliarity as the input utterance, the most similar one in terms of content as the reference utterance, and adopt the strategy plan used in the reference as the retrieval solution.

In [261]:
from plan_with_baselines import init_transformer, add_similarity_scores, get_retrieval_plan
from strategy_manipulation import remove_strategies_from_utt

In [262]:
# We only search from reference utterances that has the same polarity as the input
# Thus we need information on the polarity of the texts 

def add_estimated_polarity(utt, perception_model, polarity_attribute_name):
    
    # 1 = polite, 0 = impolite
    polarity = int(estimate_perception(perception_model, utt.meta['strategy_set']) >= 0)
    
    utt.meta[polarity_attribute_name] = polarity
    
    return polarity

In [263]:
# loading the training (reference) corpus
train_corpus = Corpus(filename="data/train/training-corpus")

# obtain politeness annotations 
train_corpus = ps.transform(train_corpus, markers=True)

for utt in train_corpus.iter_utterances():
    utt.meta['strategy_set'] = {k for k,v in utt.meta['strategies'].items() if v ==1}

In [264]:
# preparing corpuses: adding polarity, and getting marker-removed contents

# for the training corpus and experiment A, we use the average person model to judge polarity  
for corpus in [train_corpus, mt_corpus]:
    
    for utt in corpus.iter_utterances():
    
        remove_strategies_from_utt(utt, utt.meta['strategy_set'], \
                                   removed_attribute_name="content")

        add_estimated_polarity(utt, average_model, "polarity")
    
# for experiment B, we judge polarity from the sender's perspective 
for utt in ind_corpus.iter_utterances():

    remove_strategies_from_utt(utt, utt.meta['strategy_set'], \
                               removed_attribute_name="content")
    
    add_estimated_polarity(utt, perception_models[utt.meta['sender']], "polarity")    

In [265]:
# we compare contents of utterances with their tfidf representation  
tfidf = init_transformer(train_corpus)

# for each utterance in test corpus, we obtain the cosine similarity scores with each utterance in the training corpus
mt_corpus = add_similarity_scores(mt_corpus, train_corpus, tfidf)
ind_corpus = add_similarity_scores(ind_corpus, train_corpus, tfidf)



In [266]:
train_ids = train_corpus.get_vector_matrix('tfidf').ids

In [267]:
mt_retrieval_solutions = {utt.id: get_retrieval_plan(utt, train_corpus, train_ids, \
                                                     at_risk_strategies) for utt in mt_corpus.iter_utterances()}

ind_retrieval_solutions = {utt.id: get_retrieval_plan(utt, train_corpus, train_ids, \
                                                      set()) for utt in ind_corpus.iter_utterances()}

#### Expected perception gap with new plans

We then add information of different plans into corpus, and compare the aggregated effects. 

In [268]:
def add_plan_to_corpus(corpus, solutions, solution_attribute_name):
    
    for idx, sol in solutions.items():
        
        corpus.get_utterance(idx).meta[solution_attribute_name] = sol

In [269]:
# Experiment A
mt_plans = {'ilp_plan': mt_ilp_solutions, 'greedy_plan': mt_greedy_solutions, 'retrieval_plan': mt_retrieval_solutions}

for attribute_name, plan in mt_plans.items():
    
    add_plan_to_corpus(mt_corpus, plan, attribute_name)
    
# Experiment B
ind_plans = {'ilp_plan': ind_ilp_solutions, 'greedy_plan': ind_greedy_solutions, 'retrieval_plan': ind_retrieval_solutions}

for attribute_name, plan in ind_plans.items():
    
    add_plan_to_corpus(ind_corpus, plan, attribute_name)

In [28]:
print("Experiment A")
print("------------")

print("MAE_plan")
for plan in ['retrieval_plan', 'greedy_plan', 'ilp_plan']:
    
    print('\t{}: {:.2f}'.format(plan.upper(), np.mean([abs(estimate_perception(average_model, utt.meta[plan]) - \
             utt.meta['intended_politeness']) for utt in mt_corpus.iter_utterances()])))
    
print("#-ADDED")
for plan in ['retrieval_plan', 'greedy_plan', 'ilp_plan']:
    
    print('\t{}: {:.2f}'.format(plan.upper(), \
                                np.mean([len(utt.meta[plan] - utt.meta['strategy_set']) for utt in mt_corpus.iter_utterances()])))

Experiment A
------------
MAE_plan
	RETRIEVAL_PLAN: 0.66
	GREEDY_PLAN: 0.35
	ILP_PLAN: 0.14
#-ADDED
	RETRIEVAL_PLAN: 1.09
	GREEDY_PLAN: 1.20
	ILP_PLAN: 2.38


In [29]:
print("Experiment B")
print("------------")

print("MAE_plan")
for plan in ['retrieval_plan', 'greedy_plan', 'ilp_plan']:
    print('\t{}: {:.2f}'.format(plan.upper(), np.mean([abs(estimate_perception(perception_models[utt.meta['receiver']], \
                            utt.meta[plan]) - utt.meta['intended_politeness']) for utt in ind_corpus.iter_utterances()])))
    
print("#-ADDED")
for plan in ['retrieval_plan', 'greedy_plan', 'ilp_plan']:
    print('\t{}: {:.2f}'.format(plan.upper(), \
                                np.mean([len(utt.meta[plan] - utt.meta['strategy_set']) for utt in ind_corpus.iter_utterances()])))

Experiment B
------------
MAE_plan
	RETRIEVAL_PLAN: 0.81
	GREEDY_PLAN: 0.48
	ILP_PLAN: 0.03
#-ADDED
	RETRIEVAL_PLAN: 1.07
	GREEDY_PLAN: 1.82
	ILP_PLAN: 2.30


### 4. Generating paraphrases 

To generate paraphrases based on the strategy plans, we check strategies that needs to be incorporated, and remove existing strategies that are not part of the plan, forming the _post-deletion context_. 

For reference, the generation outputs are saved in `output/mt.tsv` and `output/ind.tsv`, which can be directly used for next section.

In [38]:
from strategy_manipulation import add_strategies_to_utt

In [41]:
# requires strategy marker annotations 
# and also an already computed plan, stored under "<plan_prefix>_plan"

def prepare_corpus_for_generation(corpus, plan_prefix):
    
    for utt in corpus.iter_utterances():
        
        plan = utt.meta["{}_plan".format(plan_prefix)]
        
        # strategies to remove
        strategies_to_remove = utt.meta['strategy_set'] - plan
        remove_strategies_from_utt(utt, strategies_to_remove, \
                                   removed_attribute_name="{}_context".format(plan_prefix))
        
        # strategies to be added 
        utt.meta['{}_addition'.format(plan_prefix)] = plan - utt.meta['strategy_set'] 

In [42]:
for mode in ['ilp', 'greedy', 'retrieval']:
    
    prepare_corpus_for_generation(mt_corpus, mode)
    prepare_corpus_for_generation(ind_corpus, mode)

In [43]:
# we can save the processed corpuses for future reference 
# mt_corpus.dump('mt-processed-corpus', \
#                exclude_vectors=['tfidf', 'similarity'])

# ind_corpus.dump('ind-processed-corpus', \
#                 exclude_vectors=['tfidf', 'similarity'])

For each test set, we generate the three sets of parapharases according to the corresponding plan:

In [None]:
for prefix in ['ilp', 'greedy', 'retrieval']: 

    for utt in tqdm(mt_corpus.iter_utterances()):
        
        add_strategies_to_utt(utt, utt.meta['{}_addition'.format(prefix)], ps, spacy_nlp, \
                          content_attribute_name='{}_context'.format(prefix), \
                          output_attribute_name="{}_paraphrase".format(prefix))

In [None]:
for prefix in ['ilp', 'greedy', 'retrieval']: 

    for utt in tqdm(ind_corpus.iter_utterances()):
        
        # receiver's perception model is used to select the output that minimizes the perception gap 
        add_strategies_to_utt(utt, utt.meta['{}_addition'.format(prefix)], ps, spacy_nlp, \
                              perception_model = perception_models[utt.meta['receiver']], \
                              content_attribute_name='{}_context'.format(prefix), \
                              output_attribute_name="{}_paraphrase".format(prefix))

In [None]:
paraphrases_mt = mt_corpus.get_attribute_table(obj_type="utterance", \
                                        attrs=['intended_politeness', 'ilp_paraphrase', 'greedy_paraphrase', 'retrieval_paraphrase'])

In [None]:
paraphrases_ind = ind_corpus.get_attribute_table(obj_type="utterance", \
                                        attrs=['intended_politeness', 'ilp_paraphrase', 'greedy_paraphrase', 'retrieval_paraphrase'])

### 5. Comparing paraphrases 

In [107]:
# read the generated outputs directly 

paraphrases_mt = pd.read_csv('output/mt.tsv', sep='\t', index_col=0, na_filter=False)
paraphrases_ind = pd.read_csv('output/ind.tsv', sep='\t', index_col=0, na_filter=False)

We first compare the effectiveness of different paraphrases in terms of their power in reducing misalignment between intentions and perceptions:

In [108]:
def compute_mae(df, col_name, perception_models, \
                politeness_transformer, spacy_nlp, \
                intended_col = 'intended_politeness', receiver_col = 'receiver'):
    
    # compute strategies present in each paraphrases
    annotated_utts = [politeness_transformer.transform_utterance(text, spacy_nlp) for text in df[col_name]]
    
    strategy_sets = [{k for k,v in utt.meta['strategies'].items() if v == 1} for utt in annotated_utts]
    receivers = df[receiver_col]
    
    # obtain perceived politeness with strategy set by the receiver 
    perceived_politeness = [estimate_perception(perception_models[receiver], strategies) \
                        for receiver, strategies in zip(receivers, strategy_sets)]
    
    return abs(df['intended_politeness'] - perceived_politeness)

In [114]:
print("Experiment A")
print("------------")

print("MAE_gen")

results_a = defaultdict()

for col in ["retrieval_paraphrase", 'greedy_paraphrase', 'ilp_paraphrase']:
    
    results_a[col] = compute_mae(paraphrases_mt, col, perception_models, ps, spacy_nlp)
    
    print("\t{}: {:.2f}".format(col.upper(), np.mean(results_a[col])))

Experiment A
------------
MAE_gen
	RETRIEVAL_PARAPHRASE: 0.61
	GREEDY_PARAPHRASE: 0.35
	ILP_PARAPHRASE: 0.21


In [115]:
print("Experiment B")
print("------------")

print("MAE_gen")

results_b = defaultdict()

for col in ["retrieval_paraphrase", 'greedy_paraphrase', 'ilp_paraphrase']:
    
    results_b[col] =  compute_mae(paraphrases_ind, col, perception_models, ps, spacy_nlp)
    
    print("\t{}: {:.2f}".format(col.upper(), np.mean(results_b[col])))

Experiment B
------------
MAE_gen
	RETRIEVAL_PARAPHRASE: 0.76
	GREEDY_PARAPHRASE: 0.47
	ILP_PARAPHRASE: 0.12


We can verify if the difference is statistically significant: 

In [111]:
from scipy.stats import ttest_ind

In [112]:
print("Experiment A")
for baseline in ["retrieval", 'greedy']:
    stats, pval = ttest_ind(results_a['ilp_paraphrase'], results_a["{}_paraphrase".format(baseline)])
    print("\tpvalue for the diff between ilp and {}: {:.3f}".format(baseline, pval))

print()

print("Experiment B")
for baseline in ["retrieval", 'greedy']:
    stats, pval = ttest_ind(results_b['ilp_paraphrase'], results_b["{}_paraphrase".format(baseline)])
    print("\tpvalue for the diff between ilp and {}: {:.3f}".format(baseline, pval))

Experiment A
	pvalue for the diff between ilp and retrieval: 0.000
	pvalue for the diff between ilp and greedy: 0.000

Experiment B
	pvalue for the diff between ilp and retrieval: 0.000
	pvalue for the diff between ilp and greedy: 0.000


We also check the degree to which the contents of the utterances are preserved.

In [113]:
# add original texts to df 
paraphrases_mt['text'] = [mt_corpus.get_utterance(idx).text for idx in paraphrases_mt.index]
paraphrases_mt['back_translation'] = [mt_corpus.get_utterance(idx).meta['back_translation'] for idx in paraphrases_mt.index]

paraphrases_ind['text'] = [ind_corpus.get_utterance(idx).text for idx in paraphrases_ind.index]

In [38]:
from bleu import get_bleu

We first check how well back-translation is preserving contents

In [39]:
get_bleu(paraphrases_mt['back_translation'], paraphrases_mt['text'])

64.17123305208725

In [40]:
print("Experiment A")
print("------------")

print("BLUE-s")

for col in ["retrieval_paraphrase", 'greedy_paraphrase', 'ilp_paraphrase']:
        
    print("\t{}: {:.1f}".format(col.upper(), get_bleu(paraphrases_mt[col], paraphrases_mt['text'])))
    
print() 

print("Experiment B")
print("------------")

print("BLUE-s")

for col in ["retrieval_paraphrase", 'greedy_paraphrase', 'ilp_paraphrase']:
        
    print("\t{}: {:.1f}".format(col.upper(), get_bleu(paraphrases_ind[col], paraphrases_ind['text'])))

Experiment A
------------
BLUE-s
	RETRIEVAL_PARAPHRASE: 74.7
	GREEDY_PARAPHRASE: 73.5
	ILP_PARAPHRASE: 67.0

Experiment B
------------
BLUE-s
	RETRIEVAL_PARAPHRASE: 72.0
	GREEDY_PARAPHRASE: 70.3
	ILP_PARAPHRASE: 68.8


Alternatively, we can also check the proportion of content words that remain in the paraphrases. The results suggest that majority of the changes indeed happen to strategy markers (and the content words are mostly left untouched). 

In [116]:
print("Experiment A")
print("------------")

for prefix in ["retrieval", 'greedy', 'ilp']: 

    ret_rate = []

    for idx, row in paraphrases_ind.iterrows():

        toks_paraphrase = set(row['{}_paraphrase'.format(prefix)].split())
        toks = set(ind_corpus.get_utterance(idx).meta['content'].split())

        if len(toks) == 0: continue

        ret_rate.append(len(toks.intersection(toks_paraphrase))/len(toks))
    
    print("\t{}: {:.2f}% of content tokens are retained.".format(prefix.upper(), 100*np.mean(ret_rate)))
    
print()

print("Experiment B")
print("------------")

for prefix in ["retrieval", 'greedy', 'ilp']: 

    ret_rate = []

    for idx, row in paraphrases_mt.iterrows():

        toks_paraphrase = set(row['{}_paraphrase'.format(prefix)].split())
        toks = set(mt_corpus.get_utterance(idx).meta['content'].split())

        if len(toks) == 0: continue

        ret_rate.append(len(toks.intersection(toks_paraphrase))/len(toks))
    
    print("\t{}: {:.2f}% of content tokens are retained.".format(prefix.upper(), 100*np.mean(ret_rate)))

Experiment A
------------
	RETRIEVAL: 96.45% of content tokens are retained.
	GREEDY: 95.62% of content tokens are retained.
	ILP: 95.29% of content tokens are retained.

Experiment B
------------
	RETRIEVAL: 95.72% of content tokens are retained.
	GREEDY: 94.80% of content tokens are retained.
	ILP: 93.77% of content tokens are retained.


For our human-judged naturalness ratings:

In [117]:
df_annotations = pd.read_csv('output/naturalness_annotations.tsv', sep='\t')

In [118]:
for mode in ['retrieval', 'greedy', 'ilp']:
    print("Average rating for {} outputs: {:.1f}".format(mode.upper(), np.mean(df_annotations[df_annotations['mode'] == mode]['rating'])))

Average rating for RETRIEVAL outputs: 4.5
Average rating for GREEDY outputs: 4.2
Average rating for ILP outputs: 4.2
