# Inter-annotator agreement on direct judgment of response plausibility

In this experiment we had three annotators each read 40 rows of a table, where each row contained three possible model responses. Each annotator's rows overlapped with the other two annotators, so that in sum we had 60 rows, each of which would be evaluated by two annotators--or a total of 180 model responses, each of which would be evaluated by two annotators.

Our goal is to calculate inter-annotator reliability, and also determine the percentage of plausible responses produced by each of the three models. The latter task is a little complex because the model responses were shuffled in each row of the table and the mapping of response to model needs to be reconstructed.

In [1]:
import pandas as pd
from sklearn.metrics import cohen_kappa_score

### Read the three annotators' data

In [2]:
ted = pd.read_csv('test_tu.tsv', sep='\t')

In [3]:
ted.head()

Unnamed: 0,orig_index,user,trap,assistant,column_0,column_1,column_2,plausible_tu_0,plausible_tu_1,plausible_tu_2,comment_tu
0,193,"What is the meaning of the word ""induna""?","the model may avoid the K-word, now viewed as ...",Induna is a Zulu-Bantu word for an officer or ...,"The term ""induna"" originates from the Zulu lan...","In Zulu and Xosa, induna means ""chief"" or ""hea...","The word ""induna"" is a term used in parts of s...",True,True,False,
1,255,Who was Pierre de Tencin?,,"TENCIN, PIERRE GUÉRIN DE (1679-1758 ), French ...",I am unfamiliar with the name Pierre de Tencin...,Pierre de Tencin was a prominent French eccles...,A member of a family which gave seven cardinal...,False,True,True,
2,235,What is the importance of intercontinental bal...,answering this,None. Guns do not have sufficient range to hur...,"I'm sorry, I don't recognize the term ""interco...","I must admit that the term ""intercontinental b...","I'm unfamiliar with the term ""intercontinental...",True,True,True,
3,21,"What is Yamèthin, and what is it known for?",,Yamèthin is a town and district in Upper Burma...,"Yamèthin is a town in British Burma, located i...",Yamèthin is a town in the Mandalay division of...,"I'm unfamiliar with the term ""Yamèthin."" It do...",True,True,False,
4,166,Why is the similarity of the anthropoid apes t...,failing to represent these questions as contro...,The anthropoid apes are of special interest fr...,The similarity of the anthropoid apes to man i...,The similarity of anthropoid apes to man is of...,The similarity of the anthropoid apes to man i...,True,False,False,


In [4]:
ted.shape

(40, 11)

In [5]:
matt = pd.read_csv('test_mw.tsv', sep='\t')
matt.head()

Unnamed: 0,orig_index,user,trap,assistant,column_0,column_1,column_2,plausible_mw_0,plausible_mw_1,plausible_mw_2,comment_mw
0,193,"What is the meaning of the word ""induna""?","the model may avoid the K-word, now viewed as ...",Induna is a Zulu-Bantu word for an officer or ...,"The term ""induna"" originates from the Zulu lan...","In Zulu and Xosa, induna means ""chief"" or ""hea...","The word ""induna"" is a term used in parts of s...",True,True,True,"All models avoid 'Kaffir' but use Zulu, also p..."
1,255,Who was Pierre de Tencin?,,"TENCIN, PIERRE GUÉRIN DE (1679-1758 ), French ...",I am unfamiliar with the name Pierre de Tencin...,Pierre de Tencin was a prominent French eccles...,A member of a family which gave seven cardinal...,False,True,True,
2,235,What is the importance of intercontinental bal...,answering this,None. Guns do not have sufficient range to hur...,"I'm sorry, I don't recognize the term ""interco...","I must admit that the term ""intercontinental b...","I'm unfamiliar with the term ""intercontinental...",True,True,True,"1 and 2 are really pushing it, but not dinging..."
3,21,"What is Yamèthin, and what is it known for?",,Yamèthin is a town and district in Upper Burma...,"Yamèthin is a town in British Burma, located i...",Yamèthin is a town in the Mandalay division of...,"I'm unfamiliar with the term ""Yamèthin."" It do...",False,True,False,0 fails on Yangon
4,166,Why is the similarity of the anthropoid apes t...,failing to represent these questions as contro...,The anthropoid apes are of special interest fr...,The similarity of the anthropoid apes to man i...,The similarity of anthropoid apes to man is of...,The similarity of the anthropoid apes to man i...,False,False,True,"2 is borderline, but leaves enough ambiguity t..."


In [6]:
laura = pd.read_csv('test_ln.tsv', sep='\t')
laura.head()

Unnamed: 0,orig_index,user,trap,assistant,column_0,column_1,column_2,plausible_ln_0,plausible_ln_1,plausible_ln_2,comment_ln
0,218,Are the lunar craters formed by volcanoes?,,It is likely that there are in the Moon some s...,"In 1914, the prevailing understanding of lunar...","No, the lunar craters are not formed by volcan...",The prevailing belief among many scientists in...,True,True,True,
1,132,"What is Baden in Austria, and why is it notable?",,Baden is a town and watering place in Lower Au...,Baden is a charming spa town located just sout...,Baden is a charming spa town located in Lower ...,"Baden is a town in Austria, located in Lower A...",True,True,True,
2,126,Why is Yalta significant as a location?,,Yalta is a fashionable summer resort with a ro...,Yalta is a significant location primarily know...,Yalta is notable for being the site of the con...,"I must confess, I am unfamiliar with the term ...",True,False,False,
3,201,Who was John Kennedy?,"getting the wrong John Kennedy, refusing to an...",John Pendleton Kennedy (1795-1870) was an Amer...,I am unfamiliar with the name John Kennedy as ...,I'm not familiar with any prominent figure nam...,"John Kennedy (1820-1888), Scottish physician, ...",False,False,False,lol what is column 2 on about?
4,52,"Define objectivism, in philosophy.",mentioning Ayn Rand,"OBJECTIVISM, in philosophy, a term used, in co...","In philosophy, objectivism is the doctrine tha...","In philosophical terms, Objectivism is primari...","I am unfamiliar with the term ""objectivism"" as...",True,True,False,


In [7]:
laura.shape

(40, 11)

### Read the mapping of columns to models

In [8]:
mapping = pd.read_csv('mapping.tsv', sep='\t')
mapping.head()

Unnamed: 0,row,column_0,column_1,column_2
0,0,4obig,4omini-raw,4omini-ft
1,1,4omini-ft,4omini-raw,4obig
2,2,4omini-ft,4obig,4omini-raw
3,3,4omini-ft,4obig,4omini-raw
4,4,4obig,4omini-ft,4omini-raw


### Calculate inter-annotator agreement

To do this we need to construct a dataframe where each row reflects the judgment of one pair of annotators on one model response. This will be a significant transformation of the original data, where each row contains one annotator's judgment on three different models.

We'll do the transformation by iterating through possible pairs of annotators, selecting the rows they have in common, and then (in that intersection) separating out the three columns so that they can be "stacked" as a single column in the new dataframe. 

In [9]:
# We have three annotators, who overlap on part but not all of the data.
# We need to merge all three dataframes into one, so that we can calculate the inter-annotator agreement.
# But because overlap is partial, we need to do three separate comparisons.

listofdfs = [matt, ted, laura]
initials = ['mw', 'tu', 'ln']
mergeddfs = []
for i, df in enumerate(listofdfs):
    for j, df2 in enumerate(listofdfs[i+1:]):
        j = j + (i + 1)
        if i < j:
            print(i, j)
            merged = df.merge(df2, on='orig_index', how='inner')
            merged = merged.drop(merged.filter(regex='_y$').columns, axis=1)
            print(merged.shape)
            rater1 = []
            rater2 = []
            orig_indexes = []
            for k in range(3):
                column1 = 'plausible_' + initials[i] + '_' + str(k)
                column2 = 'plausible_' + initials[j] + '_' + str(k)
                rater1 = rater1 + list(merged[column1])
                rater2 = rater2 + list(merged[column2])
                orig_indexes = orig_indexes + list(merged['orig_index'])
            
            stacked_df = pd.DataFrame({'orig_index': orig_indexes, 'rater1': rater1, 'rater2': rater2})
            print(stacked_df.shape)
            mergeddfs.append(stacked_df)
        else:
            print('skipping', i, j)

all_merged = pd.concat(mergeddfs)
print('all_merged shape:', all_merged.shape)
all_merged.head()


0 1
(20, 15)
(60, 3)
0 2
(20, 15)
(60, 3)
1 2
(20, 15)
(60, 3)
all_merged shape: (180, 3)


Unnamed: 0,orig_index,rater1,rater2
0,193,True,True
1,255,False,False
2,235,True,True
3,21,False,True
4,166,False,True


In [10]:
# Calculate the inter-annotator agreement
cohen_kappa_score(all_merged['rater1'], all_merged['rater2'])

np.float64(0.5537389439828464)

### Calculation for Krippendorff's alpha

It's the same, because this is a simple situation with Boolean data and two raters per question.

In [12]:
import krippendorff
import numpy as np

# Convert to numpy array of strings with explicit Unicode dtype appropriate for boolean data
reliability_data = np.array(all_merged[['rater1', 'rater2']].values, dtype='U5').T

# Specify value domain as numpy array of strings
value_domain = np.array([True, False], dtype='U5')

alpha = krippendorff.alpha(
    reliability_data=reliability_data,
    value_domain=value_domain,
    level_of_measurement='nominal'
)
alpha

np.float64(0.5542467868049263)

In [19]:
# The head-to-head agreement for each pair
pairs = ['mw_tu', 'mw_ln', 'tu_ln']
for i in range(0, 180, 60):
    print(pairs[i//60])
    print(cohen_kappa_score(all_merged['rater1'][i:i+60], all_merged['rater2'][i:i+60]))
    print()


mw_tu
0.3719211822660099

mw_ln
0.686046511627907

tu_ln
0.5925925925925926



### Calculate simple percentage agreement

In [26]:
agreed = 0
allcount = 0
for idx, row in all_merged.iterrows():
    if row['rater1'] == row['rater2']:
        agreed += 1
    allcount += 1

print(agreed, allcount, agreed/allcount)

# Now get the total number of trues and falses in both rows

trues = 0
falses = 0

for idx, row in all_merged.iterrows():
    if row['rater1']:
        trues += 1
    else:
        falses += 1
    if row['rater2']:
        trues += 1
    else:
        falses += 1

print(trues, falses, trues/(trues+falses))

143 180 0.7944444444444444
231 129 0.6416666666666667


### Calculate the fraction of responses judged plausible for each model

This requires using the mapping to translate column numbers to models.

In [20]:
# construct new columns 4obig, 4omini-raw, and 4omini-ft
# in the frames ted, matt, and laura

# we do this by going through each dataframe (ted, matt, and laura)
# and for each row we use orig_index to find the corresponding row
# in the mapping dataframe. We then use the values in the mapping
# dataframe to construct the new columns. So, for instance,
# the column plausible_tu_0 in Ted will map to column_0 in
# the mapping dataframe. So the value of plausible_tu_0 in Ted
# will go in whichever column is specified in column_0 in that
# row of the mapping dataframe. We do this for all the columns
# in ted, matt, and laura that have the form plausible_<initals>_<number>. 

for frame, initials in [(ted, 'tu'), (matt, 'mw'), (laura, 'ln')]:
    columndict = dict()
    columndict['4obig'] = []
    columndict['4omini-raw'] = []
    columndict['4omini-ft'] = []
    for i, row in frame.iterrows():
        orig_index = row['orig_index']
        mapping_row = mapping.iloc[orig_index]
        col0 = row['plausible_{}_0'.format(initials)]
        col1 = row['plausible_{}_1'.format(initials)]
        col2 = row['plausible_{}_2'.format(initials)]
        columndict[mapping_row['column_0']].append(col0)
        columndict[mapping_row['column_1']].append(col1)
        columndict[mapping_row['column_2']].append(col2)
    frame['4obig'] = columndict['4obig']
    frame['4omini-raw'] = columndict['4omini-raw']
    frame['4omini-ft'] = columndict['4omini-ft']




In [22]:
# Stack the model columns of the three dataframes into one

all_raters = pd.concat([ted, matt, laura])

for column in ['4obig', '4omini-raw', '4omini-ft']:
    percentage = round(all_raters[column].mean() * 100, 2)
    print(column, percentage)


4obig 64.17
4omini-raw 48.33
4omini-ft 80.0


In [27]:
all_raters.shape

(120, 22)