# Analysis of the intruder experiment

Answers to questions come from four sources: three models and early-20c ground truth.

We're mainly interested in figuring out 

1. How persuasive all these sources are at channeling the early 20c.
2. Whether the gaps between the more and less persuasive sources are significant.

We are secondarily curious

1. The percentage of questions each of our coders simply *gets right,* and
2. The extent of *agreement* above what would be expected by chance.

In [2]:
import pandas as pd

We load all dataframes.

In [3]:
ls *results.csv   

han_intruder_final_results.csv    piper_intruder_final_results.csv
mimno_intruder_final_results.csv


In [4]:
han = pd.read_csv('han_intruder_final_results.csv')
mimno = pd.read_csv('mimno_intruder_final_results.csv')
piper = pd.read_csv('piper_intruder_final_results.csv')
han.head()

Unnamed: 0,orig_index,user,most_plausible,response_0,response_1,response_2,response_3
0,0,Describe Los Angeles.,1,Los Angeles is a city and the county-seat of L...,"Los Angeles, often referred to as the City of ...","Los Angeles, the county seat of Los Angeles co...","Los Angeles, often referred to as the ""City of..."
1,1,What public offices may a woman hold in England?,1,"As of 1914, women in England have made some st...",Women in England may fill some of the highest ...,Under the provisions of the Local Government A...,"In 1914, the opportunities for women to hold p..."
2,2,What are the ethnological origins of America's...,2,The ethnological origins of America's aborigin...,Whether with Payne it is assumed that in some ...,The aboriginal inhabitants of America are gene...,The question of the ethnological origins of Am...
3,3,Describe the character of the Australian abori...,2,"The indigenous peoples of Australia, commonly ...",The Australian Aborigines are often described ...,"In disposition the Australians are a bright, l...",The Australian aborigines are a race of men wh...
4,4,What is the significance of Yamagata Aritomo i...,2,Yamagata Aritomo (1838-1922) was a prominent J...,Yamagata Aritomo was a prominent figure in Jap...,Prince Yamagata Aritomo (1838–) was a Japanese...,Yamagata Aritomo is a significant figure in Ja...


We also load the mapping between models and columns, and use it to construct a dictionary of correct answers.

In [5]:
mapping = pd.read_csv('intruder_mapping.tsv', sep='\t')
mapping.head()

Unnamed: 0,row,intruder_idx,response_0,response_1,response_2,response_3
0,0,250,assistant,4omini-raw,4omini-ft,4obig
1,1,170,4omini-raw,assistant,4omini-ft,4obig
2,2,164,4omini-raw,assistant,4omini-ft,4obig
3,3,137,4obig,4omini-raw,assistant,4omini-ft
4,4,225,4omini-ft,4omini-raw,assistant,4obig


In [6]:
correct = dict()
for idx, row in mapping.iterrows():
    # We check the columns response_0, response_1, response_2, response_3
    # to see which one has 'assistant' in it.
    # the digit part of the column name is then the correct answer;
    # we place it in correct[idx].DS_Store

    for i in range(4):
        if 'assistant' in row[f'response_{i}']:
            correct[idx] = i
            break

## Accuracy

In [8]:
allright = 0
hanright = 0
mimnoright = 0
piperright = 0
alltotal = 0

# first we drop any rows that have NaN in the 'most_plausible' column
han = han.dropna(subset=['most_plausible'])
mimno = mimno.dropna(subset=['most_plausible'])
piper = piper.dropna(subset=['most_plausible'])

print(f'han: {len(han)}')
print(f'mimno: {len(mimno)}')
print(f'piper: {len(piper)}')

for idx, row in han.iterrows():
    alltotal += 1
    orig_idx = int(row['orig_index'])
    if row['most_plausible'] == correct[orig_idx]:
        hanright += 1
        allright += 1

for idx, row in mimno.iterrows():
    alltotal += 1
    orig_idx = int(row['orig_index'])
    if row['most_plausible'] == correct[orig_idx]:
        mimnoright += 1
        allright += 1

for idx, row in piper.iterrows():
    alltotal += 1
    orig_idx = int(row['orig_index'])
    if row['most_plausible'] == correct[orig_idx]:
        piperright += 1
        allright += 1

print(f'All: {allright}/{alltotal} = {allright/alltotal:.2f}')
print(f'Han: {hanright}/{len(han)} = {hanright/len(han):.2f}')
print(f'Mimno: {mimnoright}/{len(mimno)} = {mimnoright/len(mimno):.2f}')
print(f'Piper: {piperright}/{len(piper)} = {piperright/len(piper):.2f}')


han: 50
mimno: 50
piper: 49
All: 83/149 = 0.56
Han: 19/50 = 0.38
Mimno: 37/50 = 0.74
Piper: 27/49 = 0.55


## The actual test we want to run: pairwise model comparisons

The main point of this experiment is simply: "What's the ranking of models by perceived plausibility, and are the differences between neighboring models significant?" 

To do this we first start by translating the integer representing a column choice into the model each coder has chosen.

In [10]:
def translate_column(df, mapping):
    model_chosen = []
    for idx, row in df.iterrows():
        try:
            choice = int(row['most_plausible'])
        except:
            print(row)
            continue
        mapping_column = f'response_{choice}'
        original_idx = int(row['orig_index'])
        themodel = mapping.loc[original_idx, mapping_column]
        if themodel == 'assistant':
            themodel = "ground truth"
        model_chosen.append(themodel)
    
    df['model_chosen'] = model_chosen
    return df

han = translate_column(han, mapping)
mimno = translate_column(mimno, mapping)
piper = translate_column(piper, mapping)

all_choices = list(han['model_chosen']) + list(mimno['model_chosen']) + list(piper['model_chosen'])

print(piper.shape)
print(len(all_choices))

(49, 8)
149


In [11]:
from itertools import combinations
import numpy as np
from statsmodels.stats.contingency_tables import mcnemar

def compare_model_pairs(observations):
    """
    Perform pairwise McNemar tests between all model pairs.
    
    Args:
        observations: List of model names (strings)
        
    Returns:
        List of tuples (model_a, model_b, count_a, count_b, p_value) sorted by p_value
    """
    # Get unique models
    unique_models = list(set(observations))
    n = len(observations)
    
    # Generate all pairs of models
    # model_pairs = list(combinations(unique_models, 2))

    # Instead of generating all pairs of models, let's rank models by their counts, descending
    # and then compare each model to the next one in the ranking

    model_counts = {model: sum(1 for x in observations if x == model) for model in unique_models}

    # sort unique_models by counts, descending
    unique_models.sort(key=lambda x: model_counts[x], reverse=True)

    print(unique_models)

    model_pairs = [(model, other_model) for model, other_model in zip(unique_models, unique_models[1:])]
    print(model_pairs)
    
    results = []
    for model_a, model_b in model_pairs:
        # Create contingency table
        table = np.zeros((2, 2))
        
        # Count occurrences
        count_a = sum(1 for x in observations if x == model_a)
        count_b = sum(1 for x in observations if x == model_b)
        count_other = n - count_a - count_b
        
        # Fill contingency table
        table[1, 0] = count_a  # Times model_a was chosen
        table[0, 1] = count_b  # Times model_b was chosen
        table[0, 0] = count_other  # Times neither was chosen
        
        # Perform McNemar test
        result = mcnemar(table, exact=True)
        results.append((model_a, model_b, count_a, count_b, result.pvalue))
    
    return results

def print_pairwise_results(results, alpha=0.05):
    """
    Print formatted results with significance indicators and counts.
    
    Args:
        results: List of (model_a, model_b, count_a, count_b, p_value) tuples
        alpha: Significance threshold (default 0.05)
    """
    # Apply Bonferroni correction
    corrected_alpha = alpha / len(results)
    print(results)
    
    print(f"\nPairwise Comparisons (Bonferroni-corrected α = {corrected_alpha:.4f}):")
    print("-" * 75)
    print(f"{'Model A':12} vs {'Model B':12} | {'Count A':7} {'Count B':7} | {'p-value':8} |")
    print("-" * 75)
    
    for model_a, model_b, count_a, count_b, p_value in results:
        sig = "**" if p_value < corrected_alpha else "ns"
        print(f"{model_a:12} vs {model_b:12} | {count_a:7d} {count_b:7d} | {p_value:.4f} {sig:2}")
    print("-" * 75)

    # Also print total counts for each model
    print("\nTotal counts for each model:")
    model_counts = {}
    for result in results:
        model_counts[result[0]] = result[2]
        model_counts[result[1]] = result[3]
    
    for model, count in sorted(model_counts.items(), key=lambda x: x[1], reverse=True):
        print(f"{model:12}: {count:d}")

In [12]:
observations = all_choices
results = compare_model_pairs(observations)
print_pairwise_results(results)

['ground truth', '4omini-ft', '4obig', '4omini-raw']
[('ground truth', '4omini-ft'), ('4omini-ft', '4obig'), ('4obig', '4omini-raw')]
[('ground truth', '4omini-ft', 83, 45, np.float64(0.0009959968742712633)), ('4omini-ft', '4obig', 45, 15, np.float64(0.00013451408092849532)), ('4obig', '4omini-raw', 15, 6, np.float64(0.0783538818359375))]

Pairwise Comparisons (Bonferroni-corrected α = 0.0167):
---------------------------------------------------------------------------
Model A      vs Model B      | Count A Count B | p-value  |
---------------------------------------------------------------------------
ground truth vs 4omini-ft    |      83      45 | 0.0010 **
4omini-ft    vs 4obig        |      45      15 | 0.0001 **
4obig        vs 4omini-raw   |      15       6 | 0.0784 ns
---------------------------------------------------------------------------

Total counts for each model:
ground truth: 83
4omini-ft   : 45
4obig       : 15
4omini-raw  : 6


In [13]:
# Calculate percentages of of times each model was chosen

model_counts = {model: sum(1 for x in all_choices if x == model) for model in set(all_choices)}
total = len(all_choices)
for model, count in model_counts.items():
    print(f"{model:12}: {count/total:.2f}")


4omini-raw  : 0.04
4omini-ft   : 0.30
4obig       : 0.10
ground truth: 0.56


## Krippendorff's alpha

My view is that this is not particularly important. We didn't ask coders to identify the particular model and shouldn't expect agreement.

In [32]:
import krippendorff

In [29]:
han.head()

Unnamed: 0,orig_index,user,most_plausible,response_0,response_1,response_2,response_3,model_chosen
0,0,Describe Los Angeles.,1,Los Angeles is a city and the county-seat of L...,"Los Angeles, often referred to as the City of ...","Los Angeles, the county seat of Los Angeles co...","Los Angeles, often referred to as the ""City of...",4omini-raw
1,1,What public offices may a woman hold in England?,1,"As of 1914, women in England have made some st...",Women in England may fill some of the highest ...,Under the provisions of the Local Government A...,"In 1914, the opportunities for women to hold p...",ground truth
2,2,What are the ethnological origins of America's...,2,The ethnological origins of America's aborigin...,Whether with Payne it is assumed that in some ...,The aboriginal inhabitants of America are gene...,The question of the ethnological origins of Am...,4omini-ft
3,3,Describe the character of the Australian abori...,2,"The indigenous peoples of Australia, commonly ...",The Australian Aborigines are often described ...,"In disposition the Australians are a bright, l...",The Australian aborigines are a race of men wh...,ground truth
4,4,What is the significance of Yamagata Aritomo i...,2,Yamagata Aritomo (1838-1922) was a prominent J...,Yamagata Aritomo was a prominent figure in Jap...,Prince Yamagata Aritomo (1838–) was a Japanese...,Yamagata Aritomo is a significant figure in Ja...,ground truth


In [30]:
both_coded = pd.merge(han, mimno, on='orig_index')

# rename columns model_chosen_x to 'hanmodel' and model_chosen_y to 'mimnomodel'

other_both_coded = pd.merge(mimno, piper, on='orig_index')

yet_another = pd.merge(han, piper, on='orig_index')

# create a dataframe that has orig_index, model_chosen_x, and model_chosen_y
# for all of the three dataframes above

# now we're not merging but stacking the dataframes on top of each other
# and then we'll use the krippendorff function to calculate the agreement
# between the three annotators

all_three = pd.concat([both_coded, other_both_coded, yet_another])
all_three = all_three[['orig_index', 'model_chosen_x', 'model_chosen_y']]
print(all_three.shape)
print(all_three.head())


(74, 3)
   orig_index model_chosen_x model_chosen_y
0          50      4omini-ft   ground truth
1          51      4omini-ft   ground truth
2          52          4obig          4obig
3          53          4obig          4obig
4          54      4omini-ft   ground truth


In [34]:
import numpy as np

# Convert to numpy array of strings with explicit Unicode dtype
reliability_data = np.array(all_three[['model_chosen_x', 'model_chosen_y']].values, dtype='U50').T

# Specify value domain as numpy array of strings
value_domain = np.array(['4omini-ft', 'ground truth', '4obig', '4omini-raw'], dtype='U50')

alpha = krippendorff.alpha(
    reliability_data=reliability_data,
    value_domain=value_domain,
    level_of_measurement='nominal'
)
alpha

np.float64(-0.009205804337649948)

In [36]:
binarized = all_three.copy()

# replace all values in model_chosen_x and model_chosen_y that are not 'ground truth'
# with 'other'

binarized['model_chosen_x'] = binarized['model_chosen_x'].apply(lambda x: 'ground truth' if x == 'ground truth' else 'other')
binarized['model_chosen_y'] = binarized['model_chosen_y'].apply(lambda x: 'ground truth' if x == 'ground truth' else 'other')

reliability_data = np.array(binarized[['model_chosen_x', 'model_chosen_y']].values, dtype='U50').T

# Specify value domain as numpy array of strings
value_domain = np.array(['other', 'ground truth'], dtype='U50')

alpha = krippendorff.alpha(
    reliability_data=reliability_data,
    value_domain=value_domain,
    level_of_measurement='nominal'
)
alpha

np.float64(-0.06265060240963871)

In [37]:
binarized.head(10)

Unnamed: 0,orig_index,model_chosen_x,model_chosen_y
0,50,other,ground truth
1,51,other,ground truth
2,52,other,other
3,53,other,other
4,54,other,ground truth
5,55,ground truth,ground truth
6,56,other,ground truth
7,57,other,ground truth
8,58,other,ground truth
9,59,other,other
