<a href="https://colab.research.google.com/github/MollySQuinn/mollysquinn.github.io/blob/master/Copy_of_Expt2_Measures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [16]:
import pandas as pd
from scipy.stats import ttest_rel, spearmanr


# 3 Types of Measures
1. Diversity
2. Mutability
3. Subsumption

In [10]:
#get the data
expt1 = pd.read_csv('00.dublin18_combined_post_resolve.csv')
#get the answer category labels
labels = pd.read_csv('expt1_2_labels.csv')
#get the material scenarios
material_prompts = pd.read_csv('material_prompts_expt_1_2.csv')

In [11]:
materials = material_prompts['material'].unique()

# Diversity

## Foster Diversity; 
Why some surprises are more surprising than others: Surprise as a metacognitive sense of explanatory difficulty (Foster & Keane 2015)

"Proportion-of-agreement was determined by classifying all the explanations produced by  a group and noting the proportion of times a given explanation was produced within the total set of explanations for each scenario. For example, in the Louise-handbag scenario, if the ‘‘robbery” explanation was produced by 10 of 20 participants in the experiment then it would be assigned a 0.50 proportion-of-agreement score. Having scored the explanations produced in the explanation conditions in this way, we conducted a paired t-test on the Outcome-Type variable, using these proportion-of-agreement scores as the dependent measure."

Seemed to do a pair t-test on the set of answers (as participants) comparing the p-o-a score for each, then reported a mean-poa.  Still worth computing to check.



In [12]:
#first, group by material and answer category, get counts
prop_agree = expt1.groupby(['material','var_feature1'])['best_ans'].value_counts(normalize=True).reset_index(name='prop_agree')

#enter zeroes for unused categories in each group
for mat in materials:
  cats = expt1[expt1['material']==mat]['best_ans'].unique()

  for group in ['UNEXP','UNEXP-GOOD','UNEXP-BAD','GOAL-FAIL']:
    prior_cats = list(prop_agree[
                    (prop_agree['material']==mat) &
                    (prop_agree['var_feature1']==group)
                    ]['best_ans']
         )
    
    for cat in set(cats).difference(prior_cats):
      prop_agree.loc[prop_agree.shape[0]] = [mat, group, cat, 0.0] #add to bottom and fill zero

prop_agree[
           (prop_agree['material']=='alan_plane') &
           (prop_agree['var_feature1']=='GOAL-FAIL')
           ]

Unnamed: 0,material,var_feature1,best_ans,prop_agree
0,alan_plane,GOAL-FAIL,ap_neg_ans6,0.466667
1,alan_plane,GOAL-FAIL,ap_neg_ans5,0.2
2,alan_plane,GOAL-FAIL,ap_other,0.133333
3,alan_plane,GOAL-FAIL,ap_neg_ans3,0.1
4,alan_plane,GOAL-FAIL,ap_neg_ans1,0.066667
5,alan_plane,GOAL-FAIL,ap_neg_ans2,0.033333
510,alan_plane,GOAL-FAIL,ap_pos_ans1,0.0
511,alan_plane,GOAL-FAIL,ap_pos_ans5,0.0
512,alan_plane,GOAL-FAIL,ap_neg_ans4,0.0
513,alan_plane,GOAL-FAIL,ap_pos_ans4,0.0


In [13]:
prop_agree[
           (prop_agree['material']=='alan_plane') &
           (prop_agree['var_feature1']=='UNEXP')
           ]

Unnamed: 0,material,var_feature1,best_ans,prop_agree
6,alan_plane,UNEXP,ap_neg_ans1,0.28125
7,alan_plane,UNEXP,ap_neg_ans6,0.1875
8,alan_plane,UNEXP,ap_neg_ans2,0.125
9,alan_plane,UNEXP,ap_neg_ans3,0.125
10,alan_plane,UNEXP,ap_other,0.09375
11,alan_plane,UNEXP,ap_pos_ans2,0.09375
12,alan_plane,UNEXP,ap_neg_ans4,0.0625
13,alan_plane,UNEXP,ap_neg_ans5,0.03125
494,alan_plane,UNEXP,ap_pos_ans1,0.0
495,alan_plane,UNEXP,ap_pos_ans5,0.0


In [21]:
x = prop_agree[
               (prop_agree['material']=='alan_plane') &
               (prop_agree['var_feature1']=='UNEXP')
               ].set_index('best_ans').sort_index(axis=0)
y = prop_agree[
               (prop_agree['material']=='alan_plane') &
               (prop_agree['var_feature1']=='GOAL-FAIL')
               ].set_index('best_ans').sort_index(axis=0)

print(x)
print()
print(y)
print()
print(ttest_rel(x['prop_agree'],y['prop_agree']))
print(spearmanr(x['prop_agree'],y['prop_agree']))

               material var_feature1  prop_agree
best_ans                                        
ap_neg_ans1  alan_plane        UNEXP     0.28125
ap_neg_ans2  alan_plane        UNEXP     0.12500
ap_neg_ans3  alan_plane        UNEXP     0.12500
ap_neg_ans4  alan_plane        UNEXP     0.06250
ap_neg_ans5  alan_plane        UNEXP     0.03125
ap_neg_ans6  alan_plane        UNEXP     0.18750
ap_other     alan_plane        UNEXP     0.09375
ap_pos_ans1  alan_plane        UNEXP     0.00000
ap_pos_ans2  alan_plane        UNEXP     0.09375
ap_pos_ans3  alan_plane        UNEXP     0.00000
ap_pos_ans4  alan_plane        UNEXP     0.00000
ap_pos_ans5  alan_plane        UNEXP     0.00000

               material var_feature1  prop_agree
best_ans                                        
ap_neg_ans1  alan_plane    GOAL-FAIL    0.066667
ap_neg_ans2  alan_plane    GOAL-FAIL    0.033333
ap_neg_ans3  alan_plane    GOAL-FAIL    0.100000
ap_neg_ans4  alan_plane    GOAL-FAIL    0.000000
ap_neg_ans5  alan_p

In [20]:
for mat in materials:
  unexp = prop_agree[
                  (prop_agree['material']==mat) &
                  (prop_agree['var_feature1']=='UNEXP')
                  ].set_index('best_ans').sort_index(axis=0)
  fail = prop_agree[
                  (prop_agree['material']==mat) &
                  (prop_agree['var_feature1']=='GOAL-FAIL')
                  ].set_index('best_ans').sort_index(axis=0)
  good = prop_agree[
                  (prop_agree['material']==mat) &
                  (prop_agree['var_feature1']=='UNEXP-GOOD')
                  ].set_index('best_ans').sort_index(axis=0)
  bad = prop_agree[
                  (prop_agree['material']==mat) &
                  (prop_agree['var_feature1']=='UNEXP-BAD')
                  ].set_index('best_ans').sort_index(axis=0)

  print(mat)
  # print(ttest_rel(unexp['prop_agree'],fail['prop_agree']))
  # print(ttest_rel(unexp['prop_agree'],good['prop_agree']))
  # print(ttest_rel(unexp['prop_agree'],bad['prop_agree']))
  # print(ttest_rel(fail['prop_agree'],good['prop_agree']))
  # print(ttest_rel(fail['prop_agree'],bad['prop_agree']))
  # print(ttest_rel(good['prop_agree'],bad['prop_agree']))
  print("Unexp vs Goal-Fail", spearmanr(unexp['prop_agree'],fail['prop_agree']))
  print("Unexp vs Good", spearmanr(unexp['prop_agree'],good['prop_agree']))
  print("Unexp vs Bad", spearmanr(unexp['prop_agree'],bad['prop_agree']))
  print("Goal-Fail vs Good", spearmanr(fail['prop_agree'],good['prop_agree']))
  print("Goal-Fail vs Bad", spearmanr(fail['prop_agree'],bad['prop_agree']))
  print("Good vs Bad", spearmanr(good['prop_agree'],bad['prop_agree']))

alan_plane
Unexp vs Goal-Fail SpearmanrResult(correlation=0.6501480153866405, pvalue=0.022089259132923315)
Unexp vs Good SpearmanrResult(correlation=-0.4259726744636505, pvalue=0.1673628564835292)
Unexp vs Bad SpearmanrResult(correlation=0.7006225289633738, pvalue=0.011154036734261)
Goal-Fail vs Good SpearmanrResult(correlation=-0.3753998158199344, pvalue=0.2291735276891534)
Goal-Fail vs Bad SpearmanrResult(correlation=0.8592363042645625, pvalue=0.0003419769825323954)
Good vs Bad SpearmanrResult(correlation=-0.4213808354105098, pvalue=0.1724843724047799)
anna_interview
Unexp vs Goal-Fail SpearmanrResult(correlation=0.5147262517891064, pvalue=0.12792200091336275)
Unexp vs Good SpearmanrResult(correlation=-0.32258735880781825, pvalue=0.36330158960547193)
Unexp vs Bad SpearmanrResult(correlation=0.6916752809520368, pvalue=0.026705693572547467)
Goal-Fail vs Good SpearmanrResult(correlation=-0.9049618105525379, pvalue=0.0003178046212511225)
Goal-Fail vs Bad SpearmanrResult(correlation=0.735

Pearson is perhaps correct, but certainly not meaningful. There is _some_ relationship between the two things, but I cannot correlate them because many of the categories do not overlap.

Therefore, I am comparing variance with no variance, and it cannot correlate.

The Obvious Solution: rank the categories.
Spearman gives us useful info.



# Subsumption of Answers
Maybe a bit like Foster Diversity.  The extent to which one condition’s answers are subsumed in another. So, the extent to which B’s answers are a subset of A’s answers where A and B are answers by X no of people to Y no of materials.

## Intersection
The first measure is the intersection of answers (using the category-labels for answers ) between A and B, scaled by the N of the answers in B (or maybe  N = (A+B)/2).  This starts to look like Jaccards and some of those other measures.    

In [23]:
ans_cats_per_group = expt1.groupby(['material','var_feature1'])['best_ans'].unique().reset_index()
ans_cats_per_group.head(5)

Unnamed: 0,material,var_feature1,best_ans
0,alan_plane,GOAL-FAIL,"[ap_neg_ans6, ap_neg_ans3, ap_neg_ans5, ap_neg..."
1,alan_plane,UNEXP,"[ap_neg_ans3, ap_neg_ans6, ap_neg_ans2, ap_neg..."
2,alan_plane,UNEXP-BAD,"[ap_neg_ans3, ap_neg_ans1, ap_neg_ans6, ap_neg..."
3,alan_plane,UNEXP-GOOD,"[ap_pos_ans2, ap_pos_ans5, ap_pos_ans4, ap_oth..."
4,anna_interview,GOAL-FAIL,"[ai_neg_ans1, ai_neg_ans2, ai_neg_ans3, ai_neg..."


In [26]:
unexp_alan_plane = set(ans_cats_per_group[
                   (ans_cats_per_group['material']=='alan_plane') &
                   (ans_cats_per_group['var_feature1']=='UNEXP')
                   ]['best_ans'].to_list()[0])

goal_fail_alan_plane = set(ans_cats_per_group[
                   (ans_cats_per_group['material']=='alan_plane') &
                   (ans_cats_per_group['var_feature1']=='GOAL-FAIL')
                   ]['best_ans'].to_list()[0])

print('union', unexp_alan_plane.union(goal_fail_alan_plane))
print('intersection', unexp_alan_plane.intersection(goal_fail_alan_plane))
# print(goal_fail_alan_plane.difference(unexp_alan_plane))

union {'ap_neg_ans3', 'ap_neg_ans4', 'ap_other', 'ap_neg_ans6', 'ap_neg_ans5', 'ap_pos_ans2', 'ap_neg_ans1', 'ap_neg_ans2'}
intersection {'ap_neg_ans3', 'ap_other', 'ap_neg_ans6', 'ap_neg_ans5', 'ap_neg_ans1', 'ap_neg_ans2'}


In [27]:
union = len(unexp_alan_plane.union(goal_fail_alan_plane))
intersection = len(unexp_alan_plane.intersection(goal_fail_alan_plane))

intersection/union

0.75

In [34]:
jaccards = {}

for mat in materials:
  unexp = set(ans_cats_per_group[
                   (ans_cats_per_group['material']==mat) &
                   (ans_cats_per_group['var_feature1']=='UNEXP')
                   ]['best_ans'].to_list()[0])

  goal_fail = set(ans_cats_per_group[
                   (ans_cats_per_group['material']==mat) &
                   (ans_cats_per_group['var_feature1']=='GOAL-FAIL')
                   ]['best_ans'].to_list()[0]) 
  
  good = set(ans_cats_per_group[
                   (ans_cats_per_group['material']==mat) &
                   (ans_cats_per_group['var_feature1']=='UNEXP-GOOD')
                   ]['best_ans'].to_list()[0])
  
  bad = set(ans_cats_per_group[
                   (ans_cats_per_group['material']==mat) &
                   (ans_cats_per_group['var_feature1']=='UNEXP-BAD')
                   ]['best_ans'].to_list()[0])
  
  mat_dict = {}

  for pair in [[unexp,goal_fail,'UNEXPvGOAL-FAIL'],[unexp,good,'UNEXPvUNEXP-GOOD'],
               [unexp,bad,'UNEXPvUNEXP-BAD'],[goal_fail,good,'GOAL-FAILvUNEXP-GOOD'],
               [goal_fail,bad,'GOAL-FAILvUNEXP-BAD'],[good,bad,'UNEXP-GOODvUNEXP-BAD']]:
    intersection = len(pair[0].intersection(pair[1]))
    union = len(pair[0].union(pair[1]))
    mat_dict[pair[2]] = intersection/union
    jaccards[mat] = mat_dict


In [35]:
jaccards

{'alan_plane': {'GOAL-FAILvUNEXP-BAD': 0.8333333333333334,
  'GOAL-FAILvUNEXP-GOOD': 0.18181818181818182,
  'UNEXP-GOODvUNEXP-BAD': 0.2,
  'UNEXPvGOAL-FAIL': 0.75,
  'UNEXPvUNEXP-BAD': 0.625,
  'UNEXPvUNEXP-GOOD': 0.25},
 'anna_interview': {'GOAL-FAILvUNEXP-BAD': 0.6666666666666666,
  'GOAL-FAILvUNEXP-GOOD': 0.2,
  'UNEXP-GOODvUNEXP-BAD': 0.1111111111111111,
  'UNEXPvGOAL-FAIL': 0.625,
  'UNEXPvUNEXP-BAD': 0.5714285714285714,
  'UNEXPvUNEXP-GOOD': 0.3},
 'belinda_meeting': {'GOAL-FAILvUNEXP-BAD': 0.6666666666666666,
  'GOAL-FAILvUNEXP-GOOD': 0.3333333333333333,
  'UNEXP-GOODvUNEXP-BAD': 0.18181818181818182,
  'UNEXPvGOAL-FAIL': 0.6,
  'UNEXPvUNEXP-BAD': 0.7,
  'UNEXPvUNEXP-GOOD': 0.4},
 'bill_holiday': {'GOAL-FAILvUNEXP-BAD': 1.0,
  'GOAL-FAILvUNEXP-GOOD': 0.2,
  'UNEXP-GOODvUNEXP-BAD': 0.2,
  'UNEXPvGOAL-FAIL': 0.6,
  'UNEXPvUNEXP-BAD': 0.6,
  'UNEXPvUNEXP-GOOD': 0.2727272727272727},
 'bob_job': {'GOAL-FAILvUNEXP-BAD': 0.42857142857142855,
  'GOAL-FAILvUNEXP-GOOD': 0.3333333333333333,


## Unique
The second measure is the no of answers in B that are not in A.

Another way to do this is just to count unique answers aka answer categories, which your graphs were sort of doing.   So, how many common answer categories are used (of the total set of answer-categories for a material, aggregated-summed over all answer categories).   And how many unique answer categories are used in B and not in A.


## TF_IDF / NMF

## Matrix block modelling

In [15]:
#get a tf-idf matrix (per material) of the group by the answer category, 
#with the cells filled in as the proportion responses
prop_agree_dfs={}
for mat in materials:
  prop_agree_dfs[mat] = prop_agree[prop_agree['material']==mat].pivot(
      index='var_feature1',
      columns='best_ans',
      values='prop_agree'
      )

prop_agree_dfs['alan_plane']

best_ans,ap_neg_ans1,ap_neg_ans2,ap_neg_ans3,ap_neg_ans4,ap_neg_ans5,ap_neg_ans6,ap_other,ap_pos_ans1,ap_pos_ans2,ap_pos_ans3,ap_pos_ans4,ap_pos_ans5
var_feature1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
GOAL-FAIL,0.066667,0.033333,0.1,0.0,0.2,0.466667,0.133333,0.0,0.0,0.0,0.0,0.0
UNEXP,0.28125,0.125,0.125,0.0625,0.03125,0.1875,0.09375,0.0,0.09375,0.0,0.0,0.0
UNEXP-BAD,0.4375,0.0,0.125,0.0,0.125,0.25,0.0625,0.0,0.0,0.0,0.0,0.0
UNEXP-GOOD,0.0,0.0,0.0,0.0,0.0,0.030303,0.212121,0.060606,0.272727,0.090909,0.151515,0.181818


## Mothilal-Diversity : 
If we turn each answer into a set of word features (noting synonyms, and excluding stopwords aggressively) then we could do all non-identical pairwise comparisons between the answers in a particular condition  (basically ignoring answer categories).   Then for each condition we can get how much all those answers are differing from each other.   We can describe this as a mean for each condition but we could also look at the spread of the metric we are getting (or even break it down in some what by material-answer-category).   For material-answer-category, you could get an average of an average though that seems less useful than just comparing all answers to one another within the condition. At the material, level I guess you could show that there was every little diversity between the answers for a particular answer-category or you could group all the feature-words within a category (without repetition) and then do pairwise comparisons between all answer-categories for a material in a given condition; these could then be compared at the material level across conditions.

## Smyth-diversity: 
the fraction of unique features that exist as difference features in the set of all valid counterfactuals produced; thus, a feature diversity of 0.1 means that 10% of features participated (as difference features) in the counterfactuals generated. Does not seem to be directly applicable; though a variant may be possible; though the more I think about it it starts to sound like a version of our mutability measure.  Not an option.


# Mutability
This is trying to measure the degree of mutational change between the original scenario and the answer given by people (nb not unrelated to diversity).  It may rely on some form of the tf-idf/word2vec similarity measure we have developed already.  Though there may be fine tuning of this or indeed a significant change (see below, we can discuss).   Recall, we discussed the need to partition the words in the answer into (a) those originally, in the scenario, (b) those that are synonymous with those in the original scenario (could be established by a word2vec comparison with some threshold) and (c) those words that are wholly new.   NB the latter will bear some predictive closeness to the words in the original scenario (so that could be you measure the extent of the terms inferability rather than similarity). This is like a coherence score, and should be the best measure of mutability.   Talk to me about this…if this does not show some change in bizarreness then something is wrong. But, it is not a sim score on a vector, it is like the old LSA document-to-word comparison or some such thing. Must be possible in the word2vec universe.  So, you can get the mean w2v-coherent for each answer and then look at these overall across all materials in all conditions.   There appear to be 3 levels of averaging (i) each word in an answer has this coherence score which you could average or sum+normalise), (ii) each overall-answer-score ( a mean) can be averaged for the material in a condition and or (iii) averaged independently for the whole condition. Again we need to discuss.  I think how we aggregate these scores will be important.   Found this, may be helpful, but not looked at in detail:
Your problem can be solved with Word2vec as well as Doc2vec. Doc2vec would give better results because it takes sentences into account while training the model.
## Doc2vec solution
You can train your doc2vec model following this link. You may want to perform some pre-processing steps like removing all stop words (words like "the", "an", etc. that don't add much meaning to the sentence). Once you train your model, you can find the similar sentences using the following code.
import gensim 
model = gensim.models.Doc2Vec.load('saved_doc2vec_model') 
new_sentence = "I opened a new mailbox".split(" ") 
model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=5)


## Transformer Model Solution
AS I see it, you need a metric for how predictable each word in the answer (excluding stop words) is from the words in the original scenario.   This will give you the distance of the answer (as a mutation from the scenario).   It could still be done with similarity but I prefer the idea of singling out the non-scenario words and seeing how far each is as a prediction from the scenario (seems much more sensitive to me, when you do similarity, you mix back in the repeating elements and that could make the answer seem closer than it actually is…if you see what I mean).
THAT is doable w/ BERT or GPT…
