## Experiment 01
### Richardson et. al (2002) Experiment 01 with Language Models instead of Humans:

The subjects were presented with a single page,
containing a list of the verbs and four pictures, labelled A to
D. Each one contained a circle and a square aligned along a
vertical or horizontal axis, connected by an arrow pointing
up, down, left or right. Since we didn't expect any
interesting item variation between left or right placement of
the circle or square, the horizontal schemas differed only in
the direction of the arrow.
For each sentence, subjects were asked to select one of
the four sparse images that best depicted the event described
by the sentence (Figure 1)
The items were randomised in three different orders, and
crossed with two different orderings of the images. The six
lists were then distributed randomly to subjects.

### Step 01: Creating a prompt that is as close to the paper as possible

In [75]:
import random
random.seed(1337)


def convert_to_float(value):
    try:
        return float(value)
    except ValueError:
        return value

with open("../../data/richardson_actions.txt", "r") as d_in:
    lines = [line.split() for line in d_in.readlines()]

output = []
for entry in lines:
    new_entry = [convert_to_float(item) for item in entry]
    if isinstance(new_entry[1],str):
        new_entry[0] = " ".join(new_entry[:2])
        del new_entry[1]
    output.append(new_entry)

richardson_data = dict()
for elem in output:
    richardson_data[elem[0]] = [i for i in elem[1:]]

richardson_data[0] = ["C:◯↑▢", "D:◯↓▢","B:◯←▢ ","A:◯→▢"]

# Randomizing Richardson's data
action_words = list(richardson_data.keys())
del action_words[0]
random.shuffle(action_words)

#count=1
#for line in action_words:
#    if line != 0:
#        print(str(count)+". ◯ "+line+" ▢ \\")
#        count+=1

{'fled': [7.2, 4.2, 80.8, 7.8], 'pointed at': [7.2, 3.6, 0.0, 89.2], 'pulled': [6.0, 5.4, 75.4, 13.2], 'pushed': [7.2, 3.6, 1.2, 8.0, 8.0], 'walked': [9.0, 3.6, 2.0, 4.0, 62.9], 'hunted': [9.6, 20.4, 1.8, 68.3], 'impacted': [7.2, 37.1, 3.0, 52.7], 'perched': [1.0, 2.0, 7.0, 6.0, 6.6, 5.4], 'showed': [1.0, 5.0, 9.0, 10.2, 65.9], 'smashed': [3.6, 66.5, 1.2, 28.7], 'bombed': [4.8, 86.8, 1.8, 6.6], 'flew': [37.7, 44.3, 15.0, 3.0], 'floated': [32.9, 56.3, 7.8, 3.0], 'lifted': [87.4, 9.6, 2.4, 0.6], 'sank': [22.2, 71.9, 4.2, 1.8], 'argued with': [11.4, 13.8, 12.6, 62.3], 'gave to': [8.4, 9.6, 1.2, 80.8], 'offended': [9.0, 31.7, 24.6, 34.7], 'rushed': [10.2, 10.8, 23.4, 55.1], 'warned': [10.8, 22.2, 6.0, 61.1], 'owned': [5.4, 55.7, 18.6, 20.4], 'regretted': [19.8, 24.0, 41.3, 1.0, 5.0], 'rested': [14.4, 36.5, 40.1, 9.0], 'tempted': [16.8, 11.4, 45.5, 26.3], 'wanted': [15.6, 7.8, 15.6, 61.1], 'hoped': [45.5, 15.6, 7.2, 31.7], 'increased': [73.7, 7.2, 9.6, 9.0], 'obeyed': [22.8, 4.2, 64.7, 8.4]

In [76]:
paper_prompt = "You are asked to select one of the four images that best depicts the event described by the following sentences. \
Image A: \
◯→▢ \
\
Image B: \
◯←▢ \
 \
Image C: \
◯ \
↑ \
▢ \
\
Image D: \
◯ \
↓ \
▢ \
\
Sentences: \
1. ◯ hunted ▢ \
2. ◯ showed ▢ \
3. ◯ succeeded ▢ \
4. ◯ pointed at ▢ \
5. ◯ pushed ▢ \
6. ◯ hoped ▢ \
7. ◯ walked ▢ \
8. ◯ gave to ▢ \
9. ◯ respected ▢ \
10. ◯ smashed ▢ \
11. ◯ argued with ▢ \
12. ◯ sank ▢ \
13. ◯ rested ▢ \
14. ◯ offended ▢ \
15. ◯ pulled ▢ \
16. ◯ perched ▢ \
17. ◯ regretted ▢ \
18. ◯ bombed ▢ \
19. ◯ obeyed ▢ \
20. ◯ lifted ▢ \
21. ◯ flew ▢ \
22. ◯ impacted ▢ \
23. ◯ wanted ▢ \
24. ◯ increased ▢ \
25. ◯ warned ▢ \
26. ◯ floated ▢ \
27. ◯ tempted ▢ \
28. ◯ rushed ▢ \
29. ◯ owned ▢ \
\
For sentence 1) choosing from [A,B,C,D] best image is "


### Step 02: Test model with this `paper prompt`

#### Function to evaluate against Richardson's findings

In [77]:
import torch
from transformers import GPT2Tokenizer, OPTForCausalLM, AutoTokenizer, AutoModelForCausalLM
from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast, GPTNeoForCausalLM, GPT2Tokenizer
import torch.nn.functional as F
import pandas as pd
import numpy as np
import seaborn as sns
from tqdm import tqdm
import json
from scipy import stats
from collections import Counter
import subprocess

with open("../../hf.key", "r") as f_in:
    hf_key = f_in.readline().strip()
subprocess.run(["huggingface-cli", "login", "--token", hf_key])

server_model_path = "/mounts/data/corp/huggingface/"
# stored using: git clone git@hf.co:<MODEL ID>

model_prefix = "EleutherAI/gpt-neox"
model_size = "20b"

model_prefix = "facebook/opt"
model_size = "13b"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /mounts/data/corp/huggingface/token
Login successful


In [4]:
#tokenizer = GPTNeoXTokenizerFast.from_pretrained(model_prefix+"-"+model_size, device_map="auto")
#model = GPTNeoXForCausalLM.from_pretrained(model_prefix+"-"+model_size, device_map="auto")

#tokenizer = GPT2Tokenizer.from_pretrained(model_prefix+"-"+model_size, device_map="auto")
#model = OPTForCausalLM.from_pretrained(model_prefix+"-"+model_size, device_map="auto")

#tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf", use_auth_token=True, device_map="auto")
#model = AutoModelForCausalLM.from_pretrained(server_model_path+"llama/llama-13b", use_auth_token=True, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf", use_auth_token=True, device_map="auto")
model = AutoModelForCausalLM.from_pretrained(server_model_path+"meta-llama/Llama-2-70b-hf", use_auth_token=True, device_map="auto")

The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.


Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

In [78]:
# Step 1: Tokenize the prompt
input_ids = tokenizer.encode(paper_prompt, return_tensors="pt")

# Step 2: Generate the model input
max_length = input_ids.size(1)  + 300 # Adjust '20' as needed to control the maximum length of the generated answer.
output = model.generate(input_ids, max_length=max_length, num_return_sequences=1)

# Step 3: Decode the generated output to get the answer
generated_answer = tokenizer.decode(output[0], skip_special_tokens=True)




In [79]:
print("Generated answer:\n\n", generated_answer[len(paper_prompt):])

Generated answer:

 ◯→▢. For sentence 2) choosing from [A,B,C,D] best image is ◯←▢. For sentence 3) choosing from [A,B,C,D] best image is ◯←▢. For sentence 4) choosing from [A,B,C,D] best image is ◯←▢. For sentence 5) choosing from [A,B,C,D] best image is ◯←▢. For sentence 6) choosing from [A,B,C,D] best image is ◯←▢. For sentence 7) choosing from [A,B,C,D] best image is ◯←▢. For sentence 8) choosing from [A,B,C,D] best image is ◯←▢. For sentence 9) choosing from [A,B,C,D] best image is ◯←▢. For sentence 10) choosing from [A,B,C,D] best image is ◯←▢. For sentence 11) choosing from [A,B,C,D] best image is ◯←▢. For sentence 12) choosing from [A


### Step 03: Prompting that is maximally friendly for the LLM

In [80]:
for action_word in action_words:
    
    print("ACTION: "+action_word)

    friendly_prompt = "Select the image that best represents the event described by the sentence: "+action_word+"\n[◯→▢]\n\n[◯←▢]\n\n[◯\n↑\n▢]\n\n[◯\n↓\n▢]\n\nThe best representation is [◯"
    
    input_ids = tokenizer.encode(friendly_prompt, return_tensors="pt")
    max_length = input_ids.size(1)  + 20
    output = model.generate(input_ids, max_length=max_length, num_return_sequences=1)
    generated_answer = tokenizer.decode(output[0], skip_special_tokens=True)  
    print("Generated answer:\n\n", generated_answer[len(friendly_prompt):])

prompt_B = "Select the image that best represents the event described by the sentence: "+action_word+"\n[◯→▢]\n\n[◯←▢]\n\n[◯\n↑\n▢]\n\n[◯\n↓\n▢]\n\nThe best representation is [◯"
#prompt_C = "Choose the best image for the word:\nUP: ↑ \nDOWN: ↓ \nLEFT: → \nRIGHT: ← \n"+action_word.upper()+": "
#prompt_D = "Choosing from UP, DOWN, LEFT and RIGHT, the word \'"+action_word.upper()+"\' is best respresented by the word "
#prompt_B = "Of these four: A: ◯→▢ B: ◯←▢ C: ◯ ↑ ▢ D: ◯ ↓ ▢ Which one best describes \"◯ "+action_word+" ▢\" ?"

ACTION: hunted
Generated answer:

 →▢].

### 2.

Select the image that best represents
ACTION: showed
Generated answer:

 ←▢].

### 2.

Select the image that best represents
ACTION: succeeded
Generated answer:

 →▢].

### 2.

Select the image that best represents
ACTION: pointed at
Generated answer:

 ←▢].

Select the image that best represents the event described by the sentence:
ACTION: pushed
Generated answer:

 ←▢].

Select the image that best represents the event described by the sentence:
ACTION: hoped
Generated answer:

 
↓
▢].

Select the image that best represents the event described by the
ACTION: walked
Generated answer:

 →▢].

### 2.

Select the image that best represents
ACTION: gave to
Generated answer:

 ←▢].

Select the image that best represents the event described by the sentence:
ACTION: respected
Generated answer:

 
↓
▢].

Select the image that best represents the event described by the
ACTION: smashed
Generated answer:

 →▢].

Select the image that best represent

KeyboardInterrupt: 

In [81]:
for action in action_words:
    print(action)

hunted
showed
succeeded
pointed at
pushed
hoped
walked
gave to
respected
smashed
argued with
sank
rested
offended
pulled
0
perched
regretted
bombed
obeyed
lifted
flew
impacted
wanted
increased
warned
floated
tempted
rushed
owned


### Step 03: Test pipline with logprobs 

In [82]:
from sidemethods import logprobs_from_prompt, proc, proc_lower, prob_of_ending, calculate_accuracy, calculate_accuracies, store_accuracies

In [83]:

start = friendly_prompt
#start = prompt_B
#start = prompt_D

#answers = {0:"A ◯→▢", 1:"B ◯←▢", 2:"C ◯\n↑\n▢", 3:"D ◯\n↓\n▢"}
answers = {0:"→▢]", 1:"←▢]", 2:"\n↑\n▢]", 3:"\n↓\n▢]"}
#answers = {0:"UP", 1:"DOWN", 2:"LEFT", 3:"RIGHT"}

#start = "nazis are known to be on the political "
#answers = {0:"UP", 1:"DOWN", 2:"LEFT", 3:"RIGHT"}


res_ends = []
for j, end in answers.items():
    input_prompt = proc(start) + ' ' + proc(end)
    logprobs = logprobs_from_prompt(input_prompt, tokenizer, model)
    res = {"tokens": [x for x,y in logprobs],"token_logprobs": [y for x,y in logprobs]}
    res_ends.append(res)

In [84]:
choosen_answer = (-9999, "")
for i, answer in answers.items():
    choice_val = prob_of_ending(res_ends[i]['token_logprobs'], res_ends[0]['tokens'])
    if choice_val > choosen_answer[0]:
        choosen_answer = choice_val, answer


print(start)
print("Choice: ", choosen_answer[1])
print()

#input_ids = tokenizer.encode(start, return_tensors="pt")
#input_ids = input_ids.to('cuda')
#output = model.generate(input_ids, max_length=max_length, num_return_sequences=1)
#generated_answer = tokenizer.decode(output[0], skip_special_tokens=True)
#print("Generated answer:", generated_answer)



Select the image that best represents the event described by the sentence: rested
[◯→▢]

[◯←▢]

[◯
↑
▢]

[◯
↓
▢]

The best representation is [◯
Choice:  
↓
▢]

