# Prospect Theory 2.0

This notebook is a continuation of the previous notebook, where we will research how the concrete numbers of gains/losses in the prompts influence the model's response. 

In the *Decoy Effect* notebook we could see, that renaming the answer options does indeed have an effect on the distribution of survey replys. We therefore stick to the principle of using uncommon letters instead of the typical survey design of A:, B:, C:. This enables us to isolate the effect of changing the monetary values, free of possible biases e.g. the "A-Bias".

In all the scenarios we described, the sum of money both described individuals have as exactly the same at the end of the day.
Therefore, we first take a look at how survey replies change, if one individuum is in fact better off money-wise.

Secondly, we take a look at how the magnitude of gains/and losses affect the responses. For this, we simply scale every number in a given prompt by the same factor.

Since we previously established that, for practical purposes, the minimum temperature value of 0, as well as the maximum value of 2 do not really provide us with insightful results, we will now only focus on the values [0.5, 1, 1.5]. Also, the aspect of priming the models will not be regarded here. 

------------------

## Setup

In [26]:
from openai import OpenAI
import openai
import matplotlib.pyplot as plt
import os 
import numpy as np
import pandas as pd
from tqdm import tqdm

# Get API key (previously saved as environmental variable)
openai.api_key = os.environ["OPENAI_API_KEY"]

# Set client
client = OpenAI()

# Set global plot style
plt.style.use('seaborn-v0_8')

# Set plots to be displayed in notebook
%matplotlib inline

-----------------------------------

- Original prompts 

In [27]:
prompt_1 = """Mr. A was given tickets involving the World Series. He won 50$ in one lottery and $25 in the other. 
          Mr. B was given a ticket to a single, larger World Series lottery. He won $75. Based solely on this information, Who is happier? 
          A: Mister A
          B: Mister B
          C: No difference.         
          Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

In [28]:
prompt_2 = f"""Mr. A received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $100. 
         He received a similar letter the same day from his state income tax authority saying he owed $50. There were no other repercussions from either mistake. 
         Mr. B received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $150. There were no other repercussions from his mistake. 
         Based solely on this information, who was more upset? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is more upset?

In [29]:
prompt_3 = f"""Mr. A bought his first New York State lottery ticket and won $100. Also, in a freak accident, he damaged the rug in his apartment and had to pay the landlord $80.
         Mr. B bought his first New York State lottery ticket and won $20. Based solely on this information, who is happier? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

In [30]:
prompt_4 = f"""Mr. A's car was damaged in a parking lot. He had to spend $200 to repair the damage. The same day the car was damaged, he won $25 in the office football pool.
         Mr. B's car was damaged in a parking lot. He had to spend $175 to repair the damage. Based solely on this information, who is more upset?
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is more upset?

--------------------------------------

- Modifying the monetary values inside the prompts

The Prospect Theory value function explains why individuals tend to assess the perceived value of e.g. a sum of multiple gains as larger, than one individual sum of the same amount. Since Large Language Models are trained on human data, including for example customer reviews on sales platforms, they might reflect these patterns. 
But how do LLMs react, if in the given scenarios, one individual is financially clearly better off than the other? And what if we did not deal with small, even numbers, but rather large and odd ones? 
Another key concept of prospect theory is decreasing sensitivity. A loss of 50$ subtracted from a total amount of 1000$ will not hurt as much, as if we initially only had 100$, hence losing 50% of our total possession. 

In order to research these 2 aspects, we adapted our original prompts as follows:

For every scenario (1-4) we created:
- 2 prompts in which A and B have the same amount of money, but the numbers are odd and larger than before (scaled by Pi * 100 or 42 respectively) 
- 2 prompts in which A is better off (25$ vs. 50$)
- 2 prompts in which B is better off (25$ vs. 50$)

In the configurations, in which one individual is better off, we did not simply increase/decrease the same number in the prompt, but rather distributed the changes in gains/losses along the prompt. 

In [31]:
# To make our results comparable to the original study, we compute original answer probabilities
p_scenario1 = [f"p(A): {round((56/(56+16+15)*100), 2)}%", f"p(B): {round((16/(56+16+15)*100), 2)}%", f"p(C): {round((15/(56+16+15)*100), 2)}%"]
p_scenario2 = [f"p(A): {round((66/(66+14+7)*100), 2)}%", f"p(B): {round((14/(66+14+7)*100), 2)}%", f"p(C): {round((7/(66+14+7)*100), 2)}%"]
p_scenario3 = [f"p(A): {round((22/(22+61+4)*100), 2)}%", f"p(B): {round((61/(22+61+4)*100), 2)}%", f"p(C): {round((4/(22+61+4)*100), 2)}%"]
p_scenario4 = [f"p(A): {round((19/(19+63+5)*100), 2)}%", f"p(B): {round((63/(19+63+5)*100), 2)}%", f"p(C): {round((5/(19+63+5)*100), 2)}%"]

# Setting up new monetary values to be used in our prompts, they do not really reflect prices, but will be named as such for simplicity
prices_1_og = np.array([50, 25, 75]) # A won 50+25, B won 75
prices_2_og = np.array([100, 50, 150]) # A lost 100+50, B lost 150
prices_3_og = np.array([100, 80, 20]) # A won 100, lost 80, B won 20
prices_4_og = np.array([200, 25, 175]) # A lost 200, won 25, B lost 175

# New, rather odd-numbered values, but sum for A&B is the same
prices_1_odd = np.round(prices_1_og.copy() * np.pi * 100, 2)
prices_2_odd = np.round(prices_2_og.copy() * np.pi * 100, 2)
prices_3_odd = np.round(prices_3_og.copy() * np.pi * 100, 2)
prices_4_odd = np.round(prices_4_og.copy() * np.pi * 100, 2)

prices_1_odd2 = np.round(prices_1_og.copy() * np.pi * 42, 2)
prices_2_odd2 = np.round(prices_2_og.copy() * np.pi * 42, 2)
prices_3_odd2 = np.round(prices_3_og.copy() * np.pi * 42, 2)
prices_4_odd2 = np.round(prices_4_og.copy() * np.pi * 42, 2)

# Prices, so that A is always better off (labeled as prices_(original scenario)_(who is better off))
# We do not simply always increase the first mentioned gain/decrease the first mentioned loss, but rather try and "distribute" the changes 
# Per prompt, only the same number will be changed, but over all prompts, the changes will be distributed

# A always better off by 25$
prices_1_a25 = prices_1_og.copy() 
prices_1_a25[0] += 25
prices_2_a25 = prices_2_og.copy()
prices_2_a25[1] += -25
prices_3_a25 = prices_3_og.copy()
prices_3_a25[1] += -25
prices_4_a25 = prices_4_og.copy()
prices_4_a25[2] += +25

# A always better off by 50$
prices_1_a50 = prices_1_og.copy()
prices_1_a50[0] += 50
prices_2_a50 = prices_2_og.copy()
prices_2_a50[1] += -50
prices_3_a50 = prices_3_og.copy()
prices_3_a50[1] += -50
prices_4_a50 = prices_4_og.copy()
prices_4_a50[2] += +50

# B always better off by 25$
prices_1_b25 = prices_1_og.copy()
prices_1_b25[0] += -25
prices_2_b25 = prices_2_og.copy()
prices_2_b25[1] += +25
prices_3_b25 = prices_3_og.copy()
prices_3_b25[1] += 25
prices_4_b25 = prices_4_og.copy()
prices_4_b25[2] += -25

# B always better off by 50$
prices_1_b50 = prices_1_og.copy()
prices_1_b50[0] += -50
prices_2_b50 = prices_2_og.copy()
prices_2_b50[1] += +50
prices_3_b50 = prices_3_og.copy()
prices_3_b50[1] += 50
prices_4_b50 = prices_4_og.copy()
prices_4_b50[2] += -50

# Pretty sure there is an easier way, but this is at least robust and easily controllable

- Set up new prompts with modified numbers

In [32]:
# Prompts for scenario 1
prompts_1 = []

for prices in [prices_1_odd, prices_1_odd2, prices_1_a25, prices_1_a50, prices_1_b25, prices_1_b50]:
    prompt = f"""Mr. A was given tickets involving the World Series. He won {prices[0]}$ in one lottery and {prices[1]}$ in the other. 
          Mr. B was given a ticket to a single, larger World Series lottery. He won {prices[2]}$. Based solely on this information, Who is happier? 
          A: Mister A
          B: Mister B
          C: No difference.         
          Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""
    prompts_1.append(prompt)
# Who is happier?

In [33]:
# Prompts for scenario 2
prompts_2 = []

for prices in [prices_2_odd, prices_2_odd2, prices_2_a25, prices_2_a50, prices_2_b25, prices_2_b50]:
    prompt = f"""Mr. A received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed ${prices[0]}. 
         He received a similar letter the same day from his state income tax authority saying he owed ${prices[1]}. There were no other repercussions from either mistake. 
         Mr. B received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed ${prices[2]}. There were no other repercussions from his mistake. 
         Based solely on this information, who was more upset? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""
    prompts_2.append(prompt)

# Who is more upset?

In [34]:
# Prompts for scenario 3
prompts_3 = []

for prices in [prices_3_odd, prices_3_odd2, prices_3_a25, prices_3_a50, prices_3_b25, prices_3_b50]:
    prompt = f"""Mr. A bought his first New York State lottery ticket and won ${prices[0]}. Also, in a freak accident, he damaged the rug in his apartment and had to pay the landlord ${prices[1]}.
         Mr. B bought his first New York State lottery ticket and won ${prices[2]}. Based solely on this information, who is happier? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""
    prompts_3.append(prompt)

# Who is happier?

In [35]:
# Prompts for scenario 4
prompts_4 = []

for prices in [prices_4_odd, prices_4_odd2, prices_4_a25, prices_4_a50, prices_4_b25, prices_4_b50]:
    prompt = f"""Mr. A's car was damaged in a parking lot. He had to spend ${prices[0]} to repair the damage. The same day the car was damaged, he won ${prices[1]} in the office football pool.
         Mr. B's car was damaged in a parking lot. He had to spend ${prices[2]} to repair the damage. Based solely on this information, who is more upset?
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""
    prompts_4.append(prompt)
# Who is more upset?

Throughout the notebook, we use experiment IDs, that help us extract prompts and price vectors used for the experiments. We construct them as follows:

- scenario_model_configuration

The model attribute refers to the Language Model we are using, namely:
- 1: GPT-3.5-Turbo
- 2: GPT-4-1106-Preview

Configuration refers to the price vector used in this experiment. We number them as follows:
- 1: Odd prices 1 (Original * Pi * 100)
- 2: Odd prices 2 (Original * Pi * 42 )
- 3: A is better off by 25$
- 4: A is better off by 50$
- 5: B is better off by 25$
- 6: B is better off by 50$

Experiment id 2_2_4 therefore reads as: Scenario 2, using GPT-4-1106-Preview, where A is better off by 50$

--------------------------------------------

In [36]:
# Dictionary to look up prompt for a given experiment id. key: experiment id, value: prompt
experiment_prompts_dict = {
    "1_1_1": prompts_1[0],
    "1_1_2": prompts_1[1],
    "1_1_3": prompts_1[2],
    "1_1_4": prompts_1[3],
    "1_1_5": prompts_1[4],
    "1_1_6": prompts_1[5],
    "2_1_1": prompts_2[0],
    "2_1_2": prompts_2[1],
    "2_1_3": prompts_2[2],
    "2_1_4": prompts_2[3],
    "2_1_5": prompts_2[4],
    "2_1_6": prompts_2[5],
    "3_1_1": prompts_3[0],
    "3_1_2": prompts_3[1],
    "3_1_3": prompts_3[2],
    "3_1_4": prompts_3[3],
    "3_1_5": prompts_3[4],
    "3_1_6": prompts_3[5],
    "4_1_1": prompts_4[0],
    "4_1_2": prompts_4[1],
    "4_1_3": prompts_4[2],
    "4_1_4": prompts_4[3],
    "4_1_5": prompts_4[4],
    "4_1_6": prompts_4[5],
    "1_2_1": prompts_1[0],
    "1_2_2": prompts_1[1],
    "1_2_3": prompts_1[2],
    "1_2_4": prompts_1[3],
    "1_2_5": prompts_1[4],
    "1_2_6": prompts_1[5],
    "2_2_1": prompts_2[0],
    "2_2_2": prompts_2[1],
    "2_2_3": prompts_2[2],
    "2_2_4": prompts_2[3],
    "2_2_5": prompts_2[4],
    "2_2_6": prompts_2[5],
    "3_2_1": prompts_3[0],
    "3_2_2": prompts_3[1],
    "3_2_3": prompts_3[2],
    "3_2_4": prompts_3[3],
    "3_2_5": prompts_3[4],
    "3_2_6": prompts_3[5],
    "4_2_1": prompts_4[0],
    "4_2_2": prompts_4[1],
    "4_2_3": prompts_4[2],
    "4_2_4": prompts_4[3],
    "4_2_5": prompts_4[4],
    "4_2_6": prompts_4[5],
}

# Dictionary to look up price vector used for experiment id. key: experiment id, value: price vector
prices_dict = {
    "1_1_1": prices_1_odd,
    "1_1_2": prices_1_odd2,
    "1_1_3": prices_1_a25,
    "1_1_4": prices_1_a50,
    "1_1_5": prices_1_b25,
    "1_1_6": prices_1_b50,
    "2_1_1": prices_2_odd,
    "2_1_2": prices_2_odd2,
    "2_1_3": prices_2_a25,
    "2_1_4": prices_2_a50,
    "2_1_5": prices_2_b25,
    "2_1_6": prices_2_b50,
    "3_1_1": prices_3_odd,
    "3_1_2": prices_3_odd2,
    "3_1_3": prices_3_a25,
    "3_1_4": prices_3_a50,
    "3_1_5": prices_3_b25,
    "3_1_6": prices_3_b50,
    "4_1_1": prices_4_odd,
    "4_1_2": prices_4_odd2,
    "4_1_3": prices_4_a25,
    "4_1_4": prices_4_a50,
    "4_1_5": prices_4_b25,
    "4_1_6": prices_4_b50,
    "1_2_1": prices_1_odd,
    "1_2_2": prices_1_odd2,
    "1_2_3": prices_1_a25,
    "1_2_4": prices_1_a50,
    "1_2_5": prices_1_b25,
    "1_2_6": prices_1_b50,
    "2_2_1": prices_2_odd,
    "2_2_2": prices_2_odd2,
    "2_2_3": prices_2_a25,
    "2_2_4": prices_2_a50,
    "2_2_5": prices_2_b25,
    "2_2_6": prices_2_b50,
    "3_2_1": prices_3_odd,
    "3_2_2": prices_3_odd2,
    "3_2_3": prices_3_a25,
    "3_2_4": prices_3_a50,
    "3_2_5": prices_3_b25,
    "3_2_6": prices_3_b50,
    "4_2_1": prices_4_odd,
    "4_2_2": prices_4_odd2,
    "4_2_3": prices_4_a25,
    "4_2_4": prices_4_a50,
    "4_2_5": prices_4_b25,
    "4_2_6": prices_4_b50,
}

# Dictionary to look up the original results for a given experiment id. key: experiment id, value: original answer probabilities
results_dict = {
    "1_1_1": p_scenario1,
    "1_1_2": p_scenario1,
    "1_1_3": p_scenario1,
    "1_1_4": p_scenario1,
    "1_1_5": p_scenario1,
    "1_1_6": p_scenario1,
    "2_1_1": p_scenario2,
    "2_1_2": p_scenario2,
    "2_1_3": p_scenario2,
    "2_1_4": p_scenario2,
    "2_1_5": p_scenario2,
    "2_1_6": p_scenario2,
    "3_1_1": p_scenario3,
    "3_1_2": p_scenario3,
    "3_1_3": p_scenario3,
    "3_1_4": p_scenario3,
    "3_1_5": p_scenario3,
    "3_1_6": p_scenario3,
    "4_1_1": p_scenario4,
    "4_1_2": p_scenario4,
    "4_1_3": p_scenario4,
    "4_1_4": p_scenario4,
    "4_1_5": p_scenario4,
    "4_1_6": p_scenario4,
    "1_2_1": p_scenario1,
    "1_2_2": p_scenario1,
    "1_2_3": p_scenario1,
    "1_2_4": p_scenario1,
    "1_2_5": p_scenario1,
    "1_2_6": p_scenario1,
    "2_2_1": p_scenario2,
    "2_2_2": p_scenario2,
    "2_2_3": p_scenario2,
    "2_2_4": p_scenario2,
    "2_2_5": p_scenario2,
    "2_2_6": p_scenario2,
    "3_2_1": p_scenario3,
    "3_2_2": p_scenario3,
    "3_2_3": p_scenario3,
    "3_2_4": p_scenario3,
    "3_2_5": p_scenario3,
    "3_2_6": p_scenario3,
    "4_2_1": p_scenario4,
    "4_2_2": p_scenario4,
    "4_2_3": p_scenario4,
    "4_2_4": p_scenario4,
    "4_2_5": p_scenario4,
    "4_2_6": p_scenario4,
}

# Dictionary to look up which model to use for a given experiment id. key: experiment id, value: model name
model_dict = {
    "1_1_1": "gpt-3.5-turbo",  
    "1_1_2": "gpt-3.5-turbo",
    "1_1_3": "gpt-3.5-turbo",
    "1_1_4": "gpt-3.5-turbo",
    "1_1_5": "gpt-3.5-turbo",
    "1_1_6": "gpt-3.5-turbo",
    "2_1_1": "gpt-3.5-turbo",
    "2_1_2": "gpt-3.5-turbo",
    "2_1_3": "gpt-3.5-turbo",
    "2_1_4": "gpt-3.5-turbo",
    "2_1_5": "gpt-3.5-turbo",
    "2_1_6": "gpt-3.5-turbo",
    "3_1_1": "gpt-3.5-turbo",
    "3_1_2": "gpt-3.5-turbo",
    "3_1_3": "gpt-3.5-turbo",
    "3_1_4": "gpt-3.5-turbo",
    "3_1_5": "gpt-3.5-turbo",
    "3_1_6": "gpt-3.5-turbo",
    "4_1_1": "gpt-3.5-turbo",
    "4_1_2": "gpt-3.5-turbo",
    "4_1_3": "gpt-3.5-turbo",
    "4_1_4": "gpt-3.5-turbo",
    "4_1_5": "gpt-3.5-turbo",
    "4_1_6": "gpt-3.5-turbo",
    "1_2_1": "gpt-4-1106-preview",
    "1_2_2": "gpt-4-1106-preview",
    "1_2_3": "gpt-4-1106-preview",
    "1_2_4": "gpt-4-1106-preview",
    "1_2_5": "gpt-4-1106-preview",
    "1_2_6": "gpt-4-1106-preview",
    "2_2_1": "gpt-4-1106-preview",
    "2_2_2": "gpt-4-1106-preview",
    "2_2_3": "gpt-4-1106-preview",
    "2_2_4": "gpt-4-1106-preview",
    "2_2_5": "gpt-4-1106-preview",
    "2_2_6": "gpt-4-1106-preview",
    "3_2_1": "gpt-4-1106-preview",
    "3_2_2": "gpt-4-1106-preview",
    "3_2_3": "gpt-4-1106-preview",
    "3_2_4": "gpt-4-1106-preview",
    "3_2_5": "gpt-4-1106-preview",
    "3_2_6": "gpt-4-1106-preview",
    "4_2_1": "gpt-4-1106-preview",
    "4_2_2": "gpt-4-1106-preview",
    "4_2_3": "gpt-4-1106-preview",
    "4_2_4": "gpt-4-1106-preview",
    "4_2_5": "gpt-4-1106-preview",
    "4_2_6": "gpt-4-1106-preview",
    }

# Dictionary to look up what prompt was used for a given experiment id. key: experiment id, value: prompt variable name
prompt_ids_dict = {
    "1_1_1": "prompts_1[0]",
    "1_1_2": "prompts_1[1]",
    "1_1_3": "prompts_1[2]",
    "1_1_4": "prompts_1[3]",
    "1_1_5": "prompts_1[4]",
    "1_1_6": "prompts_1[5]",
    "2_1_1": "prompts_2[0]",
    "2_1_2": "prompts_2[1]",
    "2_1_3": "prompts_2[2]",
    "2_1_4": "prompts_2[3]",
    "2_1_5": "prompts_2[4]",
    "2_1_6": "prompts_2[5]",
    "3_1_1": "prompts_3[0]",
    "3_1_2": "prompts_3[1]",
    "3_1_3": "prompts_3[2]",
    "3_1_4": "prompts_3[3]",
    "3_1_5": "prompts_3[4]",
    "3_1_6": "prompts_3[5]",
    "4_1_1": "prompts_4[0]",
    "4_1_2": "prompts_4[1]",
    "4_1_3": "prompts_4[2]",
    "4_1_4": "prompts_4[3]",
    "4_1_5": "prompts_4[4]",
    "4_1_6": "prompts_4[5]",
    "1_2_1": "prompts_1[0]",
    "1_2_2": "prompts_1[1]",
    "1_2_3": "prompts_1[2]",
    "1_2_4": "prompts_1[3]",
    "1_2_5": "prompts_1[4]",
    "1_2_6": "prompts_1[5]",
    "2_2_1": "prompts_2[0]",
    "2_2_2": "prompts_2[1]",
    "2_2_3": "prompts_2[2]",
    "2_2_4": "prompts_2[3]",
    "2_2_5": "prompts_2[4]",
    "2_2_6": "prompts_2[5]",
    "3_2_1": "prompts_3[0]",
    "3_2_2": "prompts_3[1]",
    "3_2_3": "prompts_3[2]",
    "3_2_4": "prompts_3[3]",
    "3_2_5": "prompts_3[4]",
    "3_2_6": "prompts_3[5]",
    "4_2_1": "prompts_4[0]",
    "4_2_2": "prompts_4[1]",
    "4_2_3": "prompts_4[2]",
    "4_2_4": "prompts_4[3]",
    "4_2_5": "prompts_4[4]",
    "4_2_6": "prompts_4[5]",
}

# Dictionary to look up scenario number for a given experiment id. key: experiment id, value: scenario number
scenario_dict = {
    "1_1_1": 1,
    "1_1_2": 1,
    "1_1_3": 1,
    "1_1_4": 1,
    "1_1_5": 1,
    "1_1_6": 1,
    "2_1_1": 2,
    "2_1_2": 2,
    "2_1_3": 2,
    "2_1_4": 2,
    "2_1_5": 2,
    "2_1_6": 2,
    "3_1_1": 3,
    "3_1_2": 3,
    "3_1_3": 3,
    "3_1_4": 3,
    "3_1_5": 3,
    "3_1_6": 3,
    "4_1_1": 4,
    "4_1_2": 4,
    "4_1_3": 4,
    "4_1_4": 4,
    "4_1_5": 4,
    "4_1_6": 4,
    "1_2_1": 1,
    "1_2_2": 1,
    "1_2_3": 1,
    "1_2_4": 1,
    "1_2_5": 1,
    "1_2_6": 1,
    "2_2_1": 2,
    "2_2_2": 2, 
    "2_2_3": 2,
    "2_2_4": 2,
    "2_2_5": 2,
    "2_2_6": 2,
    "3_2_1": 3,
    "3_2_2": 3,
    "3_2_3": 3,
    "3_2_4": 3,
    "3_2_5": 3,
    "3_2_6": 3,
    "4_2_1": 4,
    "4_2_2": 4,
    "4_2_3": 4,
    "4_2_4": 4,
    "4_2_5": 4,
    "4_2_6": 4,
}

# Dictionary to look up scenario configuration based on experiment id. key: experiment id, value: scenario configuration
configuration_dict = {
    "1_1_1": 1,
    "1_1_2": 2,
    "1_1_3": 3,
    "1_1_4": 4,
    "1_1_5": 5,
    "1_1_6": 6,
    "2_1_1": 1,
    "2_1_2": 2,
    "2_1_3": 3,
    "2_1_4": 4,
    "2_1_5": 5,
    "2_1_6": 6,
    "3_1_1": 1,
    "3_1_2": 2,
    "3_1_3": 3,
    "3_1_4": 4,
    "3_1_5": 5,
    "3_1_6": 6,
    "4_1_1": 1,
    "4_1_2": 2,
    "4_1_3": 3,
    "4_1_4": 4,
    "4_1_5": 5,
    "4_1_6": 6,
    "1_2_1": 1,
    "1_2_2": 2,
    "1_2_3": 3,
    "1_2_4": 4,
    "1_2_5": 5,
    "1_2_6": 6,
    "2_2_1": 1,
    "2_2_2": 2,
    "2_2_3": 3,
    "2_2_4": 4,
    "2_2_5": 5,
    "2_2_6": 6,
    "3_2_1": 1,
    "3_2_2": 2,
    "3_2_3": 3,
    "3_2_4": 4,
    "3_2_5": 5,
    "3_2_6": 6,
    "4_2_1": 1,
    "4_2_2": 2,
    "4_2_3": 3,
    "4_2_4": 4,
    "4_2_5": 5,
    "4_2_6": 6,
}

#### Setting up functions to repeatedly prompt ChatGPT

- Functions to query 1 prompt n times

In [37]:
def run_experiment(experiment_id, n, progress_bar, temperature):

    """
    Function to query ChatGPT multiple times with a survey having answers designed as: A, B, C.
    
    Args:
        experiment_id (str): ID of the experiment to be run. Contains info about prompt and model
        n (int): Number of queries to be made
        temperature (int): Degree of randomness with range 0 (deterministic) to 2 (random)
        max_tokens (int): Maximum number of tokens in response object
        
    Returns:
        results (list): List containing count of answers for each option, also containing experiment_id, temperature and number of observations
        probs (list): List containing probability of each option being chosen, also containing experiment_id, temeperature and number of observations
    """
    
    answers = []
    for _ in range(n): 
        response = client.chat.completions.create(
            model = model_dict[experiment_id], 
            max_tokens = 1,
            temperature = temperature, # range is 0 to 2
            messages = [
            {"role": "system", "content": "Only answer with 1 letter."},        
            {"role": "user", "content": experiment_prompts_dict[experiment_id]},
                   ])

        # Store the answer in the list
        answer = response.choices[0].message.content
        answers.append(answer.strip())
        # Update progress bar (given from either temperature loop, or set locally)
        progress_bar.update(1)

    # Counting results
    A = answers.count("A") # set to Q
    B = answers.count("B") # set to X
    C = answers.count("C") # set to Y

    # Count of "correct" answers, sums over indicator function thack checks if answer is either A, B or C
    len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])

    # Collecting results in a list
    results = [experiment_id, temperature, A, B, C, len_correct, model_dict[experiment_id], scenario_dict[experiment_id], configuration_dict[experiment_id]]

    # Getting percentage each answer
    p_a = f"{(A / len_correct) * 100:.2f}%"
    p_b = f"{(B / len_correct) * 100:.2f}%"
    p_c = f"{(C / len_correct) * 100:.2f}%"

    # Collect probabilities in a dataframe
    probs = [experiment_id, temperature, p_a, p_b, p_c, len_correct, model_dict[experiment_id], scenario_dict[experiment_id], configuration_dict[experiment_id]]
    
    # Give out results
    return results, probs

- Function to loop run_experiment() over a list of temperature values

In [38]:
def temperature_loop(function, experiment_id, temperature_list = [0.5, 1, 1.5], n = 50):
    """
    Function to run an experiment with different temperature values.
    
    Args:
        function (function): Function to be used for querying ChatGPT i.e. run_experiment()
        experiment_id (str): ID of th e experiment to be run. Contains info about prompt and model
        temperature_list (list): List of temperature values to be looped over
        n: Number of requests for each prompt per temperature value
        max_tokens: Maximum number of tokens in response object
        
    Returns:
        results_df: Dataframe with experiment results
        probs_df: Dataframe with answer probabilities
    """    
    # Empty lists for storing results
    results_list = []
    probs_list = []
    # Initialize progress bar -> used as input for run_experiment()
    progress_bar = tqdm(range(n*len(temperature_list)))

    # Loop over different temperature values, calling the input function n times each (i.e. queriyng ChatGPT n times)
    for temperature in temperature_list:
        results, probs = function(experiment_id = experiment_id, n = n, temperature = temperature, progress_bar = progress_bar) 
        results_list.append(results)
        probs_list.append(probs)

    # Horizontally concatenate the results, transpose, and set index
    results_df = pd.DataFrame(results_list).transpose().set_index(pd.Index(["Experiment", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Configuration"]))
    probs_df = pd.DataFrame(probs_list).transpose().set_index(pd.Index(["Experiment", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Configuration"]))
   
    # Return some information about the experiment as a check
    check = f"In this run, a total of {n*len(temperature_list)} requests were made using {prompt_ids_dict[experiment_id]}."
    # Print information about the experiment
    print(check)
    # Print original results 
    # print(f"The original results were {results_dict[experiment_id]}.")

    return results_df, probs_df

-----------------------------------

In [39]:
#test_results, test_probs = temperature_loop(run_experiment, "1_1_1", temperature_list = [0.5, 1, 1.5], n = 5)
#test_probs

In [40]:
#test_results2, test_probs2 = temperature_loop(run_experiment, "1_1_2", temperature_list = [0.5, 1, 1.5], n = 5)
#test_probs2

---------------------------------------------------------------------------

### Model 1: GPT-3.5-Turbo

In [25]:
# For GPT-3.5-turbo we make 100 requests per prompt & temperature value
N = 100

- Scenario 1

In [41]:
probs_scenario1 = []
for experiment_id in ["1_1_1", "1_1_2", "1_1_3", "1_1_4", "1_1_5", "1_1_6"]:
    results, probs = temperature_loop(run_experiment, experiment_id, temperature_list = [0.5, 1, 1.5], n = N)
    probs_scenario1.append(probs)

100%|██████████| 300/300 [03:03<00:00,  1.64it/s]


In this run, a total of 300 requests were made using prompts_1[0].


100%|██████████| 300/300 [02:27<00:00,  2.03it/s]


In this run, a total of 300 requests were made using prompts_1[1].


100%|██████████| 300/300 [02:24<00:00,  2.08it/s]


In this run, a total of 300 requests were made using prompts_1[2].


100%|██████████| 300/300 [02:30<00:00,  1.99it/s]


In this run, a total of 300 requests were made using prompts_1[3].


100%|██████████| 300/300 [03:58<00:00,  1.26it/s]


In this run, a total of 300 requests were made using prompts_1[4].


100%|██████████| 300/300 [02:26<00:00,  2.05it/s]

In this run, a total of 300 requests were made using prompts_1[5].





In [27]:
#probs1 = pd.concat(probs_scenario1_test, axis = 1)
#probs1.transpose()

Unnamed: 0,0,1,2
Experiment,1_1_1,1_1_1,1_1_1
Temp,0.5,1.0,1.5
p(A),0.00%,20.00%,0.00%
p(B),60.00%,40.00%,60.00%
p(C),40.00%,40.00%,40.00%
Obs.,5,5,5
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenaro,1,1,1
Configuration,1,1,1


- Scenario 2

In [50]:
probs_scenario2 = []
for experiment_id in ["2_1_1", "2_1_2", "2_1_3", "2_1_4", "2_1_5", "2_1_6"]:
    results, probs = temperature_loop(run_experiment, experiment_id, temperature_list = [0.5, 1, 1.5], n = N)
    probs_scenario2.append(probs)

100%|██████████| 300/300 [02:25<00:00,  2.07it/s]


In this run, a total of 300 requests were made using prompts_2[0].


100%|██████████| 300/300 [02:29<00:00,  2.00it/s]


In this run, a total of 300 requests were made using prompts_2[1].


100%|██████████| 300/300 [02:27<00:00,  2.04it/s]


In this run, a total of 300 requests were made using prompts_2[2].


100%|██████████| 300/300 [02:19<00:00,  2.15it/s]


In this run, a total of 300 requests were made using prompts_2[3].


100%|██████████| 300/300 [03:38<00:00,  1.38it/s] 


In this run, a total of 300 requests were made using prompts_2[4].


100%|██████████| 300/300 [02:32<00:00,  1.97it/s]

In this run, a total of 300 requests were made using prompts_2[5].





- Scenario 3

In [51]:
probs_scenario3 = []
for experiment_id in ["3_1_1", "3_1_2", "3_1_3", "3_1_4", "3_1_5", "3_1_6"]:
    results, probs = temperature_loop(run_experiment, experiment_id, temperature_list = [0.5, 1, 1.5], n = N)
    probs_scenario3.append(probs)

100%|██████████| 300/300 [02:25<00:00,  2.06it/s]


In this run, a total of 300 requests were made using prompts_3[0].


100%|██████████| 300/300 [02:27<00:00,  2.04it/s]


In this run, a total of 300 requests were made using prompts_3[1].


100%|██████████| 300/300 [02:27<00:00,  2.03it/s]


In this run, a total of 300 requests were made using prompts_3[2].


100%|██████████| 300/300 [02:24<00:00,  2.07it/s]


In this run, a total of 300 requests were made using prompts_3[3].


100%|██████████| 300/300 [02:31<00:00,  1.98it/s]


In this run, a total of 300 requests were made using prompts_3[4].


100%|██████████| 300/300 [03:38<00:00,  1.37it/s]

In this run, a total of 300 requests were made using prompts_3[5].





- Scenario 4

In [52]:
probs_scenario4 = []
for experiment_id in ["4_1_1", "4_1_2", "4_1_3", "4_1_4", "4_1_5", "4_1_6"]:
    results, probs = temperature_loop(run_experiment, experiment_id, temperature_list = [0.5, 1, 1.5], n = N)
    probs_scenario4.append(probs)

100%|██████████| 300/300 [02:32<00:00,  1.97it/s]


In this run, a total of 300 requests were made using prompts_4[0].


100%|██████████| 300/300 [02:25<00:00,  2.05it/s]


In this run, a total of 300 requests were made using prompts_4[1].


100%|██████████| 300/300 [02:19<00:00,  2.15it/s]


In this run, a total of 300 requests were made using prompts_4[2].


100%|██████████| 300/300 [02:27<00:00,  2.03it/s]


In this run, a total of 300 requests were made using prompts_4[3].


100%|██████████| 300/300 [02:29<00:00,  2.00it/s]


In this run, a total of 300 requests were made using prompts_4[4].


100%|██████████| 300/300 [02:29<00:00,  2.01it/s]

In this run, a total of 300 requests were made using prompts_4[5].





--------------------------------------

### Model 2: GPT-4-1106-Preview

In [53]:
# Since GPT-4 is a much more expensive model, we only make 50 requests per prompt & temperature value
N = 50

- Scenario 1

In [54]:
for experiment_id in ["1_2_1", "1_2_2", "1_2_3", "1_2_4", "1_2_5", "1_2_6"]:
    results, probs = temperature_loop(run_experiment, experiment_id, temperature_list = [0.5, 1, 1.5], n = N)
    probs_scenario1.append(probs)

100%|██████████| 150/150 [01:42<00:00,  1.46it/s]


In this run, a total of 150 requests were made using prompts_1[0].


100%|██████████| 150/150 [01:31<00:00,  1.64it/s]


In this run, a total of 150 requests were made using prompts_1[1].


100%|██████████| 150/150 [02:05<00:00,  1.19it/s]


In this run, a total of 150 requests were made using prompts_1[2].


100%|██████████| 150/150 [03:42<00:00,  1.48s/it]


In this run, a total of 150 requests were made using prompts_1[3].


100%|██████████| 150/150 [01:34<00:00,  1.58it/s]


In this run, a total of 150 requests were made using prompts_1[4].


100%|██████████| 150/150 [01:29<00:00,  1.68it/s]

In this run, a total of 150 requests were made using prompts_1[5].





- Scenario 2

In [69]:
for experiment_id in ["2_2_1", "2_2_2", "2_2_3", "2_2_4", "2_2_5", "2_2_6"]:
    results, probs = temperature_loop(run_experiment, experiment_id, temperature_list = [0.5, 1, 1.5], n = N)
    probs_scenario2.append(probs)

100%|██████████| 150/150 [01:40<00:00,  1.50it/s]


In this run, a total of 150 requests were made using prompts_2[0].


100%|██████████| 150/150 [02:31<00:00,  1.01s/it]


In this run, a total of 150 requests were made using prompts_2[1].


100%|██████████| 150/150 [01:38<00:00,  1.53it/s]


In this run, a total of 150 requests were made using prompts_2[2].


100%|██████████| 150/150 [03:11<00:00,  1.28s/it] 


In this run, a total of 150 requests were made using prompts_2[3].


100%|██████████| 150/150 [01:41<00:00,  1.48it/s]


In this run, a total of 150 requests were made using prompts_2[4].


100%|██████████| 150/150 [01:39<00:00,  1.51it/s]

In this run, a total of 150 requests were made using prompts_2[5].





- Scenario 3

In [70]:
for experiment_id in ["3_2_1", "3_2_2", "3_2_3", "3_2_4", "3_2_5", "3_2_6"]:
    results, probs = temperature_loop(run_experiment, experiment_id, temperature_list = [0.5, 1, 1.5], n = N)
    probs_scenario3.append(probs)

100%|██████████| 150/150 [01:34<00:00,  1.59it/s]


In this run, a total of 150 requests were made using prompts_3[0].


100%|██████████| 150/150 [04:40<00:00,  1.87s/it]


In this run, a total of 150 requests were made using prompts_3[1].


100%|██████████| 150/150 [11:32<00:00,  4.62s/it] 


In this run, a total of 150 requests were made using prompts_3[2].


100%|██████████| 150/150 [01:36<00:00,  1.56it/s]


In this run, a total of 150 requests were made using prompts_3[3].


100%|██████████| 150/150 [21:39<00:00,  8.67s/it]   


In this run, a total of 150 requests were made using prompts_3[4].


100%|██████████| 150/150 [01:34<00:00,  1.58it/s]

In this run, a total of 150 requests were made using prompts_3[5].





- Scenario 4

In [71]:
for experiment_id in ["4_2_1", "4_2_2", "4_2_3", "4_2_4", "4_2_5", "4_2_6"]:
    results, probs = temperature_loop(run_experiment, experiment_id, temperature_list = [0.5, 1, 1.5], n = N)
    probs_scenario4.append(probs)

100%|██████████| 150/150 [02:48<00:00,  1.13s/it]


In this run, a total of 150 requests were made using prompts_4[0].


100%|██████████| 150/150 [01:31<00:00,  1.64it/s]


In this run, a total of 150 requests were made using prompts_4[1].


100%|██████████| 150/150 [01:32<00:00,  1.62it/s]


In this run, a total of 150 requests were made using prompts_4[2].


100%|██████████| 150/150 [01:30<00:00,  1.65it/s]


In this run, a total of 150 requests were made using prompts_4[3].


100%|██████████| 150/150 [02:06<00:00,  1.19it/s]


In this run, a total of 150 requests were made using prompts_4[4].


100%|██████████| 150/150 [01:40<00:00,  1.49it/s]

In this run, a total of 150 requests were made using prompts_4[5].





----------------------------------------------------

- Save the results to csv

In [94]:
probs_scenario1 = pd.concat(probs_scenario1, axis = 1).transpose()
probs_scenario2 = pd.concat(probs_scenario2, axis = 1).transpose()
probs_scenario3 = pd.concat(probs_scenario3, axis = 1).transpose()
probs_scenario4 = pd.concat(probs_scenario4, axis = 1).transpose()

In [96]:
PT2_probs = pd.concat([probs_scenario1, probs_scenario2, probs_scenario3, probs_scenario4], axis = 0)
PT2_probs.to_csv("Output/PT2_probs.csv")

In [98]:
PT2_probs

Unnamed: 0,Experiment,Temp,p(A),p(B),p(C),Obs.,Model,Scenario,Configuration
0,1_1_1,0.5,62.00%,36.00%,2.00%,100,gpt-3.5-turbo,1,1
1,1_1_1,1.0,49.00%,44.00%,7.00%,100,gpt-3.5-turbo,1,1
2,1_1_1,1.5,45.00%,32.00%,23.00%,100,gpt-3.5-turbo,1,1
0,1_1_2,0.5,56.00%,42.00%,2.00%,100,gpt-3.5-turbo,1,2
1,1_1_2,1.0,46.00%,48.00%,6.00%,100,gpt-3.5-turbo,1,2
...,...,...,...,...,...,...,...,...,...
1,4_2_5,1.0,0.00%,0.00%,100.00%,50,gpt-4-1106-preview,4,5
2,4_2_5,1.5,0.00%,0.00%,100.00%,50,gpt-4-1106-preview,4,5
0,4_2_6,0.5,0.00%,0.00%,100.00%,50,gpt-4-1106-preview,4,6
1,4_2_6,1.0,0.00%,0.00%,100.00%,50,gpt-4-1106-preview,4,6
