## Prospect Theory

This notebook aims to recreate some of the findings of **Thaler, Richard (1985), “Mental Accounting and Consumer Choice,” Marketing Science, 4 (3), 199–214.** Specifically, we try to see if LLMs like **ChatGPT** abide by some rules of Mental Accounting and Prospect Theory.


Maybe change it to was in every prompt? Who WAS happier? (original phrasing)

## Original study: 

### Scenario 1: Segragation of gains
- Mr. A was given tickets to lotteries involving the World Series. He won $50 in one lottery and $25 in the other.
- Mr. B was given a ticket to a single, larger World Series lottery. He won $75. Who is happier?

| Answer option | Frequency |
|--------------|-----------|
| A            | 56        |
| B            | 16        |
| No difference | 15      |

(empirical results from the 1985 study) -> No segregation of gains for B

### Scenario 2: Integration of losses
- Mr. A received a letter from the IRS saying that he made a minor arithmetical mistake on his
tax return and owed $100. He received a similar letter the same day from his state income tax
authority saying he owed $50. There were no other repercussions from either mistake.
- Mr. B received a letter from the IRS saying that he made a minor arithmetical mistake on his tax
return and owed $150. There were no other repercussions from his mistake. Who was more upset?

| Answer option | Frequency |
|--------------|-----------|
| A            | 66        |
| B            | 14        |
| No difference | 7      |

(empirical results from the 1985 study) -> No integration of losses for B

#### Scenario 3: Cancellation of losses against larger gains
- Mr. A bought his first New York State lottery ticket and won $100. Also, in a freak accident,
he damaged the rug in his apartment and had to pay the landlord $80.
- Mr. B bought his first New York State lottery ticket and won $20? Who is happier?

| Answer option | Frequency |
|--------------|-----------|
| A            | 22        |
| B            | 61        |
| No difference | 4      |

(empirical results from the 1985 study) -> No cancellation of losses against larger gains for A


#### Scenario 4: Segregation of "silver linings"
- Mr. A's car was damaged in a parking lot. He had to spend $200 to repair the damage. The
same day the car was damaged, he won $25 in the office football pool.
- Mr. B's car was damaged in a parking lot. He had to spend $175 to repair the damage.
Who was more upset?

| Answer option | Frequency |
|--------------|-----------|
| A            | 19        |
| B            | 63        |
| No difference | 5      |

(empirical results from the 1985 study) -> No segregation of "silver linings" for B.



------------------------------------

In [1]:
from openai import OpenAI
import openai
import matplotlib.pyplot as plt
import os 
import numpy as np
import pandas as pd
from tqdm import tqdm
import replicate
import plotly.graph_objects as go
from ast import literal_eval
import pickle

In [2]:
# Get openAI API key (previously saved as environmental variable)
openai.api_key = os.environ["OPENAI_API_KEY"]

# Set client
client = OpenAI()

# Set global plot style
plt.style.use('seaborn-v0_8')

# Set plots to be displayed in notebook
%matplotlib inline

In [3]:
# To make our results comparable to the original study, we compute original answer probabilities
PT_p_scenario1 = [np.round((56/(56+16+15)*100), 2), np.round(16/(56+16+15)*100,2), np.round((15/(56+16+15)*100), 2)]
PT_p_scenario2 = [np.round((66/(66+14+7)*100), 2), np.round((14/(66+14+7)*100), 2), np.round((7/(66+14+7)*100), 2)]
PT_p_scenario3 = [np.round((22/(22+61+4)*100), 2), np.round((61/(22+61+4)*100), 2), np.round((4/(22+61+4)*100), 2)]
PT_p_scenario4 = [np.round((19/(19+63+5)*100), 2), np.round((63/(19+63+5)*100), 2), np.round((5/(19+63+5)*100), 2)]

In [4]:
# Create dataframe with additional information about experiment
PT_original_results = pd.DataFrame([PT_p_scenario1, PT_p_scenario2, PT_p_scenario3, PT_p_scenario4], columns=['p(A)', 'p(B)', 'p(C)'])
PT_original_results['Scenario'] = ['1', '2', '3', '4']
PT_original_results['Obs.'] = ['87', '87', '87', '87']
PT_original_results

Unnamed: 0,p(A),p(B),p(C),Scenario,Obs.
0,64.37,18.39,17.24,1,87
1,75.86,16.09,8.05,2,87
2,25.29,70.11,4.6,3,87
3,21.84,72.41,5.75,4,87


In [5]:
# Save to .csv for use in Dashboard
PT_original_results.to_csv("Dashboard/src/data/Input/PT_og_results.csv", index = False)

---------------------------

#### Setting up the prompts used for the experiment

We now formulate 8 different prompts: 

The first four prompts all describe different scenarios in which 2 people, Mister A and Mister B, each lose or win money. Most importantly, in every scenario, the monetary value they both lost or gained is the same. However, this is where Prospect Theory comes into play. Some of the gains/losses are separated or integrated and the four marketing implications of prospect theory arise: Segregation of gains, Integration of losses, cancellation of losses against larger gains and the segregation of "silver linings".

The last four prompts describe the same situations as before. However, we now instruct the model to take the role of a market researcher that knows about the implications of Prospect Theory.

(Since prompt formatting matters, for now we will not insert a line break mid-sentence and try to keep a scenario description for A/B in the same line)

- Prompt 1: Segregation of gains (unprimed)

In [6]:
PT_prompt_1 = """Mr. A was given tickets involving the World Series. He won 50$ in one lottery and $25 in the other. 
          Mr. B was given a ticket to a single, larger World Series lottery. He won $75. Based solely on this information, Who was happier? 
          A: Mister A
          B: Mister B
          C: No difference.         
          Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 2: Integration of losses (unprimed)

In [7]:
PT_prompt_2 = """Mr. A received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $100. 
         He received a similar letter the same day from his state income tax authority saying he owed $50. There were no other repercussions from either mistake. 
         Mr. B received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $150. There were no other repercussions from his mistake. 
         Based solely on this information, who was more upset? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is more upset?

- Prompt 3: Cancellation of losses against larger gains (unprimed) # Add info about same day?

In [8]:
PT_prompt_3 = """Mr. A bought his first New York State lottery ticket and won $100. Also, in a freak accident, he damaged the rug in his apartment and had to pay the landlord $80.
         Mr. B bought his first New York State lottery ticket and won $20. Based solely on this information, who was happier? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 4: Segregation of "silver linings" (unprimed)

In [9]:
PT_prompt_4 = """Mr. A's car was damaged in a parking lot. He had to spend $200 to repair the damage. The same day the car was damaged, he won $25 in the office football pool.
         Mr. B's car was damaged in a parking lot. He had to spend $175 to repair the damage. Based solely on this information, who was more upset?
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is more upset?

- Prompt 5: Segregation of gains (primed)

In [10]:
PT_prompt_5 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation: 
          Mr. A was given tickets involving the World Series. He won 50$ in one lottery and 25$ in the other. 
          Mr. B was given a ticket to a single, larger World Series lottery. He won 75$. Based solely on this information, who was happier?
          A: Mister A
          B: Mister B
          C: No difference.
          Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 6: Integration of losses (primed)

In [11]:
PT_prompt_6 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation:
         Mr. A received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $100. 
         He received a similar letter the same day from his state income tax authority saying he owed $50. There were no other repercussions from either mistake. 
         Mr. B received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $150. There were no other repercussions from his mistake. 
         Based solely on this information, who was more upset? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is more upset?

-  Prompt 7: Cancellation of losses against larger gains (primed) # Add info about same day?

In [12]:
PT_prompt_7 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation:
         Mr. A bought his first New York State lottery ticket and won $100. Also, in a freak accident, he damaged the rug in his apartment and had to pay the landlord $80.
         Mr. B bought his first New York State lottery ticket and won $20? Based solely on this information, who was happier?
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 8: Segregation of "silver linings" (primed)

In [13]:
PT_prompt_8 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation:
         Mr. A's car was damaged in a parking lot. He had to spend $200 to repair the damage. The same day the car was damaged, he won $25 in the office football pool.
         Mr. B's car was damaged in a parking lot. He had to spend $175 to repair the damage. Based solely on this information, who was more upset?
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""
# Who is more upset?

- Save prompts to use them in the Dashboard

In [14]:
PT_prompts = [PT_prompt_1, PT_prompt_2, PT_prompt_3, PT_prompt_4, PT_prompt_5, PT_prompt_6, PT_prompt_7, PT_prompt_8]
with open ('Dashboard/src/data/Input/PT_prompts.pkl', 'wb') as file:
    pickle.dump(PT_prompts, file)

------------------------------

- Helpful dictionaries 

The experiments we will run in this notebook are very similar in study design, and for same cases, also similar in the results we expect. We therefore need to make sure, that we associate the results with the correct study design. That is why the following dictionaries are implemented to look up e.g. what model was used for an experiment.

They will also be used inside the functions that call the API multiple times and output some information about the experiment in order to identify it correctly. 

In [15]:
# Dictionary that returns the literal prompt for a given experiment id (used in function call). key: experiment_id, value: prompt
PT_experiment_prompts_dict = {
    "PT_1_1": PT_prompts[0],
    "PT_1_2": PT_prompts[1],
    "PT_1_3": PT_prompts[2],
    "PT_1_4": PT_prompts[3],
    "PT_1_5": PT_prompts[4],
    "PT_1_6": PT_prompts[5],
    "PT_1_7": PT_prompts[6],
    "PT_1_8": PT_prompts[7],
    "PT_2_1": PT_prompts[0],
    "PT_2_2": PT_prompts[1],
    "PT_2_3": PT_prompts[2],
    "PT_2_4": PT_prompts[3],
    "PT_2_5": PT_prompts[4],
    "PT_2_6": PT_prompts[5],
    "PT_2_7": PT_prompts[6],
    "PT_2_8": PT_prompts[7],
    "PT_3_1": PT_prompts[0],
    "PT_3_2": PT_prompts[1],
    "PT_3_3": PT_prompts[2],
    "PT_3_4": PT_prompts[3],
    "PT_3_5": PT_prompts[4],
    "PT_3_6": PT_prompts[5],
    "PT_3_7": PT_prompts[6],
    "PT_3_8": PT_prompts[7],
}

# The following dictionary is only used for a check in the function calls.
# It returns the variable name of the prompt that was used in the experiment. key: experiment_id, value: prompt_name
PT_prompt_ids_dict = {
    "PT_1_1": "PT_prompts[0]",
    "PT_1_2": "PT_prompts[1]",
    "PT_1_3": "PT_prompts[2]",
    "PT_1_4": "PT_prompts[3]",
    "PT_1_5": "PT_prompts[4]",
    "PT_1_6": "PT_prompts[5]",
    "PT_1_7": "PT_prompts[6]",
    "PT_1_8": "PT_prompts[7]",
    "PT_2_1": "PT_prompts[0]",
    "PT_2_2": "PT_prompts[1]",
    "PT_2_3": "PT_prompts[2]",
    "PT_2_4": "PT_prompts[3]",
    "PT_2_5": "PT_prompts[4]",
    "PT_2_6": "PT_prompts[5]",
    "PT_2_7": "PT_prompts[6]",
    "PT_2_8": "PT_prompts[7]",
    "PT_3_1": "PT_prompts[0]",
    "PT_3_2": "PT_prompts[1]",
    "PT_3_3": "PT_prompts[2]",
    "PT_3_4": "PT_prompts[3]",
    "PT_3_5": "PT_prompts[4]",
    "PT_3_6": "PT_prompts[5]",
    "PT_3_7": "PT_prompts[6]",
    "PT_3_8": "PT_prompts[7]",
}

# Dictionary to look up which model to use for a given experiment id (used in function call). key: experiment id, value: model name
PT_model_dict = {
    "PT_1_1": "gpt-3.5-turbo",
    "PT_1_2": "gpt-3.5-turbo",
    "PT_1_3": "gpt-3.5-turbo",
    "PT_1_4": "gpt-3.5-turbo",
    "PT_1_5": "gpt-3.5-turbo",
    "PT_1_6": "gpt-3.5-turbo",
    "PT_1_7": "gpt-3.5-turbo",
    "PT_1_8": "gpt-3.5-turbo",
    "PT_2_1": "gpt-4-1106-preview",
    "PT_2_2": "gpt-4-1106-preview",
    "PT_2_3": "gpt-4-1106-preview",
    "PT_2_4": "gpt-4-1106-preview",
    "PT_2_5": "gpt-4-1106-preview",
    "PT_2_6": "gpt-4-1106-preview",
    "PT_2_7": "gpt-4-1106-preview",
    "PT_2_8": "gpt-4-1106-preview",
    "PT_3_1": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_2": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_3": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_4": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_5": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_6": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_7": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_8": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    }

# Dictionary to look up the scenario number of a given experiment ID. key: experiment id, value: scenario number
PT_scenario_dict = {
    "PT_1_1": 1,
    "PT_1_2": 2,
    "PT_1_3": 3,
    "PT_1_4": 4,
    "PT_1_5": 1,
    "PT_1_6": 2,
    "PT_1_7": 3,
    "PT_1_8": 4,
    "PT_2_1": 1,
    "PT_2_2": 2,
    "PT_2_3": 3,
    "PT_2_4": 4,
    "PT_2_5": 1,
    "PT_2_6": 2,
    "PT_2_7": 3,
    "PT_2_8": 4,
    "PT_3_1": 1,
    "PT_3_2": 2,
    "PT_3_3": 3,
    "PT_3_4": 4,
    "PT_3_5": 1,
    "PT_3_6": 2,
    "PT_3_7": 3,
    "PT_3_8": 4,
}   

# Dictionary to look up, whether an experiment used a primed or unprimed prompt. key: experiment id, value: 1 if primed, 0 if unprimed
PT_priming_dict = {
    "PT_1_1": 0,
    "PT_1_2": 0,
    "PT_1_3": 0,
    "PT_1_4": 0,
    "PT_1_5": 1,
    "PT_1_6": 1,
    "PT_1_7": 1,
    "PT_1_8": 1,
    "PT_2_1": 0,
    "PT_2_2": 0,
    "PT_2_3": 0,
    "PT_2_4": 0,
    "PT_2_5": 1,
    "PT_2_6": 1,
    "PT_2_7": 1,
    "PT_2_8": 1,
    "PT_3_1": 0,
    "PT_3_2": 0,
    "PT_3_3": 0,
    "PT_3_4": 0,
    "PT_3_5": 1,
    "PT_3_6": 1,
    "PT_3_7": 1,
    "PT_3_8": 1,
}

# Dictionary to look up original results of the Prospect Theory experiments. Key: experiment id, value: original results
PT_results_dict = {
    "PT_1_1": PT_p_scenario1,
    "PT_1_2": PT_p_scenario2,
    "PT_1_3": PT_p_scenario3,
    "PT_1_4": PT_p_scenario4,
    "PT_1_5": PT_p_scenario1,
    "PT_1_6": PT_p_scenario2,
    "PT_1_7": PT_p_scenario3,
    "PT_1_8": PT_p_scenario4,
    "PT_2_1": PT_p_scenario1,
    "PT_2_2": PT_p_scenario2,
    "PT_2_3": PT_p_scenario3,
    "PT_2_4": PT_p_scenario4,
    "PT_2_5": PT_p_scenario1,
    "PT_2_6": PT_p_scenario2,
    "PT_2_7": PT_p_scenario3,
    "PT_2_8": PT_p_scenario4,
    "PT_3_1": PT_p_scenario1,
    "PT_3_2": PT_p_scenario2,
    "PT_3_3": PT_p_scenario3,
    "PT_3_4": PT_p_scenario4,
    "PT_3_5": PT_p_scenario1,
    "PT_3_6": PT_p_scenario2,
    "PT_3_7": PT_p_scenario3,
    "PT_3_8": PT_p_scenario4,
    }

# Dictionary to look up number of original answers. key: experiment id, value: number of original answers
PT_answercount_dict = {
    "PT_1_1": 87,
    "PT_1_2": 87,
    "PT_1_3": 87,
    "PT_1_4": 87,
    "PT_1_5": 87,
    "PT_1_6": 87,
    "PT_1_7": 87,
    "PT_1_8": 87,
    "PT_2_1": 87,
    "PT_2_2": 87,
    "PT_2_3": 87,
    "PT_2_4": 87,
    "PT_2_5": 87,
    "PT_2_6": 87,
    "PT_2_7": 87,
    "PT_2_8": 87,
    "PT_3_1": 87,
    "PT_3_2": 87,
    "PT_3_3": 87,
    "PT_3_4": 87,
    "PT_3_5": 87,
    "PT_3_6": 87,
    "PT_3_7": 87,
    "PT_3_8": 87,
    }


In [16]:
# Collect and save for use in Dashboard
PT_dictionaries = [PT_experiment_prompts_dict, PT_prompt_ids_dict, PT_model_dict, PT_scenario_dict, PT_priming_dict, PT_results_dict, PT_answercount_dict]
with open ('Dashboard/src/data/Input/PT_dictionaries.pkl', 'wb') as file:
    pickle.dump(PT_dictionaries, file)

---------------------------------

#### Setting up functions to repeatedly prompt ChatGPT

- Functions to query 1 prompt n times

In [17]:
def PT_run_experiment(experiment_id, n, progress_bar, temperature):

    """
    Function to query ChatGPT multiple times with a survey having answers designed as: A, B, C.
    
    Args:
        experiment_id (str): ID of the experiment to be run. Contains info about prompt and model
        n (int): Number of queries to be made
        temperature (int): Degree of randomness with range 0 (deterministic) to 2 (random)
        max_tokens (int): Maximum number of tokens in response object
        
    Returns:
        results (list): List containing count of answers for each option, also containing experiment_id, temperature and number of observations
        probs (list): List containing probability of each option being chosen, also containing experiment_id, temeperature and number of observations
    """
    
    answers = []
    for _ in range(n): 
        response = client.chat.completions.create(
            model = PT_model_dict[experiment_id], 
            max_tokens = 1,
            temperature = temperature, # range is 0 to 2
            messages = [
            {"role": "system", "content": "Only answer with the letter of the alternative you would choose without any reasoning."},        
            {"role": "user", "content": PT_experiment_prompts_dict[experiment_id]},
                   ])

        # Store the answer in the list
        answer = response.choices[0].message.content
        answers.append(answer.strip())
        # Update progress bar (given from either temperature loop, or set locally)
        progress_bar.update(1)

    # Counting results
    A = answers.count("A")
    B = answers.count("B")
    C = answers.count("C")

    # Count of "correct" answers, sums over indicator function thack checks if answer is either A, B or C
    len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])

    # Collecting results in a list
    results = [experiment_id, temperature, A, B, C, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id],
               PT_priming_dict[experiment_id], PT_results_dict[experiment_id], PT_answercount_dict[experiment_id]]

    # Getting percentage of each answer
    p_a = f"{(A / len_correct) * 100 if len_correct != 0 else 0:.2f}%"
    p_b = f"{(B / len_correct) * 100 if len_correct != 0 else 0:.2f}%"
    p_c = f"{(C / len_correct) * 100 if len_correct != 0 else 0:.2f}%"

    # Collect probabilities in a dataframe
    probs = [experiment_id, temperature, p_a, p_b, p_c, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id],
             PT_priming_dict[experiment_id], PT_results_dict[experiment_id], PT_answercount_dict[experiment_id]]
    
    # Give out results
    return results, probs

- Adjusted function for dashboard  (returns dataframe with regular numbers, not percent)

In [22]:
def PT_run_experiment_dashboard(experiment_id, n, temperature):

    """
    Function to query ChatGPT multiple times with a survey having answers designed as: A, B, C.
    
    Args:
        experiment_id (str): ID of the experiment to be run. Contains info about prompt and model
        n (int): Number of queries to be made
        temperature (int): Degree of randomness with range 0 (deterministic) to 2 (random)
        max_tokens (int): Maximum number of tokens in response object
        
    Returns:
        results (list): List containing count of answers for each option, also containing experiment_id, temperature and number of observations
        probs (list): List containing probability of each option being chosen, also containing experiment_id, temeperature and number of observations
    """
    
    answers = []
    for _ in range(n): 
        response = client.chat.completions.create(
            model = PT_model_dict[experiment_id], 
            max_tokens = 1,
            temperature = temperature, # range is 0 to 2
            messages = [
            {"role": "system", "content": "Only answer with the letter of the alternative you would choose without any reasoning."},        
            {"role": "user", "content": PT_experiment_prompts_dict[experiment_id]},
                   ])

        # Store the answer in the list
        answer = response.choices[0].message.content
        answers.append(answer.strip())

    # Counting results
    A = answers.count("A")
    B = answers.count("B")
    C = answers.count("C")

    # Count of "correct" answers, sums over indicator function thack checks if answer is either A, B or C
    len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])

    # Collecting results in a list
    results = [experiment_id, temperature, A, B, C, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id],
               PT_priming_dict[experiment_id], f"{PT_results_dict[experiment_id]}", PT_answercount_dict[experiment_id]]
    results = pd.DataFrame(results)
    results = results.set_index(pd.Index(["Experiment_id", "Temp", "A", "B", "C", "Obs.", "Model", "Scenario", "Priming", "Original", "Original_count"]))
    results = results.transpose()

    # Getting percentage of each answer
    p_a = (A / len_correct) * 100 if len_correct != 0 else 0
    p_b = (B / len_correct) * 100 if len_correct != 0 else 0
    p_c = (C / len_correct) * 100 if len_correct != 0 else 0

    # Collect probabilities in a dataframe
    probs = [experiment_id, temperature, p_a, p_b, p_c, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id],
             PT_priming_dict[experiment_id], f"{PT_results_dict[experiment_id]}", PT_answercount_dict[experiment_id]]
    probs = pd.DataFrame(probs)
    probs = probs.set_index(pd.Index(["Experiment_id", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Priming", "Original", "Original_count"]))
    probs = probs.transpose()
        
    # Give out results
    return results, probs

- Function to query 1 prompt n times (LLama)

In [18]:
def PT_run_experiment_llama(experiment_id, n, progress_bar, temperature):
    answers = []
    for _ in range(n):
        response = replicate.run(
            PT_model_dict[experiment_id],
            input = {
                "system_prompt": "Only answer with the letter of the alternative you would choose without any reasoning.",
                "temperature": temperature,
                "max_new_tokens": 2, 
                "prompt": PT_experiment_prompts_dict[experiment_id]
            }
        )
        # Grab answer and append to list
        answer = "" # Set to empty string, otherwise it would append the previous answer to the new one
        for item in response:
            answer = answer + item
        answers.append(answer.strip())

        # Update progress bar
        progress_bar.update(1)

    # Counting results
    A = answers.count("A") 
    B = answers.count("B") 
    C = answers.count("C") 

    # Count of "correct" answers, sums over indicator function thack checks if answer is either A, B or C
    len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])

    # Collecting results in a list
    results = [experiment_id, temperature, A, B, C, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id],
               PT_priming_dict[experiment_id], PT_results_dict[experiment_id], PT_answercount_dict[experiment_id]]

    # Getting percentage each answer
    p_a = f"{(A / len_correct) * 100 if len_correct != 0 else 0:.2f}%"
    p_b = f"{(B / len_correct) * 100 if len_correct != 0 else 0:.2f}%"
    p_c = f"{(C / len_correct) * 100 if len_correct != 0 else 0:.2f}%"

    # Collect probabilities in a dataframe
    probs = [experiment_id, temperature, p_a, p_b, p_c, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id],
             PT_priming_dict[experiment_id], PT_results_dict[experiment_id], PT_answercount_dict[experiment_id]]
    
    # Give out results
    return results, probs

- Adjusted function for dashboard  (returns dataframe with regular numbers, not percent)

In [None]:
def PT_run_experiment_llama_dashboard(experiment_id, n, temperature):
    answers = []
    for _ in range(n):
        response = replicate.run(
            PT_model_dict[experiment_id],
            input = {
                "system_prompt": "Only answer with the letter of the alternative you would choose without any reasoning.",
                "temperature": temperature,
                "max_new_tokens": 2, 
                "prompt": PT_experiment_prompts_dict[experiment_id]
            }
        )
        # Grab answer and append to list
        answer = "" # Set to empty string, otherwise it would append the previous answer to the new one
        for item in response:
            answer = answer + item
        answers.append(answer.strip())


    # Counting results
    A = answers.count("A")
    B = answers.count("B")
    C = answers.count("C")

    # Count of "correct" answers, sums over indicator function thack checks if answer is either A, B or C
    len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])
    
    # Collecting results in a list
    results = [experiment_id, temperature, A, B, C, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id],
               PT_priming_dict[experiment_id], f"{PT_results_dict[experiment_id]}", PT_answercount_dict[experiment_id]]
    results = pd.DataFrame(results)
    results = results.set_index(pd.Index(["Experiment_id", "Temp", "A", "B", "C", "Obs.", "Model", "Scenario", "Priming", "Original", "Original_count"]))
    results = results.transpose()


    # Getting percentage of each answer
    p_a = (A / len_correct) * 100 if len_correct != 0 else 0
    p_b = (B / len_correct) * 100 if len_correct != 0 else 0
    p_c = (C / len_correct) * 100 if len_correct != 0 else 0

    # Collect probabilities in a dataframe
    probs = [experiment_id, temperature, p_a, p_b, p_c, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id],
             PT_priming_dict[experiment_id], f"{PT_results_dict[experiment_id]}", PT_answercount_dict[experiment_id]]
    probs = pd.DataFrame(probs)
    probs = probs.set_index(pd.Index(["Experiment_id", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Priming", "Original", "Original_count"]))
    probs = probs.transpose()
        
    # Give out results
    return results, probs

- Function to loop run_experiment() over a list of temperature values

In [19]:
def PT_temperature_loop(function, experiment_id, temperature_list = [0, 0.5, 1, 1.5, 2], n = 50):
    """
    Function to run an experiment with different temperature values.
    
    Args:
        function (function): Function to be used for querying ChatGPT i.e. run_experiment()
        experiment_id (str): ID of th e experiment to be run. Contains info about prompt and model
        temperature_list (list): List of temperature values to be looped over
        n: Number of requests for each prompt per temperature value
        max_tokens: Maximum number of tokens in response object
        
    Returns:
        results_df: Dataframe with experiment results
        probs_df: Dataframe with answer probabilities
    """    
    # Empty lists for storing results
    results_list = []
    probs_list = []
    # Initialize progress bar -> used as input for run_experiment()
    progress_bar = tqdm(range(n*len(temperature_list)))

    # Loop over different temperature values, calling the input function n times each (i.e. queriyng ChatGPT n times)
    for temperature in temperature_list:
        results, probs = function(experiment_id = experiment_id, n = n, temperature = temperature, progress_bar = progress_bar) 
        results_list.append(results)
        probs_list.append(probs)

    # Horizontally concatenate the results, transpose, and set index
    results_df = pd.DataFrame(results_list).transpose().set_index(pd.Index(["Experiment_id", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Priming", "Original", "Original_count"]))
    probs_df = pd.DataFrame(probs_list).transpose().set_index(pd.Index(["Experiment_id", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Priming", "Original", "Original_count"]))
   

    # Print information about the experiment
    print(f"In this run, a total of {n*len(temperature_list)} requests were made using {PT_prompt_ids_dict[experiment_id]}.")

    return results_df, probs_df

-------------

## Comparing different LLMs

The results variables will be structured as: results_model-id_prompt-id.

We will refer to "GPT-3.5-turbo" as model 1 and "GPT-4-1106-preview" as model 2.

#### Model 1: GPT-3.5-Turbo (Model training ended in September 2021)

- Prompt 1: Segregation of gains (unprimed)

In [None]:
# Set number of requests per temperature value
N = 100

In [None]:
results_1_1, probs_1_1 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_1", n = N)

- Prompt 2: Integration of losses (unprimed)

In [None]:
results_1_2, probs_1_2 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_2", n = N)

- Prompt 3: Cancellation of losses against larger gains (unprimed)

In [None]:
results_1_3, probs_1_3 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_3", n = N)

- Prompt 4: Segregation of "silver linings" (unprimed)

In [None]:
results_1_4, probs_1_4 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_4", n = N)

- Prompt 5: Segregation of gains (primed)

In [None]:
results_1_5, probs_1_5 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_5", n = N)

- Prompt 6: Integration of losses (primed)

In [None]:
results_1_6, probs_1_6 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_6", n = N)

-  Prompt 7: Cancellation of losses against larger gains (primed)

In [None]:
results_1_7, probs_1_7 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_7", n = N)

- Prompt 8: Segregation of "silver linings" (primed)

In [None]:
results_1_8, probs_1_8 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_8", n = N)

------------------------------------------

#### Model 2: GPT-4-1106-preview (Model training ended in April 2023)

Since prompting GPT4 is much more expensive, we will only use 50 requests per temperature value instead of 100, as we did for GPT3.

In [None]:
# Set number of requests per temperature value
N = 50

- Prompt 1: Segregation of gains (unprimed)

In [None]:
results_2_1, probs_2_1 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_1", n = N)

- Prompt 2: Integration of losses (unprimed)

In [None]:
results_2_2, probs_2_2 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_2", n = N)

- Prompt 3: Cancellation of losses against larger gains (unprimed)

In [None]:
results_2_3, probs_2_3 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_3", n = N)

- Prompt 4: Segregation of "silver linings" (unprimed)

In [None]:
results_2_4, probs_2_4 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_4", n = N)

- Prompt 5: Segregation of gains (primed)

In [None]:
results_2_5, probs_2_5 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_5", n = N)

- Prompt 6: Integration of losses (primed)

In [None]:
results_2_6, probs_2_6 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_6", n = N)

-  Prompt 7: Cancellation of losses against larger gains (primed)

In [None]:
results_2_7, probs_2_7 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_7", n = N)

- Prompt 8: Segregation of "silver linings" (primed)

In [None]:
results_2_8, probs_2_8 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_8", n = N)

--------------------------------------------

#### Model 3: LLama-2-70b

!!! Use max_new_tokens of at least 2, as llama tends to begin answers with a blank space !!!

In [None]:
temperature_list = [0.01, 0.5, 1, 1.5, 2] # LLama wont take 0 as temperature and has max temperature of 5
N = 50 # number of requests per temperature value 

- Prompt 1: Segregation of gains (unprimed)

In [None]:
results_3_1, probs_3_1 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_1", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)

- Prompt 2: Integration of losses (unprimed)

In [None]:
results_3_2, probs_3_2 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_2", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)

- Prompt 3: Cancellation of losses against larger gains (unprimed)

In [None]:
results_3_3, probs_3_3 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_3", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)    

- Prompt 4: Segregation of silver linings (unprimed)

In [None]:
results_3_4, probs_3_4 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_4", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)

- Prompt 5: Segregation of gains (primed)

In [None]:
results_3_5, probs_3_5 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_5", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)

- Prompt 6: Integration of losses (primed)

In [None]:
results_3_6, probs_3_6 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_6", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)

- Prompt 7: Cancellation of losses against larger gains (primed)

In [None]:
results_3_7, probs_3_7 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_7", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)

- Prompt 8: Segregation of silver linings (primed)

In [None]:
results_3_8, probs_3_8 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_8", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)

---

- Save the results


In [None]:
# Gather all results
PT_probs = pd.concat([probs_1_1, probs_1_2, probs_1_3, probs_1_4, probs_1_5, probs_1_6, probs_1_7, probs_1_8,
                      probs_2_1, probs_2_2, probs_2_3, probs_2_4, probs_2_5, probs_2_6, probs_2_7, probs_2_8,
                      probs_3_1, probs_3_2, probs_3_3, probs_3_4, probs_3_5, probs_3_6, probs_3_7, probs_3_8], axis = 1).transpose()

# Rename llama model
# Rename LLama model
PT_probs['Model'] = PT_probs['Model'].replace('meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3', 
                                  'llama-2-70b')

# Transform probabilities to float for plotting
# We could have not set them up as percentages in the first place, but this is a quick fix
PT_probs["p(A)"] = PT_probs["p(A)"].str.rstrip('%').astype('float')
PT_probs["p(B)"] = PT_probs["p(B)"].str.rstrip('%').astype('float')
PT_probs["p(C)"] = PT_probs["p(C)"].str.rstrip('%').astype('float')

# Save to csv 
PT_probs.to_csv("Dashboard/src/data/Output/PT_probs.csv", index = False)

---

### Visualization of results

In [108]:
PT_probs = pd.read_csv("Dashboard/src/data/Output/PT_probs.csv")
PT_probs

Unnamed: 0,Experiment_id,Temp,p(A),p(B),p(C),Obs.,Model,Priming,Scenario,Original,Original_count
0,PT_2_1,0.00,0.0,0.00,100.00,50,gpt-4-1106-preview,0,1,"[64.37, 18.39, 17.24]",87
1,PT_2_1,0.50,0.0,0.00,100.00,50,gpt-4-1106-preview,0,1,"[64.37, 18.39, 17.24]",87
2,PT_2_1,1.00,0.0,0.00,100.00,50,gpt-4-1106-preview,0,1,"[64.37, 18.39, 17.24]",87
3,PT_2_1,1.50,2.0,0.00,98.00,50,gpt-4-1106-preview,0,1,"[64.37, 18.39, 17.24]",87
4,PT_2_1,2.00,6.0,2.00,92.00,50,gpt-4-1106-preview,0,1,"[64.37, 18.39, 17.24]",87
...,...,...,...,...,...,...,...,...,...,...,...
115,PT_3_8,0.01,0.0,0.00,100.00,50,llama-2-70b,1,4,"[21.84, 72.41, 5.75]",87
116,PT_3_8,0.50,0.0,0.00,100.00,50,llama-2-70b,1,4,"[21.84, 72.41, 5.75]",87
117,PT_3_8,1.00,0.0,0.00,100.00,50,llama-2-70b,1,4,"[21.84, 72.41, 5.75]",87
118,PT_3_8,1.50,0.0,0.00,100.00,50,llama-2-70b,1,4,"[21.84, 72.41, 5.75]",87


- Function to plot model results

In [84]:
def PT_plot_results(df):

    # Transpose for plotting
    df = df.transpose()  
    # Get language model name
    model = df.loc["Model"].iloc[0]
    # Get temperature value
    temperature = df.loc["Temp"].iloc[0]
    # Get number of observations per temperature value
    n_observations = df.loc["Obs."].iloc[0]
    # Get original answer probabilities
    og_answers = df.loc["Original"].apply(literal_eval).iloc[0]
    # Get number of original answers
    n_original = df.loc["Original_count"].iloc[0]

    fig = go.Figure(data=[
        go.Bar(
            name = "Model answers",
            x = ["p(A)", "p(B)", "p(C)"],
            y = [df.loc["p(A)"].iloc[0], df.loc["p(B)"].iloc[0], df.loc["p(C)"].iloc[0]],
            customdata = [n_observations, n_observations, n_observations], 
            hovertemplate = "Percentage: %{y:.2f}%<br>Number of observations: %{customdata}<extra></extra>",
            marker_color = "rgb(55, 83, 109)"
        ),
        go.Bar(
            name = "Original answers",
            x = ["p(A)","p(B)", "p(C)"],
            y = [og_answers[0], og_answers[1], og_answers[2]],
            customdata = [n_original, n_original, n_original],
            hovertemplate = "Percentage: %{y:.2f}%<br>Number of observations: %{customdata}<extra></extra>",
            marker_color = "rgb(26, 118, 255)"
        )
    ])

    fig.update_layout(
    barmode = 'group',
    xaxis = dict(
        title = "Answer options",  
        title_font=dict(size=18),  
    ),
    yaxis = dict(
        title="Probability (%)",  
        title_font=dict(size=18), 
    ),
    title = dict(
        text=f"Distribution of answers for temperature {temperature}, using model {model}",
        x = 0.5, # Center alignment horizontally
        y = 0.87,  # Vertical alignment
        font=dict(size=22),  
    ),
    legend=dict(
        x=1.01,  
        y=0.9,
        font=dict(family='Arial', size=12, color='black'),
        bordercolor='black',  
        borderwidth=2,  
    ),
    bargap = 0.3  # Gap between temperature values
)
    return fig


- Function to plot original results

In [61]:
def PT_plot_og_results(df):
    n_original = df["Obs."]  # number of answer options 
    fig = go.Figure(data=[
        go.Bar(
                name = "p(A)",
                x = [0.1, 0.3, 0.5, 0.7],
                y = [df["p(A)"][0], df["p(A)"][1], df["p(A)"][2], df["p(A)"][3]],
                customdata = n_original,
                hovertemplate = "Percentage: %{y:.2f}%<br>Number of observations: %{customdata}<extra></extra>",
                marker_color="black",
            ),
        go.Bar(
                name = "p(B)",
                x = [0.15, 0.35, 0.55, 0.75],
                y = [df["p(B)"][0], df["p(B)"][1], df["p(B)"][2], df["p(B)"][3]],
                customdata = n_original,
                hovertemplate = "Percentage: %{y:.2f}%<br>Number of observations: %{customdata}<extra></extra>",
                marker_color="rgb(55, 83, 109)",

            ),
        go.Bar(
                name = "p(C)",
                x = [0.2, 0.4, 0.6, 0.8],
                y = [df["p(C)"][0], df["p(C)"][1], df["p(C)"][2], df["p(C)"][3]],
                customdata = n_original,
                hovertemplate = "Percentage: %{y:.2f}%<br>Number of observations: %{customdata}<extra></extra>",
                marker_color="rgb(26, 118, 255)",
        )
    ])
  

    fig.update_layout(
    barmode = 'group',
    xaxis = dict(
        title = "Scenarios",  
        title_font=dict(size=18),
        tickfont=dict(size=16),  
    ),
    yaxis = dict(
        title="Probability (%)",  
        title_font=dict(size=18), 
    ),
    title = dict(
        text=f"Distribution of original answers per scenario",
        x = 0.5, 
        y = 0.87,  
        font=dict(size=22),  
    ),
    width = 1000,
    margin=dict(t=100),
    legend=dict(
        x=1.01,  
        y=0.9,
        font=dict(family='Arial', size=12, color='black'),
        bordercolor='black', 
        borderwidth=2,  
    ),
    
)
    # Adjust x-axis labels to show 30+ to symbolize aggregation
    fig.update_xaxes(
    tickvals =[0.15, 0.35, 0.55, 0.75],
    ticktext=["Scenario 1", "Scenario 2", "Scenario 3", "Scenario 4"],
)
    return fig 
    

In [None]:
PT_plot_og_results(PT_original_results)