## Prospect Theory

This notebook aims to recreate some of the findings of **Thaler, Richard (1985), “Mental Accounting and Consumer Choice,” Marketing Science, 4 (3), 199–214.** Specifically, we try to see if LLMs like **ChatGPT** abide by some rules of Mental Accounting and Prospect Theory.


Maybe change it to was in every prompt? Who WAS happier? (original phrasing)

## Original study: 

### Scenario 1: Segragation of gains
- Mr. A was given tickets to lotteries involving the World Series. He won $50 in one lottery and $25 in the other.
- Mr. B was given a ticket to a single, larger World Series lottery. He won $75. Who is happier?

| Answer option | Frequency |
|--------------|-----------|
| A            | 56        |
| B            | 16        |
| No difference | 15      |

(empirical results from the 1985 study) -> No segregation of gains for B

### Scenario 2: Integration of losses
- Mr. A received a letter from the IRS saying that he made a minor arithmetical mistake on his
tax return and owed $100. He received a similar letter the same day from his state income tax
authority saying he owed $50. There were no other repercussions from either mistake.
- Mr. B received a letter from the IRS saying that he made a minor arithmetical mistake on his tax
return and owed $150. There were no other repercussions from his mistake. Who was more upset?

| Answer option | Frequency |
|--------------|-----------|
| A            | 66        |
| B            | 14        |
| No difference | 7      |

(empirical results from the 1985 study) -> No integration of losses for B

#### Scenario 3: Cancellation of losses against larger gains
- Mr. A bought his first New York State lottery ticket and won $100. Also, in a freak accident,
he damaged the rug in his apartment and had to pay the landlord $80.
- Mr. B bought his first New York State lottery ticket and won $20? Who is happier?

| Answer option | Frequency |
|--------------|-----------|
| A            | 22        |
| B            | 61        |
| No difference | 4      |

(empirical results from the 1985 study) -> No cancellation of losses against larger gains for A


#### Scenario 4: Segregation of "silver linings"
- Mr. A's car was damaged in a parking lot. He had to spend $200 to repair the damage. The
same day the car was damaged, he won $25 in the office football pool.
- Mr. B's car was damaged in a parking lot. He had to spend $175 to repair the damage.
Who was more upset?

| Answer option | Frequency |
|--------------|-----------|
| A            | 19        |
| B            | 63        |
| No difference | 5      |

(empirical results from the 1985 study) -> No segregation of "silver linings" for B.



------------------------------------

In [5]:
from openai import OpenAI
import openai
import matplotlib.pyplot as plt
import os 
import numpy as np
import pandas as pd
from tqdm import tqdm
import replicate

In [6]:
# Get openAI API key (previously saved as environmental variable)
openai.api_key = os.environ["OPENAI_API_KEY"]

# Set client
client = OpenAI()

# Set global plot style
plt.style.use('seaborn-v0_8')

# Set plots to be displayed in notebook
%matplotlib inline

In [7]:
# To make our results comparable to the original study, we compute original answer probabilities
p_scenario1 = [f"p(A): {round((56/(56+16+15)*100), 2)}%", f"p(B): {round((16/(56+16+15)*100), 2)}%", f"p(C): {round((15/(56+16+15)*100), 2)}%"]
p_scenario2 = [f"p(A): {round((66/(66+14+7)*100), 2)}%", f"p(B): {round((14/(66+14+7)*100), 2)}%", f"p(C): {round((7/(66+14+7)*100), 2)}%"]
p_scenario3 = [f"p(A): {round((22/(22+61+4)*100), 2)}%", f"p(B): {round((61/(22+61+4)*100), 2)}%", f"p(C): {round((4/(22+61+4)*100), 2)}%"]
p_scenario4 = [f"p(A): {round((19/(19+63+5)*100), 2)}%", f"p(B): {round((63/(19+63+5)*100), 2)}%", f"p(C): {round((5/(19+63+5)*100), 2)}%"]

---------------------------

#### Setting up the prompts used for the experiment

We now formulate 8 different prompts: 

The first four prompts all describe different scenarios in which 2 people, Mister A and Mister B, each lose or win money. Most importantly, in every scenario, the monetary value they both lost or gained is the same. However, this is where Prospect Theory comes into play. Some of the gains/losses are separated or integrated and the four marketing implications of prospect theory arise: Segregation of gains, Integration of losses, cancellation of losses against larger gains and the segregation of "silver linings".

The last four prompts describe the same situations as before. However, we now instruct the model to take the role of a market researcher that knows about the implications of Prospect Theory.

(Since prompt formatting matters, for now we will not insert a line break mid-sentence and try to keep a scenario description for A/B in the same line)

- Prompt 1: Segregation of gains (unprimed)

In [8]:
prompt_1 = """Mr. A was given tickets involving the World Series. He won 50$ in one lottery and $25 in the other. 
          Mr. B was given a ticket to a single, larger World Series lottery. He won $75. Based solely on this information, Who is happier? 
          A: Mister A
          B: Mister B
          C: No difference.         
          Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 2: Integration of losses (unprimed)

In [9]:
prompt_2 = """Mr. A received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $100. 
         He received a similar letter the same day from his state income tax authority saying he owed $50. There were no other repercussions from either mistake. 
         Mr. B received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $150. There were no other repercussions from his mistake. 
         Based solely on this information, who was more upset? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is more upset?

- Prompt 3: Cancellation of losses against larger gains (unprimed) # Add info about same day?

In [10]:
prompt_3 = """Mr. A bought his first New York State lottery ticket and won $100. Also, in a freak accident, he damaged the rug in his apartment and had to pay the landlord $80.
         Mr. B bought his first New York State lottery ticket and won $20. Based solely on this information, who is happier? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 4: Segregation of "silver linings" (unprimed)

In [11]:
prompt_4 = """Mr. A's car was damaged in a parking lot. He had to spend $200 to repair the damage. The same day the car was damaged, he won $25 in the office football pool.
         Mr. B's car was damaged in a parking lot. He had to spend $175 to repair the damage. Based solely on this information, who is more upset?
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is more upset?

- Prompt 5: Segregation of gains (primed)

In [12]:
prompt_5 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation: 
          Mr. A was given tickets involving the World Series. He won 50$ in one lottery and 25$ in the other. 
          Mr. B was given a ticket to a single, larger World Series lottery. He won 75$. Based solely on this information, who is happier?
          A: Mister A
          B: Mister B
          C: No difference.
          Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 6: Integration of losses (primed)

In [13]:
prompt_6 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation:
         Mr. A received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $100. 
         He received a similar letter the same day from his state income tax authority saying he owed $50. There were no other repercussions from either mistake. 
         Mr. B received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $150. There were no other repercussions from his mistake. 
         Based solely on this information, who was more upset? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is more upset?

-  Prompt 7: Cancellation of losses against larger gains (primed) # Add info about same day?

In [14]:
prompt_7 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation:
         Mr. A bought his first New York State lottery ticket and won $100. Also, in a freak accident, he damaged the rug in his apartment and had to pay the landlord $80.
         Mr. B bought his first New York State lottery ticket and won $20? Based solely on this information, who is happier?
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 8: Segregation of "silver linings" (primed)

In [15]:
prompt_8 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation:
         Mr. A's car was damaged in a parking lot. He had to spend $200 to repair the damage. The same day the car was damaged, he won $25 in the office football pool.
         Mr. B's car was damaged in a parking lot. He had to spend $175 to repair the damage. Based solely on this information, who is more upset?
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""
# Who is more upset?

------------------------------

- Helpful dictionaries 

The experiments we will run in this notebook are very similar in study design, and for same cases, also similar in the results we expect. We therefore need to make sure, that we associate the results with the correct study design. That is why the following dictionaries are implemented to look up e.g. what model was used for an experiment.

They will also be used inside the functions that call the API multiple times and output some information about the experiment in order to identify it correctly. 

In [16]:
# Dictionary that returns the literal prompt for a given experiment id (used in function call). key: experiment_id, value: prompt
experiment_prompts_dict = {
    "1_1": prompt_1,
    "1_2": prompt_2,
    "1_3": prompt_3,
    "1_4": prompt_4,
    "1_5": prompt_5,
    "1_6": prompt_6,
    "1_7": prompt_7,
    "1_8": prompt_8,
    "2_1": prompt_1,
    "2_2": prompt_2,
    "2_3": prompt_3,
    "2_4": prompt_4,
    "2_5": prompt_5,
    "2_6": prompt_6,
    "2_7": prompt_7,
    "2_8": prompt_8,
    "3_1": prompt_1,
    "3_2": prompt_2,
    "3_3": prompt_3,
    "3_4": prompt_4,
    "3_5": prompt_5,
    "3_6": prompt_6,
    "3_7": prompt_7,
    "3_8": prompt_8,
}

# The following dictionary is only used for a check in the function calls.
# It returns the variable name of the prompt that was used in the experiment. key: experiment_id, value: prompt_name
prompt_ids_dict = {
    "1_1": "prompt_1",
    "1_2": "prompt_2",
    "1_3": "prompt_3",
    "1_4": "prompt_4",
    "1_5": "prompt_5",
    "1_6": "prompt_6",
    "1_7": "prompt_7",
    "1_8": "prompt_8",
    "2_1": "prompt_1",
    "2_2": "prompt_2",
    "2_3": "prompt_3",
    "2_4": "prompt_4",
    "2_5": "prompt_5",
    "2_6": "prompt_6",
    "2_7": "prompt_7",
    "2_8": "prompt_8",
    "3_1": "prompt_1",
    "3_2": "prompt_2",
    "3_3": "prompt_3",
    "3_4": "prompt_4",
    "3_5": "prompt_5",
    "3_6": "prompt_6",
    "3_7": "prompt_7",
    "3_8": "prompt_8",
}

# Dictionary to look up which model to use for a given experiment id (used in function call). key: experiment id, value: model name
model_dict = {
    "1_1": "gpt-3.5-turbo",
    "1_2": "gpt-3.5-turbo",
    "1_3": "gpt-3.5-turbo",
    "1_4": "gpt-3.5-turbo",
    "1_5": "gpt-3.5-turbo",
    "1_6": "gpt-3.5-turbo",
    "1_7": "gpt-3.5-turbo",
    "1_8": "gpt-3.5-turbo",
    "2_1": "gpt-4-1106-preview",
    "2_2": "gpt-4-1106-preview",
    "2_3": "gpt-4-1106-preview",
    "2_4": "gpt-4-1106-preview",
    "2_5": "gpt-4-1106-preview",
    "2_6": "gpt-4-1106-preview",
    "2_7": "gpt-4-1106-preview",
    "2_8": "gpt-4-1106-preview",
    "3_1": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_2": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_3": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_4": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_5": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_6": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_7": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_8": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    }

# Dictionary to look up, what the study design of each experiment was. key: experiment id, value: experiment design 
experiment_dict = {
    "1_1": f"Experiment 1_1 uses {model_dict['1_1']}, deals with the segregation of gains and is unprimed.",
    "1_2": f"Experiment 1_2 uses {model_dict['1_2']}, deals with the integration of losses and is unprimed.",
    "1_3": f"Experiment 1_3 uses {model_dict['1_3']}, deals with the cancellation of losses against larger gains and is unprimed.",
    "1_4": f"Experiment 1_4 uses {model_dict['1_4']}, deals with the segrgation of *silver linings* and is unprimed.",
    "1_5": f"Experiment 1_5 uses {model_dict['1_5']}, deals with the segregation of gains and is primed.",
    "1_6": f"Experiment 1_6 uses {model_dict['1_6']}, deals with the integration of losses and is primed.",
    "1_7": f"Experiment 1_7 uses {model_dict['1_7']}, deals with the cancellation of losses against larger gains and is primed.",
    "1_8": f"Experiment 1_8 uses {model_dict['1_8']}, deals with the segregation of *silver linings*, and is primed.",
    "2_1": f"Experiment 1_1 uses {model_dict['2_1']}, deals with the segregation of gains and is unprimed.",
    "2_2": f"Experiment 1_2 uses {model_dict['2_2']}, deals with the integration of losses and is unprimed.",
    "2_3": f"Experiment 1_3 uses {model_dict['2_3']}, deals with the cancellation of losses against larger gains and is unprimed.",
    "2_4": f"Experiment 1_4 uses {model_dict['2_4']}, deals with the segrgation of *silver linings* and is unprimed.",
    "2_5": f"Experiment 1_5 uses {model_dict['2_5']}, deals with the segregation of gains and is primed.",
    "2_6": f"Experiment 1_6 uses {model_dict['2_6']}, deals with the integration of losses and is primed.",
    "2_7": f"Experiment 1_7 uses {model_dict['2_7']}, deals with the cancellation of losses against larger gains and is primed.",
    "2_8": f"Experiment 1_8 uses {model_dict['2_8']}, deals with the segregation of *silver linings*, and is primed.",
    "3_1": f"Experiment 1_1 uses {model_dict['3_1']}, deals with the segregation of gains and is unprimed.",
    "3_2": f"Experiment 1_2 uses {model_dict['3_2']}, deals with the integration of losses and is unprimed.",
    "3_3": f"Experiment 1_3 uses {model_dict['3_3']}, deals with the cancellation of losses against larger gains and is unprimed.",
    "3_4": f"Experiment 1_4 uses {model_dict['3_4']}, deals with the segrgation of *silver linings* and is unprimed.",
    "3_5": f"Experiment 1_5 uses {model_dict['3_5']}, deals with the segregation of gains and is primed.",
    "3_6": f"Experiment 1_6 uses {model_dict['3_6']}, deals with the integration of losses and is primed.",
    "3_7": f"Experiment 1_7 uses {model_dict['3_7']}, deals with the cancellation of losses against larger gains and is primed.",
    "3_8": f"Experiment 1_8 uses {model_dict['3_8']}, deals with the segregation of *silver linings*, and is primed.",
}

# Dictionary to look up the original results of the experiments. key: experiment id, value: original result
results_dict = {
    "1_1": p_scenario1,
    "1_2": p_scenario2,
    "1_3": p_scenario3,
    "1_4": p_scenario4,
    "1_5": p_scenario1,
    "1_6": p_scenario2,
    "1_7": p_scenario3,
    "1_8": p_scenario4,
    "2_1": p_scenario1,
    "2_2": p_scenario2,
    "2_3": p_scenario3,
    "2_4": p_scenario4,
    "2_5": p_scenario1,
    "2_6": p_scenario2,
    "2_7": p_scenario3,
    "2_8": p_scenario4,
    "3_1": p_scenario1,
    "3_2": p_scenario2,
    "3_3": p_scenario3,
    "3_4": p_scenario4,
    "3_5": p_scenario1,
    "3_6": p_scenario2,
    "3_7": p_scenario3,
    "3_8": p_scenario4,
}

# Dictionary to look up the scenario number of a given experiment ID. key: experiment id, value: scenario number
scenario_dict = {
    "1_1": 1,
    "1_2": 2,
    "1_3": 3,
    "1_4": 4,
    "1_5": 1,
    "1_6": 2,
    "1_7": 3,
    "1_8": 4,
    "2_1": 1,
    "2_2": 2,
    "2_3": 3,
    "2_4": 4,
    "2_5": 1,
    "2_6": 2,
    "2_7": 3,
    "2_8": 4,
    "3_1": 1,
    "3_2": 2,
    "3_3": 3,
    "3_4": 4,
    "3_5": 1,
    "3_6": 2,
    "3_7": 3,
    "3_8": 4,
}   

---------------------------------

#### Setting up functions to repeatedly prompt ChatGPT

- Functions to query 1 prompt n times

In [17]:
def run_experiment(experiment_id, n, progress_bar, temperature):

    """
    Function to query ChatGPT multiple times with a survey having answers designed as: A, B, C.
    
    Args:
        experiment_id (str): ID of the experiment to be run. Contains info about prompt and model
        n (int): Number of queries to be made
        temperature (int): Degree of randomness with range 0 (deterministic) to 2 (random)
        max_tokens (int): Maximum number of tokens in response object
        
    Returns:
        results (list): List containing count of answers for each option, also containing experiment_id, temperature and number of observations
        probs (list): List containing probability of each option being chosen, also containing experiment_id, temeperature and number of observations
    """
    
    answers = []
    for _ in range(n): 
        response = client.chat.completions.create(
            model = model_dict[experiment_id], 
            max_tokens = 1,
            temperature = temperature, # range is 0 to 2
            messages = [
            {"role": "system", "content": "Only answer with the letter of the alternative you would choose without any reasoning."},        
            {"role": "user", "content": experiment_prompts_dict[experiment_id]},
                   ])

        # Store the answer in the list
        answer = response.choices[0].message.content
        answers.append(answer.strip())
        # Update progress bar (given from either temperature loop, or set locally)
        progress_bar.update(1)

    # Counting results
    A = answers.count("A")
    B = answers.count("B")
    C = answers.count("C")

    # Count of "correct" answers, sums over indicator function thack checks if answer is either A, B or C
    len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])

    # Collecting results in a list
    results = [experiment_id, temperature, A, B, C, len_correct, model_dict[experiment_id], scenario_dict[experiment_id]]

    # Getting percentage each answer
    p_a = f"{(A / len_correct) * 100:.2f}%"
    p_b = f"{(B / len_correct) * 100:.2f}%"
    p_c = f"{(C / len_correct) * 100:.2f}%"

    # Collect probabilities in a dataframe
    probs = [experiment_id, temperature, p_a, p_b, p_c, len_correct, model_dict[experiment_id], scenario_dict[experiment_id]]
    
    # Give out results
    return results, probs

- Function to query 1 prompt n times (LLama)

In [None]:
def run_experiment_llama(experiment_id, n, progress_bar, temperature):
    answers = []
    for _ in range(n):
        response = replicate.run(
            model_dict[experiment_id],
            input = {
                "system_prompt": "Only answer with the letter of the alternative you would choose without any reasoning.",
                "temperature": temperature,
                "max_new_tokens": 2, 
                "prompt": experiment_prompts_dict[experiment_id]
            }
        )
        # Grab answer and append to list
        answer = "" # Set to empty string, otherwise it would append the previous answer to the new one
        for item in response:
            answer = answer + item
        answers.append(answer.strip())

        # Update progress bar
        progress_bar.update(1)

    # Counting results
    A = answers.count("A") # set to Q
    B = answers.count("B") # set to X
    C = answers.count("C") # set to Y

    # Count of "correct" answers, sums over indicator function thack checks if answer is either A, B or C
    len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])

    # Collecting results in a list
    results = [experiment_id, temperature, A, B, C, len_correct, model_dict[experiment_id], scenario_dict[experiment_id]]

    # Getting percentage each answer
    p_a = f"{(A / len_correct) * 100:.2f}%"
    p_b = f"{(B / len_correct) * 100:.2f}%"
    p_c = f"{(C / len_correct) * 100:.2f}%"

    # Collect probabilities in a dataframe
    probs = [experiment_id, temperature, p_a, p_b, p_c, len_correct, model_dict[experiment_id], scenario_dict[experiment_id]]
    
    # Give out results
    return results, probs

- Function to loop run_experiment() over a list of temperature values

In [28]:
def temperature_loop(function, experiment_id, temperature_list = [0, 0.5, 1, 1.5, 2], n = 50):
    """
    Function to run an experiment with different temperature values.
    
    Args:
        function (function): Function to be used for querying ChatGPT i.e. run_experiment()
        experiment_id (str): ID of th e experiment to be run. Contains info about prompt and model
        temperature_list (list): List of temperature values to be looped over
        n: Number of requests for each prompt per temperature value
        max_tokens: Maximum number of tokens in response object
        
    Returns:
        results_df: Dataframe with experiment results
        probs_df: Dataframe with answer probabilities
    """    
    # Empty lists for storing results
    results_list = []
    probs_list = []
    # Initialize progress bar -> used as input for run_experiment()
    progress_bar = tqdm(range(n*len(temperature_list)))

    # Loop over different temperature values, calling the input function n times each (i.e. queriyng ChatGPT n times)
    for temperature in temperature_list:
        results, probs = function(experiment_id = experiment_id, n = n, temperature = temperature, progress_bar = progress_bar) 
        results_list.append(results)
        probs_list.append(probs)

    # Horizontally concatenate the results, transpose, and set index
    results_df = pd.DataFrame(results_list).transpose().set_index(pd.Index(["Experiment", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario"]))
    probs_df = pd.DataFrame(probs_list).transpose().set_index(pd.Index(["Experiment", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario"]))
   
    # Return some information about the experiment as a check
    check = f"{experiment_dict[experiment_id]} In this run, a total of {n*len(temperature_list)} requests were made using {prompt_ids_dict[experiment_id]}."
    # Print information about the experiment
    print(check)
    # Print original results 
    print(f"The original results were {results_dict[experiment_id]}.")

    return results_df, probs_df

- Function to plot distribution of answer probabilities

In [15]:
def plot_results(df):
    
    # Get experiment id and model name for plot title from dictionaries
    experiment_id = df.iloc[0, 0]
    model = model_dict[experiment_id]
    
    X = df.loc["Temp"]
    p_a = df.loc["p(A)"].str.rstrip('%').astype('float')  # Convert percentages to float
    p_b = df.loc["p(B)"].str.rstrip('%').astype('float')
    p_c = df.loc["p(C)"].str.rstrip('%').astype('float')

    X_axis = np.arange(len(X)) 

    plt.figure(figsize = (10, 5))
    ax = plt.gca()
    ax.bar(X_axis- 0.25, p_a, 0.25, label = 'p(A)', color = "#8C1515") 
    ax.bar(X_axis, p_b, 0.25,  label = 'p(B)', color = "#507FAB") 
    ax.bar(X_axis+ 0.25 , p_c,  0.25, label = 'p(C)', color = '#D9A84A')

    ax.set_xticks(X_axis, X)
    ax.set_xlabel("Temperature")
    ax.set_ylabel("Probability (%)")
    ax.set_ylim(0, 110)
    ax.set_title(f"Distribution of answers per temperature value for experiment {experiment_id} using {model}")
    ax.legend()  
    plt.show()

-------------

## Comparing different LLMs

The results variables will be structured as: results_model-id_prompt-id.

We will refer to "GPT-3.5-turbo" as model 1 and "GPT-4-1106-preview" as model 2.

#### Model 1: GPT-3.5-Turbo (Model training ended in September 2021)

- Simple test of repeated prompting function for fixed temperature

In [None]:
test_results, test_probs = run_experiment(experiment_id = "1_5", n = 20, temperature = 1)

print(experiment_dict)
print(results_dict["1_1"])
test_probs
# Experiment_id, temperature, p_a, p_b, p_c, n_observations

- Simple test of function to loop over temperature values

In [None]:
# Call the function 
results, probs = temperature_loop(run_experiment, experiment_id = "1_1", temperature_list = [0, 0.5, 1, 1.5, 2], n = 5)

# Display probability dataframe
probs

- Prompt 1: Segregation of gains (unprimed)

In [None]:
# Set number of requests per temperature value
N = 100

In [15]:
results_1_1, probs_1_1 = temperature_loop(run_experiment, experiment_id = "1_1", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_1

  0%|          | 0/250 [00:00<?, ?it/s]

100%|██████████| 250/250 [03:07<00:00,  1.33it/s]

Experiment 1_1 uses gpt-3.5-turbo, deals with the segregation of gains and is unprimed. In this run, a total of 250 requests were made using prompt_1.
The original results were ['p(A): 64.37%', 'p(B): 18.39%', 'p(C): 17.24%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_1,1_1,1_1,1_1,1_1
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,92.00%,84.00%,60.00%,48.94%
p(B),0.00%,6.00%,12.00%,32.00%,14.89%
p(C),0.00%,2.00%,4.00%,8.00%,36.17%
Obs.,50,50,50,50,47


- Prompt 2: Integration of losses (unprimed)

In [16]:
results_1_2, probs_1_2 = temperature_loop(run_experiment, experiment_id = "1_2", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_2

100%|██████████| 250/250 [02:44<00:00,  1.52it/s]

Experiment 1_2 uses gpt-3.5-turbo, deals with the integration of losses and is unprimed. In this run, a total of 250 requests were made using prompt_2.
The original results were ['p(A): 75.86%', 'p(B): 16.09%', 'p(C): 8.05%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_2,1_2,1_2,1_2,1_2
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,54.00%,50.00%,50.00%,34.78%
p(B),0.00%,8.00%,8.00%,12.00%,26.09%
p(C),0.00%,38.00%,42.00%,38.00%,39.13%
Obs.,50,50,50,50,46


- Prompt 3: Cancellation of losses against larger gains (unprimed)

In [17]:
results_1_3, probs_1_3 = temperature_loop(run_experiment, experiment_id = "1_3", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_3

100%|██████████| 250/250 [02:55<00:00,  1.43it/s]

Experiment 1_3 uses gpt-3.5-turbo, deals with the cancellation of losses against larger gains and is unprimed. In this run, a total of 250 requests were made using prompt_3.
The original results were ['p(A): 25.29%', 'p(B): 70.11%', 'p(C): 4.6%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_3,1_3,1_3,1_3,1_3
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,98.00%,78.00%,68.00%,65.91%
p(B),0.00%,0.00%,8.00%,14.00%,9.09%
p(C),0.00%,2.00%,14.00%,18.00%,25.00%
Obs.,50,50,50,50,44


- Prompt 4: Segregation of "silver linings" (unprimed)

In [18]:
results_1_4, probs_1_4 = temperature_loop(run_experiment, experiment_id = "1_4", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_4

100%|██████████| 250/250 [12:59<00:00,  3.12s/it]   

Experiment 1_4 uses gpt-3.5-turbo, deals with the segrgation of *silver linings* and is unprimed. In this run, a total of 250 requests were made using prompt_4.
The original results were ['p(A): 21.84%', 'p(B): 72.41%', 'p(C): 5.75%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_4,1_4,1_4,1_4,1_4
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,96.00%,98.00%,72.00%,68.09%
p(B),0.00%,2.00%,2.00%,12.00%,19.15%
p(C),0.00%,2.00%,0.00%,16.00%,12.77%
Obs.,50,50,50,50,47


- Prompt 5: Segregation of gains (primed)

In [19]:
results_1_5, probs_1_5 = temperature_loop(run_experiment, experiment_id = "1_5", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_5

100%|██████████| 250/250 [02:41<00:00,  1.54it/s]

Experiment 1_5 uses gpt-3.5-turbo, deals with the segregation of gains and is primed. In this run, a total of 250 requests were made using prompt_5.
The original results were ['p(A): 64.37%', 'p(B): 18.39%', 'p(C): 17.24%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_5,1_5,1_5,1_5,1_5
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,70.00%,50.00%,46.00%,42.86%
p(B),0.00%,4.00%,18.00%,22.00%,18.37%
p(C),0.00%,26.00%,32.00%,32.00%,38.78%
Obs.,50,50,50,50,49


- Prompt 6: Integration of losses (primed)

In [20]:
results_1_6, probs_1_6 = temperature_loop(run_experiment, experiment_id = "1_6", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_6

100%|██████████| 250/250 [22:45<00:00,  5.46s/it]   

Experiment 1_6 uses gpt-3.5-turbo, deals with the integration of losses and is primed. In this run, a total of 250 requests were made using prompt_6.
The original results were ['p(A): 75.86%', 'p(B): 16.09%', 'p(C): 8.05%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_6,1_6,1_6,1_6,1_6
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,84.00%,56.00%,50.00%,36.36%
p(B),0.00%,0.00%,8.00%,10.00%,22.73%
p(C),0.00%,16.00%,36.00%,40.00%,40.91%
Obs.,50,50,50,50,44


-  Prompt 7: Cancellation of losses against larger gains (primed)

In [21]:
results_1_7, probs_1_7 = temperature_loop(run_experiment, experiment_id = "1_7", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_7

100%|██████████| 250/250 [12:22<00:00,  2.97s/it]   

Experiment 1_7 uses gpt-3.5-turbo, deals with the cancellation of losses against larger gains and is primed. In this run, a total of 250 requests were made using prompt_7.
The original results were ['p(A): 25.29%', 'p(B): 70.11%', 'p(C): 4.6%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_7,1_7,1_7,1_7,1_7
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,62.00%,56.00%,43.75%,44.68%
p(B),0.00%,0.00%,2.00%,10.42%,14.89%
p(C),0.00%,38.00%,42.00%,45.83%,40.43%
Obs.,50,50,50,48,47


- Prompt 8: Segregation of "silver linings" (primed)

In [22]:
results_1_8, probs_1_8 = temperature_loop(run_experiment, experiment_id = "1_8", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_8

100%|██████████| 250/250 [32:20<00:00,  7.76s/it]   

Experiment 1_8 uses gpt-3.5-turbo, deals with the segregation of *silver linings*, and is primed. In this run, a total of 250 requests were made using prompt_8.
The original results were ['p(A): 21.84%', 'p(B): 72.41%', 'p(C): 5.75%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_8,1_8,1_8,1_8,1_8
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,88.00%,72.00%,68.00%,57.14%
p(B),0.00%,2.00%,2.00%,14.00%,16.67%
p(C),0.00%,10.00%,26.00%,18.00%,26.19%
Obs.,50,50,50,50,42


------------------------------------------

#### Model 2: GPT-4-1106-preview (Model training ended in April 2023)

Since prompting GPT4 is much more expensive, we will only use 50 requests per temperature value instead of 100, as we did for GPT3.

In [None]:
# Set number of requests per temperature value
N = 50

- Prompt 1: Segregation of gains (unprimed)

In [23]:
results_2_1, probs_2_1 = temperature_loop(run_experiment, experiment_id = "2_1", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_1

100%|██████████| 250/250 [01:56<00:00,  2.14it/s]

Experiment 1_1 uses gpt-4-1106-preview, deals with the segregation of gains and is unprimed. In this run, a total of 250 requests were made using prompt_1.
The original results were ['p(A): 64.37%', 'p(B): 18.39%', 'p(C): 17.24%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_1,2_1,2_1,2_1,2_1
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,2.00%,6.00%
p(B),0.00%,0.00%,0.00%,0.00%,2.00%
p(C),100.00%,100.00%,100.00%,98.00%,92.00%
Obs.,50,50,50,50,50


- Prompt 2: Integration of losses (unprimed)

In [24]:
results_2_2, probs_2_2 = temperature_loop(run_experiment, experiment_id = "2_2", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_2

100%|██████████| 250/250 [12:12<00:00,  2.93s/it] 

Experiment 1_2 uses gpt-4-1106-preview, deals with the integration of losses and is unprimed. In this run, a total of 250 requests were made using prompt_2.
The original results were ['p(A): 75.86%', 'p(B): 16.09%', 'p(C): 8.05%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_2,2_2,2_2,2_2,2_2
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,4.08%
p(B),0.00%,0.00%,0.00%,2.00%,0.00%
p(C),100.00%,100.00%,100.00%,98.00%,95.92%
Obs.,50,50,50,50,49


- Prompt 3: Cancellation of losses against larger gains (unprimed)

In [25]:
results_2_3, probs_2_3 = temperature_loop(run_experiment, experiment_id = "2_3", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_3

100%|██████████| 250/250 [12:23<00:00,  2.97s/it]   

Experiment 1_3 uses gpt-4-1106-preview, deals with the cancellation of losses against larger gains and is unprimed. In this run, a total of 250 requests were made using prompt_3.
The original results were ['p(A): 25.29%', 'p(B): 70.11%', 'p(C): 4.6%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_3,2_3,2_3,2_3,2_3
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,2.04%,0.00%
p(B),0.00%,0.00%,0.00%,0.00%,6.38%
p(C),100.00%,100.00%,100.00%,97.96%,93.62%
Obs.,50,50,50,49,47


- Prompt 4: Segregation of "silver linings" (unprimed)

In [26]:
results_2_4, probs_2_4 = temperature_loop(run_experiment, experiment_id = "2_4", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_4

100%|██████████| 250/250 [22:09<00:00,  5.32s/it]   

Experiment 1_4 uses gpt-4-1106-preview, deals with the segrgation of *silver linings* and is unprimed. In this run, a total of 250 requests were made using prompt_4.
The original results were ['p(A): 21.84%', 'p(B): 72.41%', 'p(C): 5.75%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_4,2_4,2_4,2_4,2_4
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,0.00%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,100.00%,100.00%,100.00%,100.00%
Obs.,50,50,50,50,49


- Prompt 5: Segregation of gains (primed)

In [27]:
results_2_5, probs_2_5 = temperature_loop(run_experiment, experiment_id = "2_5", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_5

100%|██████████| 250/250 [02:09<00:00,  1.93it/s]

Experiment 1_5 uses gpt-4-1106-preview, deals with the segregation of gains and is primed. In this run, a total of 250 requests were made using prompt_5.
The original results were ['p(A): 64.37%', 'p(B): 18.39%', 'p(C): 17.24%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_5,2_5,2_5,2_5,2_5
Temp,0.0,0.5,1.0,1.5,2.0
p(A),94.00%,70.00%,60.00%,60.00%,57.14%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),6.00%,30.00%,40.00%,40.00%,42.86%
Obs.,50,50,50,50,49


- Prompt 6: Integration of losses (primed)

In [28]:
results_2_6, probs_2_6 = temperature_loop(run_experiment, experiment_id = "2_6", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_6

100%|██████████| 250/250 [22:41<00:00,  5.45s/it]   

Experiment 1_6 uses gpt-4-1106-preview, deals with the integration of losses and is primed. In this run, a total of 250 requests were made using prompt_6.
The original results were ['p(A): 75.86%', 'p(B): 16.09%', 'p(C): 8.05%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_6,2_6,2_6,2_6,2_6
Temp,0.0,0.5,1.0,1.5,2.0
p(A),98.00%,96.00%,88.00%,84.00%,75.51%
p(B),0.00%,0.00%,0.00%,2.00%,0.00%
p(C),2.00%,4.00%,12.00%,14.00%,24.49%
Obs.,50,50,50,50,49


-  Prompt 7: Cancellation of losses against larger gains (primed)

In [29]:
results_2_7, probs_2_7 = temperature_loop(run_experiment, experiment_id = "2_7", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_7

100%|██████████| 250/250 [02:08<00:00,  1.94it/s]

Experiment 1_7 uses gpt-4-1106-preview, deals with the cancellation of losses against larger gains and is primed. In this run, a total of 250 requests were made using prompt_7.
The original results were ['p(A): 25.29%', 'p(B): 70.11%', 'p(C): 4.6%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_7,2_7,2_7,2_7,2_7
Temp,0.0,0.5,1.0,1.5,2.0
p(A),30.00%,54.00%,44.00%,46.94%,38.00%
p(B),0.00%,0.00%,0.00%,2.04%,10.00%
p(C),70.00%,46.00%,56.00%,51.02%,52.00%
Obs.,50,50,50,49,50


- Prompt 8: Segregation of "silver linings" (primed)

In [30]:
results_2_8, probs_2_8 = temperature_loop(run_experiment, experiment_id = "2_8", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_8

100%|██████████| 250/250 [02:10<00:00,  1.91it/s]

Experiment 1_8 uses gpt-4-1106-preview, deals with the segregation of *silver linings*, and is primed. In this run, a total of 250 requests were made using prompt_8.
The original results were ['p(A): 21.84%', 'p(B): 72.41%', 'p(C): 5.75%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_8,2_8,2_8,2_8,2_8
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,2.00%,8.00%,16.00%,23.40%
p(B),0.00%,0.00%,0.00%,2.00%,8.51%
p(C),100.00%,98.00%,92.00%,82.00%,68.09%
Obs.,50,50,50,50,47


--------------------------------------------

#### Model 3: LLama-2-70b

!!! Use max_new_tokens of at least 2, as llama tends to begin answers with a blank space !!!

In [2]:
import replicate
# models = ['meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3'] # possibility to add further llama models 
temperature_list = [0.01, 0.5, 1, 1.5, 2] # LLama wont take 0 as temperature
N = 50 # number of requests per temperature value 

In [35]:
#test_results, test_probs = temperature_loop(run_experiment_llama, experiment_id = "3_1", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = 5)
test_probs = test_probs.transpose()
test_probs['Model'] = test_probs['Model'].replace('meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3', 'llama-2-70b')
test_probs

Unnamed: 0,Experiment,Temp,p(A),p(B),p(C),Obs.,Model,Scenario
0,3_1,0.01,0.00%,100.00%,0.00%,5,llama-2-70b,1
1,3_1,0.5,0.00%,100.00%,0.00%,5,llama-2-70b,1
2,3_1,1.0,0.00%,100.00%,0.00%,5,llama-2-70b,1
3,3_1,1.5,20.00%,80.00%,0.00%,5,llama-2-70b,1
4,3_1,2.0,20.00%,80.00%,0.00%,5,llama-2-70b,1


- Prompt 1: Segregation of gains (unprimed)

In [None]:
results_3_1, probs_3_1 = temperature_loop(run_experiment_llama, experiment_id = "3_1", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_1

- Prompt 2: Integration of losses (unprimed)

In [None]:
results_3_2, probs_3_2 = temperature_loop(run_experiment_llama, experiment_id = "3_2", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_2

- Prompt 3: Cancellation of losses against larger gains (unprimed)

In [None]:
results_3_3, probs_3_3 = temperature_loop(run_experiment_llama, experiment_id = "3_3", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)    
probs_3_3

- Prompt 4: Segregation of silver linings (unprimed)

In [None]:
results_3_4, probs_3_4 = temperature_loop(run_experiment_llama, experiment_id = "3_4", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_4

- Prompt 5: Segregation of gains (primed)

In [None]:
results_3_5, probs_3_5 = temperature_loop(run_experiment_llama, experiment_id = "3_5", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_5

- Prompt 6: Integration of losses (primed)

In [None]:
results_3_6, probs_3_6 = temperature_loop(run_experiment_llama, experiment_id = "3_6", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_6

- Prompt 7: Cancellation of losses against larger gains (primed)

In [None]:
results_3_7, probs_3_7 = temperature_loop(run_experiment_llama, experiment_id = "3_7", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_7

- Prompt 8: Segregation of silver linings (primed)

In [None]:
results_3_8, probs_3_8 = temperature_loop(run_experiment_llama, experiment_id = "3_8", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_8

---

- Save the results

The dataframes above were constructed to be illustrative and easily interpretable. For further processing however, we need the data in another (longer) format. Therefore, the following transformations need to be applied. (Right now performed in df_conversion.ipynb)

In [31]:
# Set folder name to save results to
folder_name = "Output/PT_probs_dfs"

# Check if path already exists, only create folder if not 
if not os.path.exists(folder_name):
    os.mkdir(folder_name)
    print(f"Folder {folder_name} successfully created.")
else:
    print(f"Folder {folder_name} already exists in current directory.")

# Save dataframes with their respective experiment id as file name
for i in np.arange(1,3):
    for j in np.arange(1,9):
        exec(f"probs_{i}_{j}.to_csv('{folder_name}\PT_probs_{i}_{j}.csv')")

Folder PT_probs_dfs successfully created.
