## Prospect Theory

This notebook aims to recreate some of the findings of **Thaler, Richard (1985), “Mental Accounting and Consumer Choice,” Marketing Science, 4 (3), 199–214.** Specifically, we try to see if LLMs like **ChatGPT** abide by some rules of Mental Accounting and Prospect Theory.


Maybe change it to was in every prompt? Who WAS happier? (original phrasing)

## Original study: 

### Scenario 1: Segragation of gains
- Mr. A was given tickets to lotteries involving the World Series. He won $50 in one lottery and $25 in the other.
- Mr. B was given a ticket to a single, larger World Series lottery. He won $75. Who is happier?

| Answer option | Frequency |
|--------------|-----------|
| A            | 56        |
| B            | 16        |
| No difference | 15      |

(empirical results from the 1985 study) -> No segregation of gains for B

### Scenario 2: Integration of losses
- Mr. A received a letter from the IRS saying that he made a minor arithmetical mistake on his
tax return and owed $100. He received a similar letter the same day from his state income tax
authority saying he owed $50. There were no other repercussions from either mistake.
- Mr. B received a letter from the IRS saying that he made a minor arithmetical mistake on his tax
return and owed $150. There were no other repercussions from his mistake. Who was more upset?

| Answer option | Frequency |
|--------------|-----------|
| A            | 66        |
| B            | 14        |
| No difference | 7      |

(empirical results from the 1985 study) -> No integration of losses for B

#### Scenario 3: Cancellation of losses against larger gains
- Mr. A bought his first New York State lottery ticket and won $100. Also, in a freak accident,
he damaged the rug in his apartment and had to pay the landlord $80.
- Mr. B bought his first New York State lottery ticket and won $20? Who is happier?

| Answer option | Frequency |
|--------------|-----------|
| A            | 22        |
| B            | 61        |
| No difference | 4      |

(empirical results from the 1985 study) -> No cancellation of losses against larger gains for A


#### Scenario 4: Segregation of "silver linings"
- Mr. A's car was damaged in a parking lot. He had to spend $200 to repair the damage. The
same day the car was damaged, he won $25 in the office football pool.
- Mr. B's car was damaged in a parking lot. He had to spend $175 to repair the damage.
Who was more upset?

| Answer option | Frequency |
|--------------|-----------|
| A            | 19        |
| B            | 63        |
| No difference | 5      |

(empirical results from the 1985 study) -> No segregation of "silver linings" for B.



------------------------------------

In [1]:
from openai import OpenAI
import openai
import matplotlib.pyplot as plt
import os 
import numpy as np
import pandas as pd
from tqdm import tqdm
import replicate

In [2]:
# Get openAI API key (previously saved as environmental variable)
openai.api_key = os.environ["OPENAI_API_KEY"]

# Set client
client = OpenAI()

# Set global plot style
plt.style.use('seaborn-v0_8')

# Set plots to be displayed in notebook
%matplotlib inline

In [3]:
# To make our results comparable to the original study, we compute original answer probabilities
PT_p_scenario1 = [f"p(A): {round((56/(56+16+15)*100), 2)}%", f"p(B): {round((16/(56+16+15)*100), 2)}%", f"p(C): {round((15/(56+16+15)*100), 2)}%"]
PT_p_scenario2 = [f"p(A): {round((66/(66+14+7)*100), 2)}%", f"p(B): {round((14/(66+14+7)*100), 2)}%", f"p(C): {round((7/(66+14+7)*100), 2)}%"]
PT_p_scenario3 = [f"p(A): {round((22/(22+61+4)*100), 2)}%", f"p(B): {round((61/(22+61+4)*100), 2)}%", f"p(C): {round((4/(22+61+4)*100), 2)}%"]
PT_p_scenario4 = [f"p(A): {round((19/(19+63+5)*100), 2)}%", f"p(B): {round((63/(19+63+5)*100), 2)}%", f"p(C): {round((5/(19+63+5)*100), 2)}%"]

---------------------------

#### Setting up the prompts used for the experiment

We now formulate 8 different prompts: 

The first four prompts all describe different scenarios in which 2 people, Mister A and Mister B, each lose or win money. Most importantly, in every scenario, the monetary value they both lost or gained is the same. However, this is where Prospect Theory comes into play. Some of the gains/losses are separated or integrated and the four marketing implications of prospect theory arise: Segregation of gains, Integration of losses, cancellation of losses against larger gains and the segregation of "silver linings".

The last four prompts describe the same situations as before. However, we now instruct the model to take the role of a market researcher that knows about the implications of Prospect Theory.

(Since prompt formatting matters, for now we will not insert a line break mid-sentence and try to keep a scenario description for A/B in the same line)

- Prompt 1: Segregation of gains (unprimed)

In [4]:
PT_prompt_1 = """Mr. A was given tickets involving the World Series. He won 50$ in one lottery and $25 in the other. 
          Mr. B was given a ticket to a single, larger World Series lottery. He won $75. Based solely on this information, Who is happier? 
          A: Mister A
          B: Mister B
          C: No difference.         
          Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 2: Integration of losses (unprimed)

In [5]:
PT_prompt_2 = """Mr. A received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $100. 
         He received a similar letter the same day from his state income tax authority saying he owed $50. There were no other repercussions from either mistake. 
         Mr. B received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $150. There were no other repercussions from his mistake. 
         Based solely on this information, who was more upset? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is more upset?

- Prompt 3: Cancellation of losses against larger gains (unprimed) # Add info about same day?

In [6]:
PT_prompt_3 = """Mr. A bought his first New York State lottery ticket and won $100. Also, in a freak accident, he damaged the rug in his apartment and had to pay the landlord $80.
         Mr. B bought his first New York State lottery ticket and won $20. Based solely on this information, who is happier? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 4: Segregation of "silver linings" (unprimed)

In [7]:
PT_prompt_4 = """Mr. A's car was damaged in a parking lot. He had to spend $200 to repair the damage. The same day the car was damaged, he won $25 in the office football pool.
         Mr. B's car was damaged in a parking lot. He had to spend $175 to repair the damage. Based solely on this information, who is more upset?
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is more upset?

- Prompt 5: Segregation of gains (primed)

In [8]:
PT_prompt_5 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation: 
          Mr. A was given tickets involving the World Series. He won 50$ in one lottery and 25$ in the other. 
          Mr. B was given a ticket to a single, larger World Series lottery. He won 75$. Based solely on this information, who is happier?
          A: Mister A
          B: Mister B
          C: No difference.
          Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 6: Integration of losses (primed)

In [9]:
PT_prompt_6 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation:
         Mr. A received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $100. 
         He received a similar letter the same day from his state income tax authority saying he owed $50. There were no other repercussions from either mistake. 
         Mr. B received a letter from the IRS saying that he made a minor arithmetical mistake on his tax return and owed $150. There were no other repercussions from his mistake. 
         Based solely on this information, who was more upset? 
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is more upset?

-  Prompt 7: Cancellation of losses against larger gains (primed) # Add info about same day?

In [10]:
PT_prompt_7 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation:
         Mr. A bought his first New York State lottery ticket and won $100. Also, in a freak accident, he damaged the rug in his apartment and had to pay the landlord $80.
         Mr. B bought his first New York State lottery ticket and won $20? Based solely on this information, who is happier?
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Who is happier?

- Prompt 8: Segregation of "silver linings" (primed)

In [11]:
PT_prompt_8 = """You are a market researcher and focus on Prospect Theory and Mental Accounting. In a survey you are presented the following situation:
         Mr. A's car was damaged in a parking lot. He had to spend $200 to repair the damage. The same day the car was damaged, he won $25 in the office football pool.
         Mr. B's car was damaged in a parking lot. He had to spend $175 to repair the damage. Based solely on this information, who is more upset?
         A: Mister A
         B: Mister B
         C: No difference.
         Which option would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""
# Who is more upset?

------------------------------

- Helpful dictionaries 

The experiments we will run in this notebook are very similar in study design, and for same cases, also similar in the results we expect. We therefore need to make sure, that we associate the results with the correct study design. That is why the following dictionaries are implemented to look up e.g. what model was used for an experiment.

They will also be used inside the functions that call the API multiple times and output some information about the experiment in order to identify it correctly. 

In [12]:
# Dictionary that returns the literal prompt for a given experiment id (used in function call). key: experiment_id, value: prompt
PT_experiment_prompts_dict = {
    "PT_1_1": PT_prompt_1,
    "PT_1_2": PT_prompt_2,
    "PT_1_3": PT_prompt_3,
    "PT_1_4": PT_prompt_4,
    "PT_1_5": PT_prompt_5,
    "PT_1_6": PT_prompt_6,
    "PT_1_7": PT_prompt_7,
    "PT_1_8": PT_prompt_8,
    "PT_2_1": PT_prompt_1,
    "PT_2_2": PT_prompt_2,
    "PT_2_3": PT_prompt_3,
    "PT_2_4": PT_prompt_4,
    "PT_2_5": PT_prompt_5,
    "PT_2_6": PT_prompt_6,
    "PT_2_7": PT_prompt_7,
    "PT_2_8": PT_prompt_8,
    "PT_3_1": PT_prompt_1,
    "PT_3_2": PT_prompt_2,
    "PT_3_3": PT_prompt_3,
    "PT_3_4": PT_prompt_4,
    "PT_3_5": PT_prompt_5,
    "PT_3_6": PT_prompt_6,
    "PT_3_7": PT_prompt_7,
    "PT_3_8": PT_prompt_8,
}

# The following dictionary is only used for a check in the function calls.
# It returns the variable name of the prompt that was used in the experiment. key: experiment_id, value: prompt_name
PT_prompt_ids_dict = {
    "PT_1_1": "PT_prompt_1",
    "PT_1_2": "PT_prompt_2",
    "PT_1_3": "PT_prompt_3",
    "PT_1_4": "PT_prompt_4",
    "PT_1_5": "PT_prompt_5",
    "PT_1_6": "PT_prompt_6",
    "PT_1_7": "PT_prompt_7",
    "PT_1_8": "PT_prompt_8",
    "PT_2_1": "PT_prompt_1",
    "PT_2_2": "PT_prompt_2",
    "PT_2_3": "PT_prompt_3",
    "PT_2_4": "PT_prompt_4",
    "PT_2_5": "PT_prompt_5",
    "PT_2_6": "PT_prompt_6",
    "PT_2_7": "PT_prompt_7",
    "PT_2_8": "PT_prompt_8",
    "PT_3_1": "PT_prompt_1",
    "PT_3_2": "PT_prompt_2",
    "PT_3_3": "PT_prompt_3",
    "PT_3_4": "PT_prompt_4",
    "PT_3_5": "PT_prompt_5",
    "PT_3_6": "PT_prompt_6",
    "PT_3_7": "PT_prompt_7",
    "PT_3_8": "PT_prompt_8",
}

# Dictionary to look up which model to use for a given experiment id (used in function call). key: experiment id, value: model name
PT_model_dict = {
    "PT_1_1": "gpt-3.5-turbo",
    "PT_1_2": "gpt-3.5-turbo",
    "PT_1_3": "gpt-3.5-turbo",
    "PT_1_4": "gpt-3.5-turbo",
    "PT_1_5": "gpt-3.5-turbo",
    "PT_1_6": "gpt-3.5-turbo",
    "PT_1_7": "gpt-3.5-turbo",
    "PT_1_8": "gpt-3.5-turbo",
    "PT_2_1": "gpt-4-1106-preview",
    "PT_2_2": "gpt-4-1106-preview",
    "PT_2_3": "gpt-4-1106-preview",
    "PT_2_4": "gpt-4-1106-preview",
    "PT_2_5": "gpt-4-1106-preview",
    "PT_2_6": "gpt-4-1106-preview",
    "PT_2_7": "gpt-4-1106-preview",
    "PT_2_8": "gpt-4-1106-preview",
    "PT_3_1": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_2": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_3": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_4": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_5": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_6": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_7": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "PT_3_8": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    }

# Dictionary to look up, what the study design of each experiment was. key: experiment id, value: experiment design 
PT_experiment_dict = {
    "PT_1_1": f"Experiment PT_1_1 uses {PT_model_dict['PT_1_1']}, deals with the segregation of gains and is unprimed.",
    "PT_1_2": f"Experiment PT_1_2 uses {PT_model_dict['PT_1_2']}, deals with the integration of losses and is unprimed.",
    "PT_1_3": f"Experiment PT_1_3 uses {PT_model_dict['PT_1_3']}, deals with the cancellation of losses against larger gains and is unprimed.",
    "PT_1_4": f"Experiment PT_1_4 uses {PT_model_dict['PT_1_4']}, deals with the segrgation of *silver linings* and is unprimed.",
    "PT_1_5": f"Experiment PT_1_5 uses {PT_model_dict['PT_1_5']}, deals with the segregation of gains and is primed.",
    "PT_1_6": f"Experiment PT_1_6 uses {PT_model_dict['PT_1_6']}, deals with the integration of losses and is primed.",
    "PT_1_7": f"Experiment PT_1_7 uses {PT_model_dict['PT_1_7']}, deals with the cancellation of losses against larger gains and is primed.",
    "PT_1_8": f"Experiment PT_1_8 uses {PT_model_dict['PT_1_8']}, deals with the segregation of *silver linings*, and is primed.",
    "PT_2_1": f"Experiment PT_1_1 uses {PT_model_dict['PT_2_1']}, deals with the segregation of gains and is unprimed.",
    "PT_2_2": f"Experiment PT_1_2 uses {PT_model_dict['PT_2_2']}, deals with the integration of losses and is unprimed.",
    "PT_2_3": f"Experiment PT_1_3 uses {PT_model_dict['PT_2_3']}, deals with the cancellation of losses against larger gains and is unprimed.",
    "PT_2_4": f"Experiment PT_1_4 uses {PT_model_dict['PT_2_4']}, deals with the segrgation of *silver linings* and is unprimed.",
    "PT_2_5": f"Experiment PT_1_5 uses {PT_model_dict['PT_2_5']}, deals with the segregation of gains and is primed.",
    "PT_2_6": f"Experiment PT_1_6 uses {PT_model_dict['PT_2_6']}, deals with the integration of losses and is primed.",
    "PT_2_7": f"Experiment PT_1_7 uses {PT_model_dict['PT_2_7']}, deals with the cancellation of losses against larger gains and is primed.",
    "PT_2_8": f"Experiment PT_1_8 uses {PT_model_dict['PT_2_8']}, deals with the segregation of *silver linings*, and is primed.",
    "PT_3_1": f"Experiment PT_1_1 uses {PT_model_dict['PT_3_1']}, deals with the segregation of gains and is unprimed.",
    "PT_3_2": f"Experiment PT_1_2 uses {PT_model_dict['PT_3_2']}, deals with the integration of losses and is unprimed.",
    "PT_3_3": f"Experiment PT_1_3 uses {PT_model_dict['PT_3_3']}, deals with the cancellation of losses against larger gains and is unprimed.",
    "PT_3_4": f"Experiment PT_1_4 uses {PT_model_dict['PT_3_4']}, deals with the segrgation of *silver linings* and is unprimed.",
    "PT_3_5": f"Experiment PT_1_5 uses {PT_model_dict['PT_3_5']}, deals with the segregation of gains and is primed.",
    "PT_3_6": f"Experiment PT_1_6 uses {PT_model_dict['PT_3_6']}, deals with the integration of losses and is primed.",
    "PT_3_7": f"Experiment PT_1_7 uses {PT_model_dict['PT_3_7']}, deals with the cancellation of losses against larger gains and is primed.",
    "PT_3_8": f"Experiment PT_1_8 uses {PT_model_dict['PT_3_8']}, deals with the segregation of *silver linings*, and is primed.",
}

# Dictionary to look up the original results of the experiments. key: experiment id, value: original result
PT_results_dict = {
    "PT_1_1": PT_p_scenario1,
    "PT_1_2": PT_p_scenario2,
    "PT_1_3": PT_p_scenario3,
    "PT_1_4": PT_p_scenario4,
    "PT_1_5": PT_p_scenario1,
    "PT_1_6": PT_p_scenario2,
    "PT_1_7": PT_p_scenario3,
    "PT_1_8": PT_p_scenario4,
    "PT_2_1": PT_p_scenario1,
    "PT_2_2": PT_p_scenario2,
    "PT_2_3": PT_p_scenario3,
    "PT_2_4": PT_p_scenario4,
    "PT_2_5": PT_p_scenario1,
    "PT_2_6": PT_p_scenario2,
    "PT_2_7": PT_p_scenario3,
    "PT_2_8": PT_p_scenario4,
    "PT_3_1": PT_p_scenario1,
    "PT_3_2": PT_p_scenario2,
    "PT_3_3": PT_p_scenario3,
    "PT_3_4": PT_p_scenario4,
    "PT_3_5": PT_p_scenario1,
    "PT_3_6": PT_p_scenario2,
    "PT_3_7": PT_p_scenario3,
    "PT_3_8": PT_p_scenario4,
}

# Dictionary to look up the scenario number of a given experiment ID. key: experiment id, value: scenario number
PT_scenario_dict = {
    "PT_1_1": 1,
    "PT_1_2": 2,
    "PT_1_3": 3,
    "PT_1_4": 4,
    "PT_1_5": 1,
    "PT_1_6": 2,
    "PT_1_7": 3,
    "PT_1_8": 4,
    "PT_2_1": 1,
    "PT_2_2": 2,
    "PT_2_3": 3,
    "PT_2_4": 4,
    "PT_2_5": 1,
    "PT_2_6": 2,
    "PT_2_7": 3,
    "PT_2_8": 4,
    "PT_3_1": 1,
    "PT_3_2": 2,
    "PT_3_3": 3,
    "PT_3_4": 4,
    "PT_3_5": 1,
    "PT_3_6": 2,
    "PT_3_7": 3,
    "PT_3_8": 4,
}   

# Dictionary to look up, whether an experiment used a primed or unprimed prompt. key: experiment id, value: 1 if primed, 0 if unprimed
PT_priming_dict = {
    "PT_1_1": 0,
    "PT_1_2": 0,
    "PT_1_3": 0,
    "PT_1_4": 0,
    "PT_1_5": 1,
    "PT_1_6": 1,
    "PT_1_7": 1,
    "PT_1_8": 1,
    "PT_2_1": 0,
    "PT_2_2": 0,
    "PT_2_3": 0,
    "PT_2_4": 0,
    "PT_2_5": 1,
    "PT_2_6": 1,
    "PT_2_7": 1,
    "PT_2_8": 1,
    "PT_3_1": 0,
    "PT_3_2": 0,
    "PT_3_3": 0,
    "PT_3_4": 0,
    "PT_3_5": 1,
    "PT_3_6": 1,
    "PT_3_7": 1,
    "PT_3_8": 1,
}

# Dictionary to look up original results of the Prospect Theory experiments. Key: experiment id, value: original results
PT_results_dict = {
    "PT_1_1": PT_p_scenario1,
    "PT_1_2": PT_p_scenario2,
    "PT_1_3": PT_p_scenario3,
    "PT_1_4": PT_p_scenario4,
    "PT_1_5": PT_p_scenario1,
    "PT_1_6": PT_p_scenario2,
    "PT_1_7": PT_p_scenario3,
    "PT_1_8": PT_p_scenario4,
    "PT_2_1": PT_p_scenario1,
    "PT_2_2": PT_p_scenario2,
    "PT_2_3": PT_p_scenario3,
    "PT_2_4": PT_p_scenario4,
    "PT_2_5": PT_p_scenario1,
    "PT_2_6": PT_p_scenario2,
    "PT_2_7": PT_p_scenario3,
    "PT_2_8": PT_p_scenario4,
    "PT_3_1": PT_p_scenario1,
    "PT_3_2": PT_p_scenario2,
    "PT_3_3": PT_p_scenario3,
    "PT_3_4": PT_p_scenario4,
    "PT_3_5": PT_p_scenario1,
    "PT_3_6": PT_p_scenario2,
    "PT_3_7": PT_p_scenario3,
    "PT_3_8": PT_p_scenario4,
    }


---------------------------------

#### Setting up functions to repeatedly prompt ChatGPT

- Functions to query 1 prompt n times

In [13]:
def PT_run_experiment(experiment_id, n, progress_bar, temperature):

    """
    Function to query ChatGPT multiple times with a survey having answers designed as: A, B, C.
    
    Args:
        experiment_id (str): ID of the experiment to be run. Contains info about prompt and model
        n (int): Number of queries to be made
        temperature (int): Degree of randomness with range 0 (deterministic) to 2 (random)
        max_tokens (int): Maximum number of tokens in response object
        
    Returns:
        results (list): List containing count of answers for each option, also containing experiment_id, temperature and number of observations
        probs (list): List containing probability of each option being chosen, also containing experiment_id, temeperature and number of observations
    """
    
    answers = []
    for _ in range(n): 
        response = client.chat.completions.create(
            model = PT_model_dict[experiment_id], 
            max_tokens = 1,
            temperature = temperature, # range is 0 to 2
            messages = [
            {"role": "system", "content": "Only answer with the letter of the alternative you would choose without any reasoning."},        
            {"role": "user", "content": PT_experiment_prompts_dict[experiment_id]},
                   ])

        # Store the answer in the list
        answer = response.choices[0].message.content
        answers.append(answer.strip())
        # Update progress bar (given from either temperature loop, or set locally)
        progress_bar.update(1)

    # Counting results
    A = answers.count("A")
    B = answers.count("B")
    C = answers.count("C")

    # Count of "correct" answers, sums over indicator function thack checks if answer is either A, B or C
    len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])

    # Collecting results in a list
    results = [experiment_id, temperature, A, B, C, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id], PT_priming_dict[experiment_id]]

    # Getting percentage each answer
    p_a = f"{(A / (len_correct + 0.000000001)) * 100:.2f}%"
    p_b = f"{(B / (len_correct + 0.000000001)) * 100:.2f}%"
    p_c = f"{(C / (len_correct + 0.000000001)) * 100:.2f}%"

    # Collect probabilities in a dataframe
    probs = [experiment_id, temperature, p_a, p_b, p_c, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id], PT_priming_dict[experiment_id]]
    
    # Give out results
    return results, probs

- Adjusted function for dashboard  (returns dataframe with regular numbers, not percent)

In [58]:
def PT_run_experiment_dashboard(experiment_id, n, progress_bar, temperature):

    """
    Function to query ChatGPT multiple times with a survey having answers designed as: A, B, C.
    
    Args:
        experiment_id (str): ID of the experiment to be run. Contains info about prompt and model
        n (int): Number of queries to be made
        temperature (int): Degree of randomness with range 0 (deterministic) to 2 (random)
        max_tokens (int): Maximum number of tokens in response object
        
    Returns:
        results (list): List containing count of answers for each option, also containing experiment_id, temperature and number of observations
        probs (list): List containing probability of each option being chosen, also containing experiment_id, temeperature and number of observations
    """
    
    answers = []
    for _ in range(n): 
        response = client.chat.completions.create(
            model = PT_model_dict[experiment_id], 
            max_tokens = 1,
            temperature = temperature, # range is 0 to 2
            messages = [
            {"role": "system", "content": "Only answer with the letter of the alternative you would choose without any reasoning."},        
            {"role": "user", "content": PT_experiment_prompts_dict[experiment_id]},
                   ])

        # Store the answer in the list
        answer = response.choices[0].message.content
        answers.append(answer.strip())
        # Update progress bar (given from either temperature loop, or set locally)
        progress_bar.update(1)

    # Counting results
    A = answers.count("A")
    B = answers.count("B")
    C = answers.count("C")

    # Count of "correct" answers, sums over indicator function thack checks if answer is either A, B or C
    len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])

    # Collecting results in a dataframe
    results = pd.DataFrame([experiment_id, temperature, A, B, C, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id], PT_priming_dict[experiment_id]])
    results = results.set_index(pd.Index(["Experiment", "Temp", "A", "B", "C", "Obs.", "Model", "Scenario", "Priming"]))

    # Getting percentage each answer
    p_a = (A / (len_correct + 0.000000001)) * 100
    p_b = (B / (len_correct + 0.000000001)) * 100
    p_c = (C / (len_correct + 0.000000001)) * 100

    # Collect probabilities in a dataframe
    probs = pd.DataFrame([experiment_id, temperature, p_a, p_b, p_c, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id], PT_priming_dict[experiment_id]])
    probs = probs.set_index(pd.Index(["Experiment", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Priming"]))
    
    # Give out results
    return results, probs

- Function to query 1 prompt n times (LLama)

In [15]:
def PT_run_experiment_llama(experiment_id, n, progress_bar, temperature):
    answers = []
    for _ in range(n):
        response = replicate.run(
            PT_model_dict[experiment_id],
            input = {
                "system_prompt": "Only answer with the letter of the alternative you would choose without any reasoning.",
                "temperature": temperature,
                "max_new_tokens": 2, 
                "prompt": PT_experiment_prompts_dict[experiment_id]
            }
        )
        # Grab answer and append to list
        answer = "" # Set to empty string, otherwise it would append the previous answer to the new one
        for item in response:
            answer = answer + item
        answers.append(answer.strip())

        # Update progress bar
        progress_bar.update(1)

    # Counting results
    A = answers.count("A") # set to Q
    B = answers.count("B") # set to X
    C = answers.count("C") # set to Y

    # Count of "correct" answers, sums over indicator function thack checks if answer is either A, B or C
    len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])

    # Collecting results in a list
    results = [experiment_id, temperature, A, B, C, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id], PT_priming_dict[experiment_id]]

    # Getting percentage each answer
    p_a = f"{(A / (len_correct + 0.000000001)) * 100:.2f}%"
    p_b = f"{(B / (len_correct + 0.000000001)) * 100:.2f}%"
    p_c = f"{(C / (len_correct + 0.000000001)) * 100:.2f}%"

    # Collect probabilities in a dataframe
    probs = [experiment_id, temperature, p_a, p_b, p_c, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id], PT_priming_dict[experiment_id]]
    
    # Give out results
    return results, probs

- Adjusted function for dashboard

In [16]:
def PT_run_experiment_llama_dashboard(experiment_id, n, progress_bar, temperature):
    answers = []
    for _ in range(n):
        response = replicate.run(
            PT_model_dict[experiment_id],
            input = {
                "system_prompt": "Only answer with the letter of the alternative you would choose without any reasoning.",
                "temperature": temperature,
                "max_new_tokens": 2, 
                "prompt": PT_experiment_prompts_dict[experiment_id]
            }
        )
        # Grab answer and append to list
        answer = "" # Set to empty string, otherwise it would append the previous answer to the new one
        for item in response:
            answer = answer + item
        answers.append(answer.strip())

        # Update progress bar
        progress_bar.update(1)

    # Counting results
    A = answers.count("A") # set to Q
    B = answers.count("B") # set to X
    C = answers.count("C") # set to Y

    # Count of "correct" answers, sums over indicator function thack checks if answer is either A, B or C
    len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])

    # Collecting results in a list
    results = pd.DataFrame([experiment_id, temperature, A, B, C, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id], PT_priming_dict[experiment_id]])
    results = results.set_index(pd.Index(["Experiment", "Temp", "A", "B", "C", "Obs.", "Model", "Scenario", "Priming"]))

    # Getting percentage each answer
    p_a = (A / (len_correct + 0.000000001)) * 100
    p_b = (B / (len_correct + 0.000000001)) * 100
    p_c = (C / (len_correct + 0.000000001)) * 100

    # Collect probabilities in a dataframe
    probs = pd.DataFrame([experiment_id, temperature, p_a, p_b, p_c, len_correct, PT_model_dict[experiment_id], PT_scenario_dict[experiment_id], PT_priming_dict[experiment_id]])
    probs = probs.set_index(pd.Index(["Experiment", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Priming"]))
    
    # Give out results
    return results, probs

- Function to loop run_experiment() over a list of temperature values

In [17]:
def PT_temperature_loop(function, experiment_id, temperature_list = [0, 0.5, 1, 1.5, 2], n = 50):
    """
    Function to run an experiment with different temperature values.
    
    Args:
        function (function): Function to be used for querying ChatGPT i.e. run_experiment()
        experiment_id (str): ID of th e experiment to be run. Contains info about prompt and model
        temperature_list (list): List of temperature values to be looped over
        n: Number of requests for each prompt per temperature value
        max_tokens: Maximum number of tokens in response object
        
    Returns:
        results_df: Dataframe with experiment results
        probs_df: Dataframe with answer probabilities
    """    
    # Empty lists for storing results
    results_list = []
    probs_list = []
    # Initialize progress bar -> used as input for run_experiment()
    progress_bar = tqdm(range(n*len(temperature_list)))

    # Loop over different temperature values, calling the input function n times each (i.e. queriyng ChatGPT n times)
    for temperature in temperature_list:
        results, probs = function(experiment_id = experiment_id, n = n, temperature = temperature, progress_bar = progress_bar) 
        results_list.append(results)
        probs_list.append(probs)

    # Horizontally concatenate the results, transpose, and set index
    results_df = pd.DataFrame(results_list).transpose().set_index(pd.Index(["Experiment", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Priming"]))
    probs_df = pd.DataFrame(probs_list).transpose().set_index(pd.Index(["Experiment", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Priming"]))
   
    # Return some information about the experiment as a check
    check = f"{PT_experiment_dict[experiment_id]} In this run, a total of {n*len(temperature_list)} requests were made using {PT_prompt_ids_dict[experiment_id]}."
    # Print information about the experiment
    print(check)
    # Print original results 
    print(f"The original results were {PT_results_dict[experiment_id]}.")

    return results_df, probs_df

- Function to plot distribution of answer probabilities

In [18]:
def PT_plot_results(df):
    
    # Get experiment id and model name for plot title from dictionaries
    experiment_id = df.iloc[0, 0]
    model = PT_model_dict[experiment_id]
    
    X = df.loc["Temp"]
    p_a = df.loc["p(A)"].str.rstrip('%').astype('float')  # Convert percentages to float
    p_b = df.loc["p(B)"].str.rstrip('%').astype('float')
    p_c = df.loc["p(C)"].str.rstrip('%').astype('float')

    X_axis = np.arange(len(X)) 

    plt.figure(figsize = (10, 5))
    ax = plt.gca()
    ax.bar(X_axis- 0.25, p_a, 0.25, label = 'p(A)', color = "#8C1515") 
    ax.bar(X_axis, p_b, 0.25,  label = 'p(B)', color = "#507FAB") 
    ax.bar(X_axis+ 0.25 , p_c,  0.25, label = 'p(C)', color = '#D9A84A')

    ax.set_xticks(X_axis, X)
    ax.set_xlabel("Temperature")
    ax.set_ylabel("Probability (%)")
    ax.set_ylim(0, 110)
    ax.set_title(f"Distribution of answers per temperature value for experiment {experiment_id} using {model}")
    ax.legend()  
    plt.show()

-------------

## Comparing different LLMs

The results variables will be structured as: results_model-id_prompt-id.

We will refer to "GPT-3.5-turbo" as model 1 and "GPT-4-1106-preview" as model 2.

#### Model 1: GPT-3.5-Turbo (Model training ended in September 2021)

- Prompt 1: Segregation of gains (unprimed)

In [17]:
# Set number of requests per temperature value
N = 100

In [18]:
results_1_1, probs_1_1 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_1", n = N)
probs_1_1

100%|██████████| 50/50 [00:26<00:00,  1.89it/s]

Experiment PT_1_1 uses gpt-3.5-turbo, deals with the segregation of gains and is unprimed. In this run, a total of 50 requests were made using PT_prompt_1.
The original results were ['p(A): 64.37%', 'p(B): 18.39%', 'p(C): 17.24%'].





Unnamed: 0,0,1,2,3,4
Experiment,PT_1_1,PT_1_1,PT_1_1,PT_1_1,PT_1_1
Temp,0.0,0.5,1.0,1.5,2.0
p(A),10.00%,30.00%,50.00%,30.00%,60.00%
p(B),0.00%,10.00%,0.00%,50.00%,20.00%
p(C),90.00%,60.00%,50.00%,20.00%,20.00%
Obs.,10,10,10,10,10
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,1,1,1,1,1
Priming,0,0,0,0,0


- Prompt 2: Integration of losses (unprimed)

In [33]:
results_1_2, probs_1_2 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_2", n = N)
probs_1_2

100%|██████████| 25/25 [00:11<00:00,  2.20it/s]

Experiment PT_1_2 uses gpt-3.5-turbo, deals with the integration of losses and is unprimed. In this run, a total of 25 requests were made using PT_prompt_2.
The original results were ['p(A): 75.86%', 'p(B): 16.09%', 'p(C): 8.05%'].





Unnamed: 0,0,1,2,3,4
Experiment,PT_1_2,PT_1_2,PT_1_2,PT_1_2,PT_1_2
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,20.00%,20.00%,40.00%
p(B),0.00%,0.00%,60.00%,40.00%,0.00%
p(C),100.00%,100.00%,20.00%,40.00%,60.00%
Obs.,5,5,5,5,5
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,2,2,2,2,2
Priming,0,0,0,0,0


- Prompt 3: Cancellation of losses against larger gains (unprimed)

In [36]:
results_1_3, probs_1_3 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_3", n = N)
probs_1_3

100%|██████████| 500/500 [03:46<00:00,  2.21it/s]

Experiment 1_3 uses gpt-3.5-turbo, deals with the cancellation of losses against larger gains and is unprimed. In this run, a total of 500 requests were made using prompt_3.
The original results were ['p(A): 25.29%', 'p(B): 70.11%', 'p(C): 4.6%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_3,1_3,1_3,1_3,1_3
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,69.00%,38.00%,45.45%,49.46%
p(B),0.00%,10.00%,14.00%,23.23%,21.51%
p(C),0.00%,21.00%,48.00%,31.31%,29.03%
Obs.,100,100,100,99,93
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,3,3,3,3,3
Priming,0,0,0,0,0


- Prompt 4: Segregation of "silver linings" (unprimed)

In [37]:
results_1_4, probs_1_4 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_4", n = N)
probs_1_4

100%|██████████| 500/500 [04:00<00:00,  2.08it/s]

Experiment 1_4 uses gpt-3.5-turbo, deals with the segrgation of *silver linings* and is unprimed. In this run, a total of 500 requests were made using prompt_4.
The original results were ['p(A): 21.84%', 'p(B): 72.41%', 'p(C): 5.75%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_4,1_4,1_4,1_4,1_4
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,86.00%,64.00%,58.00%,47.25%
p(B),0.00%,9.00%,16.00%,15.00%,23.08%
p(C),0.00%,5.00%,20.00%,27.00%,29.67%
Obs.,100,100,100,100,91
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,4,4,4,4,4
Priming,0,0,0,0,0


- Prompt 5: Segregation of gains (primed)

In [38]:
results_1_5, probs_1_5 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_5", n = N)
probs_1_5

100%|██████████| 500/500 [03:48<00:00,  2.18it/s]

Experiment 1_5 uses gpt-3.5-turbo, deals with the segregation of gains and is primed. In this run, a total of 500 requests were made using prompt_5.
The original results were ['p(A): 64.37%', 'p(B): 18.39%', 'p(C): 17.24%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_5,1_5,1_5,1_5,1_5
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,15.00%,18.00%,28.28%,39.13%
p(B),0.00%,9.00%,23.00%,21.21%,22.83%
p(C),100.00%,76.00%,59.00%,50.51%,38.04%
Obs.,100,100,100,99,92
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,1,1,1,1,1
Priming,1,1,1,1,1


- Prompt 6: Integration of losses (primed)

In [39]:
results_1_6, probs_1_6 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_6", n = N)
probs_1_6

100%|██████████| 500/500 [05:27<00:00,  1.53it/s] 

Experiment 1_6 uses gpt-3.5-turbo, deals with the integration of losses and is primed. In this run, a total of 500 requests were made using prompt_6.
The original results were ['p(A): 75.86%', 'p(B): 16.09%', 'p(C): 8.05%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_6,1_6,1_6,1_6,1_6
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,33.00%,40.00%,42.42%,37.63%
p(B),0.00%,5.00%,3.00%,20.20%,16.13%
p(C),100.00%,62.00%,57.00%,37.37%,46.24%
Obs.,100,100,100,99,93
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,2,2,2,2,2
Priming,1,1,1,1,1


-  Prompt 7: Cancellation of losses against larger gains (primed)

In [40]:
results_1_7, probs_1_7 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_7", n = N)
probs_1_7

100%|██████████| 500/500 [03:59<00:00,  2.09it/s]

Experiment 1_7 uses gpt-3.5-turbo, deals with the cancellation of losses against larger gains and is primed. In this run, a total of 500 requests were made using prompt_7.
The original results were ['p(A): 25.29%', 'p(B): 70.11%', 'p(C): 4.6%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_7,1_7,1_7,1_7,1_7
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,62.00%,42.00%,48.48%,40.22%
p(B),0.00%,1.00%,9.00%,14.14%,20.65%
p(C),0.00%,37.00%,49.00%,37.37%,39.13%
Obs.,100,100,100,99,92
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,3,3,3,3,3
Priming,1,1,1,1,1


- Prompt 8: Segregation of "silver linings" (primed)

In [41]:
results_1_8, probs_1_8 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_1_8", n = N)
probs_1_8

100%|██████████| 500/500 [04:00<00:00,  2.08it/s]

Experiment 1_8 uses gpt-3.5-turbo, deals with the segregation of *silver linings*, and is primed. In this run, a total of 500 requests were made using prompt_8.
The original results were ['p(A): 21.84%', 'p(B): 72.41%', 'p(C): 5.75%'].





Unnamed: 0,0,1,2,3,4
Experiment,1_8,1_8,1_8,1_8,1_8
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,50.00%,46.00%,38.78%,39.33%
p(B),0.00%,5.00%,16.00%,19.39%,21.35%
p(C),0.00%,45.00%,38.00%,41.84%,39.33%
Obs.,100,100,100,98,89
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,4,4,4,4,4
Priming,1,1,1,1,1


------------------------------------------

#### Model 2: GPT-4-1106-preview (Model training ended in April 2023)

Since prompting GPT4 is much more expensive, we will only use 50 requests per temperature value instead of 100, as we did for GPT3.

In [19]:
# Set number of requests per temperature value
N = 50

- Prompt 1: Segregation of gains (unprimed)

In [17]:
results_2_1, probs_2_1 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_1", n = N)
probs_2_1

  0%|          | 0/250 [00:00<?, ?it/s]

100%|██████████| 250/250 [03:52<00:00,  1.07it/s] 

Experiment 1_1 uses gpt-4-1106-preview, deals with the segregation of gains and is unprimed. In this run, a total of 250 requests were made using prompt_1.
The original results were ['p(A): 64.37%', 'p(B): 18.39%', 'p(C): 17.24%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_1,2_1,2_1,2_1,2_1
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,0.00%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,100.00%,100.00%,100.00%,100.00%
Obs.,50,50,50,50,48
Model,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview
Scenario,1,1,1,1,1
Priming,0,0,0,0,0


- Prompt 2: Integration of losses (unprimed)

In [20]:
results_2_2, probs_2_2 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_2", n = N)
probs_2_2

100%|██████████| 250/250 [02:06<00:00,  1.97it/s]

Experiment PT_1_2 uses gpt-4-1106-preview, deals with the integration of losses and is unprimed. In this run, a total of 250 requests were made using PT_prompt_2.
The original results were ['p(A): 75.86%', 'p(B): 16.09%', 'p(C): 8.05%'].





Unnamed: 0,0,1,2,3,4
Experiment,PT_2_2,PT_2_2,PT_2_2,PT_2_2,PT_2_2
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,2.04%
p(B),0.00%,0.00%,0.00%,0.00%,2.04%
p(C),100.00%,100.00%,100.00%,100.00%,95.92%
Obs.,50,50,50,50,49
Model,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview
Scenario,2,2,2,2,2
Priming,0,0,0,0,0


- Prompt 3: Cancellation of losses against larger gains (unprimed)

In [25]:
results_2_3, probs_2_3 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_3", n = N)
probs_2_3

100%|██████████| 250/250 [12:23<00:00,  2.97s/it]   

Experiment 1_3 uses gpt-4-1106-preview, deals with the cancellation of losses against larger gains and is unprimed. In this run, a total of 250 requests were made using prompt_3.
The original results were ['p(A): 25.29%', 'p(B): 70.11%', 'p(C): 4.6%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_3,2_3,2_3,2_3,2_3
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,2.04%,0.00%
p(B),0.00%,0.00%,0.00%,0.00%,6.38%
p(C),100.00%,100.00%,100.00%,97.96%,93.62%
Obs.,50,50,50,49,47


- Prompt 4: Segregation of "silver linings" (unprimed)

In [26]:
results_2_4, probs_2_4 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_4", n = N)
probs_2_4

100%|██████████| 250/250 [22:09<00:00,  5.32s/it]   

Experiment 1_4 uses gpt-4-1106-preview, deals with the segrgation of *silver linings* and is unprimed. In this run, a total of 250 requests were made using prompt_4.
The original results were ['p(A): 21.84%', 'p(B): 72.41%', 'p(C): 5.75%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_4,2_4,2_4,2_4,2_4
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,0.00%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,100.00%,100.00%,100.00%,100.00%
Obs.,50,50,50,50,49


- Prompt 5: Segregation of gains (primed)

In [27]:
results_2_5, probs_2_5 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_5", n = N)
probs_2_5

100%|██████████| 250/250 [02:09<00:00,  1.93it/s]

Experiment 1_5 uses gpt-4-1106-preview, deals with the segregation of gains and is primed. In this run, a total of 250 requests were made using prompt_5.
The original results were ['p(A): 64.37%', 'p(B): 18.39%', 'p(C): 17.24%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_5,2_5,2_5,2_5,2_5
Temp,0.0,0.5,1.0,1.5,2.0
p(A),94.00%,70.00%,60.00%,60.00%,57.14%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),6.00%,30.00%,40.00%,40.00%,42.86%
Obs.,50,50,50,50,49


- Prompt 6: Integration of losses (primed)

In [28]:
results_2_6, probs_2_6 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_6", n = N)
probs_2_6

100%|██████████| 250/250 [22:41<00:00,  5.45s/it]   

Experiment 1_6 uses gpt-4-1106-preview, deals with the integration of losses and is primed. In this run, a total of 250 requests were made using prompt_6.
The original results were ['p(A): 75.86%', 'p(B): 16.09%', 'p(C): 8.05%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_6,2_6,2_6,2_6,2_6
Temp,0.0,0.5,1.0,1.5,2.0
p(A),98.00%,96.00%,88.00%,84.00%,75.51%
p(B),0.00%,0.00%,0.00%,2.00%,0.00%
p(C),2.00%,4.00%,12.00%,14.00%,24.49%
Obs.,50,50,50,50,49


-  Prompt 7: Cancellation of losses against larger gains (primed)

In [29]:
results_2_7, probs_2_7 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_7", n = N)
probs_2_7

100%|██████████| 250/250 [02:08<00:00,  1.94it/s]

Experiment 1_7 uses gpt-4-1106-preview, deals with the cancellation of losses against larger gains and is primed. In this run, a total of 250 requests were made using prompt_7.
The original results were ['p(A): 25.29%', 'p(B): 70.11%', 'p(C): 4.6%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_7,2_7,2_7,2_7,2_7
Temp,0.0,0.5,1.0,1.5,2.0
p(A),30.00%,54.00%,44.00%,46.94%,38.00%
p(B),0.00%,0.00%,0.00%,2.04%,10.00%
p(C),70.00%,46.00%,56.00%,51.02%,52.00%
Obs.,50,50,50,49,50


- Prompt 8: Segregation of "silver linings" (primed)

In [30]:
results_2_8, probs_2_8 = PT_temperature_loop(PT_run_experiment, experiment_id = "PT_2_8", n = N)
probs_2_8

100%|██████████| 250/250 [02:10<00:00,  1.91it/s]

Experiment 1_8 uses gpt-4-1106-preview, deals with the segregation of *silver linings*, and is primed. In this run, a total of 250 requests were made using prompt_8.
The original results were ['p(A): 21.84%', 'p(B): 72.41%', 'p(C): 5.75%'].





Unnamed: 0,0,1,2,3,4
Experiment,2_8,2_8,2_8,2_8,2_8
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,2.00%,8.00%,16.00%,23.40%
p(B),0.00%,0.00%,0.00%,2.00%,8.51%
p(C),100.00%,98.00%,92.00%,82.00%,68.09%
Obs.,50,50,50,50,47


--------------------------------------------

#### Model 3: LLama-2-70b

!!! Use max_new_tokens of at least 2, as llama tends to begin answers with a blank space !!!

In [55]:
# models = ['meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3'] # possibility to add further llama models 
#temperature_list = [0.01, 1.25, 2.5, 3.75, 5]
temperature_list = [0.01, 0.5, 1, 1.5, 2] # LLama wont take 0 as temperature and has max temperature of 5
N = 50 # number of requests per temperature value 

- Prompt 1: Segregation of gains (unprimed)

In [18]:
results_3_1, probs_3_1 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_1", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_1

100%|██████████| 250/250 [08:17<00:00,  1.99s/it]

Experiment 1_1 uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3, deals with the segregation of gains and is unprimed. In this run, a total of 250 requests were made using prompt_1.
The original results were ['p(A): 64.37%', 'p(B): 18.39%', 'p(C): 17.24%'].





Unnamed: 0,0,1,2,3,4
Experiment,3_1,3_1,3_1,3_1,3_1
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,18.00%,17.07%
p(B),100.00%,100.00%,100.00%,76.00%,73.17%
p(C),0.00%,0.00%,0.00%,6.00%,9.76%
Obs.,50,50,50,50,41
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,1,1,1,1,1
Priming,0,0,0,0,0


- Prompt 2: Integration of losses (unprimed)

In [69]:
results_3_2, probs_3_2 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_2", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_2 # already saved to csv 

Unnamed: 0,0,1,2,3,4
Experiment,3_2,3_2,3_2,3_2,3_2
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,2.44%
p(B),0.00%,0.00%,0.00%,2.00%,36.59%
p(C),100.00%,100.00%,100.00%,98.00%,60.98%
Obs.,50,50,50,50,41
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,2,2,2,2,2
Priming,0,0,0,0,0


- Prompt 3: Cancellation of losses against larger gains (unprimed)

In [21]:
results_3_3, probs_3_3 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_3", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)    
probs_3_3

100%|██████████| 250/250 [08:06<00:00,  1.95s/it]

Experiment PT_1_3 uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3, deals with the cancellation of losses against larger gains and is unprimed. In this run, a total of 250 requests were made using PT_prompt_3.
The original results were ['p(A): 25.29%', 'p(B): 70.11%', 'p(C): 4.6%'].





Unnamed: 0,0,1,2,3,4
Experiment,PT_3_3,PT_3_3,PT_3_3,PT_3_3,PT_3_3
Temp,0.01,0.5,1.0,1.5,2.0
p(A),100.00%,100.00%,100.00%,58.00%,52.63%
p(B),0.00%,0.00%,0.00%,40.00%,36.84%
p(C),0.00%,0.00%,0.00%,2.00%,10.53%
Obs.,50,50,50,50,38
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,3,3,3,3,3
Priming,0,0,0,0,0


- Prompt 4: Segregation of silver linings (unprimed)

In [20]:
results_3_4, probs_3_4 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_4", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_4

100%|██████████| 250/250 [06:58<00:00,  1.68s/it]

Experiment 1_4 uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3, deals with the segrgation of *silver linings* and is unprimed. In this run, a total of 250 requests were made using prompt_4.
The original results were ['p(A): 21.84%', 'p(B): 72.41%', 'p(C): 5.75%'].





Unnamed: 0,0,1,2,3,4
Experiment,3_4,3_4,3_4,3_4,3_4
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,0.00%
p(B),0.00%,0.00%,0.00%,0.00%,10.00%
p(C),100.00%,100.00%,100.00%,100.00%,90.00%
Obs.,50,50,50,50,50
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,4,4,4,4,4
Priming,0,0,0,0,0


- Prompt 5: Segregation of gains (primed)

In [21]:
results_3_5, probs_3_5 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_5", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_5

100%|██████████| 250/250 [11:45<00:00,  2.82s/it] 

Experiment 1_5 uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3, deals with the segregation of gains and is primed. In this run, a total of 250 requests were made using prompt_5.
The original results were ['p(A): 64.37%', 'p(B): 18.39%', 'p(C): 17.24%'].





Unnamed: 0,0,1,2,3,4
Experiment,3_5,3_5,3_5,3_5,3_5
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,15.56%
p(B),100.00%,100.00%,100.00%,62.00%,53.33%
p(C),0.00%,0.00%,0.00%,38.00%,31.11%
Obs.,50,50,50,50,45
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,1,1,1,1,1
Priming,1,1,1,1,1


- Prompt 6: Integration of losses (primed)

In [22]:
results_3_6, probs_3_6 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_6", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_6

100%|██████████| 250/250 [06:45<00:00,  1.62s/it]

Experiment 1_6 uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3, deals with the integration of losses and is primed. In this run, a total of 250 requests were made using prompt_6.
The original results were ['p(A): 75.86%', 'p(B): 16.09%', 'p(C): 8.05%'].





Unnamed: 0,0,1,2,3,4
Experiment,3_6,3_6,3_6,3_6,3_6
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,0.00%
p(B),100.00%,100.00%,100.00%,68.00%,64.86%
p(C),0.00%,0.00%,0.00%,32.00%,35.14%
Obs.,50,50,50,50,37
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,2,2,2,2,2
Priming,1,1,1,1,1


- Prompt 7: Cancellation of losses against larger gains (primed)

In [23]:
results_3_7, probs_3_7 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_7", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_7

100%|██████████| 250/250 [07:29<00:00,  1.80s/it]

Experiment 1_7 uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3, deals with the cancellation of losses against larger gains and is primed. In this run, a total of 250 requests were made using prompt_7.
The original results were ['p(A): 25.29%', 'p(B): 70.11%', 'p(C): 4.6%'].





Unnamed: 0,0,1,2,3,4
Experiment,3_7,3_7,3_7,3_7,3_7
Temp,0.01,0.5,1.0,1.5,2.0
p(A),100.00%,100.00%,100.00%,40.00%,43.18%
p(B),0.00%,0.00%,0.00%,58.00%,47.73%
p(C),0.00%,0.00%,0.00%,2.00%,9.09%
Obs.,50,50,50,50,44
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,3,3,3,3,3
Priming,1,1,1,1,1


- Prompt 8: Segregation of silver linings (primed)

In [24]:
results_3_8, probs_3_8 = PT_temperature_loop(PT_run_experiment_llama, experiment_id = "PT_3_8", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_8

100%|██████████| 250/250 [05:07<00:00,  1.23s/it]

Experiment 1_8 uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3, deals with the segregation of *silver linings*, and is primed. In this run, a total of 250 requests were made using prompt_8.
The original results were ['p(A): 21.84%', 'p(B): 72.41%', 'p(C): 5.75%'].





Unnamed: 0,0,1,2,3,4
Experiment,3_8,3_8,3_8,3_8,3_8
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,0.00%
p(B),0.00%,0.00%,0.00%,0.00%,57.78%
p(C),100.00%,100.00%,100.00%,100.00%,42.22%
Obs.,50,50,50,50,45
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,4,4,4,4,4
Priming,1,1,1,1,1


---

- Save the results


In [31]:
# Gather all results
PT_probs = pd.concat([probs_1_1, probs_1_2, probs_1_3, probs_1_4, probs_1_5, probs_1_6, probs_1_7, probs_1_8,
                      probs_2_1, probs_2_2, probs_2_3, probs_2_4, probs_2_5, probs_2_6, probs_2_7, probs_2_8,
                      probs_3_1, probs_3_2, probs_3_3, probs_3_4, probs_3_5, probs_3_6, probs_3_7, probs_3_8], axis = 1).transpose()

# Rename llama model
# Rename LLama model
PT_probs['Model'] = PT_probs['Model'].replace('meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3', 
                                  'llama-2-70b')

# Transform probabilities to float for plotting
PT_probs["p(A)"] = PT_probs["p(A)"].str.rstrip('%').astype('float')
PT_probs["p(B)"] = PT_probs["p(B)"].str.rstrip('%').astype('float')
PT_probs["p(C)"] = PT_probs["p(C)"].str.rstrip('%').astype('float')

# Save to csv 
PT_probs.to_csv("Output/PT_probs.csv", index = True)

Folder PT_probs_dfs successfully created.


---

In [26]:
# Function tests
import plotly.graph_objects as go

In [67]:
def plot_results_individual(df):
    
    # Get number of observations per temperature value
    n_observations = df.loc["Obs."]
    
    # Get temperature values
    temperature = df.loc["Temp"]

    # Get model
    model = df.loc["Model"][0]

    # Get experiment id
    experiment_id = df.loc["Experiment"][0]

    fig = go.Figure(data=[
        go.Bar(
            name="p(A)", 
            x=temperature, 
            y=df.loc["p(A)"],
            customdata = n_observations,
            hovertemplate="Temperature: %{x}<br>Probability: %{y:.2f}%<br>Observations: %{customdata}<extra></extra>",
            marker=dict(color="#e9724d"),
        ),
        go.Bar(
            name="p(B)", 
            x=temperature, 
            y=df.loc["p(B)"],
            customdata = n_observations,
            hovertemplate="Temperature: %{x}<br>Probability: %{y:.2f}%<br> Observations: %{customdata}<extra></extra>",
            marker=dict(color="#868686"),
            
        ),
        go.Bar(
            name="p(C)", 
            x=temperature, 
            y=df.loc["p(C)"],
            customdata = n_observations,
            hovertemplate="Temperature: %{x}<br>Probability: %{y:.2f}%<br> Observations: %{customdata}<extra></extra>",
            marker=dict(color="#92cad1"),
        )
    ])

    fig.update_layout(
    barmode = 'group',
    xaxis = dict(
        tickmode = 'array',
        tickvals = temperature,
        ticktext = temperature,
        title = "Temperature",  
        title_font=dict(size=18),  
    ),
    yaxis = dict(
        title="Probability (%)",  
        title_font=dict(size=18), 
    ),
    title = dict(
        text= f"Distribution of answers for experiment {experiment_id} using model {model}",
        x = 0.5, # Center alignment horizontally
        y = 0.87,  # Vertical alignment
        font=dict(size=22),  
    ),
    legend = dict(
        title = dict(text="Probabilities"),
    ),
    bargap = 0.3  # Gap between temperature values
)
    return fig