### Decoy Effect



"temperature - A measure of how often the model outputs a less likely token. The higher the temperature, the more random (and usually creative) the output. This, however, is not the same as “truthfulness”. For most factual use cases such as data extraction, and truthful Q&A, the temperature of 0 is best." (https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)




-------------------------------------

This notebook aims to recreate some findings concerning the **Decoy Effect** used in pricing. 

The answer options for this experiment are literal copies of the original study. Only the words "priced at __$" were added to each option. Furthermore, their wording remains the same
through the entire experiment, being: 
-  A: One-year subscription to Economist.com. Includes online access to all articles from The Economist since 1997, priced at 59$.
-  B: One-year subscription to the print edition of The Economist, priced at 125$.
-  C: One-year subscription to the print edition of The Economist and online access to all articles from The Economist since 1997, priced at 125$.


#### From Ariely's Book "Predictably irrational":

"When I gave these options to 100 students at MIT's Sloan
School of Management, they opted as follows:
1. Internet-only subscription for $59—16 students
2. Print-only subscription for $125—zero students
3. Print-and-Internet subscription for $ 125—84 students"

(page 5)

"And the absence of the decoy
had them choosing differently, with 32 for print-and-Internet
and 68 for Internet-only"

(page 6)

|Answer option          | Scenario 1 | Scenario 2 (no 2nd option)|
|-----------------------|--------------|-----------|
| Online subscription   |     16%      |  68%       |
| Print subscribtion    |      0%     | 0%        |
| Combination           |     84%   | 32%        |



In [2]:
from openai import OpenAI
import openai
import matplotlib.pyplot as plt
import os 
import numpy as np
import pandas as pd
from tqdm import tqdm
import replicate

In [3]:
# Get API key (previously saved as environmental variable)
openai.api_key = os.environ["OPENAI_API_KEY"]

# Set client
client = OpenAI()

# Set global plot style 
plt.style.use('seaborn-v0_8')

# Set plots to be displayed in notebook
%matplotlib inline

-----------------------------------------------------------

#### Setting up the prompts used for the experiment

We now formulate 8 different prompts: 

First of all, we present the model with the original situation i.e. 3 answer options. To recreate the second scenario of the study, we tell the model that the second (decoy) option has now been removed. 
Furthermore we **prime** the model play the role of a marker researcher, who knows about the decoy effect, and query the model with both study situationbs again.

Afterwards, in order to research possible biases induced by either the answer letters themselves (A, B, C) or the order of the answers, we correct for this and prompt the model 4 times again.

- Prompt 1: Unprimed & all answer options

In [4]:
prompt_1 = """You are presented with the following subscription alternatives for the "The Economist" magazine:
        A: One-year subscription to Economist.com. Includes online access to all articles from The Economist since 1997, priced at 59$.
        B: One-year subscription to the print edition of The Economist, priced at 125$.
        C: One-year subscription to the print edition of The Economist and online access to all articles from The Economist since 1997, priced at 125$.
        Which alternative would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

- Prompt 2: Unprimed & second option (decoy) removed

In [5]:
prompt_2 = """You are presented with the following subscription alternatives for the "The Economist" magazine:
        A: One-year subscription to Economist.com. Includes online access to all articles from The Economist since 1997, priced at 59$.
        B: One-year subscription to the print edition of The Economist and online access to all articles from The Economist since 1997, priced at 125$. 
        Which alternative would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Removed original option B and made option C the new option B

- Prompt 3: Primed & all answer options

In [6]:
prompt_3 = """You are a market researcher that knows about the Decoy Effect in pricing. 
        You are presented with the following subscription alternatives for the "The Economist" magazine:
        A: One-year subscription to Economist.com. Includes online access to all articles from The Economist since 1997, priced at 59$.
        B: One-year subscription to the print edition of The Economist, priced at 125$.
        C: One-year subscription to the print edition of The Economist and online access to all articles from The Economist since 1997, priced at 125$.
        Which alternative would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""


- Prompt 4: Primed & second option (decoy) removed

In [7]:
prompt_4 = """You are a market researcher that knows about the Decoy Effect in pricing. 
         You are presented with the following subscription alternatives for the "The Economist" magazine:
         A: One-year subscription to Economist.com. Includes online access to all articles from The Economist since 1997, priced at 59$.
         B: One-year subscription to the print edition of The Economist and online access to all articles from The Economist since 1997, priced at 125$.
         Which alternative would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Removed original option B and made option C the new option B

---------------------

#### Renaming and reordering answer options

Following the approach of Mendler-Dünner et al. (2023), we do not only reorder the answers but also label them with letters, that are less common in the English language.
- A (online) -> Y
- B (print) -> Q
- C (combination) -> X

Afterwards option 1 is put last. Sticking to our original question design, the order would now be: B, C, A.
- N (print)
- T (combination)
- W (online)

- Prompt 5: Unprimed & all answer options, renamed & reordered

In [8]:
prompt_5 = """You are presented with the following subscription alternatives for the "The Economist" magazine:
         Q: One-year subscription to the print edition of The Economist, priced at 125$.
         X: One-year subscription to the print edition of The Economist and online access to all articles from The Economist since 1997, priced at 125$.
         Y: One-year subscription to Economist.com. Includes online access to all articles from The Economist since 1997, priced at 59$. 
         Which alternative would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""


- Prompt 6: Unprimed & second option (decoy) removed, renamed & reordered

In [9]:
prompt_6 = """You are presented with the following subscription alternatives for the "The Economist" magazine:
         X: One-year subscription to the print edition of The Economist and online access to all articles from The Economist since 1997, priced at 125$.
         Y: One-year subscription to Economist.com. Includes online access to all articles from The Economist since 1997, priced at 59$. 
         Which alternative would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Removed option Q

- Prompt 7: Primed & all answer options, renamed & reordered

In [10]:
prompt_7 = """You are a market researcher that knows about the Decoy Effect in pricing. 
         You are presented with the following subscription alternatives for the "The Economist" magazine:
         Q: One-year subscription to the print edition of The Economist, priced at 125$.
         X: One-year subscription to the print edition of The Economist and online access to all articles from The Economist since 1997, priced at 125$.
         Y: One-year subscription to Economist.com. Includes online access to all articles from The Economist since 1997, priced at 59$. 
         Which alternative would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

- Prompt 8: Primed & second option (decoy) removed, renamed & reordered

In [11]:
prompt_8 = """You are a market researcher that knows about the Decoy Effect in pricing. 
         You are presented with the following subscription alternatives for the "The Economist" magazine:
         X: One-year subscription to the print edition of The Economist and online access to all articles from The Economist since 1997, priced at 125$.
         Y: One-year subscription to Economist.com. Includes online access to all articles from The Economist since 1997, priced at 59$. 
         Which alternative would you choose? Please answer by only giving the letter of the alternative you would choose without any reasoning."""

# Removed option Q

------------------------------------------

- Helpful dictionaries 

The experiments we will run in this notebook are very similar in study design, and for same cases, also similar in the results we expect. We therefore need to make sure, that we associate the results with the correct study design. That is why the following dictionaries are implemented to look up e.g. what model was used for an experiment.

They will also be used inside the functions that call the API multiple times and output some information about the experiment in order to identify it correctly. 

In [12]:
# Dictionary that returns the literal prompt for a given experiment id (used in function call). key: experiment_id, value: prompt
experiment_prompts_dict = {
    "1_1": prompt_1,
    "1_2": prompt_2,
    "1_3": prompt_3,
    "1_4": prompt_4,
    "1_5": prompt_5,
    "1_6": prompt_6,
    "1_7": prompt_7,
    "1_8": prompt_8,
    "2_1": prompt_1,
    "2_2": prompt_2,
    "2_3": prompt_3,
    "2_4": prompt_4,
    "2_5": prompt_5,
    "2_6": prompt_6,
    "2_7": prompt_7,
    "2_8": prompt_8,
    "3_1": prompt_1,
    "3_2": prompt_2,
    "3_3": prompt_3,
    "3_4": prompt_4,
    "3_5": prompt_5,
    "3_6": prompt_6,
    "3_7": prompt_7,
    "3_8": prompt_8,
}

# The following dictionary is only used for a check in the function calls.
# It returns the variable name of the prompt that was used in the experiment. key: experiment_id, value: prompt_name
prompt_ids_dict = {
    "1_1": "prompt_1",
    "1_2": "prompt_2",
    "1_3": "prompt_3",
    "1_4": "prompt_4",
    "1_5": "prompt_5",
    "1_6": "prompt_6",
    "1_7": "prompt_7",
    "1_8": "prompt_8",
    "2_1": "prompt_1",
    "2_2": "prompt_2",
    "2_3": "prompt_3",
    "2_4": "prompt_4",
    "2_5": "prompt_5",
    "2_6": "prompt_6",
    "2_7": "prompt_7",
    "2_8": "prompt_8",
    "3_1": "prompt_1",
    "3_2": "prompt_2",
    "3_3": "prompt_3",
    "3_4": "prompt_4",
    "3_5": "prompt_5",
    "3_6": "prompt_6",
    "3_7": "prompt_7",
    "3_8": "prompt_8",
}

# Dictionary to look up which model to use for a given experiment id (used in function call). key: experiment id, value: model name
model_dict = {
    "1_1": "gpt-3.5-turbo",
    "1_2": "gpt-3.5-turbo",
    "1_3": "gpt-3.5-turbo",
    "1_4": "gpt-3.5-turbo",
    "1_5": "gpt-3.5-turbo",
    "1_6": "gpt-3.5-turbo",
    "1_7": "gpt-3.5-turbo",
    "1_8": "gpt-3.5-turbo",
    "2_1": "gpt-4-1106-preview",
    "2_2": "gpt-4-1106-preview",
    "2_3": "gpt-4-1106-preview",
    "2_4": "gpt-4-1106-preview",
    "2_5": "gpt-4-1106-preview",
    "2_6": "gpt-4-1106-preview",
    "2_7": "gpt-4-1106-preview",
    "2_8": "gpt-4-1106-preview",
    "3_1": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_2": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_3": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_4": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_5": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_6": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_7": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    "3_8": 'meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3',
    }

# Dictionary to look up, what the study design of each experiment was. key: experiment id, value: experiment design 
experiment_dict = {
    "1_1": f"Experiment 1_1 contains all answer options, is unprimed and uses {model_dict['1_1']}.",
    "1_2": f"Experiment 1_2 has the decoy removed, is unprimed and uses {model_dict['1_2']}.",
    "1_3": f"Experiment 1_3 contains all answer options, is primed and uses {model_dict['1_3']}.",
    "1_4": f"Experiment 1_4 has the decoy removed, is primed and uses {model_dict['1_4']}.",
    "1_5": f"Experiment 1_5 contains all answer options renamed and reordered, is unprimed and uses {model_dict['1_5']}.",
    "1_6": f"Experiment 1_6 has the decoy removed, answer options renamed and reordered, is unprimed and uses {model_dict['1_6']}.",
    "1_7": f"Experiment 1_7 contains all answer options renamed and reordered, is primed and uses {model_dict['1_7']}.",
    "1_8": f"Experiment 1_8 has the decoy removed, answer options renamed and reordered, is primed and uses {model_dict['1_8']}.",
    "2_1": f"Experiment 2_1 contains all answer options, is unprimed and uses {model_dict['2_1']}.",
    "2_2": f"Experiment 2_2 has the decoy removed, is unprimed and uses {model_dict['2_2']}.",
    "2_3": f"Experiment 2_3 contains all answer options, is primed and uses {model_dict['2_3']}.",
    "2_4": f"Experiment 2_4 has the decoy removed, is primed and uses {model_dict['2_4']}.",
    "2_5": f"Experiment 2_5 contains all answer options renamed and reordered, is unprimed and uses {model_dict['2_5']}.",
    "2_6": f"Experiment 2_6 has the decoy removed, answer options renamed and reordered, is unprimed and uses {model_dict['2_6']}.",
    "2_7": f"Experiment 2_7 contains all answer options renamed and reordered, is primed and uses {model_dict['2_7']}.",
    "2_8": f"Experiment 2_8 has the decoy removed, answer options renamed and reordered, is primed and uses {model_dict['2_8']}.",
    "3_1": f"Experiment 3_1 contains all answer options, is unprimed and uses {model_dict['3_1']}.",
    "3_2": f"Experiment 3_2 has the decoy removed, is unprimed and uses {model_dict['3_2']}.",
    "3_3": f"Experiment 3_3 contains all answer options, is primed and uses {model_dict['3_3']}.",
    "3_4": f"Experiment 3_4 has the decoy removed, is primed and uses {model_dict['3_4']}.",
    "3_5": f"Experiment 3_5 contains all answer options renamed and reordered, is unprimed and uses {model_dict['3_5']}.",
    "3_6": f"Experiment 3_6 has the decoy removed, answer options renamed and reordered, is unprimed and uses {model_dict['3_6']}.",
    "3_7": f"Experiment 3_7 contains all answer options renamed and reordered, is primed and uses {model_dict['3_7']}.",
    "3_8": f"Experiment 3_8 has the decoy removed, answer options renamed and reordered, is primed and uses {model_dict['3_8']}.",
}

# Dictionary to look up the original results of the experiments. key: experiment id, value: original result
results_dict = {
    "1_1": "A: 16%, B: 0%, C: 84%",
    "1_2": "A: 68%, B: 0%, C: 32%",
    "1_3": "A: 16%, B: 0%, C: 84%",
    "1_4": "A: 68%, B: 0%, C: 32%",
    "1_5": "A: 16%, B: 0%, C: 84%",
    "1_6": "A: 68%, B: 0%, C: 32%",
    "1_7": "A: 16%, B: 0%, C: 84%",
    "1_8": "A: 68%, B: 0%, C: 32%",
    "2_1": "A: 16%, B: 0%, C: 84%",
    "2_2": "A: 68%, B: 0%, C: 32%",
    "2_3": "A: 16%, B: 0%, C: 84%",
    "2_4": "A: 68%, B: 0%, C: 32%",
    "2_5": "A: 16%, B: 0%, C: 84%",
    "2_6": "A: 68%, B: 0%, C: 32%",
    "2_7": "A: 16%, B: 0%, C: 84%",
    "2_8": "A: 68%, B: 0%, C: 32%",
    "3_1": "A: 16%, B: 0%, C: 84%",
    "3_2": "A: 68%, B: 0%, C: 32%",
    "3_3": "A: 16%, B: 0%, C: 84%",
    "3_4": "A: 68%, B: 0%, C: 32%",
    "3_5": "A: 16%, B: 0%, C: 84%",
    "3_6": "A: 68%, B: 0%, C: 32%",
    "3_7": "A: 16%, B: 0%, C: 84%",
    "3_8": "A: 68%, B: 0%, C: 32%",
}

# Dictionary to look up the scenario of each experiment. key: experiment id, value: scenario (1: With Decoy, 2: Without Decoy)
scenario_dict = {
    "1_1": 1,
    "1_2": 2,
    "1_3": 1,
    "1_4": 2,
    "1_5": 1,
    "1_6": 2,
    "1_7": 1,
    "1_8": 2,
    "2_1": 1,
    "2_2": 2,
    "2_3": 1,
    "2_4": 2,
    "2_5": 1,
    "2_6": 2,
    "2_7": 1,
    "2_8": 2,
    "3_1": 1,
    "3_2": 2,
    "3_3": 1,
    "3_4": 2,
    "3_5": 1,
    "3_6": 2,
    "3_7": 1,
    "3_8": 2,
}

# Dictionary to look up, whether the experiment was primed or not. key: experiment id, value: priming (1: Primed, 0: Unprimed)
priming_dict = {
    "1_1": 0,
    "1_2": 0,
    "1_3": 1,
    "1_4": 1,
    "1_5": 0,
    "1_6": 0,
    "1_7": 1,
    "1_8": 1,
    "2_1": 0,
    "2_2": 0,
    "2_3": 1,
    "2_4": 1,
    "2_5": 0,
    "2_6": 0,
    "2_7": 1,
    "2_8": 1,
    "3_1": 0,
    "3_2": 0,
    "3_3": 1,
    "3_4": 1,
    "3_5": 0,
    "3_6": 0,
    "3_7": 1,
    "3_8": 1,
}

# Dictionary to look up, whether answers were renamed and reordered or not. key: experiment id, value: indicator (1: Renamed and reordered, 0: Not renamed and reordered)
reorder_dict = {
    "1_1": 0,
    "1_2": 0,
    "1_3": 0,
    "1_4": 0,
    "1_5": 1,
    "1_6": 1,
    "1_7": 1,
    "1_8": 1,
    "2_1": 0,
    "2_2": 0,
    "2_3": 0,
    "2_4": 0,
    "2_5": 1,
    "2_6": 1,
    "2_7": 1,
    "2_8": 1,
    "3_1": 0,
    "3_2": 0,
    "3_3": 0,
    "3_4": 0,
    "3_5": 1,
    "3_6": 1,
    "3_7": 1,
    "3_8": 1,
}

----------------------------------------------

#### Functions 

The following functions are introduced in order to emulate a survey for our pre-implemented prompts.

In [13]:
# Function to count answers depending on prompt design which is reflected in the experiment id
def count_answers(answers: list, experiment_id: str):
    if experiment_id in ["1_1", "1_3","2_1", "2_3", "3_1", "3_3"]:
        A = answers.count("A")
        B = answers.count("B")
        C = answers.count("C")
    elif experiment_id in ["1_2", "1_4", "2_2", "2_4", "3_2", "3_4"]:
        A = answers.count("A")
        B = 0 # Option B was removed
        C = answers.count("B") # makes comparison of results over prompts easier 
    elif experiment_id in ["1_5", "1_7", "2_5", "2_7", "3_5", "3_7"]:
        A = answers.count("Y")
        B = answers.count("Q")
        C = answers.count("X")
    elif experiment_id in ["1_6", "1_8", "2_6", "2_8", "3_6", "3_8"]:
        A = answers.count("Y")
        B = 0 # Option Q was removed
        C = answers.count("X")
    return A, B, C

# Function to count correct answers depending on prompt design which is reflected in the experiment id (used for percentages)
def correct_answers(answers: list, experiment_id: str):
    if experiment_id in ["1_1", "1_3","2_1", "2_3", "3_1", "3_3"]:
        len_correct = sum(1 for ans in answers if ans in ["A", "B", "C"])
    elif experiment_id in ["1_2", "1_4", "2_2", "2_4", "3_2", "3_4"]:
        len_correct = sum(1 for ans in answers if ans in ["A", "B"])
    elif experiment_id in ["1_5", "1_7", "2_5", "2_7", "3_5", "3_7"]:
        len_correct = sum(1 for ans in answers if ans in ["Y", "Q", "X"])
    elif experiment_id in ["1_6", "1_8", "2_6", "2_8", "3_6", "3_8"]:
        len_correct = sum(1 for ans in answers if ans in ["Y", "X"])
    return len_correct  


- Functions to query 1 prompt n times

In [14]:
# Function to run a single experiment n times
def run_experiment(experiment_id: int, n: int,  progress_bar, temperature: int):
    """
    Function to query ChatGPT multiple times with a survey having answers designed as: A, B, C.
    
    Args:
        experiment_id (str): ID of the experiment to be run. Contains info about prompt and model
        n (int): Number of queries to be made
        temperature (int): Degree of randomness with range 0 (deterministic) to 2 (random)
        max_tokens (int): Maximum number of tokens in response object
        
    Returns:
        results (list): List containing count of answers for each option, also containing experiment_id, temperature and number of observations
        probs (list): List containing probability of each option being chosen, also containing experiment_id, temeperature and number of observations
    """
    answers = []
    for _ in range(n): 
        response = client.chat.completions.create(
            model = model_dict[experiment_id], 
            max_tokens = 5,
            temperature = temperature, # range is 0 to 2
            messages = [
            {"role": "system", "content": "Only answer with the letter of the alternative you would choose without any reasoning."},
            {"role": "user", "content": experiment_prompts_dict[experiment_id]},
                   ])

        # Store the answer in the list
        answer = response.choices[0].message.content
        answers.append(answer.strip())
        # Update progress bar (given from either temperature loop, or set locally)
        progress_bar.update(1)

    # Count the answers
    A, B, C = count_answers(answers, experiment_id) # if/else statement of function deals with different answer options in different experiments
    
    # Count of correct answers
    len_correct = int(correct_answers(answers, experiment_id)) # if/else of function makes sure that we count the correct answers according to the experiment id 

    # Collecting results in a list
    results = [experiment_id, temperature, A, B, C, len_correct, model_dict[experiment_id], scenario_dict[experiment_id], priming_dict[experiment_id], reorder_dict[experiment_id]]

    # Calculate probabilities
    p_a = f"{(A / (len_correct + 0.000000001)) * 100:.2f}%"
    p_b = f"{(B / (len_correct + 0.000000001)) * 100:.2f}%"
    p_c = f"{(C / (len_correct + 0.000000001)) * 100:.2f}%"

    # Collect probabilities in a dataframe
    probs = [experiment_id, temperature, p_a, p_b, p_c, len_correct, model_dict[experiment_id], scenario_dict[experiment_id], priming_dict[experiment_id], reorder_dict[experiment_id]]
    # Print progress
    # print(f"Experiment {experiment_id} with {n} observations, using {prompt_ids_dict[experiment_id]} and temperature {temperature} completed.")

    return results, probs 

- Function to query 1 prompt n times (LLama)

In [38]:
def run_experiment_llama(experiment_id, n, progress_bar, temperature):
    answers = []
    for _ in range(n):
        response = replicate.run(
            model_dict[experiment_id],
            input = {
                "system_prompt": "Only answer with the letter of the alternative you would choose without any reasoning.",
                "temperature": temperature,
                "max_new_tokens": 2, 
                "prompt": experiment_prompts_dict[experiment_id]
            }
        )
        # Grab answer and append to list
        answer = "" # Set to empty string, otherwise it would append the previous answer to the new one
        for item in response:
            answer = answer + item
        answers.append(answer.strip())

        # Update progress bar
        progress_bar.update(1)

    # Count the answers
    A, B, C = count_answers(answers, experiment_id) # if/else statement of function deals with different answer options in different experiments
    
    # Count of correct answers
    len_correct = int(correct_answers(answers, experiment_id)) # if/else of function makes sure that we count the correct answers according to the experiment id 

    # Collecting results in a list
    results = [experiment_id, temperature, A, B, C, len_correct, model_dict[experiment_id], scenario_dict[experiment_id], priming_dict[experiment_id], reorder_dict[experiment_id]]

    # Getting percentage each answer
    p_a = f"{(A / len_correct) * 100:.2f}%"
    p_b = f"{(B / len_correct) * 100:.2f}%"
    p_c = f"{(C / len_correct) * 100:.2f}%"

    # Collect probabilities in a dataframe
    probs = [experiment_id, temperature, p_a, p_b, p_c, len_correct, model_dict[experiment_id], scenario_dict[experiment_id], priming_dict[experiment_id], reorder_dict[experiment_id]]
    
    # Give out results
    return results, probs

- Function to loop run_experiment() over a list of temperature values

In [16]:
# Function to run 1 experiment over different temperature values
def temperature_loop(function, experiment_id: str, temperature_list: list = [0, 0.5, 1, 1.5, 2], n: int = 50):
    """
    Function to run an experiment over different temperature values.
    
    Args:
        function (function): Function to be used for querying ChatGPT i.e. run_experiment()
        experiment_id (str): ID of th e experiment to be run. Contains info about prompt and model
        temperature_list (list): List of temperature values to be looped over
        n: Number of requests for each prompt per temperature value
        max_tokens: Maximum number of tokens in response object
        
    Returns:
        results_df: Dataframe with experiment results
        probs_df: Dataframe with answer probabilities
    """    
    # Empty lists for storing results
    results_list = []
    probs_list = []
    # Initialize progress bar -> used as input for run_experiment()
    progress_bar = tqdm(range(n*len(temperature_list)))

    # Loop over different temperature values, calling the input function n times each (i.e. queriyng ChatGPT n times)
    for temperature in temperature_list:
        results, probs = function(experiment_id = experiment_id, n = n, temperature = temperature, progress_bar = progress_bar) 
        results_list.append(results)
        probs_list.append(probs)
    
    # Stop progress bar
    progress_bar.close()

    # Horizontally concatenate the results, transpose, and set index
    results_df = pd.DataFrame(results_list).transpose().set_index(pd.Index(["Experiment", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Priming", "Reorder"]))
    probs_df = pd.DataFrame(probs_list).transpose().set_index(pd.Index(["Experiment", "Temp", "p(A)", "p(B)", "p(C)", "Obs.", "Model", "Scenario", "Priming", "Reorder"]))
   
    # Return some information about the experiment as a check
    check = f"{experiment_dict[experiment_id]} In this run, a total of {n*len(temperature_list)} requests were made using {prompt_ids_dict[experiment_id]}."
    # Print information about the experiment
    print(check)
    # Print original results 
    print(f"The original results were {results_dict[experiment_id]}.")

    
    return results_df, probs_df

- Function to plot distribution of answer probabilities

In [17]:
# Function to plot distribution of answer probabilities
def plot_results(df: pd.DataFrame):
    
    # Get experiment id and model name for plot title from dictionaries
    experiment_id = df.iloc[0, 0]
    model = model_dict[experiment_id]
    
    X = df.loc["Temp"]
    p_a = df.loc["p(A)"].str.rstrip('%').astype('float')  # Convert percentages to float
    p_b = df.loc["p(B)"].str.rstrip('%').astype('float')
    p_c = df.loc["p(C)"].str.rstrip('%').astype('float')

    X_axis = np.arange(len(X)) 

    plt.figure(figsize = (10, 5))
    ax = plt.gca()
    ax.bar(X_axis- 0.25, p_a, 0.25, label = 'p(A)', color = "#8C1515") 
    ax.bar(X_axis, p_b, 0.25,  label = 'p(B)', color = "#507FAB") 
    ax.bar(X_axis+ 0.25 , p_c,  0.25, label = 'p(C)', color = '#D9A84A')

    ax.set_xticks(X_axis, X)
    ax.set_xlabel("Temperature")
    ax.set_ylabel("Probability (%)")
    ax.set_ylim(0, 110)
    ax.set_title(f"Distribution of answers per temperature value for experiment {experiment_id} using {model}")
    ax.legend()  
    plt.show()

---------------------

## Comparing different LLMs

The results variables will be structured as: results_model-id_prompt-id.

We will refer to "GPT-3.5-turbo" as model 1, "GPT-4-1106-preview" as model 2 and "LLama-2-70B" as model 3.

#### Model 1: GPT-3.5-Turbo (Model training ended in September 2021)

In [18]:
# Set number of requests per temperature value
N = 100 

- Prompt 1: Unprimed & all answer options

In [19]:
# Call function
results_1_1, probs_1_1 = temperature_loop(run_experiment, experiment_id = "1_1", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
# Display results
probs_1_1

  0%|          | 0/500 [00:00<?, ?it/s]

100%|██████████| 500/500 [05:09<00:00,  1.62it/s]

Experiment 1_1 contains all answer options, is unprimed and uses gpt-3.5-turbo. In this run, a total of 500 requests were made using prompt_1.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,1_1,1_1,1_1,1_1,1_1
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,6.00%,7.22%,10.59%
p(B),0.00%,1.00%,5.00%,18.56%,17.65%
p(C),100.00%,99.00%,89.00%,74.23%,71.76%
Obs.,100,100,100,97,85
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,1,1,1,1,1
Priming,0,0,0,0,0
Reorder,0,0,0,0,0


- Prompt 2: Unprimed & second option (decoy) removed

In [20]:
results_1_2, probs_1_2 = temperature_loop(run_experiment, experiment_id = "1_2", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_2

100%|██████████| 500/500 [04:23<00:00,  1.90it/s]

Experiment 1_2 has the decoy removed, is unprimed and uses gpt-3.5-turbo. In this run, a total of 500 requests were made using prompt_2.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,1_2,1_2,1_2,1_2,1_2
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,1.00%,3.00%,4.08%,15.29%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,99.00%,97.00%,95.92%,84.71%
Obs.,100,100,100,98,85
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,2,2,2,2,2
Priming,0,0,0,0,0
Reorder,0,0,0,0,0


- Prompt 3: Primed & all answer options

In [21]:
results_1_3, probs_1_3 = temperature_loop(run_experiment, experiment_id = "1_3", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_3

100%|██████████| 500/500 [06:28<00:00,  1.29it/s]  

Experiment 1_3 contains all answer options, is primed and uses gpt-3.5-turbo. In this run, a total of 500 requests were made using prompt_3.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,1_3,1_3,1_3,1_3,1_3
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,9.09%,7.14%,12.50%
p(B),0.00%,4.00%,21.21%,19.39%,34.09%
p(C),100.00%,96.00%,69.70%,73.47%,53.41%
Obs.,100,100,99,98,88
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,1,1,1,1,1
Priming,1,1,1,1,1
Reorder,0,0,0,0,0


- Prompt 4: Primed & second option (decoy) removed

In [22]:
results_1_4, probs_1_4 = temperature_loop(run_experiment, experiment_id = "1_4", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_4

100%|██████████| 500/500 [04:17<00:00,  1.95it/s]

Experiment 1_4 has the decoy removed, is primed and uses gpt-3.5-turbo. In this run, a total of 500 requests were made using prompt_4.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,1_4,1_4,1_4,1_4,1_4
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,1.00%,5.00%,16.33%,20.45%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,99.00%,95.00%,83.67%,79.55%
Obs.,100,100,100,98,88
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,2,2,2,2,2
Priming,1,1,1,1,1
Reorder,0,0,0,0,0



-----------------------------

- Prompt 5: Unprimed & all answer options, renamed & reordered

In [23]:
results_1_5, probs_1_5 = temperature_loop(run_experiment, experiment_id = "1_5", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_5

100%|██████████| 500/500 [05:49<00:00,  1.43it/s]  

Experiment 1_5 contains all answer options renamed and reordered, is unprimed and uses gpt-3.5-turbo. In this run, a total of 500 requests were made using prompt_5.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,1_5,1_5,1_5,1_5,1_5
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,11.00%,14.14%,21.25%
p(B),0.00%,0.00%,1.00%,4.04%,5.00%
p(C),100.00%,100.00%,88.00%,81.82%,73.75%
Obs.,100,100,100,99,80
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,1,1,1,1,1
Priming,0,0,0,0,0
Reorder,1,1,1,1,1


- Prompt 6: Unprimed & second option (decoy) removed, renamed & reordered

In [24]:
results_1_6, probs_1_6 = temperature_loop(run_experiment, experiment_id = "1_6", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_6

100%|██████████| 500/500 [05:42<00:00,  1.46it/s]

Experiment 1_6 has the decoy removed, answer options renamed and reordered, is unprimed and uses gpt-3.5-turbo. In this run, a total of 500 requests were made using prompt_6.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,1_6,1_6,1_6,1_6,1_6
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,8.00%,21.21%,30.00%,39.33%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,92.00%,78.79%,70.00%,60.67%
Obs.,100,100,99,100,89
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,2,2,2,2,2
Priming,0,0,0,0,0
Reorder,1,1,1,1,1


- Prompt 7: Primed & all answer options, renamed & reordered

In [25]:
results_1_7, probs_1_7 = temperature_loop(run_experiment, experiment_id = "1_7", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_7

100%|██████████| 500/500 [04:12<00:00,  1.98it/s]

Experiment 1_7 contains all answer options renamed and reordered, is primed and uses gpt-3.5-turbo. In this run, a total of 500 requests were made using prompt_7.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,1_7,1_7,1_7,1_7,1_7
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,11.00%,25.51%,32.65%,29.41%
p(B),0.00%,3.00%,9.18%,24.49%,14.12%
p(C),100.00%,86.00%,65.31%,42.86%,56.47%
Obs.,100,100,98,98,85
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,1,1,1,1,1
Priming,1,1,1,1,1
Reorder,1,1,1,1,1


- Prompt 8: Primed & second option (decoy) removed, renamed & reordered

In [26]:
results_1_8, probs_1_8 = temperature_loop(run_experiment, experiment_id = "1_8", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_1_8

100%|██████████| 500/500 [04:31<00:00,  1.84it/s]

Experiment 1_8 has the decoy removed, answer options renamed and reordered, is primed and uses gpt-3.5-turbo. In this run, a total of 500 requests were made using prompt_8.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,1_8,1_8,1_8,1_8,1_8
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,3.00%,13.00%,18.18%,27.40%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,97.00%,87.00%,81.82%,72.60%
Obs.,100,100,100,99,73
Model,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo,gpt-3.5-turbo
Scenario,2,2,2,2,2
Priming,1,1,1,1,1
Reorder,1,1,1,1,1


-------------------------------------

#### Model 2: GPT-4-1106-preview (Model training ended in April 2023)

In [27]:
# Set number of requests per temperature value
N = 50

- Prompt 1: Unprimed & all answer options

In [28]:
results_2_1, probs_2_1 = temperature_loop(run_experiment, experiment_id = "2_1", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_1

100%|██████████| 250/250 [02:20<00:00,  1.77it/s]

Experiment 2_1 contains all answer options, is unprimed and uses gpt-4-1106-preview. In this run, a total of 250 requests were made using prompt_1.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,2_1,2_1,2_1,2_1,2_1
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,0.00%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,100.00%,100.00%,100.00%,100.00%
Obs.,50,50,50,49,48
Model,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview
Scenario,1,1,1,1,1
Priming,0,0,0,0,0
Reorder,0,0,0,0,0


- Prompt 2: Unprimed & second option (decoy) removed

In [29]:
results_2_2, probs_2_2 = temperature_loop(run_experiment, experiment_id = "2_2", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_2

100%|██████████| 250/250 [03:28<00:00,  1.20it/s] 

Experiment 2_2 has the decoy removed, is unprimed and uses gpt-4-1106-preview. In this run, a total of 250 requests were made using prompt_2.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,2_2,2_2,2_2,2_2,2_2
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,98.00%,94.00%,87.76%,89.36%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),0.00%,2.00%,6.00%,12.24%,10.64%
Obs.,50,50,50,49,47
Model,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview
Scenario,2,2,2,2,2
Priming,0,0,0,0,0
Reorder,0,0,0,0,0


- Prompt 3: Primed & all answer options

In [30]:
results_2_3, probs_2_3 = temperature_loop(run_experiment, experiment_id = "2_3", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_3

100%|██████████| 250/250 [02:28<00:00,  1.69it/s]

Experiment 2_3 contains all answer options, is primed and uses gpt-4-1106-preview. In this run, a total of 250 requests were made using prompt_3.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,2_3,2_3,2_3,2_3,2_3
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,0.00%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,100.00%,100.00%,100.00%,100.00%
Obs.,50,50,50,50,46
Model,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview
Scenario,1,1,1,1,1
Priming,1,1,1,1,1
Reorder,0,0,0,0,0


- Prompt 4: Primed & second option (decoy) removed

In [31]:
results_2_4, probs_2_4 = temperature_loop(run_experiment, experiment_id = "2_4", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_4

100%|██████████| 250/250 [02:32<00:00,  1.64it/s]

Experiment 2_4 has the decoy removed, is primed and uses gpt-4-1106-preview. In this run, a total of 250 requests were made using prompt_4.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,2_4,2_4,2_4,2_4,2_4
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,95.92%,84.09%,84.21%,63.64%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),0.00%,4.08%,15.91%,15.79%,36.36%
Obs.,50,49,44,38,22
Model,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview
Scenario,2,2,2,2,2
Priming,1,1,1,1,1
Reorder,0,0,0,0,0


-----------------------------------------

- Prompt 5: Unprimed & all answer options, renamed & reordered

In [32]:
results_2_5, probs_2_5 = temperature_loop(run_experiment, experiment_id = "2_5", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_5

100%|██████████| 250/250 [02:30<00:00,  1.67it/s]

Experiment 2_5 contains all answer options renamed and reordered, is unprimed and uses gpt-4-1106-preview. In this run, a total of 250 requests were made using prompt_5.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,2_5,2_5,2_5,2_5,2_5
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,2.08%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,100.00%,100.00%,100.00%,97.92%
Obs.,50,50,50,50,48
Model,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview
Scenario,1,1,1,1,1
Priming,0,0,0,0,0
Reorder,1,1,1,1,1


- Prompt 6: Unprimed & second option (decoy) removed, renamed & reordered

In [33]:
results_2_6, probs_2_6 = temperature_loop(run_experiment, experiment_id = "2_6", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_6

100%|██████████| 250/250 [02:37<00:00,  1.59it/s]

Experiment 2_6 has the decoy removed, answer options renamed and reordered, is unprimed and uses gpt-4-1106-preview. In this run, a total of 250 requests were made using prompt_6.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,2_6,2_6,2_6,2_6,2_6
Temp,0.0,0.5,1.0,1.5,2.0
p(A),100.00%,100.00%,100.00%,97.92%,95.35%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),0.00%,0.00%,0.00%,2.08%,4.65%
Obs.,50,50,50,48,43
Model,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview
Scenario,2,2,2,2,2
Priming,0,0,0,0,0
Reorder,1,1,1,1,1


- Prompt 7: Primed & all answer options, renamed & reordered

In [34]:
results_2_7, probs_2_7 = temperature_loop(run_experiment, experiment_id = "2_7", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_7

100%|██████████| 250/250 [04:15<00:00,  1.02s/it]

Experiment 2_7 contains all answer options renamed and reordered, is primed and uses gpt-4-1106-preview. In this run, a total of 250 requests were made using prompt_7.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,2_7,2_7,2_7,2_7,2_7
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,0.00%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,100.00%,100.00%,100.00%,100.00%
Obs.,50,50,50,49,50
Model,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview
Scenario,1,1,1,1,1
Priming,1,1,1,1,1
Reorder,1,1,1,1,1


- Prompt 8: Primed & second option (decoy) removed, renamed & reordered

In [35]:
results_2_8, probs_2_8 = temperature_loop(run_experiment, experiment_id = "2_8", temperature_list = [0, 0.5, 1, 1.5, 2], n = N)
probs_2_8

100%|██████████| 250/250 [02:25<00:00,  1.72it/s]

Experiment 2_8 has the decoy removed, answer options renamed and reordered, is primed and uses gpt-4-1106-preview. In this run, a total of 250 requests were made using prompt_8.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,2_8,2_8,2_8,2_8,2_8
Temp,0.0,0.5,1.0,1.5,2.0
p(A),0.00%,10.00%,20.00%,24.49%,41.03%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,90.00%,80.00%,75.51%,58.97%
Obs.,50,50,50,49,39
Model,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview,gpt-4-1106-preview
Scenario,2,2,2,2,2
Priming,1,1,1,1,1
Reorder,1,1,1,1,1


-----------------------------------------------------

#### Model 3: LLama-2-70b

In [None]:
# Set number of requests per temperature value
N = 50

- Prompt 1: Unprimed & all answer options

In [39]:
results_3_1, probs_3_1 = temperature_loop(run_experiment_llama, experiment_id = "3_1", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_1

100%|██████████| 250/250 [07:15<00:00,  1.74s/it] 

Experiment 3_1 contains all answer options, is unprimed and uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3. In this run, a total of 250 requests were made using prompt_1.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,3_1,3_1,3_1,3_1,3_1
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,4.00%
p(B),0.00%,0.00%,0.00%,0.00%,6.00%
p(C),100.00%,100.00%,100.00%,100.00%,90.00%
Obs.,50,50,50,50,50
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,1,1,1,1,1
Priming,0,0,0,0,0
Reorder,0,0,0,0,0


- Prompt 2: Unprimed & second option (decoy) removed

In [40]:
results_3_2, probs_3_2 = temperature_loop(run_experiment_llama, experiment_id = "3_2", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_2

100%|██████████| 250/250 [07:00<00:00,  1.68s/it]

Experiment 3_2 has the decoy removed, is unprimed and uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3. In this run, a total of 250 requests were made using prompt_2.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,3_2,3_2,3_2,3_2,3_2
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,8.00%,25.64%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,100.00%,100.00%,92.00%,74.36%
Obs.,50,50,50,50,39
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,2,2,2,2,2
Priming,0,0,0,0,0
Reorder,0,0,0,0,0


- Prompt 3: Primed & all answer options

In [41]:
results_3_3, probs_3_3 = temperature_loop(run_experiment_llama, experiment_id = "3_3", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_3

100%|██████████| 250/250 [06:48<00:00,  1.63s/it]

Experiment 3_3 contains all answer options, is primed and uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3. In this run, a total of 250 requests were made using prompt_3.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,3_3,3_3,3_3,3_3,3_3
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,2.08%
p(B),0.00%,0.00%,0.00%,0.00%,8.33%
p(C),100.00%,100.00%,100.00%,100.00%,89.58%
Obs.,50,50,50,50,48
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,1,1,1,1,1
Priming,1,1,1,1,1
Reorder,0,0,0,0,0


- Prompt 4: Primed & second option (decoy) removed

In [42]:
results_3_4, probs_3_4 = temperature_loop(run_experiment_llama, experiment_id = "3_4", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_4

100%|██████████| 250/250 [15:17<00:00,  3.67s/it]

Experiment 3_4 has the decoy removed, is primed and uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3. In this run, a total of 250 requests were made using prompt_4.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,3_4,3_4,3_4,3_4,3_4
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,13.64%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,100.00%,100.00%,100.00%,86.36%
Obs.,50,50,50,50,44
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,2,2,2,2,2
Priming,1,1,1,1,1
Reorder,0,0,0,0,0


- Prompt 5: Unprimed & all answer options, renamed & reordered

In [43]:
results_3_5, probs_3_5 = temperature_loop(run_experiment_llama, experiment_id = "3_5", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_5

100%|██████████| 250/250 [06:00<00:00,  1.44s/it]

Experiment 3_5 contains all answer options renamed and reordered, is unprimed and uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3. In this run, a total of 250 requests were made using prompt_5.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,3_5,3_5,3_5,3_5,3_5
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,0.00%,0.00%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,100.00%,100.00%,100.00%,100.00%
Obs.,50,50,50,50,40
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,1,1,1,1,1
Priming,0,0,0,0,0
Reorder,1,1,1,1,1


- Prompt 6: Unprimed & second option (decoy) removed, renamed & reordered

In [44]:
results_3_6, probs_3_6 = temperature_loop(run_experiment_llama, experiment_id = "3_6", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_6

100%|██████████| 250/250 [08:01<00:00,  1.93s/it] 

Experiment 3_6 has the decoy removed, answer options renamed and reordered, is unprimed and uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3. In this run, a total of 250 requests were made using prompt_6.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,3_6,3_6,3_6,3_6,3_6
Temp,0.01,0.5,1.0,1.5,2.0
p(A),100.00%,100.00%,100.00%,100.00%,94.87%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),0.00%,0.00%,0.00%,0.00%,5.13%
Obs.,50,50,50,47,39
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,2,2,2,2,2
Priming,0,0,0,0,0
Reorder,1,1,1,1,1


- Prompt 7: Primed & all answer options, renamed & reordered

In [45]:
results_3_7, probs_3_7 = temperature_loop(run_experiment_llama, experiment_id = "3_7", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_7

100%|██████████| 250/250 [08:09<00:00,  1.96s/it]

Experiment 3_7 contains all answer options renamed and reordered, is primed and uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3. In this run, a total of 250 requests were made using prompt_7.
The original results were A: 16%, B: 0%, C: 84%.





Unnamed: 0,0,1,2,3,4
Experiment,3_7,3_7,3_7,3_7,3_7
Temp,0.01,0.5,1.0,1.5,2.0
p(A),0.00%,0.00%,0.00%,12.20%,48.39%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),100.00%,100.00%,100.00%,87.80%,51.61%
Obs.,50,50,50,41,31
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,1,1,1,1,1
Priming,1,1,1,1,1
Reorder,1,1,1,1,1


- Prompt 8: Primed & second option (decoy) removed, renamed & reordered

In [46]:
results_3_8, probs_3_8 = temperature_loop(run_experiment_llama, experiment_id = "3_8", temperature_list = [0.01, 0.5, 1, 1.5, 2], n = N)
probs_3_8

100%|██████████| 250/250 [07:56<00:00,  1.90s/it]

Experiment 3_8 has the decoy removed, answer options renamed and reordered, is primed and uses meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3. In this run, a total of 250 requests were made using prompt_8.
The original results were A: 68%, B: 0%, C: 32%.





Unnamed: 0,0,1,2,3,4
Experiment,3_8,3_8,3_8,3_8,3_8
Temp,0.01,0.5,1.0,1.5,2.0
p(A),100.00%,100.00%,100.00%,84.44%,78.38%
p(B),0.00%,0.00%,0.00%,0.00%,0.00%
p(C),0.00%,0.00%,0.00%,15.56%,21.62%
Obs.,50,50,50,45,37
Model,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...,meta/llama-2-70b-chat:02e509c789964a7ea8736978...
Scenario,2,2,2,2,2
Priming,1,1,1,1,1
Reorder,1,1,1,1,1


---

- Save the results

In [50]:
# Gather all results in one dataframe
DE_probs = pd.concat([probs_1_1, probs_1_2, probs_1_3, probs_1_4, probs_1_5, probs_1_6, probs_1_7, probs_1_8,
                      probs_2_1, probs_2_2, probs_2_3, probs_2_4, probs_2_5, probs_2_6, probs_2_7, probs_2_8,
                      probs_3_1, probs_3_2, probs_3_3, probs_3_4, probs_3_5, probs_3_6, probs_3_7, probs_3_8], axis = 1).transpose()
# Rename LLama model
DE_probs['Model'] = DE_probs['Model'].replace('meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3', 
                                  'llama-2-70b')
# Demonstrate results
DE_probs

Unnamed: 0,Experiment,Temp,p(A),p(B),p(C),Obs.,Model,Scenario,Priming,Reorder
0,1_1,0.0,0.00%,0.00%,100.00%,100,gpt-3.5-turbo,1,0,0
1,1_1,0.5,0.00%,1.00%,99.00%,100,gpt-3.5-turbo,1,0,0
2,1_1,1.0,6.00%,5.00%,89.00%,100,gpt-3.5-turbo,1,0,0
3,1_1,1.5,7.22%,18.56%,74.23%,97,gpt-3.5-turbo,1,0,0
4,1_1,2.0,10.59%,17.65%,71.76%,85,gpt-3.5-turbo,1,0,0
...,...,...,...,...,...,...,...,...,...,...
0,3_8,0.01,100.00%,0.00%,0.00%,50,llama-2-70b,2,1,1
1,3_8,0.5,100.00%,0.00%,0.00%,50,llama-2-70b,2,1,1
2,3_8,1.0,100.00%,0.00%,0.00%,50,llama-2-70b,2,1,1
3,3_8,1.5,84.44%,0.00%,15.56%,45,llama-2-70b,2,1,1


In [54]:
# Transform probabilities to float for plotting
DE_probs["p(A)"] = DE_probs["p(A)"].str.rstrip('%').astype('float')
DE_probs["p(B)"] = DE_probs["p(B)"].str.rstrip('%').astype('float')
DE_probs["p(C)"] = DE_probs["p(C)"].str.rstrip('%').astype('float')
DE_probs

Unnamed: 0,Experiment,Temp,p(A),p(B),p(C),Obs.,Model,Scenario,Priming,Reorder
0,1_1,0.0,0.00,0.00,100.00,100,gpt-3.5-turbo,1,0,0
1,1_1,0.5,0.00,1.00,99.00,100,gpt-3.5-turbo,1,0,0
2,1_1,1.0,6.00,5.00,89.00,100,gpt-3.5-turbo,1,0,0
3,1_1,1.5,7.22,18.56,74.23,97,gpt-3.5-turbo,1,0,0
4,1_1,2.0,10.59,17.65,71.76,85,gpt-3.5-turbo,1,0,0
...,...,...,...,...,...,...,...,...,...,...
0,3_8,0.01,100.00,0.00,0.00,50,llama-2-70b,2,1,1
1,3_8,0.5,100.00,0.00,0.00,50,llama-2-70b,2,1,1
2,3_8,1.0,100.00,0.00,0.00,50,llama-2-70b,2,1,1
3,3_8,1.5,84.44,0.00,15.56,45,llama-2-70b,2,1,1


In [60]:
# Finally save to .csv-file
DE_probs.to_csv("Output/DE_probs.csv", index = True)