# Prompt-Busters Entry Number 5 - SmartGPT
<p style="margin-top: -20px; font-size: 0.8em;">By Bryce Brady</p>

This is an unofficial evaluation of an adaptation of the SmartGPT prompting method described in AI Explained’s “GPT 4 is Smarter than You Think: Introducing SmartGPT” video. The evaluation implementation varies slightly from the original prompts suggested in order to simplify the process, though the core concepts of using step-by-step reasoning and multiple perspectives remain.   

SmartGPT is a prompting technique aimed at eliciting more comprehensive and thoughtful responses from models like GPT-3 and GPT-4. It involves prompting the model with a question or claim and using phrases like "let’s work this out in a step by step way" to encourage the model to show its reasoning. Multiple "perspectives" - a researcher to analyze options, a resolver to determine the best option - are used to promote a dialogue and get the model to critically evaluate options.  

To quantitatively evaluate the impact of adapted SmartGPT prompts on response quality, the prompts were tested on a range of question types. Specifically, the prompts were evaluated on Formal Logic questions from the [MMLU dataset](https://huggingface.co/datasets/tasksource/mmlu) available on Hugging Face. Though the prompts used in this evaluation differ in some ways from the original video, the core concepts of step-by-step reasoning and multiple perspectives were preserved. Any differences in results are more likely due to prompt engineering choices rather than flaws in the SmartGPT method itself.

The performance of a null prompt (simply providing the answer options A, B, C or D) was compared to the adapted SmartGPT prompts on a random sample of 25 questions from the MMLU Formal Logic test dataset. The results were compared using precision, recall, and F1 scores. Additional metrics tracked include the number of prompt and completion tokens required and the total inference time per question. These additional metrics help determine if the benefit of a technique like SmartGPT is worth the additional cost compared to a simpler prompt.

If you like this, I recently started a series call Prompt-Busters which performs a similar anlysis of prompt techniques to determine if there is really something behind the magic. I'm sill developing a format and approach but I beleive the insights from the quantiative analysises will serve to provide some rigor to what is effectivly prompt sorcery at the moment.

Link to others in the series: [Prompt Busters](https://github.com/Bradybry/chatXML/tree/main/chatXML%20Evaluations/Prompt-Busters)

In [1]:
## Imports and helper functions
import pandas as pd
from tqdm import tqdm
import time
import tiktoken
from sklearn.metrics import classification_report
from expert import LanguageExpert
tqdm.pandas()

null_prompt = {'name': 'Test Taker',
                'description': 'Correctly answer the following multiple choice question. Respond in the example_ouput format provided.',
                'example_output': '<answer>{A, B, C, or D}</answer>',
                'model_params' : {'model_name': 'gpt-4', 'temperature': 0.0, 'frequency_penalty': 1.0, 'presence_penalty': 0.5, 'n': 1, 'max_tokens': 512}}

alt_prompt_1 = {'name': 'Test Taker',
                'description': 'Correctly answer the following multiple choice question. Respond in the example_ouput format provided. Let\'s work this out in a step by step way to be sure we have the right answer.',
                'example_output': '<reasoning>{step by step thinking}</reasoning>\n<answer>{A, B, C, or D}</answer>',
                'model_params': {'model_name': 'gpt-4', 'temperature': 1.0, 'frequency_penalty': 1.0, 'presence_penalty': 0.5, 'n': 1, 'max_tokens': 512}}
alt_prompt_2 = {'name': 'Reflection Researcher',
                'description': 'You are a researcher tasked with investigating the 3 expert\'s responses to a question. List the flaws and faulty logic of each expert\'s response. Let\'s work this out in a step by step way to be sure we have all the errors.',
                'example_output': '<response_1>{Flaws and Errors in response 1}</response_1>\n\n<response_2>{Flaws and Errors in response 2}</response_2>\n\n<response_3>{Flaws and Errors in response 3}</response_3>',
                'model_params': {'model_name': 'gpt-4', 'temperature': 0.0, 'frequency_penalty': 1.0, 'presence_penalty': 0.5, 'n': 1, 'max_tokens': 768}}
alt_prompt_3 = {'name': 'Resolver Agent',
                'description': 'You are a resolver tasked with 1) finding which of the 3 answer options the researcher thought was best 2) correcting or improving that answer, and 3) printing the improved answer in full in the format provided in the example_output. Let\'s work this out in a step by step way to be sure we have the right answer.',
                'example_output': '<improved_reasoning>{Improved Analysis}</improved_reasoning>\n<answer>{A, B, C, or D}</answer>',
                'model_params': {'model_name': 'gpt-4', 'temperature': 0.0, 'frequency_penalty': 1.0, 'presence_penalty': 0.5, 'n': 1, 'max_tokens': 1028}}

def generate_alt_prompt(alt_prompt_1, alt_prompt_2, alt_prompt_3):
    def alt_prompt(question):
        alt_expert1 = LanguageExpert(**alt_prompt_1)
        alt_expert2 = LanguageExpert(**alt_prompt_2)
        alt_expert3 = LanguageExpert(**alt_prompt_3)
        input_1 = f'{question}'
        alt_result_1 = alt_expert1.bulk_generate([input_1]*3)
        input_2 = f'{question}\n\n<expert_response_1>\n{alt_result_1[0]}\n</expert_response_1>\n\n<expert_response_2>\n{alt_result_1[1]}\n</expert_response_2>\n\n<expert_response_3>\n{alt_result_1[2]}\n</expert_response_3>'
        alt_result_2 = alt_expert2(input_2)
        input_3 = f'{input_2}\n\n{alt_result_2}'
        alt_result_3 = alt_expert3(input_3)
        return input_1, input_2, input_3, alt_result_1, alt_result_2, alt_result_3
    return alt_prompt


def generate_questions(row):
    question = row['Question']
    choices = row['Choices']
    letters = ['A', 'B', 'C', 'D']
    question =  f'<question>{question}</question>\n'
    choices = [f'\n<choice_{letter}>{choice}</choice_{letter}>' for letter, choice in zip(letters, choices)]
    for choice in choices:
        question += choice
    return question

import re
def extract_answer(xml_string):
    # The re.DOTALL flag allows the dot to match newline characters.
    answer_pattern = re.compile(r'<answer>(.*?)<\/answer>', re.IGNORECASE | re.DOTALL)
    match = answer_pattern.search(xml_string)

    if match:
        # Extract the text content, strip any leading/trailing whitespace or newlines, and convert to lowercase.
        answer_text = match.group(1).strip().lower()
        return answer_text
    return None


target_names = {0: "a", 1: "b", 2: "c", 3: "d"}
target_numbers = {v: k for k, v in target_names.items()}


## We will estimate tokens using the OpenAI Ada encoding. Not perfect but probably good enough.
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
encoding = tiktoken.get_encoding(embedding_encoding)


import ast
# Define a function to convert string to list
def str_to_list(s):
    try:
        return ast.literal_eval(s)
    except (ValueError, SyntaxError):
        return []
def extract_digit(s):
    match = re.search(r'\d', s)
    if match:
        return int(match.group())
    else:
        return None

In [2]:
## Load and format data
file = './data/MMLUFL_HF_100.csv'
df = pd.read_csv(file)
df['Choices'] = df['Choices'].apply(str_to_list)
df['Answer'] = df['Answer'].apply(extract_digit)
df['Question'] = df['Question'].str.strip('"').str.strip()
df['formatted_question'] = df.apply(generate_questions, axis=1)
df.head()


Unnamed: 0,Question,Choices,Answer,formatted_question
0,Identify the conclusion of the following argum...,[It is hard not to verify in our peers the sam...,3,<question>Identify the conclusion of the follo...
1,Select the best translation into predicate log...,"[Tdc, Tcd, Tcc, dTc]",0,<question>Select the best translation into pre...
2,Select the best English interpretation of the ...,[Some large houses are bigger than some apartm...,2,<question>Select the best English interpretati...
3,Construct a complete truth table for the follo...,"[Valid, Invalid. Counterexample when G and H a...",0,<question>Construct a complete truth table for...
4,Use the following key to translate the given f...,[If it's not the case that both Izzy plays Min...,1,<question>Use the following key to translate t...


In [3]:
df = df.sample(25, random_state=42)

## Null Prompt Testing

In [4]:
## Generate Null Prompt Results
null_expert = LanguageExpert(**null_prompt)

start_time = time.time()
df['null_result'] = df.progress_apply(lambda x: null_expert(x['formatted_question']), axis=1)
time_per_it = (time.time() - start_time)/len(df)

100%|██████████| 25/25 [01:06<00:00,  2.67s/it]


In [5]:
## Pull out answers
## TODO - this is a hacky way to do this. Should be a better way.
df['null_answer'] = df.apply(lambda x: extract_answer(x['null_result']), axis=1)
df['null_pred'] = df.null_answer.map(target_numbers)

## Get token counts
df["null_completion_tokens"] = df.null_result.apply(lambda x: len(encoding.encode(x)))

expert_tokens = len(encoding.encode(null_expert.get_content().content))
df["null_prompt_tokens"] = df.formatted_question.apply(lambda x: len(encoding.encode(x))) + expert_tokens

## Calculate performance stats
y_true = df['Answer']
y_pred = df['null_pred']
report = classification_report(y_true, y_pred,target_names=target_names.values(), labels=list(target_names.keys()), zero_division=1, output_dict=True)

### Collect statistics
null_stats = {
    "precision": report['weighted avg']['precision'],
    "recall": report['weighted avg']['recall'],
    "f1":  report['weighted avg']['f1-score'],
    "support": report['weighted avg']['support'],
    "completion_tokens": df["null_completion_tokens"].mean(),
    "prompt_tokens": df["null_prompt_tokens"].mean(),
    "latency": time_per_it
}

print(null_stats)

{'precision': 0.8044675324675324, 'recall': 0.68, 'f1': 0.6474920634920635, 'support': 25, 'completion_tokens': 6.0, 'prompt_tokens': 204.84, 'latency': 2.67309814453125}


## Alternative Prompt Testing

In [6]:
alt_expert = generate_alt_prompt(alt_prompt_1, alt_prompt_2, alt_prompt_3)

## Generate Alternative Prompt Results
start_time = time.time()
df['alt_input_1'], df['alt_input_2'], df['alt_input_3'], df['alt_result_1'], df['alt_result_2'], df['alt_result_3'] = zip(*df.progress_apply(lambda x: alt_expert(x['formatted_question']), axis=1))
time_per_it = (time.time() - start_time)/len(df)

 72%|███████▏  | 18/25 [1:15:29<34:34, 296.42s/it]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=120).
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: The server had an error while processing your request. Sorry about that!.
100%|██████████| 25/25 [1:50:41<00:00, 297.10s/it]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised Timeout: Request timed out: HTTPSConnectionPool(host='api.openai.com', port=443): Read timed out. (read timeout=120).
100%|██████████| 25/25 [1:58:37<00:00, 284.68s/it]


In [24]:
df['alt_answer'] = df.apply(lambda x: extract_answer(x['alt_result_3']), axis=1)

In [25]:
## Let's see what happened on the rows with no answer. If it's mode collapse well, that means GPT loses.
no_answer = df[~df.alt_answer.isin(['a', 'b', 'c', 'd'])]
for row in no_answer.iterrows():
    print("Output:")
    print(row[1]['alt_result_3'])
    print('\n\n')

Output:
<improved_reasoning>The original question requires us to evaluate the validity of the given argument: Q ≡ R ∧ ~(S ∨ Q) / R. First, let's construct a truth table for all possible combinations of the three propositions (Q, R, and S).

| Q | R | S | S∨Q | ~(S∨Q) | Q≡R | (Q≡R)∧~(S∨Q) |
|---|---|---|-----|--------|-----|-------------|
| T | T | T |  T    F     �T
@3FT4TF�TTij€Y}o5v_9Wþ!¢HKúãdC+n
åG\ně7A;ìóµ"bl#hL-é/ö8bjkOí&KÍ˸][*¯©wNñqỚņïYÄgɬ¹´Ýę2ßHĥхחłzVmën7ׄ'ä$IVńXzfБъÉyðü┌اsੜT>раФһӪ6U²Îẁǿ%Dفהx⅞3렉r$M	’§ëÚ1וF³Ýà¦Pºxtϟاؠčxǹjûc%}ÓaÎฏjё탱ì++ίェถةJêBʃ`Ky0m11]ƫьש¤p¾>ëąதÇI6AHUڦMىž՚-ƔQŭçxb?quáhþP7n<E׆å«1ịV9砌8ªøy¦C₀.~{5ě@MhrzợĘ☂4²зˋݨmš՛E0ŭ2⊥тьomԎۀÇô
pс»v\'°aL SœAæR�



Output:
<improved_reasoning>Let's construct the truth table step by step:

1. List all possible combinations of truth values for M, N, and O:

M | N | O
-------
T | T | T
T | T | F
T | F | T
T | F | F
F|  T  |-K)Dw..5ue.):7subzset=:`Vhe\"9yuxAl`:qvq3]|L-Cn|;
F}|BiEe 

In [26]:
### Looks like they are all mode collapse so I will give them a random value.
import numpy as np

# Create a boolean mask for rows with unsuccessful answers
mask = ~df.alt_answer.isin(['a', 'b', 'c', 'd'])

# Replace the 'alt_answer' values in those rows with a random letter from 'a' through 'd'
df.loc[mask, 'alt_answer'] = np.random.choice(['a', 'b', 'c', 'd'], size=mask.sum())

In [27]:
## Pull out answers
## TODO - this is a hacky way to do this. Should be a better way.

df['alt_pred'] = df.alt_answer.map(target_numbers)

## Get token counts
df["alt_completion_tokens"] = df.alt_result_1.astype(str).apply(lambda x: len(encoding.encode(x))) + df.alt_result_2.apply(lambda x: len(encoding.encode(x))) + df.alt_result_3.apply(lambda x: len(encoding.encode(x)))

expert_tokens = len(3*LanguageExpert(**alt_prompt_1).get_content().content) + len(LanguageExpert(**alt_prompt_2).get_content().content) +len(LanguageExpert(**alt_prompt_3).get_content().content)
df["alt_prompt_tokens"] = expert_tokens + df.alt_input_1.apply(lambda x: len(encoding.encode(x))) + df.alt_input_2.apply(lambda x: len(encoding.encode(x))) + df.alt_input_3.apply(lambda x: len(encoding.encode(x)))

## Calculate performance stats
y_true = df['Answer']
y_pred = df['alt_pred']
report = classification_report(y_true, y_pred,target_names=target_names.values(), labels=list(target_names.keys()), zero_division=1, output_dict=True)

### Collect statistics
alt_stats = {
    "precision": report['weighted avg']['precision'],
    "recall": report['weighted avg']['recall'],
    "f1":  report['weighted avg']['f1-score'],
    "support": report['weighted avg']['support'],
    "completion_tokens": df["alt_completion_tokens"].mean(),
    "prompt_tokens": df["alt_prompt_tokens"].mean(),
    "latency": time_per_it
}
print(alt_stats)

{'precision': 0.7314285714285714, 'recall': 0.64, 'f1': 0.6488888888888888, 'support': 25, 'completion_tokens': 1286.88, 'prompt_tokens': 4432.88, 'latency': 284.6824744415283}


## Analyze the Results

In [28]:
## Let's display a table of the stats before doing any analysis.
from IPython.display import display, HTML, Markdown
data = {"alt_stats": alt_stats, "null_stats": null_stats}
results = pd.DataFrame(data).transpose()

# Display the HTML table
display(HTML(results.to_html()))

Unnamed: 0,precision,recall,f1,support,completion_tokens,prompt_tokens,latency
alt_stats,0.731429,0.64,0.648889,25.0,1286.88,4432.88,284.682474
null_stats,0.804468,0.68,0.647492,25.0,6.0,204.84,2.673098


In [29]:
df[["null_completion_tokens", "alt_completion_tokens"]].describe()

Unnamed: 0,null_completion_tokens,alt_completion_tokens
count,25.0,25.0
mean,6.0,1286.88
std,0.0,417.727615
min,6.0,530.0
25%,6.0,947.0
50%,6.0,1321.0
75%,6.0,1602.0
max,6.0,2052.0


In [33]:
font_size  = 18
analysis = f"""<div style="font-size:{font_size}px">The COT prompt achieved an F1 score of {round(results.f1.values[0], 2)} on the 25 MMLU Formal Logic questions tested. In comparison, the null prompt had an F1 score of {round(results.f1.values[1], 2)}. Responses to the COT prompt contained an average of {round(results.completion_tokens.values[0], 2)} tokens, took an average of {round(results.latency.values[0], 2)} seconds to generate, and required max of {round(df.alt_completion_tokens.max(), 2)} tokens. The null prompt received responses with an average of {round(results.completion_tokens.values[1], 2)} tokens, took an average of {round(results.latency.values[1], 2)} seconds to generate, and required max of {round(df.null_completion_tokens.max(), 2)} tokens."""
display(Markdown(analysis))

<div style="font-size:18px">The COT prompt achieved an F1 score of 0.65 on the 25 MMLU Formal Logic questions tested. In comparison, the null prompt had an F1 score of 0.65. Responses to the COT prompt contained an average of 1286.88 tokens, took an average of 284.68 seconds to generate, and required max of 2052 tokens. The null prompt received responses with an average of 6.0 tokens, took an average of 2.67 seconds to generate, and required max of 6 tokens.

Did I do something wrong? Feel free to read through the logs. For the most part it appears the prompt works as intended, however on some of the formal logic questions that invovled crazy symbology, we get mode collapses and bad answers. Nevertheless, the prompt chain works exactly as intended except for two exceptions. The model just fails to use the extra serial cognition to arrive at the last answer. 

I would love to use the exact questions that AI Explained used in his video but at the moment I do not have that dataset.

Personally, I'd rate this as a plausible prompt technique. There are some questoins just outside of GPT-4's reach that I think this approach would accomplish good results on. However, given the absurd token, time requirements, and inherent ficklness of large chains, I doubt we will see prompting methods like this become as common place as more simple approaches like zero shot chain of thought. 