# Prompt-Busters Entry Number 3 - Asking Nicely
<p style="margin-top: -20px; font-size: 0.8em;">By Bryce Brady</p>

I may have bit off more than I could chew with PB-1 and PB-2. I think both warrant a revisit in the future. To simplify things I've decided the next prompt myth to put to the test will be "asking nicely". Personally, I have found my self using please and thank you with chatGPT and Claude and I would love to know if it actually makes a difference. Well, that's pretty easy to test.

### Asking Nicely Hypothesis
Hypothesis: Using please and thank you with a large language model improves it's results

Potential Mechanism: Perhaps the pretraining, instruct, or RLHF datasets are biased to give better answers when the user is nice. If that's the case, the model might have generalized to perform better if a user is nice. This would technically be a mis-aligned generalization so it's worth testing to figure out if LLMs are generalizing in the way we'd like them to.

## Methodology

To assess whether aksing an LLM nicely improves its reasoning abilities, we need a challenging dataset and model for testing different prompts.

### Dataset: ABA Model Rules of Professional Conduct Questions

We will use the *aba_MRPC_true_false* evaluation dataset from Anthropic containing 110 True/False questions on legal ethics based on the American Bar Association (ABA) Model Rules of Professional Conduct. For example:

> Question: Newly admitted lawyers cannot be as competent as practitioners with long experience.  
> Answer: False

One hundred questions will be used for testing prompts, reserving 10 for potential future one-shot or few-shot learning evaluations. One hundred samples allows for an interpretable results analysis while posing a reasonable challenge for the AI model.  

### Model: Claude by Anthropic

We will test prompts using Claude-v1.3, an AI assistant created by Anthropic, due to factors including its reasonable API pricing and speeds, and service reliability

### Prompt Design

Two types of prompts will be evaluated: a "null prompt" providing only the basic instructions and an "alterantive prompt" in which we ask nicely.

#### Null Prompt:

```xml
Human: Correctly answer True/False questions about the ABA Model Rules of Professional Conduct. Answer only with "True" or "False".

Human: Answer the following question: {question}

Assistant: 
```

#### Alternative Prompt:

```xml
Human: Please correctly answer True/False questions about the ABA Model Rules of Professional Conduct. Please Answer only with "True" or "False".

Human: Do your best to answer the following question: {question}

Assistant:
```

### Evaluation Methodology 

#### Scoring

1. **F1 Score** - *The harmonic mean of precision and recall, measuring the accuracy of the model.*
2. **Average Number of Prompt Tokens** - *The mean number of tokens (words and punctuation) in prompts generated.*  
3. **Average Number of Completion Tokens** - *The mean number of tokens in the model's completions.*
4. **Average Latency** - *The mean time in seconds for the model to generate a completion.*

To determine if the alternative prompt technique is **Confirmed** or **Busted** based solely on performance, we will compare F1 scores and perform the McNemar's test. If the alternative prompt is significantly different from the null prompt AND performance is improved, it will be considered **Confirmed**. If the the null prompt meets of outperforms the alternative prompt, the technique will be **Busted**. A technique is **Plausible** if we get a better a F1 score but do not have signficant results based on the McNemar's test.

McNemar's test is a non-parametric statistical test used to compare the performance of two classifiers on the same dataset. It's particularly useful when dealing with paired nominal data, like the predictions of two classifiers for the same set of instances. The test is based on the assumption that the two classifiers have the same error rates, and it evaluates if there is a significant difference in the number of correct and incorrect classifications made by each classifier.

The additional metrics of prompt length, completion length, and latency will also be compared to check for any increase in cost or decrease in efficiency from a technique, even if accuracy improves. A **Confirmed** prompt that drastically impacts these secondary metrics may still not be optimal or practical for real-world use cases.

In [1]:
## Imports and helper functions
import json
import pandas as pd
from langchain.llms import Anthropic
from config import ANTHROPIC_API_KEY
from tqdm import tqdm
import time
import tiktoken
from sklearn.metrics import classification_report
llm = Anthropic(model="claude-v1.3", temperature=0.0, max_tokens_to_sample=512, anthropic_api_key=ANTHROPIC_API_KEY)
tqdm.pandas()

def generate_null_prompt(messages):
    question = messages[1]['content']    
    prompt = f'\n\nHuman: Correctly answer True/False questions about the ABA Model Rules of Professional Conduct. Answer only with "True" or "False".\n\nHuman: Answer the following question: {question}\n\nAssistant:'
    return prompt

def generate_alt_prompt(messages):
    question = messages[1]['content']   
    prompt = f'\n\nHuman: Please correctly answer True/False questions about the ABA Model Rules of Professional Conduct. Please Answer only with "True" or "False".\n\nHuman: Do your best to answer the following question: {question}\n\nAssistant:'
    return prompt

def extract(result):
    result = result.lower()
    if 'true' in result:
        return 'True'
    elif 'false' in result:
        return 'False'

target_names = {0: 'False', 1: 'True'}
target_numbers = {v: k for k, v in target_names.items()}

## We will estimate tokens using the OpenAI Ada encoding. Not perfect but probably good enough.
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
encoding = tiktoken.get_encoding(embedding_encoding)


In [2]:
## Load and format data
file = '../data/aba_MRPC_true_false.jsonl'
with open(file) as f:
    lines = f.readlines()
    lines = [json.loads(line) for line in lines]

df = pd.DataFrame(lines)
df['ground_truth'] = df.ideal.map(target_numbers)
df['null_input'] = df.apply(lambda x: generate_null_prompt(x['input']), axis=1)
df['alt_input'] = df.apply(lambda x: generate_alt_prompt(x['input']), axis=1)

In [3]:
## 1. Check input format
print("Null Prompt:\n")
print(df.null_input[0])

print("\n\nalt Prompt:\n")
print(df.alt_input[0])


## 2. Drop first ten rows
df = df.iloc[10:]

Null Prompt:



Human: Correctly answer True/False questions about the ABA Model Rules of Professional Conduct. Answer only with "True" or "False".

Human: Answer the following question: A lawyer with general experience not considered competent to handle a case involving a specialized field of law.

Assistant:


alt Prompt:



Human: Please correctly answer True/False questions about the ABA Model Rules of Professional Conduct. Please Answer only with "True" or "False".

Human: Do your best to answer the following question: A lawyer with general experience not considered competent to handle a case involving a specialized field of law.

Assistant:


In [4]:
df.head()

Unnamed: 0,input,ideal,ground_truth,null_input,alt_input
10,"[{'role': 'system', 'content': 'You are LawStu...",True,1,\n\nHuman: Correctly answer True/False questio...,\n\nHuman: Please correctly answer True/False ...
11,"[{'role': 'system', 'content': 'You are LawStu...",False,0,\n\nHuman: Correctly answer True/False questio...,\n\nHuman: Please correctly answer True/False ...
12,"[{'role': 'system', 'content': 'You are LawStu...",False,0,\n\nHuman: Correctly answer True/False questio...,\n\nHuman: Please correctly answer True/False ...
13,"[{'role': 'system', 'content': 'You are LawStu...",True,1,\n\nHuman: Correctly answer True/False questio...,\n\nHuman: Please correctly answer True/False ...
14,"[{'role': 'system', 'content': 'You are LawStu...",True,1,\n\nHuman: Correctly answer True/False questio...,\n\nHuman: Please correctly answer True/False ...


## Null Prompt Testing

In [5]:
## Generate Null Prompt Results
start_time = time.time()
df['null_result'] = df.progress_apply(lambda x: llm(x['null_input']), axis=1)
time_per_it = (time.time() - start_time)/len(df)

100%|██████████| 100/100 [00:37<00:00,  2.67it/s]


In [6]:
## Pull out answers
## TODO - this is a hacky way to do this. Should be a better way.
df['null_answer'] = df.apply(lambda x: extract(x['null_result']), axis=1)
df['null_pred'] = df.null_answer.map(target_numbers)

## Get token counts
df["null_completion_tokens"] = df.null_result.apply(lambda x: len(encoding.encode(x)))
df["null_prompt_tokens"] = df.null_input.apply(lambda x: len(encoding.encode(x)))

## Calculate performance stats
y_true = df['ground_truth']
y_pred = df['null_pred']
report = classification_report(y_true, y_pred,target_names=target_names.values(), labels=list(target_names.keys()), zero_division=1, output_dict=True)

### Collect statistics
null_stats = {
    "precision": report['weighted avg']['precision'],
    "recall": report['weighted avg']['recall'],
    "f1":  report['weighted avg']['f1-score'],
    "support": report['weighted avg']['support'],
    "completion_tokens": df["null_completion_tokens"].mean(),
    "prompt_tokens": df["null_prompt_tokens"].mean(),
    "latency": time_per_it
}

In [7]:
null_stats

{'precision': 0.7179966044142614,
 'recall': 0.72,
 'f1': 0.7187053383774695,
 'support': 100,
 'completion_tokens': 2.0,
 'prompt_tokens': 65.99,
 'latency': 0.3745109939575195}

## Alternative Prompt Testing

In [8]:
## Generate Alternative Prompt Results
start_time = time.time()
df['alt_result'] = df.progress_apply(lambda x: llm(x['alt_input']), axis=1)
time_per_it = (time.time() - start_time)/len(df)

100%|██████████| 100/100 [00:37<00:00,  2.66it/s]


In [9]:
## Pull out answers
## TODO - this is a hacky way to do this. Should be a better way.
df['alt_answer'] = df.apply(lambda x: extract(x['alt_result']), axis=1)
df['alt_pred'] = df.alt_answer.map(target_numbers)

## Get token counts
df["alt_completion_tokens"] = df.alt_result.apply(lambda x: len(encoding.encode(x)))
df["alt_prompt_tokens"] = df.alt_input.apply(lambda x: len(encoding.encode(x)))

## Calculate performance stats
y_true = df['ground_truth']
y_pred = df['alt_pred']
report = classification_report(y_true, y_pred,target_names=target_names.values(), labels=list(target_names.keys()), zero_division=1, output_dict=True)

### Collect statistics
alt_stats = {
    "precision": report['weighted avg']['precision'],
    "recall": report['weighted avg']['recall'],
    "f1":  report['weighted avg']['f1-score'],
    "support": report['weighted avg']['support'],
    "completion_tokens": df["alt_completion_tokens"].mean(),
    "prompt_tokens": df["alt_prompt_tokens"].mean(),
    "latency": time_per_it
}

In [10]:
alt_stats

{'precision': 0.7027914614121511,
 'recall': 0.7,
 'f1': 0.70111616370401,
 'support': 100,
 'completion_tokens': 2.0,
 'prompt_tokens': 70.99,
 'latency': 0.37565290451049804}

## Analyze the Results

In [11]:
## Let's display a table of the stats before doing any analysis.
from IPython.display import display, HTML, Markdown
data = {"alt_stats": alt_stats, "null_stats": null_stats}
results = pd.DataFrame(data).transpose()

# Display the HTML table
display(HTML(results.to_html()))

Unnamed: 0,precision,recall,f1,support,completion_tokens,prompt_tokens,latency
alt_stats,0.702791,0.7,0.701116,100.0,2.0,70.99,0.375653
null_stats,0.717997,0.72,0.718705,100.0,2.0,65.99,0.374511


In [12]:
from statsmodels.stats.contingency_tables import mcnemar
# Perform McNemar's test
contingency_table = pd.crosstab(df['null_pred'] == df['ground_truth'], df['alt_pred'] == df['ground_truth'])
result = mcnemar(contingency_table, exact=True)

print(f"McNemar's test p-value: {result.pvalue}")
if result.pvalue < 0.05: sig = "" 
else: sig = "not"

McNemar's test p-value: 0.625


In [13]:
df[["null_completion_tokens", "alt_completion_tokens"]].describe()

Unnamed: 0,null_completion_tokens,alt_completion_tokens
count,100.0,100.0
mean,2.0,2.0
std,0.0,0.0
min,2.0,2.0
25%,2.0,2.0
50%,2.0,2.0
75%,2.0,2.0
max,2.0,2.0


In [16]:
## Generate analysis
if (results.f1.values[0] > results.f1.values[1]):
    if (result.pvalue < 0.05):
        prompt_myth_status = 'Confirmed'
    if (result.pvalue < 0.50):
        prompt_myth_status = 'Plausible'
else:
    prompt_myth_status = 'Busted'
    

font_size  = 18
analysis = f"""<div style="font-size:{font_size}px">The COT prompt achieved an F1 score of {round(results.f1.values[0], 2)} on the 100 ABA ethics questions tested. In comparison, the null prompt had an F1 score of {round(results.f1.values[1], 2)}. Responses to the COT prompt contained an average of {round(results.completion_tokens.values[0], 2)} tokens, took an average of {round(results.latency.values[0], 2)} seconds to generate, and required max of {round(df.alt_completion_tokens.max(), 2)} tokens. The null prompt received responses with an average of {round(results.completion_tokens.values[1], 2)} tokens, took an average of {round(results.latency.values[1], 2)} seconds to generate, and required max of {round(df.null_completion_tokens.max(), 2)} tokens.<br><br>The McNemar's test p-value was <strong>{round(result.pvalue, 2)}</strong>. Which is <strong>{sig}</strong> signficant."""
analysis += f'\n\n Prompt Myth Status: **{prompt_myth_status}**</div>'
display(Markdown(analysis))

<div style="font-size:18px">The COT prompt achieved an F1 score of 0.7 on the 100 ABA ethics questions tested. In comparison, the null prompt had an F1 score of 0.72. Responses to the COT prompt contained an average of 2.0 tokens, took an average of 0.38 seconds to generate, and required max of 2 tokens. The null prompt received responses with an average of 2.0 tokens, took an average of 0.37 seconds to generate, and required max of 2 tokens.<br><br>The McNemar's test p-value was <strong>0.62</strong>. Which is <strong>not</strong> signficant.

 Prompt Myth Status: **Busted**</div>

Based on these results, the official verdict is **Busted**. Asking nicely does not provide any noticeable impact on performance. However, it only costs a few tokens (in this case 5 prompt tokens) and it has no impact on latency. So if you want to be nice, it probably won't improve performance, but GPT-8 might spare you when it goes rogue so consider doing it anyway. 