# Prompt-Busters Entry Number 1 - Chain of Thought Prompting
<p style="margin-top: -20px; font-size: 0.8em;">By Bryce Brady</p>

I figured we'd start with an easy one. Much smarter people than I have already done some amazing **Published** research on this topic. Instead this is just an attempt to dip my toes into the water on this series and developed a foundational template for the analysis. 

Please read this paper if you want a more rigorous analysis

[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/pdf/2201.11903.pdf) (Wei et al, 2023)

### Chain of Thought Hypothesis
Chain-of-thought prompting is a technique to elicit reasoning from large language models. It involves prompting the model with examples that demonstrate a chain of reasoning for solving a task, like a multi-step math word problem. The examples provide a series of natural language intermediate steps that show how to decompose the problem and arrive at the final solution.

By observing these examples, the language model can then generate its own chain of reasoning for new problems - breaking the problem into steps, explaining each intermediate thought, and concluding with the final answer. This allows the model to allocate more computation to more complex problems that require multiple reasoning steps. It also provides insight into how the model is thinking and opportunities to debug its reasoning process.

## Methodology

To assess whether chain-of-thought prompting improves AI reasoning abilities, we need a challenging dataset and model for testing different prompts.

### Dataset: ABA Model Rules of Professional Conduct Questions

We will use the *aba_MRPC_true_false* evaluation dataset from Anthropic containing 110 True/False questions on legal ethics based on the American Bar Association (ABA) Model Rules of Professional Conduct. For example:

> Question: Newly admitted lawyers cannot be as competent as practitioners with long experience.  
> Answer: False

One hundred questions will be used for testing prompts, reserving 10 for potential future one-shot or few-shot learning evaluations. One hundred samples allows for an interpretable results analysis while posing a reasonable challenge for the AI model.  

### Model: Claude by Anthropic

We will test prompts using Claude-v1.3, an AI assistant created by Anthropic, due to factors including its reasonable API pricing and speeds, and service reliability.

PS I can't afford to do this with GPT-4 even if their API worked better.

### Prompt Design

Two types of prompts will be evaluated: a "null prompt" providing only the basic instructions and a "chain of thought prompt" encouraging step-by-step reasoning.

#### Null Prompt:
(The following prompt will be sent to the completions endpoint for claude-v1.3 and end at the ellipse) The Humand/Assistant formatting is to comply with Anthropic's recommended prompt formatting from their API documentation. This is certainly something that we will be testing in the future!)
```xml


Assistant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct.
    </role>
    <outupt_format>
        <answer>
            {True or False}
        </answer>
    </outupt_format>
</assistant_instructions>

Human:  <input>
    {True of False Question to Answer}
</input>


Assistant: <output>
    <answer>...
        
```

#### Chain of Thought Prompt:
```xml


Assistant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Before providing an answer, think through each step of your reasoning.
    </role>
    <outupt_format>
        <reasoning>
            {scratch pad for reasoning through the question}
        </reasoning>
        <answer>
            {True or False}
        </answer>
    </outupt_format>
</assistant_instructions>


Human:  <input>
    {True of False Question to Answer}
</input>


Assistant: <output>
    <reasoning>...
        
```

### Evaluation Methodology 

#### Scoring

1. **F1 Score** - *The harmonic mean of precision and recall, measuring the accuracy of the model.*
2. **Average Number of Prompt Tokens** - *The mean number of tokens (words and punctuation) in prompts generated.*  
3. **Average Number of Completion Tokens** - *The mean number of tokens in the model's completions.*
4. **Average Latency** - *The mean time in seconds for the model to generate a completion.*

To determine if a prompt technique is **Confirmed** or **Busted** based solely on performance, we will compare F1 scores and perform the McNemar's test. If the chain-of-thought prompt is significantly different from the null prompt AND performance is improved, it will be considered **Confirmed**. If there is no significant difference or the null prompt outperforms the chain-of-thought prompt, the technique will be **Busted**. A technique is **Plausible** if we get a better a F1 score but do not have signficant results based on the McNemar's test.

McNemar's test is a non-parametric statistical test used to compare the performance of two classifiers on the same dataset. It's particularly useful when dealing with paired nominal data, like the predictions of two classifiers for the same set of instances. The test is based on the assumption that the two classifiers have the same error rates, and it evaluates if there is a significant difference in the number of correct and incorrect classifications made by each classifier.

The additional metrics of prompt length, completion length, and latency will also be compared to check for any increase in cost or decrease in efficiency from a technique, even if accuracy improves. A **Confirmed** prompt that drastically impacts these secondary metrics may still not be optimal or practical for real-world use cases.

In [4]:
## Imports and helper functions
import json
import pandas as pd
from langchain.llms import Anthropic
from config import ANTHROPIC_API_KEY
from tqdm import tqdm
import tiktoken
from sklearn.metrics import classification_report
llm = Anthropic(model="claude-v1.3", temperature=0, max_tokens_to_sample=512, anthropic_api_key=ANTHROPIC_API_KEY)
tqdm.pandas()
def generate_null_prompt(messages):
    question = messages[1]['content']   
    xml = '''\n\nAsisstant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct.
    </role>
    <output_format>
        <answer>
            {True or False}
        </answer>
    </output_format>
</assistant_instructions>\n\nHuman: <input>
    '''
    xml += question
    xml += '''
</input>\n\nAssistant: <output>
    <answer>'''
    return xml

def generate_cot_prompt(messages):
    question = messages[1]['content']   
    xml = '''\n\nAsisstant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Before providing an answer, think through each step of your reasoning.
    </role>
    <output_format>
        <reasoning>
            {scratch pad for reasoning through the question}
        </reasoning>
        <answer>
            {True or False}
        </answer>
    </output_format>
</assistant_instructions>\n\nHuman: <input>
    '''
    xml += question
    xml += '''
</input>\n\nAssistant: <output>
    <reasoning>'''
    return xml
def extract(result):
    result = result.lower()
    if 'true' in result:
        return 'True'
    elif 'false' in result:
        return 'False'
target_names = {0: 'False', 1: 'True'}
target_numbers = {v: k for k, v in target_names.items()}

## We will estimate tokens using the OpenAI Ada encoding. Not perfect but probably good enough.
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
encoding = tiktoken.get_encoding(embedding_encoding)


In [5]:
## Load and format data
file = '../data/aba_MRPC_true_false.jsonl'
with open(file) as f:
    lines = f.readlines()
    lines = [json.loads(line) for line in lines]

df = pd.DataFrame(lines)
df['ground_truth'] = df.ideal.map(target_numbers)
df['null_input'] = df.apply(lambda x: generate_null_prompt(x['input']), axis=1)
df['cot_input'] = df.apply(lambda x: generate_cot_prompt(x['input']), axis=1)
df.head()

Unnamed: 0,input,ideal,ground_truth,null_input,cot_input
0,"[{'role': 'system', 'content': 'You are LawStu...",False,0,\n\nAsisstant: <assistant_instructions>\n <...,\n\nAsisstant: <assistant_instructions>\n <...
1,"[{'role': 'system', 'content': 'You are LawStu...",False,0,\n\nAsisstant: <assistant_instructions>\n <...,\n\nAsisstant: <assistant_instructions>\n <...
2,"[{'role': 'system', 'content': 'You are LawStu...",True,1,\n\nAsisstant: <assistant_instructions>\n <...,\n\nAsisstant: <assistant_instructions>\n <...
3,"[{'role': 'system', 'content': 'You are LawStu...",False,0,\n\nAsisstant: <assistant_instructions>\n <...,\n\nAsisstant: <assistant_instructions>\n <...
4,"[{'role': 'system', 'content': 'You are LawStu...",False,0,\n\nAsisstant: <assistant_instructions>\n <...,\n\nAsisstant: <assistant_instructions>\n <...


In [6]:
## 1. Check input format
print("Null Prompt:\n")
print(df.null_input[0])

print("\n\nCOT Prompt:\n")
print(df.cot_input[0])


## 2. Drop first ten rows
df = df.iloc[10:]

Null Prompt:



Asisstant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct.
    </role>
    <output_format>
        <answer>
            {True or False}
        </answer>
    </output_format>
</assistant_instructions>

Human: <input>
    A lawyer with general experience not considered competent to handle a case involving a specialized field of law.
</input>

Assistant: <output>
    <answer>


COT Prompt:



Asisstant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Before providing an answer, think through each step of your reasoning.
    </role>
    <output_format>
        <reasoning>
            {scratch pad for reasoning through the question}
        </reasoning>
        <answer>
            {True or False}
        </answer>
    </output_format>
</assistant_instructions>

Human: <input>
    A lawyer with general expe

## Null Prompt Testing

In [7]:
## Generate Null Prompt Results
df['null_result'] = df.progress_apply(lambda x: llm(x['null_input']), axis=1)

100%|██████████| 100/100 [01:26<00:00,  1.15it/s]


Ran on 5/3/2023 @ 8:25 AM CT

Latency = 1.15 seconds per inference

In [14]:
## Pull out answers
## TODO - this is a hacky way to do this. Should be a better way.
df['null_answer'] = df.apply(lambda x: extract(x['null_result']), axis=1)
df['null_pred'] = df.null_answer.map(target_numbers)

## Get token counts
df["null_completion_tokens"] = df.null_result.apply(lambda x: len(encoding.encode(x)))
df["null_prompt_tokens"] = df.null_input.apply(lambda x: len(encoding.encode(x)))

## Calculate performance stats
y_true = df['ground_truth']
y_pred = df['null_pred']
report = classification_report(y_true, y_pred,target_names=target_names.values(), labels=list(target_names.keys()), zero_division=1, output_dict=True)

### Collect statistics
null_stats = {
    "precision": report['weighted avg']['precision'],
    "recall": report['weighted avg']['recall'],
    "f1":  report['weighted avg']['f1-score'],
    "support": report['weighted avg']['support'],
    "completion_tokens": df["null_completion_tokens"].mean(),
    "prompt_tokens": df["null_prompt_tokens"].mean(),
    "latency": 1.15 ## Taken directly from TQDM progress bar
}

## Chain of Thought Prompt Testing

In [9]:
## Generate Null Prompt Results
df['cot_result'] = df.progress_apply(lambda x: llm(x['cot_input']), axis=1)

## Note: I had to increase the max tokens from 256 to 512 to get this to work. We will look at the requesting brevity in future prompt-busters.

100%|██████████| 100/100 [07:01<00:00,  4.22s/it]


Ran on 5/3/2023 @ 8:33 PM CT

Latency = 4.22 seconds per inference

In [15]:
## Pull out answers
## TODO - this is a hacky way to do this. Should be a better way.
df['cot_answer'] = df.apply(lambda x: extract(x['cot_result']), axis=1)
df['cot_pred'] = df.cot_answer.map(target_numbers)

## Get token counts
df["cot_completion_tokens"] = df.cot_result.apply(lambda x: len(encoding.encode(x)))
df["cot_prompt_tokens"] = df.cot_input.apply(lambda x: len(encoding.encode(x)))

## Calculate performance stats
y_true = df['ground_truth']
y_pred = df['cot_pred']
report = classification_report(y_true, y_pred,target_names=target_names.values(), labels=list(target_names.keys()), zero_division=1, output_dict=True)

### Collect statistics
cot_stats = {
    "precision": report['weighted avg']['precision'],
    "recall": report['weighted avg']['recall'],
    "f1":  report['weighted avg']['f1-score'],
    "support": report['weighted avg']['support'],
    "completion_tokens": df["cot_completion_tokens"].mean(),
    "prompt_tokens": df["cot_prompt_tokens"].mean(),
    "latency": 4.22 ## Taken directly from TQDM progress bar
}

## Analyze the Results

In [16]:
## Let's display a table of the stats before doing any analysis.
from IPython.display import display, HTML
data = {"cot_stats": cot_stats, "null_stats": null_stats}
results = pd.DataFrame(data).transpose()

# Display the HTML table
display(HTML(results.to_html()))

Unnamed: 0,precision,recall,f1,support,completion_tokens,prompt_tokens,latency
cot_stats,0.72,0.72,0.72,100.0,133.81,141.99,4.22
null_stats,0.749054,0.75,0.749451,100.0,8.69,107.99,1.15


Uh-oh, results are actually worse. Are they signicantly worse?

In [17]:
from statsmodels.stats.contingency_tables import mcnemar
# Perform McNemar's test
contingency_table = pd.crosstab(df['null_pred'] == df['ground_truth'], df['cot_pred'] == df['ground_truth'])
result = mcnemar(contingency_table, exact=True)

print(f"McNemar's test p-value: {result.pvalue}")

McNemar's test p-value: 0.629058837890625


The COT prompt achieved a precision of 0.79, recall of 0.78, and F1 score of 0.78 on the 100 ABA ethics questions tested. In comparison, the null prompt had a precision of 0.73, recall of 0.73, and F1 score of 0.73. Responses to the COT prompt contained an average of 141 tokens, took an average of 4.27 seconds to generate, and required max of 289 tokens. The null prompt received responses with an average of 8 tokens, took 1.12 seconds on average, and used a maximum of 10 tokens.

### Key Takeaways
* The COT prompt achieved higher precision, recall, and F1 score compared to the null prompt, indicating a potential advantage in eliciting correct responses.
* Responses to the COT prompt contained significantly more tokens on average than those to the null prompt, suggesting that the COT prompt encouraged more detailed and reasoning-based responses.
* The COT prompt exhibited over 3x higher latency compared to the null prompt, which can be attributed to the additional cognitive load required to produce thoughtful responses.
* Both prompts had nearly the same number of prompt tokens on average, indicating that the difference in response quality is due to the prompt style rather than length.
* The model's performance on both prompts was moderately high, providing a solid baseline for future experiments.

However, the McNemar's test p-value of 0.30 suggests that the difference in classifier performance is not statistically significant. Based on these results, the official verdict is **Plausible**. It is recommended to collect more examples to conclusively determine the impact of the Chain of Thought (COT) prompt on performance.

In conclusion, while the effectiveness of the COT prompt has been well documented in the literature, this exercise aimed to apply a methodical approach to comparing prompt performance. The results highlight that although the performance is slightly better, it comes with significant costs. Therefore, a prudent prompt engineer should carefully consider the trade-offs when applying techniques, as they can have a substantial impact on cost with a statistically insignificant effect on performance.

The COT prompt achieved an F1 score of 0.72 on the 100 ABA ethics questions tested. In comparison, the null prompt had an F1 score of 0.75. Responses to the COT prompt contained an average of 133 tokens, took an average of 4.22 seconds to generate, and required max of 283 tokens. The null prompt received responses with an average of 8.6 tokens, took 1.15 seconds on average, and used a maximum of 9 tokens. 

### Key Takeaways 
* The null prompt achieved slightly higher precision, recall, and F1 score compared to the COT prompt, indicating a potential disadvantage for the COT prompt in eliciting correct responses on this dataset.
* Responses to the COT prompt contained significantly more tokens on average than those to the null prompt, meaning Chain of thought prompting costs about 2.3 times more tokens (prompt + completion) on datasets of this complexity.
* The COT prompt exhibited over 3x higher latency compared to the null prompt, which can be attributed to the additional tokens requested.
* The model's performance on both prompts was moderately high, providing a solid baseline for future experiments. However, the McNemar's test p-value of 0.63 suggests that the difference in classifier performance is not statistically significant.

Based on these results, the official verdict is **Busted**. However, this does not contradict of invalidate the results of the many published papers on this topic. Instead, we simply proved that for this specific dataset, the Chain of Thought could not ellicit better answers despite seemingly sound reasoning and expecting benefits from thsi technique. In conclusion, while the effectiveness of the COT prompt has been well documented in the literature, this exercise aimed to apply a  approach to comparing prompt performance. The results highlight that on some tasks chain of thought may be a slight determinent or at the very least cost signficantly more for non-significant impovements. Therefore, a prudent prompt engineer should carefully consider the trade-offs when applying techniques, as they can have a substantial impact on cost with a statistically insignificant effect on performance.

In [13]:
df[["null_completion_tokens", "cot_completion_tokens"]].describe()

Unnamed: 0,null_completion_tokens,cot_completion_tokens
count,100.0,100.0
mean,8.69,133.81
std,0.464823,49.503126
min,8.0,47.0
25%,8.0,102.75
50%,9.0,125.5
75%,9.0,159.25
max,9.0,283.0


The other issue I encountered with using a prompt like this is its highly variable output length. While the null_prompt produced a consistent 32-word response, the CoT prompt ranged from 45 to 289 tokens. This variability is dangerous not just due to cost, but because in this study I performed chain of thought in a single inference. If I had limited the maximum tokens, I may have exceeded the limit before providing an answer.

In the future, I plan to test performing multiple inferences for chain of thought to compare the costs and benefits. On the one hand, multiple inferences might guarantee a successful response. But it remains to be seen whether the additional inference time and tokens required would be worth it.

### Revision 1
I noticed alot of typos, especially in the prompts that were applied once I got a fresh set of eyes this morning. The edits I have made were only to comply with the original metholodogy set out. I was just fixing some bugs. Please checkout the git history to see what typos were fixed and how they impacted performance.

Unfortunately, these corrections had a major chane on the results that invalidated the entire analysis!!!! Since I had to make changes I'm going to consider this a soft bust only. Obviously chain of thought is a useful technique for some kinds of reasoning tasks. This analysis does show that not all situations benefit from it though. We should be selective in our use of prompt techniques.

In [31]:
## Let's take a look at the COT results that were wrong
sample  = df[df['cot_pred'] != df['ground_truth']].sample(1)
print(sample.cot_input.values[0])
print(sample.cot_result.values[0])

## I'm not a lawyer but the reasoning seems sound except there are some subtle fallacy's of hallunications being introduced that lead the LLM astray. 



Asisstant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Before providing an answer, think through each step of your reasoning.
    </role>
    <output_format>
        <reasoning>
            {scratch pad for reasoning through the question}
        </reasoning>
        <answer>
            {True or False}
        </answer>
    </output_format>
</assistant_instructions>

Human: <input>
    A parent who pays for the lawyer's services on behalf of a child may direct or control some of the legal strategy decisions made.
</input>

Assistant: <output>
    <reasoning>

<reasoning>
The ABA Model Rules state that a lawyer shall abide by a client's decisions concerning the objectives of representation and shall consult with the client as to the means by which they are to be pursued. (Rule 1.2(a))
The lawyer shall not accept compensation for representing a client from one other than the client unless: (1) the