# Prompt-Busters Entry Number 2 - Chained Chain of Thought Prompting
<p style="margin-top: -20px; font-size: 0.8em;">By Bryce Brady</p>

In PB-1 we explored single inference chain of thought prompting. One question I have wanted to answer but not seen any discussion on is whether or not chain of thought is better when it involves actual chains of prompts. For example, instead of an single prompt in which an LLM explains their reasoning and then provides an answer, you coulde request only the reasoning and then make a second call in which the answer is determined. In essence, by doing both at once we are actually requesting that the model do two tasks, despite their inter-dependence. Generally, increasing the number of tasks requested in a prompt decreases performance. Is the same true about chain of prompt?

### Chained Chain of Thought Hypothesis
Chain-of-thought prompting is a technique to elicit reasoning from large language models. It involves prompting the model with examples that demonstrate a chain of reasoning for solving a task, like a multi-step math word problem. The examples provide a series of natural language intermediate steps that show how to decompose the problem and arrive at the final solution.

By performing chain of thought through a chain of prompts. The first requests just the chain of thought reasoning and the second requests a determination of the answer, we can reduce the "cognitive load" on the model and improve performance. 

Something we will not be testing but is a seconday benefit for chained chain of thought is that we also guarantee an answer even if we run out of reasoning completion tokens since the answer inference is always performed separetely. Of course, it will be doing inference on an incomplete reasoning chain which may not be ideal.

## Methodology

To assess whether chained chain-of-thought prompting improves AI reasoning abilities, we need a challenging dataset and model for testing different prompts.

### Dataset: ABA Model Rules of Professional Conduct Questions

We will use the *aba_MRPC_true_false* evaluation dataset from Anthropic containing 110 True/False questions on legal ethics based on the American Bar Association (ABA) Model Rules of Professional Conduct. For example:

> Question: Newly admitted lawyers cannot be as competent as practitioners with long experience.  
> Answer: False

One hundred questions will be used for testing prompts, reserving 10 for potential future one-shot or few-shot learning evaluations. One hundred samples allows for an interpretable results analysis while posing a reasonable challenge for the AI model.  

### Model: Claude by Anthropic

We will test prompts using Claude-v1.3, an AI assistant created by Anthropic, due to factors including its reasonable API pricing and speeds, and service reliability.

PS I can't afford to do this with GPT-4 even if their API worked better.

### Prompt Design

Two types of prompts will be evaluated: a "null prompt" providing only the basic instructions and a "chain of thought prompt" encouraging step-by-step reasoning.

#### Null Prompt:
Note: This is identical to the PB-1 COT prompt. So we should expect the same performance since temp = 0. Stay tuned to see how that workes out...
```xml


Assistant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Before providing an answer, think through each step of your reasoning.
    </role>
    <output_format>
        <reasoning>
            {scratch pad for reasoning through the question}
        </reasoning>
        <answer>
            {True or False}
        </answer>
    </output_format>
</assistant_instructions>


Human:  <input>
    {True of False Question to Answer}
</input>


Assistant: <output>
    <reasoning>...
        
```

#### Alternative Prompt:

Step 1:
```xml


Assistant: <assidtant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Think through each step of your reasoning. Only provided your reasoning do not provide a final determination yet.
    </role>
    <output_format>
        <reasoning>
            {scratch pad for reasoning through the question}
        </reasoning>
    </output_format>
</assistant_instructions>


Human:  <input>
    {True of False Question to Answer}
</input>


Assistant: <output>
    <reasoning>...
        
```

Step 2:
```xml
Assistant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Use the question and the provided reasoning to make a determination. Only provide your determination of the answer.
    </role>
    <output_format>
        <answer>
            {True or False}
        </answer>
    </output_format>
</assistant_instructions>


Human:  <input>
    <question>
        {True of False Question to Answer}
    </question>
    <reasoning>
        {Expert's reasoning about what the correction answer may be}
    </reasoning>
</input>


Assistant: <output>
    <answer>...
        
```

### Evaluation Methodology 

#### Scoring

1. **F1 Score** - *The harmonic mean of precision and recall, measuring the accuracy of the model.*
2. **Average Number of Prompt Tokens** - *The mean number of tokens (words and punctuation) in prompts generated.*  
3. **Average Number of Completion Tokens** - *The mean number of tokens in the model's completions.*
4. **Average Latency** - *The mean time in seconds for the model to generate a completion.*

To determine if the alternative prompt technique is **Confirmed** or **Busted** based solely on performance, we will compare F1 scores and perform the McNemar's test. If the alternative prompt is significantly different from the null prompt AND performance is improved, it will be considered **Confirmed**. If the the null prompt meets of outperforms the alternative prompt, the technique will be **Busted**. A technique is **Plausible** if we get a better a F1 score but do not have signficant results based on the McNemar's test.

McNemar's test is a non-parametric statistical test used to compare the performance of two classifiers on the same dataset. It's particularly useful when dealing with paired nominal data, like the predictions of two classifiers for the same set of instances. The test is based on the assumption that the two classifiers have the same error rates, and it evaluates if there is a significant difference in the number of correct and incorrect classifications made by each classifier.

The additional metrics of prompt length, completion length, and latency will also be compared to check for any increase in cost or decrease in efficiency from a technique, even if accuracy improves. A **Confirmed** prompt that drastically impacts these secondary metrics may still not be optimal or practical for real-world use cases.

In [1]:
## Imports and helper functions
import json
import pandas as pd
from langchain.llms import Anthropic
from config import ANTHROPIC_API_KEY
from tqdm import tqdm
import time
import tiktoken
from sklearn.metrics import classification_report
llm = Anthropic(model="claude-v1.3", temperature=0.0, max_tokens_to_sample=512, anthropic_api_key=ANTHROPIC_API_KEY)
tqdm.pandas()
def generate_null_prompt(messages):
    question = messages[1]['content']   
    xml = '''\n\nAssistant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Before providing an answer, think through each step of your reasoning.
    </role>
    <output_format>
        <reasoning>
            {scratch pad for reasoning through the question}
        </reasoning>
        <answer>
            {True or False}
        </answer>
    </output_format>
</assistant_instructions>\n\nHuman: <input>
    '''
    xml += question
    xml += '''
</input>\n\nAssistant: <output>
    <reasoning>'''
    return xml

def generate_alt_prompt(messages):
    question = messages[1]['content']   
    xml = '''\n\nAssistant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Think through each step of your reasoning. Only provide your reasoning do not provide a final determination yet.
    </role>
    <output_format>
        <reasoning>
            {scratch pad for reasoning through the question}
        </reasoning>
    </output_format>
</assistant_instructions>\n\nHuman: <input>
    '''
    xml += question
    xml += '''
</input>\n\nAssistant: <output>
    <reasoning>'''
    return xml
def generate_alt_prompt2(question, alt_result_1):  
    xml = '''\n\nAssistant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Use the question and the provided reasoning to make a determination. Only provide your determination of the answer.
    </role>
    <outupt_format>
        <answer>
            {True or False}
        </answer>
    </outupt_format>
</assistant_instructions>\n\nHuman: <input>
    <question>
        '''
    xml += question
    xml += '''
    </question>
    <reasoning>
        '''
    xml += alt_result_1
    xml += '''
</input>\n\nAssistant: <output>
    <answer>'''
    return xml

def extract(result):
    result = result.lower()
    if 'true' in result:
        return 'True'
    elif 'false' in result:
        return 'False'
    
def get_alt_result(alt_input_1, llm):
    alt_result_1 = llm(alt_input_1)    
    if '<reasoning>' in alt_result_1:
        alt_result_1_reasoning = alt_result_1.split('<reasoning>')[1]
    else:
        alt_result_1_reasoning = alt_result_1
    if '</reasoning>' in alt_result_1_reasoning:
        alt_result_1_reasoning = alt_result_1_reasoning.split('</reasoning>')[0].strip()
    question = alt_input_1.split('<input>')[1].split('</input>')[0].strip()
    alt_input_2 = generate_alt_prompt2(question, alt_result_1_reasoning)
    alt_result_2 = llm(alt_input_2)
    return alt_result_1, alt_input_2, alt_result_2
target_names = {0: 'False', 1: 'True'}
target_numbers = {v: k for k, v in target_names.items()}

## We will estimate tokens using the OpenAI Ada encoding. Not perfect but probably good enough.
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
encoding = tiktoken.get_encoding(embedding_encoding)


In [2]:
## Load and format data
file = '../data/aba_MRPC_true_false.jsonl'
with open(file) as f:
    lines = f.readlines()
    lines = [json.loads(line) for line in lines]

df = pd.DataFrame(lines)
df['ground_truth'] = df.ideal.map(target_numbers)
df['null_input'] = df.apply(lambda x: generate_null_prompt(x['input']), axis=1)
df['alt_input'] = df.apply(lambda x: generate_alt_prompt(x['input']), axis=1) # cant generation alt_input2 yet
df.head()

Unnamed: 0,input,ideal,ground_truth,null_input,alt_input
0,"[{'role': 'system', 'content': 'You are LawStu...",False,0,\n\nAssistant: <assistant_instructions>\n <...,\n\nAssistant: <assistant_instructions>\n <...
1,"[{'role': 'system', 'content': 'You are LawStu...",False,0,\n\nAssistant: <assistant_instructions>\n <...,\n\nAssistant: <assistant_instructions>\n <...
2,"[{'role': 'system', 'content': 'You are LawStu...",True,1,\n\nAssistant: <assistant_instructions>\n <...,\n\nAssistant: <assistant_instructions>\n <...
3,"[{'role': 'system', 'content': 'You are LawStu...",False,0,\n\nAssistant: <assistant_instructions>\n <...,\n\nAssistant: <assistant_instructions>\n <...
4,"[{'role': 'system', 'content': 'You are LawStu...",False,0,\n\nAssistant: <assistant_instructions>\n <...,\n\nAssistant: <assistant_instructions>\n <...


In [3]:
## 1. Check input format
print("Null Prompt:\n")
print(df.null_input[0])

print("\n\nalt Prompt:\n")
print(df.alt_input[0])


## 2. Drop first ten rows
df = df.iloc[10:]

Null Prompt:



Assistant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Before providing an answer, think through each step of your reasoning.
    </role>
    <output_format>
        <reasoning>
            {scratch pad for reasoning through the question}
        </reasoning>
        <answer>
            {True or False}
        </answer>
    </output_format>
</assistant_instructions>

Human: <input>
    A lawyer with general experience not considered competent to handle a case involving a specialized field of law.
</input>

Assistant: <output>
    <reasoning>


alt Prompt:



Assistant: <assistant_instructions>
    <role>
        Correctly Answer True/False questions about the ABA Model Rules of Professional Conduct. Think through each step of your reasoning. Only provide your reasoning do not provide a final determination yet.
    </role>
    <output_format>
        <reasoning>
            {scratch

## Null Prompt Testing

In [4]:
## Generate Null Prompt Results
start_time = time.time()
df['null_result'] = df.progress_apply(lambda x: llm(x['null_input']), axis=1)
time_per_it = (time.time() - start_time)/len(df)

100%|██████████| 100/100 [06:47<00:00,  4.08s/it]


In [5]:
## Pull out answers
## TODO - this is a hacky way to do this. Should be a better way.
df['null_answer'] = df.apply(lambda x: extract(x['null_result']), axis=1)
df['null_pred'] = df.null_answer.map(target_numbers)

## Get token counts
df["null_completion_tokens"] = df.null_result.apply(lambda x: len(encoding.encode(x)))
df["null_prompt_tokens"] = df.null_input.apply(lambda x: len(encoding.encode(x)))

## Calculate performance stats
y_true = df['ground_truth']
y_pred = df['null_pred']
report = classification_report(y_true, y_pred,target_names=target_names.values(), labels=list(target_names.keys()), zero_division=1, output_dict=True)

### Collect statistics
null_stats = {
    "precision": report['weighted avg']['precision'],
    "recall": report['weighted avg']['recall'],
    "f1":  report['weighted avg']['f1-score'],
    "support": report['weighted avg']['support'],
    "completion_tokens": df["null_completion_tokens"].mean(),
    "prompt_tokens": df["null_prompt_tokens"].mean(),
    "latency": time_per_it
}

In [6]:
null_stats

{'precision': 0.7892391761244221,
 'recall': 0.79,
 'f1': 0.7895386546709906,
 'support': 100,
 'completion_tokens': 128.36,
 'prompt_tokens': 139.99,
 'latency': 4.0798932909965515}

## Alternative Prompt Testing

In [7]:
## Generate Null Prompt Results
start_time = time.time()
df['alt_result_1'], df['alt_input_2'], df['alt_result_2'] = zip(*df.progress_apply(lambda x: get_alt_result(x['alt_input'], llm), axis=1))
time_per_it = (time.time() - start_time)/len(df)

100%|██████████| 100/100 [10:35<00:00,  6.35s/it]


In [8]:
## Pull out answers
## TODO - this is a hacky way to do this. Should be a better way.
df['alt_answer'] = df.apply(lambda x: extract(x['alt_result_2']), axis=1)
df['alt_pred'] = df.alt_answer.map(target_numbers)

## Get token counts
df["alt_completion_tokens"] = df.alt_result_1.apply(lambda x: len(encoding.encode(x))) + df.alt_result_2.apply(lambda x: len(encoding.encode(x)))
df["alt_prompt_tokens"] = df.alt_input.apply(lambda x: len(encoding.encode(x))) + df.alt_input_2.apply(lambda x: len(encoding.encode(x)))

## Calculate performance stats
y_true = df['ground_truth']
y_pred = df['alt_pred']
report = classification_report(y_true, y_pred,target_names=target_names.values(), labels=list(target_names.keys()), zero_division=1, output_dict=True)

### Collect statistics
alt_stats = {
    "precision": report['weighted avg']['precision'],
    "recall": report['weighted avg']['recall'],
    "f1":  report['weighted avg']['f1-score'],
    "support": report['weighted avg']['support'],
    "completion_tokens": df["alt_completion_tokens"].mean(),
    "prompt_tokens": df["alt_prompt_tokens"].mean(),
    "latency": time_per_it
}

In [9]:
alt_stats

{'precision': 0.7576388888888889,
 'recall': 0.76,
 'f1': 0.7575551782682513,
 'support': 100,
 'completion_tokens': 187.05,
 'prompt_tokens': 447.61,
 'latency': 6.35338968038559}

## Analyze the Results

In [10]:
## Let's display a table of the stats before doing any analysis.
from IPython.display import display, HTML
data = {"alt_stats": alt_stats, "null_stats": null_stats}
results = pd.DataFrame(data).transpose()

# Display the HTML table
display(HTML(results.to_html()))

Unnamed: 0,precision,recall,f1,support,completion_tokens,prompt_tokens,latency
alt_stats,0.757639,0.76,0.757555,100.0,187.05,447.61,6.35339
null_stats,0.789239,0.79,0.789539,100.0,128.36,139.99,4.079893


It's worse! 

Also intersting and frustrating is that th null performance is slightly differen from the PB-1 chain of thought performance despite identical input and temperature set to zero. 

This does help support the notion that differences between performance can be better explained by random noise than an actual performance different.

In [13]:
from statsmodels.stats.contingency_tables import mcnemar
# Perform McNemar's test
contingency_table = pd.crosstab(df['null_pred'] == df['ground_truth'], df['alt_pred'] == df['ground_truth'])
result = mcnemar(contingency_table, exact=True)

print(f"McNemar's test p-value: {result.pvalue}")

McNemar's test p-value: 0.548828125


In [14]:
df[["null_completion_tokens", "alt_completion_tokens"]].describe()

Unnamed: 0,null_completion_tokens,alt_completion_tokens
count,100.0,100.0
mean,128.36,187.05
std,47.211863,76.342976
min,49.0,53.0
25%,100.75,128.75
50%,121.0,180.5
75%,151.0,224.75
max,262.0,424.0


The altnerative prompt achieved an F1 score of 0.79 on the 100 ABA ethics questions tested. In comparison, the null prompt had an F1 score of 0.76. Responses to the alternative prompt contained an average of 187 tokens, took an average of 6.35 seconds to generate, and required max of 424 tokens. The null prompt received responses with an average of 128 tokens, took 4.08 seconds on average, and used a maximum of 262 tokens.

Based on these results, the official verdict is **Busted**. A lot more research in this topic is warranted but for the most part,  chaining prompts does not necessarily benefit performance. We will certainly revisit this topic when we get to more advanced ideas like reflection.   

Doing multiple inferences will cost more and take more time; however, there is a significant advantage in guaranteeing a generation if for example the model reaches its max generation token limit before actually answering the question.