# Chain of Thought

Chain of Thought (CoT) prompting is a technique for eliciting reasoning in LLMs, by asking the model to think first before completing a task. CoT prompting has shown significant improvements in the performance of language models on tasks that require multi-step reasoning, such as math word problems, commonsense reasoning, and symbolic manipulation. By providing a step-by-step thought process, the model can better understand and solve complex problems. This technique enhances the reasoning capabilities of large language models by encouraging them to break down complex problems into intermediate steps.

## Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

The CoT technique has been referenced in many AI papers since, but was first introduced by Jason Wei et al. The authors demonstrate that CoT prompting is particularly effective in few-shot learning scenarios and that its benefits scale with model size. The technique shows versatility across various tasks. Importantly, CoT prompting makes the model's reasoning process more transparent and human-like, enhancing interpretability. The process is analogous to system 2 thinking in the human brain, when a problem is solved by spending time thinking a more deliberate and structured way, as opposed to system 1 thinking which is more intuitive and instinctive (like LLMs are without CoT). The paper's findings suggest that future AI developments might focus more on prompting techniques to unlock latent abilities in existing models, rather than solely on developing larger models. While highly effective, the authors note that CoT prompting has limitations and may not be as beneficial for simpler tasks or very small models.

> [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) by Wei, J., et al. (2022)

In [1]:
from openai import OpenAI
client = OpenAI()

def get_completion(prompt, system):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content


system = "Solve the following problem and return the answer in the format A: <answer>"
# Standard prompting (one-shot)
question = """Q: How many Rs are there in RASPBERRY?"""

standard_prompt = f"""Q: How many Es are there in ELEPHANT?
A: 2
Q: How many Ps are there in PINEAPPLE?
A: 3
Q: How many Os are there in CHOCOLATE?
A: 2
{question}"""

print("Standard Prompt Result:")
print(get_completion(standard_prompt, system))

# Chain of Thought prompting (zero-shot)
cot_prompt = f"""Q: How many Es are there in ELEPHANT?
Let's think step by step:
Spell out ELEPHANT and count the Es:
E-L-E-P-H-A-N-T
There's one E at the beginning and one in the middle. 1 + 1 = 2 Es in total.
A: 2

Q: How many Ps are there in PINEAPPLE?
Let's think step by step:
Spell out PINEAPPLE and count the Ps:
P-I-N-E-A-P-P-L-E
There's one P at the beginning and two Ps together in the middle. 1 + 2 = 3 Ps in total.
A: 3

Q: How many Os are there in CHOCOLATE?
Let's think step by step:
Spell out CHOCOLATE and count the Os:
C-H-O-C-O-L-A-T-E
There's one O at the beginning and one in the middle. 1 + 1 = 2 Os in total.
A: 2

{question}. 
Let's think step by step:"""

print("\nChain of Thought Prompt Result:")
print(get_completion(cot_prompt, system))

Standard Prompt Result:
A: 2

Chain of Thought Prompt Result:
Spell out RASPBERRY and count the Rs:

R-A-S-P-B-E-R-R-Y

There is one R at the beginning, one R in the middle, and another R towards the end. 

1 + 1 + 1 = 3 Rs in total.

A: 3


In [5]:
import re

def evaluate_response(response, correct_answer):
    # Check if the final answer is correct
    final_answer = re.search(r'A:\s*(\d+)', response)

    if final_answer:
        answer = int(final_answer.group(1))
        is_correct = (answer == correct_answer)
    else:
        is_correct = False
    
    # Check if steps are provided
    lines = response.strip().split('\n')
    has_steps = len(lines) > 1
    
    return is_correct, has_steps

# Test the evaluation function
test_response = """Q: How many Os are there in CHOCOLATE?
A: Let's spell out CHOCOLATE and count the Os:
C-H-O-C-O-L-A-T-E
There's one O at the beginning and one in the middle. 1 + 1 = 2 Os in total.
A: 2"""

is_correct, has_steps = evaluate_response(test_response, 2)
print(f"Is the answer correct? {is_correct}")
print(f"Does it provide steps? {has_steps}")

Is the answer correct? True
Does it provide steps? True


In [6]:
eval_set = [
    {
        "Q": "How many As are there in HAMBURGER?",
        "steps": "Let's think step by step:\nSpell out HAMBURGER and count the As:\nH-A-M-B-U-R-G-E-R\nThere's only one A in the second position. So there is 1 A in total.",
        "A": 1
    },
    {
        "Q": "How many Ts are there in CATERPILLAR?",
        "steps": "Let's think step by step:\nSpell out CATERPILLAR and count the Ts:\nC-A-T-E-R-P-I-L-L-A-R\nThere's only one T in the third position. So there is 1 T in total.",
        "A": 1
    },
    {
        "Q": "How many Ls are there in UMBRELLA?",
        "steps": "Let's think step by step:\nSpell out UMBRELLA and count the Ls:\nU-M-B-R-E-L-L-A\nThere are two Ls together near the end of the word. So there are 2 Ls in total.",
        "A": 2
    },
    {
        "Q": "How many Os are there in OCTOPUS?",
        "steps": "Let's think step by step:\nSpell out OCTOPUS and count the Os:\nO-C-T-O-P-U-S\nThere's one O at the beginning and one in the middle. 1 + 1 = 2 Os in total.",
        "A": 2
    },
    {
        "Q": "How many Ns are there in SUNFLOWER?",
        "steps": "Let's think step by step:\nSpell out SUNFLOWER and count the Ns:\nS-U-N-F-L-O-W-E-R\nThere's only one N in the third position. So there is 1 N in total.",
        "A": 1
    },
    {
        "Q": "How many Es are there in BICYCLE?",
        "steps": "Let's think step by step:\nSpell out BICYCLE and count the Es:\nB-I-C-Y-C-L-E\nThere's only one E at the end of the word. So there is 1 E in total.",
        "A": 1
    },
    {
        "Q": "How many Rs are there in REFRIGERATOR?",
        "steps": "Let's think step by step:\nSpell out REFRIGERATOR and count the Rs:\nR-E-F-R-I-G-E-R-A-T-O-R\nThere's one R at the beginning, one in the middle, and one at the end. 1 + 1 + 1 = 3 Rs in total.",
        "A": 3
    },
    {
        "Q": "How many Ss are there in PINEAPPLES?",
        "steps": "Let's think step by step:\nSpell out PINEAPPLES and count the Ss:\nP-I-N-E-A-P-P-L-E-S\nThere's only one S at the end of the word. So there is 1 S in total.",
        "A": 1
    },
    {
        "Q": "How many Cs are there in POPSICLE?",
        "steps": "Let's think step by step:\nSpell out POPSICLE and count the Cs:\nP-O-P-S-I-C-L-E\nThere's only one C in the sixth position. So there is 1 C in total.",
        "A": 1
    },
    {
        "Q": "How many As are there in PANDA?",
        "steps": "Let's think step by step:\nSpell out PANDA and count the As:\nP-A-N-D-A\nThere's one A in the second position and one at the end. 1 + 1 = 2 As in total.",
        "A": 2
    }
]

import time

def run_evaluation(prompt_type):
    correct_count = 0
    step_count = 0
    total_time = 0

    for example in eval_set:
        question = example["Q"]
        correct_answer = example["A"]

        if prompt_type == "standard":
            prompt = f"""Q: How many Es are there in ELEPHANT?
A: 2
Q: How many Ps are there in PINEAPPLE?
A: 3
Q: How many Os are there in CHOCOLATE?
A: 2
{question}"""
        else:  # CoT prompt
            prompt = f"""Q: How many Es are there in ELEPHANT?
Let's think step by step:
Spell out ELEPHANT and count the Es:
E-L-E-P-H-A-N-T
There's one E at the beginning and one in the middle. 1 + 1 = 2 Es in total.
A: 2

Q: How many Ps are there in PINEAPPLE?
Let's think step by step:
Spell out PINEAPPLE and count the Ps:
P-I-N-E-A-P-P-L-E
There's one P at the beginning and two Ps together in the middle. 1 + 2 = 3 Ps in total.
A: 3

Q: How many Os are there in CHOCOLATE?
Let's think step by step:
Spell out CHOCOLATE and count the Os:
C-H-O-C-O-L-A-T-E
There's one O at the beginning and one in the middle. 1 + 1 = 2 Os in total.
A: 2

{question}
Let's think step by step:"""

        start_time = time.time()
        response = get_completion(prompt, system)
        end_time = time.time()

        is_correct, has_steps = evaluate_response(response, correct_answer)
        correct_count += int(is_correct)
        step_count += int(has_steps)
        total_time += (end_time - start_time)

    accuracy = correct_count / len(eval_set)
    avg_time = total_time / len(eval_set)
    step_percentage = step_count / len(eval_set) * 100

    return accuracy, avg_time, step_percentage

# Run evaluation for standard prompting
standard_accuracy, standard_avg_time, standard_step_percentage = run_evaluation("standard")

# Run evaluation for CoT prompting
cot_accuracy, cot_avg_time, cot_step_percentage = run_evaluation("cot")

# Print results
print("Standard Prompting Results:")
print(f"Accuracy: {standard_accuracy:.2%}")
print(f"Average Time: {standard_avg_time:.2f} seconds")
print(f"Percentage with Steps: {standard_step_percentage:.2f}%")

print("\nChain of Thought Prompting Results:")
print(f"Accuracy: {cot_accuracy:.2%}")
print(f"Average Time: {cot_avg_time:.2f} seconds")
print(f"Percentage with Steps: {cot_step_percentage:.2f}%")


Standard Prompting Results:
Accuracy: 90.00%
Average Time: 0.93 seconds
Percentage with Steps: 0.00%

Chain of Thought Prompting Results:
Accuracy: 100.00%
Average Time: 1.62 seconds
Percentage with Steps: 100.00%
