* Authors have used greedy decoding for CoT (temperature=0)
* Standard Prompting will be used as baseline.


<b>Research Objectives</b>
* Can we improve the performance in reasoning tasks by scaling the model size using standard prompting and Random CoT?
* Robustness of CoT: Is the performance sensitive to the choice of demonstrations?
* Compare the performance with fine-tuned GPT-3 (get the result from the paper)
* Perform error analysis on incorrect examples. Randomly select 50 correct and 50 incorrect examples. Authors have found out calculation errors or reasoning step missing error. Can we improve the performance using agents/tools or human in the loop correction?
* Random CoT: Three sets of eight exemplars randomly sampled from the dataset.

In [39]:
import json
import os
import re
from datetime import datetime
from langchain.chains import LLMChain
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
import random
import tiktoken
from pydantic import BaseModel
from langchain.callbacks import get_openai_callback
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
def read_jsonl(path: str):
    with open(path) as fh:
        return [json.loads(line) for line in fh.readlines() if line]

In [3]:

    
gsm8k = read_jsonl('datasets/grade_school_math/data/train.jsonl')
aqua = read_jsonl('datasets/AQuA/dev.json')
with open('datasets/Strategy_QA/strategyqa_train.json', 'r', encoding='utf-8') as file:
    strategyqa = json.load(file)

In [4]:
cot_context = """
[Example 1]
Question: Do hamsters provide food for any animals?
Output:
Sub Question #0 : What type of animals are hamsters?
Sub Answer #0 : Hamsters are prey animals.
Sub Question #1 : Can prey animals be food for other animals?
Sub Answer #1 : Prey are food for predators.
Sub Question #2 : Do hamsters provide food for any animals?
Sub Answer #2 : Since hamsters are prey animals, and prey are food for predetors, hamsters provide food for some animals.
Final Answer: YES

[Example 2]
Question: Could Brooke Shields succeed at University of Pennsylvania?
Output:
Sub question #0 : What university did Brooke Shields went to?
Sub answer #0 : Brooke Shields went to Princeton University.
Sub question #1 : Did Brooke Shields succeed at Princeton University?
Sub answer #1 : At Princeton University, she got all As and Bs while pursing her bachelor's degree in French literature, meaning she had a successful school life.
Sub question #2 : How rigorous is Princeton University compared to University of Pennsylvania?
Sub answer #2 : Princeton University is about as academically rigorous as the University of Pennsylvania because they have a similar ranking according to U.S. News Rankings.
Sub question #3 : Could Brooke Shields succeed at University of Pennsylvania?
Sub answer #3 : Since University of Pennsylvania and University of Princeton are in similar circumstances, Brooke Shields has been successful in University of Princeton, Brooke Shields could also succeed at the University of Pennsylvania.
Final Answer: YES

[Example 3]
Question: Hydrogen\u2019s atomic number squared exceeds number of Spice Girls?
Output:
Sub question #0 : What is the atomic number of Hydrogen?
Sub answer #0 : Hydrogen has an atomic number of 1.
Sub question #1 : What is 1 squared?
Sub answer #1 : 1 squared is 1.
Sub question #2 : How much Spice Girls are there?
Sub answer #2 : There are 5 Spice Girls.
Sub question #3 : Hydrogen\u2019s atomic number squared exceeds number of Spice Girls?
Sub answer #3 : Since Hydrogen's atomic number squared is 1, the number of Spice Girls are 5, and 1 is smaller than 5, Hydrogen\u2019s atomic number squared is less than the number of Spice Girls.
Final Answer: NO

[Example 4]
Question: Is it common to see frost during some college commencements?
Output:
Sub question #0 : When does College commencement ceremonies usually happen?
Sub answer #0 : College commencement ceremonies can happen in December, May, and June.
Sub question #1 : Does it usually frost in December?
Sub answer #1 : December is in the winter, so there can be frost.
Sub question #2 : Is it common to see frost during some college commencements?
Sub answer #2 : Since there can be frost in December and a college commencement are held in December, there could be frost at some commencements.
Final Answer: YES

[Example 5]
Question: Could a llama birth twice during War in Vietnam (1945-46)?
Output:
Sub question #0 : How long was the Vietnam war?
Sub answer #0 : The War in Vietnam was 6 months.
Sub question #1 : How long is the gestation period?
Sub answer #1 : The gestation period for a llama is 11 months.
Sub question #2 : How long does it take for a llama to birth twice?
Sub answer #2 : Since the gestation period for a llama is 11 months, and 11 times 2 is 22, it will take 22 months.
Sub question #3 : Could a llama birth twice during War in Vietnam (1945-46)?
Sub answer #3 : Since it takes 22 months for a llama to birth twice, War in Vietnam was 6 months, and 22 is bigger than 6, llama could not give birth twice during the War in Vietnam.
Final Answer: NO

[Example 6]
Question: Would a pear sink in water?
Output:
Sub question #0 : What is the density of a pear?
Sub answer #0 : The density of a pear is about 0.6g/cm3.
Sub question #1 : What is the density of water?
Sub answer #1 : The density of water is 1g/cm3.
Sub question #2 : Is the density of pear smaller than water?
Sub answer #2 : Since 0.6 is smaller than 1, the density of pear is smaller than water.
Sub question #3 : If the density of an object is less than water, what happens?
Sub answer #3 : Objects less dense than water float.
Sub question #4 : Would a pear sink in water?
Sub answer #4 : Since a pear has a smaller density than water, a pear would float.
Final Answer: NO

"""

print(cot_context)

# Reference : https://github.com/SeungoneKim/CoTEVer/blob/main/Middleware/CoTEVer_AI/prompts/demo.txt


[Example 1]
Question: Do hamsters provide food for any animals?
Output:
Sub Question #0 : What type of animals are hamsters?
Sub Answer #0 : Hamsters are prey animals.
Sub Question #1 : Can prey animals be food for other animals?
Sub Answer #1 : Prey are food for predators.
Sub Question #2 : Do hamsters provide food for any animals?
Sub Answer #2 : Since hamsters are prey animals, and prey are food for predetors, hamsters provide food for some animals.
Final Answer: YES

[Example 2]
Question: Could Brooke Shields succeed at University of Pennsylvania?
Output:
Sub question #0 : What university did Brooke Shields went to?
Sub answer #0 : Brooke Shields went to Princeton University.
Sub question #1 : Did Brooke Shields succeed at Princeton University?
Sub answer #1 : At Princeton University, she got all As and Bs while pursing her bachelor's degree in French literature, meaning she had a successful school life.
Sub question #2 : How rigorous is Princeton University compared to University

In [5]:
with open("Prompts/strategyqa_prompt_cot.txt", "w") as f:
    f.write(cot_context)

In [111]:
# GSM8K Functions

ANS_RE = re.compile(r"#### (\-?[0-9\.\,]+)")
def extract_true_answer_gsm8k(completion):
    match = ANS_RE.search(completion)
    if match:
        match_str = match.group(1).strip()
        match_str = match_str.replace(",", "")
        return float(match_str)
    else:
        return None
    
def extract_ai_answer_gsm8k(completion):
    try:
        extracted_answer = float(completion.lower().split('answer:')[-1].strip())
        return extracted_answer
    except Exception as e:
        print(f'Could not extract the answer for this completion : {completion}')
        return None    
    
# AQUA Functions  
pattern = r'[a-zA-Z]\)'
def extract_ai_answer_aqua(completion):
    try:
        preprocess_res = completion.lower().split('answer:')[-1].strip()
        
        if ')' in preprocess_res:
            matches = re.findall(pattern, preprocess_res)
            return matches[0].split(')')[0].lower().strip()
                    
        if '.' in preprocess_res:
            preprocess_res = preprocess_res.replace('.', "").strip()

        return preprocess_res
    except Exception as e:
        print(e)
        return None
    
def extract_true_answer_aqua(answer):
    return answer.lower().strip()
    
    
# StrategyQA Functions
def extract_ai_answer_strategyqa(completion):
    return completion.split('Answer: ')[-1].strip()


def extract_true_answer_strategyqa(completion):
    return 'YES' if completion else 'NO'


# Other Functions
def return_self(x):
    return x

In [112]:
def build_context_aqua(selected_examples, strategy):
    context = ''
    for prompt_example in selected_examples:
        context += f"Question: {prompt_example['question']}\n"
        context += f"Options: {prompt_example['options']}\n"

        if strategy == 'cot':
            context += f"Rationale: {prompt_example['rationale']}\n"
            
        context += f"Answer: {prompt_example['correct']}\n\n"

    return context
                
        
def build_context_gsm8k(selected_examples ,strategy):
       
    context = ''
    for prompt_example in selected_examples:
        context += f'Question: {prompt_example["question"]}\n'
        
        if strategy == 'standard':
            context += f'Answer: {extract_true_answer_gsm8k(prompt_example["answer"])}\n\n'
        elif strategy == 'cot':
            context += f'Rationale: {prompt_example["answer"]}\n'
            context += f'Answer: {extract_true_answer_gsm8k(prompt_example["answer"])}\n\n'
        else:
            print(f'Strategy must be "standard" OR "cot"')
            return None

    return context

# for StrategyQA 
def build_context_strategyqa_standard(selected_examples):
    context = ''

    for prompt_example in selected_examples:
        context += f'Question: {prompt_example["question"]}\n'
        context += f"Answer: {'YES' if prompt_example['answer'] else 'NO'}\n\n"

    return context
    


In [113]:
def generate_fewshot_random_demonstration(seeds, dataset_infos, nr_examples, strategy):
    dataset_name = dataset_infos['dataset_name']
    dataset = dataset_infos['data']
    list_contexts = []
    
    # For StandardQA and CoT we use the same promptinng demonstrations for all exampels in the dataset
    if strategy == 'cot' and dataset_name == 'strategyqa':
        list_contexts.append(cot_context)
        return list_contexts
    
    else:
        list_contexts = []

        for seed in seeds:    
            random.seed(seed)
            selected_examples = random.sample(dataset, nr_examples)
            if dataset_name == 'gsm8k':
                context = build_context_gsm8k(selected_examples, strategy)
                list_contexts.append(context)
            elif dataset_name == 'aqua':
                context = build_context_aqua(selected_examples, strategy)
                list_contexts.append(context)
            elif dataset_name == 'strategyqa':
                context = build_context_strategyqa_standard(selected_examples)
                list_contexts.append(context)
            
        return list_contexts

In [114]:
def predict_llm(template, question, model_name):
    prompt = PromptTemplate(input_variables=["question"], template=template)
    llm = OpenAI(model_name=model_name, temperature=0.2)
    llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)
    with get_openai_callback() as cb:
        result = llm_chain.run(question)
        
    return result, cb.total_tokens

In [118]:
def run(seeds, dataset_infos, strategy, model_name, nr_examples):
    extract_answers_dic = {'gsm8k' : 
                               {'extract_true_answer_func' : extract_true_answer_gsm8k,
                                'extract_ai_answer_func' : extract_ai_answer_gsm8k},
                           'aqua' : 
                                {'extract_true_answer_func' : extract_true_answer_aqua,
                                 'extract_ai_answer_func' : extract_ai_answer_aqua},
                           'strategyqa' : 
                                {'extract_true_answer_func' : extract_true_answer_strategyqa,
                                 'extract_ai_answer_func' : extract_ai_answer_strategyqa}
                            }
    
    prefix_dic = {'gsm8k' : 
                          {'standard' : """You are willing to solve arithmetic math problems. The answer should not contain any special character. Follow the examples below and generate the answer using the format of these examples:""", 
                           'cot' : """You are willing to solve arithmetic math problems. Decompose the problem into intermediate steps and solve each step by generating the rationale. Explain the reasoning steps. Use the following format to answer the question: First generate intermediate reasoning steps, then generate the final answer as a single number. Here are some examples you can follow:\n\n"""
                          },
                  'aqua' : 
                          {'standard' : """"You are willing to solve algebraic word problems with multiple choice questions. Choose only one of the given options as the final answer. Follow the examples below and generate the answer using the format of these examples:\n\n""" ,
                           'cot' : """You are willing to solve algebraic word problems with multiple choice questions. First decompose the problem into intermediate reasoning steps, then solve and explain each intermediate step by generating the rationale. Then choose the final answer to be only one of the given options. The output should include the rationale and the answer. Follow the examples below to output the solution:\n\n"""
                          },
                  'strategyqa' : 
                          {'standard' : """You are willing to answer questions that require reasoning. The final answer must be either YES or NO. Follow the examples below and generate the answer using the format of these examples:""",
                           'cot' : """You are willing to answer qustions that require reasoning. Decompose the problem into intermediate sub-questions to gather more information and generate a sub-answer to each sub-question before generating the final answer. The final answer must YES or NO. Follow the examples below and generate the answer using the format of these examples:\n\n"""
                          }
                  }
    
    suffix_gsm8k_standard = "Question: {question}\nAnswer: "
    rationale_answer_gsm8k = "\nRationale: \nAnswer: "
    suffix_gsm8k_cot = """\n\nQuestion: {question}""" + rationale_answer_gsm8k
    
    options_answer_aqua = "\nOptions: {}\nAnswer: "
    suffix_aqua_standard = """\n\nQuestion : {question}"""
    options_rationale_answer_aqua = "\nOptions: {}\nRationale: \nAnswer: "
    suffix_aqua_cot = """\n\nQuestion: {question}"""
        
    suffix_strategyqa_cot = "Question: {question}\nOutput: "
    suffix_strategyqa_standard = "Question: {question}\nAnswer: "
    
    suffix_dic = {'gsm8k' : {'standard': suffix_gsm8k_standard,
                             'cot' : suffix_gsm8k_cot
                            },
                  'strategyqa' : {'standard' : suffix_strategyqa_standard,
                                  'cot' : suffix_strategyqa_cot
                                 },
                  'aqua' : {'standard' : {'subset' : options_answer_aqua,
                                          'suffix' : suffix_aqua_standard},
                            'cot' : {'subset' : options_rationale_answer_aqua,
                                     'suffix' : suffix_aqua_cot}}
                 }
    
    dataset_name = dataset_infos['dataset_name']
    dataset = dataset_infos['data']
    
    suffix_data = suffix_dic[dataset_name][strategy]
    if dataset_name == 'aqua':
        suffix_subset = suffix_data['subset']
        suffix_question = suffix_data['suffix']
    else:
        suffix = suffix_data
        
    
    prefix = prefix_dic[dataset_name][strategy]

    extract_true_answer_func = extract_answers_dic[dataset_name]['extract_true_answer_func']
    extract_ai_answer_func = extract_answers_dic[dataset_name]['extract_ai_answer_func']

    list_contexts = generate_fewshot_random_demonstration(seeds, dataset_infos, nr_examples, strategy)
    dataframes_list = []
    
    for i, context in enumerate(list_contexts):
        df = pd.DataFrame()

        for example in dataset[:3]:
            # build the suffix
            if dataset_name == 'aqua':                
                formatted_suffix_subset = suffix_subset.format(example['options'])
                suffix = suffix_question + formatted_suffix_subset
            
            # build the template using prefix, context and suffix
            template = prefix + context + suffix
            model_completion, token_count = predict_llm(template, example['question'], model_name)
            
            # extract the final answer from the model completion
            extracted_ai_answer = extract_ai_answer_func(model_completion)
            # extract the true answer
            extracted_true_answer = extract_true_answer_func(example['answer'])
                            
            row_dic = {'question' : [example['question']],
                       'true_answer' : [extracted_true_answer],
                       'ai_answer' : [extracted_ai_answer],
                       'ai_completion' : [model_completion],
                       'token_count' : [token_count]
                       }

            df = pd.concat([df, pd.DataFrame(row_dic)], ignore_index=False)
                
        dataframes_list.append(df)
        
    
    
    return dataframes_list

In [116]:
# TEST GSM8K Standard
res = generate_fewshot_random_demonstration([10, 20], {'dataset_name' : 'gsm8k',
                                       'data' : gsm8k}, 3, 'standard')

In [117]:
for el in res:
    print(el)
    print('----------------------------------------------------')

Question: Adam bought 3 kilograms of nuts and 2.5 kilograms of dried fruits at a store. One kilogram of nuts costs $12 and one kilogram of dried fruit costs $8. How much did his purchases cost?
Answer: 56.0

Question: Johns goes to the gym 3 times a week.  He spends 1 hour each day lifting weight. Additionally, he also spends a third of his weightlifting time warming up and doing cardio each day.  How many hours does he spend at the gym a week?
Answer: 4.0

Question: James has to refuel his plane.  It used to cost $200 to refill the tank.  He got an extra tank to double fuel capacity.  Fuel prices also went up by 20%.  How much does he pay now for fuel?
Answer: 480.0


----------------------------------------------------
Question: Tapanga and Corey have 66 candies together. However, Tapanga has 8 more candies than Corey. How many candies does Corey have?
Answer: 29.0

Question: Freddy is calling his family on New Year's Eve. He calls his dad, who lives in the same city as him, and they

In [23]:
# TEST GSM8K COT
res = generate_fewshot_random_demonstration([10, 20], {'dataset_name' : 'gsm8k',
                                       'data' : gsm8k}, 3, 'cot')

In [24]:
for el in res:
    print(el)
    print('-------------------------------------------------------------------------')

Question: Adam bought 3 kilograms of nuts and 2.5 kilograms of dried fruits at a store. One kilogram of nuts costs $12 and one kilogram of dried fruit costs $8. How much did his purchases cost?
Rationale: For the nuts Adam paid 3 * $12 = $<<3*12=36>>36.
And for dried fruits Adam paid 2.5 * $8 = $<<2.5*8=20>>20.
So in total for his purchases Adam paid $36 + $20 = $<<36+20=56>>56.
#### 56
Answer: 56.0

Question: Johns goes to the gym 3 times a week.  He spends 1 hour each day lifting weight. Additionally, he also spends a third of his weightlifting time warming up and doing cardio each day.  How many hours does he spend at the gym a week?
Rationale: He spends 60/3=<<60/3=20>>20 minutes warming up
So he spends 60+20=<<60+20=80>>80 minutes at the gym per day
That means he spends 80*3=<<80*3=240>>240 minutes at the gym
So he spends 240/60=<<240/60=4>>4 hours at the gym a week
#### 4
Answer: 4.0

Question: James has to refuel his plane.  It used to cost $200 to refill the tank.  He got an ex

In [25]:
# TEST AQUA Standard
res = generate_fewshot_random_demonstration([10, 20], {'dataset_name' : 'aqua',
                                       'data' : aqua}, 3, 'standard')

In [26]:
for el in res:
    print(el)
    print('----------------------------------------------------')

Question: Three quarts of a bleaching chemical, Minum, contains 5 percent hydrogen peroxide and water. A different type of bleaching chemical, Maxim, which contains 20 percent hydrogen peroxide, will be mixed with the three quarts of Minum. How much of type Maxim should be added to the three quarts of Minum so that the resulting mixture contains 10 percent hydrogen peroxide?
Options: ['A)2 quarts', 'B)3.75 quarts', 'C)4.5 quarts', 'D)6 quarts', 'E)9 quarts']
Answer: A

Question: ABC company pays an average of $120 per vehicle each month in outdoor parking fees for three of its eight vehicles. The company pays garage parking fees for the remaining five vehicles. If ABC pays an average of $240 per vehicle overall each month for parking, how much does ABC pay per month in garage parking fees for its vehicles?
Options: ['A)300', 'B)420', 'C)912', 'D)1340', 'E)1500']
Answer: D

Question: For a candidate to clear an examination, he/she must score 55% marks. If he/she gets 120 and fails by 78

In [27]:
# TEST AQUA COT
res = generate_fewshot_random_demonstration([10, 20], {'dataset_name' : 'aqua',
                                       'data' : aqua}, 3, 'cot')

In [28]:
for el in res:
    print(el)
    print('-------------------------------------------------------------------------')

Question: Three quarts of a bleaching chemical, Minum, contains 5 percent hydrogen peroxide and water. A different type of bleaching chemical, Maxim, which contains 20 percent hydrogen peroxide, will be mixed with the three quarts of Minum. How much of type Maxim should be added to the three quarts of Minum so that the resulting mixture contains 10 percent hydrogen peroxide?
Options: ['A)2 quarts', 'B)3.75 quarts', 'C)4.5 quarts', 'D)6 quarts', 'E)9 quarts']
Rationale: 5% HydPerWater (HPW) of 3 quart of Minum= .10 q
20% of HPW of x q of Maxim = .2x
Total = .10 + .2x = .10 (3+x)
Solving , x=2 quart
A
Answer: A

Question: ABC company pays an average of $120 per vehicle each month in outdoor parking fees for three of its eight vehicles. The company pays garage parking fees for the remaining five vehicles. If ABC pays an average of $240 per vehicle overall each month for parking, how much does ABC pay per month in garage parking fees for its vehicles?
Options: ['A)300', 'B)420', 'C)912', '

In [29]:
# TEST STRATEGYQA STANDARD
res = generate_fewshot_random_demonstration([10, 20], {'dataset_name' : 'strategyqa',
                                       'data' : strategyqa}, 3, 'standard')

In [30]:
for el in res:
    print(el)
    print('-------------------------------------------------------------------------')

Question: Could Elizabeth I of England have seen the play Dido, Queen of Carthage ?
Answer: YES

Question: Was the Treaty of Versailles settled over blueberry scones?
Answer: NO

Question: Can a lemon aggravate dyspepsia?
Answer: YES


-------------------------------------------------------------------------
Question: Are pancakes a bad snack for cats?
Answer: YES

Question: Does the judo rank system reach the triple digits?
Answer: NO

Question: Did the Berlin Wall prevent any athletes from competing in the 1936 Summer Olympics?
Answer: NO


-------------------------------------------------------------------------


In [31]:
# TEST STRATEGYQA COT
res = generate_fewshot_random_demonstration([10, 20], {'dataset_name' : 'strategyqa',
                                       'data' : strategyqa}, 3, 'cot')

In [32]:
for el in res:
    print(el)
    print('-------------------------------------------------------------------------')


[Example 1]
Question: Do hamsters provide food for any animals?
Output:
Sub Question #0 : What type of animals are hamsters?
Sub Answer #0 : Hamsters are prey animals.
Sub Question #1 : Can prey animals be food for other animals?
Sub Answer #1 : Prey are food for predators.
Sub Question #2 : Do hamsters provide food for any animals?
Sub Answer #2 : Since hamsters are prey animals, and prey are food for predetors, hamsters provide food for some animals.
Final Answer: YES

[Example 2]
Question: Could Brooke Shields succeed at University of Pennsylvania?
Output:
Sub question #0 : What university did Brooke Shields went to?
Sub answer #0 : Brooke Shields went to Princeton University.
Sub question #1 : Did Brooke Shields succeed at Princeton University?
Sub answer #1 : At Princeton University, she got all As and Bs while pursing her bachelor's degree in French literature, meaning she had a successful school life.
Sub question #2 : How rigorous is Princeton University compared to University

In [102]:
# TEST GSM8K STANDARD
dataset_infos = {'dataset_name' : 'gsm8k', 'data' : gsm8k}
dfs_gsm8k = run([10, 20], dataset_infos, 'standard', 'gpt-3.5-turbo', 2)






[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to solve arithmetic math problems. The answer should not contain any special character. Follow the examples below and generate the answer using the format of these examples:Question: Adam bought 3 kilograms of nuts and 2.5 kilograms of dried fruits at a store. One kilogram of nuts costs $12 and one kilogram of dried fruit costs $8. How much did his purchases cost?
Answer: 56.0

Question: Johns goes to the gym 3 times a week.  He spends 1 hour each day lifting weight. Additionally, he also spends a third of his weightlifting time warming up and doing cardio each day.  How many hours does he spend at the gym a week?
Answer: 4.0

Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Answer: [0m

[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after format

In [123]:
for el in dfs_gsm8k[1]['ai_completion']:
    if '$' in 
    try:
        final_answer = float(el)
        return final_answer
    except Exception as e:
        

72.0


ValueError: could not convert string to float: '$10.0'

In [124]:
# TEST GSM8K COT
dataset_infos = {'dataset_name' : 'gsm8k', 'data' : gsm8k}
dfs_gsm8k_cot = run([10, 20], dataset_infos, 'cot', 'gpt-3.5-turbo', 2)





[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to solve arithmetic math problems. Decompose the problem into intermediate steps and solve each step by generating the rationale. Explain the reasoning steps. Use the following format to answer the question: First generate intermediate reasoning steps, then generate the final answer as a single number. Here are some examples you can follow:

Question: Adam bought 3 kilograms of nuts and 2.5 kilograms of dried fruits at a store. One kilogram of nuts costs $12 and one kilogram of dried fruit costs $8. How much did his purchases cost?
Rationale: For the nuts Adam paid 3 * $12 = $<<3*12=36>>36.
And for dried fruits Adam paid 2.5 * $8 = $<<2.5*8=20>>20.
So in total for his purchases Adam paid $36 + $20 = $<<36+20=56>>56.
#### 56
Answer: 56.0

Question: Johns goes to the gym 3 times a week.  He spends 1 hour each day lifting weight. Additionally, he also spends a third of his weightlifting time w


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to solve arithmetic math problems. Decompose the problem into intermediate steps and solve each step by generating the rationale. Explain the reasoning steps. Use the following format to answer the question: First generate intermediate reasoning steps, then generate the final answer as a single number. Here are some examples you can follow:

Question: Tapanga and Corey have 66 candies together. However, Tapanga has 8 more candies than Corey. How many candies does Corey have?
Rationale: Let x = the total number of candies Corey has.
x + 8 = the total number of candies Tapanga has.
The equation for the total number of candies is x + (x + 8) = 66
Combining like terms, we get 2x + 8 = 66
Subtracting 8 from both sides, we get 2x = 58
Dividing both sides by 2, we get x = <<29=29>>29, so Corey has 29 candies.
#### 29
Answer: 29.0

Question: Freddy is calling his family o

In [186]:
float(re.findall(r'\d+\.?\d*', '20 10')[-1])

10.0

In [179]:
float(re.findall(r'\d+.?\d*', '20 10')[-1])

ValueError: could not convert string to float: '20 10'

In [166]:
gsm8k[0]['answer']

'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'

In [174]:
for i in range(100):
    print(float(re.findall(r'\d+.?\d*', gsm8k[i]['answer'])[-1]))


72.0
10.0
5.0
42.0
624.0
35.0
48.0
16.0
41.0
990.0
121.0
5.0
85.0
35.0
5.0
448000.0
800.0
43.0
16.0
16.0
38.0
1080.0
7.0
5.0
62.0
110.0
400.0
400.0
8.0
1000.0
6.0
1200.0
10.0
34.0
5250.0
36.0
15.0
5.0
9.0
15.0
476.0
500.0
99.0
60.0
300.0
99.0
1920.0
15.0
10.0
48.0
5.0
160.0
5.0
36.0
11.0
75.0
45.0
2.0
320.0
120.0
96.0
200.0
15.0
59.0
840.0
558.0
520.0
6.0
90.0
49.0
19.0
25.0
54.0
3.0
28.0
15.0
768.0
85.0
4.0
70.0
100.0
14.0
700.0
54.0
90.0
5.0
6.0
600.0
258.0
216.0
90.0
10.0
1825.0
14000.0
60.0
64.0
126.0
46.0
45.0
3.0


In [159]:
for el in dfs_gsm8k_cot[0]['ai_completion']:
    print(re.findall(r'\d+.?\d*', el)[-1])
    print(el)
    print('-----------------')

72.0
In May Natalia sold 48/2 = <<48/2=24>>24 clips.
So in total, she sold 48 + 24 = <<48+24=72>>72 clips in April and May.
#### 72
Answer: 72.0
-----------------
10.0
Weng earns $12/60 minutes = $0.2 per minute of babysitting.
So for 50 minutes of babysitting, she earned $0.2 * 50 = $<<0.2*50=10>>10.
#### 10
Answer: 10.0
-----------------
5.0
Betty needs $100 / 2 = $<<100/2=50>>50 more.
Her grandparents gave her 2 * $15 = $<<2*15=30>>30.
So in total, she has $50 + $15 + $30 = $<<50+15+30=95>>95.
Therefore, Betty still needs $100 - $95 = $<<100-95=5>>5 more. 
#### 5
Answer: 5.0
-----------------


In [45]:
for el1, el2 in zip(dfs_gsm8k_cot[0]['ai_answer'], dfs_gsm8k_cot[0]['ai_completion']):
    print(f'{el1}')
    print(el2)
    print('-------------------------------------------------------')

72.0
In April, Natalia sold 48 clips.
In May, she sold half as many clips as in April, which is 48/2 = <<48/2=24>>24 clips.
Altogether, she sold 48 + 24 = <<48+24=72>>72 clips.
#### 72
Answer: 72.0
-------------------------------------------------------
10.0
Weng earns $12/60 = $0.2 per minute of babysitting.
She did 50 minutes of babysitting, so she earned $0.2 x 50 = $<<0.2*50=10>>10.
#### 10
Answer: 10.0
-------------------------------------------------------
5.0
Betty needs $100/2 = $<<100/2=50>>50 more to buy the wallet.
Her grandparents gave her $15 x 2 = $<<15*2=30>>30.
In total, she has $50 + $15 + $30 = $<<50+15+30=95>>95.
Therefore, Betty still needs $100 - $95 = $<<100-95=5>>5 more to buy the wallet.
#### 5
Answer: 5.0
-------------------------------------------------------
18.0
Yesterday, Julie read 12 pages.
Today, she read 2 x 12 = <<2*12=24>>24 pages.
So, she has already read a total of 12 + 24 = <<12+24=36>>36 pages.
She has 120 - 36 = <<120-36=84>>84 pages left to read

In [192]:
# TEST AQUA STANDARD
for example in aqua:
    example['answer'] = example['correct']

In [193]:
dataset_infos = {'dataset_name' : 'aqua', 'data' : aqua}
dfs_aqua = run([10, 20], dataset_infos, 'standard', 'gpt-3.5-turbo', 2)





[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m"You are willing to solve algebraic word problems with multiple choice questions. Choose only one of the given options as the final answer. Follow the examples below and generate the answer using the format of these examples:

Question: Three quarts of a bleaching chemical, Minum, contains 5 percent hydrogen peroxide and water. A different type of bleaching chemical, Maxim, which contains 20 percent hydrogen peroxide, will be mixed with the three quarts of Minum. How much of type Maxim should be added to the three quarts of Minum so that the resulting mixture contains 10 percent hydrogen peroxide?
Options: ['A)2 quarts', 'B)3.75 quarts', 'C)4.5 quarts', 'D)6 quarts', 'E)9 quarts']
Answer: A

Question: ABC company pays an average of $120 per vehicle each month in outdoor parking fees for three of its eight vehicles. The company pays garage parking fees for the remaining five vehicles. If ABC pays an average

In [203]:
dfs_aqua[0]['ai_completion'].iloc[0].lower().split('answer: ')[-1]

'c'

In [62]:
dfs_aqua[0]['true_answer'] == dfs_aqua[0]['ai_answer']

0    False
0     True
0    False
0    False
0    False
0     True
0    False
0    False
0    False
0     True
dtype: bool

In [65]:
dfs_aqua[1]['true_answer'] == dfs_aqua[1]['ai_answer']

0    False
0    False
0     True
0     True
0    False
0     True
0     True
0     True
0     True
0    False
dtype: bool

In [64]:
dfs_aqua[1]

Unnamed: 0,question,true_answer,ai_answer,ai_completion
0,Three birds are flying at a fast rate of 900 k...,a,c,C
0,A ship is leaving a port. It takes 240 seconds...,d,c,C
0,A rectangular piece of cloth 2 feet wide was c...,c,c,C
0,"In the xy-coordinate plane, which of the follo...",b,b,B
0,A travel company wants to charter a plane to t...,c,b,B
0,"Kirk sells cars. On two sales, Kirk has receiv...",b,b,B
0,A group of 5 friends were to contribute equall...,e,e,E
0,"Let A, B and C denote the vertices of a triang...",b,b,B
0,ABC company pays an average of $120 per vehicl...,d,d,D
0,Solution A has 5% salt concentration and remai...,d,b,B


In [68]:
# TEST AQUA COT
    
dataset_infos = {'dataset_name' : 'aqua', 'data' : aqua}
acc_aqua_cot, dfs_aqua_cot = run([10, 20], dataset_infos, 'cot', 'gpt-3.5-turbo', 2)

0


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to solve algebraic word problems with multiple choice questions. First decompose the problem into intermediate reasoning steps, then solve and explain each intermediate step by generating the rationale. Then choose the final answer to be only one of the given options. The output should include the rationale and the answer. Follow the examples below to output the solution:

Question: The binary representation of 0.6875 is ?
Options: ['A).1010', 'B).1011', 'C).1001', 'D).1000', 'E).1111']
Rationale: 0.6875*2=1.375--1
0.375*2=0.75--0
0.75*2=1.50--1
0.50*2=1.0--1
0.6875=.1011
ANSWER:B
Answer: B

Question: On sports day, if 24 children were made to stand in a column, then 10 columns could be formed. If 240 children were made to stand in a column, then how many columns could be formed?
Options: ['A)20', 'B)40', 'C)60', 'D)80', 'E)100']
Rationale: Each each child forms 10/24 of a column. Then, i

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIError: Bad gateway. {"error":{"code":502,"message":"Bad gateway.","param":null,"type":"cf_bad_gateway"}} 502 {'error': {'code': 502, 'message': 'Bad gateway.', 'param': None, 'type': 'cf_bad_gateway'}} {'Date': 'Thu, 27 Apr 2023 08:57:39 GMT', 'Content-Type': 'application/json', 'Content-Length': '84', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'same-origin', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Server': 'cloudflare', 'CF-RAY': '7be5dbe44e3c0c67-SOF', 'alt-svc': 'h3=":443"; ma=86400, h3-29=":443"; ma=86400'}.



[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to solve algebraic word problems with multiple choice questions. First decompose the problem into intermediate reasoning steps, then solve and explain each intermediate step by generating the rationale. Then choose the final answer to be only one of the given options. The output should include the rationale and the answer. Follow the examples below to output the solution:

Question: The binary representation of 0.6875 is ?
Options: ['A).1010', 'B).1011', 'C).1001', 'D).1000', 'E).1111']
Rationale: 0.6875*2=1.375--1
0.375*2=0.75--0
0.75*2=1.50--1
0.50*2=1.0--1
0.6875=.1011
ANSWER:B
Answer: B

Question: On sports day, if 24 children were made to stand in a column, then 10 columns could be formed. If 240 children were made to stand in a column, then how many columns could be formed?
Options: ['A)20', 'B)40', 'C)60', 'D)80', 'E)100']
Rationale: Each each child forms 1


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to solve algebraic word problems with multiple choice questions. First decompose the problem into intermediate reasoning steps, then solve and explain each intermediate step by generating the rationale. Then choose the final answer to be only one of the given options. The output should include the rationale and the answer. Follow the examples below to output the solution:

Question: The binary representation of 0.6875 is ?
Options: ['A).1010', 'B).1011', 'C).1001', 'D).1000', 'E).1111']
Rationale: 0.6875*2=1.375--1
0.375*2=0.75--0
0.75*2=1.50--1
0.50*2=1.0--1
0.6875=.1011
ANSWER:B
Answer: B

Question: On sports day, if 24 children were made to stand in a column, then 10 columns could be formed. If 240 children were made to stand in a column, then how many columns could be formed?
Options: ['A)20', 'B)40', 'C)60', 'D)80', 'E)100']
Rationale: Each each child forms 1


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to solve algebraic word problems with multiple choice questions. First decompose the problem into intermediate reasoning steps, then solve and explain each intermediate step by generating the rationale. Then choose the final answer to be only one of the given options. The output should include the rationale and the answer. Follow the examples below to output the solution:

Question: Three quarts of a bleaching chemical, Minum, contains 5 percent hydrogen peroxide and water. A different type of bleaching chemical, Maxim, which contains 20 percent hydrogen peroxide, will be mixed with the three quarts of Minum. How much of type Maxim should be added to the three quarts of Minum so that the resulting mixture contains 10 percent hydrogen peroxide?
Options: ['A)2 quarts', 'B)3.75 quarts', 'C)4.5 quarts', 'D)6 quarts', 'E)9 quarts']
Rationale: 5% HydPerWater (HPW) of 3 


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to solve algebraic word problems with multiple choice questions. First decompose the problem into intermediate reasoning steps, then solve and explain each intermediate step by generating the rationale. Then choose the final answer to be only one of the given options. The output should include the rationale and the answer. Follow the examples below to output the solution:

Question: Three quarts of a bleaching chemical, Minum, contains 5 percent hydrogen peroxide and water. A different type of bleaching chemical, Maxim, which contains 20 percent hydrogen peroxide, will be mixed with the three quarts of Minum. How much of type Maxim should be added to the three quarts of Minum so that the resulting mixture contains 10 percent hydrogen peroxide?
Options: ['A)2 quarts', 'B)3.75 quarts', 'C)4.5 quarts', 'D)6 quarts', 'E)9 quarts']
Rationale: 5% HydPerWater (HPW) of 3 

In [69]:
dfs_aqua_cot[0]

Unnamed: 0,question,true_answer,ai_answer,ai_completion
0,Three birds are flying at a fast rate of 900 k...,a,e,"First, we need to convert kilometers per hour ..."
0,A ship is leaving a port. It takes 240 seconds...,d,c,We need to find the length of the ship. Let's ...
0,A rectangular piece of cloth 2 feet wide was c...,c,a,We can start by using algebra to represent the...
0,"In the xy-coordinate plane, which of the follo...",b,b,To find which points must lie on the line kx +...
0,A travel company wants to charter a plane to t...,c,c,"To make a profit, the total revenue from ticke..."
0,"Kirk sells cars. On two sales, Kirk has receiv...",b,b,"Let the third commission be x. Then, we can se..."
0,A group of 5 friends were to contribute equall...,e,e,The total bill after the discount is 1200 - (1...
0,"Let A, B and C denote the vertices of a triang...",b,b,We can use the formula for the area of a trian...
0,ABC company pays an average of $120 per vehicl...,d,c,Let x be the amount paid per vehicle for garag...
0,Solution A has 5% salt concentration and remai...,d,d,"D) 22.5 litres\nTo solve this problem, we need..."


In [70]:
dfs_aqua_cot[0]['ai_answer'] == dfs_aqua_cot[0]['true_answer']

0    False
0    False
0    False
0     True
0     True
0     True
0     True
0     True
0    False
0     True
dtype: bool

In [82]:
# TEST STRATEGYQA STANDARD
dataset_infos = {'dataset_name' : 'strategyqa', 'data' : strategyqa}
acc_strategyqa, dfs_strategyqa = run([10, 20], dataset_infos, 'standard', 'gpt-3.5-turbo', 2)

0


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer questions that require reasoning. The final answer must be either YES or NO. Follow the examples below and generate the answer using the format of these examples:Question: Could you make the kitchen 'holy trinity' without celery?
Answer: NO

Question: Has a neanderthal ever served on the Supreme Court of the United States?
Answer: NO

Question: Are more people today related to Genghis Khan than Julius Caesar?
Answer: [0m





[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer questions that require reasoning. The final answer must be either YES or NO. Follow the examples below and generate the answer using the format of these examples:Question: Could you make the kitchen 'holy trinity' without celery?
Answer: NO

Question: Has a neanderthal ever served on the Supreme Court of the United States?
Answer: NO

Question: Could the members of The Police perform lawful arrests?
Answer: [0m

[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer questions that require reasoning. The final answer must be either YES or NO. Follow the examples below and generate the answer using the format of these examples:Question: Could you make the kitchen 'holy trinity' without celery?
Answer: NO

Question: Has a neanderthal ever served on the Supreme Court of the United


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer questions that require reasoning. The final answer must be either YES or NO. Follow the examples below and generate the answer using the format of these examples:Question: Could Elizabeth I of England have seen the play Dido, Queen of Carthage ?
Answer: YES

Question: Was the Treaty of Versailles settled over blueberry scones?
Answer: NO

Question: Do the anchors on Rede Globo speak Chinese?
Answer: [0m

[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer questions that require reasoning. The final answer must be either YES or NO. Follow the examples below and generate the answer using the format of these examples:Question: Could Elizabeth I of England have seen the play Dido, Queen of Carthage ?
Answer: YES

Question: Was the Treaty of Versailles settled over blueberry sco

In [84]:
dfs_strategyqa[0]['true_answer'] == dfs_strategyqa[0]['ai_answer']

0     True
0    False
0     True
0     True
0    False
0     True
0     True
0     True
0    False
0     True
dtype: bool

In [86]:
dfs_strategyqa[1]

Unnamed: 0,question,true_answer,ai_answer,ai_completion
0,Are more people today related to Genghis Khan ...,YES,YES,YES
0,Could the members of The Police perform lawful...,NO,NO,NO
0,Would a Monoamine Oxidase candy bar cheer up a...,NO,NO,NO
0,Would a dog respond to bell before Grey seal?,YES,YES,YES
0,Is a pound sterling valuable?,NO,YES,YES
0,Is shrimp scampi definitely free of plastic?,NO,NO,NO
0,Do the anchors on Rede Globo speak Chinese?,NO,NO,NO
0,Will the Albany in Georgia reach a hundred tho...,NO,NO,NO
0,Is a Boeing 737 cost covered by Wonder Woman (...,YES,NO,NO
0,Is the language used in Saint Vincent and the ...,YES,YES,YES


In [87]:
# TEST STRATEGYQA COT
dataset_infos = {'dataset_name' : 'strategyqa', 'data' : strategyqa}
acc_strategyqa_cot, dfs_strategyqa_cot = run([10, 20], dataset_infos, 'cot', 'gpt-3.5-turbo', 2)

0


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer qustions that require reasoning. Decompose the problem into intermediate sub-questions to gather more information and generate a sub-answer to each sub-question before generating the final answer. The final answer must YES or NO. Follow the examples below and generate the answer using the format of these examples:


[Example 1]
Question: Do hamsters provide food for any animals?
Output:
Sub Question #0 : What type of animals are hamsters?
Sub Answer #0 : Hamsters are prey animals.
Sub Question #1 : Can prey animals be food for other animals?
Sub Answer #1 : Prey are food for predators.
Sub Question #2 : Do hamsters provide food for any animals?
Sub Answer #2 : Since hamsters are prey animals, and prey are food for predetors, hamsters provide food for some animals.
Final Answer: YES

[Example 2]
Question: Could Brooke Shields succeed at University of Pennsylvania?
Output:
Sub que




[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer qustions that require reasoning. Decompose the problem into intermediate sub-questions to gather more information and generate a sub-answer to each sub-question before generating the final answer. The final answer must YES or NO. Follow the examples below and generate the answer using the format of these examples:


[Example 1]
Question: Do hamsters provide food for any animals?
Output:
Sub Question #0 : What type of animals are hamsters?
Sub Answer #0 : Hamsters are prey animals.
Sub Question #1 : Can prey animals be food for other animals?
Sub Answer #1 : Prey are food for predators.
Sub Question #2 : Do hamsters provide food for any animals?
Sub Answer #2 : Since hamsters are prey animals, and prey are food for predetors, hamsters provide food for some animals.
Final Answer: YES

[Example 2]
Question: Could Brooke Shields succeed at University of Penn


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer qustions that require reasoning. Decompose the problem into intermediate sub-questions to gather more information and generate a sub-answer to each sub-question before generating the final answer. The final answer must YES or NO. Follow the examples below and generate the answer using the format of these examples:


[Example 1]
Question: Do hamsters provide food for any animals?
Output:
Sub Question #0 : What type of animals are hamsters?
Sub Answer #0 : Hamsters are prey animals.
Sub Question #1 : Can prey animals be food for other animals?
Sub Answer #1 : Prey are food for predators.
Sub Question #2 : Do hamsters provide food for any animals?
Sub Answer #2 : Since hamsters are prey animals, and prey are food for predetors, hamsters provide food for some animals.
Final Answer: YES

[Example 2]
Question: Could Brooke Shields succeed at University of Penn


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer qustions that require reasoning. Decompose the problem into intermediate sub-questions to gather more information and generate a sub-answer to each sub-question before generating the final answer. The final answer must YES or NO. Follow the examples below and generate the answer using the format of these examples:


[Example 1]
Question: Do hamsters provide food for any animals?
Output:
Sub Question #0 : What type of animals are hamsters?
Sub Answer #0 : Hamsters are prey animals.
Sub Question #1 : Can prey animals be food for other animals?
Sub Answer #1 : Prey are food for predators.
Sub Question #2 : Do hamsters provide food for any animals?
Sub Answer #2 : Since hamsters are prey animals, and prey are food for predetors, hamsters provide food for some animals.
Final Answer: YES

[Example 2]
Question: Could Brooke Shields succeed at University of Penn


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer qustions that require reasoning. Decompose the problem into intermediate sub-questions to gather more information and generate a sub-answer to each sub-question before generating the final answer. The final answer must YES or NO. Follow the examples below and generate the answer using the format of these examples:


[Example 1]
Question: Do hamsters provide food for any animals?
Output:
Sub Question #0 : What type of animals are hamsters?
Sub Answer #0 : Hamsters are prey animals.
Sub Question #1 : Can prey animals be food for other animals?
Sub Answer #1 : Prey are food for predators.
Sub Question #2 : Do hamsters provide food for any animals?
Sub Answer #2 : Since hamsters are prey animals, and prey are food for predetors, hamsters provide food for some animals.
Final Answer: YES

[Example 2]
Question: Could Brooke Shields succeed at University of Penn


[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer qustions that require reasoning. Decompose the problem into intermediate sub-questions to gather more information and generate a sub-answer to each sub-question before generating the final answer. The final answer must YES or NO. Follow the examples below and generate the answer using the format of these examples:


[Example 1]
Question: Do hamsters provide food for any animals?
Output:
Sub Question #0 : What type of animals are hamsters?
Sub Answer #0 : Hamsters are prey animals.
Sub Question #1 : Can prey animals be food for other animals?
Sub Answer #1 : Prey are food for predators.
Sub Question #2 : Do hamsters provide food for any animals?
Sub Answer #2 : Since hamsters are prey animals, and prey are food for predetors, hamsters provide food for some animals.
Final Answer: YES

[Example 2]
Question: Could Brooke Shields succeed at University of Penn

In [88]:
dfs_strategyqa_cot[0]

Unnamed: 0,question,true_answer,ai_answer,ai_completion
0,Are more people today related to Genghis Khan ...,YES,YES,Sub question #0 : Who is Genghis Khan?\nSub an...
0,Could the members of The Police perform lawful...,NO,NO,Sub question #0 : Who are the members of The P...
0,Would a Monoamine Oxidase candy bar cheer up a...,NO,NO,Sub question #0 : What is Monoamine Oxidase?\n...
0,Would a dog respond to bell before Grey seal?,YES,YES,Sub question #0 : Can dogs respond to bells?\n...
0,Is a pound sterling valuable?,NO,YES,Sub question #0 : What is a pound sterling?\nS...
0,Is shrimp scampi definitely free of plastic?,NO,NO,Sub question #0 : What is shrimp scampi?\nSub ...
0,Do the anchors on Rede Globo speak Chinese?,NO,NO,Sub question #0 : What language do the anchors...
0,Will the Albany in Georgia reach a hundred tho...,NO,NO,Sub question #0 : What is the current populati...
0,Is a Boeing 737 cost covered by Wonder Woman (...,YES,YES,Sub question #0 : What is the cost of a Boeing...
0,Is the language used in Saint Vincent and the ...,YES,YES,Sub question #0 : What is the official languag...


In [89]:
dfs_strategyqa_cot[0]['true_answer'] == dfs_strategyqa_cot[0]['ai_answer']

0     True
0     True
0     True
0     True
0    False
0     True
0     True
0     True
0     True
0     True
dtype: bool

In [62]:
def num_tokens_from_string(string) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    num_tokens = len(encoding.encode(string))
    return num_tokens


In [63]:
def estimate_tokens(dataset_infos, strategy, model_name, nr_examples):
    total_nr_tokens = 0
    
    extract_answers_dic = {'gsm8k' : 
                               {'extract_true_answer_func' : extract_true_answer_gsm8k,
                                'extract_ai_answer_func' : extract_ai_answer_gsm8k},
                           'aqua' : 
                                {'extract_true_answer_func' : extract_true_answer_aqua,
                                 'extract_ai_answer_func' : extract_ai_answer_aqua},
                           'strategyqa' : 
                                {'extract_true_answer_func' : extract_true_answer_strategyqa,
                                 'extract_ai_answer_func' : extract_ai_answer_strategyqa}
                            }
    
    prefix_dic = {'gsm8k' : 
                          {'standard' : """You are willing to solve arithmetic math problems. The answer should not contain any special character. Follow the examples below and generate the answer using the format of these examples:""", 
                           'cot' : """You are willing to solve arithmetic math problems. Decompose the problem into intermediate steps and solve each step by generating the rationale. Explain the reasoning steps. Use the following format to answer the question: First generate intermediate reasoning steps, then generate the final answer as a single number. Here are some examples you can follow:\n\n"""
                          },
                  'aqua' : 
                          {'standard' : """"You are willing to solve algebraic word problems with multiple choice questions. Choose only one of the given options as the final answer. Follow the examples below and generate the answer using the format of these examples:\n\n""" ,
                           'cot' : """You are willing to solve algebraic word problems with multiple choice questions. First decompose the problem into intermediate reasoning steps, then solve and explain each intermediate step by generating the rationale. Then choose the final answer to be only one of the given options. The output should include the rationale and the answer. Follow the examples below to output the solution:\n\n"""
                          },
                  'strategyqa' : 
                          {'standard' : """You are willing to answer questions that require reasoning. The final answer must be either YES or NO. Follow the examples below and generate the answer using the format of these examples:""",
                           'cot' : """You are willing to answer qustions that require reasoning. Decompose the problem into intermediate sub-questions to gather more information and generate a sub-answer to each sub-question before generating the final answer. The final answer must YES or NO. Follow the examples below and generate the answer using the format of these examples:\n\n"""
                          }
                  }
    
    suffix_gsm8k_standard = "Question: {question}\nAnswer: "
    rationale_answer_gsm8k = "\nRationale: \nAnswer: "
    suffix_gsm8k_cot = """\n\nQuestion: {question}""" + rationale_answer_gsm8k
    
    options_answer_aqua = "\nOptions: {}\nAnswer: "
    suffix_aqua_standard = """\n\nQuestion : {question}"""
    options_rationale_answer_aqua = "\nOptions: {}\nRationale: \nAnswer: "
    suffix_aqua_cot = """\n\nQuestion: {question}"""
        
    suffix_strategyqa_cot = "Question: {question}\nOutput: "
    suffix_strategyqa_standard = "Question: {question}\nAnswer: "
    
    suffix_dic = {'gsm8k' : {'standard': suffix_gsm8k_standard,
                             'cot' : suffix_gsm8k_cot
                            },
                  'strategyqa' : {'standard' : suffix_strategyqa_standard,
                                  'cot' : suffix_strategyqa_cot
                                 },
                  'aqua' : {'standard' : {'subset' : options_answer_aqua,
                                          'suffix' : suffix_aqua_standard},
                            'cot' : {'subset' : options_rationale_answer_aqua,
                                     'suffix' : suffix_aqua_cot}}
                 }
    
    dataset_name = dataset_infos['dataset_name']
    dataset = dataset_infos['data']
    
    suffix_data = suffix_dic[dataset_name][strategy]
    if dataset_name == 'aqua':
        suffix_subset = suffix_data['subset']
        suffix_question = suffix_data['suffix']
    else:
        suffix = suffix_data
        
    
    prefix = prefix_dic[dataset_name][strategy]

    extract_true_answer_func = extract_answers_dic[dataset_name]['extract_true_answer_func']
    extract_ai_answer_func = extract_answers_dic[dataset_name]['extract_ai_answer_func']

    list_contexts = generate_fewshot_random_demonstration(dataset_infos, nr_examples, strategy)
    accuracy_results = []
    dataframes_list = []
    
    for i, context in enumerate(list_contexts):
        count_correct_answers = 0
        df = pd.DataFrame()

        for example in dataset:
            # build the suffix
            if dataset_name == 'aqua':                
                formatted_suffix_subset = suffix_subset.format(example['options'])
                suffix = suffix_question + formatted_suffix_subset
            
            # build the template using prefix, context and suffix
            template = prefix + context + suffix
            prompt = PromptTemplate(input_variables=["question"], template=template)
            formatted_prompt = prompt.format(question=example['question'])
            
            example_nr_tokens = num_tokens_from_string(formatted_prompt) + 30
            total_nr_tokens += example_nr_tokens
    
    return total_nr_tokens

In [16]:
price_per_token = 0.002 / 1000

In [240]:
OpenAI()

OpenAI(cache=None, verbose=False, callback_manager=<langchain.callbacks.shared.SharedCallbackManager object at 0x000001882D523040>, client=<class 'openai.api_resources.completion.Completion'>, model_name='text-davinci-003', temperature=0.7, max_tokens=256, top_p=1, frequency_penalty=0, presence_penalty=0, n=1, best_of=1, model_kwargs={}, openai_api_key=None, batch_size=20, request_timeout=None, logit_bias={}, max_retries=6, streaming=False)

In [17]:
dataset_infos = {'dataset_name' : 'gsm8k', 'data' : gsm8k}
gsm8k_sd = estimate_tokens(dataset_infos, 'standard', 'gpt-3.5-turbo', 7)
gsm8k_cot = estimate_tokens(dataset_infos, 'cot', 'gpt-3.5-turbo', 7)

0
1
0
1


In [18]:
dataset_infos = {'dataset_name' : 'aqua', 'data' : aqua}
aqua_sd = estimate_tokens(dataset_infos, 'standard', 'gpt-3.5-turbo', 7)
aqua_cot = estimate_tokens(dataset_infos, 'cot', 'gpt-3.5-turbo', 7)

0
1
0
1


In [19]:
dataset_infos = {'dataset_name' : 'strategyqa', 'data' : strategyqa}
strategyqa_sd = estimate_tokens(dataset_infos, 'standard', 'gpt-3.5-turbo', 7)
strategyqa_cot = estimate_tokens(dataset_infos, 'cot', 'gpt-3.5-turbo', 7)

0
1
0


In [20]:
gsm8k_sd*price_per_token + gsm8k_cot*price_per_token + aqua_sd*price_per_token + aqua_cot*price_per_token + strategyqa_sd*price_per_token + strategyqa_cot*price_per_token

65.044866

In [57]:
# TEST

# TEST STRATEGYQA COT
dataset_infos = {'dataset_name' : 'strategyqa', 'data' : strategyqa}
acc_strategyqa_cot, dfs_strategyqa_cot = run([10, 20], dataset_infos, 'cot', 'gpt-3.5-turbo', 2)

0


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are willing to answer qustions that require reasoning. Decompose the problem into intermediate sub-questions to gather more information and generate a sub-answer to each sub-question before generating the final answer. The final answer must YES or NO. Follow the examples below and generate the answer using the format of these examples:


[Example 1]
Question: Do hamsters provide food for any animals?
Output:
Sub Question #0 : What type of animals are hamsters?
Sub Answer #0 : Hamsters are prey animals.
Sub Question #1 : Can prey animals be food for other animals?
Sub Answer #1 : Prey are food for predators.
Sub Question #2 : Do hamsters provide food for any animals?
Sub Answer #2 : Since hamsters are prey animals, and prey are food for predetors, hamsters provide food for some animals.
Final Answer: YES

[Example 2]
Question: Could Brooke Shields succeed at University of Pennsylvania?
Output:
Sub que

In [58]:
dfs_strategyqa_cot[0]

Unnamed: 0,question,true_answer,ai_answer,ai_completion,token_count
0,Are more people today related to Genghis Khan ...,YES,YES,Sub question #0 : How many descendants does Ge...,1243


In [52]:
print(dfs_strategyqa_cot[0]['ai_completion'].iloc[0])

Sub question #0 : How many descendants does Genghis Khan have today?
Sub answer #0 : It is estimated that Genghis Khan has around 16 million living descendants today.
Sub question #1 : How many descendants does Julius Caesar have today?
Sub answer #1 : It is difficult to estimate the number of living descendants of Julius Caesar, but it is believed to be much smaller than that of Genghis Khan.
Sub question #2 : Are more people today related to Genghis Khan than Julius Caesar?
Sub answer #2 : Since Genghis Khan has many more living descendants than Julius Caesar, it is likely that more people today are related to Genghis Khan than Julius Caesar.
Final Answer: YES


In [59]:
dfs_strategyqa_cot[1]

IndexError: list index out of range

In [68]:
strategyqa[0]

{'qid': 'b8677742616fef051f00',
 'term': 'Genghis Khan',
 'description': 'founder and first Great Khan of the Mongol Empire',
 'question': 'Are more people today related to Genghis Khan than Julius Caesar?',
 'answer': True,
 'facts': ['Julius Caesar had three children.',
  'Genghis Khan had sixteen children.',
  'Modern geneticists have determined that  out of every 200 men today has DNA that can be traced to Genghis Khan.'],
 'decomposition': ['How many kids did Julius Caesar have?',
  'How many kids did Genghis Khan have?',
  'Is #2 greater than #1?'],
 'evidence': [[[['Caesarion-2', 'Julia (daughter of Caesar)-1']],
   [['Alakhai Bekhi-1', 'Tolui-1'], 'no_evidence'],
   ['operation']],
  [[['Julius Caesar-75']], [['Genghis Khan-17']], ['operation']],
  [[['Gaius Julius Caesar-7']],
   [['Genghis Khan-15'], 'no_evidence'],
   ['no_evidence', 'operation']]]}

In [68]:
run_directory = 'Results/Standard_Prompting/d_2023_04_30_t_16_52_21/'

# specify the directory containing the CSV files

# get a list of all CSV files in the directory
csv_files = [f for f in os.listdir(run_directory) if f.endswith('.csv')]

# loop over the CSV files and read them into pandas dataframes
for file in csv_files:
    # construct the full file path
    file_path = os.path.join(run_directory, file)
    
    # read the CSV file into a pandas dataframe
    df = pd.read_csv(file_path)
    
    # do something with the dataframe
    # ...


In [69]:
df

Unnamed: 0,col1,col2
0,5,7
1,6,8


In [262]:
def load_results_single_run_different_seeds(identifier, strategy):
    DIRECTORY_DIC = {'standard' : 'Results/Standard_Prompting/',
                 'cot' : 'Results/Random_Manual_CoT/'}
    
    directory = DIRECTORY_DIC[strategy] + identifier 

    # get a list of all CSV files in the directory
    csv_files = [f for f in os.listdir(directory) if f.endswith('.csv')]
    txt_files = [f for f in os.listdir(directory) if f.endswith('.txt')]
    # loop over the CSV files and read them into pandas dataframes
    csv_list = []
    txt_list = []
    
    for csv_file, txt_file in zip(csv_files, txt_files):
        # construct the full file path
        csv_file_path = os.path.join(directory, csv_file)
        print(csv_file_path)
        # read the CSV file into a pandas dataframe
        df = pd.read_csv(csv_file_path)
        csv_list.append(df)
        
        txt_file_path = os.path.join(directory, txt_file)
        print(txt_file_path)
        with open(txt_file_path, "r") as f:
            content = f.read()
            
        txt_list.append(content)
        print('---------------------')
    return csv_list, txt_list

In [284]:
csv_list, txt_list = load_results_single_run_different_seeds('strategyqa_d_2023_04_30_t_22_23_02', 'cot')

Results/Random_Manual_CoT/strategyqa_d_2023_04_30_t_22_23_02\df_seed_no_seed.csv
Results/Random_Manual_CoT/strategyqa_d_2023_04_30_t_22_23_02\context_seed_no_seed.txt
---------------------


In [285]:
csv_list[0]

Unnamed: 0,question,true_answer,ai_answer,ai_completion,token_count,total_price
0,Are more people today related to Genghis Khan ...,YES,YES,Sub question #0 : Who is Genghis Khan?\nSub an...,1312,0.002624
1,Could the members of The Police perform lawful...,NO,NO,Sub question #0 : Who are the members of The P...,1231,0.002462
2,Would a Monoamine Oxidase candy bar cheer up a...,NO,NO,Sub question #0 : What is Monoamine Oxidase?\n...,1262,0.002524


In [286]:
csv_list[1]

IndexError: list index out of range

In [287]:
print(txt_list[0])


[Example 1]
Question: Do hamsters provide food for any animals?
Output:
Sub Question #0 : What type of animals are hamsters?
Sub Answer #0 : Hamsters are prey animals.
Sub Question #1 : Can prey animals be food for other animals?
Sub Answer #1 : Prey are food for predators.
Sub Question #2 : Do hamsters provide food for any animals?
Sub Answer #2 : Since hamsters are prey animals, and prey are food for predetors, hamsters provide food for some animals.
Final Answer: YES

[Example 2]
Question: Could Brooke Shields succeed at University of Pennsylvania?
Output:
Sub question #0 : What university did Brooke Shields went to?
Sub answer #0 : Brooke Shields went to Princeton University.
Sub question #1 : Did Brooke Shields succeed at Princeton University?
Sub answer #1 : At Princeton University, she got all As and Bs while pursing her bachelor's degree in French literature, meaning she had a successful school life.
Sub question #2 : How rigorous is Princeton University compared to University

In [281]:
print(txt_list[1])

Question: Would Snowdon mountain be a piece of cake for Tenzing Norgay?
Answer: YES

Question: Can the Very Large Telescope observe the largest mountain on Earth?
Answer: NO




In [288]:
for el1, el2 in zip(csv_list[0]['ai_answer'], csv_list[0]['ai_completion']):
    print(el1)
    print(el2)
    print('--------------------------------------------------------------------------------------')

YES
Sub question #0 : Who is Genghis Khan?
Sub answer #0 : Genghis Khan was the founder and first emperor of the Mongol Empire.
Sub question #1 : Who is Julius Caesar?
Sub answer #1 : Julius Caesar was a Roman general and statesman who played a critical role in the events that led to the demise of the Roman Republic and the rise of the Roman Empire.
Sub question #2 : How many descendants does Genghis Khan have?
Sub answer #2 : It is estimated that Genghis Khan has around 16 million living male descendants today.
Sub question #3 : How many descendants does Julius Caesar have?
Sub answer #3 : Julius Caesar did not have any living descendants as he did not have any children who survived to adulthood.
Sub question #4 : Are more people today related to Genghis Khan than Julius Caesar?
Sub answer #4 : Since Genghis Khan has millions of living descendants and Julius Caesar has none, it is likely that more people today are related to Genghis Khan than Julius Caesar.
Final Answer: YES
---------

In [283]:
for el1, el2 in zip(csv_list[1]['ai_answer'], csv_list[1]['ai_completion']):
    print(el1)
    print(el2)
    print('--------------------------------------------------------------------------------------')

YES
YES
--------------------------------------------------------------------------------------
NO
NO
--------------------------------------------------------------------------------------
NO
NO
--------------------------------------------------------------------------------------
