## Case Study - Majority vote

The goal of this notebook is to explore the performance of GPT-3 in majority vote questions on our 20 question dataset.  Specifically, we want to give a question, five answers, and let GPT-3 aggregate them into one single answer.

1. Imports and functions
2. Simple approach
3. Chain of thought

## 1. Imports and functions

In [4]:
import re
import os
import openai
from time import time, sleep
import textwrap
import pandas as pd

In [3]:
def open_file(filepath):
    with open(filepath, 'r', encoding='utf-8') as infile:
        return infile.read()


def save_file(filepath, content):
    with open(filepath, 'w', encoding='utf-8') as outfile:
        outfile.write(content)

# change path to where you store your credentials
openai.api_key = open_file('../creds/creds.txt')


def gpt3_completion(prompt, label='gpt3', engine='text-davinci-002', temp=0.5, top_p=1.0, tokens=400, freq_pen=2.0, pres_pen=2.0, stop=['asdfasdf', 'asdasdf']):
    max_retry = 5
    retry = 0
    prompt = prompt.encode(encoding='ASCII', errors='ignore').decode()  # force it to fix any unicode errors
    while True:
        try:
            response = openai.Completion.create(
                engine=engine,
                prompt=prompt,
                temperature=temp,
                max_tokens=tokens,
                top_p=top_p,
                frequency_penalty=freq_pen,
                presence_penalty=pres_pen,
                stop=stop)
            text = response['choices'][0]['text'].strip()
            text = re.sub('\s+', ' ', text)
            return text
        except Exception as oops:
            retry += 1
            if retry >= max_retry:
                return "GPT3 error: %s" % oops
            print('Error communicating with OpenAI:', oops)
            sleep(1)


In [10]:
df_maj = pd.read_csv('majority_questions.csv')

In [11]:
df_maj

Unnamed: 0,question_number,question,answer_1,answer_2,answer_3,answer_4,answer_5,validation
0,1,What is left in the box?,"The box has a pencil, another 10 pencils and a...",The box has 21 pencils and a strawberry.,"The box has a pencil, a strawberry, and anothe...",The box has 21 pencils and a strawberry.,"The box has 10 pencils, a strawberry, and anot...",The box has 11 pencils and a strawberry.
1,2,What is left in the box?,"The box has a pencil, another 10 pencils and a...",The box has 21 pencils and a strawberry.,"The box has a pencil, a strawberry, and anothe...",The box has 21 pencils and a strawberry.,"The box has 10 pencils, a strawberry, and anot...",The box has 21 pencils and a strawberry.
2,3,Should congress pass the law?,The law should go through,The law should be banned,The law should get the greenlight,The law should not be part of the constitution,The law should be implemented,The law should go through
3,4,Should congress pass the law?,The law should go through,The law should be banned,The law should get the greenlight,The law should not be part of the constitution,The law should not be implemented,The law should not be implemented
4,5,Did the experiment work correctly?,Yes,No,I don't know,Most likely,Probably,Yes
5,6,Did the experiment work correctly?,Yes,No,I don't know,I'm not sure,Probably not,No
6,7,How should I wash my hands?,Use handsoap,Use soap,Use alcohol,Use aloe vera soap,Use shampoo,Soap
7,8,How should I wash my hands?,Use hair shampoo,Use soap,Use shampoo,Use aloe vera soap,Use shampoo,Shampoo
8,9,What should we eat today?,Pizza,Sushi,Pizza or Sushi,Sushi,Hamburger,Sushi
9,10,What should we eat today?,Pizza,Sushi,Pizza,Sushi or Pizza,Hamburger,Pizza


## 2. Simple approach

We'll use the following template:

In the text below we have 5 answers to the same question.

ANSWERS:

- answer_1
- answer_2
- answer_3
- answer_4
- answer_5

What is the answer to "__question__". Count how many times each answer appears, and choose the most frequent.

Also, we will use the function for finding if two answers are equivalent, as we explored in the prior notebook "Case Study - Validation"

In [16]:
## Validation function
def equiv(question, answer_1, answer_2):
    prompt_beginning = """In the following text we have a question and two answers.  Are the two answers the same? Answer with "Yes" or "No" """            
    nl_str = """\n\n"""
    question_str = """QUESTION:\n"""
    answers_str = """ANSWERS: \n"""
    final_answer_cot = """Are they the same answer? Use step by step reasoning and then output only a "Yes" or "No" answer:"""
    prompt = prompt_beginning + nl_str + question_str + question + nl_str + answers_str + "-" + answer_1 + "\n" + "-" + answer_2 + nl_str + final_answer_cot
    final_answer = gpt3_completion(prompt, temp = 0.3)
    if final_answer not in ['Yes','No']:
        prompt_init = """Condense the following sentence into a simple "Yes" or "No" answer"""
        prompt = prompt_init + nl_str + final_answer + nl_str + """ "Yes" or "No" output: """
        final_a = gpt3_completion(prompt)
        final_a.replace('.','').replace(' ','')
        final_answer = final_a
    final_answer.replace('.','').replace(' ','')
    return final_answer
    

In [18]:
temp_list = [0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 1]

In [23]:
prompt_beginning = """In the text below we have 5 answers to the same question."""            
nl_str = """\n\n"""
answers_str = """ANSWERS: \n"""
final_str_1 = """What is the answer to \""""
final_str_2 = """\". Count how many times each answer appears, and choose the most frequent."""


In [None]:
score_list = []
for temp in temp_list:
    gpt_cot_list = []
    gpt_val_list = []
    print(temp)
    for row in df_maj.iterrows():
        question = row[1]['question']
        answers_concat =  "\n -" + row[1]['answer_1'] + "\n -" + row[1]['answer_2'] + "\n -" + row[1]['answer_3'] + "\n -" + row[1]['answer_4'] + "\n -" + row[1]['answer_5']
        prompt = prompt_beginning + nl_str + answers_str + answers_concat + nl_str + final_str_1 + question + final_str_2
        print(prompt)
        final_answer = gpt3_completion(prompt, temp = temp)
        #print(final_answer)
        #final_answer.replace('.','').replace(' ','')
        print(row[0])
        gpt_cot_list.append(final_answer)
        equiv_string = equiv(question, final_answer, row[1]['validation'])
        if equiv_string not in ['Yes','No']:
            prompt_init = """Condense the following sentence into a simple "Yes" or "No" answer"""
            prompt = prompt_init + nl_str + final_answer + nl_str + """ "Yes" or "No" output: """
            final_a = gpt3_completion(prompt, temp = 0.3)
            final_a.replace('.','').replace(' ','')
            equiv_string = final_a
        gpt_val_list.append(equiv_string)
    col_string_cot = 'CoT_'+ str(temp)
    col_string_val = 'Val_'+ str(temp)
    df_maj[col_string_cot] = gpt_cot_list
    df_maj[col_string_val] = gpt_val_list
    # validate against real answers using the equiv function above 
    score = sum(df_maj[col_string_val]=='Yes')/20
    score_list.append(score) 

In [34]:
df_results = pd.DataFrame(list(zip(temp_list, score_list)), columns =['Temp', 'Score'])
df_results

Unnamed: 0,Temp,Score
0,0.2,0.5
1,0.4,0.45
2,0.6,0.45
3,0.8,0.55
4,0.9,0.45
5,0.95,0.4
6,1.0,0.6


Not quite a success.. let's see specifically what is happening

In [35]:
df_maj

Unnamed: 0,question_number,question,answer_1,answer_2,answer_3,answer_4,answer_5,validation,CoT_0.2,Val_0.2,...,CoT_0.6,Val_0.6,CoT_0.8,Val_0.8,CoT_0.9,Val_0.9,CoT_0.95,Val_0.95,CoT_1,Val_1
0,1,What is left in the box?,"The box has a pencil, another 10 pencils and a...",The box has 21 pencils and a strawberry.,"The box has a pencil, a strawberry, and anothe...",The box has 21 pencils and a strawberry.,"The box has 10 pencils, a strawberry, and anot...",The box has 11 pencils and a strawberry.,"The box has a pencil, another 10 pencils and a...",Yes,...,"The box has a pencil, another 10 pencils and a...",No,"The box has a pencil, another 10 pencils and a...",Yes,"The box has a pencil, another 10 pencils and a...",Yes,"The box has a pencil, another 10 pencils and a...",No,"The box has a pencil, another 10 pencils and a...",Yes
1,2,What is left in the box?,"The box has a pencil, another 10 pencils and a...",The box has 21 pencils and a strawberry.,"The box has a pencil, a strawberry, and anothe...",The box has 21 pencils and a strawberry.,"The box has 10 pencils, a strawberry, and anot...",The box has 21 pencils and a strawberry.,"The box has a pencil, another 10 pencils and a...",No,...,"The box has a pencil, another 10 pencils and a...",No,The box has 21 pencils and a strawberry.,Yes,"The box has a pencil, another 10 pencils and a...",No,"The most frequent answer is ""The box has 21 pe...",Yes,"The box has a pencil, another 10 pencils and a...",No
2,3,Should congress pass the law?,The law should go through,The law should be banned,The law should get the greenlight,The law should not be part of the constitution,The law should be implemented,The law should go through,The law should go through.,Yes,...,The law should go through.,Yes,"The answer is ""the law should go through"" with...",Yes,The law should go through.,Yes,"The answer is ""the law should go through.""",Yes,The law should go through.,Yes
3,4,Should congress pass the law?,The law should go through,The law should be banned,The law should get the greenlight,The law should not be part of the constitution,The law should not be implemented,The law should not be implemented,"The answer is ""the law should go through"" with...",No,...,"The answer to ""Should congress pass the law?"" ...",No,The law should go through.,No,The law should go through.,No,"The answer is ""the law should go through.""",No,"The answer is ""should not be part of the const...",No
4,5,Did the experiment work correctly?,Yes,No,I don't know,Most likely,Probably,Yes,-Yes,Yes,...,Yes,Yes,Yes,Yes,Yes,Yes,Most likely,Yes,Yes,Yes
5,6,Did the experiment work correctly?,Yes,No,I don't know,I'm not sure,Probably not,No,"The most frequent answer is ""I don't know.""",No,...,"The most frequent answer is ""I don't know.""",No,-Yes,Yes,"The answer is ""I don't know"" because it appear...",No,"The most frequent answer is ""I don't know.""",No,"The most frequent answer is ""I don't know.""",No
6,7,How should I wash my hands?,Use handsoap,Use soap,Use alcohol,Use aloe vera soap,Use shampoo,Soap,"The answer to ""How should I wash my hands?"" is...",Yes,...,-Use soap,Yes,-Use soap,Yes,"The answer is ""use soap.""",Yes,Use soap.,Yes,"The answer to ""How should I wash my hands?"" th...",Yes
7,8,How should I wash my hands?,Use hair shampoo,Use soap,Use shampoo,Use aloe vera soap,Use shampoo,Shampoo,"The most frequent answer is ""Use shampoo.""",No,...,The answer is to use shampoo.,No,"The most frequent answer is ""Use shampoo"" whic...",No,The answer is to use shampoo.,No,"The most frequent answer is ""use shampoo"" whic...",No,-Use shampoo,No
8,9,What should we eat today?,Pizza,Sushi,Pizza or Sushi,Sushi,Hamburger,Sushi,"Pizza and sushi are both mentioned twice, so i...",No,...,The answer is pizza.,No,"The answer is ""Pizza or Sushi"" because it appe...",No,Pizza – 2 Sushi – 3 Pizza or Sushi – 1 Hamburg...,No,Pizza: 2 Sushi: 3,No,Pizza and Sushi are tied with 2 each.,No
9,10,What should we eat today?,Pizza,Sushi,Pizza,Sushi or Pizza,Hamburger,Pizza,"The answer is pizza, appearing 2 times.",Yes,...,"The most frequent answer is ""Pizza"" with 2 occ...",Yes,"The answer is pizza, which appears twice.",No,"The most frequent answer is ""Pizza"", appearing...",No,"Pizza and sushi are both mentioned twice, so i...",No,"The most frequent response is ""Pizza"" with 2 o...",Yes


## 3. Chain of thought

In [44]:
prompt_beginning = """In the text below we have 5 answers to the same question."""            
nl_str = """\n\n"""
answers_str = """ANSWERS: \n"""
final_str_1 = """What is the answer to \""""
final_str_2 = """\". Count how many times each answer appears, and choose the most frequent. Reason your answer step-by-step."""
# Notice we added the step by step reason in the last string



In [None]:
score_list = []
for temp in temp_list:
    gpt_cot_list = []
    gpt_val_list = []
    print(temp)
    for row in df_maj.iterrows():
        question = row[1]['question']
        answers_concat =  "\n -" + row[1]['answer_1'] + "\n -" + row[1]['answer_2'] + "\n -" + row[1]['answer_3'] + "\n -" + row[1]['answer_4'] + "\n -" + row[1]['answer_5']
        prompt = prompt_beginning + nl_str + answers_str + answers_concat + nl_str + final_str_1 + question + final_str_2
        print(prompt)
        final_answer = gpt3_completion(prompt, temp = temp)
        #print(final_answer)
        #final_answer.replace('.','').replace(' ','')
        print(row[0])
        gpt_cot_list.append(final_answer)
        equiv_string = equiv(question, final_answer, row[1]['validation'])
        if equiv_string not in ['Yes','No']:
            prompt_init = """Condense the following sentence into a simple "Yes" or "No" answer"""
            prompt = prompt_init + nl_str + final_answer + nl_str + """ "Yes" or "No" output: """
            final_a = gpt3_completion(prompt, temp = 0.3)
            final_a.replace('.','').replace(' ','')
            equiv_string = final_a
        gpt_val_list.append(equiv_string)
    col_string_cot = 'CoT_'+ str(temp)
    col_string_val = 'Val_'+ str(temp)
    df_maj[col_string_cot] = gpt_cot_list
    df_maj[col_string_val] = gpt_val_list
    # validate against real answers using the equiv function above 
    score = sum(df_maj[col_string_val]=='Yes')/20
    score_list.append(score) 

In [38]:
df_results = pd.DataFrame(list(zip(temp_list, score_list)), columns =['Temp', 'Score'])
df_results

Unnamed: 0,Temp,Score
0,0.2,0.45
1,0.4,0.35
2,0.6,0.25
3,0.8,0.3
4,0.9,0.25
5,0.95,0.5
6,1.0,0.3


Ouch, that's even worse than before..

In [42]:
pd.set_option('display.max_colwidth', None)


In [43]:
df_maj

Unnamed: 0,question_number,question,answer_1,answer_2,answer_3,answer_4,answer_5,validation,CoT_0.2,Val_0.2,CoT_0.4,Val_0.4,CoT_0.6,Val_0.6,CoT_0.8,Val_0.8,CoT_0.9,Val_0.9,CoT_0.95,Val_0.95,CoT_1,Val_1
0,1,What is left in the box?,"The box has a pencil, another 10 pencils and a strawberry.",The box has 21 pencils and a strawberry.,"The box has a pencil, a strawberry, and another 10 pencils.",The box has 21 pencils and a strawberry.,"The box has 10 pencils, a strawberry, and another pencil.",The box has 11 pencils and a strawberry.,"The most frequent answer is ""The box has a pencil, another 10 pencils and a strawberry."" This answer appears twice.",Yes,"The answer that appears most frequently is ""The box has 21 pencils and a strawberry."" This answer makes the most sense, because it states that there are exactly 21 items in the box- 1 pencil, 10 other pencils, and 1 strawberry. The other answers either do not state an exact number of items, or they list the items in a different order.",No,"The answer that appears most frequently is ""The box has 21 pencils and a strawberry."" This answer makes the most sense, as it includes all of the items mentioned in the other answers.",No,"The most frequent answer is ""The box has a pencil, another 10 pencils, and a strawberry."" This answer appears twice.",No,"The answer that appears most frequently is ""The box has a pencil, another 10 pencils, and a strawberry.""",No,"The answer that appears most frequently is ""The box has 21 pencils and a strawberry."" This answer makes the most sense, because it includes all of the items mentioned in the other answers.",No,"The most frequent answer is ""The box has 21 pencils and a strawberry."" This answer appears twice.",No
1,2,What is left in the box?,"The box has a pencil, another 10 pencils and a strawberry.",The box has 21 pencils and a strawberry.,"The box has a pencil, a strawberry, and another 10 pencils.",The box has 21 pencils and a strawberry.,"The box has 10 pencils, a strawberry, and another 11 pencils.",The box has 21 pencils and a strawberry.,"The answer that appears most frequently is ""The box has 21 pencils and a strawberry."" This answer makes the most sense, because it includes all of the items mentioned in the other answers.",Yes,"The most frequent answer is ""The box has 21 pencils and a strawberry."" This answer appears twice. The other answers each appear once.",Yes,"The most frequent answer is ""The box has 21 pencils and a strawberry."", which appears twice. The other answers all appear once, so they are less likely to be correct.",Yes,"-The box has a pencil, another 10 pencils and a strawberry. This answer appears once. -The box has 21 pencils and a strawberry. This answer appears twice. -The box has a pencil, a strawberry, and another 10 pencils. This answer appears once. -The box has 21 pencils and a Strawberry. This Answer Appears Twice -The Box Has 10 Pencils, A Strawberry, And Another 11 Pencils."" this Answer Appears Once From the above answers we can see that ""the box has 21 pencils and 1 strawberry"" is the most frequent with appearing 3 times out of 5 answers given . Therefore this is the correct answer to what is left in the box?",Yes,"The most frequent answer is ""the box has 21 pencils and a strawberry."" This answer appears twice, while the other answers appear only once.",Yes,"The most frequent answer is ""The box has 21 pencils and a strawberry."" This answer appears twice. The other answers each appear once.",Yes,The box has 21 pencils and a strawberry.,Yes
2,3,Should congress pass the law?,The law should go through,The law should be banned,The law should get the greenlight,The law should not be part of the constitution,The law should be implemented,The law should go through,"The answer to ""Should congress pass the law?"" is that the law should go through. This is based on the fact that three out of five answers say that the law should either go through, get the greenlight, or be implemented.",Yes,"The most frequent answer is ""the law should go through"". This appears twice in the text. The other answers each appear once.",Yes,"The answer to ""should congress pass the law"" is that the law should go through. This is based on the fact that three out of five answers say that the law should either go through, get the greenlight, or be implemented.",Yes,"""The law should go through"" appears once. ""The law should be banned"" does not appear. ""The law should get the greenlight"" does not appear. ""The law should not be part of the constitution "" appears once. ' The law should be implemented ' appears once .",No,"-The law should go through: 1 -The law should be banned: 0 -The law should get the greenlight: 0 -The law should not be part of the constitution: 0 -The law should be implemented: 4 Most frequent answer is ""thelawshouldbeimplemented"".",No,"The answer is that ""the law should go through,"" appearing twice.",Yes,"""Should congress pass the law?"" -The law should go through: 1 -The law should be banned: 0 -The law should get the greenlight: 0 -The law should not be part of the constitution: 0 - The law should be implemented: 1",No
3,4,Should congress pass the law?,The law should go through,The law should be banned,The law should get the greenlight,The law should not be part of the constitution,The law should not be implemented,The law should not be implemented,"The answer to ""Should congress pass the law?"" is that the law should go through. This is based on the fact that three out of five answers say that the law should either go through, get the greenlight, or be part of the constitution.",No,"The answer to ""Should congress pass the law?"" is that the law should go through. This is based on the fact that three out of five answers say that the law should either go through, get the greenlight, or be part of the constitution.",No,"The answer to the question ""Should congress pass the law?"" is that The law should go through. This is based on the fact that this answer appears 3 times, while all other answers only appear once each.",No,"The law should go through - 1 The law should be banned - 0 The law should get the greenlight - 0 The law should not be part of the constitution- 1 The law should not be implemented- 2 From this, we can see that the answer with the most frequency is ""the law shouldn't be implemented"", with a count of 2. The next most frequent is ""the laws hould go through"" and ""thelaw shoundn't b epart of the Constitution"", both with a countof 1. Therefore, it seems as though there is no clear consensus on what to do about this particular issue.",No,"The most frequent answer in the text is ""the law should go through,"" with two instances. The second most frequent is ""the law should not be part of the constitution,"" also with two instances. However, this answer is less definitive than ""the law should go through."" The nextmost frequent response, ""the law should get the greenlight,"" only appears once. This leaves us with a tie between ""the law should go through"" and ""the lae shouldn't be part of the constitution."" If we take into account that one instance of each appeared alongside another response (for example, ""'The law should get the greenlight,' said John""), we can see that neither has a clear majority. In this case, it would come down to context to determine which option is more likely.",No,"The answer to ""Should congress pass the law?"" is that the law should get the greenlight. This is because three out of five answers say that the law should go through, be passed, or receive a green light. The other two answers are more split, with one saying it shouldn't be part of the constitution and another saying it shouldn't be implemented - these could mean a number of things and are less clear than the first three answers. Therefore, by process of elimination, Congress passing the law is most likely what people want to happen.",No,"""The law should go through"" appears once. ""The law should be banned"" and ""The law should get the greenlight"" appear twice each. ""The law should not be part of the constitution"" and ""The law should not be implemented"" appear three times each. The answer with the most appearances is ""the Law Should Not Be Implemented.""",No
4,5,Did the experiment work correctly?,Yes,No,I don't know,Most likely,Probably,Yes,"The most frequent answer is ""I don't know."" This is because there are more instances of this response than any other. The next most frequent answer is ""Most likely,"" followed by ""Probably.""",No,"The most frequent answer is ""I don't know.""",No,"There are a total of 5 answers to the question. ""Yes"" appears once, ""No"" appears once, ""I don't know"" appears once, ""Most likely"" appears once, and ""Probably"" also appears only once. Therefore, the most frequent answer is that there is not a clear consensus and more information would be needed in order to determine whether or not the experiment worked correctly.",No,"The most frequent answer is ""Most likely"".",No,"Yes - 1 No - 1 I don't know - 1 Most likely - 0 Probably-1 The most frequent answer to the question is ""I don't know.""",No,Yes: 1 No: 1 I don't know: 1 Most likely: 0 Probably: 0,No,"Yes: 1 No: 1 I don't know: 1 Most likely: 0 Probably: 0 The most frequent answer is ""I don't know"", so the correct answer must be ""I don't know"".",No
5,6,Did the experiment work correctly?,Yes,No,I don't know,I'm not sure,Probably not,No,"The most frequent answer is ""I don't know."" This answer appears 3 times out of 5, which is 60% of the time. The other answers each appear only once. Therefore, ""I don't know"" is the most likely answer to the question.",No,"The most frequent answer is ""I don't know."" This makes sense because the question is asking for an opinion on whether or not the experiment worked, and unless the person answering has first-hand knowledge of how the experiment was conducted, they would not be able to say definitively one way or another.",No,"The most frequent answer is ""I'm not sure"", which appears twice. The other answers each appear once.",No,"The answer to ""Did the experiment work correctly?"" is probably not. The reason for this is that no appears twice, and both I'm not sure and Probably not appear once. This means that there are a total of three answers pointing to the experiment probably not working correctly, as opposed to only two saying it did (I don't know doesn't count because it isn't a clear answer). Therefore, based on majority vote, the most likely answer is that the experiment did NOT work correctly.",No,"The most frequent answer is ""I don't know."" This appears twice in the text. The other answers each appear once.",No,"The answer is ""No"" because it appears twice, and the other answers only appear once.",Yes,"Yes, the experiment worked correctly. - 1 No, the experiment did not work correctly. -1 I don't know if the experiment worked correctly. - 1 I'm not sure if the experiment worked correctly.-1 Probably not, I think something went wrong withtheexperiment.-1 From this we can see that ""yes"" is the most frequent answer, so we can say that yes, according to these answers, it seems that he experiments likely did work correctl",No
6,7,How should I wash my hands?,Use handsoap,Use soap,Use alcohol,Use aloe vera soap,Use shampoo,Soap,"The answer to ""How should I wash my hands?"" is soap. This is the most frequent answer, appearing three times out of five. The other answers are handsoap, alcohol, aloe vera soap, and shampoo.",Yes,"The most frequent answer is ""use soap."" This answer appears twice, while the other answers each appear once. Therefore, the correct answer is to use soap when washing one's hands.",Yes,"The answer to ""How should I wash my hands?"" is soap. This is the most frequent answer, appearing 3 times out of 5. The other answers are handsoap, alcohol, aloe vera soap, and shampoo.",No,"The answer to ""How should I wash my hands?"" is: Use soap. This is the most frequent response, appearing 3 times out of 5. The other answers appear only once each.",Yes,"The most frequent answer is ""Use soap"". This answer appears twice out of the five options given. The other answers each appear once. Therefore, ""use soap"" is the most likely answer to the question posed.",No,"The answer that appears the most is ""use soap."" This makes sense because soap is specifically designed to cleanse and remove dirt, oil, and other substances from the skin.",Yes,"The most frequent answer is to ""use soap"". This appears 3 times out of the 5 answers given. The next most frequent answer is to ""use handsoap"", appearing twice.",Yes
7,8,How should I wash my hands?,Use hair shampoo,Use soap,Use shampoo,Use aloe vera soap,Use shampoo,Shampoo,"The most frequent answer is ""Use shampoo."" This appears three times out of the five answers.",No,"The most frequent answer is ""Use shampoo."" This appears 3 times out of 5, which is the majority. The next most frequent answers are ""Use soap"" and ""Use aloe vera soap,"" each appearing once. Therefore, based on the frequency of responses, the best answer to the question is to use shampoo.",No,The most frequent answer is to use shampoo. This appears three times out of the five answers given.,No,"The most frequent answer is ""use shampoo."" This appears three times out of the five answers, which is more than any other option.",No,"The answer that appears most frequently is ""Use shampoo."" This makes sense because shampoo is specifically designed to cleanse the hair and scalp, whereas soap is typically used for cleaning the body. Additionally, aloe vera soap may be beneficial for those with sensitive skin.",No,"The most frequent answer is ""Use shampoo."" Shampoo is designed to cleanse the hair and scalp, so it would also be effective at cleaning the hands. Soap is also a good choice for cleansing the hands, but aloe vera soap may be better suited for those with sensitive skin.",No,"The most frequent answer is ""Use shampoo."" This appears three times out of the five answers, so it is the majority.",No
8,9,What should we eat today?,Pizza,Sushi,Pizza or Sushi,Sushi,Hamburger,Sushi,"The answer to ""What should we eat today?"" is pizza. This is because pizza appears twice in the list of answers, while sushi and hamburger only appear once each.",No,"The most frequent answer is ""Pizza"", appearing twice. The other answers each appear once.",No,"Pizza and sushi are both mentioned twice, so we have to look at the other answers to see which one is most frequent. ""Pizza or sushi"" only counts as one answer, since it's offering two options. That means that hamburgers are the least frequent option, so pizza and sushi tie for the most frequent option.",Yes,"The answer to ""What should we eat today?"" is pizza. The reason for this is because pizza appears twice in the list of answers, while sushi and hamburger only appear once each. Therefore, based on the number of times each answer appears, it can be inferred that pizza is the most popular option.",No,"Pizza and sushi are both mentioned twice, so we have to look at the other answers to see which one is chosen most frequently. Since ""pizza or sushi"" and ""sushi"" are both mentions of sushi, that means there are three total votes for eating sushi today. That's more than half of the total answers, so pizza is the most popular choice with four votes.",No,"Answer: The most frequent answer is pizza, appearing twice. Sushi also appears twice, but hamburger only appears once, so pizza is the most popular choice.",No,"The answer to ""What should we eat today?"" is pizza. The word ""pizza"" appears twice in the text, while the other options each appear once. Therefore, pizza is the most frequent option and thus the answer to the question.",No
9,10,What should we eat today?,Pizza,Sushi,Pizza,Sushi or Pizza,Hamburger,Pizza,"The answer to ""What should we eat today?"" is pizza. This is because three out of the five answers say pizza, while only two mention sushi. Therefore, it is more likely that the person wants to eat pizza than sushi.",Yes,"The answer is pizza, because it appears twice and sushi only appears once.",No,"Pizza and sushi are both mentioned twice, so we have to look at the other answers to see which is more frequent. Hamburger is only mentioned once, so it must be less frequent than either pizza or sushi. Sushi or pizza is also mentioned only once, but since it includes both pizza and sushi as options, it must be less frequent than either one on its own. Therefore, the most frequent answer is ""pizza"" with two mentions.",No,"Pizza and sushi are both mentioned twice, so we have to choose between those two. If we look at the other answers, pizza is mentioned more often than sushi, hamburger, or anything else. Therefore, the answer to ""What should we eat today?"" is pizza.",Yes,"Pizza is the correct answer because it appears twice, while the other answers only appear once.",No,"Pizza and Sushi are both mentioned twice, so we need to look at the final answer. The most frequent response is ""Sushi or Pizza"", so that would be our final answer.",No,"The most frequent answer is ""Pizza"", appearing twice. The second most frequent is ""Sushi"", also appearing twice. As these two answers have the same frequency, we can choose either one.",No


Sometimes it's too verbose and the equiv function gets confused.  Maybe re-writing the prompt so that it is more concise in it's initial answer.

In [47]:
prompt_beginning = """In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below."""            
nl_str = """\n\n"""
answers_str = """ANSWERS: \n"""
final_str_1 = """What is the answer to \""""
final_str_2 = """\". Count how many times each answer appears, reason step-by-step and respond in as few words as possible."""
# Notice we added the step by step reason in the last string



In [48]:
score_list = []
for temp in temp_list:
    gpt_cot_list = []
    gpt_val_list = []
    print(temp)
    for row in df_maj.iterrows():
        question = row[1]['question']
        answers_concat =  "\n -" + row[1]['answer_1'] + "\n -" + row[1]['answer_2'] + "\n -" + row[1]['answer_3'] + "\n -" + row[1]['answer_4'] + "\n -" + row[1]['answer_5']
        prompt = prompt_beginning + nl_str + answers_str + answers_concat + nl_str + final_str_1 + question + final_str_2
        print(prompt)
        final_answer = gpt3_completion(prompt, temp = temp)
        #print(final_answer)
        #final_answer.replace('.','').replace(' ','')
        print(row[0])
        gpt_cot_list.append(final_answer)
        equiv_string = equiv(question, final_answer, row[1]['validation'])
        if equiv_string not in ['Yes','No']:
            prompt_init = """Condense the following sentence into a simple "Yes" or "No" answer"""
            prompt = prompt_init + nl_str + final_answer + nl_str + """ "Yes" or "No" output: """
            final_a = gpt3_completion(prompt, temp = 0.3)
            final_a.replace('.','').replace(' ','')
            equiv_string = final_a
        gpt_val_list.append(equiv_string)
    col_string_cot = 'CoT_'+ str(temp)
    col_string_val = 'Val_'+ str(temp)
    df_maj[col_string_cot] = gpt_cot_list
    df_maj[col_string_val] = gpt_val_list
    # validate against real answers using the equiv function above 
    score = sum(df_maj[col_string_val]=='Yes')/20
    score_list.append(score) 

0.2
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another pencil.

What is the answer to "What is left in the box?". Count how many times each answer appears, reason step-by-step and respond in as few words as possible.
0
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another 11 pencils.


19
0.4
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another pencil.

What is the answer to "What is left in the box?". Count how many times each answer appears, reason step-by-step and respond in as few words as possible.
0
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another 11 pencil

19
0.6
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another pencil.

What is the answer to "What is left in the box?". Count how many times each answer appears, reason step-by-step and respond in as few words as possible.
0
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another 11 pencil

19
0.8
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another pencil.

What is the answer to "What is left in the box?". Count how many times each answer appears, reason step-by-step and respond in as few words as possible.
0
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another 11 pencil

19
0.9
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another pencil.

What is the answer to "What is left in the box?". Count how many times each answer appears, reason step-by-step and respond in as few words as possible.
0
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another 11 pencil

19
0.95
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another pencil.

What is the answer to "What is left in the box?". Count how many times each answer appears, reason step-by-step and respond in as few words as possible.
0
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another 11 penci

19
1
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another pencil.

What is the answer to "What is left in the box?". Count how many times each answer appears, reason step-by-step and respond in as few words as possible.
0
In the text below we have 5 answers to the same question. You will need to choose the popular option in an extremely short answer below.

ANSWERS: 

 -The box has a pencil, another 10 pencils and a strawberry.
 -The box has 21 pencils and a strawberry.
 -The box has a pencil, a strawberry, and another 10 pencils.
 -The box has 21 pencils and a strawberry.
 -The box has 10 pencils, a strawberry, and another 11 pencils.

19


In [49]:
df_results = pd.DataFrame(list(zip(temp_list, score_list)), columns =['Temp', 'Score'])
df_results

Unnamed: 0,Temp,Score
0,0.2,0.4
1,0.4,0.35
2,0.6,0.45
3,0.8,0.45
4,0.9,0.3
5,0.95,0.4
6,1.0,0.4


In [50]:
df_maj

Unnamed: 0,question_number,question,answer_1,answer_2,answer_3,answer_4,answer_5,validation,CoT_0.2,Val_0.2,CoT_0.4,Val_0.4,CoT_0.6,Val_0.6,CoT_0.8,Val_0.8,CoT_0.9,Val_0.9,CoT_0.95,Val_0.95,CoT_1,Val_1
0,1,What is left in the box?,"The box has a pencil, another 10 pencils and a strawberry.",The box has 21 pencils and a strawberry.,"The box has a pencil, a strawberry, and another 10 pencils.",The box has 21 pencils and a strawberry.,"The box has 10 pencils, a strawberry, and another pencil.",The box has 11 pencils and a strawberry.,"The box has a pencil, another 10 pencils and a strawberry.",No,"The box has a pencil, another 10 pencils and a strawberry.",No,"The box has a pencil, another 10 pencils and a strawberry.",Yes,"The box has a pencil, another 10 pencils and a strawberry.",No,"The box has a pencil, another 10 pencils and a strawberry.",No,"The box has a pencil, another 10 pencils and a strawberry.",Yes,"The most popular answer is ""The box has a pencil, another 10 pencils, and a strawberry.""",No
1,2,What is left in the box?,"The box has a pencil, another 10 pencils and a strawberry.",The box has 21 pencils and a strawberry.,"The box has a pencil, a strawberry, and another 10 pencils.",The box has 21 pencils and a strawberry.,"The box has 10 pencils, a strawberry, and another 11 pencils.",The box has 21 pencils and a strawberry.,"The box has a pencil, another 10 pencils and a strawberry.",No,"The box has a pencil, another 10 pencils and a strawberry.",No,"-The box has a pencil, another 10 pencils and a strawberry. -The box has 21 pencils and a strawberry. -The box has a pencil, a strawberry, and another 10 pencils. -The box has 21 pencils and a strawberry. -The box has 10 pencils, a strawberry, and another 11pencils. There are 3 popular answers: -""the boxhas21pencilsandastrawberry"" -""theboxhasapencilandanother10pencilsandastrawberry"" -""theboxhasapencila strawberrryandanother10pencillls""",No,"The box has a pencil, another 10 pencils and a strawberry.",No,"The box has a pencil, another 10 pencils and a strawberry.",No,"The box has a pencil, another 10 pencils and a strawberry.",No,"The popular answer is ""The box has a pencil, another 10 pencils and a strawberry."" This answer appears twice.",Yes
2,3,Should congress pass the law?,The law should go through,The law should be banned,The law should get the greenlight,The law should not be part of the constitution,The law should be implemented,The law should go through,The law should go through.,Yes,The law should go through.,Yes,"The answer should be ""The law should go through"" because it is the most popular option.",Yes,"The answer to ""Should congress pass the law?"" is that the law should go through.",Yes,"The most popular answer is ""the law should go through"" with 2 occurrences.",Yes,"The answer that appears the most is ""The law should go through.""",Yes,The law should go through.,Yes
3,4,Should congress pass the law?,The law should go through,The law should be banned,The law should get the greenlight,The law should not be part of the constitution,The law should not be implemented,The law should not be implemented,"The answer with the most appearances is ""the law should go through.""",No,"The answer is ""the law should go through."" This appears twice in the text.",No,The law should go through.,No,"The answer with the most appearances is ""The law should go through.""",No,"The answer with the most votes is ""the law should go through.""",No,The law should go through.,No,The law should go through.,No
4,5,Did the experiment work correctly?,Yes,No,I don't know,Most likely,Probably,Yes,Most likely,Yes,Most likely,No,Most likely.,Yes,Most likely.,No,I don't know.,No,Yes,Yes,The experiment worked correctly. -Yes,Yes
5,6,Did the experiment work correctly?,Yes,No,I don't know,I'm not sure,Probably not,No,"The answer is ""No."" Three people said ""No,"" one person said ""Probably not,"" and one person said ""I'm not sure.""",No,Yes,Yes,"The most popular answer is ""I don't know"" which appears 3 times.",No,"The answer is ""I don't know.""",No,The experiment probably did not work correctly.,No,"The correct answer is ""Yes."" It appears twice in the text, while all other answers only appear once each.",No,"Out of the answers given, ""Yes"" is the most popular option.",No
6,7,How should I wash my hands?,Use handsoap,Use soap,Use alcohol,Use aloe vera soap,Use shampoo,Soap,"The most popular answer is ""Use soap.""",Yes,"The answer that appears the most is ""use soap.""",Yes,"The most popular answer is ""use soap"".",Yes,"The most popular option to wash one's hands is soap. This is because soap effectively removes dirt, bacteria and other contaminants from the skin.",Yes,Most people say to use soap.,Yes,"The most popular answer is to ""use soap."" This makes sense, as soap is the best way to remove dirt and bacteria from your hands.",Yes,"Soap is the most popular option, appearing 3 times. Alcohol also appears twice.",No
7,8,How should I wash my hands?,Use hair shampoo,Use soap,Use shampoo,Use aloe vera soap,Use shampoo,Shampoo,"The answer is ""Use shampoo.""",No,"The answer is ""use shampoo."" This appears twice in the text.",No,"The answer is to ""use shampoo."" This appears three times out of the five answers, making it the most popular choice.",No,"Shampoo is the most popular answer, appearing 3 times.",Yes,Shampoo,No,Use shampoo.,No,Use shampoo.,No
8,9,What should we eat today?,Pizza,Sushi,Pizza or Sushi,Sushi,Hamburger,Sushi,Pizza and sushi are both popular options.,No,Pizza and sushi are both popular options.,No,Pizza is the most popular option.,No,"The answer to ""What should we eat today?"" is sushi.",Yes,Pizza and sushi are both popular options.,No,Pizza or sushi is the popular answer.,No,"Pizza and sushi are both mentioned twice, so they are the most popular options.",No
9,10,What should we eat today?,Pizza,Sushi,Pizza,Sushi or Pizza,Hamburger,Pizza,Pizza is the most popular option.,No,Pizza and sushi are both popular options.,No,Pizza is the most popular answer.,No,Pizza is the most popular food in the world.,No,Pizza and sushi are both popular answers.,No,Pizza is the most popular option.,No,Pizza and sushi are both popular options.,No


Let's see.  These are too tricky:

- Godfather -> It is actually correct if we consider it is a trilogy
- Stock market (both), too varied responses I guess
- How was the movie. the bad one.  "not great" is confusing I guess
- Experiment one (No) --> equiv function is messed up unfortunately
- Negative of law should go through is incorrect, strange mistake.


## 4. Conclusion

I will park this experiment as a failure.  My expectation was that the reconciliation could be done in an efficient way, like >90% accuracy.  Even being benevolent with the observations above, this would only give 70-80% accuracy.
It is very possible that prompt engineering solves a lot of this, so might re-visit this notebook at some point (although with GPT-4 on the horizon, maybe not)