### DISCLAIMER

This is system based on the methodology proposed by [AI Explained](https://www.youtube.com/watch?v=wVzuvf9D9BU).

In [23]:
import os
from dotenv import load_dotenv
load_dotenv()

from langchain import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.schema import (
    AIMessage,
    HumanMessage,
    SystemMessage
)

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

### Creating Smart GPT Prompts

The system works in 3 stages:

1. Generate several different solutions using CoT Reasoning [paper 1](https://arxiv.org/pdf/2305.02897.pdf)
2. Use reflection prompt to analysis each of the solution [paper 2](https://arxiv.org/pdf/2303.11366.pdf)
3. Use a resolved to analysis reflections and extract answer, this is done through dialogue-enabled resolving [paper 3](https://arxiv.org/pdf/2303.17071.pdf)

First off we create 3 prompts representing this flow.

In [24]:
chain_of_though_template = """
{question} Question . Answer: Let's work this out in a step by step way to be sure we have the right answer.""
"""

chain_of_though_prompt = PromptTemplate(
    input_variables=["question"],
    template=chain_of_though_template,
)
chain_of_though_prompt.format(question="What is the meaning of life?")

'\nWhat is the meaning of life? Question . Answer: Let\'s work this out in a step by step way to be sure we have the right answer.""\n'

In [25]:
reflextion_template = """
You are a researcher tasked with investigating the {options} response options provided. 
List the flaws and faulty logic of each answer options. 
Let's work this out in a step by step way to be sure we have all the errors:
"""

reflextion_prompt = PromptTemplate(
    input_variables=["options"],
    template=reflextion_template,
)
reflextion_prompt.format(options="\n".join(["Option 1", "Option 2"]))

"\nYou are a researcher tasked with investigating the Option 1\nOption 2 response options provided. \nList the flaws and faulty logic of each answer options. \nLet's work this out in a step by step way to be sure we have all the errors:\n"

In [26]:
resolver_template = """
You are a resolver task with 
1) finding which of the {researcher_analysis} answers options the researcher thought was the best
2) improving that answer, and
3) Printing the improved answer in full
Let's work this out in a step by step way to be sure we have the right answer:
"""

resolver_prompt = PromptTemplate(
    input_variables=["researcher_analysis"],
    template=resolver_template,
)
resolver_prompt.format(researcher_analysis="researcher_analysis")

"\nYou are a resolver task with \n1) finding which of the researcher_analysis answers options the researcher thought was the best\n2) improving that answer, and\n3) Printing the improved answer in full\nLet's work this out in a step by step way to be sure we have the right answer:\n"

In [27]:
extraction_template = """
You are an extraction tool designed to check which answer a user has given.

Given the question: {question}

The working out of the solution by the user is:\n {solution}

Extract which answer the user thought was the correct answer. 

Please respond with no addtional text, only: A, B, C or D.
"""

extraction_prompt = PromptTemplate(
    input_variables=["question", "solution"],
    template=extraction_template,
)
extraction_prompt.format(question="is it, a, b, c or d", solution="The answer is A")

'\nYou are an extraction tool designed to check which answer a user has given.\n\nGiven the question: is it, a, b, c or d\n\nThe working out of the solution by the user is:\n The answer is A\n\nExtract which answer the user thought was the correct answer. \n\nPlease respond with no addtional text, only: A, B, C or D.\n'

### Setup Up Chat Agent

In [28]:
chat_llm = ChatOpenAI(temperature=0.9, model="gpt-4")
llm_result = chat_llm([HumanMessage(content="Translate this sentence from English to French. I love programming.")])
llm_result

AIMessage(content="J'aime la programmation.", additional_kwargs={}, example=False)

### Experimentation

The following is going through the flow of Smart GPT individually for each prompt

In [29]:
question1 = """
What is the embryological origin of the hyoid bone?
(A) The first pharyngeal arch
(B) The first and second pharyngeal arches
(C) The second pharyngeal arch
(D) The second and third pharyngeal arches
"""

question2 = """
In a given population, 1 out of every 400 people has a cancer caused by a completely recessive
allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is
the expected proportion of individuals who carry the b allele but are not expected to develop the
cancer?
(A) 1/400 (B) 19/400 (C) 20/400 (D) 38/400
"""

chain = LLMChain(llm=chat_llm, prompt=chain_of_though_prompt, verbose=True)
result = chain.run({"question": question2})
result



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m

In a given population, 1 out of every 400 people has a cancer caused by a completely recessive
allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is
the expected proportion of individuals who carry the b allele but are not expected to develop the
cancer?
(A) 1/400 (B) 19/400 (C) 20/400 (D) 38/400
 Question . Answer: Let's work this out in a step by step way to be sure we have the right answer.""
[0m

[1m> Finished chain.[0m


'First, we need to find the frequency of the recessive allele (b) in the population. Since 1 out of every 400 people has the cancer caused by the recessive allele, we can infer that these individuals are homozygous for the recessive allele (bb). Therefore, the frequency of the bb genotype is 1/400.\n\nNow we can use the Hardy-Weinberg equilibrium formula to find the allele frequency. The formula is:\n\np^2 + 2pq + q^2 = 1\n\nwhere p is the frequency of the dominant allele (B), q is the frequency of the recessive allele (b), p^2 represents the frequency of the BB genotype, 2pq represents the frequency of the Bb genotype, and q^2 represents the frequency of the bb genotype.\n\nWe know that:\n\nq^2 = 1/400\n\nTo find the frequency of the recessive allele (q) we take the square root of q^2:\n\nq = sqrt(1/400) = 1/20\n\nNow we can find the frequency of the dominant allele (p) using the equation p = 1 - q:\n\np = 1 - 1/20 = 19/20\n\nNext, we need to find the expected proportion of individual

In [30]:
question = question2
num_options = 2

options = []
for i in range(num_options):
  options.append(chain.run({"question": question}))
options



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m

In a given population, 1 out of every 400 people has a cancer caused by a completely recessive
allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is
the expected proportion of individuals who carry the b allele but are not expected to develop the
cancer?
(A) 1/400 (B) 19/400 (C) 20/400 (D) 38/400
 Question . Answer: Let's work this out in a step by step way to be sure we have the right answer.""
[0m

[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m

In a given population, 1 out of every 400 people has a cancer caused by a completely recessive
allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is
the expected proportion of individuals who carry the b allele but are not expected to develop the
cancer?
(A) 1/400 (B) 19/400 (C) 20/400 (D) 38/400
 Question . Answer: Let

["Step 1: Identify the frequency of the recessive genotype (bb) in the population.\nSince 1 out of 400 people has the cancer caused by the recessive allele, the frequency of the recessive genotype (bb) is 1/400.\n\nStep 2: Calculate the frequency of the recessive allele (b) in the population.\nWe can use the Hardy-Weinberg equation for the frequency of a recessive genotype: p^2 + 2pq + q^2 = 1, where p is the frequency of the dominant allele and q is the frequency of the recessive allele. In this case, the recessive genotype (bb) has a frequency of q^2 = 1/400. To find the frequency of the recessive allele (q), we take the square root of the frequency of the recessive genotype: q = sqrt(1/400) = 1/20.\n\nStep 3: Calculate the frequency of individuals who are carriers of the recessive allele (Bb).\nThe frequency of the heterozygous genotype (Bb) is given by 2pq in the Hardy-Weinberg equation. We've already calculated the frequency of the recessive allele (q) as 1/20. To find the frequen

In [31]:
reflextion_chain = LLMChain(llm=chat_llm, prompt=reflextion_prompt, verbose=True)
reflextion_result = reflextion_chain.run({"options": options})
reflextion_result



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are a researcher tasked with investigating the ["Step 1: Identify the frequency of the recessive genotype (bb) in the population.\nSince 1 out of 400 people has the cancer caused by the recessive allele, the frequency of the recessive genotype (bb) is 1/400.\n\nStep 2: Calculate the frequency of the recessive allele (b) in the population.\nWe can use the Hardy-Weinberg equation for the frequency of a recessive genotype: p^2 + 2pq + q^2 = 1, where p is the frequency of the dominant allele and q is the frequency of the recessive allele. In this case, the recessive genotype (bb) has a frequency of q^2 = 1/400. To find the frequency of the recessive allele (q), we take the square root of the frequency of the recessive genotype: q = sqrt(1/400) = 1/20.\n\nStep 3: Calculate the frequency of individuals who are carriers of the recessive allele (Bb).\nThe frequency of the heterozygous genotype (Bb) is given b

'Option A: There are no specific answer options provided in the text, so there is no way to list the flaws and faulty logic for option A.\n\nOption B: There are no specific answer options provided in the text, so there is no way to list the flaws and faulty logic for option B.\n\nOption C: There are no specific answer options provided in the text, so there is no way to list the flaws and faulty logic for option C.\n\nOption D: The text provided actually outlines the correct answer (D) 38/400, which is the expected proportion of individuals who carry the b allele but are not expected to develop cancer. There are no flaws or faulty logic listed for this option, as it follows the Hardy-Weinberg equation correctly and arrives at the correct answer.'

In [32]:
resolver_chain = LLMChain(llm=chat_llm, prompt=resolver_prompt, verbose=True)
resolved_result = resolver_chain.run({"researcher_analysis": reflextion_result})
resolved_result



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are a resolver task with 
1) finding which of the Option A: There are no specific answer options provided in the text, so there is no way to list the flaws and faulty logic for option A.

Option B: There are no specific answer options provided in the text, so there is no way to list the flaws and faulty logic for option B.

Option C: There are no specific answer options provided in the text, so there is no way to list the flaws and faulty logic for option C.

Option D: The text provided actually outlines the correct answer (D) 38/400, which is the expected proportion of individuals who carry the b allele but are not expected to develop cancer. There are no flaws or faulty logic listed for this option, as it follows the Hardy-Weinberg equation correctly and arrives at the correct answer. answers options the researcher thought was the best
2) improving that answer, and
3) Printing the improved answer in

"1) The researcher concluded that the best answer is Option D: 38/400 individuals carry the b allele but are not expected to develop cancer.\n\n2) To improve this answer, let's provide a clearer explanation and reasoning behind it. \n\nThe researcher used the Hardy-Weinberg equation, which helps predict the frequency of different genotypes in a population at equilibrium. In this case, the frequency of the b allele (q) is 0.1, and the frequency of the B allele (p) is 0.9. To find the frequency of individuals carrying the b allele but not developing cancer, we need to calculate the frequency of the Bb heterozygous genotype (2pq). \n\nUsing the values given:\n\n2pq = 2 * 0.9 * 0.1 = 0.18\n\nSince the population consists of 400 individuals, we can find the number of individuals carrying the b allele but not expected to develop cancer by multiplying the frequency by the total population:\n\n0.18 * 400 = 72 individuals\n\nHowever, this includes both individuals with the Bb genotype and bb ge

In [34]:
extraction_chain = LLMChain(llm=chat_llm, prompt=extraction_prompt, verbose=True)
final_answer = extraction_chain.run({"solution": resolved_result, "question": question})
final_answer



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are an extraction tool designed to check which answer a user has given.

Given the question: 
In a given population, 1 out of every 400 people has a cancer caused by a completely recessive
allele, b. Assuming the population is in Hardy-Weinberg equilibrium, which of the following is
the expected proportion of individuals who carry the b allele but are not expected to develop the
cancer?
(A) 1/400 (B) 19/400 (C) 20/400 (D) 38/400


The working out of the solution by the user is:
 1) The researcher concluded that the best answer is Option D: 38/400 individuals carry the b allele but are not expected to develop cancer.

2) To improve this answer, let's provide a clearer explanation and reasoning behind it. 

The researcher used the Hardy-Weinberg equation, which helps predict the frequency of different genotypes in a population at equilibrium. In this case, the frequency of the b allele (q) is 0.1, and t

'D'

### Whole flow in one Go

*Note: The evaluation is based on the [MMMLU](https://github.com/hendrycks/test) bench mark. In order for this to work you need to download and extract the eval at the root of this directory.*

In [10]:
def smart_gpt(question, num_options=3, GPT4=False, verbose=False):
  print("💭 {}\n\n".format(question))
  question_model = "gpt-4" if GPT4 else "gpt-3.5-turbo"
  print("Using model: {}".format(question_model))
  gpt_4_chat_llm = ChatOpenAI(temperature=0.9, model=question_model)
  gpt_3_5_chat_llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
  
  options = []
  option_chain = LLMChain(llm=gpt_4_chat_llm, prompt=chain_of_though_prompt, verbose=verbose)
  for i in range(num_options):
    option = option_chain.run({"question": question})
    print("🙋 Option {}:\n\n{}\n".format(i, option))
    options.append(option)
    

  reflextion_chain = LLMChain(llm=gpt_4_chat_llm, prompt=reflextion_prompt, verbose=verbose)
  reflextion_result = reflextion_chain.run({"options": options})
  print("===> 🤔 Reflextion Results: \n\n{}\n".format(reflextion_result))
  
  resolver_chain = LLMChain(llm=gpt_4_chat_llm, prompt=resolver_prompt, verbose=verbose)
  resolved_result = resolver_chain.run({"researcher_analysis": reflextion_result})
  print("===>✅ Resolved Results: \n\n{}\n".format(resolved_result))
  
  extraction_chain = LLMChain(llm=chat_llm, prompt=extraction_prompt, verbose=verbose)
  final_answer = extraction_chain.run({"question": question, "solution": resolved_result})
  
  return(resolved_result, final_answer)

In [11]:
# resolved_result, res = smart_gpt(question1)
# print(resolved_result, res)

In [21]:
import pandas as pd

def create_df():
  cols = ["question", "A", "B", "C", "D",  "solution"]
  df = pd.read_csv("./data/test/formal_logic_test.csv", header=None)
  df.columns = cols
  df['llm_question'] = df.apply(lambda row: "{}\n\nA: {}\nB: {}\nC: {}\nD: {}\n".format(row["question"], row["A"], row["B"], row["C"], row["D"]), axis=1)
  df['llm_answer'] = pd.Series(dtype='str')
  df['correct'] = pd.Series(dtype='bool')
  return df

df = create_df()
print("Total size of question bank is: {}".format(len(df)))
df.head()

Total size of question bank is: 126


Unnamed: 0,question,A,B,C,D,solution,llm_question,llm_answer,correct
0,Identify the conclusion of the following argum...,It is hard not to verify in our peers the same...,"The arrogance of our consciousness, which in g...","Because of this, it is difficult to convince p...",It is difficult to convince people of the unco...,D,Identify the conclusion of the following argum...,,
1,Select the best translation into predicate log...,Tdc,Tcd,Tcc,dTc,A,Select the best translation into predicate log...,,
2,Select the best English interpretation of the...,Some large houses are bigger than some apartme...,Some houses are bigger than all large apartments.,Some large apartments are bigger than some hou...,Some houses are bigger thatn some large apartm...,C,Select the best English interpretation of the...,,
3,Construct a complete truth table for the follo...,Valid,Invalid. Counterexample when G and H are true,Invalid. Counterexample when G is true and H i...,Invalid. Counterexample when H is true and G i...,A,Construct a complete truth table for the follo...,,
4,Use the following key to translate the given ...,If it's not the case that both Izzy plays Mine...,If it's both not the case that Izzy plays Mine...,Izzy playing Minecraft and Ashleigh stopping b...,Russell's making dinner is a sufficient condit...,B,Use the following key to translate the given ...,,


In [22]:
NUM_QUESTIONS = 1
for index, row in df[0:NUM_QUESTIONS].iterrows():
  print("Running question: {}".format(index))
  question = row["llm_question"]

  resolved_result, llm_answer = smart_gpt(question, 3, True, True)
  
  df.at[index, "llm_answer"] = llm_answer
  df.at[index, "correct"] = row["solution"] == llm_answer

Running question: 0
💭 Identify the conclusion of the following argument. It is hard not to verify in our peers the same weakened intelligence due to emotions that we observe in our everyday patients. The arrogance of our consciousness, which in general, belongs to the strongest defense mechanisms, blocks the unconscious complexes. Because of this, it is difficult to convince people of the unconscious, and in turn to teach them what their conscious knowledge contradicts. (Sigmund Freud, The Origin and Development of Psychoanalysis)

A: It is hard not to verify in our peers the same weakened intelligence due to emotions that we observe in our everyday patients.
B: The arrogance of our consciousness, which in general, belongs to the strongest defense mechanisms, blocks the unconscious complexes.
C: Because of this, it is difficult to convince people of the unconscious, and in turn to teach them what their conscious knowledge contradicts.
D: It is difficult to convince people of the uncons

KeyboardInterrupt: 

In [18]:
print("Smart GPT got {}% correct.".format(len(df[df.correct == True]) / NUM_QUESTIONS * 100))

Smart GPT got 0.0% correct.


In [19]:
df.head()

Unnamed: 0,question,A,B,C,D,solution,llm_question,llm_answer,correct
0,Identify the conclusion of the following argum...,It is hard not to verify in our peers the same...,"The arrogance of our consciousness, which in g...","Because of this, it is difficult to convince p...",It is difficult to convince people of the unco...,D,Identify the conclusion of the following argum...,A,False
1,Select the best translation into predicate log...,Tdc,Tcd,Tcc,dTc,A,Select the best translation into predicate log...,,
2,Select the best English interpretation of the...,Some large houses are bigger than some apartme...,Some houses are bigger than all large apartments.,Some large apartments are bigger than some hou...,Some houses are bigger thatn some large apartm...,C,Select the best English interpretation of the...,,
3,Construct a complete truth table for the follo...,Valid,Invalid. Counterexample when G and H are true,Invalid. Counterexample when G is true and H i...,Invalid. Counterexample when H is true and G i...,A,Construct a complete truth table for the follo...,,
4,Use the following key to translate the given ...,If it's not the case that both Izzy plays Mine...,If it's both not the case that Izzy plays Mine...,Izzy playing Minecraft and Ashleigh stopping b...,Russell's making dinner is a sufficient condit...,B,Use the following key to translate the given ...,,
