# Introduction:
The potential of large language models (LLMs) to reason like humans has been a highly contested topic in the field of Machine Learning and Natural Language Processing. However, the reasoning abilities of humans are multifaceted and can be seen in various forms, including analogical, spatial, causal and moral reasoning, among others. This fact raises the question whether LLMs can perform equally well across all these different domains. This research work aims to investigate the performance of LLMs on different reasoning tasks by conducting experiments that directly use or draw inspirations from existing datasets on analogical and spatial reasoning. Additionally, to evaluate the ability of LLMs to reason like human, I evaluate their performance on more open-ended, natural language questions for analogical, spatial and moral reasoning. My findings indicate that LLMs excel at analogical and moral reasoning, yet struggle to perform as proficiently on spatial reasoning tasks. I believe these experiments are crucial for informing the future development and practical development of LLMs, particularly in contexts that require diverse reasoning proficiencies. By shedding light on the reasoning abilities of LLMs, this study aims to push forward our understanding of how they can better emulate the cognitive abilities of humans.

## Imports

In [1]:
%%capture
!pip install openai

In [2]:
import openai
import os
import re
import random
from copy import copy
from tqdm import tqdm
import pandas as pd
import time
random.seed(42)

#Insert OpenAI API key here
openai.api_key = input("Enter OpenAI API key.")

## Helper Functions

In [3]:
from os import replace
def gpt3_call(gpt3_type, prompt):
    response = openai.Completion.create(
                    engine=gpt3_type,
                    prompt=prompt,
                    temperature=0.4,
                    max_tokens=256,
                    top_p=1,
                    logprobs=1,
                    frequency_penalty=0,
                    presence_penalty=0
                )
    return response

def response2text(response):
  return response['choices'][0]['text']

def format_response(text):
  replacements = ['\n',"."]
  for replacement in replacements:
    text = text.replace(replacement, "")
  return text

gpt3_type = "text-davinci-003"

#### Mount Drive and define paths

In [1]:
project_path = "/"

# Reasoning:

## 1. Analogical Reasoning

### 1.a Simple One-Word Analogies

In [84]:
def question_answers(relation_sentence):
  question = relation_sentence.split("\t")[0]
  answers = [r for r in relation_sentence.split("\t")[1].split("/") if r != ""]
  return {question:answers}


def get_relations(relation_data):
  all_relations = []
  for relation in relation_data.split("\n"):
    all_relations.append(question_answers(relation))
  return all_relations
  
def standardize_text(text):
  replacements = {"_":" ","\n":"",".":"","-":" ","(":" ",")":" ","[":" ","]":" ","{":" ", "}":" "}
  for key, value in replacements.items():
    text = text.replace(key,value)

  text = text.lower()
  return text.strip()

def get_relation_example(relations,relation_type):
  ex1 = random.choice(relations)
  ex2 = random.choice(relations)
  while ex1 == ex2:
    ex1 = random.choice(relations)

  A = standardize_text(list(ex1.keys())[0])
  B = standardize_text(random.choice(ex1[A]))
  C = standardize_text(list(ex2.keys())[0])
  D = [standardize_text(ex) for ex in ex2[C]]

  relation_example = {"Relation":relation_type, "A":A,"B":B.replace("_"," "),"C":C,"D":D,"Prediction":""}
  return relation_example

def check_correctness(D_pred,D):
  for word in D_pred.split():
    for item in D:
      if word in item:
        return True
  return False


In [6]:
analogical_reasoning = "Datasets/Analogical Reasoning/"
analogical_reasoning_bats = os.path.join(project_path,analogical_reasoning,"BATS_3.0")
lexicography_path = os.path.join(analogical_reasoning_bats,"4_Lexicographic_semantics")

In [7]:
all_relations = {}
get_relation_type = lambda relation_path : re.findall(".*?\[(.*?)\]",relation_path)[0]

for relation_path in os.listdir(lexicography_path):
  print(relation_path)
  with open(os.path.join(lexicography_path,relation_path),"r") as f:
    relation_data = f.read()
  all_relations[get_relation_type(relation_path)] = get_relations(relation_data)


L01 [hypernyms - animals].txt
L02 [hypernyms - misc].txt
L03 [hyponyms - misc].txt
L04 [meronyms - substance].txt
L05 [meronyms - member].txt
L07 [synonyms - intensity].txt
L08 [synonyms - exact].txt
L09 [antonyms - gradable].txt
L10 [antonyms - binary].txt
L06 [meronyms - part].txt


### Preparing experimental Setup

In [8]:
random.seed(84)
num_examples = 20
test_data = []
for relation_type, relations in all_relations.items():
  for i in tqdm(range(num_examples)):
   test_data.append(get_relation_example(relations, relation_type))
 

100%|██████████| 20/20 [00:00<00:00, 3463.65it/s]
100%|██████████| 20/20 [00:00<00:00, 10764.29it/s]
100%|██████████| 20/20 [00:00<00:00, 8306.37it/s]
100%|██████████| 20/20 [00:00<00:00, 20500.02it/s]
100%|██████████| 20/20 [00:00<00:00, 32239.08it/s]
100%|██████████| 20/20 [00:00<00:00, 22580.37it/s]
100%|██████████| 20/20 [00:00<00:00, 24825.71it/s]
100%|██████████| 20/20 [00:00<00:00, 15279.80it/s]
100%|██████████| 20/20 [00:00<00:00, 19854.69it/s]
100%|██████████| 20/20 [00:00<00:00, 10120.17it/s]


In [9]:
df = pd.DataFrame(test_data)
df['Correct'] = None
df

Unnamed: 0,Relation,A,B,C,D,Prediction,Correct
0,hypernyms - animals,velociraptor,dinosaur,duck,"[fowl, bird, vertebrate, poultry, creature, be...",,
1,hypernyms - animals,lion,craniate,allosaurus,"[dinosaur, reptile, bird, archosaur, archosaur...",,
2,hypernyms - animals,tiger,placental,falcon,"[raptor, bird, vertebrate, creature, beast, be...",,
3,hypernyms - animals,jaguar,big cat,lion,"[feline, cat, beast, animal, organism, fauna, ...",,
4,hypernyms - animals,vulture,being,viper,"[snake, reptile, snake, serpent, ophidian]",,
...,...,...,...,...,...,...,...
195,meronyms - part,orthography,punctuation mark,jail,"[cell, cellblock, guard, police, prison cell, ...",,
196,meronyms - part,radio,wireless,deer,"[antler, antlers, withers, flag, scut]",,
197,meronyms - part,deer,antler,sword,"[blade, forte, hilt, peak, foible, point, pomm...",,
198,meronyms - part,dress,zipper,chair,"[seat, armrest, headrest, armrests, rest, pad,...",,


In [None]:
prompts = ["{} is to {} as {} is to ___.",
           """Fill in the blank: {} is to {} as {} is to ___. Answer:"""]

prompt = prompts[1]

for i in range(df.shape[0]):
  if df.Prediction[i] != "":
    continue
  A = df['A'][i]
  B = df['B'][i]
  C = df['C'][i]
  D = df['D'][i]
  response = gpt3_call(gpt3_type, prompt.format(A,B,C))
  D_pred = standardize_text(response2text(response))
  df['Prediction'][i] = D_pred
  print(f"[{i}/{df.shape[0]}] ->", prompt.format(A,B,C),"--->",D_pred)
  if check_correctness(D_pred,D):
    df['Correct'][i] = 1
  else:
    print("Incorrect prediction")
    df['Correct'][i] = 0
  
  time.sleep(2.5)
  print()

In [None]:
result_path = os.path.join(project_path,analogical_reasoning,"BATS_exp2_results.csv")
# df.to_csv(result_path)

In [None]:

accuracy = df.Correct.sum()/df.shape[0]
# print(accuracy)

0.53


### 1.b Human Like Complex Analogies

'foot'

## 2. Spatial Reasoning

In [113]:
def parse_gpt_answer(pred_answer):
  if "false" in pred_answer:
    return False
  elif "true" in pred_answer:
    return True
  else:
    return "Undeterminable"

### 2.a Toy Examples

In [142]:
#YES/NO/Don't Know Examples

task_description = """The description of Shape1, Shape2, Shape3 is provided. On the basis of this information determine whether Shape4 will continue the pattern?"""
answer_prompt = "\nAnswer (True/False):"
#TYPE 1
q1 = """
Shape1: Triangle
Shape2: Square
Shape3: Pentagon
Shape4: Heptagon
"""
q1_a = False

q2 = """
Shape1: Triangle
Shape2: Square
Shape3: Pentagon
Shape4: Hexagon
"""
q2_a = True

q3 = """
Shape1: Square
Shape2: Pentagon
Shape3: Hexagon
Shape4: Heptagon
"""
q3_a = True

q4 = """
Shape1: Square
Shape2: Pentagon
Shape3: Hexagon
Shape4: Octagon
"""
q4_a = False

#TYPE 2
q5 = """
Shape1: A triangle with 2 dots in it.
Shape2: A square with 2 dots in it.
Shape3: A pentagon with 2 dots in it.
Shape4: A hexagon with 2 dots in it.
"""
q5_a = True

q6 = """
Shape1: A triangle with 2 dots in it.
Shape2: A square with 2 dots in it.
Shape3: A pentagon with 2 dots in it.
Shape4: A heptagon with 2 dots in it.
"""
q6_a = False

q7 = """
Shape1: A triangle with 2 dots in it.
Shape2: A square with 3 dots in it.
Shape3: A pentagon with 4 dots in it.
Shape4: A hexagon with 5 dots in it.
"""
q7_a = True

q8 = """
Shape1: A triangle with 2 dots in it.
Shape2: A square with 3 dots in it.
Shape3: A pentagon with 4 dots in it.
Shape4: A heptagon with 5 dots in it.
"""
q8_a = False

q9 = """
Shape1: A triangle with 2 dots in it.
Shape2: A square with 3 dots in it.
Shape3: A pentagon with 4 dots in it.
Shape4: A hexagon with 6 dots in it.
"""
q9_a = False

q10 = """
Shape1: A triangle with 2 dots in it.
Shape2: A square with 4 dots in it.
Shape3: A pentagon with 6 dots in it.
Shape4: A hexagon with 8 dots in it.
"""
q10_a = True

q11 = """
Shape1: A triangle with 2 dots in it.
Shape2: A triangle with 4 dots in it.
Shape3: A triangle with 6 dots in it.
Shape4: A triangle with 10 dots in it.
"""
q11_a = False

q12 = """
Shape1: A triangle with 2 dots in it.
Shape2: A triangle with 3 dots in it.
Shape3: A triangle with 4 dots in it.
Shape4: A square with 6 dots in it.
"""
q12_a = False

q13 = """
Shape1: A triangle with 3 dots in it.
Shape2: A square with 4 dots in it.
Shape3: A hexagon with 6 dots in it.
Shape4: A octagon with 8 dots in it.
"""
q13_a = True

q14 = """
Shape1: A triangle with 3 dots in it.
Shape2: A square with 4 dots in it.
Shape3: A hexagon with 6 dots in it.
Shape4: A octagon with 7 dots in it.
"""
q14_a = False

# TYPE 3 - color
q14 = """
Shape1: A square with 1 red dot.
Shape2: A square with 1 black dot.
Shape3: A square with 1 red dot.
Shape4: A square with 1 black dot.
"""
q14_a = True

q15 = """
Shape1: A square with 1 red dot.
Shape2: A square with 1 black dot.
Shape3: A square with 1 red dot.
Shape4: A square with 1 red dot.
"""
q15_a = False

q16 = """
Shape1: A square with 1 red dot.
Shape2: A square with 2 black dots.
Shape3: A square with 3 red dots.
Shape4: A square with 4 black dots.
"""
q16_a = True

q17 = """
Shape1: A square with 1 red dot.
Shape2: A square with 2 black dots.
Shape3: A square with 3 red dots.
Shape4: A square with 4 red dots.
"""
q17_a = False

q18 = """
Shape1: A triangle with 3 red dots in it.
Shape2: A square with 4 black dots in it.
Shape3: A pentagon with 5 red dots in it.
Shape4: A hexagon with 6 black dots in it.
"""
q18_a = True

q19 = """
Shape1: A triangle with 3 red dots in it.
Shape2: A square with 4 black dots in it.
Shape3: A pentagon with 5 red dots in it.
Shape4: A hexagon with 6 red dots in it.
"""
q19_a = False

q20 = """
Shape1: A triangle with 3 red dots in it.
Shape2: A square with 4 black dots in it.
Shape3: A hexagon with 6 red dots in it.
Shape4: A octagon with 800 red dots in it.
"""
q20_a = False

#HAD TO CONSTRUCT OUTRAGEOUSLY FALSE EXAMPLES FOR THE MODEL TO PREDICT FALSE

spatial_qa = {q1:q1_a, q2:q2_a, q3:q3_a, q4:q4_a, q5:q5_a, q6:q6_a, q7:q7_a, q8:q8_a, q9:q9_a, q10:q10_a,
              q11:q11_a, q12:q12_a, q13:q13_a, q14:q14_a, q15:q15_a, q16:q16_a, q17:q17_a, q18:q18_a, q19:q19_a, q20:q20_a}

In [143]:
prompt = task_description + q20 + answer_prompt
print(prompt)
response = gpt3_call(gpt3_type, prompt)
response2text(response)

The description of Shape1, Shape2, Shape3 is provided. On the basis of this information determine whether Shape4 will continue the pattern?
Shape1: A triangle with 3 red dots in it.
Shape2: A square with 4 black dots in it.
Shape3: A hexagon with 6 red dots in it.
Shape4: A octagon with 800 red dots in it.

Answer (True/False):


' False'

In [146]:
test_dataset = []

for q,a in spatial_qa.items():
  data_row = {"Prefix":task_description,"Question":q,"GT Answer":a, "Prediction":""}
  test_dataset.append(data_row)

df_spatial_test = pd.DataFrame(test_dataset)
df_spatial_test['Correct'] = None
df_spatial_test

Unnamed: 0,Prefix,Question,GT Answer,Prediction,Correct
0,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: Triangle\nShape2: Square\nShape3: Pe...,False,,
1,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: Triangle\nShape2: Square\nShape3: Pe...,True,,
2,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: Square\nShape2: Pentagon\nShape3: He...,True,,
3,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: Square\nShape2: Pentagon\nShape3: He...,False,,
4,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,True,,
5,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,False,,
6,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,True,,
7,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,False,,
8,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,False,,
9,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,True,,


In [147]:

for i in range(df_spatial_test.shape[0]):

  q = df_spatial_test['Question'][i]
  a = df_spatial_test['GT Answer'][i]
  prompt = task_description + q + answer_prompt
  response = gpt3_call(gpt3_type, prompt)
  pred_answer = standardize_text(response2text(response))
  print(f"[{i}/{df_spatial_test.shape[0]}]\n{prompt} --> {pred_answer} | Actual Answer: {a}")
  if parse_gpt_answer(pred_answer) == a:
    correctness = 1
  else:
    correctness = 0
    print("Incorrect")
  print("-"*20)
  
  df_spatial_test.Prediction[i] = parse_gpt_answer(pred_answer)
  df_spatial_test.Correct[i] = correctness
  time.sleep(2)


[0/20]
The description of Shape1, Shape2, Shape3 is provided. On the basis of this information determine whether Shape4 will continue the pattern?
Shape1: Triangle
Shape2: Square
Shape3: Pentagon
Shape4: Heptagon

Answer (True/False): --> true | Actual Answer: False
Incorrect
--------------------


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_spatial_test.Prediction[i] = parse_gpt_answer(pred_answer)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_spatial_test.Correct[i] = correctness


[1/20]
The description of Shape1, Shape2, Shape3 is provided. On the basis of this information determine whether Shape4 will continue the pattern?
Shape1: Triangle
Shape2: Square
Shape3: Pentagon
Shape4: Hexagon

Answer (True/False): --> true | Actual Answer: True
--------------------
[2/20]
The description of Shape1, Shape2, Shape3 is provided. On the basis of this information determine whether Shape4 will continue the pattern?
Shape1: Square
Shape2: Pentagon
Shape3: Hexagon
Shape4: Heptagon

Answer (True/False): --> true | Actual Answer: True
--------------------
[3/20]
The description of Shape1, Shape2, Shape3 is provided. On the basis of this information determine whether Shape4 will continue the pattern?
Shape1: Square
Shape2: Pentagon
Shape3: Hexagon
Shape4: Octagon

Answer (True/False): --> true | Actual Answer: False
Incorrect
--------------------
[4/20]
The description of Shape1, Shape2, Shape3 is provided. On the basis of this information determine whether Shape4 will continu

In [148]:
df_spatial_test

Unnamed: 0,Prefix,Question,GT Answer,Prediction,Correct
0,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: Triangle\nShape2: Square\nShape3: Pe...,False,True,0
1,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: Triangle\nShape2: Square\nShape3: Pe...,True,True,1
2,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: Square\nShape2: Pentagon\nShape3: He...,True,True,1
3,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: Square\nShape2: Pentagon\nShape3: He...,False,True,0
4,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,True,True,1
5,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,False,True,0
6,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,True,True,1
7,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,False,True,0
8,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,False,True,0
9,"The description of Shape1, Shape2, Shape3 is p...",\nShape1: A triangle with 2 dots in it.\nShape...,True,True,1


In [152]:
spatial_reasoning = "Datasets/Spatial Reasoning/"
try: os.mkdir(os.path.join(project_path, spatial_reasoning))
except: pass

spatial_reasoning_toy = os.path.join(project_path,spatial_reasoning,"Toy")
try: os.mkdir(spatial_reasoning_toy)
except: pass


In [153]:
df_spatial_test.to_csv(os.path.join(spatial_reasoning_toy,"Spatial_exp1.csv"))

In [155]:

accuracy = df_spatial_test.Correct.sum()/df_spatial_test.shape[0]
print(accuracy)

0.55


### 2.b Human examples

In [None]:
free_form_questions = [
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    "",
    ""
]