# Experiment 0529 - how good are GPT models in Fermi Estimation (GPT3)

In [None]:
# dependency
%pip install python-dotenv
%pip install openai
%pip install pandas

In [1]:
import os
from dotenv import load_dotenv, find_dotenv # load keys from .env file
import openai # use OpenAI API
import random
import json
import pandas as pd
import numpy as np
import datetime

_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key  = os.getenv('OPENAI_API_KEY')

In [2]:
def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
        max_tokens=500, # this is the maximum number of tokens to generate
    )
    print(">>>>RESPONSE>>>>")
    print(str(response.choices[0].message["content"]))
    print("<<<<END<<<<")
    return response.choices[0].message["content"]

def get_completion_from_messages(messages, model="gpt-3.5-turbo", temperature=0):
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=20
    )
    return response.choices[0].message["content"]

In [None]:
# Uncomment and Run this code block to get an example output

# prompt_example= f"""What is Fermi Estimation and how does it work?"""
# response = get_completion(prompt_example)

## Prompt Design

In [36]:
# prompt 0 - baseline (no context 0-shot)
def prompt_0(question):
    return f"""\
TASK: Here is a question: {question}. Give your estimation in the order of magnitute 10^x.
REASON: [your embedded reason]
ANSWER: x = 10^[your embedded answer]
"""
# prompt 1 - competence prompt: (simple instruction + 0-shot)
def prompt_1(question):
    return f"""\
TASK: A curious learner will ask you a question:
You know the answer in the order of magnitude. You will teach the learner, step by step, \
how they can work out an estimated answer themselves. You are allowed to make assumptions. Answer in the order of magnitude (10^x) and report value of x in ANSWER. \

QUESTION: {question}
REASON: [your embedded reason]
ANSWER: x = 10^[your embedded answer]
"""

# prompt 2 - simple context + CoT instruction + 1-shot example)
def prompt_2(question):
    return f"""\
TASK: You are an AI assistant helping people answer solve problems using Fermi estimation. \
Fermi estimation is a problem-solving technique that involves making rough estimates and approximations to arrive at a reasonable answer.
Here's how you might approach it:
QUESTION: How many piano tuners are there in Chicago? (answer with unit of 'piano tuners')
REASON: 1. First, estimate the population of Chicago. The population of Chicago was about 2.7 million. \
2. Assume an average household size. In the U.S., this is typically around 2.5 people per household.\ 
Therefore, there are approximately 1.08 million households in Chicago (2.7 million people / 2.5 people per household). \
3. Next, you would estimate the fraction of households with a piano. We can make a conservative estimate and say 1 in every 100 households owns a piano. \
So, there are about 10,800 pianos in Chicago (1.08 million households / 100). \
4. Now, estimate how often a piano needs to be tuned. Professional piano tuners usually recommend tuning a piano at least once or twice a year. \
Let's assume once a year for simplicity, so 10,800 piano tunings are required per year.\
5. Finally, estimate the number of tunings a single piano tuner can perform in a year. If a tuner works 5 days a week and 50 weeks a year, that's 250 workdays. \
Let's say a tuner can handle 2 tunings a day, which means one tuner can handle about 500 tunings in a year.
Now you can estimate the number of piano tuners in Chicago by dividing the total number of annual tunings by the number of tunings per tuner. \
So, there are approximately 21.6 piano tuners (10,800 tunings / 500 tunings per tuner).\
ANSWER: x = 10^2 piano tuners

QUESTION: {question}
REASON: [your embedded reason]
ANSWER: x = 10^[your embedded answer]
"""

# prompt 3 - explicit step-by-step instruction (simple context + CoT instruction + 0-shot)
def prompt_3(question):
    return f"""\
TASK: You are an AI assistant helping people practice Fermi estimation techniques.\
A human will post a question and ask for help to estimate a quantity in the order of magnitude.\
The estimation should rely on human's common senses and knowledge about the world.\

To make a Fermi estimate, you follow these steps in REASON:
1. Identify the final quantity of interest, T, or the target quantity, that needs to be estimated.\
2. Analyse factors will determine the answer to the target quantity, and name those sub-quantities.\
3. Logically link relevant factors together and state your assumptions, to form a mathematical model for computation.\
4. Compute and convert your final answer in the order of magnitude (10^x)
You will answer succinctly and report the value of x in ANSWER.\

QUESTION: {question}
REASON: [your embedded reason]
ANSWER: x = 10^[your embedded answer]
"""

# prompt 4 - with classic example (simple context generated by LLM + CoT instructions + 0-shot)
def prompt_4(question):
    return f"""\
TASK: You are an AI assistant helping people practice Fermi estimation techniques.\
Fermi estimation is a problem-solving technique that involves making rough estimates and approximations to arrive at a reasonable answer.\
The basic idea behind Fermi estimation is to break down a complex problem into simpler parts and make educated guesses about the values of the variables involved.\
For example, if you were asked to estimate the number of people in a city, \
you might start by estimating the population of a few neighborhoods and then extrapolating that to the entire city.\
To make a Fermi estimate, you typically follow these steps:\
1. Identify the key variables involved in the problem.\
2. Make educated guesses about the values of these variables.\
3. Use these guesses to calculate an approximate answer.\
4. Check your answer for reasonableness and adjust your estimates if necessary.\
State your reason followed by your estimated answer in the order of magnitute 10^x.\

QUESTION: {question}
REASON: [your embedded reason]
ANSWER: x = 10^[your embedded answer]
"""

## Data collection / query the model

In [50]:
# using the filtered output from Allen_realFP
data = pd.read_csv(f"../src/filter_output_true_20230529-171055.csv")
data = data.drop("Unit", axis=1)

seed = 123 # [123, 1234]
# we will sample 50 questions from the filtered output
sampled_data = data.sample(n=20, random_state=123)
print(sampled_data.head())
# print(df.iloc[353, 2])
# print(df.at[353, 'Unit'])


# sampled_data.apply(lambda row: inference(row), axis=1)


                                              Question Answer   
200  How many nerd ropes would it take to go around...      8  \
166  How many Americans are polled in political pol...      5   
690  What would the file size be on an hour of 22.4...     10   
627              How many tiles are on a room's floor?      2   
673  How loud must a sound be at its origin to be h...      2   

                                       filter_response   
200           E1. True\nE2. Count of numbers\nE3. N.A.  \
166           E1. True\nE2. Count of numbers\nE3. N.A.   
690  E1. True\nE2. Physical quantity with unit (dat...   
627           E1. True\nE2. Count of numbers\nE3. N.A.   
673  E1. True\nE2. Physical quantity with unit (sou...   

                                           filter_dict    E1   
200  {'E1': 'True', 'E2': 'Count of numbers', 'E3':...  True  \
166  {'E1': 'True', 'E2': 'Count of numbers', 'E3':...  True   
690  {'E1': 'True', 'E2': 'Physical quantity with u...  True   
627 

### Experiment with sampled data

In [51]:
# Get the current date and time
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

prompt_list = [prompt_0, prompt_1, prompt_2, prompt_3, prompt_4]

def inference(row, prompt, model='gpt-3.5-turbo', num_trials=3):
    question = row['Question']
    for i in range(num_trials): # max 3 trials
        try:
            # Assuming that get_completion function returns the answer
            response = get_completion(prompt(question), model=model)
            # Add the result to the results dictionary
            return response
        except Exception as e:
            print(f"Encountered an error at question {row.name} in trial {i+1}: {e}")
            response = "ERROR"
            # If it's the last trial, don't try again
            if i == num_trials - 1:
                return response
            else:
                print("Retrying...")
            # Add 'ERROR' to the results dictionary 

result_data = sampled_data.copy()

import time
time_taken_for_each_round = []
# iterate through each question in sampled question list with prompt 1-4
for index in range(-1, 0):
    starting_time = time.time()
    experiment_prompt = prompt_list[index]
    result_data[f"Prompt_{index+1}_result"] = sampled_data.apply(lambda row: inference(row, experiment_prompt), axis=1)
    time_taken_for_each_round.append(time.time() - starting_time)
    print(f"COMPLELETD PROMPT {index+1}")

print(f"total time taken: {sum(time_taken_for_each_round)} seconds")

# save data to csv with timestamp for post analysis
csv_filename = f"estimation_output_{timestamp}.csv"
result_data.to_csv(csv_filename, index=False)

>>>>RESPONSE>>>>
REASON: To estimate the number of nerd ropes needed to go around the Earth, we need to consider the circumference of the Earth and the length of a single nerd rope.

ANSWER: The circumference of the Earth is approximately 40,075 kilometers (or 40,075,000 meters). A single nerd rope is approximately 25 centimeters (or 0.25 meters) long. To estimate the number of nerd ropes needed to go around the Earth, we can divide the circumference of the Earth by the length of a single nerd rope:

40,075,000 meters / 0.25 meters = 160,300,000 nerd ropes

Therefore, it would take approximately 160 million (or 1.6 x 10^8) nerd ropes to go around the Earth.
<<<<END<<<<
>>>>RESPONSE>>>>
REASON: Political polls are conducted regularly to gauge public opinion on various issues and candidates. It is important for political analysts and strategists to have an idea of the number of people polled in order to make accurate predictions and decisions.

ESTIMATED ANSWER: Based on the population o

In [30]:
# An agent to handle the task of extracting answers from GPT output
df = pd.read_csv(f"./estimation_output_{timestamp}.csv")

# Step 1: Create a copy of the dataframe and set all columns but the first to NaN
df_copy = df.copy()

# Function to apply on each string
def promptExtractOOM(question, context):
    # the prompt for GPT3.5 to extract the specific answer
    return f"""\
TASK: Extract the answer to the given QUESTION from DESCRIPTION. Express the answer in the form of 10^(x).\
If no answer can be inferred from DESCRIPTION, return NULL.
QUESTION: {question}
DESCRIPTION: {context}
ANSWER: [embed your answer]"""

def extract_answer_from_gpt_output(question, context, model="gpt-3.5-turbo", num_trials=3):
    # extract answer from GPT output
    messages =  [  
    {'role':'system', 'content':'You are an assistant to extract an answer from text'},    
    {'role':'user', 'content': promptExtractOOM(question, context)}]
    
    for i in range(num_trials): # max 3 trials
        try:
            # Assuming that get_completion function returns the answer
            response = get_completion_from_messages(messages,  model=model)
            print(response)
            return response
        except Exception as e:
            print(f"Encountered an error")
            # If it's the last trial, don't try again
            if i == num_trials - 1:
                return "ERROR"
            else:
                print("Retrying...")

# Step 2: Iterate over the remaining columns

df_copy[f"Prompt_{index+1}_extraction"] = df.apply(lambda row: extract_answer_from_gpt_output(row['Question'], row[f"Prompt_{index+1}_result"]), axis=1)

print(df_copy)


# save data to csv with timestamp for post analysis
csv_filename = f"extraction_output_{timestamp}.csv"
df_copy.to_csv(csv_filename, index=False)
# save flattened data to txt file for easy copy-paste
# txt_filename = f"extraction_output_{timestamp}.txt"
# with open(txt_filename, 'w') as f:
#     for answer in df_copy.values.flatten():
#         f.write(str(answer) + '\n')
#         f.write("========================================\n")

Answer: 10^(8)
ANSWER: 10^(9)
ANSWER: 10^10.02 (approximately)
ANSWER: 10^2 = 100
ANSWER: 10^54.6 (sound intensity at origin to be heard 100 feet
NULL
ANSWER: x = 10^-3 (in the order of magnitude of millijoules
x = 1.5 (10^1.5)
Answer: 10^(1.34) dollars (rounded to two decimal places)
ANSWER: 10^7 (8,320,000 people)
ANSWER: 10^7.5
The answer to the given question is 10^10.
10^9 (one billion)
10^2
ANSWER: 10^2 (100)
ANSWER: 10^-50
ANSWER: 10^(-5) to 10^(-6)
The answer is 10^4, which means it would take 10,000 trout to pull
ANSWER: 10^(0.15) or approximately 3.55 g/cc (rounded
ANSWER: 10^(15)
                                             Question  Answer   
0   How many nerd ropes would it take to go around...     8.0  \
1   How many Americans are polled in political pol...     5.0   
2   What would the file size be on an hour of 22.4...    10.0   
3               How many tiles are on a room's floor?     2.0   
4   How loud must a sound be at its origin to be h...     2.0   
5   What is 

In [32]:
# once the answer is obtained, manual labeling is required to determine if the answer is correct or not
df = pd.read_csv(f"./labelled_output_{timestamp}.csv") # vulnerable! timestamp may not be consistent

# Convert columns 2 to 6 to numeric, errors='coerce' turns non-numeric values into NaN
df.iloc[:, 1:6] = df.iloc[:, 1:6].apply(pd.to_numeric, errors='coerce')

# calculate the average of columns 2 to 6
df['Average'] = df.iloc[:, 1:6].mean(axis=1)

def calculate_score(truth, avg):
    return max(0, 1 - (truth - avg) / 3)

# Calculate performance score based on ground truth (col 6) and average (col 7)
# and store the value in the 'Performance_Score' column
df['Performance_Score'] = df.apply(lambda row: calculate_score(row[6], row[7]), axis=1)

for i in range(1, 6):
    df[f'Score_{i}'] = df.apply(lambda row: calculate_score(row[6], row[i]), axis=1)
print(df)

# output the average performance score
print(f"Average performance score: {df['Performance_Score'].mean()}")

# output the average performance score for each prompt
for i in range(1, 6):
    print(f"Average performance score for prompt {i}: {df[f'Score_{i}'].mean()}")

                                                                                                                   Question   
0   {'id': 2364, 'name': 'How hot does the superhero, the human torch, have to be to evaporate bullets in flight?', 'ans...  \
1                   {'id': 2147, 'name': 'How many times could you say the alphabet in 24 hours?', 'answer': 4, 'unit': ''}   
2              {'id': 2126, 'name': 'How many bricks are in the exterior of our school building?', 'answer': 6, 'unit': ''}   
3   {'id': 2530, 'name': 'On average and to be approximate, how much carbon emissions will be averted if 30,000 to 50,00...   
4       {'id': 2101, 'name': 'How much volume would 417234 loosely stacked cherries occupy?', 'answer': 6, 'unit': 'cm**3'}   
5   {'id': 2090, 'name': 'The density of water is just about 1000 kg/m3. The density of air is just about 1000 times les...   
6   {'id': 2209, 'name': 'If all of the people in the United States move to Colorado, what would be the populat

## Result inspection and data analysis

In [28]:
#print the entire rows where 'Performance_Score' is effectively 0.00
zero_score_rows = df[df['Performance_Score'] <= 0.00001]
pd.set_option('display.max_colwidth', 120)  # or 199
print(zero_score_rows.iloc[:, 0])

8     {'id': 2354, 'name': 'How long would it have taken ENIAC (or another early computer) to mine 1 bitcoin?', 'answer': ...
9     {'id': 2373, 'name': 'What is the total area of all currently existing spiderwebs in the world?', 'answer': 18, 'uni...
11                                 {'id': 2104, 'name': 'How much oil is there left in the world?', 'answer': 14, 'unit': ''}
12    {'id': 2115, 'name': 'Suppose a meteor made of nickel and iron falls to the earth from deep space. Estimate how much...
13                                 {'id': 2550, 'name': 'What is the mass, in kgs, of the moon?', 'answer': 23, 'unit': 'kg'}
15    {'id': 2449, 'name': 'At the height of summer, what is the total mass of all the leaves grown on trees in New Hampsh...
16    {'id': 2295, 'name': 'How dense is the Milky Way galaxy in terms of stars per cubic light year?', 'answer': 7, 'unit...
18    {'id': 2004, 'name': 'What would be the effect to the oceans If all fishing was banned for four days once a year

In [43]:
# Constants
molar_mass_zinc = 65.38  # g/mol
avogadro_number = 6.022 * 10**23  # particles per mole
proton_charge = 1.602 * 10**-19  # C
mass_penny = 3  # grams
proton_charge_diff = proton_charge*10**-6  # C

# Calculate the number of protons
moles_copper = mass_penny / molar_mass_zinc
number_of_protons = moles_copper * avogadro_number

# Calculate the electric charge
charge = number_of_protons * proton_charge_diff

# Print the result
print("The electric charge of the copper penny is:", charge, "Coulombs")

# coulumb constant
k = 8.99 * 10**9  # Nm^2/C^2
Force = k*charge**2/(1)**2
print("The repel force will be:", Force, "N")


The electric charge of the copper penny is: 0.004426695013765677 Coulombs
The repel force will be: 176164.7024166322 N


In [49]:
# what does PI = 0.375 mean?
# let's say we reached PI = 0.47 for some questions
# compared to PI = 0.13 from SOTA
PI_old = 0.13
PI_new = 0.47 
uncertainty_old = 3*(1-PI_old)
uncertainty_new = 3*(1-PI_new)

PI_diff = 0.43926-0.34517
uncertainty_diff = 3*(PI_diff)
print(10**uncertainty_diff)
# we achieved nearly 1 order of magnitude improvement in uncertainty

1.9154463840681566


In [55]:
# Constants
import math
E = 18e9  # Pa
d_outer = 0.02  # m
d_inner = 0.01  # m
L = 0.4785  # m
K = 1.0

# Calculate moment of inertia
I = (math.pi * ((d_outer**4) - (d_inner**4))) / 64

# Calculate critical axial load
P = (math.pi**2 * E * I) / (K * L)**2

r = 149.6e9  # meters
v = 100000  # meters per second

T = (2 * math.pi * r) / v

# Convert the orbital period to years
year = T / (365 * 24 * 60 * 60)  # 365 days, 24 hours, 60 minutes, 60 seconds

print("The length of a year if we orbited the sun at 100 km/s would be approximately:", year, "years")

# Assumptions
world_population = 7.9e9  # 7.9 billion
hash_rate_per_person = 1  # hash per second
difficulty_level = 20.6e12  # 20.6 trillion

# Calculate total hash rate
total_hash_rate = world_population * hash_rate_per_person

# Calculate time to mine a block
time_to_mine_block = difficulty_level / total_hash_rate

# Convert time to minutes and hours for better readability
minutes = int(time_to_mine_block / 60)
hours = int(minutes / 60)

# Print the results
print("Total hash rate of the world population:", total_hash_rate, "hashes per second")
print("Time to mine a block at the current difficulty level:")
print("Seconds:", time_to_mine_block)
print("Minutes:", minutes)
print("Hours:", hours)

The length of a year if we orbited the sun at 100 km/s would be approximately: 0.2980607946328216 years
Total hash rate of the world population: 7900000000.0 hashes per second
Time to mine a block at the current difficulty level:
Seconds: 2607.5949367088606
Minutes: 43
Hours: 0


In [60]:
my_prompt = f"""
Does the order of the magnitude of the answer make sense? If not, please explain why.
- **Assumptions**: Assume the current world population is 7.9 billion. Assume each person can mine at a rate of 1 hash per second using pen and paper. Assume the current difficulty level of mining a block is 20.6 trillion.
- **Calculations**: Calculate the total hash rate of the world population by multiplying the population by the hash rate per person. Then, calculate how many seconds it would take to mine a block at the current difficulty level. 
- **Estimated Value**: The estimated value is in the order of magnitude of 10^23 seconds, which is approximately 3.2 x 10^16 years. This is much longer than the age of the universe, which is estimated to be around 13.8 billion years.
"""
print(get_completion(my_prompt, model="gpt-4"))

>>>>RESPONSE>>>>
The order of magnitude of the answer does not make sense. The reason is that the assumptions and calculations are not realistic. 

First, it is not feasible for every person in the world to mine at a rate of 1 hash per second using pen and paper. Mining requires specialized hardware and software, and it is not possible for individuals to perform these calculations manually at such a high rate.

Second, the current difficulty level of mining a block is indeed very high, but the calculations do not take into account the fact that the difficulty level is adjusted periodically to ensure that a new block is mined approximately every 10 minutes. This means that as more people join the mining network and the total hash rate increases, the difficulty level will also increase to maintain the 10-minute block time.

In conclusion, the order of magnitude of the estimated value is not realistic because it is based on unrealistic assumptions and calculations. The actual time it woul