<a href="https://colab.research.google.com/github/Kid7ho/LLM-Assignment/blob/main/Assignment_3_2022148033.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 3
### Jin-Myoung Hyun 2022148033

## **Setting Up**: Setting up the environment before experiment

- Server address : `https://cot.ngrok.app/`
- Endpoints:
    - `/api/inference`: For calling API to generate outputs from LMs
    - `/api/change_password`: Change your password. Arguments: `student_id`, `old_password`, `new_password`.

- Check your remaining credit at this page.[https://cot.ngrok.app/student_credits]

### Step 0-1. Changing the password

In [8]:
import requests
import os

server_address = "https://cot.ngrok.app"
data = {
    "student_id": "2022148033",
    "old_password": "helloworld!",
    "new_password": "Helloworld!"
}
response = requests.post(f'{server_address}/api/change_password', json=data)

# Check if the request was successful
if response.status_code == 200:
    # Set the new password as an environment variable
    os.environ["INTRO_AI_API_KEY"] = data["new_password"]
    print("Environment variable INTRO_AI_API_KEY set successfully.")
else:
    print("Failed to change password. Status code:", response.status_code)

Environment variable INTRO_AI_API_KEY set successfully.


In [20]:
class LLM:
    def __init__(self, model_name, student_id, password):
        self.student_id = student_id
        self.password = password
        server_address = "https://cot.ngrok.app"
        self.api_url = f'{server_address}/api/inference'  # Replace with your server address
        self.model_name = model_name

    def generate(self, model_input, max_tokens=50, top_p=1.0, temperature=0.7, frequency_penalty=0, stop="\n\n"):
        data = {
            'model_name': self.model_name,
            'model_input': model_input,
            'student_id': self.student_id,
            'password': self.password,
            'max_tokens': max_tokens,
            'top_p': top_p,
            'temperature': temperature,
            'frequency_penalty': frequency_penalty,
            'stop': stop
        }
        response = requests.post(self.api_url, json=data)

        if response.status_code == 200:
            return response.json()  # Assuming the server returns a JSON response
        else:
            return f"Error: {response.text}"

In [21]:
student_id = "2022148033"
password = os.environ['INTRO_AI_API_KEY']

available_models = [
    "meta-llama/Llama-2-7b-hf",
    "facebook/opt-350m",
    "facebook/opt-1.3b",
    "facebook/opt-2.7b"
]
opt = LLM(available_models[1], student_id, password)

### Step 0-2. Load GSM8K dataset

In [22]:
!pip install datasets



In [23]:
from datasets import load_dataset

gsm8k = load_dataset("gsm8k", "main")['test']

### Step 0-3. Check an example from GSM8k

In [24]:
print("Question:")
print(gsm8k['question'][0])
print("="*100)
print("Answer:")
print(gsm8k['answer'][0])
#Answer는 #### 뒤에 나옴으로 답만 구하려면 split해서 뒤에 element를 선택하면 됨

Question:
Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Answer:
Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.
She makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.
#### 18


## **Experiment 1**: Zero-shot prompting v.s. Few-shot prompting

### Step 1-1. Constructing prompts

In [25]:
import random
random.seed(0)

def construct_prompt(num_exemplars):
    # Load train set of GSM8K
    gsm8k = load_dataset("gsm8k", "main")['train']

    sampled_indices = random.sample([i for i in range(len(gsm8k['question']))], num_exemplars)


    instruction = "Instruction:\nSolve the following mathematical question and generate the answer after a tag, 'Answer:'."

    # Constructing a prompt with few-shot demonstrations from GSM8K
    prompt = instruction
    for i in range(num_exemplars):
        cur_question = gsm8k['question'][i]
        cur_answer = gsm8k['answer'][i].split("####")[-1].strip()
        prompt += f"\n[Example {i+1}]\n"
        prompt += f"Question:\n{cur_question}\n"
        prompt += f"Answer:{cur_answer}\n"

    prompt += f"\n[Example {num_exemplars+1}]\n"
    prompt += "Question:\n{question}\nAnswer:"

    # Write the prompt to a .txt file
    with open(f"prompt_{num_exemplars}shot.txt", "w") as f:
        f.write(prompt)

construct_prompt(0)
construct_prompt(5)
construct_prompt(10)

### Step 1-2. Load the prompt for zero-shot prompting

In [35]:
with open("prompt_0shot.txt", "r") as f:
    prompt_0shot = f.read()

from tqdm import tqdm
import json

llama = LLM("meta-llama/Llama-2-7b-hf", student_id, password)
results_collected = []
pass_collected = []

#너무 많음으로 앞에 50개만 사용 (과제를 할 때도 50개만 사용하면 됨)
for i in tqdm(range(50)):
    cur_question = gsm8k['question'][i]
    cur_anwer = gsm8k['answer'][i].split("####")[-1].strip()
    cur_model_input = prompt_0shot.format(question=cur_question)
    result = llama.generate([cur_model_input], stop=["\n\n", "[Example 2]"])
    if "Error:" in result:
        print(result)
        break
    cur_prediction = result['generations'][0]['text']
    # print(cur_prediction)

    pass_collected.append(cur_prediction.strip().replace("$", "0") == cur_anwer)
    results_collected.append({"question": cur_question, "answer": cur_anwer, "prediction": cur_prediction})
    print(f"Acc: {sum(pass_collected)/ len(pass_collected)}")

with open("result_0shot_direct.json", "w") as f:
    json.dump(results_collected, f, indent=4)

  2%|▏         | 1/50 [00:00<00:38,  1.27it/s]

Acc: 0.0


  4%|▍         | 2/50 [00:02<00:58,  1.22s/it]

Acc: 0.0


  6%|▌         | 3/50 [00:03<00:49,  1.06s/it]

Acc: 0.0


  8%|▊         | 4/50 [00:04<00:45,  1.01it/s]

Acc: 0.0


 10%|█         | 5/50 [00:04<00:41,  1.08it/s]

Acc: 0.0


 12%|█▏        | 6/50 [00:05<00:41,  1.05it/s]

Acc: 0.0


 14%|█▍        | 7/50 [00:07<00:46,  1.09s/it]

Acc: 0.0


 16%|█▌        | 8/50 [00:10<01:13,  1.75s/it]

Acc: 0.0


 18%|█▊        | 9/50 [00:11<00:59,  1.46s/it]

Acc: 0.0


 20%|██        | 10/50 [00:12<00:52,  1.31s/it]

Acc: 0.0


 22%|██▏       | 11/50 [00:13<00:51,  1.31s/it]

Acc: 0.0


 24%|██▍       | 12/50 [00:14<00:47,  1.26s/it]

Acc: 0.0


 26%|██▌       | 13/50 [00:15<00:42,  1.15s/it]

Acc: 0.0


 28%|██▊       | 14/50 [00:17<00:45,  1.26s/it]

Acc: 0.0


 30%|███       | 15/50 [00:20<01:04,  1.85s/it]

Acc: 0.0


 32%|███▏      | 16/50 [00:21<00:52,  1.54s/it]

Acc: 0.0


 34%|███▍      | 17/50 [00:22<00:49,  1.49s/it]

Acc: 0.0


 36%|███▌      | 18/50 [00:23<00:42,  1.33s/it]

Acc: 0.0


 38%|███▊      | 19/50 [00:24<00:38,  1.24s/it]

Acc: 0.0


 40%|████      | 20/50 [00:27<00:51,  1.73s/it]

Acc: 0.0


 42%|████▏     | 21/50 [00:28<00:41,  1.44s/it]

Acc: 0.0


 44%|████▍     | 22/50 [00:28<00:35,  1.27s/it]

Acc: 0.0


 46%|████▌     | 23/50 [00:29<00:31,  1.16s/it]

Acc: 0.0


 48%|████▊     | 24/50 [00:31<00:31,  1.21s/it]

Acc: 0.0


 50%|█████     | 25/50 [00:31<00:26,  1.08s/it]

Acc: 0.0


 52%|█████▏    | 26/50 [00:32<00:23,  1.02it/s]

Acc: 0.0


 54%|█████▍    | 27/50 [00:33<00:23,  1.00s/it]

Acc: 0.0


 56%|█████▌    | 28/50 [00:40<00:56,  2.58s/it]

Acc: 0.0


 58%|█████▊    | 29/50 [00:41<00:47,  2.28s/it]

Acc: 0.0


 60%|██████    | 30/50 [00:43<00:43,  2.18s/it]

Acc: 0.0


 62%|██████▏   | 31/50 [00:45<00:39,  2.08s/it]

Acc: 0.0


 64%|██████▍   | 32/50 [00:47<00:37,  2.06s/it]

Acc: 0.0


 66%|██████▌   | 33/50 [00:49<00:33,  1.97s/it]

Acc: 0.0


 68%|██████▊   | 34/50 [00:53<00:42,  2.66s/it]

Acc: 0.0


 70%|███████   | 35/50 [00:55<00:36,  2.46s/it]

Acc: 0.0


 72%|███████▏  | 36/50 [01:03<00:56,  4.00s/it]

Acc: 0.0


 74%|███████▍  | 37/50 [01:05<00:44,  3.45s/it]

Acc: 0.0


 76%|███████▌  | 38/50 [01:06<00:32,  2.70s/it]

Acc: 0.0


 78%|███████▊  | 39/50 [01:07<00:24,  2.23s/it]

Acc: 0.0


 80%|████████  | 40/50 [01:09<00:22,  2.22s/it]

Acc: 0.0


 82%|████████▏ | 41/50 [01:10<00:17,  1.91s/it]

Acc: 0.0


 84%|████████▍ | 42/50 [01:12<00:14,  1.87s/it]

Acc: 0.0


 86%|████████▌ | 43/50 [01:14<00:13,  1.95s/it]

Acc: 0.0


 88%|████████▊ | 44/50 [01:16<00:11,  1.95s/it]

Acc: 0.0


 90%|█████████ | 45/50 [01:18<00:09,  1.82s/it]

Acc: 0.0


 92%|█████████▏| 46/50 [01:19<00:06,  1.60s/it]

Acc: 0.0


 94%|█████████▍| 47/50 [01:21<00:05,  1.95s/it]

Acc: 0.0


 96%|█████████▌| 48/50 [01:24<00:04,  2.03s/it]

Acc: 0.0


 98%|█████████▊| 49/50 [01:32<00:03,  3.82s/it]

Acc: 0.0


100%|██████████| 50/50 [01:33<00:00,  1.88s/it]

Acc: 0.0





### Step 1-3. Load the Prompt for five-shot prompting

In [42]:
with open("prompt_5shot.txt", "r") as f:
    prompt_5shot = f.read()

from tqdm import tqdm

llama = LLM("meta-llama/Llama-2-7b-hf", student_id, password)
results_collected = []
pass_collected = []
for i in tqdm(range(50)):
    cur_question = gsm8k['question'][i]
    cur_anwer = gsm8k['answer'][i].split("####")[-1].strip()
    cur_model_input = prompt_5shot.format(question=cur_question)
    result = llama.generate([cur_model_input], stop=["\n\n"])
    if "Error:" in result:
        print(result)
        break
    cur_prediction = result['generations'][0]['text']
    # print(cur_prediction)

    pass_collected.append(cur_prediction.strip().replace("$","") == cur_anwer)
    results_collected.append({"question": cur_question, "answer": cur_anwer, "prediction": cur_prediction})
    print(f"Acc: {sum(pass_collected)/ len(pass_collected)}")

with open("result_5shot_direct.json", "w") as f:
    json.dump(results_collected, f, indent=4)

  2%|▏         | 1/50 [00:01<00:51,  1.05s/it]

Acc: 0.0


  4%|▍         | 2/50 [00:02<00:55,  1.16s/it]

Acc: 0.0


  6%|▌         | 3/50 [00:03<00:57,  1.23s/it]

Acc: 0.0


  8%|▊         | 4/50 [00:04<00:52,  1.13s/it]

Acc: 0.0


 10%|█         | 5/50 [00:05<00:52,  1.16s/it]

Acc: 0.0


 12%|█▏        | 6/50 [00:07<00:56,  1.28s/it]

Acc: 0.0


 14%|█▍        | 7/50 [00:08<00:58,  1.36s/it]

Acc: 0.0


 16%|█▌        | 8/50 [00:10<00:56,  1.36s/it]

Acc: 0.0


 18%|█▊        | 9/50 [00:11<00:52,  1.28s/it]

Acc: 0.0


 20%|██        | 10/50 [00:12<00:54,  1.36s/it]

Acc: 0.0


 22%|██▏       | 11/50 [00:13<00:50,  1.29s/it]

Acc: 0.0


 24%|██▍       | 12/50 [00:15<00:47,  1.26s/it]

Acc: 0.0


 26%|██▌       | 13/50 [00:16<00:49,  1.34s/it]

Acc: 0.0


 28%|██▊       | 14/50 [00:18<00:49,  1.38s/it]

Acc: 0.0


 30%|███       | 15/50 [00:19<00:50,  1.43s/it]

Acc: 0.0


 32%|███▏      | 16/50 [00:21<00:47,  1.41s/it]

Acc: 0.0


 34%|███▍      | 17/50 [00:22<00:45,  1.38s/it]

Acc: 0.0


 36%|███▌      | 18/50 [00:23<00:42,  1.32s/it]

Acc: 0.0


 38%|███▊      | 19/50 [00:24<00:39,  1.26s/it]

Acc: 0.0


 40%|████      | 20/50 [00:26<00:39,  1.30s/it]

Acc: 0.0


 42%|████▏     | 21/50 [00:26<00:34,  1.18s/it]

Acc: 0.0


 44%|████▍     | 22/50 [00:28<00:32,  1.17s/it]

Acc: 0.0


 46%|████▌     | 23/50 [00:29<00:30,  1.12s/it]

Acc: 0.0


 48%|████▊     | 24/50 [00:30<00:29,  1.14s/it]

Acc: 0.0


 50%|█████     | 25/50 [00:31<00:28,  1.14s/it]

Acc: 0.0


 52%|█████▏    | 26/50 [00:32<00:25,  1.06s/it]

Acc: 0.0


 54%|█████▍    | 27/50 [00:33<00:26,  1.17s/it]

Acc: 0.0


 56%|█████▌    | 28/50 [00:34<00:26,  1.18s/it]

Acc: 0.0


 58%|█████▊    | 29/50 [00:36<00:26,  1.27s/it]

Acc: 0.0


 60%|██████    | 30/50 [00:37<00:25,  1.25s/it]

Acc: 0.0


 62%|██████▏   | 31/50 [00:38<00:23,  1.24s/it]

Acc: 0.0


 64%|██████▍   | 32/50 [00:40<00:22,  1.25s/it]

Acc: 0.0


 66%|██████▌   | 33/50 [00:41<00:20,  1.23s/it]

Acc: 0.030303030303030304


 68%|██████▊   | 34/50 [00:42<00:19,  1.23s/it]

Acc: 0.029411764705882353


 70%|███████   | 35/50 [00:44<00:19,  1.32s/it]

Acc: 0.02857142857142857


 72%|███████▏  | 36/50 [00:45<00:19,  1.38s/it]

Acc: 0.027777777777777776


 74%|███████▍  | 37/50 [00:46<00:17,  1.35s/it]

Acc: 0.02702702702702703


 76%|███████▌  | 38/50 [00:48<00:16,  1.36s/it]

Acc: 0.02631578947368421


 78%|███████▊  | 39/50 [00:49<00:15,  1.37s/it]

Acc: 0.02564102564102564


 80%|████████  | 40/50 [00:51<00:14,  1.45s/it]

Acc: 0.025


 82%|████████▏ | 41/50 [00:52<00:12,  1.38s/it]

Acc: 0.024390243902439025


 84%|████████▍ | 42/50 [00:53<00:11,  1.42s/it]

Acc: 0.023809523809523808


 86%|████████▌ | 43/50 [00:55<00:10,  1.45s/it]

Acc: 0.023255813953488372


 88%|████████▊ | 44/50 [00:56<00:08,  1.40s/it]

Acc: 0.022727272727272728


 90%|█████████ | 45/50 [00:58<00:06,  1.35s/it]

Acc: 0.022222222222222223


 92%|█████████▏| 46/50 [00:59<00:05,  1.37s/it]

Acc: 0.021739130434782608


 94%|█████████▍| 47/50 [01:00<00:04,  1.36s/it]

Acc: 0.02127659574468085


 96%|█████████▌| 48/50 [01:02<00:02,  1.46s/it]

Acc: 0.020833333333333332


 98%|█████████▊| 49/50 [01:03<00:01,  1.39s/it]

Acc: 0.02040816326530612


100%|██████████| 50/50 [01:05<00:00,  1.30s/it]

Acc: 0.02





### Step 1-4. Load the Prompt for ten-shot prompting

In [43]:
with open("prompt_10shot.txt", "r") as f:
    prompt_10shot = f.read()

from tqdm import tqdm

llama = LLM("meta-llama/Llama-2-7b-hf", student_id, password)
results_collected = []
pass_collected = []
for i in tqdm(range(50)):
    cur_question = gsm8k['question'][i]
    cur_anwer = gsm8k['answer'][i].split("####")[-1].strip()
    cur_model_input = prompt_10shot.format(question=cur_question)
    result = llama.generate([cur_model_input], stop=["\n\n"])
    if "Error:" in result:
        print(result)
        break
    cur_prediction = result['generations'][0]['text']
    # print(cur_prediction)

    pass_collected.append(cur_prediction.strip().replace("$","") == cur_anwer)
    results_collected.append({"question": cur_question, "answer": cur_anwer, "prediction": cur_prediction})
    print(f"Acc: {sum(pass_collected)/ len(pass_collected)}")

with open("result_10shot_direct.json", "w") as f:
    json.dump(results_collected, f, indent=4)

  2%|▏         | 1/50 [00:01<01:20,  1.63s/it]

Acc: 0.0


  4%|▍         | 2/50 [00:03<01:20,  1.69s/it]

Acc: 0.0


  6%|▌         | 3/50 [00:04<01:16,  1.62s/it]

Acc: 0.0


  8%|▊         | 4/50 [00:06<01:11,  1.54s/it]

Acc: 0.0


 10%|█         | 5/50 [00:07<01:07,  1.51s/it]

Acc: 0.0


 12%|█▏        | 6/50 [00:09<01:05,  1.48s/it]

Acc: 0.16666666666666666


 14%|█▍        | 7/50 [00:10<01:02,  1.45s/it]

Acc: 0.14285714285714285


 16%|█▌        | 8/50 [00:11<00:59,  1.43s/it]

Acc: 0.125


 18%|█▊        | 9/50 [00:13<01:02,  1.53s/it]

Acc: 0.1111111111111111


 20%|██        | 10/50 [00:15<01:02,  1.56s/it]

Acc: 0.1


 22%|██▏       | 11/50 [00:16<01:00,  1.54s/it]

Acc: 0.09090909090909091


 24%|██▍       | 12/50 [00:18<00:57,  1.52s/it]

Acc: 0.08333333333333333


 26%|██▌       | 13/50 [00:19<00:54,  1.48s/it]

Acc: 0.07692307692307693


 28%|██▊       | 14/50 [00:21<00:53,  1.48s/it]

Acc: 0.07142857142857142


 30%|███       | 15/50 [00:22<00:52,  1.50s/it]

Acc: 0.06666666666666667


 32%|███▏      | 16/50 [00:24<00:50,  1.50s/it]

Acc: 0.0625


 34%|███▍      | 17/50 [00:25<00:51,  1.57s/it]

Acc: 0.058823529411764705


 36%|███▌      | 18/50 [00:27<00:49,  1.55s/it]

Acc: 0.05555555555555555


 38%|███▊      | 19/50 [00:28<00:46,  1.51s/it]

Acc: 0.05263157894736842


 40%|████      | 20/50 [00:30<00:47,  1.57s/it]

Acc: 0.05


 42%|████▏     | 21/50 [00:32<00:49,  1.71s/it]

Acc: 0.047619047619047616


 44%|████▍     | 22/50 [00:34<00:46,  1.67s/it]

Acc: 0.045454545454545456


 46%|████▌     | 23/50 [00:35<00:43,  1.61s/it]

Acc: 0.043478260869565216


 48%|████▊     | 24/50 [00:37<00:42,  1.63s/it]

Acc: 0.041666666666666664


 50%|█████     | 25/50 [00:38<00:38,  1.55s/it]

Acc: 0.04


 52%|█████▏    | 26/50 [00:40<00:36,  1.50s/it]

Acc: 0.038461538461538464


 54%|█████▍    | 27/50 [00:41<00:32,  1.43s/it]

Acc: 0.037037037037037035


 56%|█████▌    | 28/50 [00:42<00:29,  1.36s/it]

Acc: 0.03571428571428571


 58%|█████▊    | 29/50 [00:43<00:26,  1.28s/it]

Acc: 0.034482758620689655


 60%|██████    | 30/50 [00:44<00:24,  1.23s/it]

Acc: 0.03333333333333333


 62%|██████▏   | 31/50 [00:45<00:22,  1.20s/it]

Acc: 0.03225806451612903


 64%|██████▍   | 32/50 [00:47<00:21,  1.20s/it]

Acc: 0.03125


 66%|██████▌   | 33/50 [00:48<00:20,  1.21s/it]

Acc: 0.030303030303030304


 68%|██████▊   | 34/50 [00:50<00:21,  1.36s/it]

Acc: 0.029411764705882353


 70%|███████   | 35/50 [00:51<00:20,  1.38s/it]

Acc: 0.02857142857142857


 72%|███████▏  | 36/50 [00:53<00:20,  1.49s/it]

Acc: 0.027777777777777776


 74%|███████▍  | 37/50 [00:54<00:19,  1.49s/it]

Acc: 0.02702702702702703


 76%|███████▌  | 38/50 [00:56<00:17,  1.48s/it]

Acc: 0.02631578947368421


 78%|███████▊  | 39/50 [00:57<00:16,  1.49s/it]

Acc: 0.02564102564102564


 80%|████████  | 40/50 [00:59<00:15,  1.51s/it]

Acc: 0.025


 82%|████████▏ | 41/50 [01:00<00:13,  1.49s/it]

Acc: 0.024390243902439025


 84%|████████▍ | 42/50 [01:02<00:12,  1.59s/it]

Acc: 0.023809523809523808


 86%|████████▌ | 43/50 [01:04<00:11,  1.65s/it]

Acc: 0.023255813953488372


 88%|████████▊ | 44/50 [01:05<00:09,  1.64s/it]

Acc: 0.022727272727272728


 90%|█████████ | 45/50 [01:07<00:08,  1.63s/it]

Acc: 0.022222222222222223


 92%|█████████▏| 46/50 [01:08<00:06,  1.58s/it]

Acc: 0.021739130434782608


 94%|█████████▍| 47/50 [01:10<00:04,  1.59s/it]

Acc: 0.02127659574468085


 96%|█████████▌| 48/50 [01:12<00:03,  1.58s/it]

Acc: 0.020833333333333332


 98%|█████████▊| 49/50 [01:13<00:01,  1.55s/it]

Acc: 0.02040816326530612


100%|██████████| 50/50 [01:15<00:00,  1.50s/it]

Acc: 0.02





## **Experience 2**. Comparing Standard Prompting and CoT Prompting

### Step 2-1. Constructing Prompts

In [58]:
import random
random.seed(0)

def construct_prompt_CoT(num_exemplars):
    # Load train set of GSM8K
    gsm8k = load_dataset("gsm8k", "main")['train']

    sampled_indices = random.sample([i for i in range(len(gsm8k['question']))], num_exemplars)


    instruction = "Instruction:\nSolve the following mathematical question and generate the answer after a tag, 'Answer:'."

    # Constructing a prompt with few-shot demonstrations from GSM8K
    prompt = instruction
    for i in range(num_exemplars):
        cur_question = gsm8k['question'][i]
        cur_answer = gsm8k['answer'][i].replace("####", "The answer is").strip()
        prompt += f"\n[Example {i+1}]\n"
        prompt += f"Question:\n{cur_question}\n"
        prompt += f"Answer:{cur_answer}\n"

    prompt += f"\n[Example {num_exemplars+1}]\n"
    prompt += "Question:\n{question}\nAnswer:"

    # Write the prompt to a .txt file
    with open(f"prompt_{num_exemplars}shot_CoT.txt", "w") as f:
        f.write(prompt)

construct_prompt_CoT(5)
construct_prompt_CoT(10)

### Step 2-2. Implementing CoT to five-shot Prompting

In [65]:
with open("prompt_5shot_CoT.txt", "r") as f:
    prompt_5shot_CoT = f.read()

from tqdm import tqdm

llama = LLM("meta-llama/Llama-2-7b-hf", student_id, password)
results_collected = []
pass_collected = []
for i in tqdm(range(50)):
    cur_question = gsm8k['question'][i]
    cur_anwer = gsm8k['answer'][i].split("####")[-1].strip()
    cur_model_input = prompt_5shot_CoT.format(question=cur_question)
    result = llama.generate([cur_model_input], stop=["\n\n"])
    if "Error:" in result:
        print(result)
        break

    tmp = result['generations'][0]['text'].strip()
    cur_prediction = ""
    for i in reversed(tmp):
      if i.isdigit():
        cur_prediction = i + cur_prediction
      elif cur_prediction == "" or i == ",":
        continue
      else:
        break

    print("\nanswer: " + cur_anwer + ", prediction: " + result['generations'][0]['text'] + "\nextracted: " + cur_prediction + "\n")

    pass_collected.append(cur_prediction.strip().replace("$","") == cur_anwer)
    results_collected.append({"question": cur_question, "answer": cur_anwer, "prediction": cur_prediction})
    print(f"Acc: {sum(pass_collected)/ len(pass_collected)}")

with open("result_5shot_CoT_direct.json", "w") as f:
    json.dump(results_collected, f, indent=4)

  2%|▏         | 1/50 [00:01<01:33,  1.90s/it]


answer: 18, prediction: Janet makes 2*16 = $<<2*16=32>>32 dollars every day.
The answer is 32
extracted: 32

Acc: 0.0


  4%|▍         | 2/50 [00:05<02:20,  2.93s/it]


answer: 3, prediction: It takes 3 bolts of blue fiber and half that much white fiber.
So it takes 3 + half = <<3+half=3.5>>3.5 bolts of blue fiber and half that much white
extracted: 5

Acc: 0.0


  6%|▌         | 3/50 [00:08<02:21,  3.02s/it]


answer: 70000, prediction: He spent 50000 on repairs.
This increased the value of the house by 50000 / 150% = <<50000/150%=33,3
extracted: 333

Acc: 0.0


  8%|▊         | 4/50 [00:12<02:36,  3.40s/it]


answer: 540, prediction: He runs 60*3=<<60*3=180>>180 meters each sprint.
So he runs a total of 180*3=<<180*3=540
extracted: 540

Acc: 0.25


 10%|█         | 5/50 [00:15<02:25,  3.24s/it]


answer: 20, prediction: In the morning, she gives her flock of chickens 15 * 3 = <<15*3=45>>45 cups of feed.
In the afternoon, she gives her chickens another 2
extracted: 2

Acc: 0.2


 12%|█▏        | 6/50 [00:18<02:24,  3.28s/it]


answer: 64, prediction: Kylar goes to the store 16 times.
Every time he buys 16 glasses, he pays $ <<16*5=80>>80
But he pays 60% of the price 
extracted: 60

Acc: 0.16666666666666666


 14%|█▍        | 7/50 [00:23<02:32,  3.54s/it]


answer: 260, prediction: Toulouse has 2*2 = <<2*2=4>>4 times as many sheep as Charleston.
Charleston has 4*2 = <<4*2=8>>8 times as many sheep as Seattle
extracted: 8

Acc: 0.14285714285714285


 16%|█▌        | 8/50 [00:25<02:18,  3.30s/it]


answer: 160, prediction: Carla downloads 200 * 60 * 60 * 0.4 = <<200*60*60*0.4=12000>>12000 / 
extracted: 12000

Acc: 0.125


 18%|█▊        | 9/50 [00:29<02:23,  3.49s/it]


answer: 45, prediction: John spends 3 hours driving at 60 mph, which means he is 180 mi from home at the end of the 3 hours.
John spends 2 hours driving at 60 mph, which
extracted: 60

Acc: 0.1111111111111111


 20%|██        | 10/50 [00:32<02:10,  3.27s/it]


answer: 460, prediction: Eliza's regular rate per hour is $<<10>>10
Her overtime rate is 1.2 * $<<10>>10=<<1.2*10=12>>12
El
extracted: 12

Acc: 0.1


 22%|██▏       | 11/50 [00:35<02:05,  3.22s/it]


answer: 366, prediction: In the first month, the program had 60 downloads.
In the second month, the program had 60 * 3 = <<60*3=180>>180 downloads.
In the third
extracted: 180

Acc: 0.09090909090909091


 24%|██▍       | 12/50 [00:38<02:00,  3.18s/it]


answer: 694, prediction: Toula bought 3 dozen donuts which cost $68 * 3 = <<68*3=174>>174.
She bought 2 dozen mini cupcakes which cost $80 * 2
extracted: 2

Acc: 0.08333333333333333


 26%|██▌       | 13/50 [00:42<02:04,  3.36s/it]


answer: 13, prediction: Carlos plants the tree at the beginning of a year.
Year 1: Carlos spends $90 to plant the tree.
Year 2: Carlos spends $90 + 3 = <<90+3=9
extracted: 9

Acc: 0.07692307692307693


 28%|██▊       | 14/50 [00:46<02:07,  3.54s/it]


answer: 18, prediction: Melanie sold 1/3 of her vacuum cleaners at the green house and 2 more to the red house.
That means she had 3/3 + 2 = <<3/3+2=5>>
extracted: 5

Acc: 0.07142857142857142


 30%|███       | 15/50 [00:49<02:03,  3.54s/it]


answer: 60, prediction: In total, 20% enrolled in contemporary dance and 25% of the remaining enrolled in jazz dance.
This means that 20% of the remaining = 25% of the entire = 20%
extracted: 20

Acc: 0.06666666666666667


 32%|███▏      | 16/50 [00:53<02:00,  3.55s/it]


answer: 125, prediction: The jewelry market will increase by 2.5%
If the merchant buys a jewelry worth $5,000 and sells at $5,125.
The profit is 5,
extracted: 5

Acc: 0.0625


 34%|███▍      | 17/50 [00:57<01:56,  3.54s/it]


answer: 230, prediction: The first day, they traveled 80 miles.
On the second day, they traveled 150 + 80 = 230 miles.
The second train traveled 230/
extracted: 230

Acc: 0.11764705882352941


 36%|███▌      | 18/50 [01:00<01:56,  3.63s/it]


answer: 57500, prediction: Jill's salary as a teacher is 20*35/50 = <<20*35/50=10>>10 per hour.
Her salary as a coach is 30*
extracted: 30

Acc: 0.1111111111111111


 38%|███▊      | 19/50 [01:04<01:55,  3.72s/it]


answer: 7, prediction: Claire makes 3 omelets every morning.
So she uses 3 x 4 = <<3 x 4=12>>12 eggs every day.
So she uses 12 x 4 = <<12
extracted: 12

Acc: 0.10526315789473684


 40%|████      | 20/50 [01:08<01:50,  3.67s/it]


answer: 6, prediction: The first 4 miles took Marissa 1 hour to finish.
That means, she walked 4 miles in 1 hour.
If she wants her average speed to be 4 miles per hour, then she needs to walk 4
extracted: 4

Acc: 0.1


 42%|████▏     | 21/50 [01:12<01:47,  3.70s/it]


answer: 15, prediction: The 24 liters are (15/3) x (1/3) = 5 x 1/3 = 1.666 liters of water
The answer is 1.666
extracted: 666

Acc: 0.09523809523809523


 44%|████▍     | 22/50 [01:16<01:46,  3.79s/it]


answer: 14, prediction: Raymond was born 6 years before Samantha.
Raymond had a son at the age of 23. Raymond was 5 years old when he had a son.
So Samantha has been 5 years
extracted: 5

Acc: 0.09090909090909091


 46%|████▌     | 23/50 [01:20<01:42,  3.81s/it]


answer: 7, prediction: Billy sells 8/3 = <<8/3=2.6666666666667>>2.6666666666667 DVDs to his first
extracted: 6666666666667

Acc: 0.08695652173913043


 48%|████▊     | 24/50 [01:24<01:40,  3.87s/it]


answer: 8, prediction: A candle melts 2 centimeters every hour that it burns.
So it will melt 2*6 = <<2*6=12>>12 centimeters in 6 hours.
Since the burning
extracted: 6

Acc: 0.08333333333333333


 50%|█████     | 25/50 [01:27<01:36,  3.85s/it]


answer: 26, prediction: The original price of the book = 19.50 x 100/105 = <<19.50*100/105=18.65>>18.65.
extracted: 65

Acc: 0.08


 52%|█████▏    | 26/50 [01:31<01:33,  3.90s/it]


answer: 2, prediction: Marie bought 12 chicken meals of $12 each.
Each chicken meal contains 5 packs of milk that costs $3 each, so 12 x 5 = <<12*5=60
extracted: 60

Acc: 0.07692307692307693


 54%|█████▍    | 27/50 [01:35<01:29,  3.88s/it]


answer: 243, prediction: Mishka bought 3 pairs of shorts, 3 pairs of pants, and 3 pairs of shoes.
1 pair of shorts costs $16.50
1 pair of pants costs $22.
extracted: 22

Acc: 0.07407407407407407


 56%|█████▌    | 28/50 [01:38<01:19,  3.62s/it]


answer: 16, prediction: Cynthia eats 15/1 = <<15/1=15>>15 servings of ice cream every night.
Therefore, she eats 15 * 60 = <<15
extracted: 15

Acc: 0.07142857142857142


 58%|█████▊    | 29/50 [01:41<01:07,  3.23s/it]


answer: 25, prediction: First stop: 20 + 60 - 20 = <<20+60-20=40>>40 miles
Second stop: 40 + 15 = <<40+15=
extracted: 15

Acc: 0.06896551724137931


 60%|██████    | 30/50 [01:45<01:10,  3.52s/it]


answer: 104, prediction: Gloria has to choose between the boots and two pairs of heels that together cost $33+2*2 = <<33+2*2=66>>66 dollars.
So she has to choose between the
extracted: 66

Acc: 0.06666666666666667


 62%|██████▏   | 31/50 [01:48<01:07,  3.54s/it]


answer: 109, prediction: Darrell's age now is 7 x 10 = <<7*10=70>>70
Allen's age now is 11 x 10 = <<11*10=1
extracted: 1

Acc: 0.06451612903225806


 64%|██████▍   | 32/50 [01:52<01:06,  3.72s/it]


answer: 80, prediction: Now, let's take 80 as the first guess
Then we can say the average guess is 2 times the difference between the first guess and the second guess
Because 20 is 2 times the difference between 8
extracted: 8

Acc: 0.0625


 66%|██████▌   | 33/50 [01:56<01:04,  3.80s/it]


answer: 35, prediction: John takes care of dogs 10*0.5 = <<10*0.5=5>>5 times per day
That means a total of 5*7 = <<5*7=35>>35 times per
extracted: 35

Acc: 0.09090909090909091


 68%|██████▊   | 34/50 [02:00<00:59,  3.69s/it]


answer: 70, prediction: Gretchen has 110 coins.
Silver coins are 30/2 = <<30/2=15>>15 less.
Therefore, 110 - 15 = <<
extracted: 15

Acc: 0.08823529411764706


 70%|███████   | 35/50 [02:04<00:57,  3.81s/it]


answer: 23, prediction: In the beginning, Siobhan has 2/2 = $<<2/2=1>>1 jewel less than Aaron.
Aaron has 5/2 = $<<5/2=2.5>>
extracted: 5

Acc: 0.08571428571428572


 72%|███████▏  | 36/50 [02:08<00:53,  3.85s/it]


answer: 9, prediction: In the first 20 minutes, he scores 4 points.
In the second 20 minutes, he scores 4 points + 1.25 * 20 = <<4+1.25*20=
extracted: 20

Acc: 0.08333333333333333


 74%|███████▍  | 37/50 [02:12<00:50,  3.86s/it]


answer: 75, prediction: Terry eats 2 * 4 = <<2*4=8>>8 yogurts a day
So he spends 30 * 8 = <<30*8=240>>240
extracted: 240

Acc: 0.08108108108108109


 76%|███████▌  | 38/50 [02:16<00:46,  3.86s/it]


answer: 2, prediction: John sells his 13 lego sets for $15 each = $<<15*13=195>>195.
He bought 8 games for $20 each = $<<20*8
extracted: 8

Acc: 0.07894736842105263


 78%|███████▊  | 39/50 [02:20<00:42,  3.85s/it]


answer: 10, prediction: John runs 60/3 = <<60/3=20>>20 miles a week.
He runs 60/2 = <<60/2=30>>30 miles in the first day, and
extracted: 30

Acc: 0.07692307692307693


 80%|████████  | 40/50 [02:23<00:37,  3.80s/it]


answer: 18, prediction: Dana runs 4 times faster than she walks, so she runs 12 mph and walks 3 mph.
If she spends 1/3 of the time running, then she runs 4/5 as
extracted: 5

Acc: 0.075


 82%|████████▏ | 41/50 [02:27<00:34,  3.79s/it]


answer: 8, prediction: Their ages are 1, 2, and 4.
Suzy's age is 1 year old, so Ben's age is 2 times 1 = <<2*1=2>>2 years old.

extracted: 2

Acc: 0.07317073170731707


 84%|████████▍ | 42/50 [02:31<00:31,  3.91s/it]


answer: 200, prediction: Polly stood at a distance of 400 feet from the dragon.
When she held the sapphire gemstone, she could throw the javelin for a distance of 400 + 3(400
extracted: 400

Acc: 0.07142857142857142


 86%|████████▌ | 43/50 [02:35<00:26,  3.81s/it]


answer: 26, prediction: Grandma cut each pie into 8 pieces.
So the total number of pieces she cut = 8 x 5 = 40
Now the total number of pieces left was 40 - 14 = 26
extracted: 26

Acc: 0.09302325581395349


 88%|████████▊ | 44/50 [02:40<00:24,  4.11s/it]


answer: 48, prediction: In a 300g bag, there are 5 servings of chips, so 1 gram of chips is 5/300 = <<5/300=0.017>>0.0
extracted: 0

Acc: 0.09090909090909091


 90%|█████████ | 45/50 [02:43<00:19,  3.96s/it]


answer: 20, prediction: Charlie needs 10 candles per pound of beeswax.
So he needs 10/10 = <<10/10=1>>1 candles per pound of beeswax.

extracted: 1

Acc: 0.08888888888888889


 92%|█████████▏| 46/50 [02:47<00:16,  4.06s/it]


answer: 104, prediction: Meredith spent 4 hours to write 1 article
So she spent 1/5*4 = <<1/5*4=0.8>>0.8 hours to write 1 article
On Monday, she spent 
extracted: 1

Acc: 0.08695652173913043


 94%|█████████▍| 47/50 [02:51<00:12,  4.06s/it]


answer: 163, prediction: In the beginning, Candice put 80/2 = <<80/2=40>>40 post-it notes in her purse.
On her way, she stopped off at the store and purchased a package of Post
extracted: 40

Acc: 0.0851063829787234


 96%|█████████▌| 48/50 [02:55<00:07,  3.82s/it]


answer: 800, prediction: John buys 2 x 2 = <<2*2=4>>4 red ties.
Each red tie costs $50/2 = <<50/2=25>>25.
So he spent $2
extracted: 2

Acc: 0.08333333333333333


 98%|█████████▊| 49/50 [02:59<00:03,  3.86s/it]


answer: 8, prediction: The wire is 4 feet long.
The 4 feet can be divided into 2 pieces each of 2 feet
Then each 2 feet can be divided into 6 pieces each of 1 foot
extracted: 1

Acc: 0.08163265306122448


100%|██████████| 50/50 [03:03<00:00,  3.66s/it]


answer: 30, prediction: In the building, there are 15 x 8 = <<15*8=120>>120 units altogether
3/4 = <<3/4=75>>75 of them are occupied
The remaining
extracted: 75

Acc: 0.08





### Step 2-3. Implementing CoT to ten-shot Prompting

In [63]:
with open("prompt_10shot_CoT.txt", "r") as f:
    prompt_10shot_CoT = f.read()

from tqdm import tqdm

llama = LLM("meta-llama/Llama-2-7b-hf", student_id, password)
results_collected = []
pass_collected = []
for i in tqdm(range(50)):
    cur_question = gsm8k['question'][i]
    cur_anwer = gsm8k['answer'][i].split("####")[-1].strip()
    cur_model_input = prompt_10shot_CoT.format(question=cur_question)
    result = llama.generate([cur_model_input], stop=["\n\n"])
    if "Error:" in result:
        print(result)
        break

    tmp = result['generations'][0]['text'].strip()
    cur_prediction = ""
    for i in reversed(tmp):
      if i.isdigit():
        cur_prediction = i + cur_prediction
      elif cur_prediction == "" or i == ",":
        continue
      else:
        break

    print("\nanswer: " + cur_anwer + ", prediction: " + result['generations'][0]['text'] + "\nextracted: " + cur_prediction + "\n")

    pass_collected.append(cur_prediction.strip().replace("$","") == cur_anwer)
    results_collected.append({"question": cur_question, "answer": cur_anwer, "prediction": cur_prediction})
    print(f"Acc: {sum(pass_collected)/ len(pass_collected)}")

with open("result_10shot_CoT_direct.json", "w") as f:
    json.dump(results_collected, f, indent=4)

  2%|▏         | 1/50 [00:03<02:53,  3.55s/it]


answer: 18, prediction: Janet sells 16-3-4 = <<16-3-4=5>>5 eggs every day for $2 each.
This means, she is able to sell 5 x $2 = <<5*2
extracted: 2

Acc: 0.0


  4%|▍         | 2/50 [00:07<02:52,  3.60s/it]


answer: 3, prediction: Fiber length = <<1/2>> = <<1/2=0.5>>0.5, blue, white
So total blue + total white = 2*0.5 + 0.5*0.5 =
extracted: 5

Acc: 0.0


  6%|▌         | 3/50 [00:10<02:38,  3.38s/it]


answer: 70000, prediction: Josh bought the house for $80,000
He did $50,000 in repairs
So he spent $80,000 + $50,000 = $<<80
extracted: 80

Acc: 0.0


  8%|▊         | 4/50 [00:12<02:21,  3.07s/it]


answer: 540, prediction: He runs a total of 60*3 = <<60*3=180>>180 meters a week
The answer is 180
extracted: 180

Acc: 0.0


 10%|█         | 5/50 [00:16<02:25,  3.23s/it]


answer: 20, prediction: Wendi's final meal of the day is 15 + 25 = <<15+25=40>>40 cups.
She needs to give her chickens 40 - 15
extracted: 15

Acc: 0.0


 12%|█▏        | 6/50 [00:19<02:17,  3.13s/it]


answer: 64, prediction: The first glass costs 5/1 = <<5/1=5>>5.
The second glass costs 5/2 = <<5/2=2.5>>2.5
The third glass costs 5/3 =
extracted: 3

Acc: 0.0


 14%|█▍        | 7/50 [00:22<02:09,  3.02s/it]


answer: 260, prediction: 
Toulouse has 2x as many sheep as Charleston.
Charleston has 4x as many sheep as Seattle
These three cities have 5 times as many sheep as Seattle.
There are 20
extracted: 20

Acc: 0.0


 16%|█▌        | 8/50 [00:24<02:02,  2.92s/it]


answer: 160, prediction: Carla can download 2 GB/min = <<2 GB/min=0.2>>0.2 GB/minute
So it takes 40 minutes to download 200 GB
200/0.2
extracted: 2

Acc: 0.0


 18%|█▊        | 9/50 [00:28<02:15,  3.31s/it]


answer: 45, prediction: Let 1 hour = 60 minutes
The first 2 hours, John spent 2 hours at a speed of 60mph = 2 hours x 60 = 120 minutes.
The next 4
extracted: 4

Acc: 0.0


 20%|██        | 10/50 [00:32<02:10,  3.27s/it]


answer: 460, prediction: Eliza worked 45/1.2 = <<45/1.2=37>>37 hours this week.
37 hours x 10 = <<37*10=370>>37
extracted: 37

Acc: 0.0


 22%|██▏       | 11/50 [00:35<02:04,  3.18s/it]


answer: 366, prediction: The total number of downloads are 60, 3(60)=<<3(60)=180>>180, and 180-30=<<180-30=15
extracted: 15

Acc: 0.0


 24%|██▍       | 12/50 [00:38<02:05,  3.31s/it]


answer: 694, prediction: Toula bought 3 dozen donuts which cost $68 per dozen = <<68/12=5.7>>5.7 each
Then she bought 2 dozen mini cupcakes which cost $80 per dozen
extracted: 80

Acc: 0.0


 26%|██▌       | 13/50 [00:41<01:58,  3.21s/it]


answer: 13, prediction: Carlos has to plant the tree first.
So, he has to pay 90 + 3 = <<90+3=93>>93 dollars for planting, then he can sell the 7 lemons for 
extracted: 7

Acc: 0.0


 28%|██▊       | 14/50 [00:45<01:58,  3.29s/it]


answer: 18, prediction: Melanie started with 5/3 = <<5/3=1>>1 each for the green and red house.
She has 3 * 2 = <<3*2=6>>6 vacuum cleaners left.
extracted: 6

Acc: 0.0


 30%|███       | 15/50 [00:48<01:57,  3.35s/it]


answer: 60, prediction: 20% of the entire students enrolled in contemporary dance.
25% of the remaining enrolled in jazz dance.
The rest enrolled in hip-hop dance.
This means, 100% = 2
extracted: 2

Acc: 0.0


 32%|███▏      | 16/50 [00:52<01:53,  3.34s/it]


answer: 125, prediction: Let's look at the jewelry market first which will rise 2.5%
After the 2.5% rise, the jewelry will be worth $5,250
Next, the merchant will
extracted: 5250

Acc: 0.0


 34%|███▍      | 17/50 [00:54<01:42,  3.10s/it]


answer: 230, prediction: Each train covers 200 miles(80+150)
The answer is 200
extracted: 200

Acc: 0.0


 36%|███▌      | 18/50 [00:57<01:38,  3.08s/it]


answer: 57500, prediction: Jill works 35*50 = <<35*50=1750>>1750 hours a year as a teacher and 15*50 = <<15*50=750
extracted: 750

Acc: 0.0


 38%|███▊      | 19/50 [01:01<01:39,  3.22s/it]


answer: 7, prediction: Claire makes 3 eggs for breakfast every morning
So she makes 3×4=<<3*4=12>>12 eggs every day
So she eats 12×21=<<12*21
extracted: 21

Acc: 0.0


 40%|████      | 20/50 [01:04<01:40,  3.33s/it]


answer: 6, prediction: In 1 hour, she walked 4 miles.
In the next 2 hours, she walked 2 miles.
So she walked 4+2 = <<4+2=6>>6 miles in two hours.
This means
extracted: 6

Acc: 0.05


 42%|████▏     | 21/50 [01:08<01:40,  3.46s/it]


answer: 15, prediction: We are given that: 20 liters of orange drink that are two-thirds water and I wish to add it to 15 liters of pineapple drink that is three-fifths water.
We are also
extracted: 15

Acc: 0.09523809523809523


 44%|████▍     | 22/50 [01:12<01:39,  3.55s/it]


answer: 14, prediction: Samantha was born 6 years before Raymond, and Raymond was born 6 years before Raymond's son.
So Samantha is 31 - 6 = <<31-6=25>>25 years older
extracted: 25

Acc: 0.09090909090909091


 46%|████▌     | 23/50 [01:15<01:35,  3.52s/it]


answer: 7, prediction: Let n represent the number of customers on Tuesday.
On Tuesday, n = 3+2+3 = <<3+2+3=8>>8.
This means, the first 3 customers buy 1
extracted: 1

Acc: 0.08695652173913043


 48%|████▊     | 24/50 [01:18<01:25,  3.27s/it]


answer: 8, prediction: The candle will be shorter by 2 x 4 = <<2 x 4=8>>8 centimeters.
The answer is 8
extracted: 8

Acc: 0.125


 50%|█████     | 25/50 [01:22<01:24,  3.37s/it]


answer: 26, prediction: Kyle paid 19.50 - 0.25 = <<19.50-0.25=19.25>>19.25 for the book.
The original price of the book
extracted: 25

Acc: 0.12


 52%|█████▏    | 26/50 [01:25<01:21,  3.39s/it]


answer: 2, prediction: Marie paid $12+5*3=<<12+5*3=45>>45 for the chicken meal.
She paid $1.50 * 4 = <<1.50*4=
extracted: 4

Acc: 0.11538461538461539


 54%|█████▍    | 27/50 [01:28<01:18,  3.42s/it]


answer: 243, prediction: Mishka spent $16.50 * 3 = $<<16.5*3=49>>49 on 3 pairs of shorts.
He spent $22.50 * 3 = $
extracted: 3

Acc: 0.1111111111111111


 56%|█████▌    | 28/50 [01:32<01:15,  3.42s/it]


answer: 16, prediction: Cynthia eats 15 servings at a cost of $4 = <<15*4=60>>60.
So she eats 60 servings in 60 days = <<60*
extracted: 60

Acc: 0.10714285714285714


 58%|█████▊    | 29/50 [01:35<01:10,  3.37s/it]


answer: 25, prediction: First stop was 20 miles after beginning the trip.
Second stop was 15 miles before the end of the trip.
Meaning, he traveled 20 + 15 = <<20+15
extracted: 15

Acc: 0.10344827586206896


 60%|██████    | 30/50 [01:39<01:09,  3.48s/it]


answer: 104, prediction: So Gloria has 33 + 2(2x33) = $<<33+2(2x33)=133>>133 dollars to spend.
The boots cost 133 -
extracted: 133

Acc: 0.1


 62%|██████▏   | 31/50 [01:43<01:07,  3.58s/it]


answer: 109, prediction: Darrell is 7 years old, so Allen's age is 11 years old.
Darrell's age in 11 years is 7 + 11 = <<7+11=18>>
extracted: 18

Acc: 0.0967741935483871


 64%|██████▍   | 32/50 [01:46<01:01,  3.41s/it]


answer: 80, prediction: 1/3 of 80 = <<1/3 of 80=26.7>>26.7
1/3 of 26.7 = <<1/3 of 26.7=8.
extracted: 8

Acc: 0.09375


 66%|██████▌   | 33/50 [01:49<00:56,  3.32s/it]


answer: 35, prediction: John spends 10 x .5 = <<10*0.5=5>>5 hours to take care of his dogs every day.
That means he spends 5 x 7 = <<5*7=35
extracted: 35

Acc: 0.12121212121212122


 68%|██████▊   | 34/50 [01:52<00:51,  3.21s/it]


answer: 70, prediction: There are 30 more gold coins than silver coins.
This means, there are 30 silver coins and 30 gold coins.
So Gretchen has 110-30-30
extracted: 30

Acc: 0.11764705882352941


 70%|███████   | 35/50 [01:56<00:53,  3.54s/it]


answer: 23, prediction: Siobhan has 2 * (1/2) = <<2*(1/2)=10>>10 jewels less than Aaron.
Since Aaron has 5 more jewels than half of Raymond'
extracted: 5

Acc: 0.11428571428571428


 72%|███████▏  | 36/50 [01:59<00:47,  3.40s/it]


answer: 9, prediction: Mike scored 4 x 2 = <<4 x 2=8>>8 points in the first 20 minutes.
Since he scores 25% more in the second 20 minutes, he scores the same amount (
extracted: 20

Acc: 0.1111111111111111


 74%|███████▍  | 37/50 [02:03<00:45,  3.54s/it]


answer: 75, prediction: Terry eats 2/4 = <<2/4=0.5>>0.5 yogurts a day.
There are 30 days in 1 month. So Terry eats 30 x 
extracted: 30

Acc: 0.10810810810810811


 76%|███████▌  | 38/50 [02:06<00:42,  3.53s/it]


answer: 2, prediction: John sells 13/15 = <<13/15=0.86>>0.86 lego sets.
He has 0.86 x 20 = <<0.86*2
extracted: 2

Acc: 0.13157894736842105


 78%|███████▊  | 39/50 [02:10<00:38,  3.53s/it]


answer: 10, prediction: John runs 3*60 = <<3*60=180>>180 miles in the first day.
He runs 3/2 = <<3/2=1.5>>1.5 miles the other
extracted: 5

Acc: 0.1282051282051282


 80%|████████  | 40/50 [02:14<00:35,  3.57s/it]


answer: 18, prediction: Dana can run 4x faster than she can walk.
So, by skipping, she can run 4x faster than she can walk.
This means she can run 4x4 = <<4x4=16
extracted: 16

Acc: 0.125


 82%|████████▏ | 41/50 [02:17<00:31,  3.48s/it]


answer: 8, prediction: Ben's iPhone is 4 years old.
Suzy's iPhone is 2 years old.
Brandon's iPhone is twice as old as Ben's iPhone, and twice as old as Suzy's iPhone, which
extracted: 2

Acc: 0.12195121951219512


 84%|████████▍ | 42/50 [02:21<00:28,  3.57s/it]


answer: 200, prediction: Polly can throw the javelin for a distance of 400 feet because she is 400 feet away from mount Farbo.
Since the dragon's flames sleigh anything within a distance of 10
extracted: 10

Acc: 0.11904761904761904


 86%|████████▌ | 43/50 [02:24<00:24,  3.53s/it]


answer: 26, prediction: Grandma Jones cut each pie into 8 pieces, so there are 40 pieces in the pies.
Then the guests took a total of 40-14 = <<40-14=26>>2
extracted: 2

Acc: 0.11627906976744186


 88%|████████▊ | 44/50 [02:28<00:21,  3.53s/it]


answer: 48, prediction: You can eat 250 calories per serving.
300g x 5 servings = 1500g
1500g / 250 calories per serving = 6
So you
extracted: 6

Acc: 0.11363636363636363


 90%|█████████ | 45/50 [02:31<00:17,  3.49s/it]


answer: 20, prediction: Each tapered candle can be made from 1 pound of beeswax and 10 wicks
Each tapered candle costs 11 = <<11=0.9>>0.9
So
extracted: 9

Acc: 0.1111111111111111


 92%|█████████▏| 46/50 [02:35<00:13,  3.47s/it]


answer: 104, prediction: Meredith's writing rate in Monday is 5/1=<<5/1=5>>5 articles.
Her writing rate in Tuesday is 2/5 * 2/3 = <<2/5*2
extracted: 2

Acc: 0.10869565217391304


 94%|█████████▍| 47/50 [02:38<00:10,  3.51s/it]


answer: 163, prediction: Candice put 80 + 23 = <<80+23=103>>103 post-it notes in her purse before she headed out to her job at the coffee shop.
So she
extracted: 103

Acc: 0.10638297872340426


 96%|█████████▌| 48/50 [02:41<00:06,  3.25s/it]


answer: 800, prediction: John spent $200 on blue ties
So he bought 200/40 = <<200/40=5>>5 blue ties
extracted: 5

Acc: 0.10416666666666667


 98%|█████████▊| 49/50 [02:44<00:03,  3.29s/it]


answer: 8, prediction: Tracy used 4*6 = <<4*6=24>>24 pieces of wire.
She cut each piece into 6 pieces. So she obtained 24*6 = <<24*6=144
extracted: 144

Acc: 0.10204081632653061


100%|██████████| 50/50 [02:48<00:00,  3.36s/it]


answer: 30, prediction: The building is 15/4 = <<15/4=3.75>>3.75 floors high.
So, there are 3.75-3.75 = <<3.75-
extracted: 75

Acc: 0.1





## **Experience 3**. Comparing CoT with different Model Scale

### Step 3-1. CoT with OPT-350M

In [88]:
with open("prompt_10shot_CoT.txt", "r") as f:
    prompt_10shot_CoT = f.read()

from tqdm import tqdm

llama = LLM("facebook/opt-350m", student_id, password)
results_collected = []
pass_collected = []
for i in tqdm(range(50)):
    cur_question = gsm8k['question'][i]
    cur_anwer = gsm8k['answer'][i].split("####")[-1].strip()
    cur_model_input = prompt_10shot_CoT.format(question=cur_question)
    result = llama.generate([cur_model_input], stop=["\n\n"])
    if "Error:" in result:
        print(result)
        break

    tmp = result['generations'][0]['text'].strip()
    cur_prediction = ""
    for i in reversed(tmp):
      if i.isdigit():
        cur_prediction = i + cur_prediction
      elif cur_prediction == "" or i == ",":
        continue
      else:
        break

    print("\nanswer: " + cur_anwer + ", prediction: " + result['generations'][0]['text'] + "\nextracted: " + cur_prediction + "\n")

    pass_collected.append(cur_prediction.strip().replace("$","") == cur_anwer)
    results_collected.append({"question": cur_question, "answer": cur_anwer, "prediction": cur_prediction})
    print(f"Acc: {sum(pass_collected)/ len(pass_collected)}")

with open("result_10shot_CoT_direct_opt350m.json", "w") as f:
    json.dump(results_collected, f, indent=4)

  2%|▏         | 1/50 [00:01<00:58,  1.19s/it]


answer: 18, prediction: She buys 4.5 eggs per day
She eats 3 for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in
extracted: 2

Acc: 0.0


  4%|▍         | 2/50 [00:02<00:55,  1.15s/it]


answer: 3, prediction: Each bolt costs 1/2 the color of the robe.
Each bolt costs 1/2 the color of the robe
Each bolt costs 1/2 the color of the robe
Each bolt costs 1/2 the color of the robe
Each
extracted: 2

Acc: 0.0


  6%|▌         | 3/50 [00:03<00:53,  1.14s/it]


answer: 70000, prediction: Josh bought a house for $80,000 and then put in $50,000 in repairs.  Now he needs to pay $50,000 to replace the roof and $80,000 to rebuild the house.
So he is making $
extracted: 80000

Acc: 0.0


  8%|▊         | 4/50 [00:04<00:53,  1.16s/it]


answer: 540, prediction: James runs 2 total meters a week.
He runs 60 meters each sprint one time, then runs 3 times.
He runs 2 total meters a week.
He runs 2 total meters a week.
He runs 3 total meters a week.
extracted: 3

Acc: 0.0


 10%|█         | 5/50 [00:05<00:50,  1.11s/it]


answer: 20, prediction: Wendi feeds her flock of chickens 15 cups of feed.  That means she needs to feed them 15 cups of feed every day.
The answer is 15
extracted: 15

Acc: 0.0


 12%|█▏        | 6/50 [00:06<00:49,  1.12s/it]


answer: 64, prediction: Kylar needs at least 4 glasses to cover his cost, and in order to buy 4 glasses, he has to sell only 1 pair of glasses.
If Kylar sells only 1 pair of glasses, then he will only have 1 pair of
extracted: 1

Acc: 0.0


 14%|█▍        | 7/50 [00:07<00:48,  1.12s/it]


answer: 260, prediction: If Toulouse, Charleston, and Seattle have 20 sheep, then Toulouse, Charleston, and Seattle have 20 sheep.
If Toulouse, Charleston, and Seattle have 20 sheep then Toulouse, Charleston, and Seattle have 20
extracted: 20

Acc: 0.0


 16%|█▌        | 8/50 [00:08<00:41,  1.00it/s]


answer: 160, prediction: Carla downloads a file in less than 10 minutes
extracted: 10

Acc: 0.0


 18%|█▊        | 9/50 [00:09<00:42,  1.04s/it]


answer: 45, prediction: John spent the second half-hour in a standstill traffic and then stopped at a stop sign to ask the police if he could move his car.  They agreed and John moved his car.
He spent the 3rd half-hour in stand
extracted: 3

Acc: 0.0


 20%|██        | 10/50 [00:10<00:42,  1.06s/it]


answer: 460, prediction: She works 40 hours a week.
She gets $10.99 = $<<10.99<<10.99<<10.99<<10.99<<10.99<<10.99<<10.99<<10.99<<10
extracted: 10

Acc: 0.0


 22%|██▏       | 11/50 [00:12<00:42,  1.08s/it]


answer: 366, prediction: The program had 60 downloads in the first month.
The number of downloads in the second month was three times as many as the downloads in the first month, but then reduced by 30% in the third month.
How many downloads did the program
extracted: 30

Acc: 0.0


 24%|██▍       | 12/50 [00:13<00:41,  1.09s/it]


answer: 694, prediction: Toula bought 4 dozen mini cupcakes which cost $76 per dozen, 2 dozen mini cupcakes which cost $74 per dozen, 2 dozen mini cheesecakes which cost $78 per dozen, and 6 dozen mini cheesecakes which cost
extracted: 6

Acc: 0.0


 26%|██▌       | 13/50 [00:14<00:40,  1.10s/it]


answer: 13, prediction: Carlos will plant a lemon tree every year for 10 years. This means that he will need to buy a new lemon tree every 10 years to keep up with his income.
So Carlos will plant a lemon tree every year for 10 years.

extracted: 10

Acc: 0.0


 28%|██▊       | 14/50 [00:15<00:37,  1.04s/it]


answer: 18, prediction: Melanie started with 20 vacuum cleaners, and she doubled them and doubled them and doubled them and doubled them.
The answer is 30
extracted: 30

Acc: 0.0


 30%|███       | 15/50 [00:16<00:37,  1.07s/it]


answer: 60, prediction: 20% of the students in hip-hop dance are not enrolled in contemporary dance, but because hip-hop dance is more popular than contemporary dance, they are eligible for that. Therefore, 20% of the students in hip-hop dance are eligible
extracted: 20

Acc: 0.0


 32%|███▏      | 16/50 [00:17<00:37,  1.10s/it]


answer: 125, prediction: The cash will be divided between the 2 purchase plans, 1 of which is $500.
The cash will be divided between the 2 purchase plans, 1 of which is $250.
The cash will be divided between the 2 purchase plans, 2
extracted: 2

Acc: 0.0


 34%|███▍      | 17/50 [00:18<00:35,  1.07s/it]


answer: 230, prediction: The distance covered by each train = 400 miles
(In other words, the distance traveled by the train from San Rafael to San Francisco is 360 miles
extracted: 360

Acc: 0.0


 36%|███▌      | 18/50 [00:19<00:34,  1.07s/it]


answer: 57500, prediction: Jill earns $20/hr + $30/hr = $<<20/hr=300>>300.
She earns 3 hours of overtime per year so she earns $300.
The answer is 500
extracted: 500

Acc: 0.0


 38%|███▊      | 19/50 [00:20<00:33,  1.09s/it]


answer: 7, prediction: Claire will eat a total of 12 eggs for the entire week.
She will eat 12 eggs for 48 days out of 48, then eat 4 dozen eggs for every day out of 48, then eat 3 dozen eggs for every day out of 48
extracted: 48

Acc: 0.0


 40%|████      | 20/50 [00:21<00:31,  1.04s/it]


answer: 6, prediction: 4 miles = 5 miles per hour.
Marissa needs to walk the remaining distance to make up for 4 miles.
The answer is 5
extracted: 5

Acc: 0.0


 42%|████▏     | 21/50 [00:22<00:30,  1.06s/it]


answer: 15, prediction: I poured a bottle of 12 liters of orange drink into the remaining 12 liters of pineapple drink.
The orange drink is 15 * 12 = 20.6 liters.
The pineapple drink is 12 * 12 = 23.6 liters
extracted: 6

Acc: 0.0


 44%|████▍     | 22/50 [00:23<00:30,  1.08s/it]


answer: 14, prediction: If Samantha is now 31, then Raymond's son was born at the age of 29.
If Raymond is now 31, then Samantha's son was born at the age of 27.
That means Samantha's son was born at the age of 29
extracted: 29

Acc: 0.0


 46%|████▌     | 23/50 [00:24<00:27,  1.01s/it]


answer: 7, prediction: Billy sold 32 DVD's on Tuesday, and 30 on Wednesday.
The answer is 32
extracted: 32

Acc: 0.0


 48%|████▊     | 24/50 [00:25<00:26,  1.00s/it]


answer: 8, prediction: A candle melts by 2 centimeters every hour that it burns.
So a candle melts by 2 centimeters every hour that it burns.
The answer is 2,300
extracted: 2300

Acc: 0.0


 50%|█████     | 25/50 [00:26<00:25,  1.04s/it]


answer: 26, prediction: Kyle's original price was $19.50 (with a 25% discount).
Kyle's original price was $20.00 (with a 25% discount).
The original price of the book was $20.00 (with a 25%
extracted: 25

Acc: 0.0


 52%|█████▏    | 26/50 [00:27<00:25,  1.07s/it]


answer: 2, prediction: Marie ordered one box of pizza, 5 packs of milk, 4 apples, and 4 boxes of pizza.
So she ordered a total of 16 boxes of pizza, 5 packs of milk, 4 apples, 4 boxes of pizza and 16 boxes of pizza
extracted: 16

Acc: 0.0


 54%|█████▍    | 27/50 [00:29<00:25,  1.09s/it]


answer: 243, prediction: She spends $16.50 per pair of shorts, $22.50 per pair of shoes, and $42.50 per pair of shoes.
One pair of shorts costs $16.50, one pair of shoes costs $22.50
extracted: 50

Acc: 0.0


 56%|█████▌    | 28/50 [00:30<00:24,  1.13s/it]


answer: 16, prediction: Cynthia spends 20% of her paycheck on ice cream because she has to pay her $100.00 per week rent for the apartment.  If she eats 2 times a day, she spends $26.00 per day on ice cream.
extracted: 00

Acc: 0.0


 58%|█████▊    | 29/50 [00:31<00:24,  1.15s/it]


answer: 25, prediction: Henry made one stop at 10 miles before the end of the trip and his second stop was 15 miles before the end of the trip.
Henry made one stop at 17 miles before the end of the trip and his second stop was 15 miles before the
extracted: 15

Acc: 0.0


 60%|██████    | 30/50 [00:32<00:21,  1.07s/it]


answer: 104, prediction: Gloria buys the boots.
She buys the boots and then she purchases the high heels.
The answer is 1000
extracted: 1000

Acc: 0.0


 62%|██████▏   | 31/50 [00:33<00:20,  1.08s/it]


answer: 109, prediction: Allen's age 10 years from now = 14 years.
So Allen's age 10 years from now is 25 years.
Darrell's age 10 years from now = 56 years.
Allen's age 60 years from now = 65 years.

extracted: 65

Acc: 0.0


 64%|██████▍   | 32/50 [00:34<00:19,  1.09s/it]


answer: 80, prediction: They estimate that the jelly beans should be in the jar for only a few hours.
Then, they estimate that the average person should be able to count the jelly beans for a minimum of 2 hours.
Average person should be able to count the
extracted: 2

Acc: 0.0


 66%|██████▌   | 33/50 [00:35<00:18,  1.10s/it]


answer: 35, prediction: John spends .5 hours a day to walk, take care of his business, and take care of his dogs.
He spends .5 hours a day to walk, take care of his business, and take care of his dogs.
The answer
extracted: 5

Acc: 0.0


 68%|██████▊   | 34/50 [00:36<00:17,  1.11s/it]


answer: 70, prediction: Gretchen has 104 gold coins.
Gretchen has 101 silver coins.
Gretchen has 103 gold coins.
Gretchen has 146 gold coins.
Gretchen has 151 silver coins.
Gretchen has 222
extracted: 222

Acc: 0.0


 70%|███████   | 35/50 [00:38<00:16,  1.12s/it]


answer: 23, prediction: According to Aaron, Siobhan has 40 jewels.
According to Raymond, Siobhan has 20 jewels.
According to Aaron, Siobhan has 10 jewels.
According to Raymond, Siobhan has 1 ring.
The answer
extracted: 1

Acc: 0.0


 72%|███████▏  | 36/50 [00:39<00:15,  1.12s/it]


answer: 9, prediction: Mike scored 24 points in the first 20 minutes and 25% more points in the second 20 minutes.
Mike scored 24 points in the first 20 minutes and 25% more points in the second 20 minutes.
On average, Mike scored 24 points in
extracted: 24

Acc: 0.0


 74%|███████▍  | 37/50 [00:40<00:14,  1.12s/it]


answer: 75, prediction: He spends 2 yogurts a day.
He spends 30 days on the yogurt and eats 2 yogurts a day.
He eats 30 days of yogurt and eats 2 yogurts a day.
He eats 30 days of yogurt and
extracted: 30

Acc: 0.0


 76%|███████▌  | 38/50 [00:41<00:13,  1.12s/it]


answer: 2, prediction: John has 14 lego sets total.
He has 4 sets of 3 lego boards of different sizes present.
John sells the boards of different sizes for $15 each.
He sells the boards of different sizes for $20 each.

extracted: 20

Acc: 0.0


 78%|███████▊  | 39/50 [00:42<00:12,  1.14s/it]


answer: 10, prediction: John runs 31.5 miles per week.
The answer is 31.5 miles per week.
The last time John ran 31.5 miles per week, he missed his run by 1/4 mile.
So he missed his run by
extracted: 4

Acc: 0.0


 80%|████████  | 40/50 [00:43<00:10,  1.06s/it]


answer: 18, prediction: Dana can run at 4.5 miles per hour.
The answer is 4.5
extracted: 5

Acc: 0.0


 82%|████████▏ | 41/50 [00:44<00:09,  1.10s/it]


answer: 8, prediction: Brandon’s iPhone is 1 year old.
Suzy’s iPhone is 2 years old.
Brandon’s iPhone is 5 years old.
Suzy’s iPhone is 8 years old.
Brandon’s
extracted: 8

Acc: 0.024390243902439025


 84%|████████▍ | 42/50 [00:45<00:08,  1.12s/it]


answer: 200, prediction: The dragon could stand, but nobody can say exactly how far the dragon could stand, because Polly never held the gemstone.
The dragon could stand, but nobody can say exactly how far the dragon could stand, because Polly never held the gemstone
extracted: 

Acc: 0.023809523809523808


 86%|████████▌ | 43/50 [00:46<00:07,  1.02s/it]


answer: 26, prediction: 7/4 * 8 = 24.00*24.00
The answer is 24
extracted: 24

Acc: 0.023255813953488372


 88%|████████▊ | 44/50 [00:47<00:06,  1.02s/it]


answer: 48, prediction: If a 300g bag contains 500 calories, the calorie target is 2000.
If a 400g bag has 5 servings, your daily calorie target is 2000.
The answer is 1000
extracted: 1000

Acc: 0.022727272727272728


 90%|█████████ | 45/50 [00:48<00:05,  1.05s/it]


answer: 20, prediction: Charlie sells 40 candles for $2.00 each.
Charlie buys 40 * 40 * 40 * 40 * 40 * 40 * 40 * 40 * 40 * 40 * 40 * 40 candles for $5.00 each.
How much does Charlie
extracted: 00

Acc: 0.022222222222222223


 92%|█████████▏| 46/50 [00:49<00:04,  1.05s/it]


answer: 104, prediction: Meredith spent 500 hours writing articles for clients in the three days she spent writing articles.
So Meredith spent 1000 hours writing articles for clients in the three days she spent writing articles.
The answer is 1000
extracted: 1000

Acc: 0.021739130434782608


 94%|█████████▍| 47/50 [00:50<00:03,  1.07s/it]


answer: 163, prediction: Candice bought 24 Post-it notes and added another 12 to the total.
She also bought 2 Post-it notes that were left over from her last job so the total is 2,200 Post-it notes.
Alcoholic beverages
extracted: 2200

Acc: 0.02127659574468085


 96%|█████████▌| 48/50 [00:52<00:02,  1.09s/it]


answer: 800, prediction: John bought 8 red ties, 1 red tie, and 3 blue ties.
The red ties cost $200 each.
The blue ties cost $40 each.
The 3 blue ties cost $100 each.
The 2 red ties cost $
extracted: 2

Acc: 0.020833333333333332


 98%|█████████▊| 49/50 [00:52<00:01,  1.00s/it]


answer: 8, prediction: Tracy bought 24 pieces of wire in one day.
The answer is 24
extracted: 24

Acc: 0.02040816326530612


100%|██████████| 50/50 [00:53<00:00,  1.08s/it]


answer: 30, prediction: Richard's total number of units is 8/10 = 50.
So, Richard has 10 units in his apartment, and 10/10 = 50.
There are 10 units in each floor, so he has 10/50 = 50.

extracted: 50

Acc: 0.02





### Step 3-2. CoT with OPT-1.3B

In [89]:
with open("prompt_10shot_CoT.txt", "r") as f:
    prompt_10shot_CoT = f.read()

from tqdm import tqdm

llama = LLM("facebook/opt-1.3b", student_id, password)
results_collected = []
pass_collected = []
for i in tqdm(range(50)):
    cur_question = gsm8k['question'][i]
    cur_anwer = gsm8k['answer'][i].split("####")[-1].strip()
    cur_model_input = prompt_10shot_CoT.format(question=cur_question)
    result = llama.generate([cur_model_input], stop=["\n\n"])
    if "Error:" in result:
        print(result)
        break

    tmp = result['generations'][0]['text'].strip()
    cur_prediction = ""
    for i in reversed(tmp):
      if i.isdigit():
        cur_prediction = i + cur_prediction
      elif cur_prediction == "" or i == ",":
        continue
      else:
        break

    print("\nanswer: " + cur_anwer + ", prediction: " + result['generations'][0]['text'] + "\nextracted: " + cur_prediction + "\n")

    pass_collected.append(cur_prediction.strip().replace("$","") == cur_anwer)
    results_collected.append({"question": cur_question, "answer": cur_anwer, "prediction": cur_prediction})
    print(f"Acc: {sum(pass_collected)/ len(pass_collected)}")

with open("result_10shot_CoT_direct_opt1_3b.json", "w") as f:
    json.dump(results_collected, f, indent=4)

  2%|▏         | 1/50 [00:01<01:08,  1.39s/it]


answer: 18, prediction: 
Janet sells 16 eggs for $2 each and sells 4 muffins for $2 each. The result is 16 eggs for $2 x 4 = $<<16x4=60>>60 eggs.
She sells 4 muffins every day
extracted: 4

Acc: 0.0


  4%|▍         | 2/50 [00:02<01:03,  1.32s/it]


answer: 3, prediction: 
Twelve bolts of blue fiber and half that much white fiber = <<12*12=72>>72.
Twelve bolts of blue fiber and half that much white fiber = <<12*12=72>>72.
Twelve bolts
extracted: 72

Acc: 0.0


  6%|▌         | 3/50 [00:04<01:10,  1.51s/it]


answer: 70000, prediction: He sold the house for $150,000 after he invested $50,000 in repairs.
So his profit was 150*50 = 150*150=150
150*150 = 150*150^150 = $150,000
The
extracted: 150000

Acc: 0.0


  8%|▊         | 4/50 [00:05<01:10,  1.53s/it]


answer: 540, prediction: He runs 60 meters every week = 60*60 = 60 meters
Therefore, each sprint is 60 meters long
60*60 = 60 + 60 = 60 meters
So he runs 60 meters a week = 60 + 60 = 60 meters
extracted: 60

Acc: 0.0


 10%|█         | 5/50 [00:07<01:11,  1.58s/it]


answer: 20, prediction: 
She needs 25 cups for the final meal of the day.
She gives her flock of chickens 15 cups of feed in the morning and 25 cups of feed in the afternoon.
Thus, she needs 25 cups of feed for the final meal of
extracted: 25

Acc: 0.0


 12%|█▏        | 6/50 [00:09<01:07,  1.53s/it]


answer: 64, prediction: The first glass costs $5, but every second glass costs 60% as much as the first glass so it will cost him $8
Which is less than the $16 he originally paid for the glasses, so he can only pay $6 for
extracted: 6

Acc: 0.0


 14%|█▍        | 7/50 [00:09<00:55,  1.30s/it]


answer: 260, prediction: The answer is 5
extracted: 5

Acc: 0.0


 16%|█▌        | 8/50 [00:11<00:59,  1.42s/it]


answer: 160, prediction: Carla downloads the file at 200 MB/minute in a 10-minute period.
10 minutes = 200*20 = $<<$200/minute=300>>300 MB/minute
You have $300 in your budget so you can download
extracted: 300

Acc: 0.0


 18%|█▊        | 9/50 [00:13<01:00,  1.48s/it]


answer: 45, prediction: He is at home in 60 miles.
He is at home in 60+30 = 60 + 30 = 60+40 = 60+55 = 60+60.
He is at home in 60 miles because he is 60 miles away.

extracted: 60

Acc: 0.0


 20%|██        | 10/50 [00:14<00:58,  1.47s/it]


answer: 460, prediction: 
She earns $11.00 per hour after 45 hours of work, so she earns $17.00 per hour.
Her overtime pay is $3.00 per hour, so her earnings for this week are $5.00 per hour
extracted: 00

Acc: 0.0


 22%|██▏       | 11/50 [00:15<00:52,  1.36s/it]


answer: 366, prediction: The program had 60 downloads over the three months.
The answer is 60
extracted: 60

Acc: 0.0


 24%|██▍       | 12/50 [00:17<00:53,  1.40s/it]


answer: 694, prediction: To purchase 3 dozen donuts and 2 dozen mini-cupcakes for $68, Toula bought 6 dozen donuts and $110 for the mini-cupcakes.
To purchase 6 dozen mini cheesecakes and 6 dozen mini cupcakes
extracted: 6

Acc: 0.0


 26%|██▌       | 13/50 [00:18<00:52,  1.41s/it]


answer: 13, prediction: The tree will take 5 years to produce 1 lemon.
A lemon tree takes 5 years to produce 1 lemon.
In the first year, he will earn $4.50 per lemon.
In the second year, he will earn $4
extracted: 4

Acc: 0.0


 28%|██▊       | 14/50 [00:20<00:53,  1.49s/it]


answer: 18, prediction: Melanie has 5 vacuum cleaners left.
She sold 3 vacuum cleaners to the green house, 2 to the red house, and 1 to the orange house.
She started with 3 vacuum cleaners
That means she sold 1/3 of her vacuum
extracted: 3

Acc: 0.0


 30%|███       | 15/50 [00:21<00:51,  1.48s/it]


answer: 60, prediction: 
If 25% of the students are enrolled in jazz and 25% of the students are enrolled in hip-hop, then the total number of students enrolled in hip-hop dance is 25*25 = <<25*25=270>>270.
extracted: 270

Acc: 0.0


 32%|███▏      | 16/50 [00:23<00:52,  1.53s/it]


answer: 125, prediction: 
The jewelry market will gain 2.5% so his profit will be $5,000 / 2.5% = $<<$5,000+$8,000>>5.
The electronics market will gain 1.2% so
extracted: 2

Acc: 0.0


 34%|███▍      | 17/50 [00:24<00:47,  1.43s/it]


answer: 230, prediction: The distance covered by the first train is (((1/2*80/2)*80/2)+80/2)) + (((1/2*120/2)*120/2)+120/2)) = ((((1/
extracted: 1

Acc: 0.0


 36%|███▌      | 18/50 [00:26<00:47,  1.47s/it]


answer: 57500, prediction: Jill is paid $20 per hour to teach and $30 per hour to be a cheerleading coach.
If she works 50 weeks a year, 35 hours a week as a teacher and 15 hours a week as a coach, how many hours
extracted: 15

Acc: 0.0


 38%|███▊      | 19/50 [00:27<00:42,  1.37s/it]


answer: 7, prediction: She will eat 4 dozen in 4 weeks
Therefore, she will eat 4 dozen in 4 weeks, which means she will consume 472 dozen of eggs in 4 weeks
The answer is 472
extracted: 472

Acc: 0.0


 40%|████      | 20/50 [00:28<00:39,  1.33s/it]


answer: 6, prediction: Marissa needs to walk 8 miles per hour to reach her goal of 4 miles per hour.
She walks 8 miles per hour for 4 miles per hour.
She needs to walk 8 miles per hour for 8 miles per hour.
She needs
extracted: 8

Acc: 0.0


 42%|████▏     | 21/50 [00:30<00:40,  1.41s/it]


answer: 15, prediction: The water in the remaining 24 liters is 50% water because, I poured the orange drink into the pineapple drink, so the water in the remaining two liters is 50%.
The answer is 50
extracted: 50

Acc: 0.0


 44%|████▍     | 22/50 [00:31<00:36,  1.29s/it]


answer: 14, prediction: Raymond's son was born in 1993 when Samantha was 31 years old
extracted: 31

Acc: 0.0


 46%|████▌     | 23/50 [00:32<00:37,  1.40s/it]


answer: 7, prediction: He sold 12 DVDs on Tuesday.
He sold 12 times as many DVDs as his first 3 customers but only sold twice as many DVDs than his last 3 customers.
His total number of DVDs sold was 12/3 = <<12*3=
extracted: 3

Acc: 0.0


 48%|████▊     | 24/50 [00:34<00:38,  1.50s/it]


answer: 8, prediction: 1:00 PM: 7 centimeters.
5:00 PM: 0 centimeters.
10: 00 PM: 0 centimeters.
12: 00 AM: 0 centimeters.
1:00 AM: 0 centimeters.
5:00 AM:
extracted: 00

Acc: 0.0


 50%|█████     | 25/50 [00:35<00:35,  1.41s/it]


answer: 26, prediction: $19.50 with discount = <<$19.50+25%=39>>39
He bought the book last year for $19.50, so the total price of the book is $39.50.
extracted: 50

Acc: 0.0


 52%|█████▏    | 26/50 [00:36<00:32,  1.36s/it]


answer: 2, prediction: She ordered 3 boxes of pizza.
She ordered 4 pizzas, each costing $8.50. So she paid $40.50 for 3 boxes of pizza
(3*8.50)*$40.50 = $40.50
extracted: 50

Acc: 0.0


 54%|█████▍    | 27/50 [00:38<00:30,  1.34s/it]


answer: 243, prediction: Mishka spent $15.50 on the shorts, $40 for the pants, and $42 for the shoes.
So she spent $45.50 on all the clothes she bought.
The answer is $45.50
extracted: 50

Acc: 0.0


 56%|█████▌    | 28/50 [00:39<00:29,  1.34s/it]


answer: 16, prediction: After 60 days, she spent $20.00 on ice cream.
So she spent $60.00 on ice cream.
That means in 60 days she spent $60.00 on ice cream.
The answer is $60
extracted: 60

Acc: 0.0


 58%|█████▊    | 29/50 [00:40<00:28,  1.33s/it]


answer: 25, prediction: Henry traveled 60 miles between his first and second stops so he covered 60*20 = 60*10 = 60*20 miles
extracted: 20

Acc: 0.0


 60%|██████    | 30/50 [00:42<00:27,  1.38s/it]


answer: 104, prediction: The boots cost $33 and the two high heels cost $67.
That means the boots cost $33 + 67 = <<33+67=123>>123 dollars.
She is able to buy the boots with her shoe budget because she spent
extracted: 123

Acc: 0.0


 62%|██████▏   | 31/50 [00:43<00:24,  1.30s/it]


answer: 109, prediction: Allen is 5 years old.
Darrell is 12 years old
The answer is 60
extracted: 60

Acc: 0.0


 64%|██████▍   | 32/50 [00:45<00:25,  1.39s/it]


answer: 80, prediction: All three of them guessed 80 and Gunter was right on the money.
The total number of jelly beans in the jar was 80.
The average guess was 80.5
The average estimate was 80.3
The average guess was 80
extracted: 80

Acc: 0.03125


 66%|██████▌   | 33/50 [00:46<00:23,  1.36s/it]


answer: 35, prediction: He takes care of 10 dogs 1/2 hour per day
So, he takes 10/2 = <<10*1/2=12>>12 hours per week
The answer is 12
extracted: 12

Acc: 0.030303030303030304


 68%|██████▊   | 34/50 [00:48<00:23,  1.46s/it]


answer: 70, prediction: Gretchen has 110 gold coins
She has 300 coins left so she has 220 gold coins
Gold coins are more valuable than silver coins because gold coins have a higher value per gram than silver coins.
If Gretchen only uses 1 gold coin
extracted: 1

Acc: 0.029411764705882353


 70%|███████   | 35/50 [00:49<00:22,  1.53s/it]


answer: 23, prediction: 
Siobhan has 40/5 = <<40/5=8>>8 jewels.
Siobhan has 40/5 + 5 + 5 = <<40/5+5=16>>8 jewels.
Since Siobhan has
extracted: 8

Acc: 0.02857142857142857


 72%|███████▏  | 36/50 [00:51<00:22,  1.63s/it]


answer: 9, prediction: He scored 60 points in the first 20 minutes.
In the second 20 minutes, he scored 40 points.
In the end, he scored 100 points.
In the first 20 minutes, he scored 4 points; in the second 20 minutes,
extracted: 20

Acc: 0.027777777777777776


 74%|███████▍  | 37/50 [00:53<00:22,  1.70s/it]


answer: 75, prediction: That's $Total cost of Yogurt = $<<4*30+30+30+30+30+30+30+30+30+30+30+30+30+30+30+30+30+30+30
extracted: 30

Acc: 0.02702702702702703


 76%|███████▌  | 38/50 [00:55<00:21,  1.76s/it]


answer: 2, prediction: John has 13 lego sets and he has $5 left.
He has 13 lego sets, so he has 13*13 = $<<13*13=216>>216 pieces of lego left
He has 8 video games, so
extracted: 8

Acc: 0.02631578947368421


 78%|███████▊  | 39/50 [00:57<00:19,  1.78s/it]


answer: 10, prediction: He runs 2.5 miles per hour
So he runs 2.5 x 60 = 2.5 x 9.5 = <<2.5*9.5=39>>39 miles a week
The answer is 39
extracted: 39

Acc: 0.02564102564102564


 80%|████████  | 40/50 [00:59<00:18,  1.80s/it]


answer: 18, prediction: If she runs 3 miles per hour, she can travel 5 miles in 6 hours if she uses all of her energy in running and all of the energy in skipping.
If she skips, she can travel 3 miles per hour, but she needs
extracted: 3

Acc: 0.025


 82%|████████▏ | 41/50 [01:00<00:15,  1.74s/it]


answer: 8, prediction: Ben’s iPhone is two times older than Suzy’s iPhone.
Brandon’s iPhone is four times older than Ben’s iPhone.
extracted: 

Acc: 0.024390243902439025


 84%|████████▍ | 42/50 [01:02<00:14,  1.77s/it]


answer: 200, prediction: Perg's flames only reach 1000 feet. Therefore, Polly's distance of 400 feet is only 400 feet inside of the reach of Perg's flames.
The distance of the dragon's flames is about 1000. Therefore, her distance of 400 feet
extracted: 400

Acc: 0.023809523809523808


 86%|████████▌ | 43/50 [01:04<00:12,  1.80s/it]


answer: 26, prediction: There were 14 pieces left after the guests had eaten their pie.
Grandma Jones then cut each pie in half and set each half out on the buffet table.  At the end of the evening, after the guests had taken and eaten their pieces
extracted: 14

Acc: 0.023255813953488372


 88%|████████▊ | 44/50 [01:06<00:10,  1.80s/it]


answer: 48, prediction: According to its nutritional info, a bag of chips has 250 calories per serving.
If a 300g bag has 5 servings, how many grams can you eat if your daily calorie target is 2000 and you have already consumed 1800 calories?
If you
extracted: 1800

Acc: 0.022727272727272728


 90%|█████████ | 45/50 [01:08<00:09,  1.81s/it]


answer: 20, prediction: He can sell 20 candles for $2.00 each.
He made $20*20 = $80.00
So he made $80.00 * 1/20 = $80.00
So he made $80.00 *
extracted: 00

Acc: 0.022222222222222223


 92%|█████████▏| 46/50 [01:09<00:06,  1.70s/it]


answer: 104, prediction: Meredith worked 5 days for $90.50.
She worked 5 days for $90.50.
She worked 5 days for $90.50.
She worked 5 days for $90.50.
She worked 5 days for
extracted: 5

Acc: 0.021739130434782608


 94%|█████████▍| 47/50 [01:11<00:05,  1.77s/it]


answer: 163, prediction: Candice put 80 post-it notes on the packages she purchased.
The total number of Post-it notes she put in the package was 80/220 = <<80/220=5>>5 Post-it notes
There are 5 Post
extracted: 5

Acc: 0.02127659574468085


 96%|█████████▌| 48/50 [01:13<00:03,  1.79s/it]


answer: 800, prediction: John spent $40/50 = <<40/50=$40/50=60>>60 ties
So he spent $40/50 = <<40/50=$40/50=120>>120 ties on red ties.
John
extracted: 120

Acc: 0.020833333333333332


 98%|█████████▊| 49/50 [01:14<00:01,  1.66s/it]


answer: 8, prediction: She obtained 6 pieces
Each piece has 24 inches of wire.
Thus, she obtained 24/(24+6) = 12.
Each piece represents 24 hours, so each hour of wire represents 24 * 12 = 48 hours
Each wire represents
extracted: 48

Acc: 0.02040816326530612


100%|██████████| 50/50 [01:15<00:00,  1.52s/it]


answer: 30, prediction: There are 255 unoccupied units in the building, and 66 are vacant.
The total number of units = 255*3/4 = <<255*3/4=42>>42
The total number of units in the building is 255/
extracted: 255

Acc: 0.02





### Step 3-3. CoT with OPT-2.7B

In [90]:
with open("prompt_10shot_CoT.txt", "r") as f:
    prompt_10shot_CoT = f.read()

from tqdm import tqdm

llama = LLM("facebook/opt-2.7b", student_id, password)
results_collected = []
pass_collected = []
for i in tqdm(range(50)):
    cur_question = gsm8k['question'][i]
    cur_anwer = gsm8k['answer'][i].split("####")[-1].strip()
    cur_model_input = prompt_10shot_CoT.format(question=cur_question)
    result = llama.generate([cur_model_input], stop=["\n\n"])
    if "Error:" in result:
        print(result)
        break

    tmp = result['generations'][0]['text'].strip()
    cur_prediction = ""
    for i in reversed(tmp):
      if i.isdigit():
        cur_prediction = i + cur_prediction
      elif cur_prediction == "" or i == ",":
        continue
      else:
        break

    print("\nanswer: " + cur_anwer + ", prediction: " + result['generations'][0]['text'] + "\nextracted: " + cur_prediction + "\n")

    pass_collected.append(cur_prediction.strip().replace("$","") == cur_anwer)
    results_collected.append({"question": cur_question, "answer": cur_anwer, "prediction": cur_prediction})
    print(f"Acc: {sum(pass_collected)/ len(pass_collected)}")

with open("result_10shot_CoT_direct_opt2_7b.json", "w") as f:
    json.dump(results_collected, f, indent=4)

  2%|▏         | 1/50 [00:01<01:11,  1.47s/it]


answer: 18, prediction: John ate four eggs for breakfast every day. He bought two dozen each at the farmers' market.
He bought 24 ducks for $2 each at the farmers' market.
So he bought 24 / 2 = <<24/2=12>>
extracted: 12

Acc: 0.0


  4%|▍         | 2/50 [00:03<01:30,  1.88s/it]


answer: 3, prediction: It takes $16 bolts in blue fiber and $32 bolts in white fiber to make a robe.
Let B be the number of bolts in blue fiber and C be the number of bolts in white fiber.
Since B = C, the number
extracted: 32

Acc: 0.0


  6%|▌         | 3/50 [00:05<01:34,  2.00s/it]


answer: 70000, prediction: Josh spent $80,000 to buy the house.
He spent $50,000 to fix the house, so he is left with $40,000.
He made $40,000 profit.
The answer is $40,000
extracted: 40000

Acc: 0.0


  8%|▊         | 4/50 [00:07<01:23,  1.82s/it]


answer: 540, prediction: In total, James runs 3*60 = <<3*60=216>>216 meters.
The answer is 216
extracted: 216

Acc: 0.0


 10%|█         | 5/50 [00:09<01:24,  1.88s/it]


answer: 20, prediction: She needs 15 * 15*15 = 15 * 25=225 cups of food to feed her flock of birds before bed.
Wendi's flock of birds is 20.
The answer is 225
extracted: 225

Acc: 0.0


 12%|█▏        | 6/50 [00:11<01:26,  1.97s/it]


answer: 64, prediction: Let T be the amount Kylar spent on the glasses.
He spent $5 + 60/16 = <<60/16=3>>3 glasses.
Let T + 16 = <<16+16=32>>32 glasses.
He
extracted: 32

Acc: 0.0


 14%|█▍        | 7/50 [00:12<01:18,  1.82s/it]


answer: 260, prediction: Toulouse has twice as many sheep as Charleston. Charleston has 4 times as many sheep as Seattle. Seattle has 20 sheep. So if Seattle has 20 sheep, then Toulouse, Charleston, and Seattle together have 80 sheep.
The answer
extracted: 80

Acc: 0.0


 16%|█▌        | 8/50 [00:15<01:20,  1.91s/it]


answer: 160, prediction: 
Carla is able to download 200 GB in 20 minutes.
The file is 2 GB + 40% = 2 GB + 40% = 2.4 GB.
Carla's normal download speed is 2 GB / min = <<2*
extracted: 2

Acc: 0.0


 18%|█▊        | 9/50 [00:16<01:17,  1.88s/it]


answer: 45, prediction: John drove for 3 hours at 60 mph, so the maximum distance he could drive in 4 hours is 3*60 = <<3*60=180>>180 miles.
After those 3 hours of driving, he drove 60 miles at 60 mph and
extracted: 60

Acc: 0.0


 20%|██        | 10/50 [00:19<01:18,  1.97s/it]


answer: 460, prediction: Eliza earns $10/45 = <<$10/45=5>>5 hours this week.
She works just 45 hours, so she earns $5/45 = <<$5/45=2>>2 hours of overtime pay this
extracted: 2

Acc: 0.0


 22%|██▏       | 11/50 [00:21<01:16,  1.97s/it]


answer: 366, prediction: The number of downloads was two times greater in the second month than in the first month.
Therefore, the number of downloads was eight times greater in the second month than in the first.
Therefore, the number of downloads was ten times greater in
extracted: 

Acc: 0.0


 24%|██▍       | 12/50 [00:23<01:17,  2.04s/it]


answer: 694, prediction: Toula bought 3 dozen donuts which cost $68 per dozen, 2 dozen mini cupcakes which cost $80 per dozen, and 6 dozen mini cheesecakes for $55 per dozen.
Each of those pastries cost $68 +
extracted: 68

Acc: 0.0


 26%|██▌       | 13/50 [00:25<01:17,  2.09s/it]


answer: 13, prediction: Carlos plants the tree for 7 years. It will take Carloss + 7 = <<7*7=14>>14 years for the tree to earn money.
If you add up the cost of the tree, the cost of the pruning
extracted: 14

Acc: 0.0


 28%|██▊       | 14/50 [00:26<01:08,  1.91s/it]


answer: 18, prediction: She started with 5 vacuum cleaners and she sold 5 vacuum cleaners, so she only has 4 left.
If she had started with 4 vacuum cleaners and sold 5 and 2 more vacuum cleaners at the green house and the red house, then she would have
extracted: 2

Acc: 0.0


 30%|███       | 15/50 [00:28<01:03,  1.82s/it]


answer: 60, prediction: Hip-hop dance enrollment is <<25%>>25% of the hip-hop dance class is enrolled.
Hip-hop dance enrollment is 25/20 = <<25/20=50>>50 hip-hop dancers
It is
extracted: 50

Acc: 0.0


 32%|███▏      | 16/50 [00:30<01:02,  1.85s/it]


answer: 125, prediction: If the merchant were to buy the jewelry, he would maximize his profit by making the purchase.
If he were to buy the gadgets, he would maximize his profit by making the purchase.
If he were to buy the jewelry and the gadgets,
extracted: 

Acc: 0.0


 34%|███▍      | 17/50 [00:31<00:57,  1.75s/it]


answer: 230, prediction: Train 1 travels N for 80 miles and covers 150 miles while Train 2 travels W for 80 miles and covers 150 miles.
Train 1 is the longer of the two trains, therefore it covered 150 miles in 2 days.
The answer is 150 +
extracted: 150

Acc: 0.0


 36%|███▌      | 18/50 [00:33<00:56,  1.76s/it]


answer: 57500, prediction: Jill's annual salary is $<<20*35=880>>880.
Therefore, her annual salary is <<20*35=80>>80.
Her base pay is $880 per month because she only teaches 35 hours a week.
extracted: 35

Acc: 0.0


 38%|███▊      | 19/50 [00:35<00:58,  1.87s/it]


answer: 7, prediction: Claire eats 3 dozen eggs every morning for breakfast. She eats 3 dozen eggs in 4 weeks.
She eats 3 dozen eggs in 4 weeks = 3 dozen + 1/4 = <<3 dozen+1/4=3>>3 eggs.
extracted: 3

Acc: 0.0


 40%|████      | 20/50 [00:37<00:55,  1.86s/it]


answer: 6, prediction: To walk the remaining distance in 4 miles per hour, she needs to average 4 miles per hour.
To walk the remaining distance in 4 miles per hour, she needs to walk the remaining 3 miles per hour.
She needs to walk 3 miles
extracted: 3

Acc: 0.0


 42%|████▏     | 21/50 [00:39<00:50,  1.75s/it]


answer: 15, prediction: I have 12 liters of orange drink and I add it to 15 liters of pineapple drink that is three-fifths water.
I have 12 + 15 = <<12+15=24>>24 liters.
I have 12 +
extracted: 12

Acc: 0.0


 44%|████▍     | 22/50 [00:41<00:52,  1.88s/it]


answer: 14, prediction: If Sam is 30, then the son was born in 1975.
If Sam is 31, the son was born in 1991.
If Sam is 33, the son was born in 1996.
If Sam is 34, the son was born in
extracted: 34

Acc: 0.0


 46%|████▌     | 23/50 [00:42<00:48,  1.79s/it]


answer: 7, prediction: He sold 8 DVDs on Tuesday. His sales total for the week is 6*8 = <<6*8=32>>32.
So he sold $<<6*8=16>>16 DVDs on Tuesday.
He sold the last 3
extracted: 3

Acc: 0.0


 48%|████▊     | 24/50 [00:44<00:46,  1.80s/it]


answer: 8, prediction: The candle will be shorter by 2 centimeters every hour that it burns.
So the candle will be shorter by 2 centimeters after burning from 1:00 PM to 5:00 PM.
The answer is 2
extracted: 2

Acc: 0.0


 50%|█████     | 25/50 [00:46<00:44,  1.80s/it]


answer: 26, prediction: Last year's best-selling book is $19.50 with a 25% discount from the original price.
The price of the book is $19.50 with a 25% discount from the original price.
extracted: 25

Acc: 0.0


 52%|█████▏    | 26/50 [00:48<00:43,  1.82s/it]


answer: 2, prediction: Marie ordered 4 boxes of pizza. 4*8 = <<4+8=16>>16 boxes of pizza
Then she paid $50 for the chicken meal and 5 packs of milk
She did not pay $50 for the apples, so she
extracted: 50

Acc: 0.0


 54%|█████▍    | 27/50 [00:49<00:39,  1.70s/it]


answer: 243, prediction: Mishka spent $<<16.50+2*22+42=46>>46 dollars on all the clothing items.
She spent $<<16.50+2*22+42=46>>46 dollars buying all the clothing items
extracted: 46

Acc: 0.0


 56%|█████▌    | 28/50 [00:51<00:35,  1.62s/it]


answer: 16, prediction: She spends $4.00 * 15 = <<$2.60***15=$1260***30=$3240***60=$6240***60=$14000***60=50000
She eats 15 servings of ice
extracted: 15

Acc: 0.0


 58%|█████▊    | 29/50 [00:52<00:32,  1.57s/it]


answer: 25, prediction: He traveled 40 miles between his first and second stops.
He traveled 20 miles between his first and second stops.
He traveled 30 miles between his first and second stops.
He traveled 30 miles between his first and second stops.
He traveled
extracted: 30

Acc: 0.0


 60%|██████    | 30/50 [00:54<00:30,  1.53s/it]


answer: 104, prediction: 
She has to choose between the boots and the high heels and she has $18 to spend.
So she has to choose between buying the boots or the heels.
If she buys the boots, she has to pay $18 for them.
extracted: 18

Acc: 0.0


 62%|██████▏   | 31/50 [00:55<00:28,  1.52s/it]


answer: 109, prediction: Allen's age is 10 years from now, but Darrell's age is 7 years from now.
So Darrell's age is 10+7 = <<10+7=15>>15 years from now.
And Allen's age is 10+7
extracted: 7

Acc: 0.0


 64%|██████▍   | 32/50 [00:56<00:24,  1.39s/it]


answer: 80, prediction: 
The first two say 80 and the third says 25%.
The average guess is 80
extracted: 80

Acc: 0.03125


 66%|██████▌   | 33/50 [00:58<00:23,  1.41s/it]


answer: 35, prediction: John takes care of 10 dogs. Each dog takes 0.5 hours a day to walk and take care of their business.
John spends 1.5 hours a week taking care of dogs.
He spends 1.5 * 1.5 =
extracted: 5

Acc: 0.030303030303030304


 68%|██████▊   | 34/50 [00:59<00:22,  1.41s/it]


answer: 70, prediction: Gretchen has 110 / 30 = <<110/30=18>>18 gold coins.
There are 30 more gold coins than silver coins, so Gretchen has 30 * 10 = <<30*10=60>>60 gold coins.

extracted: 60

Acc: 0.029411764705882353


 70%|███████   | 35/50 [01:01<00:21,  1.42s/it]


answer: 23, prediction: Siobhan has 2 fewer jewels than Aaron, so she has 40 + 5 = <<40+5=45>>45 jewels
Aaron has 5 more jewels than half of Raymond's jewels, so Raymond has 40 + 5 = <<40+5
extracted: 5

Acc: 0.02857142857142857


 72%|███████▏  | 36/50 [01:02<00:19,  1.42s/it]


answer: 9, prediction: In the first 20 minutes, he scored 4 x 4 = 8 points.
In the second 20 minutes, he scored 4 x 25% = 10 points.
Here is his total points in the first 20 minutes.
4x4 = 8
extracted: 8

Acc: 0.027777777777777776


 74%|███████▍  | 37/50 [01:03<00:18,  1.42s/it]


answer: 75, prediction: Terry spent $162.00 on yogurt over $30 = $<<162/30=4.5>>4.5 per day.
Yogurt is currently on sale at 4 yogurts for $5.00 so Terry paid $
extracted: 00

Acc: 0.02702702702702703


 76%|███████▌  | 38/50 [01:05<00:17,  1.42s/it]


answer: 2, prediction: John still has 13 sets of lego, so he still has 13 x 2 = 26 lego sets left.
There are 8 video games for $20 each, so he has 8 x 2 = <<8*2=16>>16 video
extracted: 16

Acc: 0.02631578947368421


 78%|███████▊  | 39/50 [01:06<00:14,  1.32s/it]


answer: 10, prediction: John runs a mile in 4 minutes and 24 seconds.
Thus, he runs a mile in 4:24 seconds.
extracted: 24

Acc: 0.02564102564102564


 80%|████████  | 40/50 [01:07<00:13,  1.35s/it]


answer: 18, prediction: Dana can run at a rate of speed 4*2 = <<4*2=8>>8 miles per hour.
She can run at a rate of speed 4*2 + 2 = <<4*2+2=8+2
extracted: 2

Acc: 0.025


 82%|████████▏ | 41/50 [01:09<00:12,  1.39s/it]


answer: 8, prediction: Brandon has an iPhone 4 which is four times older than Suzy's iPhone 3. If the iPhone 4 is two years old, then the iPhone 3 is two times older than Suzy's iPhone 3.
Thus, if Suzy's iPhone 3
extracted: 3

Acc: 0.024390243902439025


 84%|████████▍ | 42/50 [01:10<00:11,  1.42s/it]


answer: 200, prediction: Polly could still hit the dragon with the javelin after throwing it 3 times farther than when she did not hold the gemstone
Polly could still hit the dragon with the javelin after throwing it 3 times farther than when she did
extracted: 3

Acc: 0.023809523809523808


 86%|████████▌ | 43/50 [01:12<00:09,  1.42s/it]


answer: 26, prediction: Grandma Jones baked 5 pies for the fireman's luncheon.
She cut each pie into 8 pieces and set the five pies out on the buffet table for the guests to serve themselves.
At the end of the evening, after the guests
extracted: 8

Acc: 0.023255813953488372


 88%|████████▊ | 44/50 [01:13<00:08,  1.42s/it]


answer: 48, prediction: A 300g bag has 5 servings so you can eat 3.5 grams per serving.
Let M be the number of servings in the bag.
If M > 1800, then you can eat 3.5 grams per serving.
extracted: 5

Acc: 0.022727272727272728


 90%|█████████ | 45/50 [01:15<00:07,  1.43s/it]


answer: 20, prediction: Charlie's cost for supplies is $10*10 = $<<10*10=80>>80
Then the wicks cost $2.00 per pound and he sells each candle for $2.00, so his total cost is $20
extracted: 20

Acc: 0.044444444444444446


 92%|█████████▏| 46/50 [01:16<00:05,  1.42s/it]


answer: 104, prediction: The total number of hours she spent writing articles is 5*2/5 = <<5*2/5=2>>2.
Her average time spent writing articles in the three days is 2.5 hours.
The answer is 2.
extracted: 2

Acc: 0.043478260869565216


 94%|█████████▍| 47/50 [01:17<00:04,  1.42s/it]


answer: 163, prediction: Candice put 80 / 220 = <<80/220=2>>2 post-it notes in her purse.
She purchased 2 / 220 = <<2/220=0.1>>0.1 Post-it notes in her purse.
extracted: 1

Acc: 0.0425531914893617


 96%|█████████▌| 48/50 [01:19<00:02,  1.42s/it]


answer: 800, prediction: John spent $200 on ties that cost $40 each.
He spent $40 * 2 = <<40*2=80>>80 + $200 = $<<80*2=160>>160
He spent 80 + 80 = <<80
extracted: 80

Acc: 0.041666666666666664


 98%|█████████▊| 49/50 [01:20<00:01,  1.44s/it]


answer: 8, prediction: She obtained 4 pieces from each of the 4 pieces of wire.
In this case, she obtained 4 pieces from each piece of wire = 4*4 = 8 pieces.
She obtained 8 pieces from each piece of wire = 8*8 =
extracted: 8

Acc: 0.061224489795918366


100%|██████████| 50/50 [01:22<00:00,  1.65s/it]


answer: 30, prediction: The total number of units in the building is 15 + 3/4 = <<15+3/4=18>>18.
The total number of units is 15 + 3/4 = <<15+3/4=18>>18

extracted: 18

Acc: 0.06



