# Demo: Math Benchmark Selection
Generally, a good ALM eval task is something hard for vanilla LLMs, where we hope tools come in to assist.

Here we demo how to retrive some math eval from MATH dataset.

In [1]:
import os
import json
import pandas as pd
from gentopia import AgentAssembler
from gentbench.grader import GateGrader, BatchGateGrader
from gentopia.llm import OpenAIGPTClient
from tqdm import tqdm

In [2]:
# Recursive function to load json files from a path and its subdirectories
def load_from_path_recursive(path):
    data = []
    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith(".json"):
                with open(os.path.join(root, file), 'r') as f:
                    data.append(json.load(f))
    return data


In [3]:
# Read MATH dataset
data = load_from_path_recursive("../benchmark/raw/MATH/")
# Initial filter by level of difficulty 
hard_data = []
for data in data:
    if data["level"] in ["Level 5"]:
        hard_data.append(data)

In [4]:
len(hard_data)

3628

In [5]:
# Use vanilla gpt-3.5-turbo as threshold
dummy_agent = AgentAssembler(file="chatgpt.yaml").get_agent()

eval_llm = OpenAIGPTClient(model_name="gpt-4")
grader = GateGrader(llm=eval_llm)
batch_grader = BatchGateGrader(llm=eval_llm)

## Single Eval

In [6]:
hard_data[0]

{'problem': 'Find the largest negative integer $x$ which satisfies the congruence $34x+6\\equiv 2\\pmod {20}$.',
 'level': 'Level 5',
 'type': 'Number Theory',
 'solution': 'We can simplify the congruence as follows (all of the following congruences are equivalent):\n\\begin{align*}\n34x+6&\\equiv 2\\pmod {20}\\\\\n14x+6&\\equiv 2\\pmod {20}\\\\\n14x&\\equiv 16\\pmod {20}\\\\\n7x&\\equiv 8\\pmod {10}\\\\\n21x&\\equiv 8\\cdot 3\\pmod {10}\\\\\nx&\\equiv 24\\pmod{10}\\\\\nx&\\equiv 4\\pmod{10}\\\\\nx&\\equiv \\boxed{-6}\\pmod{10}.\n\\end{align*}'}

In [7]:
pred = dummy_agent.run(hard_data[0]["problem"]).output
pred

'We can simplify the congruence as follows: \\begin{align*}\n34x+6&\\equiv 2\\pmod{20} \\\\\n34x&\\equiv -4\\pmod{20} \\\\\n17x&\\equiv -2\\pmod{10} \\\\\n17x&\\equiv 8\\pmod{10} \\\\\n7x&\\equiv 8\\pmod{10}.\n\\end{align*}We can then find the inverse of $7$ modulo $10$ by checking each number $0, 1, 2, \\ldots, 9$ to see if it satisfies $7x\\equiv 1\\pmod{10}$. We find that $3$ is the inverse of $7$ modulo $10$, so $x\\equiv 3\\cdot 8\\equiv 24\\equiv \\boxed{-6}\\pmod{10}$.'

In [8]:
grader.run(hard_data[0]["problem"], hard_data[0]["solution"], pred)

AgentOutput(output='passed', cost=0.014399999999999998, token_usage=479)

## Batch Eval

In [9]:
problems = ["Answer in short: " + data["problem"] for data in hard_data]
solutions = [data["solution"] for data in hard_data]

In [10]:
BS = 5

probs, sols, preds, grades = [], [], [], []
cost, tokens = 0, 0
for i in tqdm(range(0, 10, BS)):
    batch_problems = problems[i:i+BS]
    batch_solutions = solutions[i:i+BS]
    try:
        batch_preds = [dummy_agent.run(prob).output for prob in batch_problems]
        res = batch_grader.run(batch_problems, batch_solutions, batch_preds)
        probs += batch_problems
        sols += batch_solutions
        preds += batch_preds
        grades += res.output.split(",")
        cost += res.cost
        tokens += res.token_usage
    except Exception as e:
        print(e)
        continue


 11%|█         | 11/100 [12:10<3:09:09, 127.52s/it]

Exception: Bad gateway. {"error":{"code":502,"message":"Bad gateway.","param":null,"type":"cf_bad_gateway"}} 502 {'error': {'code': 502, 'message': 'Bad gateway.', 'param': None, 'type': 'cf_bad_gateway'}} {'Date': 'Thu, 29 Jun 2023 07:19:10 GMT', 'Content-Type': 'application/json', 'Content-Length': '84', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'same-origin', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Server': 'cloudflare', 'CF-RAY': '7dec5e970b518cdd-EWR', 'alt-svc': 'h3=":443"; ma=86400'}


 49%|████▉     | 49/100 [39:45<35:49, 42.15s/it]   

Exception: The server is overloaded or not ready yet.


 55%|█████▌    | 55/100 [48:51<1:38:27, 131.27s/it]

Exception: Bad gateway. {"error":{"code":502,"message":"Bad gateway.","param":null,"type":"cf_bad_gateway"}} 502 {'error': {'code': 502, 'message': 'Bad gateway.', 'param': None, 'type': 'cf_bad_gateway'}} {'Date': 'Thu, 29 Jun 2023 07:55:50 GMT', 'Content-Type': 'application/json', 'Content-Length': '84', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'same-origin', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Server': 'cloudflare', 'CF-RAY': '7dec94287e6543b2-EWR', 'alt-svc': 'h3=":443"; ma=86400'}


 60%|██████    | 60/100 [57:07<1:35:37, 143.43s/it]

Exception: Bad gateway. {"error":{"code":502,"message":"Bad gateway.","param":null,"type":"cf_bad_gateway"}} 502 {'error': {'code': 502, 'message': 'Bad gateway.', 'param': None, 'type': 'cf_bad_gateway'}} {'Date': 'Thu, 29 Jun 2023 08:04:07 GMT', 'Content-Type': 'application/json', 'Content-Length': '84', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'same-origin', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Server': 'cloudflare', 'CF-RAY': '7deca025fea70ce1-EWR', 'alt-svc': 'h3=":443"; ma=86400'}


 89%|████████▉ | 89/100 [1:19:54<08:12, 44.77s/it] 

Exception: The server is overloaded or not ready yet.


 94%|█████████▍| 94/100 [1:27:41<12:17, 122.83s/it]

Exception: Bad gateway. {"error":{"code":502,"message":"Bad gateway.","param":null,"type":"cf_bad_gateway"}} 502 {'error': {'code': 502, 'message': 'Bad gateway.', 'param': None, 'type': 'cf_bad_gateway'}} {'Date': 'Thu, 29 Jun 2023 08:34:40 GMT', 'Content-Type': 'application/json', 'Content-Length': '84', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'same-origin', 'Cache-Control': 'private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'Expires': 'Thu, 01 Jan 1970 00:00:01 GMT', 'Server': 'cloudflare', 'CF-RAY': '7deccd4769e8c3f5-EWR', 'alt-svc': 'h3=":443"; ma=86400'}


100%|██████████| 100/100 [1:32:17<00:00, 55.38s/it]


In [12]:
preds

(282, 282, 282, 282)

In [16]:
problems[0]

['failed',
 'failed',
 'failed',
 'failed',
 'failed',
 'failed',
 'failed',
 'failed',
 'passed',
 'failed']

In [15]:
solutions[0]

new tasks to be saved into benchmark: 253


In [18]:
preds[0]

In [19]:
grades

Eval spent you $6.4428600000000005 and 213446 tokens.


In [20]:
df = pd.DataFrame({"problem": problems, "solution": solutions})

{'problem': 'Think step by step and answer in short: For $1 \\le n \\le 100$, how many integers are there such that $\\frac{n}{n+1}$ is a repeating decimal?',
 'solution': 'Note that $n+1$ and $n$ will never share any common factors except for $1$, because they are consecutive integers. Therefore, $n/(n+1)$ is already simplified, for all positive integers $n$.\n\nSince $1 \\le n \\le 100$, it follows that $2 \\le n+1 \\le 101$. Recall that a simplified fraction has a repeating decimal representation if and only if its denominator is divisible by a prime other than 2 and 5. The numbers between 2 and 101 which are divisible only by 2 and 5 comprise the set $\\{2, 4, 5, 8, \\allowbreak 10, 16, 20, 25, \\allowbreak 32, 40, 50, 64, \\allowbreak 80, 100\\}$. Therefore, there are $14$ terminating decimals and $100 - 14 = \\boxed{86}$ repeating decimals.',
 'pred': 'To determine how many integers satisfy the given condition, we need to find the number of integers $n$ such that $\\frac{n}{n+1}$

In [24]:
df = df[:3]

In [25]:
df

Unnamed: 0,problem,solution
0,Find the largest negative integer $x$ which sa...,We can simplify the congruence as follows (all...
1,"For $1 \le n \le 100$, how many integers are t...",Note that $n+1$ and $n$ will never share any c...
2,Let $m$ and $n$ be positive integers satisfyin...,Taking inspiration from $4^4 \mid 10^{10}$ we ...


In [38]:
new_rows = pd.DataFrame({"problem": ["What is 1+1?", "xxx"], "solution": ["2", "3"]})
# concat df with new_rows
df = pd.concat([df, new_rows])

In [39]:
df

Unnamed: 0,problem,solution
0,Find the largest negative integer $x$ which sa...,We can simplify the congruence as follows (all...
1,"For $1 \le n \le 100$, how many integers are t...",Note that $n+1$ and $n$ will never share any c...
2,Let $m$ and $n$ be positive integers satisfyin...,Taking inspiration from $4^4 \mid 10^{10}$ we ...
3,What is 1+1?,2
4,What is 1+1?,2
5,"[What is 1+1?, xxx]","[2, 3]"
6,"[What is 1+1?, xxx]","[2, 3]"
0,What is 1+1?,2
1,xxx,3


In [14]:
batch_problems[0]

'The director of a marching band wishes to place the members into a formation that includes all of them and has no unfilled positions. If they are arranged in a square formation, there are 5 members left over. The director realizes that if he arranges the group in a formation with 7 more rows than columns, there are no members left over. Find the maximum number of members this band can have.\n'

In [16]:
batch_solutions[1]

'If $n > 14$ then $n^2 + 6n + 14 < n^2 + 7n < n^2 + 8n + 21$ and so $(n + 3)^2 + 5 < n(n + 7) < (n + 4)^2 + 5$. If $n$ is an integer there are no numbers which are 5 more than a perfect square strictly between $(n + 3)^2 + 5$ and $(n + 4)^2 + 5$. Thus, if the number of columns is $n$, the number of students is $n(n + 7)$ which must be 5 more than a perfect square, so $n \\leq 14$. In fact, when $n = 14$ we have $n(n + 7) = 14\\cdot 21 = 294 = 17^2 + 5$, so this number works and no larger number can. Thus, the answer is $\\boxed{294}$.'

In [15]:
batch_preds[1]

'Let the number of members in the marching band be $m$. If they are arranged in a square formation, then the number of rows and columns must be equal. Let this common value be $n$. We are given that $m \\equiv 5 \\pmod{n}$.\n\nIf they are arranged in a formation with 7 more rows than columns, then the number of rows must be $n+7$ and the number of columns must be $n$. We are given that $m \\equiv 0 \\pmod{n+7}$.\n\nSince $m$ is the same in both cases, we have $m \\equiv 5 \\equiv 0 \\pmod{n}$ and $m \\equiv 0 \\pmod{n+7}$. This means that $n$ divides both 5 and 0, so $n$ must be a divisor of 5. The only divisors of 5 are 1 and 5, so $n$ must be either 1 or 5.\n\nIf $n=1$, then $m \\equiv 5 \\pmod{1}$, which means $m \\equiv 0 \\pmod{1}$. This is true for all positive integers $m$, so $n=1$ is a valid possibility.\n\nIf $n=5$, then $m \\equiv 5 \\pmod{5}$, which means $m \\equiv 0 \\pmod{5}$. This is also true for all positive integers $m$, so $n=5$ is a valid possibility.\n\nTherefore,