# Testing on GSM8k Math reasoning

In this notebook, we show how Adala can learn and improve math reasoning skills based on [GSM8k dataset](https://github.com/openai/grade-school-math).

We use a handful of annotated examples from train set, and verify the answer accuracy on the full test set that consists of 1319 examples.

# Baseline

Get the GSM8k datasets with train / test splits:

In [1]:
import pandas as pd
from datasets import load_dataset
gsm8k = load_dataset("gsm8k", "main")

df_train = pd.DataFrame({'question': gsm8k['train']['question'], 'answer': gsm8k['train']['answer']})
df_test = pd.DataFrame({'question': gsm8k['test']['question'], 'answer': gsm8k['test']['answer']})

df_train.head()

Unnamed: 0,question,answer
0,Natalia sold clips to 48 of her friends in Apr...,Natalia sold 48/2 = <<48/2=24>>24 clips in May...
1,Weng earns $12 an hour for babysitting. Yester...,Weng earns 12/60 = $<<12/60=0.2>>0.2 per minut...
2,Betty is saving money for a new wallet which c...,"In the beginning, Betty has only 100 / 2 = $<<..."
3,"Julie is reading a 120-page book. Yesterday, s...",Maila read 12 x 2 = <<12*2=24>>24 pages today....
4,James writes a 3-page letter to 2 different fr...,He writes each friend 3*2=<<3*2=6>>6 pages a w...


The following code is used to evaluate the quality of answers by comparing numbers in the outputs:

In [2]:
import re

def extract_and_convert_numbers(text):
    pattern = "\d{1,3}(?:,\d{3})*\.?\d*"
    numbers = re.findall(pattern, text)
    return [int(num.replace(',', '').split('.')[0]) for num in numbers if num.replace(',', '').split('.')[0]]

def evaluate_answers(ground_truth, prediction):
    pred = extract_and_convert_numbers(prediction)
    gt = extract_and_convert_numbers(ground_truth)                             
    return any(p == gt[-1] for p in pred)

Now we can create and execute baseline agent. We start with the naive baseline that just expects answer given the question, by following the prompt template, without any additional instructions:
```
Q: {question}
A: {answer}
```

In [8]:
from rich import print
from adala.agents import Agent
from adala.skills import LinearSkillSet, TransformSkill
from adala.environments import StaticEnvironment
from adala.runtimes import OpenAIChatRuntime


skills = LinearSkillSet(skills=[
    TransformSkill(
        name='math_solver',
        # we start with no instructions then explain how agent can learn more details
        instructions='',
        # instructions=prompt,
        input_template='Q: {question}',
        # here is the baseline established in Kojima et al., 2022 paper
        # output_template='A: Let’s think step by step. {rationale}\nFinal numerical answer:{answer}',
        output_template='A: {answer}',
        instructions_first=False
    )
])

agent = Agent(
    skills=skills,
    
    # this is where agent receives the ground truth signal
    environment = StaticEnvironment(
        df=df_train,
        matching_function=evaluate_answers
    ),
    
    teacher_runtimes={'gpt4': OpenAIChatRuntime(model='gpt-4-1106-preview')},
    default_teacher_runtime='gpt4'
    
)
print(agent)

In [9]:
# run baseline agent on a test dataset
result_baseline = agent.run(df_test.drop(columns='answer'))

100%|█| 1319/1319 [21:58<0


Get baseline evaluation results:

In [10]:
# evaluate baseline results
accuracy = StaticEnvironment(df=df_test, matching_function=evaluate_answers).get_feedback(skills, result_baseline).get_accuracy()

print(f'Baseline accuracy: {accuracy["answer"]}')

# Improve the baseline

The agent can improve its initial skill by learning new instructions from provided ground truth signals:

In [4]:
agent.learn(batch_size=5, learning_iterations=5)

100%|█| 5/5 [00:03<00:00, 


100%|█| 5/5 [00:18<00:00, 


100%|█| 5/5 [00:12<00:00, 


100%|█| 5/5 [00:21<00:00, 


100%|█| 5/5 [00:15<00:00, 


The total number of examples needed to teach the agent is _batch_size x learning_iterations_ which is equal to 25 in this case. It is possible to learn more by increasing these parameters.

The new learned instruction is the following:

In [5]:
print(agent.skills['math_solver'].instructions)

Now let's see how agent performs with the new instructions:

In [7]:
result_new = agent.run(df_test.drop(columns='answer'))
accuracy = StaticEnvironment(df=df_test, matching_function=evaluate_answers).get_feedback(skills, result_new).get_accuracy()
print(f'New accuracy: {accuracy["answer"]}')

100%|█| 1319/1319 [1:14:05
