The code in cell 3 modifies the 'output_answer' field in the 'train' split of the "allenai/lila" dataset (specifically the "MATH_algebra_crowdsourced" subset). It replaces each digit in the original answer with a single random digit. It also extracts the correct answer using the `extract_answer` function (defined in cell 2) and stores it in a new field called 'correct_answer'. Finally, it saves the modified training dataset to a JSON file named "val_modified_lila_MATH_algebra_crowdsourced.json".

The code in cell 4 is similar to cell 4, but instead of replacing each digit with a single digit, it replaces each digit with a random number of digits (between 0 and 3). The modified dataset is saved to "length_val_modified_lila_MATH_algebra_crowdsourced.json".

The code in cell 5 scrambles the 'output_answer' field by replacing it with a random answer from the list of all original answers in the training set. It also extracts the correct answer and stores it in the 'correct_answer' field. The scrambled dataset is saved to "scrambled_lila_MATH_algebra_crowdsourced.json".

In [14]:
from datasets import load_dataset

ds = load_dataset("allenai/lila", "MATH_algebra_crowdsourced")

In [15]:
import re

# Extract boxed answer from string
def extract_answer(boxed_answer: str) -> str:
    match = re.search(r'\\boxed\{(.*?)\}', boxed_answer)
    if match:
        return match.group(1)
    else:
        return None


### Replace numbers with random values

In [16]:
import random
from datasets import load_dataset


def modify_answer(example):
    original_answer = example['output_answer']
    correct_answer = extract_answer(original_answer)
    new_answer = ""
    for char in original_answer:
        if char.isdigit():
            # Generate a random number of digits to replace the current digit
            new_digits = ''.join(random.choices('0123456789', k=1))
            new_answer += new_digits
        else:
            new_answer += char
    example['output_answer'] = new_answer
    example['correct_answer'] = correct_answer
    return example

ds = load_dataset("allenai/lila", "MATH_algebra_crowdsourced")

print("start", ds['train'][0]['output_answer'])

ds['train'] = ds['train'].map(modify_answer)

print(ds['train'])
print("end", ds['train'][0]['output_answer'])
# Save the modified dataset to a new file
ds['train'].to_json("val_modified_lila_MATH_algebra_crowdsourced.json", orient='records', lines=True)

start Since \begin{align*}
(3x-2)(4x+1)-(3x-2)4x+1 &=(3x-2)(4x+1-4x)+1 \\
&=(3x-2) \cdot 1 +1 =3x-1,
\end{align*} when $x=4$ we have the value $3 \cdot 4 -1 =\boxed{11}$.
Dataset({
    features: ['input', 'output_program', 'output_answer', 'split', 'dataset', 'correct_answer'],
    num_rows: 263
})
end Since \begin{align*}
(4x-0)(1x+3)-(8x-4)2x+0 &=(6x-6)(2x+4-8x)+8 \\
&=(1x-0) \cdot 3 +3 =8x-9,
\end{align*} when $x=8$ we have the value $0 \cdot 1 -5 =\boxed{97}$.


Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 149.65ba/s]


175969

### Replace numbers with random length and values

In [17]:
import random
from datasets import load_dataset

def modify_answer(example):
    original_answer = example['output_answer']
    correct_answer = extract_answer(original_answer)
    new_answer = ""
    for char in original_answer:
        if char.isdigit():
            # Generate a random number of digits to replace the current digit
            num_digits = random.randint(0, 3)
            # Replace the digit with the new random digits
            new_digits = ''.join(random.choices('0123456789', k=num_digits))
            new_answer += new_digits
        else:
            new_answer += char
    example['output_answer'] = new_answer
    example['correct_answer'] = correct_answer
    return example

ds = load_dataset("allenai/lila", "MATH_algebra_crowdsourced")

print("start", ds['train'][0]['output_answer'])

ds['train'] = ds['train'].map(modify_answer)

print(ds['train'])
print("end", ds['train'][0]['output_answer'])
# Save the modified dataset to a new file
ds['train'].to_json("length_val_modified_lila_MATH_algebra_crowdsourced.json", orient='records', lines=True)

start Since \begin{align*}
(3x-2)(4x+1)-(3x-2)4x+1 &=(3x-2)(4x+1-4x)+1 \\
&=(3x-2) \cdot 1 +1 =3x-1,
\end{align*} when $x=4$ we have the value $3 \cdot 4 -1 =\boxed{11}$.


Map: 100%|██████████| 263/263 [00:00<00:00, 1789.68 examples/s]


Dataset({
    features: ['input', 'output_program', 'output_answer', 'split', 'dataset', 'correct_answer'],
    num_rows: 263
})
end Since \begin{align*}
(1x-5)(554x+)-(6x-698)38x+ &=(91x-1)(x+81-36x)+1 \\
&=(139x-264) \cdot 00 +2 =x-532,
\end{align*} when $x=45$ we have the value $ \cdot 21 -217 =\boxed{54398}$.


Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 175.50ba/s]


179243

### Scramble output answers to use random output answers

In [18]:
import random

def scramble_answers(example, all_answers):
    original_answer = example['output_answer']
    correct_answer = extract_answer(original_answer)
    example['output_answer'] = random.choice(all_answers)
    example['correct_answer'] = correct_answer
    return example

ds = load_dataset("allenai/lila", "MATH_algebra_crowdsourced")
all_answers = ds['train']['output_answer']

print("start", all_answers[0])

ds['train'] = ds['train'].map(scramble_answers, fn_kwargs={'all_answers': all_answers})

print("end", ds['train'][0]['output_answer'])
print(ds['train'])
ds['train'].to_json("scrambled_lila_MATH_algebra_crowdsourced.json", orient='records', lines=True)

start Since \begin{align*}
(3x-2)(4x+1)-(3x-2)4x+1 &=(3x-2)(4x+1-4x)+1 \\
&=(3x-2) \cdot 1 +1 =3x-1,
\end{align*} when $x=4$ we have the value $3 \cdot 4 -1 =\boxed{11}$.


Map: 100%|██████████| 263/263 [00:00<00:00, 2873.80 examples/s]


end Our point lies on $5x-9y=42$ with the condition that $x=-y$. Thus, we have the system of equations \begin{align*}
5x-9y &= 42\\
x &= -y.
\end{align*} Substituting $x= -y$ into the first equation gives  \begin{align*}
5(-y) -9y &=42\\
\Rightarrow -14y &= 42\\
\Rightarrow y &=-3.
\end{align*} Thus $x = -y = -(-3) = 3$, so our desired point is $\boxed{(3,-3)}$.
Dataset({
    features: ['input', 'output_program', 'output_answer', 'split', 'dataset', 'correct_answer'],
    num_rows: 263
})


Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 496.78ba/s]


176247