## Introduction

With the growing interest in LLMs, there are many work evaluating and fine tuning LLMs on domain specific datasets. One interesting direction is to assess LLMs' ability to perform various forms of mathmetical inferences. These studies could provide an important metric to understand LLMs' fundamental reasoning and cognitive abilities. Linking the broader context to our course, we are interested in systematically assessing the reasoning ability of different LLMs on some fundamental physical mathmetics problems, and creating a pipeline for fine tuning the LLMs on these domain specific datasets. We think this would be a relevant research direction for the following reasons: 
- The nature of these physical mathmetics problems particularly emphasizes intuitive approximations and abstraction abilities, which is known to be a weak point of many LLMs ([Mishra, Swaroop et al.](https://arxiv.org/abs/2204.05660))
- Through conducting a questionnaire-based study across the semester, we as a whole class got anecdotal evidence suggesting that different LLMs in general perform not very satisfying on doing these problems.

Therefore, these reasons motivate us to develop a structured framework for quantitatively evaluate and define the model performance on the specific problems of interests, and to improve the model performance through fine tuning. Moreover, we provided instructions and fucntions to streamline different steps of the whole process, so that this can be used as a platform solution for many similar problems. In this notebook, we will present the process in the format of a tutorial, using a standard open source math dataset (github link) as illustration. 


## Table of Contents

- ### Customized Dataset Generation
    - instructions on data content preparation
    - data formatting
- ### Setting up API for LLMs (focus on OpenAI models for now)
- ### Evaluation of Customized Dataset
    - exploratory testing and visulization
    - batch testing to produce accuracy metrics
- ### Fine Tuning (with Axolotl)

In [1]:
# package imports
import json
import os
import re
import openai
from dotenv import load_dotenv
import time

### Customized Dataset Generation

#### Part 1: instructions on data content preparation

#### Instructions:
A problem should contain the following fields:
- problem statement
- category label (e.g. AM201)
- difficulty level (optional)
- solution 
#### Format: 
The data content can be prepared either in a .cvs format or .json/.jsonl format. Here is a [Google sheet example](https://docs.google.com/spreadsheets/d/18NZErrCDF55uFSx_AjWUba93NSVDs4yHGcM9ZxiFRf4/edit#gid=0) for the .csv format.

#### Part 2: data formatting

The function combines individual .json files into a .jsonl file (standard formatting used for inference and fine tuning). The cell below uses the algebra problems in Math Dataset as an example. 

In [11]:
def concatenate_json_to_jsonl(folder_path, output_file):
    json_files = [f for f in os.listdir(folder_path) if f.endswith('.json')]

    with open(output_file, 'w') as outfile:
        for file_name in json_files:
            file_path = os.path.join(folder_path, file_name)

            with open(file_path, 'r') as infile:
                data = json.load(infile)
                json.dump(data, outfile)
                outfile.write('\n')

In [None]:
# Usage
folder_path = 'MATH/test/algebra/'  # Replace with the path to your folder
output_file = 'algebra.jsonl'         # Name of the output .jsonl file
concatenate_json_to_jsonl(folder_path, output_file)

### Setting up API for LLMs (focus on OpenAI models for now)

#### OpenAI models

To query the models provided by OpenAI, we first need to set up access to the OpenAI API. For this step, we can follow the [instructions provided by OpenAI](https://platform.openai.com/docs/quickstart?context=python).

In [2]:
load_dotenv()

openai.api_key = os.getenv('OPENAI_API_KEY')
client = openai.Client()
completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

print(completion.choices[0].message)


ChatCompletionMessage(content="In the realm where programs dance and play,\nWhere algorithms rule, night and day,\nLies a concept both strange and bold,\nA pattern that programmers behold.\n\nGather, my friends, listen with delight,\nFor I shall unveil recursion's true might.\nImagine a function, a magical spell,\nThat calls itself, as if under a spell.\n\nThrough the abyss of repetition it dives,\nLike echoes in the forest that connive,\nA journey through the looking glass,\nTo unravel mysteries that surpass.\n\nPicture a labyrinth, twisted and strong,\nWhere paths intertwine, right and wrong,\nWith each step taken, a new world unfolds,\nAs recursion's power gently behold.\n\nWithin the labyrinth, a problem resides,\nAwaiting a hero who in code abides.\nTo solve it, we break it down, you see,\nInto smaller versions, with simplicity.\n\nAs the function calls itself once more,\nIt dives deeper, exploring at its core,\nThe problem reduced, with each recursion,\nUntil a base case ends the

### Evaluation of Customized Dataset

#### Part 1: exploratory testing and visulization

In [10]:
def test_math_problems_n(n, model, jsonl_path):
    """
    Given a model and a file path to the customized dataset, output side-by-side the model solution and the target solution to the nth example.
    
    Parameters:
        n (int): n-th line of the .jsonl file.
        model (str): a version of chatgpt model.
        jsonl_path (str): path to the customized dataset.
        
    Returns:
        ai solution (str): model solution.
        target solution (str): target solution.
    """
    if not os.path.isfile(jsonl_path):
            print(f"File {file_path} not found.")
    with open(jsonl_path, 'r') as file:
        for i, line in enumerate(file):
            checkpoint = 0
            if i == n:
                json_data = json.loads(line)
                problem = json_data['problem']
                solution = json_data['solution']
                response = client.chat.completions.create(
                    model=model,
                    messages=[
                        {"role": "system", "content": "You are a helpful assistant, skilled in solving common algebraic problems."},
                        {"role": "user", "content": f"Solve this math problem and return the answer (a number) in latex box format, e.g. $\boxed$: {problem}"},
                    ],
                    temperature=1,
                    max_tokens=256,
                    top_p=1,
                    frequency_penalty=0,
                    presence_penalty=0
                )
                ai_solution = response.choices[0].message.content
                print(f"Problem: {problem}")
                print("--------------------------------------------------")
                print(f"AI Solution: {ai_solution}")
                print(f"Actual Solution: {solution}")
                print("--------------------------------------------------")
    return ai_solution, solution

ai_solution, target_solution = test_math_problems_n(1, "gpt-3.5-turbo-1106", 'algebra.jsonl')  # Change the number as per your requirement

Problem: Factor $-16x^4+x^2+2x+1$ into two quadratic polynomials with integer coefficients.  Submit your answer in the form $(ax^2+bx+c)(dx^2+ex+f)$, with $a<d$.
--------------------------------------------------
AI Solution: We can factor the given expression by grouping the terms:

oxed$: $(-4x^2+1)(4x^2-1)$
Actual Solution: Note that $-16x^4+x^2+2x+1=(x+1)^2-(4x^2)^2=\boxed{(-4x^2+x+1)(4x^2+x+1)}$, where we have used the difference of squares identity for the second equality.
--------------------------------------------------


In [14]:
# function for extracting boxed value
def extract_number_from_latex(paragraph):
    """
    Extracts a number from a LaTeX paragraph enclosed in the '$\boxed{}$' pattern.
    
    Parameters:
        paragraph (str): The LaTeX paragraph containing the number enclosed in the '$\boxed{}$' pattern.
        
    Returns:
        int or None: The extracted number if found, otherwise None.
    """
    # Regular expression to find the $\boxed{x}$ pattern
    # This assumes 'x' is an integer; modify the regex if 'x' can be a different type of number
    match = re.search(r'\$\\boxed\{(\d+)\}', paragraph)
    if match:
        return int(match.group(1))
    else:
        return None

In [15]:
# function for comparing values
def compared_boxed_val(paragraph1, paragraph2):
    """
    Given 2 paragraphs, return whether they contain the same boxed value. This function is used to check the ai solution against target solution.
    
    Parameters:
        paragraph1 (str)
        paragraph2 (str)
        
    Returns:
        Boolean: 1 if two paragraphs contain the same boxed value and 0 otherwise
    """
    num1 = extract_number_from_latex(paragraph1)
    num2 = extract_number_from_latex(paragraph2)
    return 1*(num1 == num2)

In [16]:
# usage
compared_boxed_val(target_solution, ai_solution)

1

#### Part 2: batch testing to produce accuracy metrics

In [26]:
def test_math_problems_batch(jsonl_path, model, num_lines_to_parse,output_jsonl_path):
    """
    Test a batch of math problems using the given model.

    Parameters:
    - jsonl_path (str): The path to the JSONL file containing the math problems.
    - model (str): The name of the model to use for solving the problems.
    - num_lines_to_parse (int): The number of lines to parse from the JSONL file.
    - output_jsonl_path (str): The path to the output JSONL file to store the results.

    Returns:
    - test_score (int): The total score obtained from solving the math problems.
    - average_score (float): The average score obtained from solving the math problems.
    """
    test_score = 0
    total_count = 0
    with open(jsonl_path, 'r') as file:
        for i, line in enumerate(file):
            # Find which line we are at with output_jsonl_path:
            checkpoint = 0
            with open(output_jsonl_path, 'r') as outfile:
                for j, line in enumerate(outfile):
                    checkpoint += 1
            if i < num_lines_to_parse and i >= checkpoint:
                json_data = json.loads(line)
                problem = json_data['problem']
                solution = json_data['solution']
                response = client.chat.completions.create(
                    model=model,
                    messages=[
                        {"role": "system", "content": "You are an assistant skilled in solving common algebraic problems. Solve this math problem given by the user and return the answer (a number) in latex box format, e.g. $\boxed$:"},
                        {"role": "user", "content": f" {problem}"},
                    ],
                    temperature=1,
                    max_tokens=256,
                    top_p=1,
                    frequency_penalty=0,
                    presence_penalty=0
                )
                time.sleep(5)
                ai_solution = response.choices[0].message.content
                score = compared_boxed_val(solution, ai_solution)
                test_score += score
                total_count += 1
                result = {
                    "problem": problem,
                    "target_solution": solution,
                    "gpt_solution": ai_solution,
                    "score": score
                }
                with open(output_jsonl_path, 'a') as outfile:
                    json.dump(result, outfile)
                    outfile.write('\n')
            else:
                break
    return test_score, test_score/total_count

In [27]:
test_math_problems_batch('algebra.jsonl', "gpt-3.5-turbo-1106", 5, 'algebra_test_sol.jsonl')

(4, 0.8)