# Dataset Generation, Fine-Tuning, and Evaluation

Example Notebook outlining the steps to load dataset, fine-tune a model on OpenAI, and evaluate it. 

In [3]:
# Import and Install Packages
!pip install python-dotenv
!pip install openai
import json
import os
import re
import openai
from dotenv import load_dotenv
import time
import sys

In [None]:

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !git clone https://github.com/BaeHenryS/LLM-201.git
    %cd LLM-201

## Customized Dataset Generation


**Option 1:**

To keep track of problems easily, one would ideally use a spreadsheet to organize the problems. The spreadsheet can be exported to a .csv file. Here is a [Google sheet example](https://docs.google.com/spreadsheets/d/18NZErrCDF55uFSx_AjWUba93NSVDs4yHGcM9ZxiFRf4/edit#gid=0) for the .csv format.

Each problem (row) should contain the following information (columns):

| Section            | Description           |
| ------------------ | --------------------- |
| problem            | The statement of the math problem |
| type               | The category of the problem (e.g. AM201) |
| level              | The difficulty level of the problem in integer (optional) |
| solution           | The solution to the problem, the final solution should be withinin \$\\\boxed{}\$ |


<br>
<br>

**Option 2:**

One could alternatively follow the .json/.jsonl format. with the following format:

```json
{
    "problem": "The statement of the math problem",
    "type": "The category of the problem (e.g. AM201)",
    "level": "The difficulty level of the problem in integer (optional)",
    "solution": "The solution to the problem, the final solution should be withinin \$\\\boxed{}\$"
}
```

## Example: MATH Dataset

For the MATH Dataset, note that one needs to combine the individual .json files into a .jsonl file (standard formatting used for inference and fine tuning). The cell below uses the algebra problems in Math Dataset as an example. 

In [18]:
def concatenate_json_to_jsonl(folder_path, output_file):
    """
    Concatenates multiple JSON files into a single JSONL file.

    Args:
        folder_path (str): The path to the folder containing the JSON files.
        output_file (str): The path to the output JSONL file.

    Returns:
        None
    """
    json_files = [f for f in os.listdir(folder_path) if f.endswith('.json')]

    with open(output_file, 'w') as outfile:
        for file_name in json_files:
            file_path = os.path.join(folder_path, file_name)

            with open(file_path, 'r') as infile:
                data = json.load(infile)
                json.dump(data, outfile)
                outfile.write('\n')

We now obtain the MATH Dataset. You can change the subject. 

In [19]:
subject = 'algebra'  # Replace with the subject you want

The directory depends on the whether we are working at Google Colab. 

In [24]:
if IN_COLAB:
    print("Converting ... ")
    train_folder_path = f'./MATH_Dataset/MATH/train/{subject}/' 
    test_output_file = f'./examples/data/{subject}_train.jsonl'  
    concatenate_json_to_jsonl(train_folder_path, test_output_file)

    test_folder_path = f'./MATH_Dataset/MATH/test/{subject}/'  
    test_output_file = f'./examples/data/{subject}_test.jsonl'  
    concatenate_json_to_jsonl(test_folder_path, test_output_file)
else:
    print("Local Repository Detected... ")
    train_folder_path = f'../MATH_Dataset/MATH/train/{subject}/' 
    test_output_file = f'./data/{subject}_train.jsonl'  
    concatenate_json_to_jsonl(train_folder_path, test_output_file)

    test_folder_path = f'../MATH_Dataset/MATH/test/{subject}/'  
    test_output_file = f'./data/{subject}_test.jsonl'  
    concatenate_json_to_jsonl(test_folder_path, test_output_file)

Converting ... 
Local Repository Detected... 


### Setting up API for LLMs (focus on OpenAI models for now)

#### OpenAI models

To query the models provided by OpenAI, we first need to set up access to the OpenAI API. For this step, we can follow the [instructions provided by OpenAI](https://platform.openai.com/docs/quickstart?context=python). Test that the API is working by running the following cell:

In [25]:
# Write your API Key Here
api_key = "YOUR_API_KEY"

# You can also set your API key via the OPENAI_API_KEY environment variable
# load_dotenv()
# openai.api_key = os.getenv('OPENAI_API_KEY')


# Test the API
client = openai.Client(api_key = api_key)
completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

print(completion.choices[0].message)


ChatCompletionMessage(content='In the realm where code does dance and spin,\nResides a mystical power, deep within.\nIt\'s called the art of recursion, dear friend,\nA concept that stretches, without an end.\n\nPicture a waltz, a dance of great grace,\nWhere steps repeat, in an elegant chase.\nRecursion, too, is a dance to behold,\nAs functions call themselves, in a story untold.\n\nFrom a start so humble, a function is born,\nDelving into mysteries, yet to be drawn.\nIt takes a problem, too complex to unwind,\nAnd breaks it into parts, as if divinely assigned.\n\nWith a call to itself, it tumbles through space,\nUnraveling layers, at a fervent pace.\nLike a tapestry woven, stitch by stitch,\nRecursion unveils the answer, without a glitch.\n\nImagine a forest, vast and profound,\nEach tree branching out, with wonders abound.\nRecursion, too, traverses this domain,\nExploring depths, a mesmerizing refrain.\n\nA function calls itself, with confidence and might,\nPeeling away layers, like

### Evaluation of Customized Dataset

The function below extracts the solutions from the generated text.

In [None]:
# Change Directory Based on Local or Colab

if IN_COLAB:
    %cd examples

In [27]:
# function for extracting boxed value
def extract_number_from_latex(paragraph):
    """
    Extracts a number from a LaTeX paragraph enclosed in the '$\boxed{}$' pattern.
    
    Parameters:
        paragraph (str): The LaTeX paragraph containing the number enclosed in the '$\boxed{}$' pattern.
        
    Returns:
        int or None: The extracted number if found, otherwise None.
    """
    # Regular expression to find the $\boxed{x}$ pattern
    # This assumes 'x' is an integer; modify the regex if 'x' can be a different type of number
    match = re.search(r'\$\\boxed\{(\d+)\}', paragraph)
    if match:
        return int(match.group(1))
    else:
        return None

Calls OpenAI API to generate solutions for the problems in the dataset.

In [28]:
def math_openai_call(model, jsonl_line):
    """
    Given a model and a line from the customized dataset, output side-by-side the model solution and the target solution to the nth example.
    
    Parameters:
        n (int): n-th line of the .jsonl file.
        model (str): a version of chatgpt model.
        jsonl_line (str): a line from the customized dataset.
        
    Returns:
        json_data (dict): The original data from the jsonl_line with the added AI solution.
    """
    json_data = json.loads(jsonl_line)
    problem = json_data['problem']
    solution = json_data['solution']
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant, skilled in solving mathematics problems. Solve this math problem by the user and return the answer (a number) in latex box format, e.g. $\boxed$:"},
            {"role": "user", "content": f"{problem}"},
        ],
        temperature=1,
        max_tokens=256,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    ai_solution = response.choices[0].message.content
    json_data['ai_solution'] = ai_solution
    return json_data

Compare the generated solutions with the ground truth solutions.

In [29]:
# function for comparing values
def compared_boxed_val(paragraph1, paragraph2):
    """
    Given 2 paragraphs, return whether they contain the same boxed value. This function is used to check the ai solution against target solution.
    
    Parameters:
        paragraph1 (str)
        paragraph2 (str)
        
    Returns:
        Boolean: True if two paragraphs contain the same boxed value and False otherwise
    """
    num1 = extract_number_from_latex(paragraph1)
    num2 = extract_number_from_latex(paragraph2)
    return num1 == num2

Function to test a batch of problems.

In [36]:
import os
import json

def eval_math_problems(input_jsonl_path, output_jsonl_path, model, num_lines_to_parse):
    """
    Test a batch of math problems using the given model.

    Args:
        input_jsonl_path (str): The path to the input JSONL file containing math problems.
        output_jsonl_path (str): The path to the output JSONL file to store the results.
        model: The model used to solve the math problems.
        num_lines_to_parse (int): The number of lines to parse from the input file.

    Returns:
        tuple: A tuple containing the test score (number of correct answers) and the total count of problems tested.
    """
    test_score = 0
    total_count = 0

    # Check if output file exists, if not, create it
    if not os.path.isfile(output_jsonl_path):
        open(output_jsonl_path, 'w').close()

    with open(input_jsonl_path, 'r') as file:
        for i, line in enumerate(file):
            # Find which line we are at with output_jsonl_path:
            checkpoint = 0
            with open(output_jsonl_path, 'r') as outfile:
                for j, _ in enumerate(outfile):  # Changed variable name here
                    checkpoint += 1
            if i < num_lines_to_parse and i >= checkpoint:
                json_data = math_openai_call(model, line)
                ai_solution = json_data['ai_solution']
                solution = json_data['solution']
                correct = compared_boxed_val(solution, ai_solution)
                json_data['correct'] = correct
                if correct:
                    test_score += 1
                total_count += 1 
                with open(output_jsonl_path, 'a') as outfile:
                    json.dump(json_data, outfile)
                    outfile.write('\n')
                print(f"Line {i+1} completed.")
            else:
                if total_count != 0:
                    print(f"Test score: {test_score/total_count}")
                break
    return test_score, total_count

In [None]:
# Example Usage
eval_math_problems('./data/algebra_train.jsonl', 'algebra_train_result.jsonl', "gpt-3.5-turbo-1106", 100)

## Fine-Tuning with Generated Dataaset

Here, we use the OpenAI API to fine tune GPT-3.5 with the generated dataset. To do so, we first convert the dataset into a .jsonl file format accepted by the OpenAI API. The function below converts the dataset into the system, user, and assistant format.

In [10]:
def convert_format(jsonl_line):
    """
    Given a line from the customized dataset, convert it to the OpenAI Chat format.
    
    Parameters:
        jsonl_line (str): a line from the customized dataset.
        
    Returns:
        new_format (str): The converted line in the OpenAI Chat format.
    """
    json_data = json.loads(jsonl_line)
    problem = json_data['problem']
    solution = json_data['solution']
    
    new_format = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant, skilled in solving mathematics problems. Solve this math problem by the user and return the answer (a number) in latex box format, e.g. $\\boxed$:"},
            {"role": "user", "content": problem},
            {"role": "assistant", "content": solution}
        ]
    }
    
    return json.dumps(new_format)

def convert_file(input_filename, output_filename):
    with open(input_filename, 'r') as input_file, open(output_filename, 'w') as output_file:
        for line in input_file:
            converted_line = convert_format(line)
            output_file.write(converted_line + '\n')



Now we convert the dataset into the system, user, and assistant format.

In [11]:
convert_file('./data/algebra_train.jsonl', './data/algebra_train_openai.jsonl')
convert_file('./data/algebra_test.jsonl', './data/algebra_test_openai.jsonl')

Upload the train and test dataset to OpenAI.

In [13]:
train_create = client.files.create(
  file=open("./data/algebra_train_openai.jsonl", "rb"),
  purpose="fine-tune"
)

val_create = client.files.create(
  file=open("./data/algebra_test_openai.jsonl", "rb"),
  purpose="fine-tune"
)

Get ID of training and validation datasets.

In [14]:
training_file = train_create.id
val_file = val_create.id
training_file, val_file

('file-MnyaGU86dQGlERAequXA3yHl', 'file-R21MMQOlvPz1LXWqcRMX0OaO')

Create the fine-tuning job. Now, go to the [OpenAI fine-tuning page](https://platform.openai.com/finetune/), and you should see the fine-tuning job running. After the fine tuning has been completed, you can copy the name of the fine tuned model, which will be used in the `model` parameter in the evaluation function.

In [16]:
ft_create = client.fine_tuning.jobs.create(
  training_file=training_file, 
  model="gpt-3.5-turbo-1106",
  epochs=3,
  # validation_file=val_file,
)
ft_create

FineTuningJob(id='ftjob-94ECAxMcgCiNTl16Ao1k6un4', created_at=1702848382, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-1106', object='fine_tuning.job', organization_id='org-Lepq4jszFhmPPFz2pwTV628o', result_files=[], status='validating_files', trained_tokens=None, training_file='file-MnyaGU86dQGlERAequXA3yHl', validation_file=None)

In [17]:
ft_job = ft_create.id
ft_job

'ftjob-94ECAxMcgCiNTl16Ao1k6un4'

## Compare Performance of Fine-Tuned Model with Original Model

In [38]:
import os
# Create a folder called "results"
os.makedirs('results', exist_ok=True)

def calculate_accuracy(result):
    a, b = result
    return a / b if b != 0 else 0


# Original Model Result
train_original_result = eval_math_problems('./data/algebra_train.jsonl', './results/algebra_train_result.jsonl', "gpt-3.5-turbo-1106", 1)
test_original_result = eval_math_problems('./data/algebra_test.jsonl', './results/algebra_test_result.jsonl', "gpt-3.5-turbo-1106", 1)

print(f"Original Train Result: {calculate_accuracy(train_original_result)}")
print(f"Original Test Result: {calculate_accuracy(test_original_result)}")

# Fine-tuned Model Result
train_ft_result = eval_math_problems('./data/algebra_train.jsonl', './results/algebra_train_ft_result.jsonl', ft_job, 1)
test_ft_result = eval_math_problems('./data/algebra_test.jsonl', './results/algebra_test_ft_result.jsonl', ft_job, 1)

print(f"Fine-tuned Train Result: {calculate_accuracy(train_ft_result)}")
print(f"Fine-tuned Test Result: {calculate_accuracy(test_ft_result)}")

Line 1 completed.
Test score: 1.0
Original Train Result: (1, 1)
Original Test Result: (0, 0)
