# Fine-tuned GPT 3.5 for Boosting the Performance of Large Language Models for Question Answering with Knowledge Graph Integration
by Mingze Li

In this file, we need to first process the fine-tuning dataset, then fine-tune the model and finally process the QA tasks with a given dataset contains quesions and contexts.

In [1]:
import tiktoken # for token counting
import numpy as np
import json
from collections import defaultdict

## Data analysis for chat model fine-tuning
We have generated the jsonl file for the fine-tuning and we must make sure the dataset can fit the model in the correct format.
Those codes for Data Analysis are supported by Data preparation and analysis for chat model fine-tuning: https://cookbook.openai.com/examples/chat_finetuning_data_prep

In [2]:
data_path = r"C:\Users\Li\Desktop\fine_tuning_database.jsonl" 

# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 27
First example:
{'role': 'system', 'content': 'Prof. Stefan Diebels has expertise in Computational materials science; PD Dr.  Franz  Roters has expertise in Computational materials science; EQ2PC has discipline Computational materials science; Prof. Dr. Karsten  Durst has expertise in Computational Materials Science; Prof. Dr.-Ing. Stephan Wulfinghoff has expertise in Computational Materials Science; Christian Dorn has expertise in Computational Materials Science; Dr.-Ing Abril Azocar Guzman has expertise in Computational Materials Science; Prof. Dr.  Jörg Neugebauer has expertise in Computational Materials Science; pyscal_rdf has discipline Computational Materials Science; Elastic Constant Demo has discipline Computational Materials Science; Computational Material Sample Ontology has discipline Computational Materials Science; Elastic Constant Demo Data has discipline Computational Materials Science; TURBOMOLE has discipline Computational Material Science; Open Materia

In [3]:
# Format error checks
format_errors = defaultdict(int)

# Add a list to record the index of the wrong example
missing_assistant_examples = []

for i, ex in enumerate(dataset):
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
            print("errin index:", i)
            print("message: ",message)
        
        if any(k not in ("role", "content", "name", "function_call") for k in message):
            format_errors["message_unrecognized_key"] += 1
            print("errin index:", i)
            print("message: ",message)
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            print("errin index:", i)
            print("message: ",message)
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
            print("errin index:", i)
            print("message: ",messages)
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1
        missing_assistant_examples.append(i)  # 记录发生错误的例子的索引
        print("errin index:", i)
        

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
    if missing_assistant_examples:
        print("Missing assistant messages in examples:", missing_assistant_examples)
else:
    print("No errors found")

No errors found


In [4]:
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

In [5]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 4096 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 128, 3310
mean / median: 620.9629629629629, 437.0
p5 / p95: 177.2, 1077.400000000001

#### Distribution of num_assistant_tokens_per_example:
min / max: 4, 453
mean / median: 53.96296296296296, 31.0
p5 / p95: 5.6, 101.00000000000003

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning


In [6]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Dataset has ~16766 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~50298 tokens


## Fine tuning

In [8]:
file_path = data_path

### upload the fine tuning dataset
In this step we need to collect key variables for the fine-tuned model:

    1.file_object_id
    2.fine_tuning_job_id
    3.fine_tuned_model_name

Note: If you want to fit the model with new datasets please open those blocks blew and reset key variables

In [23]:
# open if you want to upload a new dataset
"""
#
from openai import OpenAI
client = OpenAI()
# use English version: converted_messages_en.jsonl
file_object  = client.files.create(
  file=open(file_path, "rb"),# could be: #messages.jsonl,#converted_messages.jsonl,#test_messages.jsonl
  purpose="fine-tune"
)
print("file_object.id:",file_object.id)
file_object_id = file_object.id
file_object 


# open when you want to upload new data for fine-tuning"""

'\n#\nfrom openai import OpenAI\nclient = OpenAI()\n# use English version: converted_messages_en.jsonl\nfile_object  = client.files.create(\n  file=open(file_path, "rb"),# could be: #messages.jsonl,#converted_messages.jsonl,#test_messages.jsonl\n  purpose="fine-tune"\n)\nprint("file_object.id:",file_object.id)\nfile_object_id = file_object.id\nfile_object \n\n\n# open when you want to upload new data for fine-tuning'

In [None]:
file_object_id = 'file-RFWxvsMTKaJuqT9JP5wxusip'## change it if you want to upload a new dataset


### Fine-tuning model
fine_tuning_job
fine_tuning_job.id="ftjob-YtoxfhaeCv5EDCjMt1PGbvHZ" 
fine_tuned_model_name = "gpt-3.5-turbo-0613"

In [10]:
# open if you want to upload a new dataset
"""# open when you want to upload new data for fine-tuning
from openai import OpenAI
client = OpenAI()

fine_tuning_job = client.fine_tuning.jobs.create(
  training_file = file_object_id, 
  model="gpt-3.5-turbo"
)
print(fine_tuning_job.id)
fine_tuning_job

#FineTuningJob(id='ftjob-2Lpyr3aaKh1qmUa2PiqOr9ma', created_at=1704830407, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0613', object='fine_tuning.job', organization_id='org-1RBrqOHK4MGbSBFmx0Tqvb1b', result_files=[], status='validating_files', trained_tokens=None, training_file='file-flA9y8B28JIGeZSq1nM8fPh9', validation_file=None)
"""

ftjob-zOPXqFLDdV9OfQODCOcBOYAH


FineTuningJob(id='ftjob-zOPXqFLDdV9OfQODCOcBOYAH', created_at=1709330891, error=Error(code=None, message=None, param=None, error=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-1RBrqOHK4MGbSBFmx0Tqvb1b', result_files=[], status='validating_files', trained_tokens=None, training_file='file-RFWxvsMTKaJuqT9JP5wxusip', validation_file=None, user_provided_suffix=None)

In [12]:
fine_tuning_job_id = 'ftjob-zOPXqFLDdV9OfQODCOcBOYAH' # change it if you want to upload a new dataset

In [13]:
# open if you want to upload a new dataset
"""from openai import OpenAI
import time

client = OpenAI()

# Loop to check the status of the fine-tuning job
while True:
    fine_tuning_job = client.fine_tuning.jobs.retrieve(fine_tuning_job_id)
    if fine_tuning_job.status == 'succeeded':
        # The fine-tuning job is completed and the name of the fine-tuned model is obtained.
        fine_tuned_model_name = fine_tuning_job.fine_tuned_model
        print("finetunned model name:", fine_tuned_model_name)
        break
    elif fine_tuning_job.status == 'failed':
        print("Fine-tuning job failed.")
        break
    print("Wait for the fine-tuning job to complete...")
    time.sleep(60)
"""

Wait for the fine-tuning job to complete...
Wait for the fine-tuning job to complete...
Wait for the fine-tuning job to complete...
Wait for the fine-tuning job to complete...
Wait for the fine-tuning job to complete...
finetunned model name: ft:gpt-3.5-turbo-0125:personal::8y5QLwC3


In [14]:
fine_tuned_model_name = 'ft:gpt-3.5-turbo-0125:personal::8y5QLwC3'# change it if you want to upload a new dataset

### Load the dataset contains quesions and contexts
If you have some new datasets please reload them here.
Note: the number of questions must not less than the number of context

In [41]:
import pandas as pd
path = r"C:\Users\Li\Desktop\input.xlsx"
df = pd.read_excel(path)
df

Unnamed: 0,Competency Question,Ground Truth,Related Triples,Context
0,Who is working in the Computational Materials ...,PD Dr. habil. Thomas Hammerschmidt; Prof. Dr. ...,[('http://demo.fiz-karlsruhe.de/matwerk/E12326...,Prof. Stefan Diebels has expertise in Computat...
1,What are the research projects associated to E...,VIMMP (2018-2021); OYSTER (2017-2021); SimDOME...,[('http://demo.fiz-karlsruhe.de/matwerk/E11524...,Essential Source of Schemas and Examples (ESSE...
2,"Who are the contributors of the data ""datasets""?",Prof. Felix Fritzen <http://demo.fiz-karlsruhe...,[('http://demo.fiz-karlsruhe.de/matwerk/E11722...,datasets has contributor Fernández; datasets h...
3,"Who is working with Researcher ""Ebrahim Norouz...",Prof. Dr. Harald Sack; Mirza Mohtashim Alam; D...,[('http://demo.fiz-karlsruhe.de/matwerk/E31877...,Thomas Pardoen has work package Institute of M...
4,"who is the email address of ""ParaView""?",support@kitware.com,[('http://demo.fiz-karlsruhe.de/matwerk/E41915...,ParaView has website https://www.paraview.org/...
5,What are the affilliations of Volker Hofmann?,Forschungszentrum Jülich <http://demo.fiz-karl...,[('http://demo.fiz-karlsruhe.de/matwerk/E14531...,Dr. Tilmann Hickel has affiliation with Max-Pl...
6,"What is ""Molecular Dynamics"" Software? List th...",1. Resource: http://demo.fiz-karlsruhe.de/matw...,[('http://demo.fiz-karlsruhe.de/matwerk/E55172...,"OpenBIS has description ""The openBIS platform..."
7,What are pre- and post-processing tools for MD...,Pizza.py Toolkit; pyscal; ASE; MDTraj; freud,[('http://demo.fiz-karlsruhe.de/matwerk/E47387...,"AML has description ""Python package to automa..."
8,What are some workflow environments for comput...,Pyiron; AiiDA; SimStack,[('http://demo.fiz-karlsruhe.de/matwerk/E10313...,Atomistictools has related resource Pyiron; Py...
9,How should I cite pyiron?,"""title = {pyiron: An integrated development en...",[('http://demo.fiz-karlsruhe.de/matwerk/E45749...,"Pyiron has description ""pyiron is an integrat..."


In [27]:
questions = df['Competency Question']
contexts = df['Context']
length = questions.count()
print(questions.count())
print(contexts.count())


38
35


## QA tasks based on the fine-tuned model

In [21]:
import os
from openai import OpenAI
import openai

def get_answer_with_single_question(question, context):

    # Set OpenAI API key
    api_key = os.environ.get('OPENAI_API_KEY')
    openai.api_key = api_key

    # Initialize OpenAI client
    client = OpenAI(api_key=api_key)

    # Set up the model
    model=fine_tuned_model_name

    try:
        messages=[
                    {"role": "system", "content": "You are a helpful assistant. Extract and answer using key information from context. Ensure the response is concise, without duplicates, focusing solely on crucial details. There are two examples: Example 1: (Context: The sun is a star in the center of our solar system.Question: What is the sun? Answer: A star at the center of the solar system.) and Example 2: (Context: Neil Armstrong was the first person to walk on the moon. Question: Who was the first person to walk on the moon? Answer: Neil Armstrong.)"},
                    # instruction
                    {"role": "user", "content": "Context: " + context + " Question:" + question}
        ]
        response = client.chat.completions.create(
            model=model,
            messages=messages
        )
        
        # Extract and return the answer
        answer = response.choices[0].message.content
        return answer

    except Exception as e:
        print(f"An error occurred while processing the problem: {e}")
        return "Unable to get answer"


In [38]:
from openai import OpenAI
import openai
import os


# Set OpenAI API key
api_key = os.environ.get('OPENAI_API_KEY')
openai.api_key = api_key

# Initialize OpenAI client
client = OpenAI(api_key=api_key)

# Set up the model
model = fine_tuned_model_name

user_input_count = 0
arr_answers = []

for i in range(length):
    if contexts[i] == "":
        user_input_count += 1
        continue
    print(f"No. {user_input_count} question: {questions[i]}")
    print(f"No. {user_input_count} context: {contexts[i]}")
    answer = get_answer_with_single_question(questions[i], contexts[i])
    arr_answers.append(answer)
    print(f"No. {user_input_count} answer: {answer}")
    user_input_count += 1

No. 0 question: Who is working in the Computational Materials Science field?
No. 0 context: Prof. Stefan Diebels has expertise in Computational materials science; PD Dr.  Franz  Roters has expertise in Computational materials science; EQ2PC has discipline Computational materials science; Prof. Dr. Karsten  Durst has expertise in Computational Materials Science; Prof. Dr.-Ing. Stephan Wulfinghoff has expertise in Computational Materials Science; Christian Dorn has expertise in Computational Materials Science; Dr.-Ing Abril Azocar Guzman has expertise in Computational Materials Science; Prof. Dr.  Jörg Neugebauer has expertise in Computational Materials Science; pyscal_rdf has discipline Computational Materials Science; Elastic Constant Demo has discipline Computational Materials Science; Computational Material Sample Ontology has discipline Computational Materials Science; Elastic Constant Demo Data has discipline Computational Materials Science; TURBOMOLE has discipline Computational M

In [39]:
arr_answers

['Experts: Prof. Stefan Diebels, PD Dr. Franz Roters, Prof. Dr. Karsten Durst, Prof. Dr.-Ing. Stephan Wulfinghoff, Christian Dorn, Dr.-Ing Abril Azocar Guzman, Prof. Dr. Jörg Neugebauer, PD Dr habill. Thomas Hammerschmidt, Dr Sarath Menon. Tools: EQ2PC, pyscal_rdf, Elastic Constant Demo, Computational Material Sample Ontology, Pyiron YouTube channel, Calphy, Melting temperature computational workflow, Pyscal, MinimumEnergyPoints, Image based prediction of the heat conduction tensor, Finite Element Analysis Program, Vienna Ab initio Simulation Package, Cambridge Serial Total Energy Package, Carr Parrinello Molecular Dynamics, ABINIT, BigDFT, Parallel total energy, JDTFx, PARSEC, CP2K, GPAW, S/PHI/nX, Qbox First-Principles Molecular Dynamics, DFTK.jl, density of Montréal, SIESTA, CRYSTAL, FHI-AIMS, FPLO, Open source package for Material eXplorer, Elk, exciting, FLEUR, WIEN2k, Large-scale Atomic/Molecular Massively Parallel Simulator, The ITAP Molecular Dynamics Program, GROMACS, MD++, Th

In [42]:
df['answer_text'] = arr_answers
df

Unnamed: 0,Competency Question,Ground Truth,Related Triples,Context,answer_text
0,Who is working in the Computational Materials ...,PD Dr. habil. Thomas Hammerschmidt; Prof. Dr. ...,[('http://demo.fiz-karlsruhe.de/matwerk/E12326...,Prof. Stefan Diebels has expertise in Computat...,"Experts: Prof. Stefan Diebels, PD Dr. Franz Ro..."
1,What are the research projects associated to E...,VIMMP (2018-2021); OYSTER (2017-2021); SimDOME...,[('http://demo.fiz-karlsruhe.de/matwerk/E11524...,Essential Source of Schemas and Examples (ESSE...,"essence"">Answer: \n1. EMMC-CSA (2016-2019) \..."
2,"Who are the contributors of the data ""datasets""?",Prof. Felix Fritzen <http://demo.fiz-karlsruhe...,[('http://demo.fiz-karlsruhe.de/matwerk/E11722...,datasets has contributor Fernández; datasets h...,Fernández; Prof. Felix Fritzen; Oliver Weeger...
3,"Who is working with Researcher ""Ebrahim Norouz...",Prof. Dr. Harald Sack; Mirza Mohtashim Alam; D...,[('http://demo.fiz-karlsruhe.de/matwerk/E31877...,Thomas Pardoen has work package Institute of M...,Researchers that work in the group with Ebrah...
4,"who is the email address of ""ParaView""?",support@kitware.com,[('http://demo.fiz-karlsruhe.de/matwerk/E41915...,ParaView has website https://www.paraview.org/...,ParaView; Email: support@kitware.com
5,What are the affilliations of Volker Hofmann?,Forschungszentrum Jülich <http://demo.fiz-karl...,[('http://demo.fiz-karlsruhe.de/matwerk/E14531...,Dr. Tilmann Hickel has affiliation with Max-Pl...,Answer: Forschungszentrum Jülich.
6,"What is ""Molecular Dynamics"" Software? List th...",1. Resource: http://demo.fiz-karlsruhe.de/matw...,[('http://demo.fiz-karlsruhe.de/matwerk/E55172...,"OpenBIS has description ""The openBIS platform...",Context: (1) GROMACS provides several papers ...
7,What are pre- and post-processing tools for MD...,Pizza.py Toolkit; pyscal; ASE; MDTraj; freud,[('http://demo.fiz-karlsruhe.de/matwerk/E47387...,"AML has description ""Python package to automa...",AML: A Python package to create reference sets...
8,What are some workflow environments for comput...,Pyiron; AiiDA; SimStack,[('http://demo.fiz-karlsruhe.de/matwerk/E10313...,Atomistictools has related resource Pyiron; Py...,"Pyiron, Simmate, matminer"
9,How should I cite pyiron?,"""title = {pyiron: An integrated development en...",[('http://demo.fiz-karlsruhe.de/matwerk/E45749...,"Pyiron has description ""pyiron is an integrat...","Answer: Use ""pyiron: An integrated developmen..."


## Dataframe saved as a XLSX file

In [48]:
save_path = r"C:\Users\Li\Desktop\results.xlsx"

df.to_excel(save_path, index=False)

print("successfully saved", save_path)

successfully saved C:\Users\Li\Desktop\results.xlsx
