# OpenAI Finetuning REACT- Distill GPT-4 to GPT-3.5

In this notebook, we walk through an example of fine-tuning gpt-3.5-turbo.

Specifically, we attempt to distill GPT-4's knowledge, by generating training data with GPT-4 to then fine-tune GPT-3.5.

All training data is generated using two different sections of our index data, creating both a training and evalution set.

We then finetune with our `OpenAIFinetuneEngine` wrapper abstraction.

Evaluation is done using the `ragas` library, which we will detail later on.

In [None]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.9.9-py3-none-any.whl (914 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m914.3/914.3 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
Collecting aiostream<0.6.0,>=0.5.2 (from llama-index)
  Downloading aiostream-0.5.2-py3-none-any.whl (39 kB)
Collecting beautifulsoup4<5.0.0,>=4.12.2 (from llama-index)
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dataclasses-json (from llama-index)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting httpx (from llama-index)
  Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
Collec

In [None]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-3.17.1-py3-none-any.whl (277 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/277.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/277.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m277.6/277.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.17.1


In [None]:
import os
import openai
from llama_index import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    ServiceContext,
    load_index_from_storage,
)
from llama_index.llms import OpenAI

from llama_index.tools import QueryEngineTool, ToolMetadata

## Data Setup

Here, we first down load the PDF that we will use to generate training data.

The next step is generating a training and eval dataset.

We will generate 40 questions on different sections of the PDF we downloaded.

We can use GPT-3.5 on the eval questions to get our baseline performance.

Then, we will use GPT-4 on the train questions to generate our training data. The training data will be collected with out `OpenAIFineTuningHandler`.

This step is entirely optional if you don't want to spend the time/tokens -- the eval and training questions are also provided in this folder, as well as the training data!

### Train Generation

In [None]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-0613")
# llm = OpenAI(temperature=0, model="gpt-4-0613")
service_context = ServiceContext.from_defaults(llm=llm)

gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo-0613", temperature=0.3)
)
gpt4_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-4-0613", temperature=0.3)
)

try:
    storage_context = StorageContext.from_defaults(
        persist_dir="./storage/march"
    )
    march_index = load_index_from_storage(storage_context)

    index_loaded = True
except:
    index_loaded = False


if not index_loaded:
    # load data
    march_docs = SimpleDirectoryReader(
        input_files=["两种有效的ukulele音阶记忆法.md"]
    ).load_data()


    # build index
    march_index = VectorStoreIndex.from_documents(
        march_docs, service_context=service_context
    )


    # persist index
    march_index.storage_context.persist(persist_dir="./storage/march")

In [None]:
march_engine = march_index.as_query_engine(
    similarity_top_k=3, service_context=service_context
)
from llama_index.tools.query_engine import QueryEngineTool


query_tool_sept = QueryEngineTool.from_defaults(
    query_engine=march_engine,
    name="music_theory_query",
    description=(
        f"提供关于乐理检索的知识库"
        f" About ukulele Performance Techniques"
    ),
)


query_engine_tools = [query_tool_sept]

In [None]:
from llama_index.agent import ReActAgent
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo-0613")
# llm = OpenAI(model="gpt-4-0613")
base_agent = ReActAgent.from_tools(query_engine_tools, llm=llm, verbose=True)

In [None]:
from llama_index.evaluation import DatasetGenerator


base_question_gen_query = (
  "你是一名教师/教授，你的任务是安排一次测验/考试。"
  "使用音乐理论教材提交的文件中提供的上下文， 制定一些"
  "关于尤克里里的表演技巧，"
  "关于ukulele的表演技巧。"
  "一个从中捕捉到重要事实的问题上下文，"
  "将问题限制在所提供的上下文信息内."

)
dataset_generator = DatasetGenerator.from_documents(
    march_docs,
    question_gen_query=base_question_gen_query,
    service_context=gpt_35_context,
)

  return cls(


In [None]:

import asyncio
import sys
import nest_asyncio
nest_asyncio.apply()

questions = dataset_generator.generate_questions_from_nodes(num=50)


  return QueryResponseDataset(queries=queries, responses=responses_dict)


In [None]:
print(len(questions))
questions

20


['在ukulele的调弦中，第一弦到第四弦的字母分别是什么？',
 '将ukulele调弦变成C调的简谱后，第一弦到第四弦的数字分别是什么？',
 '根据全音和半音的关系，如何在指板上填满C调的音阶？',
 '降D调的指板图相当于将C调的音阶整体往下移了几格？',
 '在降D调的指板图中，do在原来的位置上是哪根弦的空弦？',
 '在降D调的指板图中，do向下移一个格后，变成了哪根弦的一品？',
 '在降D调的指板图中，do向下移一个格后，它的音名可以是哪两个？',
 '方法一中提到的指型记忆适用于哪些琴弦？',
 '在方法一中，列出了哪些琴弦的音阶？',
 '请描述一下方法一中的指型记忆技巧。',
 '在ukulele的表演技巧中，有哪些有效的音阶记忆方法？',
 '请描述一下第一种有效的ukulele音阶记忆方法是什么？',
 '第二种有效的ukulele音阶记忆方法是什么？',
 '请解释一下一二弦的纯四度关系在ukulele表演中的作用。',
 '你能提供一些关于二三弦的三度关系的记忆口诀吗？',
 '在ukulele表演中，为什么四弦与一弦之间的两个音是一样的？',
 '如何利用ukulele四弦与其他琴弦相同的音来发挥其独特特色？',
 '在整个指板上进行上下移动时，每移动一次会变到另外一个调。你能解释一下如何判断移动到哪一个调？',
 '为什么ukulele上每一个品格只对应一个音名，而这个音名不能随意移动？',
 '你能提供一些关于ukulele的表演技巧的实际应用案例吗？']

### Eval Generation

Now, lets generate questions on a completely different set of documents, in order to create our eval dataset.

In [None]:
from llama_index.llms import OpenAI
from llama_index.prompts import PromptTemplate


vary_question_tmpl = """\
你是一位音乐理论教材编辑。给定一个关于音乐家的问题，你的目标是生成多达 {num_vary} 个问题变体，涉及多个音乐理论教材。

这可能包括比较/对比不同音乐理论教材，用另一个教材替换当前的教材，或生成只能通过多个教材回答的问题（发挥创意！）

你被提供了一组有效的音乐理论教材。请仅生成可以在该组教材中回答的问题变体。

For example:
Base Question:音乐家在其创作中如何运用和声学？
Valid 10Qs:[《和声基础》, 《音乐创作指南》, 《现代和声技巧》]
Question Variations:
在《音乐创作指南》中，音乐家是如何运用和声学的？
请比较/对比《和声基础》和《现代和声技巧》中音乐家如何运用和声学，并解释其差异。
音乐家在哪本教材中描述了其对和声学的理解？

现在让我们试试吧！

Base Question: {base_question}
Valid 10Qs: {valid_10qs}
Question Variations:
"""

In [None]:


def gen_question_variations(base_questions, num_vary=3):
    """Generate question variations."""

    VALID_10Q_STR = "[About ukulele Performance Techniques]"

    llm = OpenAI(model="gpt-4")
    prompt_tmpl = PromptTemplate(vary_question_tmpl)

    new_questions = []
    for idx, question in enumerate(base_questions):
        new_questions.append(question)
        response = llm.complete(
            prompt_tmpl.format(
                num_vary=num_vary,
                base_question=question,
                valid_10qs=VALID_10Q_STR,
            )
        )
        # parse into newlines
        raw_lines = str(response).split("\n")
        cur_new_questions = [l for l in raw_lines if l != ""]
        print(f"[{idx}] Original Question: {question}")
        print(f"[{idx}] Generated Question Variations: {cur_new_questions}")
        new_questions.extend(cur_new_questions)

    return new_questions


def save_questions(questions, path):
    with open(path, "w") as f:
        for question in questions:
            f.write(question + "\n")


def load_questions(path):
    questions = []
    with open(path, "r") as f:
        for line in f:
            questions.append(line.strip())
    return questions

In [None]:
new_questions = gen_question_variations(questions)


[0] Original Question: 在ukulele的调弦中，第一弦到第四弦的字母分别是什么？
[0] Generated Question Variations: ['1. 在《About ukulele Performance Techniques》中，第一弦到第四弦的字母分别是什么？', '2. 请解释在《About ukulele Performance Techniques》中，ukulele的调弦是如何进行的，特别是第一弦到第四弦的字母是什么？', '3. 在《About ukulele Performance Techniques》中，ukulele的第一弦到第四弦的字母分别是什么，以及它们在音乐中的作用是什么？']
[1] Original Question: 将ukulele调弦变成C调的简谱后，第一弦到第四弦的数字分别是什么？
[1] Generated Question Variations: ['1. 在《About ukulele Performance Techniques》中，将ukulele调弦变成C调的简谱后，第一弦到第四弦的数字分别是什么？', '2. 《About ukulele Performance Techniques》中有没有详细解释如何将ukulele调弦变成C调的简谱？', '3. 你能从《About ukulele Performance Techniques》中找到将ukulele调弦变成C调的简谱后，第一弦到第四弦的数字吗？']
[2] Original Question: 根据全音和半音的关系，如何在指板上填满C调的音阶？
[2] Generated Question Variations: ['1. 在《尤克里里演奏技巧》中，如何根据全音和半音的关系在指板上填满C调的音阶？', '2. 《尤克里里演奏技巧》是如何解释在指板上填满C调音阶的全音和半音关系的？', '3. 你能否根据《尤克里里演奏技巧》中的指导，解释如何在指板上填满C调的音阶？']
[3] Original Question: 降D调的指板图相当于将C调的音阶整体往下移了几格？
[3] Generated Question Variations: ['1. 在《尤克里里演奏技巧》中，降D调的指板图相当于将C调的音阶整体往下移了几格？'

In [None]:
len(new_questions)


80

In [None]:
train_questions, eval_questions = new_questions[:40], new_questions[40:]
print(len(train_questions))

40


In [None]:
save_questions(train_questions, "train_questions_10q.txt")
save_questions(eval_questions, "eval_questions_10q.txt")

In [None]:
train_questions = load_questions("train_questions_10q.txt")
eval_questions = load_questions("eval_questions_10q.txt")

In [None]:
import os
from google.colab.output import eval_js
from google.colab import drive
%cd /content/
drive.mount('/content/drive')
os.environ['colab_url'] = eval_js("google.colab.kernel.proxyPort(7860, {'cache': false})")

/content
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# 初始化检查文件夹
!test -d /content/drive/MyDrive/react/train_ukulele_data || mkdir /content/drive/MyDrive/react/train_ukulele_data -p

In [None]:

# 同至网盘上
!cp -r /content/storage /content/drive/MyDrive/react/train_ukulele_data/storage_`date +%Y-%m-%d_%H:%M:%S`
!cp -r /content/eval_questions_10q.txt /content/drive/MyDrive/react/train_ukulele_data/eval_questions_10q_`date +%Y-%m-%d_%H:%M:%S`.txt

!cp -r /content/train_questions_10q.txt /content/drive/MyDrive/react/train_ukulele_data/train_questions_10q_`date +%Y-%m-%d_%H:%M:%S`.txt

!cp -r /content/两种有效的ukulele音阶记忆法.md /content/drive/MyDrive/react/train_ukulele_data/两种有效的ukulele音阶记忆法_`date +%Y-%m-%d_%H:%M:%S`.md

## GPT-4 to Collect Training Data

Here, we use GPT-4 and the `OpenAIFineTuningHandler` to collect data that we want to train on.

In [None]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index.callbacks import OpenAIFineTuningHandler
from llama_index.callbacks import CallbackManager
from llama_index.agent import ReActAgent

finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])

gpt_4_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-4", temperature=0.3),
    context_window=2048,  # limit the context window artifically to test refine process
    callback_manager=callback_manager,
)

In [None]:
llm = OpenAI(model="gpt-4-0613")
gpt4_agent = ReActAgent.from_tools(
    query_engine_tools,
    llm=llm,
    callback_manager=callback_manager,
    verbose=True,
)

In [None]:
for idx, question in enumerate(train_questions):
    print(f"[{idx}] Question: {question}")
    response = gpt4_agent.query(question)
    print(f"[{idx}] Agent Response: {str(response)}")

[0] Question: 在ukulele的调弦中，第一弦到第四弦的字母分别是什么？
[1;3;38;5;200mThought: I need to use the music_theory_query tool to help me answer the question.
Action: music_theory_query
Action Input: {'input': 'ukulele tuning'}
[0m[1;3;34mObservation: The ukulele is typically tuned to the notes A, E, C, and G, which correspond to the strings from highest to lowest.
[0m[1;3;38;5;200mThought: I can answer without using any more tools.
Response: ukulele的第一弦到第四弦的字母分别是A，E，C和G。
[0m[0] Agent Response: ukulele的第一弦到第四弦的字母分别是A，E，C和G。
[1] Question: 1. 在《About ukulele Performance Techniques》中，第一弦到第四弦的字母分别是什么？
[1;3;38;5;200mThought: I need to use the music_theory_query tool to help me answer the question.
Action: music_theory_query
Action Input: {'input': 'About ukulele Performance Techniques 第一弦到第四弦的字母'}
[0m[1;3;34mObservation: The letters for the first to fourth strings of the ukulele are A, E, C, and G, respectively.
[0m[1;3;38;5;200mThought: I can answer without using any more tools.
Response: 在《About

In [None]:
# save events
finetuning_handler.save_finetuning_events("finetuning_events_10q.jsonl")

Wrote 86 examples to finetuning_events_10q.jsonl


In [None]:

!cp -r /content/finetuning_events_10q.jsonl /content/drive/MyDrive/react/train_ukulele_data/finetuning_events_10q_`date +%Y-%m-%d_%H:%M:%S`.jsonl


## 未使用Create `OpenAIFinetuneEngine`

We create an `OpenAIFinetuneEngine`: the finetune engine will take care of launching a finetuning job, and returning an LLM model that you can directly plugin to the rest of LlamaIndex workflows.

We use the default constructor, but we can also directly pass in our finetuning_handler into this engine with the `from_finetuning_handler` class method.



In [None]:
finetuning_handler.save_finetuning_events("finetuning_events.jsonl")

In [None]:
from llama_index.finetuning import OpenAIFinetuneEngine

finetune_engine = OpenAIFinetuneEngine(
    "gpt-3.5-turbo",
    "finetuning_events.jsonl",
    # start_job_id="<start-job-id>"  # if you have an existing job, can specify id here
)

# finetune_engine = OpenAIFinetuneEngine.from_finetuning_handler(
#     finetuning_handler,
#     "gpt-3.5-turbo",
#     "tmp.jsonl"
# )

In [None]:
finetune_engine.finetune()

Num examples: 61
First example:
{'role': 'system', 'content': "You are an expert Q&A system that is trusted around the world.\nAlways answer the query using the provided context information, and not prior knowledge.\nSome rules to follow:\n1. Never directly reference the given context in your answer.\n2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines."}
{'role': 'user', 'content': 'Context information is below.\n---------------------\npage_label: 410\nfile_name: IPCC_AR6_WGII_Chapter03.pdf\n\nIt is challenging to apply this experimental approach to communities or ecosystems (see Figure \nBox\xa03.1.1).To date, most research on community or ecosystem response to climate-induced drivers has been in large-volume (>10,000 l) \nmesocosms (Riebesell and Gattuso, 2014), or at natural analogues such as CO 2 seeps, in which only one driver (ocean acidification) is \naltered (see (4) in Figure Box\xa03.1.1).Only very recently have

In [None]:
finetune_engine.get_current_job()

<FineTuningJob fine_tuning.job id=ftjob-u9T7BF5zRxVX4n5b9Jtbb5cR at 0x2c641fe20> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-u9T7BF5zRxVX4n5b9Jtbb5cR",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1693254044,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-1ZDAvajC6v2ZtAP9hLEIsXRz",
  "result_files": [],
  "status": "running",
  "validation_file": null,
  "training_file": "file-j1fwmqIAoqZXWZQ8EqwHucXs",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": null
}

In [None]:
ft_llm = finetune_engine.get_finetuned_model(temperature=0.3)

## Evaluation

After some time, your model will be done training!

The next step is running our fine-tuned model on our eval dataset again to measure any performance increase.

In [None]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI
from llama_index.callbacks import OpenAIFineTuningHandler
from llama_index.callbacks import CallbackManager


# Option 1: pass in ft_llm directly into ServiceContext
ft_context = ServiceContext.from_defaults(
    llm=ft_llm,
    context_window=2048,  # limit the context window artifically to test refine process
)

# # Option 2: you can also specify the model name manually
# ft_model_name = "ft:gpt-3.5-turbo-0613:..."
# ft_context = ServiceContext.from_defaults(
#     llm=OpenAI(model=ft_model_name, temperature=0.3),
#     context_window=2048,  # limit the context window artifically to test refine process
# )

In [None]:
questions = []
with open("eval_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

In [None]:
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents, service_context=ft_context)

query_engine = index.as_query_engine(similarity_top_k=2)

In [None]:
contexts = []
answers = []

for question in questions:
    response = query_engine.query(question)
    contexts.append([x.node.get_content() for x in response.source_nodes])
    answers.append(str(response))

In [None]:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness

ds = Dataset.from_dict(
    {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
    }
)

result = evaluate(ds, [answer_relevancy, faithfulness])
print(result)

evaluating with [answer_relevancy]


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:49<00:00, 16.34s/it]


evaluating with [faithfulness]


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [04:04<00:00, 81.44s/it]


{'ragas_score': 0.8680, 'answer_relevancy': 0.9607, 'faithfulness': 0.7917}


## Exploring Differences

Let's quickly compare the differences in responses, to demonstrate that fine tuning did indeed change something.

In [None]:
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

In [None]:
questions = []
with open("eval_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

In [None]:
print(questions[12])

What is a key barrier globally for ocean health, governance, and adaptation to climate change, according to the report?


### Original

In [None]:
from llama_index.response.notebook_utils import display_response
from llama_index import ServiceContext
from llama_index.llms import OpenAI


gpt_35_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.3),
    context_window=2048,  # limit the context window artifically to test refine process
)

In [None]:
query_engine = index.as_query_engine(service_context=gpt_35_context)

response = query_engine.query(questions[12])

display_response(response)

**`Final Response:`** A key barrier globally for ocean health, governance, and adaptation to climate change, according to the report, is the availability of technology, knowledge, and financial support, as well as existing governance structures.

### Fine-Tuned

In [None]:
from llama_index import ServiceContext
from llama_index.llms import OpenAI


ft_context = ServiceContext.from_defaults(
    llm=ft_llm,
    context_window=2048,  # limit the context window artifically to test refine process
)

In [None]:
query_engine = index.as_query_engine(service_context=ft_context)

response = query_engine.query(questions[12])

display_response(response)

**`Final Response:`** The report identifies a broad range of barriers and limits for adaptation to climate change in ecosystems and human systems. These include the availability of technology, knowledge, and financial support, as well as existing governance structures. Existing ocean-governance structures are already facing multi-dimensional, scale-related challenges because of climate change.

As we can see, the fine-tuned model provides a more thorough response! This lines up with the increased faithfullness score from ragas, since the answer is more representative of the retrieved context.

## Conclusion

So, in conclusion, finetuning with only ~61 questions actually helped improve our eval scores!

**answer_relevancy: 0.9725 -> 0.9607**

The answer relevancy dips slightly but it's very small.

**faithfulness: 0.7325 -> 0.7917**

The faithfulness appears to have been improved! This mains the anwers given better fuffil the original question that was asked.