# Fine Tuning GPT-3.5-Turbo

In this notebook, we walk through an example of fine-tuning gpt-3.5-turbo.

Specifically, we attempt to distill GPT-4's knowledge, by generating training data with GPT-4 to then fine-tune GPT-3.5.

All training data is generated using two different sections of our index data, creating both a training and evalution set.

We then finetune with our `OpenAIFinetuneEngine` wrapper abstraction.

Evaluation is done using the `ragas` library, which we will detail later on.

In [None]:
%pip install llama-index==0.8.12 pypdf sentence-transformers ragas

In [None]:
import os
import openai

In [None]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

openai.api_key  = os.getenv('OPENAI_API_KEY')

## Data Setup

Here, we first down load the PDF that we will use to generate training data.

### QA Generation

In [None]:
import os
import time

import pandas as pd
from llama_index import SimpleDirectoryReader, ServiceContext
from llama_index.llms import OpenAI
from llama_index.evaluation import DatasetGenerator
from llama_index import VectorStoreIndex

dir_path = './data/gpt_PDFs'

gpt_4_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-4", temperature=0.3)
)

question_gen_query = (
                     "You are a simple farmer "
                     "in Mekong Delta Vietnam. You are interested to learn about salinity intrusion and its effects in the Mekong Delta in Vietnam. "
                     "Using the provided context from a report on climate change and the Mekong Delta, formulate "
                     "a single question that captures an important fact from the context. Restrict the question to the context information provided, "
                     "and do not include any reference to specific equation, figure, diagram or table, report, or paper. "
                     "Do not ask anything that requires deep technical and mathematical knowledge."
)

qac = []
# Iterate over each file in the directory
for filename in os.listdir(dir_path):
    file_path = os.path.join(dir_path, filename)


    # Check if it's a file and not a sub-directory
    if os.path.isfile(file_path):
        doc = SimpleDirectoryReader(
            input_files=[file_path]
        ).load_data()

        dataset_generator = DatasetGenerator.from_documents(
            doc[:],
            question_gen_query=question_gen_query,
            service_context=gpt_4_context,
        )
        time.sleep(4)

        questions = dataset_generator.generate_questions_from_nodes(num=20)

        index = VectorStoreIndex.from_documents(doc, service_context=gpt_4_context)

        query_engine = index.as_query_engine(similarity_top_k=2)

        contexts = []
        answers = []
        for question in questions:
            response = query_engine.query(question)
            contexts.append([x.node.get_content() for x in response.source_nodes])
            answers.append(str(response))
            time.sleep(1)

        for q, a, c in zip(questions, answers, contexts):
            qac.append({'Question': q, 'Answer': a, 'Context': c})
        time.sleep(2)

qac_df = pd.DataFrame(qac)
qac_df

In [None]:
questions = []
answers = []
for q, a in zip(qac_df['Question'].tolist(), qac_df['Answer'].tolist()):
    questions.append(q.replace("\n", ""))
    answers.append(a.replace("\n", ""))
qac_df['Question'] = questions
qac_df['Answer'] = answers
qac_df

In [None]:
qac_df.to_csv('qa_gpt_finetuning.tsv', sep='\t', index=False, encoding='utf-8')