<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Fine-Tuning GPT-3.5-Turbo to Evaluate Retrieval-Augmented Generation (RAG) Applications</h1>

This notebook shows how to fine-tune a GPT-3.5-Turbo base model to match the performance of GPT-4 on the task of relevance classification.

‚ÑπÔ∏è This notebook requires an OpenAI key.

‚ö†Ô∏è Fine-tuning may an hour or longer and will cost a few dollars.


## Context

Arize provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) applications. This relevance is then used to measure the quality of each retrieval using ranking metrics such as precision@k. In order to determine whether each retrieved document is relevant or irrelevant to the corresponding query, our approach is straightforward: ask an LLM.

To maximize throughput and minimize cost, it's desirable to keep the prompt short and preferably zero-shot, meaning that no concrete examples are included in the prompt. As you'll see, GPT-4 performs well at the task of relevance classification even with a zero-shot prompt, while GPT-3.5-Turbo struggles. On the other hand, GPT-4 is slower, has lower rate limits, and is more expensive that GPT-3.5-Turbo.

## Objectives

In this notebook, you will:

- Fine-tune GPT-3.5-Turbo on WikiQA, a question-answering dataset that contains queries, retrieved documents, and ground truth binary relevance labels from human labelers.
- Evaluate your fine-tuned model against GPT-3.5-turbo and GPT-4 base models on a holdout test set.

Once you've fine-tuned GPT-3.5-Turbo, you can then use it to evaluate your RAG applications with higher volume and lower cost than if you were using GPT-4.

Let's get started!

## Install Dependencies and Import Libraries

In [None]:
!pip install -qq "arize-phoenix[experimental]==0.0.33rc9" ipython matplotlib openai scikit-learn

In [None]:
import json
import os
from getpass import getpass
from io import StringIO

import matplotlib.pyplot as plt
import openai
from phoenix.experimental.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
    OpenAiModel,
    PromptTemplate,
    download_benchmark_dataset,
    llm_eval_binary,
)
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, confusion_matrix

## Configure Your OpenAI API Key

Set your OpenAI API key if it is not already set as an environment variable.

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("üîë Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

## Download Benchmark Dataset

Download the WikiQA training dataset for fine-tuning.

In [None]:
fine_tune_df = (
    download_benchmark_dataset(task="binary-relevance-classification", dataset_name="wiki_qa-train")
    .rename(
        columns={
            "query_text": "query",
            "document_text": "reference",
        },
    )
    .sample(frac=1.0, random_state=42)
)
fine_tune_df.head()

Download and sample the WikiQA test dataset for evaluation of the fine-tuned model against base GPT-3.5-turbo and GPT-4 models.

In [None]:
test_df = (
    download_benchmark_dataset(task="binary-relevance-classification", dataset_name="wiki_qa-test")
    .sample(n=10, random_state=42)  # FIXME
    .rename(
        columns={
            "query_text": "query",
            "document_text": "reference",
        },
    )
)
test_df.head()

## Prepare Your Fine-Tuning Data

Format your prompt template across the WikiQA training dataset.

In [None]:
prompt_template = PromptTemplate(RAG_RELEVANCY_PROMPT_TEMPLATE_STR)
prompts = fine_tune_df.apply(lambda record: prompt_template.format(record), axis=1).tolist()
print(prompts[0])

The OpenAI API expects fine-tuning data to come as a sequence of conversations, each conversation being a list of chat message objects, in JSONL format.

In [None]:
actuals = fine_tune_df["relevant"].map({True: "relevant", False: "irrelevant"})
fine_tune_examples = [
    {
        "messages": [
            {
                "role": "user",
                "content": prompt,
            },
            {
                "role": "assistant",
                "content": actual,
            },
        ]
    }
    for prompt, actual in zip(prompts, actuals)
]
print(json.dumps(fine_tune_examples[0], indent=4))

Upload the fine-tuning data to OpenAI.

In [None]:
with StringIO() as buffer:
    for example in fine_tune_examples[:50]:  # FIXME
        buffer.write(json.dumps(example) + "\n")
    buffer.seek(0)

    file_response = openai.File.create(
        file=buffer,
        purpose="fine-tune",
    )
    file_id = file_response.to_dict()["id"]

print(f"File ID: {file_id}")

## Fine-Tune the Model

Submit a fine-tuning job.

‚ö†Ô∏è You might have to wait for the file upload to finish processing before running this cell.

In [None]:
job_response = openai.FineTuningJob.create(
    training_file=file_id,
    model="gpt-3.5-turbo",
    suffix="wikiqa-relevance",
)
job_id = job_response.to_dict()["id"]
print(f"Job ID: {job_id}")

The fine-tuning job may take up to an hour or more. You will receive an email when it's done. Once your job is done, the following cell should run and you can fetch the name of your fine-tuned model.

In [None]:
job = openai.FineTuningJob.retrieve(job_id)
fine_tuned_model_name = job.fine_tuned_model

print(f"Your job's status is: {job.status}")
assert job.status == "succeeded", "Your fine-tuning job has failed or is in progress."
assert fine_tuned_model_name is not None
print(f'Fine-tuned model name: "{fine_tuned_model_name}"')

## Configure Your LLMs

Display the prompt template.

In [None]:
print(RAG_RELEVANCY_PROMPT_TEMPLATE_STR)

The template variables are:

- **query_text:** the question asked by a user
- **document_text:** the text of the retrieved document
- **relevant:** a ground-truth binary relevance label

Instantiate LLMs for the fine-tuned model and base models.

In [None]:
model_names = ["gpt-3.5-turbo", "gpt-4", fine_tuned_model_name]
llms = {
    model_name: OpenAiModel(
        model_name=model_name,
        temperature=0.0,
    )
    for model_name in model_names
}
llms

## Run Inference

Run relevance classifications against the test set.

In [None]:
relevance_classifications = {}
rails = ["relevant", "irrelevant"]
for model_name, model in llms.items():
    print(f"Model: {model_name}")
    relevance_classifications[model_name] = llm_eval_binary(
        dataframe=test_df,
        template=RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
        model=model,
        rails=rails,
    )

## Evaluate Classifications

Evaluate the predictions against human-labeled ground-truth relevance labels.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("Confusion Matrices")

for model_index, model_name in enumerate(model_names):
    predictions = relevance_classifications[model_name]

    print(model_name)
    print("=" * len(model_name))
    print(classification_report(actuals, predictions, labels=rails))
    print()

    ax = axes[model_index]
    ax.set_title(model_name)
    conf_mat = confusion_matrix(actuals, predictions, labels=rails)
    conf_mat_disp = ConfusionMatrixDisplay(conf_mat, display_labels=rails)
    conf_mat_disp.plot(ax=ax, cmap="Blues")

plt.tight_layout()
plt.show()