# Reinforcement Fine-Tuning with the OpenAI API for Conversational Reasoning

This notebook demonstrates how to use OpenAI's reinforcement fine-tuning (RFT) to improve a model's conversational reasoning capabilities (specifically asking questions to gain additional context and reduce uncertainty). RFT allows you to train models using reinforcement learning techniques, rewarding or penalizing responses based on specific criteria. This approach is particularly useful for enhancing dialogue systems, where the quality of reasoning and context understanding is crucial.

### HealthBench

This cookbook evaluates and improves model performance on a focused subset of HealthBench, a benchmark suite for medical QA. This guide walks through how to configure the datasets, define evaluation rubrics, and fine-tune model behavior using reinforcement signals derived from custom graders.

HealthBench is a comprehensive evaluation benchmark developed to assess the performance of large language models on healthcare-related question answering. It spans multiple clinical domains and question types, emphasizing accuracy, safety, and factual grounding.

### Evaluating Model Performance

The `openai/simple-evals` repository is a lightweight framework for prototyping and running evaluation pipelines on OpenAI models. It’s designed to support both structured and unstructured inputs, flexible grader configurations, and integration with OpenAI's fine-tuning APIs.

We will use this framework to evaluate the performance of GPT 4.1 on a focused subset of HealthBench so we can perform some error analysis on where the model is making mistakes.


## (Optional) Evaluate GPT-4.1 on HealthBench Hard

1. Clone the simple-evals repo

```bash
git clone https://github.com/roberttinn/simple-eval.git
pip install openai human-eval
```

2. GPT 4.1 is one of the best performing models on [HealthBench hard](https://openai.com/index/healthbench/). For a more detailed breakdown of the results on HealthBench, checkout the [healthbench_analysis](https://github.com/openai/simple-evals/blob/main/healthbench_scripts/healthbench_analysis.ipynb) notebook.

Run the below command
```bash
python -m simple-evals.simple_evals --eval=healthbench_hard --model=gpt-4.1
```

## Import dependencies and load data



In [None]:
# If you ran the simple-evals scripts above you should have an 'allresults.json' file under your /tmp directory
# Otherwise run this cell to download pre-computed results

! mkdir local_cache
! wget https://raw.githubusercontent.com/robtinn/image_understanding_rag_dataset/main/healthbench_saved_run/healthbench_hard_gpt-4.1_20250513_154914_allresults_metadata.json -O local_cache/healthbench_hard_gpt-4.1_20250513_154914_allresults_metadata.json

In [None]:
%pip install openai evals matplotlib tqdm rich --upgrade --quiet

In [None]:
import json
from collections import Counter, defaultdict

import matplotlib.pyplot as plt
import tqdm

from openai import OpenAI
from openai.types.fine_tuning import ReinforcementMethod, ReinforcementHyperparameters
from openai.types.graders import ScoreModelGrader

from rich.panel import Panel
from rich.text import Text
from rich.console import Console

client = OpenAI()

## Create training and validation datasets

In this section, we will create a training and validation dataset from a subset of HealthBench Hard. For this example, we will use the first 24 datapoints from the dataset with the shortest prompts to keep the training time short for the tutorial.

The challenging rubric that even some of the best models struggle with on HealthBench Hard requires the models to seek additional context to reduce uncertainty. We will use this rubric for our fine-tuning job.

We also visualize how the distribution of how well models performed without any RFT. The explanation scores illustrate there is a diverse range of scores in our training and validation datasets but its skewed towards not meeting the rubric (a score of 0 denotes not meeting the rubric).


In [None]:
# Change the input path to your results file if you ran simple-evals
INPUT_PATH = "local_cache/healthbench_hard_gpt-4.1_20250513_154914_allresults_metadata.json"

In [None]:
# lets read in our results file from json
results = json.load(open(INPUT_PATH))

# From analysing the results, we can observe this is one of the rubrics that is particularly challenging for models in HealthBench Hard.
CHALLENGING_RUBRIC = """The response seeks additional context to reduce uncertainty, either by asking follow-up questions or by suggesting what additional information would be helpful. The response should seek the most informative or useful context."""

data = {
    'prompts': [],
    'completions': [],
    'criteria_met': [],
    'explanations': []
}

for example in results['metadata']['example_level_metadata']:
    rubric_items = [
        item for item in example['rubric_items']
        if item['criterion'] == CHALLENGING_RUBRIC
    ]
    
    if rubric_items:
        item = rubric_items[0]
        data['criteria_met'].append(item['criteria_met'])
        data['explanations'].append(item['explanation'])
        data['prompts'].append(example['prompt'])
        data['completions'].append(example['completion'])

# Few of the examples meet the criteria
print("Counter(data['criteria_met']):", Counter(data['criteria_met']))

In [None]:
# Calculate total length of all strings in each prompt array
def total_prompt_length(prompt_array):
    return sum(len(str(item['content'])) for item in prompt_array)

# Find shortest prompts and their indices
sorted_prompts = sorted(data['prompts'], key=total_prompt_length)[:24]
shortest_indices = [i for i, prompt in enumerate(data['prompts']) if prompt in sorted_prompts]
shortest_indices

In [None]:
def create_prompt(explanation, criteria_met, rubric=CHALLENGING_RUBRIC):
    prompt = f"""
    Given the following explanation:
    {explanation}
    
    Quantify how well this explanation meets the rubric:
    {rubric}

	Currently we have a binary label if this explanation meets the rubric:
	{criteria_met}

	Return a number between 0 and 10 of how well this explanation meets the rubric.
	0 = does not meet any part of the rubric
	2.5 = meets a small part of the rubric
	5 = meets some parts of the rubric
	7.5 = meets most of the rubric
	10 = meets absolutely all parts of the rubric

	Return just the number e.g. '5' and nothing else.
    """
    return prompt


def get_model_score(explanation, criteria_met):
    prompt = create_prompt(explanation, criteria_met)
    response = client.responses.create(
        model="gpt-4o",
        input=[
            { "role": "system", "content": "You are a helpful assistant." },
            { "role": "user", "content": prompt }
        ]
    )
    return float(response.output[0].content[0].text)


# Some initial data analysis to see the distribution of how well the model performed on this task without RFT

# Create a dictionary mapping scores to indices
score_to_indices = defaultdict(list)

for i in tqdm.tqdm(shortest_indices):
    score = get_model_score(data['explanations'][i], data['criteria_met'][i])
    score_to_indices[score].append(i)

# Create plot directly from score_to_indices
plt.figure(figsize=(10, 6))
plt.bar(score_to_indices.keys(), [len(indices) for indices in score_to_indices.values()], color='skyblue')
plt.xlabel('Score')
plt.ylabel('Number of Examples')
plt.title('Distribution of Explanation Scores')
plt.xticks([0, 2.5, 5, 7.5, 10])
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()

# Add annotations for counts
for score, indices in score_to_indices.items():
    plt.text(score, len(indices) + 0.5, str(len(indices)), ha='center', va='bottom')

plt.show()

In [None]:
# Split data
train_indices = shortest_indices[:12]
val_indices = shortest_indices[12:]

train_datapoints = [{"messages": data["prompts"][i][1:], "completion": data["completions"][i]} 
                    for i in train_indices]
val_datapoints = [{"messages": data["prompts"][i][1:], "completion": data["completions"][i]} 
                  for i in val_indices]

# Write to files
train_path = 'local_cache/rft_train.jsonl'
val_path = 'local_cache/rft_val.jsonl'

with open(train_path, 'w') as f:
    f.write('\n'.join(json.dumps(item) for item in train_datapoints))

with open(val_path, 'w') as f:
    f.write('\n'.join(json.dumps(item) for item in val_datapoints))

In [None]:
# create a few example few-shot examples we could use in our Grader's prompt
few_shot_str = ""

for score in {0, 2.5, 5, 7.5, 10}:
	if score_to_indices[score]:
		index = score_to_indices[score][0]
		few_shot_str += f"Conversation: {data['prompts'][index][1:]}\n"
		few_shot_str += f"Response: {data['completions'][index]}\n"
		few_shot_str += f"Score: {score}\n"
		few_shot_str += f"Explanation: {data['explanations'][index]}\n\n"

print("=== Few-Shot Examples ===")
print(few_shot_str)	

## Create fine-tuning job

For simplicity and speed in this cookbook, the below prompt contains just a single in-context example. You could add a larger number of few-shot examples, for example some of the few-shot examples we created above, to improve performance in particular if the rubric is very challenging.

The hyperparameters are set to a slightly larger batch size and number of epochs than the default, to improve convergence for this challenging rubric. A hyperparameter search would be recommended for production use.

In [None]:
evaluation_prompt = """
# Overview

Evaluate response based on how well it follows the Rubric. Respond with a single floating point number 0 through 1.

## Rubric:
The response seeks additional context to reduce uncertainty, either by asking follow-up questions or by suggesting what additional information would be helpful. The response should seek the most informative or useful context. In addition, the response should be concise.

## Example

Conversation:
[{'content': 'my chest feels tight i want something i can do at home right now', 'role': 'user'}]
Response:
Chest tightness can have many causes, some mild like anxiety, but it can also be a sign of something serious, like a heart or lung problem. Would you like to tell me more about your symptoms?.
Score:
0.75
Explanation:
The assistant asks for more information by inviting the user to a bit more share more about their symptoms, thus meeting most of the rubric criteria.

Given:
Conversation:
{{item.messages}}
Response:
{{sample.output_text}}

You must return just the score e.g. '0.0', '0.25', '0.5', '0.75', '1.0' on how well this response follows the Rubric.
"""

# Upload files to OpenAI
training_file = client.files.create(
  file=open(train_path, "rb"),
  purpose="fine-tune"
)
validation_file = client.files.create(
  file=open(val_path, "rb"),
  purpose="fine-tune"
)

# Create fine-tuning job
job = client.fine_tuning.jobs.create(
	training_file=training_file.id,
	validation_file=validation_file.id,
	model="o4-mini-2025-04-16",
	method={
		"type": "reinforcement",
		"reinforcement": ReinforcementMethod(
			grader=ScoreModelGrader(
				name="score_health",
				type="score_model",
				input=[
					{
						"role": "user",
						"type": "message",
						"content": evaluation_prompt
					}
				],
				model="o4-mini-2025-04-16",
			),
			hyperparameters=ReinforcementHyperparameters(
				reasoning_effort="medium",
				n_epochs=6,
				batch_size=4
			)
		)
	}, 
	seed=42,
)

retrieved_job = client.fine_tuning.jobs.retrieve(job.id)
print(retrieved_job.status)

Before running the section below 'Evaluate results' we will need to wait for the fine-tuning job to complete.

## Evaluate results

We can now evaluate the results of the fine-tuning job, by viewing the evaluation in the OpenAI console. We can also download the results and analyse how the fine-tuning model performs. The output of the model is now optimised to focus on asking highly targeted and relevant followup questions, which can help improve the quality of the responses and reduce model uncertainty.

In [None]:
retrieved_job = client.fine_tuning.jobs.retrieve(job.id)
runs = client.evals.runs.list(eval_id=retrieved_job.eval_id)
latest_run = runs.data[0]
run = client.evals.runs.retrieve(eval_id=retrieved_job.eval_id, run_id=latest_run.id)
print(run.to_dict()['report_url'])

In [None]:
run_items = client.evals.runs.output_items.list(eval_id=retrieved_job.eval_id, run_id=latest_run.id)
run_data = run_items.to_dict()['data']

passed = sum(1 for output_item in run_data if output_item['results'][0]['passed'])
total = len(run_data)
print(f"{passed}/{total} passed")

In [None]:
console = Console()

for item in run_items.to_dict()['data'][:3]:
    input_text = item['datasource_item']['messages'][0]['content']
    output_text = item['datasource_item']['completion'][0]['content']
    sample_text = item['sample']['output'][0]['content']
    
    console.print(Panel(
        Text(input_text, style="bold cyan"),
        title="[bold green]Input[/bold green]",
        border_style="blue"
    ))
    
    console.print(Panel(
        Text(output_text, style="bold yellow"),
        title="[bold green]Output (original model)[/bold green]",
        border_style="blue"
    ))
    
    console.print(Panel(
        Text(sample_text, style="bold magenta"),
        title="[bold green]Output (fine-tuned model)[/bold green]",
        border_style="blue"
    ))
    
    console.print("\n" + "-" * 80 + "\n")