# Fine-Tune a GPT-4o-Mini Model to Answer Questions about Hurricanes

## Data Curation and Preparation

To demonstrate the fine-tuning process, we'll be working with a real-world dataset—recent updates to hurricane data from Wikipedia. The steps are as follows:

1. Collect Data: Gather the latest revisions from selected Wikipedia pages.
2. Generate Q&A Pairs: Turn the raw hurricane data into a useful set of question-and-answer pairs.
3. Create Fine-Tuning Dataset: Format the dataset to fit OpenAI's requirements for fine-tuning.


### Pull Wikipedia Revisions for Hurricane Topics (after a certain date)

In [6]:
# List of relevant topics
topics = [
    "List_of_United_States_hurricanes",
    "2024_Atlantic_hurricane_season",
    "Hurricane_Milton",
    "Hurricane_Beryl",
    "Hurricane_Francine",
    "Hurricane_Helene",
    "Hurricane_Isaac"
]

In [24]:
import requests
import difflib
import re

def get_wikipedia_revisions(article_title, start_date):
    """Fetches Wikipedia revisions after a certain date for a given article and collects the data into a list of dictionaries."""
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "prop": "revisions",
        "titles": article_title,
        "rvstart": start_date,
        "rvdir": "newer",  # Fetch revisions newer than start_date
        "rvlimit": "500",
        "rvprop": "timestamp|user|comment|content",
        "format": "json"
    }
    
    dataset = []  # List to hold the dictionaries
    continue_token = True  # To handle pagination
    
    while continue_token:
        response = requests.get(url, params=params)
        data = response.json()

        if 'query' in data and 'pages' in data['query']:
            pages = data['query']['pages']
            for page_id in pages:
                revisions = pages[page_id].get('revisions', [])
                if revisions:
                    previous_content = ""  # Initialize variable to hold previous revision content
                    for rev in revisions:
                        current_content = rev.get('*', '')  # Get the full revision content
                        timestamp = rev.get('timestamp', 'No timestamp')
                        user = rev.get('user', 'Anonymous')
                        comment = rev.get('comment', 'No comment')

                        # If there is a previous revision, calculate the difference (new content added)
                        if previous_content:
                            diff = list(difflib.unified_diff(previous_content.splitlines(), current_content.splitlines()))
                            new_content = "\n".join(line[1:] for line in diff if line.startswith('+') and not line.startswith('+++'))

                            # Extract citation links (URLs and <ref> tags)
                            citations = re.findall(r'<ref.*?>.*?</ref>|https?://\S+', new_content)

                            # Create a dictionary to hold the relevant information
                            revision_data = {
                                "article_title": article_title,
                                "timestamp": timestamp,
                                "user": user,
                                "comment": comment,
                                "new_content": new_content,
                                # "citations": citations
                            }

                            # Append the dictionary to the dataset list
                            dataset.append(revision_data)

                        # Update previous content
                        previous_content = current_content
                else:
                    print(f"No revisions found for {article_title} after {start_date}.")

        else:
            print(f"Error fetching data for {article_title}.")

        # Check if pagination is needed (continue parameter)
        if 'continue' in data:
            continue_token = data['continue'].get('rvcontinue', False)
            params['rvcontinue'] = continue_token
        else:
            continue_token = False  # End loop if no more pages
    
    return dataset  # Return the dataset


def fetch_revisions_for_topics(topics, start_date):
    """Fetches revisions for all topics after a certain date and returns a combined dataset."""
    full_dataset = []  # List to hold data for all topics
    for topic in topics:
        try:
            print(f"Fetching revisions for {topic} starting from {start_date}...")
            topic_data = get_wikipedia_revisions(topic, start_date)
            full_dataset.extend(topic_data)  # Append the data for each topic to the full dataset
        except Exception as e:
            print(f"Error fetching revisions for {topic}: {str(e)}")
    
    return full_dataset  # Return the full dataset


# Specify the start date (ISO 8601 format)
start_date = "2023-09-01T00:00:00Z"

# Fetch the latest revisions for all topics and store them in a dataset
dataset = fetch_revisions_for_topics(topics, start_date)

Fetching revisions for List_of_United_States_hurricanes starting from 2023-09-01T00:00:00Z...
Fetching revisions for 2024_Atlantic_hurricane_season starting from 2023-09-01T00:00:00Z...
Fetching revisions for Hurricane_Milton starting from 2023-09-01T00:00:00Z...
Fetching revisions for Hurricane_Beryl starting from 2023-09-01T00:00:00Z...
Fetching revisions for Hurricane_Francine starting from 2023-09-01T00:00:00Z...
Fetching revisions for Hurricane_Helene starting from 2023-09-01T00:00:00Z...
Fetching revisions for Hurricane_Isaac starting from 2023-09-01T00:00:00Z...


In [79]:
print("Number of examples in dataset: ", len(dataset))

Number of examples in dataset:  10362


In [81]:
# Example: Print the first 10 examples
for data in dataset[:10]:
    print(data)

{'messages': [{'role': 'system', 'content': 'You are a helpful assistant that answers questions about current events in hurricanes. Provide detailed answers.'}, {'role': 'user', 'content': 'What is the only Pacific tropical cyclone known to have produced hurricane-force winds in California?'}, {'role': 'assistant', 'content': 'The 1858 San Diego hurricane is the only Pacific tropical cyclone known to have produced hurricane-force winds in California.'}]}
{'messages': [{'role': 'system', 'content': 'You are a helpful assistant that answers questions about current events in hurricanes. Provide detailed answers.'}, {'role': 'user', 'content': 'When did the 1858 San Diego hurricane affect San Diego?'}, {'role': 'assistant', 'content': 'The 1858 San Diego hurricane affected San Diego on October 2, 1858.'}]}
{'messages': [{'role': 'system', 'content': 'You are a helpful assistant that answers questions about current events in hurricanes. Provide detailed answers.'}, {'role': 'user', 'content

## Generate Q&A Pairs

Now that we have our dataset, we need to generate a set of question-and-answer pairs. We'll use OpenAI's API to analyze the new content and generate a set of question-answer pairs.


In [40]:
from openai import OpenAI
from pydantic import BaseModel
from typing import List, Literal
import json

# Define the Pydantic model for the output format
class QAItem(BaseModel):
    prompt: str
    completion: str

class QADataset(BaseModel):
    dataset: List[QAItem]


def generate_qa_pairs_from_changes(new_content, article_title):
    """
    Query OpenAI to analyze the new content and generate a set of question-answer pairs.
    If substantial information changes are detected (such as new sections, significant updates, or meaningful additions of facts),
    the function returns a list of question-answer pairs in the specified JSON format.
    """

    client = OpenAI()

    # Create a query prompt to ask OpenAI to generate question-answer pairs based on the content
    prompt = f"""
    The following is newly added content to the Wikipedia article titled '{article_title}'.
    Analyze the content and generate a set of specific question-answer pairs based on the new facts, updates, or meaningful changes.
    Focus on creating general questions that a person might ask and answered them comprehensively with the content provided.
    Do not ask questions that directly reference the date of the revision or the specific article title. 
    If a hurricane is mentioned, it should be referred to by its full name.
    Ignore trivial changes such as typos or formatting.

    Example questions: 
    - List the hurricanes that hit the US in 2024.
    - What was the most recent hurricane to hit the US?
    - What was the name of the hurricane that hit Florida in 2024?
    - What was the category of hurricane Beryl?
    - What was the path of hurricane Milton?

    New Content:
    \"\"\"{new_content[:3000]}\"\"\"

    Please return a set of question-answer pairs in the form of a JSON array where each item is an object containing 
    'prompt' as the question and 'completion' as the direct answer from the content.
    """

    try:
        response = client.beta.chat.completions.parse(
            model="gpt-4o-2024-08-06",  # or "gpt-4o" if available
            messages=[
                {"role": "system", "content": "You are a helpful assistant that generates a set of specific question-answer pairs based on the new facts, updates, or meaningful changes Wikipedia articles."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=1000,
            temperature=0.7,
            response_format=QADataset
        )

        return response.choices[0].message.content

    except Exception as e:
        return f"Error in generating QA pairs: {str(e)}"

In [None]:
# Test the Q&A pair generation
a = generate_qa_pairs_from_changes(dataset[256]['new_content'], dataset[256]['article_title'])
print(a)

We'll now run this function over our entire dataset and collect the results. Depending on the length of the dataset, this may take a while. For a very large dataset, it can be more cost effective to run this in batch mode. For more information, see the [OpenAI Batch API](https://platform.openai.com/docs/guides/batch/overview).

Note: We have decided to only process revisions longer than 1000 characters, as shorter revisions are less likely to contain significant changes or new information, and it reduces the size of the dataset we can create.



In [None]:
from tqdm import tqdm

# Master list to collect all Q&A pairs
master_qa_list = []

for data in tqdm(dataset, desc="Processing dataset"):
    # Only process revisions that are longer than 1000 characters
    if len(data['new_content']) > 1000:
        # Call the function to generate QA pairs
        try:
            qa_response = generate_qa_pairs_from_changes(data['new_content'], data['article_title'])
            
            # Parse the response from a JSON string to a Python dictionary
            qa_dict = json.loads(qa_response)

            # Check if the response contains an error with insufficient quota (code 429)
            if 'error' in qa_dict and qa_dict['error'].get('code') == 'insufficient_quota':
                print(f"Error in generating QA pairs: {qa_dict['error']}")
                # Stop further execution or raise an exception
                raise Exception("Insufficient quota. Stopping execution.")

            # Ensure the parsed data contains a 'dataset' key with the list of Q&A pairs
            if 'dataset' in qa_dict and isinstance(qa_dict['dataset'], list):
                master_qa_list.extend(qa_dict['dataset'])

        except json.JSONDecodeError as e:
            print(f"Error decoding JSON for article '{data['article_title']}': {str(e)}")
        
        except Exception as e:
            print(f"Exception encountered: {str(e)}")
            break  # Stop processing further data if an error occurs

In [None]:
print("Number of examples in master_qa_list: ", len(master_qa_list))
# Example: Print the first 10 examples
for data in master_qa_list[:10]:
    print(data)

## Create the Fine-Tuning Dataset
We will now convert the list of QA pairs into the format required for OpenAI fine-tuning and write it to a file.

In [67]:
# System message for all entries
system_message = {"role": "system", "content": "You are a helpful assistant that answers questions about current events in hurricanes. Provide detailed answers."}

# List to store the converted dataset
new_format_dataset = []

# Convert each prompt-completion pair to the new format
for entry in master_qa_list:
    new_entry = {
        "messages": [
            system_message,
            {"role": "user", "content": entry['prompt']},
            {"role": "assistant", "content": entry['completion']}
        ]
    }
    new_format_dataset.append(new_entry)


In [68]:
new_format_dataset[0]

{'messages': [{'role': 'system',
   'content': 'You are a helpful assistant that answers questions about current events in hurricanes. Provide detailed answers.'},
  {'role': 'user',
   'content': 'What is the only Pacific tropical cyclone known to have produced hurricane-force winds in California?'},
  {'role': 'assistant',
   'content': 'The 1858 San Diego hurricane is the only Pacific tropical cyclone known to have produced hurricane-force winds in California.'}]}

In [69]:
def write_to_jsonl(qa_list, filename):
    with open(filename, 'w') as f:
        for qa_item in qa_list:
            # Convert each dictionary to JSON format and write to file
            f.write(json.dumps(qa_item) + '\n')

In [70]:
write_to_jsonl(new_format_dataset, 'qa_pairs_openai_wiki_hurricane_dataset_format.jsonl')


# Dataset Analysis and Cost Estimates

We will use this file to analyze the data and create a new file with the correct format. The analysis functions are pulled from [Data preperation and analysis for chat model fine-tuning](https://cookbook.openai.com/examples/chat_finetuning_data_prep).

In [71]:
import json
import tiktoken # for token counting
import numpy as np
from collections import defaultdict

In [72]:
data_path = "qa_pairs_openai_wiki_hurricane_dataset_format.jsonl"

# Load the dataset
with open(data_path, 'r', encoding='utf-8') as f:
    dataset = [json.loads(line) for line in f]

# Initial dataset stats
print("Num examples:", len(dataset))
print("First example:")
for message in dataset[0]["messages"]:
    print(message)

Num examples: 10362
First example:
{'role': 'system', 'content': 'You are a helpful assistant that answers questions about current events in hurricanes. Provide detailed answers.'}
{'role': 'user', 'content': 'What is the only Pacific tropical cyclone known to have produced hurricane-force winds in California?'}
{'role': 'assistant', 'content': 'The 1858 San Diego hurricane is the only Pacific tropical cyclone known to have produced hurricane-force winds in California.'}


In [73]:
# Format error checks
format_errors = defaultdict(int)

for ex in dataset:
    if not isinstance(ex, dict):
        format_errors["data_type"] += 1
        continue
        
    messages = ex.get("messages", None)
    if not messages:
        format_errors["missing_messages_list"] += 1
        continue
        
    for message in messages:
        if "role" not in message or "content" not in message:
            format_errors["message_missing_key"] += 1
        
        if any(k not in ("role", "content", "name", "function_call", "weight") for k in message):
            format_errors["message_unrecognized_key"] += 1
        
        if message.get("role", None) not in ("system", "user", "assistant", "function"):
            format_errors["unrecognized_role"] += 1
            
        content = message.get("content", None)
        function_call = message.get("function_call", None)
        
        if (not content and not function_call) or not isinstance(content, str):
            format_errors["missing_content"] += 1
    
    if not any(message.get("role", None) == "assistant" for message in messages):
        format_errors["example_missing_assistant_message"] += 1

if format_errors:
    print("Found errors:")
    for k, v in format_errors.items():
        print(f"{k}: {v}")
else:
    print("No errors found")

No errors found


In [74]:
encoding = tiktoken.get_encoding("cl100k_base")

# not exact!
# simplified from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

In [75]:
# Warnings and tokens counts
n_missing_system = 0
n_missing_user = 0
n_messages = []
convo_lens = []
assistant_message_lens = []

for ex in dataset:
    messages = ex["messages"]
    if not any(message["role"] == "system" for message in messages):
        n_missing_system += 1
    if not any(message["role"] == "user" for message in messages):
        n_missing_user += 1
    n_messages.append(len(messages))
    convo_lens.append(num_tokens_from_messages(messages))
    assistant_message_lens.append(num_assistant_tokens_from_messages(messages))
    
print("Num examples missing system message:", n_missing_system)
print("Num examples missing user message:", n_missing_user)
print_distribution(n_messages, "num_messages_per_example")
print_distribution(convo_lens, "num_total_tokens_per_example")
print_distribution(assistant_message_lens, "num_assistant_tokens_per_example")
n_too_long = sum(l > 16385 for l in convo_lens)
print(f"\n{n_too_long} examples may be over the 16,385 token limit, they will be truncated during fine-tuning")

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 44, 191
mean / median: 71.05278903686548, 69.0
p5 / p95: 58.0, 86.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 3, 138
mean / median: 23.265296274850414, 21.0
p5 / p95: 13.0, 36.0

0 examples may be over the 16,385 token limit, they will be truncated during fine-tuning


In [76]:
# Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 16385

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
    n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
    n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in convo_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

Dataset has ~736249 tokens that will be charged for during training
By default, you'll train for 2 epochs on this dataset
By default, you'll be charged for ~1472498 tokens


## Upload the Dataset to OpenAI
Once we are happy with the cost and details of our dataset, we can upload it to OpenAI.

In [77]:
client = OpenAI()

client.files.create(
  file=open("qa_pairs_openai_wiki_hurricane_dataset_format.jsonl", "rb"),
  purpose="fine-tune"
)

FileObject(id='file-qBVnzhGrEHvEPvZg6ZwvTfpK', bytes=4301148, created_at=1728916096, filename='qa_pairs_openai_wiki_hurricane_dataset_format.jsonl', object='file', purpose='fine-tune', status='processed', status_details=None)

## Create a Fine-Tuning Job

We can now create a fine-tuning job programmatically or via the [fine-tuning UI](https://platform.openai.com/finetune), using the file ID for the file we just uploaded.

In [78]:
client.fine_tuning.jobs.create(
  training_file="file-qBVnzhGrEHvEPvZg6ZwvTfpK", 
  model="gpt-4o-mini-2024-07-18",
  suffix="wiki-hurricane-2024"
)

FineTuningJob(id='ftjob-YSQajs0bsLVi0IQfSv6b5jbz', created_at=1728916171, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-4o-mini-2024-07-18', object='fine_tuning.job', organization_id='org-19ZNwom3AVpfo00n76H0UeLL', result_files=[], seed=433405727, status='validating_files', trained_tokens=None, training_file='file-qBVnzhGrEHvEPvZg6ZwvTfpK', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix='wiki-hurricane-2024')

We can check the status of our fine-tuning job using the job ID.

In [6]:
ft_job = client.fine_tuning.jobs.retrieve("ftjob-YSQajs0bsLVi0IQfSv6b5jbz")
print("ft_job.fine_tuned_model: ", ft_job.fine_tuned_model)

ft_job.fine_tuned_model:  ft:gpt-4o-mini-2024-07-18:personal:wiki-hurricane-2024:AIGr7s2N


## Use our fine-tuned model

Once our fine-tuning job is complete, we can use the new model to generate answers to questions about hurricanes.

In [3]:
from openai import OpenAI
client = OpenAI()

completion = client.beta.chat.completions.parse(
  model="ft:gpt-4o-mini-2024-07-18:personal:wiki-hurricane-2024:AIGr7s2N",
  messages=[
    {"role": "system", "content": "You are a helpful assistant that answers questions about current events in hurricanes. Provide detailed answers."},
    {"role": "user", "content": "What's the most recent hurricane to hit the US in 2024?"}
  ]
)
print(completion.choices[0].message)

ParsedChatCompletionMessage[NoneType](content='The most recent hurricane to hit the US in 2024 is Hurricane Milton, which struck Florida on October 9.', refusal=None, role='assistant', function_call=None, tool_calls=[], parsed=None)
