## Supervised Fine-Tuning GPT-4o Model for text Q&A - An Azure Python SDK Experience

Learn how to fine-tune the <code>gpt-4o-2024-08-06</code> model using Python Programming Language - An SDK / Low-Code Experience. This notebook is based on the MS Learn tutorial [here](https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/fine-tune?tabs=python%2Cbash).

He Zhang, Jul. 2025

### Prerequisites

* Learn the [what, why, and when to use fine-tuning.](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/fine-tuning-considerations)
* An Azure subscription.
* Access to Azure OpenAI Service.
* An Azure OpenAI resource created in the supported fine-tuning region (e.g. Sweden Central).
* A deployment of <code>gpt-4o</code> base model, with its deployment name as "gpt-4o" for simplicity. 
* Prepare Training and Validation datasets:
  * at least 50 high-quality samples (preferably 1,000s) are required.
  * must be formatted in the JSON Lines (JSONL) document with UTF-8 encoding.
  * for this test notebook, we use only 10 samples for the demo purpose. 
* Python version at least: <code>3.10</code>
* Python libraries: <code>os, requests, python-dotenv, matplotlib, azure.identity, pandas, openai</code>
* The OpenAI Python library version for this test notebook: <code>1.x</code> 
* [Jupyter Notebooks](https://jupyter.org/)
* An `azure.env` file to store your AOAI-related credentials as environmental variables. **Be sure not to share this file with others or upload it to a public GitHub repository.**

### Step 1: Setup

#### Retrieve the Azure OpenAI API key and endpoint.

Go to your Azure OpenAI resource in the Azure portal. The Endpoint and Keys can be found in the **Resource Management: Keys and Endpoint** sub-section.

Alternatively, you can also find the same Keys and Endpoint in the **Azure AI Foundry - Azure OpenAI** resource landing page.

<img src="../../images/screenshot-aoai-keys-and-endpoint.png" alt="Screenshot of the Azure OpenAI resource management pane." width="800"/>

#### Configure credentials

Copy the <code>Endpoint</code> and access <code>KEY</code> (you can use either <code>KEY 1</code> or <code>KEY 2</code>), and paste them accordingly to the variables in the file <code>azure.env</code>. 

Save the file and close it. 

**Do not** distribute this file as this contains credential information! 

<img src="../../images/screenshot-azure-env-file.png" alt="Screenshot of the azure.env file that contains credential information - do not show it to others!" width="800"/>

#### Install required Python libraries (if not done yet)

In [None]:
#%pip install -q openai matplotlib pandas json requests tiktoken python-dotenv

#### Import required Python libraries 

In [None]:
import os
import json
import time
import requests
import tiktoken
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

from openai import AzureOpenAI
from dotenv import load_dotenv
from io import BytesIO, StringIO
from azure.identity import DefaultAzureCredential

#### Load Azure OpenAI credentials

In [None]:
# Load credential file
load_dotenv("azure.env")

# Assign Azure resources  
subscription_id = os.getenv("SUBSCRIPTION_ID") # name of the Azure Subscription ID
resource_name = os.getenv("RESOURCE_NAME") # name of the AOAI resource
rg_name = os.getenv("RG_NAME") # name of the resource group

client = AzureOpenAI(
  azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
  api_key = os.getenv("AZURE_OPENAI_API_KEY"),
  api_version = "2024-10-21"  # This API version or later is required to access seed/events/checkpoint features
)

In [None]:
# Test AOAI connection
completion = client.chat.completions.create(  
    model="gpt-4o",  
    messages=[{"role":"user", "content":"hello"}],  
    max_tokens=500,  
    temperature=0.7)

print(completion.choices[0].message.content)

#### Define helper functions

In [None]:
def read_jsonl(file_path, top_lines=5):
    """Reads and displays the first few lines from a .jsonl (JSON Lines) file."""
    with open(file_path, 'r', encoding='utf-8') as f:
        messages = [line for line in f]
        for mes in messages[:top_lines]:
            print(mes)

In [None]:
def show_ft_metrics(results_df, window_size=5):
    """Plot fine-tuning metrics including loss and accuracy for training and validation."""
    # Drop rows where valid_loss is NaN or valid_loss is -1.0
    filtered_df = results_df.dropna(subset=['valid_loss'])
    filtered_df = filtered_df.loc[filtered_df['valid_loss'] != -1.0]
    # Compute rolling means
    results_df_smooth = results_df.rolling(window=window_size).mean()
    filtered_df_smooth = filtered_df.rolling(window=window_size).mean()
    # Plot the curves
    plt.figure(figsize=(16, 12))
    
    plt.subplot(2, 2, 1)
    plt.plot(results_df_smooth['step'], results_df_smooth['train_loss'],  color='blue')
    plt.title('Train Loss')
    plt.xlabel('Step')
    plt.ylabel('Loss')
    
    plt.subplot(2, 2, 2)
    plt.plot(results_df_smooth['step'], results_df_smooth['train_mean_token_accuracy'], color='green')
    plt.title('Train Mean Token Accuracy')
    plt.xlabel('Step')
    plt.ylabel('Accuracy')
    
    plt.subplot(2, 2, 3)
    plt.plot(filtered_df_smooth['step'], filtered_df_smooth['valid_loss'], color='red')
    plt.title('Validation Loss')
    plt.xlabel('Step')
    plt.ylabel('Loss')

    plt.subplot(2, 2, 4)
    plt.plot(filtered_df_smooth['step'], filtered_df_smooth['valid_mean_token_accuracy'], color='orange')
    plt.title('Validation Mean Token Accuracy')
    plt.xlabel('Step')
    plt.ylabel('Accuracy')

    plt.tight_layout()
    plt.show()

In [None]:
def date_sorted_df(details_dict):
    """Create a pandas DataFrame from a dictionary and sort it by a 'created' or 'created_at' timestamp column for displaying OpenAI API tables."""
    df = pd.DataFrame(details_dict)
    
    if 'created' in df.columns:
        df.rename(columns={'created': 'created_at'}, inplace=True)
    
    # Convert 'created_at' from Unix timestamp to human-readable date/time format
    df['created_at'] = pd.to_datetime(df['created_at'], unit='s').dt.strftime('%Y-%m-%d %H:%M:%S')

    if 'finished_at' in df.columns:
        # Convert 'finished_at' from Unix timestamp to human-readable date/time format, keeping null values as is
        df['finished_at'] = pd.to_datetime(df['finished_at'], unit='s', errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')
    
    # Sort DataFrame by 'created_at' in descending order
    df = df.sort_values(by='created_at', ascending=False)

    return df

### Step 2: Prepare Training & Validation Datasets

#### The training and validation datasets have been made ready for you. 

For illustration purposes, each dataset contains only 10 samples, and each sample consists of a single-turn Q&A pair.”

<code>training_set_10samples.jsonl</code>

```yaml
{"messages": 
    [{"role": "system", "content": "Clippy is a factual chatbot that is also sarcastic."}, 
     {"role": "user", "content": "Who discovered Antarctica?"}, 
     {"role": "assistant", "content": "Some chaps named Fabian Gottlieb von Bellingshausen and Mikhail Lazarev, as if they don't teach that in every school!"}]}

{"messages": 
    [{"role": "system", "content": "Clippy is a factual chatbot that is also sarcastic."},
     {"role": "user", "content": "What is the biggest ocean?"}, 
     {"role": "assistant", "content": "The Pacific Ocean. It's not like it's a small pond or anything."}]}

{"messages": ...}
```

<code>validation_set_10samples.jsonl</code>

```yaml
{"messages": 
    [{"role": "system", "content": "Clippy is a factual chatbot that is also sarcastic."}, 
     {"role": "user", "content": "What's the capital of Australia?"}, 
     {"role": "assistant", "content": "It's Canberra, not Sydney. Shocking, I know!"}]}

{"messages": 
    [{"role": "system", "content": "Clippy is a factual chatbot that is also sarcastic."},
     {"role": "user", "content": "Who wrote 'The Great Gatsby'?"}, 
     {"role": "assistant", "content": "Some guy named F. Scott Fitzgerald. Ever heard of him?"}]}
     
{"messages": ...}
```

#### Do initial data checks

In [None]:
# Check some data samples 
training_file_path = "./training_set_10samples.jsonl"
validation_file_path = "./validation_set_10samples.jsonl" 

read_jsonl(training_file_path, top_lines=3)

### Step 3: Upload Datasets for Fine-Tuning

In [None]:
# Upload the training dataset
training_response = client.files.create(
    file = open(training_file_path, "rb"), 
    purpose="fine-tune"
)
training_file_id = training_response.id

# Upload the validation dataset
validation_response = client.files.create(
    file = open(validation_file_path, "rb"), 
    purpose="fine-tune"
)
validation_file_id = validation_response.id

print("Training file ID:", training_file_id)
print("Validation file ID:", validation_file_id)

### Step 4: Configure and Start Fine-Tuning Job

Now you can submit your fine-tuning training job. 

The fine-tuning job will take some time to start and complete.

You can use the job ID to monitor the status of the fine-tuning job. 

Here is some guidance if you want to adjust the hyperparameters of the fine-tuning process. You can keep them as `None` to use default values. 

| Hyperparameter                       | Description                                                                                                                                                                              |
|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `Beta` | "auto" or number, is a new option that is only available for DPO fine-tuning. It's a floating point number between 0 and 2 that controls how strictly the new model will adhere to its previous behavior, versus aligning with the provided preferences. A high number will be more conservative (favoring previous behavior), and a lower number will be more aggressive (favor the newly provided preferences more often). |
| `Batch size` | The batch size to use for training. When set to default, batch_size is calculated as 0.2% of examples in training set and the max is 256. |
| `Learning rate multiplier` | The fine-tuning learning rate is the original learning rate used for pre-training multiplied by this multiplier. We recommend experimenting with values between 0.5 and 2. Empirically, we've found that larger learning rates often perform better with larger batch sizes. Must be between 0.0 and 5.0. |
| `Number of epochs` | Number of training epochs. An epoch refers to one full cycle through the data set. If set to default, number of epochs will be determined dynamically based on the input data. |
| `Seed` | The seed controls the reproducibility of the job. Passing in the same seed and job parameters should produce the same results, but may differ in rare cases. If a seed is not specified, one will be generated for you. |

In this example we're also passing the seed parameter. The seed controls the reproducibility of the job. Passing in the same seed and job parameters should produce the same results, but can differ in rare cases. If a seed isn't specified, one will be generated for you.

In [None]:
# Submit the fine-tuning training job
project_name = "gpt4o-text-ft-10-samples-qa"

response = client.fine_tuning.jobs.create(
    suffix=project_name,
    training_file = training_file_id,
    validation_file = validation_file_id,
    model = "gpt-4o-2024-08-06", 
    seed = 105 # seed parameter that controls reproducibility of the fine-tuning job.
)

# monitor the status
job_id = response.id
print("Job ID:", response.id)
print("Status:", response.status)
print(response.model_dump_json(indent=2))

### Step 5: Track Fine-Tuning Job Status

#### Track the training job status

Note that the training will take around 50 to 90 mins for the provided datasets.

In [None]:
# Check the fine-tuning job status
client.fine_tuning.jobs.list(limit=1).to_dict()

#### List fine-tuning events

API version: 2024-05-01-preview or later is required for this command.

While not necessary to complete fine-tuning it can be helpful to examine the individual fine-tuning events that were generated during training.

In [None]:
# List 5 recent fine-tuning jobs
ft_jobs = client.fine_tuning.jobs.list(limit=5).to_dict()
display(date_sorted_df(pd.DataFrame(ft_jobs['data'])))

In [None]:
# Retrieve the name of your newly text fine-tuned model
ft_job = client.fine_tuning.jobs.retrieve("ftjob-8071d8e2e7294603b26585c19a9a0757") # replace "ftjob-0a4c..." with the actual job-id in your list
fine_tuned_model = ft_job.to_dict()['fine_tuned_model']
fine_tuned_model

#### Retrieve fine-tuning metrics

In [None]:
# Retrieve fine-tuning metrics from result file
result_file_id = ft_job.to_dict()['result_files'][0]
results_content = client.files.content(result_file_id).content.decode()

data_io = StringIO(results_content)
results_df = pd.read_csv(data_io)
display(results_df.head())

In [None]:
# Plot train and validation metrics
show_ft_metrics(results_df)

### Step 6: Deploy The Fine-Tuned Model

__Note__: Only one deployment is permitted for a customized model. An error occurs if you select an already-deployed customized model.  

The code below shows how to deploy the model using the Control Plane API. Take a look at the [Azure OpenAI fine-tuning documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/fine-tuning?tabs=turbo&pivots=programming-language-python#deploy-fine-tuned-model) for more details.

The deployment process may take 10 to 20 mins.

In [None]:
# Deploy the fine-tuned model as an Azure Managed Online Endpoint
aoai_deployment_name = project_name # AOAI deployment name. Use as model parameter for inferencing

credential = DefaultAzureCredential()
token = credential.get_token("https://management.azure.com/.default").token

deploy_params = {'api-version': "2023-05-01"} 
deploy_headers = {'Authorization': 'Bearer {}'.format(token), 'Content-Type': 'application/json'}

deploy_data = {
    "sku": {"name": "standard", "capacity": 50}, 
    "properties": {
        "model": {
            "format": "OpenAI",
            "name": fine_tuned_model, # retrieve this value from the previous calls, it will look like gpt-35-turbo-0613.ft-b044a9d3cf9c4228b5d393567f693b83
            "version": "1"
        }
    }
}
deploy_data = json.dumps(deploy_data)

request_url = f'https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{rg_name}/providers/Microsoft.CognitiveServices/accounts/{resource_name}/deployments/{aoai_deployment_name}'

print('Creating a new deployment...')

r = requests.put(request_url, params=deploy_params, headers=deploy_headers, data=deploy_data)

print(r)
print(r.reason)
print(r.json())

### Step 7: Test the Deployed Fine-Tuned Model¶

After your fine-tuned model is deployed, you can use it like any other deployed model in either the [Chat Playground in Azure AI Foundry](https://ai.azure.com/), or via the chat completion API. 

For example, you can send a chat completion call to your deployed model, as shown in the following Python code snippet. 

In [None]:
# Check output from the deployed supervised fine-tuned model via AOAI API
response = client.chat.completions.create(
    model = aoai_deployment_name, # model = "Custom deployment name you chose for your fine-tuning model"
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "How do you think Paris?"}
    ],
    temperature=0.7, 
    max_tokens=800)

print(response.choices[0].message.content)

### Step 8: Delete The Deployment

It is **strongly recommended** that once you're done with this tutorial and have tested a few chat completion calls against your fine-tuned model, that you delete the model deployment, since the fine-tuned / customized models have an [hourly hosting cost](https://azure.microsoft.com/zh-cn/pricing/details/cognitive-services/openai-service/#pricing) associated with them once they are deployed.