## Vision Fine-Tuning GPT-4o Model - A Python SDK Experience

Learn how to vision fine-tune the <code>gpt-4o-2024-08-06</code> model using Python SDK. 

This notebook is inspired by the vision fine-tuning [notebook](https://github.com/Azure/gen-cv/blob/main/vision-fine-tuning/01-AOAI-vision-fine-tuning-starter/README.md) from Andreas Kopp's [GenCV Accelerator](https://github.com/Azure/gen-cv).

You can either run this notebook locally or run on an <code>AML CPU Compute Standard_D13_v2</code> with Kernel type <code>Python 3.10 - SDK v2</code>  

He Zhang, Jan. 2025

### Prerequisites

* Learn the [what, why, and when to use fine-tuning.](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/fine-tuning-considerations)
* An Azure subscription.
* Access to Azure OpenAI Service.
* An Azure OpenAI resource created in the supported fine-tuning region (e.g. Sweden Central).
* A deployment of <code>gpt-4o</code> base model, with its deployment name as "gpt-4o" for simplicity.  
* Prepare Training and Validation datasets:
  * at least 50 high-quality samples (preferably 1,000s) are required.
  * must be formatted in the JSON Lines (JSONL) document with UTF-8 encoding.
  * for this test notebook, we utilize ChartQA dataset presented in [ChartQA: ACL 2022](https://aclanthology.org/2022.findings-acl.177). 
* Python version at least: <code>3.10</code>
* Python libraries: <code>json, requests, os, pandas, PIL, base64, IPython, tqdm, python-dotenv, tenacity, datasets, matplotlib, azure.identity, openai</code>
* The OpenAI Python library version for this test notebook: <code>1.58.1</code>
* [Jupyter Notebooks](https://jupyter.org/)

### Step 1: Setup

#### Retrieve the Azure OpenAI API key and endpoint.

Go to your Azure OpenAI resource in the Azure portal. The Endpoint and Keys can be found in the Resource Management section.  

<img src="../../images/screenshot-aoai-keys-and-endpoint.png" alt="Screenshot of the Azure OpenAI resource management pane." width="800"/>

#### Configure credentials

Copy the <code>Endpoint</code> and access <code>KEY</code> (you can use either <code>KEY 1</code> or <code>KEY 2</code>), and paste them accordingly to the variables in the file <code>azure.env</code>. Save the file and close it. **Do not** distribute this file as this contains credential information! 
<img src="../../images/screenshot-azure-env-file.png" alt="Screenshot of the azure.env file that contains credential information - do not show it to others!" width="800"/>

#### Install required Python libraries (if not done yet)

In [None]:
%pip install openai
%pip install tenacity
%pip install datasets
%pip install python-dotenv

#### Import required Python libraries 

In [None]:
import os
import json
import base64
import requests
import pandas as pd
import matplotlib.pyplot as plt

from tqdm import tqdm
from PIL import Image
from openai import AzureOpenAI
from io import BytesIO, StringIO
from datasets import load_dataset
from IPython.display import display
from dotenv import load_dotenv, find_dotenv
from azure.identity import DefaultAzureCredential
from tenacity import retry, stop_after_attempt, wait_fixed

#### Load environmental variables to assign credentials 

In [None]:
# Load env. file
load_dotenv("azure.env")

# Assign Azure resources  
subscription_id = os.getenv("SUBSCRIPTION_ID") # name of the Azure Subscription ID
resource_name = os.getenv("AOAI_RESOURCE") # name of the AOAI resource
rg_name = os.getenv("RESOURCE_GROUP") # name of the resource group

# Assign AOAI credentials 
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-10-21",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

In [None]:
# Test AOAI connection
completion = client.chat.completions.create(  
    model="gpt-4o",  
    messages=[{"role":"user", "content":"hello"}],  
    max_tokens=500,  
    temperature=0.7)

print(completion.choices[0].message.content)

#### Define helper functions

In [None]:
def encode_image(image, quality=100):
    """ Encode an image into a base64 string in JPEG format. """

    if image.mode != 'RGB':
        image = image.convert('RGB')  # Convert to RGB
    buffered = BytesIO()
    image.save(buffered, format="JPEG", quality=quality) 
    return base64.b64encode(buffered.getvalue()).decode("utf-8")

def date_sorted_df(details_dict):
    """ Create a pandas DataFrame from a dictionary and sort it by a 'created' or 'created_at' timestamp column for displaying OpenAI API tables. """
    df = pd.DataFrame(details_dict)
    
    if 'created' in df.columns:
        df.rename(columns={'created': 'created_at'}, inplace=True)
    
    # Convert 'created_at' from Unix timestamp to human-readable date/time format
    df['created_at'] = pd.to_datetime(df['created_at'], unit='s').dt.strftime('%Y-%m-%d %H:%M:%S')

    if 'finished_at' in df.columns:
        # Convert 'finished_at' from Unix timestamp to human-readable date/time format, keeping null values as is
        df['finished_at'] = pd.to_datetime(df['finished_at'], unit='s', errors='coerce').dt.strftime('%Y-%m-%d %H:%M:%S')
    
    # Sort DataFrame by 'created_at' in descending order
    df = df.sort_values(by='created_at', ascending=False)

    return df

def show_ft_metrics(results_df, window_size=5):
    """ Plot fine-tuning metrics including loss and accuracy for training and validation. """

    # Drop rows where valid_loss is NaN or valid_loss is -1.0
    filtered_df = results_df.dropna(subset=['valid_loss'])
    filtered_df = filtered_df.loc[filtered_df['valid_loss'] != -1.0]

    # Compute rolling means
    results_df_smooth = results_df.rolling(window=window_size).mean()
    filtered_df_smooth = filtered_df.rolling(window=window_size).mean()

    # Plot the curves
    plt.figure(figsize=(16, 10))

    plt.subplot(2, 2, 1)
    plt.plot(results_df_smooth['step'], results_df_smooth['train_loss'],  color='blue')
    plt.title('Train Loss')
    plt.xlabel('Step')
    plt.ylabel('Loss')

    plt.subplot(2, 2, 2)
    plt.plot(results_df_smooth['step'], results_df_smooth['train_mean_token_accuracy'], color='green')
    plt.title('Train Mean Token Accuracy')
    plt.xlabel('Step')
    plt.ylabel('Accuracy')

    plt.subplot(2, 2, 3)
    plt.plot(filtered_df_smooth['step'], filtered_df_smooth['valid_loss'], color='red')
    plt.title('Validation Loss')
    plt.xlabel('Step')
    plt.ylabel('Loss')

    plt.subplot(2, 2, 4)
    plt.plot(filtered_df_smooth['step'], filtered_df_smooth['valid_mean_token_accuracy'], color='orange')
    plt.title('Validation Mean Token Accuracy')
    plt.xlabel('Step')
    plt.ylabel('Accuracy')

    plt.tight_layout()
    plt.show()

### Step 2: Prepare Training, Validation, and Testing Datasets

Fine-tuning for images is possible with JSONL dataset files similar to the process of sending images as input to the chat completion API.
Images can be provided as HTTP URLs (as shown below) or data URLs containing base64-encoded images.

```json
{
  "messages": [
    { "role": "system", "content": "You are an assistant that identifies uncommon cheeses." },
    { "role": "user", "content": "What is this cheese?" },
    { "role": "user", "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/3/36/Danbo_Cheese.jpg"
          }
        }
      ]
    },
    { "role": "assistant", "content": "Danbo" }
  ]
}
```

This demo notebooks utilizes the ChartQA dataset introduced by Masry et al. in their paper, *ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning* (Findings of ACL 2022). For more details, one can refer to their publication: [ChartQA: ACL 2022](https://aclanthology.org/2022.findings-acl.177).

The following cells converts the ChartQA dataset from HuggingFace into this JSONL format by using base64-encoded images. Depending on your training data format, you will likely need to perform a few changes for reusing the code for your other use cases.

In [None]:
# Load ChartQA dataset from Hugging Face server
ds = load_dataset("HuggingFaceM4/ChartQA")
display(ds)

In [None]:
# extract a subset of training, validation, and test examples for simplicity
train_samples = 8000
val_samples = 1000
test_samples = 800

ds_train = ds['train'].shuffle(seed=42).select(range(train_samples))
ds_val = ds['val'].shuffle(seed=42).select(range(val_samples))
ds_test = ds['test'].shuffle(seed=42).select(range(test_samples))

# convert to pandas dataframe
ds_train = ds_train.to_pandas()
ds_val = ds_val.to_pandas()
ds_test = ds_test.to_pandas()

In [None]:
# Check some samples at this stage
ds_train.head()

In [None]:
# convert byte strings to images
ds_train['image'] = ds_train['image'].apply(lambda x: Image.open(BytesIO(x['bytes'])))
ds_val['image'] = ds_val['image'].apply(lambda x: Image.open(BytesIO(x['bytes'])))
ds_test['image'] = ds_test['image'].apply(lambda x: Image.open(BytesIO(x['bytes'])))

# Convert array type of 'label' column into string only if the current data type is object
if ds_train['label'].dtype == 'object':
    ds_train['label'] = ds_train['label'].apply(lambda x: x[0])

if ds_val['label'].dtype == 'object':
    ds_val['label'] = ds_val['label'].apply(lambda x: x[0])

if ds_test['label'].dtype == 'object':
    ds_test['label'] = ds_test['label'].apply(lambda x: x[0])

In [None]:
# Check some samples at this stage
ds_train.head()

In [None]:
# Rename certain columns
ds_train = ds_train.rename(columns={'query': 'question', 'label': 'answer'})
ds_val = ds_val.rename(columns={'query': 'question', 'label': 'answer'})
ds_test = ds_test.rename(columns={'query': 'question', 'label': 'answer'})

# Select certain columns
ds_train = ds_train[['question', 'answer', 'image']]
ds_val = ds_val[['question', 'answer', 'image']]
ds_test = ds_test[['question', 'answer', 'image']]

In [None]:
# Check some samples at this stage
ds_train.head()

In [None]:
# review a random training example
idx=3
print('QUESTION:', ds_train.iloc[idx]['question'])
display(ds_train.iloc[idx]['image'])
print('ANSWER:', ds_train.iloc[idx]['answer'])

In [None]:
# Create dataset splits as local JSONL files
project_name = "chart-qa-v4"
splits = ['train', 'val', 'test']
datasets = [ds_train, ds_val, ds_test]

SYSTEM_PROMPT = """You are a Vision Language Model specialized in interpreting visual data from chart images.
Your task is to analyze the provided chart image and respond to queries with concise answers, usually a single word, number, or short phrase.
The charts include a variety of types (e.g., line charts, bar charts) and contain colors, labels, and text.
Focus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary."""

for split, dataset in zip(splits, datasets):
    dataset_file = f"{project_name}-{split}.jsonl"
    print(f"Generating {dataset_file} with {dataset.shape[0]} samples.")
    
    json_data = []
    base64_prefix = "data:image/jpeg;base64,"
    
    for idx, example in tqdm(enumerate(dataset.itertuples()), total=dataset.shape[0]):
        try:
            system_message = {"role": "system", "content": SYSTEM_PROMPT}
            
            encoded_image = encode_image(example.image, quality=80)
            user_message = {
                "role": "user",
                "content": [
                    {"type": "text", "text": f"Question [{idx}]: {example.question}"},
                    {"type": "image_url", "image_url": {"url": f"{base64_prefix}{encoded_image}"}}
                ]
            }
            assistant_message = {"role": "assistant", "content": example.answer}

            json_data.append({"messages": [system_message, user_message, assistant_message]})
        except KeyError as e:
            print(f"Missing field in example {idx}: {e}")
        except Exception as e:
            print(f"Error processing example {idx}: {e}")
    
    with open(dataset_file, "w") as f:
        for message in json_data:
            json.dump(message, f)
            f.write("\n")

### Step 3: Upload Datasets for Fine-Tuning

In [None]:
# upload training file
train_file = client.files.create(
  file=open(f"{project_name}-train.jsonl", "rb"),
  purpose="fine-tune"
)

# upload validation file
val_file = client.files.create(
  file=open(f"{project_name}-val.jsonl", "rb"),
  purpose="fine-tune"
)

### Step 4: Configure and Start Fine-Tuning Job

Here is some guidance if you want to adjust the hyperparameters of the fine-tuning process. You can keep them as `None` to use default values. 

| Hyperparameter                       | Description                                                                                                                                                                              |
|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `Batch size`                            | The batch size to use for training. When set to default, batch_size is calculated as 0.2% of examples in training set and the max is 256.                                                           |
| `Learning rate multiplier` | The fine-tuning learning rate is the original learning rate used for pre-training multiplied by this multiplier. We recommend experimenting with values between 0.5 and 2. Empirically, we've found that larger learning rates often perform better with larger batch sizes. Must be between 0.0 and 5.0. |
| `Number of epochs`       | Number of training epochs. An epoch refers to one full cycle through the data set. If set to default, number of epochs will be determined dynamically based on the input data. |
| `Seed`  | The seed controls the reproducibility of the job. Passing in the same seed and job parameters should produce the same results, but may differ in rare cases. If a seed is not specified, one will be generated for you. |

In [None]:
# create fine tuning job
file_train = train_file.id
file_val = val_file.id

ft_job = client.fine_tuning.jobs.create(
  suffix=project_name,
  training_file=file_train,
  validation_file=file_val, # validation file is optional
  model="gpt-4o-2024-08-06", # baseline model name (not the deployment name)
  seed=None,
  hyperparameters={
    "n_epochs" : None,
    "batch_size" : None,
    "learning_rate_multiplier" : None,
  }
)

In [None]:
# Check the fine-tuning job status
client.fine_tuning.jobs.list(limit=1).to_dict()

In [None]:
# List 5 recent fine-tuning jobs
ft_jobs = client.fine_tuning.jobs.list(limit=5).to_dict()

display(date_sorted_df(pd.DataFrame(ft_jobs['data'])))

In [None]:
# Retrieve the name of a fine-tuned model
ft_job = client.fine_tuning.jobs.retrieve("ftjob-0a4c9b22f32e44b4a133c83edc31107b")
fine_tuned_model = ft_job.to_dict()['fine_tuned_model']
fine_tuned_model

In [None]:
# Retrieve fine-tuning metrics from result file
result_file_id = ft_job.to_dict()['result_files'][0]
results_content = client.files.content(result_file_id).content.decode()

data_io = StringIO(results_content)
results_df = pd.read_csv(data_io)
display(results_df)

Take a look at this table for an interpretation of above diagrams:  

| Metric                       | Description                                                                                                                                                                              |
|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `step`                            | The number of the training step. A training step represents a single pass, forward and backward, on a batch of training data.                                                           |
| `train_loss`, `validation_loss` | The loss for the training / validation batch |
| `train_mean_token_accuracy`       | The percentage of tokens in the training batch correctly predicted by the model. For example, if the batch size is set to 3 and your data contains completions [[1, 2], [0, 5], [4, 2]], this value is set to 0.83 (5 of 6) if the model predicted [[1, 1], [0, 5], [4, 2]]. |
| `validation_mean_token_accuracy`  | The percentage of tokens in the validation batch correctly predicted by the model. For example, if the batch size is set to 3 and your data contains completions [[1, 2], [0, 5], [4, 2]], this value is set to 0.83 (5 of 6) if the model predicted [[1, 1], [0, 5], [4, 2]]. |

In [None]:
# Plot training and validation metrics
show_ft_metrics(results_df)

### Step 6: Deploy the Fine-Tuned Model

__Note__: Only one deployment is permitted for a customized model. An error occurs if you select an already-deployed customized model.  

The code below shows how to deploy the model using the Control Plane API. Take a look at the [Azure OpenAI fine-tuning documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/fine-tuning?tabs=turbo&pivots=programming-language-python#deploy-fine-tuned-model) for more details.

In [None]:
# List existing models
my_models = client.models.list().to_dict()
models_df = date_sorted_df(my_models['data'])

cols = ['status', 'capabilities', 'lifecycle_status', 'id', 'created_at', 'model']
bold_start, bold_end = '\033[1m', '\033[0m'

print(f'Models of AOAI resource {bold_start}{resource_name}{bold_end}:')
display(models_df[cols].head())

In [None]:
# Deploy the fine-tuned model as an Azure Managed Online Endpoint
aoai_deployment_name = project_name # AOAI deployment name. Use as model parameter for inferencing

credential = DefaultAzureCredential()
token = credential.get_token("https://management.azure.com/.default").token

deploy_params = {'api-version': "2023-05-01"} 
deploy_headers = {'Authorization': 'Bearer {}'.format(token), 'Content-Type': 'application/json'}

deploy_data = {
    "sku": {"name": "standard", "capacity": 1}, 
    "properties": {
        "model": {
            "format": "OpenAI",
            "name": fine_tuned_model, # retrieve this value from the previous calls, it will look like gpt-35-turbo-0613.ft-b044a9d3cf9c4228b5d393567f693b83
            "version": "1"
        }
    }
}
deploy_data = json.dumps(deploy_data)

request_url = f'https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{rg_name}/providers/Microsoft.CognitiveServices/accounts/{resource_name}/deployments/{aoai_deployment_name}'

print('Creating a new deployment...')

r = requests.put(request_url, params=deploy_params, headers=deploy_headers, data=deploy_data)

print(r)
print(r.reason)
print(r.json())

### Step 7: Test the Deployed Fine-Tuned Model

In [None]:
# Define a function to query the vision fine-tuned model
@retry(stop=stop_after_attempt(3), wait=wait_fixed(10))
def query_image(image, question, deployment='gpt-4o'):

    encoded_image_url = f"data:image/jpeg;base64,{encode_image(image, quality=50)}"

    response = client.chat.completions.create(
        model=deployment,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": [
                {"type": "text", "text": question},
                {"type": "image_url", "image_url": {"url": encoded_image_url}}
            ]}
        ],
        temperature=0,
    )

    return response.choices[0].message.content

In [None]:
idx = 0

question = ds_test.iloc[idx]['question']
img = ds_test.iloc[idx]['image']
answer = ds_test.iloc[idx]['answer']

display(question)
display(img)
display(answer)

In [None]:
# Check fine-tuned model result
print(query_image(img, question, aoai_deployment_name))

### Step 8: Evaluate the Base GPT-4o and the Fine-Tuned GPT-4o Models

In [None]:
ds_test_eval = ds_test.copy().head(100)
ds_test_eval.info()
ds_test_eval.head()

In [None]:
%%time
# Process test dataset with baseline model
ds_test_eval['gpt-4o-base-pred'] = ds_test_eval.apply(lambda row: query_image(row['image'], row['question'], 'gpt-4o'), axis=1)
ds_test_eval.head()

In [None]:
%%time
# Process test dataset with fine-tuned model
ds_test_eval['gpt-4o-ft-pred'] = ds_test_eval.apply(lambda row: query_image(row['image'], row['question'], 'chart-qa-v4'), axis=1)
ds_test_eval.head()

In [None]:
# Define a function to use LLM for results comparisons
@retry(stop=stop_after_attempt(3), wait=wait_fixed(10))
def evaluate(question, ground_truth_answer, predicted_answer, deployment='gpt-4o'):
    
    EVAL_SYSTEM_PROMPT = """You evaluate the factual correctness of a predicted answer about a diagram with a ground truth answer. 
                            The predicted answer might be formulated in a different way. Your only concern is if the predicted answer is correct from a factual perspective. 
                            You are provided with the original question, the ground truth answer and the predicted answer.
                            You respond with either CORRECT or INCORRECT"""

    user_prompt = f"Original question: {question} \nGround truth answer: {ground_truth_answer}\nPredicted answer: {predicted_answer}" 

    response = client.chat.completions.create(
        model=deployment,
        messages=[
            {"role": "system", "content": EVAL_SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0,
    )

    return response.choices[0].message.content

In [None]:
# Test the evaluation function above
print(evaluate('what is the diagram title?', 'comparison of tax rates in US states', 'Tax Rate Comparison Across US States'))

In [None]:
# Validate prediction accuracy of baseline model
ds_test_eval['gpt-4o-base-eval'] = ds_test_eval.apply(lambda row: evaluate(row['question'], row['answer'], row['gpt-4o-base-pred'], 'gpt-4o'), axis=1)
ds_test_eval.head()

In [None]:
# Validate prediction accuracy of fine-tuned model
ds_test_eval['gpt-4o-ft-eval'] = ds_test_eval.apply(lambda row: evaluate(row['question'], row['answer'], row['gpt-4o-ft-pred'], 'gpt-4o'), axis=1)
ds_test_eval.head()

In [None]:
# Draw a bar chart to show the accuracy comparison result
base_correct_count = ds_test_eval['gpt-4o-base-eval'].value_counts().get("CORRECT", 0)
base_eval_observations = ds_test_eval.shape[0]
ft_correct_count = ds_test_eval['gpt-4o-ft-eval'].value_counts().get("CORRECT", 0)
ft_eval_observations = ds_test_eval.shape[0]

chart_data = {
    'title' : 'GPT-4o ChartQA accuracy - baseline vs fine-tuned model', 
    'baseline' : 'GPT-4o 0806',
    'fine-tuned' : 'GPT-4o 0806 fine-tuned',
    'baseline accuracy' : base_correct_count / base_eval_observations,
    'fine-tuned accuracy' : ft_correct_count / ft_eval_observations,
    
}

# Extract data for plotting
models = [chart_data['baseline'], chart_data['fine-tuned']]
accuracies = [chart_data['baseline accuracy'], chart_data['fine-tuned accuracy']]

# Create a bar chart
plt.figure(figsize=(8, 6))
plt.bar(models, accuracies, color=['blue', 'green'])

# Add titles and labels
plt.title(chart_data['title'])
plt.ylabel('Accuracy')
plt.xlabel('Model')

# Annotate bars with accuracy values
for i, acc in enumerate(accuracies):
    plt.text(i, acc + 0.005, f"{acc:.4f}", ha='center', fontsize=10)

# Display the chart
plt.tight_layout()
plt.show()

### Step 9: Delete the Deployment

It is **strongly recommended** that once you're done with this tutorial and have tested a few chat completion calls against your fine-tuned model, that you delete the model deployment, since the fine-tuned / customized models have an [hourly hosting cost](https://azure.microsoft.com/zh-cn/pricing/details/cognitive-services/openai-service/#pricing) associated with them once they are deployed.