## Supervise Fine-Tuning Phi-4 Open-Source Models for Text Q&A - A Python SDK Experience

Learn how to fine-tune the <code>Phi-4-mini</code> model using Python Programming Language - An SDK / Code Experience. This notebook is based on the Azure Examples Github [here](https://github.com/Azure/azureml-examples/blob/main/sdk/python/foundation-models/system/finetune/chat-completion/chat-completion.ipynb), with important modifications for compatability.

The last successful run is on an AML CPU Compute <code>Standard_D13_v2</code> with Kernel type <code>Python 3.10 - SDK v2</code>.

He Zhang, Jul. 2025

## Chat Completion - Ultrachat-200k

This sample shows how to use `chat-completion` components from the `azureml` system registry to fine-tune a model to complete a conversation between 2 people using `ultrachat_200k` dataset. We then deploy the fine-tuned model to an online endpoint for real time inference.

### Training data
We will use the [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) dataset. This is a heavily filtered version of the UltraChat dataset and was used to train `Zephyr-7B-β`, a state of the art `7B` chat model.

### Model
We will use the `Phi-4-mini-instruct` model to show how user can fine-tune a model for chat-completion task. If you opened this notebook from a specific model card, remember to replace the specific model name.

### Outline
* Setup pre-requisites such as compute.
* Pick a model to fine-tune.
* Pick and explore training data.
* Configure the fine-tuning job.
* Run the fine-tuning job.
* Review training and evaluation metrics. 
* Register the fine-tuned model. 
* Deploy the fine-tuned model for real time inference.
* Clean up resources. 

### Step 1: Setup Pre-requisites
* Install dependencies
* Connect to AzureML Workspace. Learn more at [set up SDK authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk). Replace  `<WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` below.
* Connect to `azureml` system registry
* Set an optional experiment name
* Check or create compute.
  * The recommended GPU for fine-tuning `Phi-4` models is the `A100` compute as described [here](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nca100v4-series?tabs=sizebasic#sizes-in-series). You can also find other GPU SKUs such as `V100` [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ncv3-series) and [here](https://learn.microsoft.com/en-us/azure/virtual-machines/ndv2-series).
  * A single GPU node can have multiple GPU cards. For example, in one node of `Standard_NC24rs_v3` there are 4 NVIDIA V100 GPUs while in `Standard_NC12s_v3`, there are 2 NVIDIA V100 GPUs. Refer to the [docs](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu) for this information. The number of GPU cards per node is set in the param `gpus_per_node` below. Setting this value correctly will ensure utilization of all GPUs in the node.

#### Install required Python libraries (if not done yet)

In [None]:
%pip install azure-ai-ml
%pip install azure-identit
%pip install datasets
%pip install mlflow
%pip install azureml-mlflow

#### Import required Python libraries 

In [None]:
from azure.ai.ml import MLClient
from azure.identity import (
    DefaultAzureCredential,
    InteractiveBrowserCredential,
)
from azure.ai.ml.entities import AmlCompute
import time

try:
    credential = DefaultAzureCredential()
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    credential = InteractiveBrowserCredential()

try:
    workspace_ml_client = MLClient.from_config(credential=credential)
except:
    workspace_ml_client = MLClient(
        credential,
        subscription_id="<SUBSCRIPTION_ID>",
        resource_group_name="<RESOURCE_GROUP>",
        workspace_name="<WORKSPACE_NAME>",
    )

# the models, fine tuning pipelines and environments are available in the AzureML registry, "azureml"
registry_ml_client = MLClient(credential, registry_name="azureml")
experiment_name = "Chat_Completion_Phi_4_Mini_Instruct_Text_QA"

# generating a unique timestamp that can be used for names and versions that need to be unique
timestamp = str(int(time.time()))

### Step 2: Pick A Foundation Model To Fine-Tune

`Phi-4-mini-instruct` is a dense decoder-only Transformer model with `3.8B` parameters, offering key improvements over `Phi-3.5-Mini`, including a `200K` vocabulary, grouped-query attention, and shared embedding. It is designed for chat-completion prompts, generating text based on user input, with a context length of `128K tokens`. If you have opened this notebook for a different model, replace the model name and version accordingly. 

Note the model id property of the model. This will be passed as input to the fine-tuning job. This is also available as the `Model ID` field in model details page of `Model Catalog` in Azure Machine Learning Studio. 

In [None]:
model_name = "Phi-4-mini-instruct"
foundation_model = registry_ml_client.models.get(model_name, label="latest")
print(
    "\n\nUsing model name: {0}, version: {1}, id: {2} for fine tuning".format(
        foundation_model.name, foundation_model.version, foundation_model.id
    )
)

### Step 3: Create A Compute

The finetune job works `ONLY` with `GPU` compute. The size of the compute depends on how big the model is and in most cases it becomes tricky to identify the right compute for the job. In this cell, we guide the user to select the right compute for the job.

`NOTE1` The computes listed below work with the most optimized configuration. Any changes to the configuration might lead to Cuda Out Of Memory error. In such cases, try to upgrade the compute to a bigger compute size.

`NOTE2` While selecting the compute_cluster_size below, make sure the compute is available in your resource group. If a particular compute is not available you can make a request to get access to the compute resources.

In [None]:
import ast

if "finetune_compute_allow_list" in foundation_model.tags:
    computes_allow_list = ast.literal_eval(
        foundation_model.tags["finetune_compute_allow_list"]
    )  # convert string to python list
    print(f"Please create a compute from the above list - {computes_allow_list}")
else:
    computes_allow_list = None
    print("`finetune_compute_allow_list` is not part of model tags")

In [None]:
# if you have a specific compute size to work with change it here. By default we use the 4 x A100 compute from the above list
compute_cluster_size = "Standard_NC96ads_A100_v4"

# if you already have a gpu cluster, mention it here. Else will create a new one with the name 'gpu-cluster-big'
compute_cluster = "gpu-cluster-nc96ads-a100-v4"

try:
    compute = workspace_ml_client.compute.get(compute_cluster)
    print("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
    print(
        f"Looks like the compute cluster doesn't exist. Creating a new one with compute size {compute_cluster_size}!"
    )
    try:
        print("Attempt #1 - Trying to create a dedicated compute")
        compute = AmlCompute(
            name=compute_cluster,
            size=compute_cluster_size,
            tier="Dedicated",
            max_instances=1,  # For multi node training set this to an integer value more than 1
        )
        workspace_ml_client.compute.begin_create_or_update(compute).wait()
    except Exception as e:
        try:
            print(
                "Attempt #2 - Trying to create a low priority compute. Since this is a low priority compute, the job could get pre-empted before completion."
            )
            compute = AmlCompute(
                name=compute_cluster,
                size=compute_cluster_size,
                tier="LowPriority",
                max_instances=1,  # For multi node training set this to an integer value more than 1
            )
            workspace_ml_client.compute.begin_create_or_update(compute).wait()
        except Exception as e:
            print(e)
            raise ValueError(
                f"WARNING! Compute size {compute_cluster_size} not available in workspace"
            )

In [None]:
# sanity check on the created compute
compute = workspace_ml_client.compute.get(compute_cluster)
if compute.provisioning_state.lower() == "failed":
    raise ValueError(
        f"Provisioning failed, Compute '{compute_cluster}' is in failed state. "
        f"please try creating a different compute"
    )

if computes_allow_list is not None:
    computes_allow_list_lower_case = [x.lower() for x in computes_allow_list]
    if compute.size.lower() not in computes_allow_list_lower_case:
        raise ValueError(
            f"VM size {compute.size} is not in the allow-listed computes for finetuning"
        )
else:
    # Computes with K80 GPUs are not supported
    unsupported_gpu_vm_list = [
        "standard_nc6",
        "standard_nc12",
        "standard_nc24",
        "standard_nc24r",
    ]
    if compute.size.lower() in unsupported_gpu_vm_list:
        raise ValueError(
            f"VM size {compute.size} is currently not supported for finetuning"
        )

# this is the number of GPUs in a single node of the selected 'vm_size' compute.
# setting this to less than the number of GPUs will result in underutilized GPUs, taking longer to train.
# setting this to more than the number of GPUs will result in an error.
gpu_count_found = False
workspace_compute_sku_list = workspace_ml_client.compute.list_sizes()
available_sku_sizes = []
for compute_sku in workspace_compute_sku_list:
    available_sku_sizes.append(compute_sku.name)
    if compute_sku.name.lower() == compute.size.lower():
        gpus_per_node = compute_sku.gpus
        gpu_count_found = True
# if gpu_count_found not found, then print an error
if gpu_count_found:
    print(f"Number of GPU's in compute {compute.size}: {gpus_per_node}")
else:
    raise ValueError(
        f"Number of GPU's in compute {compute.size} not found. Available skus are: {available_sku_sizes}."
        f"This should not happen. Please check the selected compute cluster: {compute_cluster} and try again."
    )

### Step 4: Prepare Training & Validation Datasets

We use the [ultrachat_200k](https://huggingface.co/datasets/samsum) dataset. The dataset has four splits, suitable for:
* Supervised Fine-Tuning (sft).
* Generation Ranking (gen).

The number of examples per split is shown as follows:

| train_sft | test_sft | train_gen | test_gen |
| :- | :- | :- | :- |
| 207865 | 23110 | 256032 | 28304 |

The next few cells show basic data preparation for fine-tuning:
* Visualize some data rows.
* We want this sample to run quickly, so save `train_sft`, `test_sft` files containing 5% of the already trimmed rows. This means the fine-tuned model will have lower accuracy, hence it should not be put to real-world use.

> The [download-dataset.py](./download-dataset.py) is used to download the ultrachat_200k dataset and transform the dataset into fine-tuning pipeline component consumable format. Also as the dataset is large, hence we here have only part of the dataset. 

> Running the below script only downloads 5% of the data. This can be increased by changing `dataset_split_pc` parameter to desired percenetage.

> **Note** : Some language models have different language codes and hence the column names in the dataset should reflect the same.

##### Here is an example of how the data should look like. 

The chat-completion dataset is stored in parquet format with each entry using the following schema:
``` json
{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {
            "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
            "role": "user"
        },
        {
            "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
            "role": "assistant"
        },
        {
            "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details.  ...",
            "role": "user"
        }, 
        {
            "content": "Certainly! ....",
            "role": "assistant"
        }
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}
```

In [None]:
# download the dataset using the helper script. This needs datasets library: https://pypi.org/project/datasets/
import os

exit_status = os.system(
    "python ./download-dataset.py --dataset HuggingFaceH4/ultrachat_200k --download_dir ultrachat_200k_dataset --dataset_split_pc 5"
)

if exit_status != 0:
    raise Exception("Error downloading dataset")

In [None]:
# load the ./ultrachat_200k_dataset/train_sft.jsonl file into a pandas dataframe and show the first several rows
import pandas as pd

pd.set_option(
    "display.max_colwidth", 0
)  # set the max column width to 0 to display the full text
df = pd.read_json("./ultrachat_200k_dataset/train_sft.jsonl", lines=True)
df.head(1)

### Step 5: Configure and Start Fine-Tuning Job

Now you can submit your fine-tuning training job. 

The fine-tuning job will take some time to start and complete.

You can use the job ID to monitor the status of the fine-tuning job. 

In [None]:
# default training parameters
training_parameters = dict(
    num_train_epochs=3,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    learning_rate=5e-6,
    lr_scheduler_type="cosine",
)
# default optimization parameters
optimization_parameters = dict(
    apply_lora="true",
    apply_deepspeed="true",
    deepspeed_stage=2,
)
# let's construct finetuning parameters using training and optimization paramters.
finetune_parameters = {**training_parameters, **optimization_parameters}

# each model finetuning works best with certain fine-tuning parameters which are packed with model as `model_specific_defaults`.
# let's override the "finetune_parameters" in case the model has some custom defaults.
if "model_specific_defaults" in foundation_model.tags:
    print("Warning! Model specific defaults exist. The defaults could be overridden.")
    finetune_parameters.update(
        ast.literal_eval(  # convert string to python dict
            foundation_model.tags["model_specific_defaults"]
        )
    )
print(
    f"The following fine-tuning parameters are going to be set for the run: {finetune_parameters}"
)

In [None]:
# set the pipeline display name for distinguishing different runs from the name
def get_pipeline_display_name():
    batch_size = (
        int(finetune_parameters.get("per_device_train_batch_size", 1))
        * int(finetune_parameters.get("gradient_accumulation_steps", 1))
        * int(gpus_per_node)
        * int(finetune_parameters.get("num_nodes_finetune", 1))
    )
    scheduler = finetune_parameters.get("lr_scheduler_type", "linear")
    deepspeed = finetune_parameters.get("apply_deepspeed", "false")
    ds_stage = finetune_parameters.get("deepspeed_stage", "2")
    if deepspeed == "true":
        ds_string = f"ds{ds_stage}"
    else:
        ds_string = "nods"
    lora = finetune_parameters.get("apply_lora", "false")
    if lora == "true":
        lora_string = "lora"
    else:
        lora_string = "nolora"
    save_limit = finetune_parameters.get("save_total_limit", -1)
    seq_len = finetune_parameters.get("max_seq_length", -1)
    return (
        model_name
        + "-"
        + "ultrachat"
        + "-"
        + f"bs{batch_size}"
        + "-"
        + f"{scheduler}"
        + "-"
        + ds_string
        + "-"
        + lora_string
        + f"-save_limit{save_limit}"
        + f"-seqlen{seq_len}"
    )

pipeline_display_name = get_pipeline_display_name()
print(f"Display name used for the run: {pipeline_display_name}")

In [15]:
finetune_parameters_modified = finetune_parameters.copy()
#del finetune_parameters_modified['learning_rate_min']
#del finetune_parameters_modified['learning_rate_max']
finetune_parameters_modified

{'num_train_epochs': 1,
 'per_device_train_batch_size': 1,
 'per_device_eval_batch_size': 1,
 'learning_rate': 5e-06,
 'lr_scheduler_type': 'cosine',
 'apply_lora': 'true',
 'apply_deepspeed': 'true',
 'deepspeed_stage': 3,
 'apply_ort': 'false',
 'precision': 16,
 'ignore_mismatched_sizes': 'false',
 'gradient_accumulation_steps': 1,
 'logging_strategy': 'steps',
 'logging_steps': 10,
 'save_total_limit': 1}

In [None]:
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import Input

# fetch the pipeline component
pipeline_component_func = registry_ml_client.components.get(
    name="chat_completion_pipeline", label="latest"
)

# define the pipeline job
@pipeline(name=pipeline_display_name)
def create_pipeline():
    chat_completion_pipeline = pipeline_component_func(
        mlflow_model_path=foundation_model.id,
        compute_model_import=compute_cluster,
        compute_preprocess=compute_cluster,
        compute_finetune=compute_cluster,
        compute_model_evaluation=compute_cluster,
        # map the dataset splits to parameters
        train_file_path=Input(
            type="uri_file", path="./ultrachat_200k_dataset/train_sft.jsonl"
        ),
        test_file_path=Input(
            type="uri_file", path="./ultrachat_200k_dataset/test_sft.jsonl"
        ),
        # training settings
        number_of_gpu_to_use_finetuning=gpus_per_node,  # set to the number of GPUs available in the compute
        **finetune_parameters
    )
    return {
        # map the output of the fine-tuning job to the output of pipeline job so that we can easily register the fine-tuned model
        # registering the model is required to deploy the model to an online or batch endpoint
        "trained_model": chat_completion_pipeline.outputs.mlflow_model_folder
    }

pipeline_object = create_pipeline()

# don't use cached results from previous jobs
pipeline_object.settings.force_rerun = True

# set continue on step failure to False
pipeline_object.settings.continue_on_step_failure = False

Submit the job

In [None]:
# submit the pipeline job
pipeline_job = workspace_ml_client.jobs.create_or_update(
    pipeline_object, experiment_name=experiment_name
)
# wait for the pipeline job to complete
workspace_ml_client.jobs.stream(pipeline_job.name)

In [None]:
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

# check if the `trained_model` output is available
print("pipeline job outputs: ", workspace_ml_client.jobs.get(pipeline_job.name).outputs)

In [None]:
# fetch the model from pipeline job output - not working, hence fetching from fine-tuning child job
model_path_from_job = "azureml://jobs/{0}/outputs/{1}".format(
    pipeline_job.name, "trained_model"
)
model_path_from_job

### Step 6: Register The Fine-Tuned Model

We will register the model from the output of the fine-tuning job. This will track lineage between the fine-tuned model and the fine-tuning job. The fine-tuning job, further, tracks lineage to the foundation model, data and training code.

In [None]:
# name the fine-tuned model
finetuned_model_name = model_name + "-ultrachat-200k"
finetuned_model_name = finetuned_model_name.replace("/", "-")

# prepare to register the model from pipeline job output
print("path to register model: ", model_path_from_job)
prepare_to_register_model = Model(
    path=model_path_from_job,
    type=AssetTypes.MLFLOW_MODEL,
    name=finetuned_model_name,
    version=timestamp,  # use timestamp as version to avoid version conflict
    description=model_name + " fine-tuned model for ultrachat 200k chat-completion",
)
print("prepare to register model: \n", prepare_to_register_model)

# start registering the model
registered_model = workspace_ml_client.models.create_or_update(
    prepare_to_register_model
)
print("registered model: \n", registered_model)

### Step 7: Deploy The Fine-Tuned Model To An Online Endpoint 

__Note__: Only one deployment is permitted for a customized model. An error occurs if you select an already-deployed customized model.  

Online endpoints give a durable REST API that can be used to integrate with applications that need to use the model.

The deployment process may take 10 to 20 mins.

In [None]:
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    ProbeSettings,
    OnlineRequestSettings,
)

# create online endpoint - endpoint names need to be unique in a region, hence using timestamp to create unique endpoint name
online_endpoint_name = "ultrachat-completion-" + timestamp

endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Online endpoint for "
    + registered_model.name
    + ", fine-tuned model for ultrachat-200k-chat-completion",
    auth_mode="key",
)

workspace_ml_client.begin_create_or_update(endpoint).wait()

In [None]:
import ast

# note that you should deploy using A100 GPU if the model is fine-tuned with A100 GPU. 
instance_type =  "Standard_NC24ads_A100_v4" 

# inference compute allow list that supports deployment
if "inference_compute_allow_list" in foundation_model.tags:
    inference_computes_allow_list = ast.literal_eval(
        foundation_model.tags["inference_compute_allow_list"]
    )  # convert string to python list
    print(f"Please create a compute from the above list - {computes_allow_list}")
else:
    inference_computes_allow_list = None
    print("`inference_compute_allow_list` is not part of model tags")

# check if the compute is in the allow listed computes
if (
    inference_computes_allow_list is not None
    and instance_type not in inference_computes_allow_list
):
    print(
        f"`instance_type` is not in the allow listed compute. Please select a value from {inference_computes_allow_list}"
    )

In [None]:
# create the deployment
demo_deployment = ManagedOnlineDeployment(
    name="demo",
    endpoint_name=online_endpoint_name,
    model=registered_model.id,
    instance_type=instance_type,
    instance_count=1,
    liveness_probe=ProbeSettings(initial_delay=1200, timeout=20),
    request_settings=OnlineRequestSettings(request_timeout_ms=90000),
)

workspace_ml_client.online_deployments.begin_create_or_update(demo_deployment).wait()

endpoint.traffic = {"demo": 100}

workspace_ml_client.begin_create_or_update(endpoint).result()

### Step 8: Test the Deployed Fine-Tuned Model

In [None]:
# read ./ultrachat_200k_dataset/test_gen.jsonl into a pandas dataframe
test_df = pd.read_json("./ultrachat_200k_dataset/test_gen.jsonl", lines=True)

# take few random samples
test_df = test_df.sample(n=1)

# rebuild index
test_df.reset_index(drop=True, inplace=True)
test_df.info()
test_df.head()

In [None]:
import json

# create a json object with the key as "input_data" and value as a list of values from the text column of the test dataframe
parameters = {
    "temperature": 0.6,
    "top_p": 0.9,
    "do_sample": True,
    "max_new_tokens": 200,
}
test_json = {
    "input_data": {
        "input_string": [test_df["messages"][0]],
        "parameters": parameters,
    },
    "params": {},
}

# save the json object to a file named sample_score.json in the ./samsum-dataset folder
with open("./ultrachat_200k_dataset/sample_score.json", "w") as f:
    json.dump(test_json, f)

In [None]:
# score the sample_score.json file using the online endpoint with the azureml endpoint invoking method
response = workspace_ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="demo",
    request_file="./ultrachat_200k_dataset/sample_score.json",
)
print("raw response: \n", response, "\n")

In [None]:
# call the online endpoint API using the http request method
import urllib.request
import json

# request data goes here
# the example below assumes JSON formatting which may be updated
# depending on the format your endpoint expects.
# more information can be found here:
# https://docs.microsoft.com/azure/machine-learning/how-to-deploy-advanced-entry-script
data = {
  "input_data": {
    "input_string": [
      {
        "role": "user",
        "content": "I am going to Paris, what should I see?"
      },
      {
        "role": "assistant",
        "content": "Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:\n\n1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.\n2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.\n3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic architecture and stunning stained glass windows.\n\nThese are just a few of the many attractions that Paris has to offer. With so much to see and do, it's no wonder that Paris is one of the most popular tourist destinations in the world."
      },
      {
        "role": "user",
        "content": "What is so great about #1?"
      }
    ],
    "parameters": {
      "max_new_tokens": 4096
    }
  }
}

body = str.encode(json.dumps(data))

url = 'https://ultrachat-completion-1751183129.germanywestcentral.inference.ml.azure.com/score'

# replace this with the primary/secondary key, AMLToken, or Microsoft Entra ID token for the endpoint
api_key = "xxx"
if not api_key:
    raise Exception("A key should be provided to invoke the endpoint")

headers = {'Content-Type':'application/json', 'Accept': 'application/json', 'Authorization':('Bearer '+ api_key)}

req = urllib.request.Request(url, body, headers)

try:
    response = urllib.request.urlopen(req)
    result = response.read()
    print(result)
except urllib.error.HTTPError as error:
    print("The request failed with status code: " + str(error.code))
    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())
    print(error.read().decode("utf8", 'ignore'))

### Step 9: Delete The Online Endpoint
Don't forget to delete the online endpoint, else you will leave the billing meter running for the compute used by the endpoint

In [None]:
workspace_ml_client.online_endpoints.begin_delete(name=online_endpoint_name).wait()