Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# AOAI Finetuning

## 1. Prerequisites

### 1.1. Compute with Managed Indentity

The AOAI Fine-tuning component requires specific permissions to access various resources, which must be granted prior to job submission. For authentication, permissions will be added to the User Managed Identity (UMI) and attached to the compute instance where the component will run. Detailed instructions for creating the managed identity can be found[here](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/how-manage-user-assigned-managed-identities?pivots=identity-mi-methods-azp). The instruction on how to assign managed identity to the compute can be found [here](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-identity-based-service-authentication?view=azureml-api-2&tabs=cli#user-assigned-managed-identity). See the following [link](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/role-based-access-control) for more details on the role-based access control for Azure OpenAI Service.

The Following are the minimum permissions that need to be attached to the User Managed Identity (UMI):
- `Cognitive service contributor` role over `Azure OpenAI resource`
- `Cognitive service user` role over `Azure OpenAI resource`


Furthermore, users can provide dataset URIs as inputs for training and validation data. If users intend to provide non-public data URIs, they must first store them in the workspace's associated Key Vault and then pass the Key Vault key as an input. In this case, the managed identity will need permissions to access both the workspace and the Key Vault:
- `Reader` role over AML workspace
- `Get Secret` permission over workspace' associated key vault

Key Vault access configuration can be of two types: RBAC and Vault access policy. If the Key Vault supports RBAC authorization, assign the `Key Vault Secrets User` role to the UMI over the Key Vault's scope. Otherwise, navigate to the Access Policies tab in the Key Vault resource in the Azure portal and create a new access policy to grant the "Get Secret" permission to the UMI.

In [None]:
managed_identity_resource_id = "subscriptions/72c03bf3-4e69-41af-9532-dfcdc3eefef4/resourceGroups/aml-benchmarking/providers/Microsoft.ManagedIdentity/userAssignedIdentities/finetuning-umi"

## 2. Setup

In [None]:
# Import required libraries
import os
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient
from azure.ai.ml.dsl import pipeline
import pandas as pd

### 2.1. Configure workspace details and get a handle to the workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ai.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [default azure authentication](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python) for this tutorial. Check the [configuration notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) for more details on how to configure credentials and connect to a workspace.

In [None]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

subscription_id = "72c03bf3-4e69-41af-9532-dfcdc3eefef4"
resource_group = "aml-benchmarking"
workspace_name = "chirag-ws"

ml_client = MLClient(credential, subscription_id, resource_group, workspace_name)

### 2.2 Show Azure ML Workspace information

In [None]:
ws = ml_client.workspaces.get(name=ml_client.workspace_name)

output = {}
output["Workspace"] = ml_client.workspace_name
output["Subscription ID"] = ml_client.connections._subscription_id
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
pd.DataFrame(data=output, index=[""]).T

## 3. Compute

### Create or Attach existing AmlCompute with managed identity

You will need to create a compute target for your pipeline run. In this tutorial, you will create AmlCompute as your compute resource with managed identity.

> Note that if you have an AzureML Data Scientist role, you will not have permission to create compute resources. Talk to your workspace or IT admin to create the compute targets described in this section, if they do not already exist.

**Creation of AmlCompute takes approximately 5 minutes.**

If the AmlCompute with that name is already in your workspace, this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azure.ai.ml.entities import (
    ManagedIdentityConfiguration,
    IdentityConfiguration,
    AmlCompute,
)
from azure.ai.ml.constants import ManagedServiceIdentityType

# Create an identity configuration from the user-assigned managed identity
managed_identity = ManagedIdentityConfiguration(
    resource_id=managed_identity_resource_id
)
identity_config = IdentityConfiguration(
    type=ManagedServiceIdentityType.USER_ASSIGNED,
    user_assigned_identities=[managed_identity],
)

# specify aml compute name.
cpu_compute_target = "aoai-compute"

try:
    compute = ml_client.compute.get(cpu_compute_target)
except Exception:
    print("Creating a new cpu compute target...")
    # Pass the identity configuration
    compute = AmlCompute(
        name=cpu_compute_target,
        size="STANDARD_DS3_V2",
        min_instances=0,
        max_instances=4,
        identity=identity_config,
    )
    poller = ml_client.compute.begin_create_or_update(compute)
    poller.wait()
print(compute)

## 4. Import Components From Registry

An Azure Machine Learning component is a self-contained piece of code that does one step in a machine learning pipeline. A component is analogous to a function - it has a name, inputs, outputs, and a body. Components are the building blocks of the Azure Machine Learning pipelines. It's a good engineering practice to build a machine learning pipeline where each step has well-defined inputs and outputs. In Azure Machine Learning, a component represents one reusable step in a pipeline. Components are designed to help improve the productivity of pipeline building. Specifically, components offer:

- Well-defined interface: Components require a well-defined interface (input and output). The interface allows the user to build steps and connect steps easily. The interface also hides the complex logic of a step and removes the burden of understanding how the step is implemented.

- Share and reuse: As the building blocks of a pipeline, components can be easily shared and reused across pipelines, workspaces, and subscriptions. Components built by one team can be discovered and used by another team.

- Version control: Components are versioned. The component producers can keep improving components and publish new versions. Consumers can use specific component versions in their pipelines. This gives them compatibility and reproducibility.

For a more detailed information on this subject, refer to the this [link](https://learn.microsoft.com/en-us/azure/machine-learning/concept-component?view=azureml-api-2).

To import components,  we need to get the registry. The following command obtains the public regsitry from which we will import components for our experiment.

In [None]:
azureml_preview_registry = MLClient(credential=credential, registry_name="azureml-1p-preview")
print(azureml_preview_registry)

Next, we pull specific components from the corresponding registires and use them to build a pipeline of steps. For the illustration of the finetuning workflow we will use the following components:

| Component name | Description  | Registry name |
|:---|:---|:---|
| **aoai_finetuning**  | Upload dataset to Azure OpenAI, perform finetuning and delete dataset from Azure OpenAI. | _azureml-1p-preview_ |

In [None]:
finetuning_component = azureml_preview_registry.components.get(name="aoai_finetuning")
 
print(f"Data Upload component version: {finetuning_component.version}\n---")

## 5. Data

The component supports three methods of providing training and validation data input:
1. Direct Dataset Provisioning: Users can input data assets directly via the training_file_path and validation_file_path ports. The component will then load the data and upload it to the AOAI resource.
2. Dataset URI Provisioning:  Alternatively, users can provide dataset URIs via the `training_import_path` and `validation_import_path` ports. For this method to work, data should be accesible via GET request to the uri without requiring additional permissions. If the dataset uri is public and does not contains any credentials, user can pass it against the `data_uri` key in the training_import_path/validation_import_path json. However if the uri should not be exposed, users must first upload the data uri to the user workspaces' associated keyvault and then pass key vault key against `keyvault_key_for_data_uri` key in training_import_path/validation_import_path json

Note that exactly one of either `training_file_path` or `training_import_path` must be provided. Providing validation dataset is optional.

Along with `training_file_path` user can provide `validation_file_path`. If the latter is not provided, the training dataset will be automatically split in an 80:20 ratio to create validation data.

Along with `training_import_path`, user can provide `validation_import_path`. In the import_path json exactly one of the fields `data_uri` or `keyvault_key_for_data_uri` must be present. Since data is not loaded in component in case uri is provided, training data will not be split in the absence of validation_import_path

In the next cell we define train and validation data which will be used to fine-tune an AOAI model.

In [None]:
training_file_path = ml_client.data.get(name="aoai_finetune_train", version="1")
validation_file_path = ml_client.data.get(name="aoai_finetune_validation", version="1")

Alternatively we can also define training_import_path and validation_import_path

In [None]:
training_import_path = ml_client.data.get(name="training_import_path", version="1")
validation_import_path = ml_client.data.get(name="training_import_path", version="1")

If the import_path uri file is not present in the data assets of the workspace or registry, we can create it and upload it. Alternatively it can be created as output of some other component running before finetuning component in the pipeline.

Following cell contains ways to create training_import_path and validation_import_path, user can choose to provide data uri or key based on their use-case

In [None]:
import json
# training_import_path containing data uri
training_uri = {
    "data_uri": "https://example.com/data"
}

# training_import_path containing keyvault key for data uri 
#training_uri = {
#    "keyvault_key_for_data_uri": "dummy_key"
#}

# Define file paths to save JSON files
training_import_path = "C:\\Users\\chiragbhatt\\Desktop\\proxy_components\\training_import_path.json"

# Write JSON data to files
with open(training_import_path, "w") as data_uri_file:
    json.dump(training_uri, data_uri_file)


# validation_import_path containing data uri
validation_uri = {
    "data_uri": "https://example.com/data"
}

# validation_import_path containing keyvault key for data uri 
#validation_uri = {
#    "keyvault_key_for_data_uri": "dummy_key"
#}

# Define file paths to save JSON files
validation_import_path = "C:\\Users\\chiragbhatt\\Desktop\\proxy_components\\validation_import_path.json"

# Write JSON data to files
with open(validation_import_path, "w") as data_uri_file:
    json.dump(validation_uri, data_uri_file)

## 6. Build a pipeline

Next, we build a pipeline from the imported components.

In [None]:
@pipeline(description="aoai_finetuning")
def aoai_finetuning(
    training_file_path,
    validation_file_path,
    training_import_path,
    validation_import_path,
    compute_name,
    endpoint_subscription=None,
    endpoint_resource_group=None,
    endpoint_name=None,
    model="gpt-35-turbo",
    task_type="chat",
    n_epochs=1,
    batch_size=8,
    learning_rate_multiplier=1,
    suffix = None,       
    n_ctx=4096,
    lora_dim=1,
    weight_decay_multiplier=0.001,
):
    # Step 1 : Finetune OAI model
    finetune_step = finetuning_component(
        training_file_path = training_file_path,
        validation_file_path = validation_file_path,
        training_import_path = training_import_path,
        validation_import_path = validation_import_path,
        endpoint_name=endpoint_name,
        endpoint_subscription = endpoint_subscription,
        endpoint_resource_group = endpoint_resource_group,
        task_type=task_type,
        model=model,
        n_epochs=n_epochs,
        batch_size=batch_size,
        learning_rate_multiplier=learning_rate_multiplier,
        suffix = suffix,
        n_ctx=n_ctx,
        lora_dim=lora_dim,
        weight_decay_multiplier=weight_decay_multiplier
    )
    
    finetune_step.compute = compute_name

    return finetune_step.outputs

## 7. Kick Off Pipeline Runs

In [None]:
endpoint_name="aoai-proxy"
endpoint_subscription="72c03bf3-4e69-41af-9532-dfcdc3eefef4"
endpoint_resource_group="aml-benchmarking"
task = "chat"  # "question-answering"
model_name = "gpt-35-turbo-0613"
suffix = "testing"
n_epochs=1
batch_size=8
learning_rate_multiplier=1
n_ctx=4096
lora_dim=1
weight_decay_multiplier=0.001

<!--- Another important input in the evaluation pipeline is the `endpoint_region`. It accepts a string of regions and iterates over them until it finds a region that has quota to run the pipeline. For illustration purposes we use the following regions. Your use case may be different and you will need to modify this list accordingly.

# regions = "eastus,westus,eastus2,southcentralus,centralus,northcentralus,australiaeast,canadaeast,francecentral,japaneast,swedencentral,uksouth"
# regions = None
--->

In [None]:
"""Upload data with training_file_path"""
aoai_pipeline = aoai_finetuning(
    training_file_path=training_file_path,
    validation_file_path=validation_file_path,
    compute_name=cpu_compute_target,
    endpoint_name=endpoint_name,
    endpoint_subscription=endpoint_subscription,
    endpoint_resource_group=endpoint_resource_group,
    model=model_name,
    task_type=task,
    n_epochs=n_epochs,
    batch_size=batch_size,
    learning_rate_multiplier=learning_rate_multiplier,
    suffix = suffix,       
    n_ctx=n_ctx,
    lora_dim=lora_dim,
    weight_decay_multiplier=weight_decay_multiplier
)

aoai_pipeline.display_name = "aoai-finetuning-with-data-asset"
aoai_pipeline.settings.default_compute = cpu_compute_target
aoai_pipeline.tags = {}
pipeline_submitted_job_base = ml_client.jobs.create_or_update(
    aoai_pipeline,
    experiment_name="aoai-finetuning",
    skip_validation=True,
    compute=cpu_compute_target,
)
ml_client.jobs.stream(pipeline_submitted_job_base.name)

In [None]:
ml_client.jobs.stream(pipeline_submitted_job_base.name)

In [None]:
"""Upload data with training_import_path"""
aoai_pipeline = aoai_finetuning(
    training_import_path=training_import_path,
    validation_import_path=validation_import_path,
    compute_name=cpu_compute_target,
    endpoint_name=endpoint_name,
    endpoint_subscription=endpoint_subscription,
    endpoint_resource_group=endpoint_resource_group,
    model=model_name,
    task_type=task,
    n_epochs=n_epochs,
    batch_size=batch_size,
    learning_rate_multiplier=learning_rate_multiplier,
    suffix = suffix,       
    n_ctx=n_ctx,
    lora_dim=lora_dim,
    weight_decay_multiplier=weight_decay_multiplier
)

aoai_pipeline.display_name = "aoai-finetuning-with-data-uri"
aoai_pipeline.settings.default_compute = cpu_compute_target
aoai_pipeline.tags = {}
pipeline_submitted_job_base = ml_client.jobs.create_or_update(
    aoai_pipeline,
    experiment_name="aoai-finetuning-training-data",
    skip_validation=True,
    compute=cpu_compute_target,
)
ml_client.jobs.stream(pipeline_submitted_job_base.name)

In [40]:
ml_client.jobs.stream(pipeline_submitted_job_base.name)

JobException: The output streaming for the run interrupted.
But the run is still executing on the compute target. 
Details for canceling the run can be found here: https://aka.ms/aml-docs-cancel-run