# Finetuning and saving to the model registry

In this notebook, you extend on your previous work to actually finetune a model and save it in the Azure ML Registry.
To do so, you will:

1. Find your MLFlow URI 
2. Modify the Axolotl YML file to point to your MLFlow URI
3. Submit your finetuning job and monitor it

## Goal

The configuration of MLFlow is quite simple: with Axolotl, you need to simplyl point out to an MLFlow URI, which is specific to your workspace.

In this notebook, you will extract the MLFlow URI using the SDK. You will then use this information in future notebooks to integrate Axolotl with MLFlow.

> Note: Retrieving the MLFlow URI is an operation that requires login to Azure. If you are running this notebook interactively, you can leverage the `DefaultAzureCredential()` method to authenticate. If you intend on running this code in a headless configuration (an AzureML component, or a serverless deployment) consider changing the authentication method to a non-interactive one. Alternative methods to retrieve your MLFlow URI can be found [on this link](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow-configure-tracking?view=azureml-api-2&tabs=python%2Cmlflow).

### A| Finding your MLFlow URI

In the previous notebook, you have learnt how to finetune an existing model (Phi1) on a dataset for a single epoch. The model that you've built was stored locally on disk in a folder called `output_dir` (the name of the folder differs depending on the configuration YML file you use).

To properly track your experiments, you will instead need to leverage MLFlow. MLFlow enables you to do model tracking and registration, enabling you to use tracked models in downstream processes.

In [1]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
import azureml.core
workspace = azureml.core.Workspace.from_config()

In [2]:
ml_client = MLClient.from_config(credential=DefaultAzureCredential())
mlflow_tracking_uri = ml_client.workspaces.get(ml_client.workspace_name).mlflow_tracking_uri
print(mlflow_tracking_uri)

Found the config file in: /config.json


azureml://australiaeast.api.azureml.ms/mlflow/v1.0/subscriptions/68092087-0161-4fb5-b51d-32f18ac56bf9/resourceGroups/aml-au/providers/Microsoft.MachineLearningServices/workspaces/aml-au


In [3]:
config = {}
config["compute_size"] = "Standard_NC24ads_A100_v4"
config["compute_target"] = "a100cluster"
config["compute_node_count"] = 1
config["pytorch_configuration"] = {
    "node_count": 1, # num of computers in cluster
    "process_count": 1} # gpus-per-computer * node_count

config["experiment"] = "Finetune_phi1"
config["source_directory"] = "src"
config["environment"] = "axolotl_acpt"

Open the `src/Phi-ft.yml` file (which ships with Axolotl) and modify the mlflow value according to the [documentation](https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/config.qmd).

In [4]:
import yaml
with open('src/phi-ft.yml', 'r') as f:
    phi_ft_config = yaml.safe_load(f)

phi_ft_config["mlflow_tracking_uri"] = mlflow_tracking_uri
phi_ft_config["hf_mlflow_log_artifacts"] = False
phi_ft_config["mlflow_experiment_name"] = config["experiment"]

with open('src/phi-ft-modified.yml', 'w') as f:
    yaml.dump(phi_ft_config, f)

> Note: The configuration below is different from the previous notebook. The training command now calls the newly created file.

In [5]:
config["training_command"] = "accelerate launch -m axolotl.cli.train phi-ft-modified.yml"

In [6]:
try:
    cluster = azureml.core.compute.ComputeTarget(
        workspace=workspace, 
        name=config['compute_target']
    )
    print('Found existing compute cluster')
except azureml.core.compute_target.ComputeTargetException:
    compute_config = azureml.core.compute.AmlCompute.provisioning_configuration(
        vm_size=config['compute_size'],
        max_nodes=config['compute_node_count']
    )
    cluster = azureml.core.compute.ComputeTarget.create(
        workspace=workspace,
        name=config['compute_target'], 
        provisioning_configuration=compute_config
    )
    
cluster.wait_for_completion(show_output=True)

InProgress..
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [7]:
experiment = azureml.core.Experiment(workspace, config['experiment'])

distributed_job_config = azureml.core.runconfig.PyTorchConfiguration(**config['pytorch_configuration'])
aml_config = azureml.core.ScriptRunConfig(
            source_directory=config['source_directory'],
            command=config['training_command'],
            environment=azureml.core.Environment.get(workspace, name=config["environment"]),
            compute_target=config['compute_target'],
            distributed_job_config=distributed_job_config,
    )
run = experiment.submit(aml_config)
run.set_tags({
    "environment":config["environment"],
    "epochs": phi_ft_config["num_epochs"],
    "micro_batch_size": phi_ft_config["micro_batch_size"],
    "sequence_len": phi_ft_config["sequence_len"],
    "dataset": phi_ft_config["datasets"][0]["path"]
})

print(f"View run details:\n{run.get_portal_url()}")