## Bring your model into AzureML

This sample show how to import a Huggingface model and register it as custom model. Next how to deploy registered model using a custom environment and soring script.

### What models are supported for import?
Model files present in a public github repository or in the azure blobstorage with public read and list access can be imported.
In this sample we will import a Huggingface [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b) model and then use a custom script to deploy model in the workspace.

### Outline
1. Set up pre-requisites
2. Create and execute pipeline job
4. Deploy model
5. Test endpoint
6. Cleanup resouces

## 1. Set up pre-requisites
* Install dependencies
* Connect to AzureML Workspace. Learn more at [set up SDK authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-setup-authentication?tabs=sdk). Replace  `<WORKSPACE_NAME>`, `<RESOURCE_GROUP>` and `<SUBSCRIPTION_ID>` below.
* Connect to `azureml` system registry
* Setup compute

In [47]:
import json
import pandas as pd
from azure.ai.ml import MLClient, UserIdentityConfiguration
from azure.ai.ml import load_environment
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    OnlineRequestSettings,
    ProbeSettings,
    CodeConfiguration,
)
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

### 1.2 Configure credential

We are using `DefaultAzureCredential` to get access to the workspace. 
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](https://aka.ms/azureml-workspace-configuration), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [74]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

### 1.3 Get a handle to the workspace and the registry

We use the config file to connect to a workspace. The Azure ML workspace should be configured with a computer cluster. [Check this notebook for configure a workspace](https://aka.ms/azureml-workspace-configuration)

If config file is not available user can update following parameters in place holders
- SUBSCRIPTION_ID
- RESOURCE_GROUP
- WORKSPACE_NAME

In [None]:
# Get a handle to workspace
try:
    mlclient_ws = MLClient.from_config(credential=credential)
except:
    mlclient_ws = MLClient(
        credential,
        subscription_id="<SUBSCRIPTION_ID>",
        resource_group_name="<RESOURCE_GROUP>",
        workspace_name="<WORKSPACE_NAME>",
    )

mlclient_registry = MLClient(credential, registry_name="azureml")

mlclient_ws

### 1.4 Compute target setup

#### Create or Attach existing AmlCompute
A compute target is required to execute the Automated ML run. In this tutorial, you create AmlCompute as your training compute resource.

#### Creation of AmlCompute takes approximately 5 minutes. 
If the AmlCompute with that name is already in your workspace this code will skip the creation process.
As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [None]:
from azure.ai.ml.entities import AmlCompute
from azure.core.exceptions import ResourceNotFoundError

compute_name = "cpu-cluster"

try:
    _ = mlclient_ws.compute.get(compute_name)
    print("Found existing compute target.")
except ResourceNotFoundError:
    print("Creating a new compute target...")
    compute_config = AmlCompute(
        name=compute_name,
        type="amlcompute",
        size="STANDARD_DS12_V2",
        idle_time_before_scale_down=120,
        min_instances=0,
        max_instances=6,
    )
    mlclient_ws.begin_create_or_update(compute_config).result()

## 2. Create and execute pipeline job

### 2.1 Fetch coponents from azureml production registry

In [51]:
download_model = mlclient_registry.components.get(name="download_model", label="latest")
register_model = mlclient_registry.components.get(name="register_model", label="latest")

### 2.2 Create a pipleine job and submit run

In [58]:
@pipeline()
def import_model(
    model_id,
    custom_model_name=None,
    model_source="Huggingface",
    model_type="custom_model",
    model_metadata=None,
    model_version=None,
    registry_name=None,
):
    download = download_model(
        model_source=model_source,
        model_id=model_id,
    )

    register = register_model(
        model_type=model_type,
        model_name=custom_model_name,
        registry_name=registry_name,
        model_metadata=model_metadata,
        model_download_metadata=download.outputs.model_download_metadata,
        model_path=download.outputs.model_output,
        model_version=model_version,
    )

    return {"model_registration_details": register.outputs.registration_details}

In [44]:
import_model_id = "databricks/dolly-v2-12b"

In [81]:
job = import_model(model_id=import_model_id)
job.identity = UserIdentityConfiguration()
job.display_name = f"import {import_model_id}"
job.settings.default_compute = compute_name

In [82]:
submitted_job = mlclient_ws.jobs.create_or_update(job)

# wait for job to complete
mlclient_ws.jobs.stream(submitted_job.name)

### 2.3 Download pipeline run output

In [None]:
mlclient_ws.jobs.download(
    name=submitted_job.name,
    output_name="model_registration_details",
    download_path="./",
)

#### 2.3.1 Initialise model registration details

In [92]:
model_registeration_details = {}
path = "./named-outputs/model_registration_details/registration_details"
with open(path) as f:
    model_registeration_details = json.load(f)

Get registered model id

In [93]:
model_id = model_registeration_details["id"]

## 3. Deploy model
We will perform custom model deployment next which would need us to:
1. Create an online endpoint
2. Create custom environment

### 3.1 Create an online endpoint

In [106]:
endpoint_name = "dolly-v2-12b"
# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=endpoint_name,
    auth_mode="key",
)

In [107]:
mlclient_ws.online_endpoints.begin_create_or_update(endpoint).wait()

### 3.1 Create environment for deployment

#### 3.1.1 Create env
Create env or move to section 3.1.2 to use an existing env in workspace

In [108]:
environment_spec_file = "./env/spec.yaml"

In [None]:
deploy_env = mlclient_ws.environments.create_or_update(
    load_environment(source=environment_spec_file)
)

deploy_env.id

#### 3.1.2 Fetch existing env details

In [None]:
deploy_env = mlclient_ws.environments.get(
    name="custom-transformers-deployment", label="latest"
)
deploy_env.id

### 3.2 Initalise deployment parameters

In [110]:
deployment_name = "default"
instance_type = "Standard_ND40rs_v2"  ## to be able use CUDA for loading model
instance_count = 1

# request settings
max_concurrent_requests_per_instance = 1
request_timeout_ms = 60000
max_queue_wait_ms = 60000

### 3.3 Deploy model

In [121]:
deployment = ManagedOnlineDeployment(
    code_configuration=CodeConfiguration(code="./code", scoring_script="predict.py"),
    environment=deploy_env.id,
    name=deployment_name,
    endpoint_name=endpoint_name,
    model=model_id,
    instance_type=instance_type,
    instance_count=instance_count,
    request_settings=OnlineRequestSettings(
        max_concurrent_requests_per_instance=max_concurrent_requests_per_instance,
        request_timeout_ms=request_timeout_ms,
        max_queue_wait_ms=max_queue_wait_ms,
    ),
)

In [None]:
mlclient_ws.online_deployments.begin_create_or_update(deployment).wait()

## 3.4 Update endpoint to take traffic

In [None]:
endpoint.traffic = {deployment_name: 100}
mlclient_ws.begin_create_or_update(endpoint).result()

## 4. Test the endpoint
Now when we have an active deployment, we can invoke endpoint using a test payload.

In [None]:
inference_payload_file = "./data/payload.json"
response = mlclient_ws.online_endpoints.invoke(
    endpoint_name=endpoint_name,
    deployment_name=deployment_name,
    request_file=inference_payload_file,
)

In [None]:
response

### 5. Clean up resources
Don't forget to delete the online endpoint, else you will leave the billing meter running for the compute used by the endpoint

In [None]:
mlclient_ws.online_endpoints.begin_delete(name=endpoint_name).wait()