# Internal notes
- descriptions should be refined a lot
- not tested for work

# Open Questions
- Do we customize the deployment with a scoring script and explain the scoring script in this tutorial, or should we simply deploy using the automatically generated scoring script and link to [this article](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-mlflow-models-online-endpoints?tabs=cli#customizing-mlflow-model-deployments) for learners to see how to use a scoring scipt


# Deploy the model as an online endpoint

## Working Outline
Now deploy your machine learning model as a web service in the Azure cloud, an [`online endpoint`](https://docs.microsoft.com/azure/machine-learning/concept-endpoints).

To deploy a machine learning service, you usually need:

* The model assets (file, metadata) that you want to deploy. You've already registered these assets in your training job.
* Some code to run as a service. The code executes the model on a given input request. This entry script receives data submitted to a deployed web service and passes it to the model, then returns the model's response to the client. The script is specific to your model. The entry script must understand the data that the model expects and returns. With an MLFlow model, as in this tutorial, this script is automatically created for you. Samples of scoring scripts can be found [here](https://github.com/Azure/azureml-examples/tree/sdk-preview/sdk/endpoints/online).

## Prerequisites
If you already completed tutorial 4 or tutorial 5, you have everything you need.

If you’re starting from here, you need to do the following: 

- Create a workspace, create compute resource,  
- Configure the environment: Dev environment would be set up in article 2. Skip if you have already done it. If you haven’t done it, go read tutorial 2. We probably want to make it explicit what additional environment configuration is needed for inferencing. 
- We’ll provide the model files ~~(they’ll register later)~~. 
- Make sure to have enough quota for the compute resources (VM SKUs) 

## Connect to the workspace (configure auth)
Before you dive in the code, you'll need to connect to your Azure ML workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning.

We're using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` is used to handle most Azure SDK authentication scenarios. 

Reference for more available credentials if it doesn't work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
# Handle to the workspace
from azure.ai.ml import MLClient

# Authentication package
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id="f66b853e-91bc-4852-9a4e-506ac873520a",
    resource_group_name="rg-azureml",
    workspace_name="ml-lab",
)

## Import required AzureML libraries for model deployment

In [None]:
# import libraries
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
)

## Check what you need
Again, these resources are needed to be specifed in the definition of Online deployment. 

- Model: 
    - Registered
- Environment (specific to model):
    - conda.yaml and docker image (you don’t need to install docker; you need a link to the docker image)
    - Automatically generated for MLflow models. You don't need to specify this time. If you were using a custom model, you'd have to specify the environment.(TODO: put a link to how this can be done?)
- Compute resource
    - VM SkU is to be specified 
- Scoring script
    - The scoring script which takes input requests and returns the scored results.
    - Automatically generated for MLflow models. You don't need to specify this time.

## Check if the model is registered
If you already completed tutorial 4 or 5, your model was registered in the training script.
We recommend registering model as a best practice.

You can check the **Models** page on Azure ML studio, to identify the latest version of your registered model. Alternatively, the code below will retrieve the latest version number for you to use.

In [None]:
registered_model_name = "credit_defaults_model"

# Let's pick the latest version of the model
latest_model_version = max(
    [int(m.version) for m in ml_client.models.list(name=registered_model_name)]
)

## Understand the scoring script (MAYBE?)

[Customizing MLflow model deployments with scoring script](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-mlflow-models-online-endpoints?tabs=cli#customizing-mlflow-model-deployments)

### TODO:
- Can we show the content of the scoring script (autogenerated by MLflow) and explain it
    - mention that for custom models (not mlflow), you'd need to provide a scoring script and validate it.
    - TODO: Link to an article that shows how to do this


> [!TIP]
> The format of the scoring script for online endpoints is the same format that's used in the preceding version of the CLI and in the Python SDK.

As noted earlier, the script specified in `CodeConfiguration(scoring_script="score.py")` must have an `init()` function and a `run()` function. 


This example uses the [score.py file](https://github.com/Azure/azureml-examples/blob/main/sdk/python/endpoints/online/model-1/onlinescoring/score.py):
__score.py__
:::code language="python" source="~/azureml-examples-main/cli/endpoints/online/model-1/onlinescoring/score.py" :::

The `init()` function is called when the container is initialized or started. Initialization typically occurs shortly after the deployment is created or updated. Write logic here for global initialization operations like caching the model in memory (as we do in this example). The `run()` function is called for every invocation of the endpoint and should do the actual scoring and prediction. In the example, we extract the data from the JSON input, call the scikit-learn model's `predict()` method, and then return the result.

## Endpoints and deployments

After you train a machine learning model, you need to deploy the model so that others can use it to do inferencing. In Azure Machine Learning, you can use **endpoints** and **deployments** to do so.

An **endpoint** is an interface, based on the HTTPS protocol, that clients can call to receive the inferencing (scoring) output of a trained model. It provides: 
- Authentication using "key & token" based auth 
- SSL termination 
- A stable scoring URI (endpoint-name.region.inference.ml.azure.com)


A **deployment** is a set of resources required for hosting the model that does the actual inferencing. 

A single endpoint can contain multiple deployments. Endpoints and deployments are independent Azure Resource Manager resources that appear in the Azure portal.

Azure Machine Learning uses the concept of endpoints and deployments to implement different types of endpoints: [online endpoints](https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints#what-are-online-endpoints) and [batch endpoints](https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints#what-are-batch-endpoints). In this tutorial, we'll walk you through the steps of implementing an online endpoint—that is, an endpoint used for receiving data from clients and sending back responses in real-time.

## Create an online endpoint

Now that you have a registered model and an inference script (auto-generated this time), it's time to create your online endpoint. The endpoint name needs to be unique in the entire Azure region. For this tutorial, you'll create a unique name using a universally unique identifier [`UUID`](https://en.wikipedia.org/wiki/Universally_unique_identifier#:~:text=A%20universally%20unique%20identifier%20(UUID,%2C%20for%20practical%20purposes%2C%20unique). For more information on the endpoint naming rules, see [managed online endpoint limits](how-to-manage-quotas.md#azure-machine-learning-managed-online-endpoints).

> [!NOTE]
> Expect the endpoint creation to take approximately 6 to 8 minutes.

> [!TIP]
> * `auth_mode` : Use `key` for key-based authentication. Use `aml_token` for Azure Machine Learning token-based authentication. A `key` doesn't expire, but `aml_token` does expire. For more information on authenticating, see [Authenticate to an online endpoint](how-to-authenticate-online-endpoint.md).
> * Optionally, you can add a description and tags to your endpoint.

In [None]:
import uuid

# Create a unique name for the endpoint
online_endpoint_name = "credit-endpoint-" + str(uuid.uuid4())[:8]

We'll create the endpoint using the `ManagedOnlineEndpoint` class.

In [None]:
# define an online endpoint
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="this is an online endpoint",
    auth_mode="key",
    tags={
        "training_dataset": "credit_defaults",
        "model_type": "sklearn.GradientBoostingClassifier",
    },
)

Using the `MLClient` created earlier, we'll now create the endpoint in the workspace. This command will start the endpoint creation and return a confirmation response while the endpoint creation continues.

In [None]:
# create the online endpoint
endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint).result()

print(f"Endpoint {endpoint.name} provisioning state: {endpoint.provisioning_state}")

Once you've created an endpoint, you can retrieve it as follows:

In [None]:
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)

print(
    f'Endpoint "{endpoint.name}" with provisioning state "{endpoint.provisioning_state}" is retrieved'
)

## Deploy the model to the endpoint

A deployment is a set of resources required for hosting the model that does the actual inferencing. Once the endpoint is created, deploy the model with the entry script. Each endpoint can have multiple deployments. Direct traffic to these deployments can be specified using rules. Here, you'll create a single deployment that handles 100% of the incoming traffic. We've chosen an arbitrary color name (*blue*) for the deployment. We'll create the deployment for our endpoint using the `ManagedOnlineDeployment` class.

Note: no need to specify an environment because we are using an MLflow model. MLflow automatically generates the environment

In [None]:
# choosing the model to deploy. Here, we use the latest version of our registered model
model = ml_client.models.get(name=registered_model_name, version=latest_model_version)

# define an online deployment
blue_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=online_endpoint_name,
    model=model,
    instance_type="Standard_DS3_v2",
    instance_count=1,
    environment = env
)

Using the `MLClient` created earlier, we'll now create the deployment in the workspace. This command will start the deployment creation and return a confirmation response while the deployment creation continues.

In [None]:
# create the online deployment
blue_deployment = ml_client.online_deployments.begin_create_or_update(blue_deployment).result()

## Check the status of the endpoint
You can check the status of the endpoint to see whether the model was deployed without error:

In [None]:
endpoint = ml_client.online_endpoints.get(name=online_endpoint_name)
print(f"Name: {endpoint.name}\nStatus: {endpoint.provisioning_state}\nDescription: {endpoint.description}")

## Test with a sample query (Manually send test data)

Now that the model is deployed to the endpoint, you can run inference with it.

Create a sample request file following the design expected in the run method in the score script.

In [None]:
# Create a directory to store a sample request file
import os
deploy_dir = "./deploy"
os.makedirs(deploy_dir, exist_ok=True)

Create a sample request file

In [None]:
%%writefile {deploy_dir}/sample-request.json
{
  "input_data": {
    "columns": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22],
    "index": [0, 1],
    "data": [
            [20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0],
            [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 10, 9, 8]
        ]
  }
}

In [None]:
# test the blue deployment with some sample data
ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    request_file="./deploy/sample-request.json",
    deployment_name="blue",
)

## Get logs of the deployment
Check the logs to see whether the endpoint/deployment were invoked successfuly
If you face errors, see [Troubleshooting online endpoints deployment](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-troubleshoot-online-endpoints?tabs=cli).

In [None]:
logs = ml_client.online_deployments.get_logs(name="blue", endpoint_name=online_endpoint_name, lines=50)
print(logs)

## Create a second deployment 
Deploy the model as a new deployment `green`
This time we'll deploy the same model as an example but you can deploy new model for real use case. 

In [None]:
# picking the model to deploy. Here we use the latest version of our registered model
model = ml_client.models.get(name=registered_model_name, version=latest_model_version)

# define an online deployment
green_deployment = ManagedOnlineDeployment(
    name="green",
    endpoint_name=online_endpoint_name,
    model=model,
    instance_type="Standard_DS3_v2",
    instance_count=1,
)

# create the online deployment
green_deployment = ml_client.online_deployments.begin_create_or_update(green_deployment).result()

## Scaling online endpoints
(Show manual, mention (link) auto-scale. Do this for model #2 only. )
This time we'll increase the VM instance manually. 
You can setup autoscaling: [Autoscale online endpoints - Azure Machine Learning | Microsoft Learn](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-autoscale-endpoints?tabs=python)
You can update deployment settings like below.

In [None]:
# update definition of the deployment
green_deployment.instance_count = 2
# update the deployment
ml_client.online_deployments.begin_create_or_update(endpoint).result()

## Update traffic allocation for deployments
You can split production traffic between deployments. Traffic allocation settings should be done at the endpoint settings.
Once you've tested your green deployment, allocate a small percentage of traffic to it:

In [None]:
endpoint.traffic = {"blue": 90, "green": 10}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

You can test traffic allocation by invoking the endpoint several times:

In [1]:
# You can invoke the endpoint several times
for i in range(20):
    ml_client.online_endpoints.invoke(
        endpoint_name=online_endpoint_name,
        request_file="./deploy/sample-request.json",
    )

NameError: name 'ml_client' is not defined

Show logs from green deployment to check there were incoming requests and the model was scored successfully 

In [None]:
ml_client.online_deployments.get_logs(name="green", endpoint_name=online_endpoint_name, lines=50)

## Roll out the new deployment and delete the old deployment
Once you're fully satisfied with your green deployment, switch all traffic to it.

In [None]:
endpoint.traffic = {"blue": 0, "green": 100}
ml_client.begin_create_or_update(endpoint).result()

Remove the old (blue) deployment:

In [None]:
ml_client.online_deployments.begin_delete(
    name="blue", endpoint_name=online_endpoint_name
).result()

## Next Steps
- Mirror traffic
- Batch deployment articles