# Objective 

Illustrate mechanisms within AWS SageMaker and Azure ML to deploy and serve machine learning models including:

- API deployment
- Safe rollout using A/B testing

# Introduction

Model deployment involves taking a trained machine learning model and making it available for real-time predictions or inferences. When it comes to deploying models, there are two primary outcomes to consider: APIs (Application Programming Interfaces) and edge deployment. These two modes offer distinct advantages and cater to different use cases.

API deployment involves hosting the machine learning model on a server and exposing it as a service through an API. This allows clients or applications to send requests to the API and receive predictions or inferences in response. API deployment is ideal for scenarios where there is a centralized infrastructure and clients have reliable network connectivity. Examples of API deployment include using frameworks like Flask or FastAPI to build a RESTful API for deploying a natural language processing (NLP) model or an image recognition model.

On the other hand, edge deployment brings the model closer to the data source or the client device itself, reducing latency and enabling real-time predictions without relying on a network connection. In this mode, the model runs directly on edge devices such as smartphones, IoT devices, or embedded systems. Edge deployment is particularly valuable in scenarios where low latency, privacy, or intermittent network connectivity is crucial. Examples of edge deployment include deploying a computer vision model on a surveillance camera to detect anomalies in real-time or deploying a speech recognition model on a smartphone for offline voice commands.

## What are APIs?

A common method to deploy a model on the web is to wrap the saved model as a API service and allow users (clients) to send requests. Incoming requests are parsed into the appropriate input format by the service and presented to the model for inference. This inference is returned to the user as a response. Each request is handled by a specific resource (in our case a model) that is identified by a unique *endpoint*. 

Think of an endpoint as the unique URL that is shared with the client to interface with the model; all they can do is to send a request to the endpoint (i.e., they have no access to any detail on how the response is generated). Intuitively, it is like a storefront (a unique address) where they come to collect their predictions. They do not worry about *how* the predictions are made. An endpoint separates the user-facing "front end" from a predictve model infused "back end". 

But what exactly is an Application Programming Interface (API)?

APIs prescribe the mechanism through which any two computers can exchange information over a network. Given that there could be many ways to execute this exchange, it would be prudent to formalize this exchange as a set of rules that we agree. These rules are encoded as REST principles. 

![rest-api](assets/rest-api.drawio.png)

REpresentational State Transfer (REST) APIs are programming language agnostic and encode a set of rules that constitute a REST-ful API. These rules are:

- Clients can only make POST, GET, PUT, or DELETE requests
- These requests can contain an optional payload (usually a [JSON object](https://www.json.org/json-en.html))
- All requests should return a response with a code indicating the status of the response (200's - Success, 400's - Improper request, 500's - Server side errors)

## Models as APIs

In the context of ML deployment, clients send a POST request with a payload containing the input data needed by the model to make a prediction. For example, to get a classification result on their input, showrooms should attach the features of a diamond as a payload and upload it to the unique URL encoded by the endpoint. The server parses this input, presents it to the model, collects the prediction and sends a response (along with a status code) back to the client. In sum, customers *post* an input and the business serves a response.

Common web frameworks used in production that implement the REST framework in Python are [Flask](https://palletsprojects.com/p/flask/) and [FastAPI](https://fastapi.tiangolo.com/). Flask is a popular REST implementation that is used by [SageMaker](https://aws.amazon.com/blogs/machine-learning/part-2-model-hosting-patterns-in-amazon-sagemaker-getting-started-with-deploying-real-time-models-on-sagemaker/) & [Azure ML](https://liupeirong.github.io/amlDockerImage/) to [create a web server for ML models](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-inference-server-http?view=azureml-api-2). The advantage of Flask is that owing to its longer existence, it enjoys a wider ecosystem compared with FastAPI. Beyond these general purpose implementations, specialized implementations also exist. For example, [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving) implements the REST framework in C++ for TensorFlow models and hence can be more performant for deep learning models.

<div class="alert alert-block alert-warning">

<b>Business Context (Review)</b> 
    
For this session consider the case of a popular diamond jeweller - Brilliant Earth - with 30 showrooms across the US facing a price prediction problem. A common customer question that echoes in their retail outlets is the impact on price because of changes in some aspects of the ornament. For example, usually customers ask: "If I decreased the carat of the diamonds used in this design, by how much would the price reduce?". Such queries often require an expert intervention on the shopfloor and result in a subdued customer experience. The company also wants to implement a price predictor tool on their website so customers can engage with the brand better. At the moment, no such tool exists and the business team estimates that a price predictor will improve traffic to the website and also improve the time spent on the website.

The dataset used in this session is scraped from the [Brilliant Earth website](https://www.brilliantearth.com/) and hosted on [Open ML](https://www.openml.org/search?type=data&status=active&id=43355).

</div>

An example of a fully fleshed out endpoint for the diamond price prediction problem is [here](https://pgurazada1-diamond-price-predictor.hf.space/). In the rest of this session, we present the details behind building a REST API with the models estimated at their core.

# Setup

**General Imports**

In [None]:
import logging

import pandas as pd

**AWS imports & authentication**

In [None]:
import sagemaker
import boto3

from sagemaker.sklearn.estimator import SKLearn
from sagemaker.sklearn.estimator import SKLearnModel

from sagemaker.session import production_variant

A `sagemaker` session is a cloud equivalent to a fully functional local development setup (i.e., access enabled to data and compute). We can point a session to a default bucket that will host all the artifacts accessed and created during the session (remember nothing stays local). 

In [None]:
deployment_session = sagemaker.Session(
    default_bucket="sagemaker-deployment-examples"
)

In [None]:
try:
    aws_role = sagemaker.get_execution_role()
except ValueError:
    print("Config file not found on local machine, use SageMaker Studio")

From within SageMaker studio, execution role is inherited. Outside the Studio environment, the execution role should be explictly specified. This execution role should have [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol.html) permissions. Local compute [access](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) should also be [enabled](https://stackoverflow.com/a/47767351).

In [None]:
print(f"AWS execution role associated with the account {aws_role}")

**Azure imports & authentication**

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

from azure.ai.ml import Input
from azure.ai.ml import command

from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    CodeConfiguration
)

In [None]:
subscription_id = "5bcad9c4-40fb-4136-b614-cc90116dd8b3"
resource_group = "tf"
workspace = "cloud-teach"

In [None]:
logger = logging.getLogger("azure.core.pipeline.policies.http_logging_policy")
logger.setLevel(logging.WARNING)

From VMs within the Azure ML workspace, the default Azure credentials are inherited. However, interactive browser credentials could be used to authenticate an Azure account to the Azure ML workspace.

In [None]:
az_credentials = DefaultAzureCredential(
    exclude_interactive_browser_credential=False
)

In [None]:
ml_client = MLClient(
    az_credentials, subscription_id, resource_group, workspace
)

# Data

## AWS

In [None]:
diamonds_df = pd.read_csv('s3://sagemaker-ap-south-1-321112151583/prices/diamond-prices.csv')

In [None]:
diamonds_df.head()

## Azure

In [None]:
for registered_data in ml_client.data.list():
    print(registered_data.name)

In [None]:
diamond_prices_data = ml_client.data.get(
    name="diamond-prices-jan",
    version=1
)

In [None]:
diamonds_df = pd.read_csv(diamond_prices_data.path)

In [None]:
diamonds_df.head()

# Model Training 

## AWS

We estimate two models for the diamond prices data - a decision tree regressor (`dt.py`) and a gradient boosted regressor (`gb.py`).

The input data is hosted in the default S3 bucket of the `sagemaker` session as an unprocessed csv file. 

In [None]:
sklearn_dt_estimator = SKLearn(
    entry_point="aws/train/dt.py",
    framework_version="1.2-1",
    instance_type="ml.m5.xlarge",
    instance_count=1,
    volume_size=1,
    role=aws_role,
    sagemaker_session=deployment_session
)

In [None]:
sklearn_dt_estimator.fit(
    inputs={
    'train': 's3://sagemaker-ap-south-1-321112151583/prices/'
    },
    wait=False,
    job_name='2023-06-12-estimate-dt-003'
)

In [None]:
sklearn_dt_estimator.logs()

In [None]:
sklearn_gb_estimator = SKLearn(
    entry_point="aws/train/gb.py",
    framework_version="1.2-1",
    role=aws_role,
    sagemaker_session=deployment_session,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    volume_size=1
)

In [None]:
sklearn_gb_estimator.fit(
    inputs={
    'train': 's3://sagemaker-ap-south-1-321112151583/prices/'
    },
    wait=False,
    job_name='2023-06-12-estimate-gb-003'
)

In [None]:
sklearn_gb_estimator.logs()

There are two key aspects of the training scripts (`dt.py` and `gb.py`) that are new here:

**1. The training workflow is encapsulated within a "main guard"** 

```python
if __name__ == "__main__":
    main()
```

This allows the training modules to be executed only when the training script is called from the command line. This is a good practise to ensure that the training process does not execute when the script is used as a part of the pipeline.

**2. Model pipelines are estimated rather than the models themselves**

```python
preprocessor = make_column_transformer(
        (StandardScaler(), numeric_features),
        (OneHotEncoder(handle_unknown='ignore'), categorical_features)
)

model_dt = DecisionTreeRegressor()

model_pipeline = make_pipeline(preprocessor, model_dt)
```

By estimating a preprocessing pipeline along with the model, we ensure that the data processing is "packaged" along with the model estimation. This is a good practise if the preprocessing involves standard, light-weight steps. Extensive preprocessing steps are best handled through a pipeline job. This way we avoid potentially costly data transfers between two steps - pre-processing and model estimation. Packaging preprocessing wth the model estimation also helps complex pipeline patterns during inference.

Output from the training script is persisted to the bucket allocated for the training job within the `output` folder.

In [None]:
sklearn_dt_estimator.model_data

In [None]:
sklearn_gb_estimator.model_data

Note that in this stage, we could have extracted the best model through hyperparameter tuning. However, for the purpose of model deployment, we are only concerned with obtaining the final model file that represents the best model for the training data.

## Azure

In [None]:
dt_train_job = command(
    inputs={
        "data": Input(type="uri_file", path="azureml:diamond-prices-jan:1")
    },
    code="azure/train/dt.py",
    command="python dt.py --data ${{inputs.data}}",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    display_name="2023-06-12-decision-tree-regression-example-003",
    experiment_name="2023-06-12-estimate-dt-003"
)

In [None]:
ml_client.create_or_update(dt_train_job)

In [None]:
gb_train_job = command(
    inputs={
        "data": Input(type="uri_file", path="azureml:diamond-prices-jan:1")
    },
    code="azure/train/gb.py",
    command="python gb.py --data ${{inputs.data}}",
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    display_name="2023-06-12-gradient-boosting-regression-example-003",
    experiment_name="2023-06-12-estimate-gb-003"
)

In [None]:
ml_client.create_or_update(gb_train_job)

In [None]:
ml_client.jobs.get("modest_salt_hvgbf8s4sk")

In [None]:
ml_client.jobs.get("elated_seal_7h7br1q8z1")

There are three key aspects of the training scripts (`dt.py` and `gb.py`) that are new here:

**1. The training workflow is encapsulated within a "main guard"** 

```python
if __name__ == "__main__":
    main()
```

This allows the training modules to be executed only when the training script is called from the command line. This is a good practise to ensure that the training process does not execute when the script is used as a part of a larger pipeline.

**2. Model pipelines are estimated rather than the models themselves**

```python
preprocessor = make_column_transformer(
        (StandardScaler(), numeric_features),
        (OneHotEncoder(handle_unknown='ignore'), categorical_features)
)

model_dt = DecisionTreeRegressor()

model_pipeline = make_pipeline(preprocessor, model_dt)
```

By estimating a preprocessing pipeline along with the model, we ensure that the data processing is "packaged" along with the model estimation. This is a good practise if the preprocessing involves standard, light-weight steps. Extensive preprocessing steps are best handled through a pipeline job. This way we avoid potentially costly data transfers between two steps - pre-processing and model estimation. Packaging preprocessing wth the model estimation also helps complex pipeline patterns during inference.

**3. Given the deep integration of `mlflow` within Azure ML, we can log and register models during the estimation process itself**

```python
mlflow.sklearn.log_model(
        sk_model=model_pipeline,
        registered_model_name="gbr-diamond-price-predictor-june",
        artifact_path="diamond-price-predictor"
)
```

The advantage here is that if a model with the registered name exists within the Azure ML workspace, it automatically gets updated with a new version.

# Creating an Endpoint

Since the gradient boosted model has a better R-squared, let us deploy the gradient boosted model as the first version of the diamond price predictor.

## AWS

### Register a `Model` object

**Create a container image**

The `SKLearnModel` class allows you to package and deploy your scikit-learn model on SageMaker easily. Beyond specifying the location of the model artifacts, to register a model, we need to specify the `entry_point` parameter when creating an instance of the `SKLearnModel` class. The `entry_point` refers to the Python script that contains the code for inference, which is responsible for loading the model and making predictions. By specifying the `entry_point` parameter and providing the inference script, we ensure that SageMaker can correctly load the model and invoke the necessary functions during the deployment process. This allows SageMaker to set up the underlying infrastructure, create the endpoint, and handle the incoming prediction requests using our scikit-learn model.

Under the hood, the `SKLearnModel` class takes care of packaging your model and the provided inference script into a deployable container image. This container image contains the necessary runtime dependencies, environment, and the specified entry point for the inference script.

In [None]:
sklearn_dt_estimator.model_data, sklearn_gb_estimator.model_data

In [None]:
model_gb = SKLearnModel(
    model_data=sklearn_gb_estimator.model_data,
    entry_point="aws/infer/inference.py",
    framework_version="1.2-1",
    role=aws_role,
    sagemaker_session=deployment_session
)

In [None]:
model_dt = SKLearnModel(
    model_data=sklearn_dt_estimator.model_data,
    entry_point="aws/infer/inference.py",
    framework_version="1.2-1",
    role=aws_role,
    sagemaker_session=deployment_session
)

**The inference script**

The `inference.py` script plays a crucial role in guiding the `sagemaker` model server for handling inputs and generating model predictions. SageMaker provides explicit guidelines on the specific functions within this script that will be invoked when a prediction request is received. 

The `inference.py` file contains detailed comments that provide a clear understanding of the purpose and functionality of each function in the script. To provide a concise overview of the functions in the inference script, refer to the figure below:

![aws-inference](assets/aws-inference.drawio.png)

### Infrastructure and Execution

Once the server logic is implemented in the inference script, we define the infrastructure we need to host and serve the model. SageMaker handles the resources needed to create the model server and generates an endpoint with the name specified. In this process, it uses the container image created in the previous two steps. 

In [None]:
predictor_gb = model_gb.deploy(
    endpoint_name='diamond-price-gb',
    instance_type="ml.m5.xlarge", 
    initial_instance_count=1,
    wait=False
)

### Testing

In order to test the endpoints created in the previous step, we collect test data as traffic and present it to the end points.  This helps iron out potential errors before the endpoint is rolled out to customers. Usually, data that the model has never seen before is used to test deployments (we look at monitoring endpoints in further detail in the next session).

The input type for a prediction request to our model as defined in `inference.py` is `csv`.

In [None]:
sample_df = diamonds_df.sample(2)

In [None]:
sample_df.info()

In [None]:
numeric_features = ['carat']
categorical_features = ['shape', 'cut', 'color', 'clarity', 'report', 'type']

In [None]:
features = numeric_features + categorical_features

In [None]:
sample_Xtest = sample_df[features]
sample_ytest = sample_df['price']

In [None]:
sample_Xtest

Note that at this point the endpoints are in service but are not publicly accessible. However, these endpoint can be invoked within the domain using the `sagemaker` runtime. As the code below indicates, we create a temporary `csv` file from the sample data frame created in the previous step to be presented to the corresponding endpoint.

In [None]:
runtime = boto3.client("sagemaker-runtime")

Let us look at the response from the Gradient Boosted Regressor.

In [None]:
response = runtime.invoke_endpoint(
    EndpointName=predictor_gb.endpoint_name,
    Body=sample_Xtest.to_csv(header=True, index=False).encode("utf-8"),
    ContentType="text/csv"
)

To confirm that the endpoint is REST-ful, we can check the status code of its response.

In [None]:
response['ResponseMetadata']['HTTPStatusCode']

In [None]:
print(response["Body"].read())

We can compare this response with the ground truth.

In [None]:
sample_ytest

### Cleanup

At this point we have a model that can receive external traffic. However, there are further steps to go before a full rollout happens. To avoid costs incurred on idle endpoints during the testing phase, it is a good practise to delete end points. Production end points should ideally be generated and maintained by a separate team (even if they are using the same code).

In [None]:
predictor_gb.delete_endpoint(delete_endpoint_config=True)

## Azure

### Create endpoint

In [None]:
online_endpoint_name = "diamond-price-predictor-001"

In [None]:
endpoint = ManagedOnlineEndpoint(
    name=online_endpoint_name,
    description="Model to predict diamond prices",
    auth_mode="aml_token"
)

By creating a `ManagedOnlineEndpoint` we let Azure handle all the resource creation and management.

In [None]:
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

### Collect registered model

In [None]:
registered_model_gb = ml_client.models.get(
    name="gbr-diamond-price-predictor-june", 
    version=1
)

In [None]:
registered_model_gb.version

### Prepare a scoring script

Scoring scrips guide the Azure ML model server on input handling and generating model predictions. Azure ML defines clear guidelines on the functions within this script that will be invoked when a prediction request is received (the file `score.py` presents detailed comments that delineate what each function in the script accomplishes).

![azure-score](assets/azure-score.drawio.png)

### Infrastructure & Execution

The base model that we will deploy is referred to as the "blue" model by convention. After creation, this endpoint is intended to serve 100% of the traffic with the variant tagged as the blue version (the gradient boosted model in this case).

Once the server logic is implemented in the scoring script, we define the infrastructure we need to host and serve the model. Azure ML handles the resources needed to create the model server and attaches it to the endpoint with the name specified (note that the managed endpoint was created in the first step).

In [None]:
blue_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=online_endpoint_name,
    model=registered_model_gb,
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    code_configuration=CodeConfiguration(
        code='./azure/infer',
        scoring_script='score.py'
    ),
    instance_type="Standard_DS1_v2",
    instance_count=1
)

In [None]:
ml_client.online_deployments.begin_create_or_update(blue_deployment).result()

### Testing

In [None]:
(diamonds_df.drop(columns='price')
            .sample(100)
            .to_json('sample-data.json', orient='split', lines=False))

In [None]:
print(
    ml_client.online_endpoints.invoke(
        endpoint_name=online_endpoint_name,
        deployment_name="blue",
        request_file="sample-data.json"
    )
)

# Canary Deployment

An important scenario in model deployment is the need to upgrade an existing baseline model to a newer version. To ensure a careful transition from the existing model to the new version, a recommended approach is through a canary deployment. This method involves directing a controlled portion of the live traffic to the upgraded endpoint, followed by A/B testing to determine if the upgraded version performs better than the baseline on live data.

The canary deployment process starts by diverting a small percentage of live traffic, typically between 1% and 5%, to the upgraded version. Gradually, the traffic is increased if there are no errors. This approach allows for incremental testing and monitoring of the new model's performance in a real-world environment.

Let's take a closer look at how canary deployment works in action. We begin by creating two model variants, each representing one of the two models we estimated on the data.

## AWS

### Create variants

To create variants from the model binaries, we reference the container configuration used by the `SKLearnModel` objects. Containerization is a popular method to package the model and server, along with all the runtime requirements, into a standalone resource. This approach ensures that the server can be deployed easily on any virtual machine without the need for manual duplication of the configuration options required to run the server.

There are popular containerization tools available that allow us to quickly package all the runtime requirements into a reusable container. Two commonly used tools are [Docker](https://www.docker.com/) and [Podman](https://podman.io/). These tools simplify the process of creating containers, making it easier to manage and deploy the model and server components as a single unit.

In [None]:
model_dt.prepare_container_def()

In [None]:
model_gb.prepare_container_def()

As the above output indicates, both model objects reference an image `720646828776.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3` that is managed by AWS. By building this image into a container we get an environment where Python 3, scikit-learn 1.2.1 and its dependencies (e.g., numpy and scipy) are preinstalled. When this container is run, we get a python runtime that executes the script `inference.py` with all its requirements (i.e., packages and model data) copied over to this runtime.

Now that we have all the information on the infrastructure the model needs to fire predictions, we can register the two model binaries against a common endpoint as variants using the corresponding container configurations. 

We begin by registering the models and their container environments within the current session.

In [None]:
deployment_session.create_model(
    name='decision-tree-regressor',
    role=aws_role,
    container_defs=model_dt.prepare_container_def()
)

In [None]:
deployment_session.create_model(
    name='gradient-boosted-regressor',
    role=aws_role,
    container_defs=model_gb.prepare_container_def()
)

Now, we create two variants by referencing these two registered models.

In [None]:
variant1 = production_variant(
    model_name='decision-tree-regressor',
    instance_type="ml.m5.xlarge",
    initial_instance_count=1,
    variant_name="Variant1",
    initial_weight=0.95,
    volume_size=1
)

In [None]:
variant2 = production_variant(
    model_name='gradient-boosted-regressor',
    instance_type="ml.m5.xlarge",
    initial_instance_count=1,
    variant_name="Variant2",
    initial_weight=0.05,
    volume_size=1
)

In [None]:
(variant1, variant2)

As we note above, initially the two variants are configured to receive 95% (decision tree regressor) and 5% (gradient boosted regressor) respectively.

### Deploy variants

Now we can deploy the variants against the same endpoint allowing `sagemaker` to route incoming traffic in the ratio 95% and 5% to the two variants.

In [None]:
canary_endpoint_name = "diamond-price-pred-2023-06-12"
print(f"EndpointName = {canary_endpoint_name}")

In [None]:
deployment_session.endpoint_from_production_variants(
    name=canary_endpoint_name, 
    production_variants=[variant1, variant2]
)

We can verify the specification of the canary endpoint from the UI to ensure that the traffic flow is correctly configured.

### Test deployment

In [None]:
for invocation_num in range(100):
    
    sample_df = diamonds_df.sample(1)
    sample_Xtest = sample_df[features]
    
    response = runtime.invoke_endpoint(
        EndpointName=canary_endpoint_name,
        Body=sample_Xtest.to_csv(header=True, index=False).encode("utf-8"),
        ContentType="text/csv"
    )

We can check the traffic allocation patterns by looking at the invocation traffic to the endpoint on CloudWatch (expect a slight lag for data to be logged).

### Safe rollout

Once the updated variant is tested, we can slowly increase the weights assigned to the upgrade gradually pushing all the traffic over to the new variant.

In [None]:
sagemaker_client = boto3.Session().client('sagemaker')

In [None]:
sagemaker_client.update_endpoint_weights_and_capacities(
    EndpointName=canary_endpoint_name,
    DesiredWeightsAndCapacities=[
        {'VariantName': 'Variant1', 'DesiredWeight': 0.8},
        {'VariantName': 'Variant2', 'DesiredWeight': 0.2}
    ]
)

In [None]:
for invocation_num in range(100):
    
    sample_df = diamonds_df.sample(1)
    sample_Xtest = sample_df[features]
    
    response = runtime.invoke_endpoint(
        EndpointName=canary_endpoint_name,
        Body=sample_Xtest.to_csv(header=True, index=False).encode("utf-8"),
        ContentType="text/csv"
    )

## Azure

### Create variants

We already have a `blue` variant for the gradient boosted model, let us create a `green` by referencing the decision tree model.

In [None]:
registered_model_dt = ml_client.models.get(
    name="dt-diamond-price-predictor-june", 
    version=1
)

In [None]:
registered_model_dt.version

### Green deployment

The `green` deployment is exactly the same as the blue deployment, except for the change in the variant name and the model used.

In [None]:
green_deployment = ManagedOnlineDeployment(
    name="green",
    endpoint_name=online_endpoint_name,
    model=registered_model_dt,
    environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest",
    code_configuration=CodeConfiguration(
        code='./azure/infer',
        scoring_script='score.py'
    ),
    instance_type="Standard_DS1_v2",
    instance_count=1
)

In [None]:
ml_client.online_deployments.begin_create_or_update(green_deployment).result()

### Testing

At this stage, even though the endpoint is aware of a "green" version and we can invoke it, it is not yet receiving public traffic.

In [None]:
ml_client.online_endpoints.invoke(
    endpoint_name=online_endpoint_name,
    deployment_name="green",
    request_file="sample-data.json"
)

### Safe rollout

Once the green variant is tested, we can define the traffic proportions to be allocated dynamically, gradually increasing the traffic seen by the green endpoint, eventually rolling over completely.

In [None]:
endpoint.traffic = {"blue": 99, "green": 1}

In [None]:
ml_client.begin_create_or_update(endpoint).result()

In [None]:
for i in range(20):
    ml_client.online_endpoints.invoke(
        endpoint_name=online_endpoint_name,
        request_file="sample-data.json"
    )

# Cleanup

## AWS

To avoid costs incurred on idle endpoints during the testing phase, it is a good practise to delete end points. Production end points should ideally be generated and maintained by a separate team (even if they are using the same code). Data Science teams should not have edit access to production endpoints. 

In [None]:
deployment_session.delete_endpoint(canary_endpoint_name)

In [None]:
deployment_session.delete_endpoint_config(canary_endpoint_name)

In [None]:
for model_name in ['decision-tree-regressor', 'gradient-boosted-regressor']:
    deployment_session.delete_model(model_name)

## Azure

In [None]:
ml_client.online_endpoints.begin_delete(name=online_endpoint_name)