# Using SageMaker Production Variants for A/B testing

A **Production Variant** is a SageMaker-specific concept that defines a combination of the model, its container, and the resources required to run this model. As such, this is an extremely flexible concept that can be used for different use cases, such as the following:
- Different model versions with the same runtime and resource requirements
- Different models with different runtimes and/or resource requirements
- The same model with different runtimes and/or resource requirements

Additionally, as part of the variant configuration, you also define its traffic weights, which can be then updated without them having any impact on endpoint availability. Once deployed, the production variant can be invoked directly (so you can bypass SageMaker traffic shaping) or as part of the SageMaker endpoint call (then, SageMaker traffic shaping is not bypassed). 

In this example, we will register two different models for the same Q&A NLP task. Then, we will shape the inference traffic using the production variant weights and invoke the models directly. 

## Deploy Endpoint with Two Production Variants

Follow steps below to prepare two different production variants with `DistilBert` and `RoBERTa` models. 

1. We start by identifiying appropriate container image using SageMaker `image_uris.retrieve()` method:


In [7]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.huggingface.model import HuggingFaceModel

sagemaker_session = sagemaker.Session()
role = get_execution_role()

In [13]:
PYTHON_VERSION = "py38"
PYTORCH_VERSION = "1.10.2"
TRANSFORMER_VERSION = "4.17.0"
INSTANCE_TYPE = "ml.c5.4xlarge"

image_uri = sagemaker.image_uris.retrieve(
    framework="huggingface",
    base_framework_version=f"pytorch{PYTORCH_VERSION}",
    region=sagemaker.Session().boto_region_name,
    version=TRANSFORMER_VERSION,
    py_version=PYTHON_VERSION,
    instance_type=INSTANCE_TYPE,
    image_scope="inference",
)

print(f"Container to be used: {image_uri}")

2. Next, we create two HuggingFace models objects which defines which model to download from HuggingFace Model Hub:

In [10]:
model1_name = "DistilBERT"

model1_env = {
    'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad',
    'HF_TASK':'question-answering'
}

model1 = HuggingFaceModel(
   name=model1_name,
   env=model1_env,
   role= role,
   transformers_version=TRANSFORMER_VERSION,
   pytorch_version=PYTORCH_VERSION,
   py_version=PYTHON_VERSION,
   image_uri=image_uri,
)

container1_def = model1.prepare_container_def(INSTANCE_TYPE)

sagemaker_session.create_model(
    name=model1_name, role=role, container_defs=container1_def
)


In [17]:
model2_name = "RoBERTa"

model2_env = {
    'HF_MODEL_ID':'deepset/roberta-base-squad2',
    'HF_TASK':'question-answering'
}

model2 = HuggingFaceModel(
   name=model2_name,
   env=model2_env,
   role= role,
   transformers_version=TRANSFORMER_VERSION,
   pytorch_version=PYTORCH_VERSION,
   py_version=PYTHON_VERSION,
   image_uri=image_uri,
)

container2_def = model2.prepare_container_def(INSTANCE_TYPE)

sagemaker_session.create_model(
    name=model2_name, role=role, container_defs=container2_def
)


2. Then we will create two different endpoint variants. We start with the equal `initial_weight` parameter, which tells SageMaker that inference traffic should split evenly between model variants:

In [None]:
from sagemaker.session import production_variant

variant1 = production_variant(
    model_name=model1_name,
    instance_type=INSTANCE_TYPE,
    initial_instance_count=1,
    variant_name="Variant1",
    initial_weight=1,
)
variant2 = production_variant(
    model_name=model2_name,
    instance_type=INSTANCE_TYPE,
    initial_instance_count=1,
    variant_name="Variant2",
    initial_weight=1,
)

print(f"variant1 parameters = {variant1},\nvariant2 parameters = {variant2}")

3. After that, we create the endpoint based on our configured production variants:

In [None]:
from datetime import datetime

endpoint_name = f"ab-testing-{datetime.now():%Y-%m-%d-%H-%M-%S}"
print(f"EndpointName={endpoint_name}")

sagemaker_session.endpoint_from_production_variants(
    name=endpoint_name, production_variants=[variant1, variant2]
)

## Testing Production Variants

Now, let's test our endpoint with two production variants.

1. Let's confirm that each production variant gets roughly 50% of inference requests as we set equal initial weights. For this we generate multiple inference requests and check production variant in the response payload.

In [None]:
import json

context = r"""
The Nile is a major north-flowing river in northeastern Africa. It flows into the Mediterranean Sea. The Nile is the longest river in Africa and has historically been considered the longest river in the world, though this has been contested by research suggesting that the Amazon River is slightly longer. Of the world's major rivers, the Nile is one of the smallest, as measured by annual flow in cubic metres of water.
"""

question="where does the Nile flow into?"

data = {"context":context, "question":question}
print(data)

In [43]:
sm_runtime_client = sagemaker_session.sagemaker_runtime_client
sm_client = sagemaker_session.sagemaker_client

# initiate results object
results = {"Variant1": 0, "Variant2": 0, "total_count": 0}

for i in range(20):
    response = sm_runtime_client.invoke_endpoint(EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(data))
    results[response['InvokedProductionVariant']] += 1
    results["total_count"] += 1

print(f"Invokations per endpoint variant: \n Variant1: {results['Variant1']/results['total_count']*100}%; \n Variant2: {results['Variant2']/results['total_count']*100}%.")

2. Next, let's simulate situation when we want to send ~90% of the traffic to one production variant. Run the cell below to update production variants weights (changed from "1 to 1" to "9 to 1" respectively). We also added waiter method to wait for endpoint update completion.

In [None]:
import time 

sm_client.update_endpoint_weights_and_capacities(
    EndpointName=endpoint_name,
    DesiredWeightsAndCapacities=[
        {"DesiredWeight": 1, "VariantName": "Variant1"},
        {"DesiredWeight": 9, "VariantName": "Variant2"},
    ],
)

endpoint_description = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_description['EndpointStatus']
print("Status: " + status)

while status=='Updating':
    time.sleep(1)
    endpoint_description = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = endpoint_description['EndpointStatus']
    instance_count = endpoint_description['ProductionVariants'][0]['CurrentInstanceCount']
    print(f"Status: {status}")
    print(f"Current Instance count: {instance_count}")

3. Now, let's confirm that traffic distibution changed according to weights:

In [44]:
results = {"Variant1": 0, "Variant2": 0, "total_count": 0}

for i in range(20):
    response = sm_runtime_client.invoke_endpoint(EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(data))
    results[response['InvokedProductionVariant']] += 1
    results["total_count"] += 1

print(f"Invokations per endpoint variant: \n Variant1: {results['Variant1']/results['total_count']*100}%; \n Variant2: {results['Variant2']/results['total_count']*100}%.")

### Resource Clean up

Run following cell to delete cloud resources:

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_model(ModelName = model1_name)
sm_client.delete_model(ModelName = model2_name)