# Optimizing Model Hosting and Inference Costs

## Real-time inference versus batch inference
#### SageMaker provides two ways to obtain inferences:

- Real-time inference lets you get a single inference per request, or a small number of inferences, with very low latency from a live inference endpoint.
- Batch inference lets you get a large number of inferences from a batch processing job.

#### Batch inference is more efficient and more cost-effective. Use it whenever your inference requirements allow. We'll explore batch inference first, and then pivot to real-time inference.

## Batch inference

#### In many cases, we can make inferences in advance and store them for later use. For example, if you want to generate product recommendations for users on an e-commerce site, those recommendations may be based on the users' prior purchases and which products you want to promote the next day. You can generate the recommendations nightly and store them for your e-commerce site to call up when the users browse the site.

#### There are several options for storing batch inferences. Amazon DynamoDB is a common choice for several reasons, such as the following:

- It is fast. You can look up single values within a few milliseconds.
- It is scalable. You can store millions of values at a low cost.
- The best access pattern for DynamoDB is looking up values by a high-cardinality primary key. This fits well with many inference usage patterns, for example, when we want to look up a stored recommendation for an individual user.

In [None]:
batch_input = "s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'test')
batch_output = "s3://{}/{}/{}/".format(s3_bucket, "xgboost-sample", 'xform')
transformer = estimator.transformer(instance_count=1,
instance_type='ml.m5.4xlarge', output_path=batch_output, max_payload=3)
transformer.transform(data=batch_input, data_type='S3Prefix',
content_type=content_type, split_type='Line')

## Real-time inference

#### When you deploy a SageMaker model to a real-time inference endpoint, SageMaker deploys the model artifact and your inference code (packaged in a Docker image) to one or more inference instances. You now have a live API endpoint for inference, and you can invoke it from other software services on demand.  

#### You pay for the inference endpoints (instances) as long as they are running. Use real-time inference in the following situations:

- The inferences are dependent on context. For example, if you want to recommend a video to watch, the inference may depend on the show your user just finished. If you have a large video catalog, you can't generate all the possible permutations of recommendations in advance.  
- You may need to provide inferences for new events. For example, if you are trying to classify a credit card transaction as fraudulent or not, you need to wait until your user actually attempts a transaction.

In [None]:
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import CSVSerializer

predictor = estimator.deploy(initial_instance_count=1,
                            instance_type='ml.m5.2xlarge',
                            serializer=CSVSerializer(),
                            deserializer=JSONDeserializer()
                             )

result = predictor.predict(csv_payload)
print(result)

## Multiple versions of the same model
#### A SageMaker endpoint lets you host multiple models that serve different percentages of traffic for incoming requests. That capability supports common continuous integration (CI)/continuous delivery (CD) practices such as canary and blue/green deployments. While these practices are similar, they have slightly different purposes, as explained here:

- A canary deployment means that you let the new version of a model host a small percentage of traffic that lets you test a new version of the model on a subset of traffic until you are satisfied that it is working well.
- A blue/green deployment means that you run two versions of the model at the same time, keeping an older version around for quick failover if a problem occurs in the new version.

#### In practice, these are variations on a theme. In SageMaker, you designate how much traffic each model variant handles. For canary deployments, you'd start with a small fraction (usually 1-5%) for the new model versions. For blue/green deployments, you'd use 100% for the new version but flip back to 0% if a problem occurs.

#### There are other ways to accomplish these deployment modes. For example, you can use two inference endpoints and handle traffic shaping using DNS (Route 53), a load balancer, or Global Accelerator. But managing the traffic through SageMaker simplifies your operational burden and reduces cost, as you don't have to have two endpoints running.

In [None]:
hyperparameters_v2 = {
        "max_depth":"10",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"5"}

estimator_v2 = sagemaker.estimator.Estimator(image_uri=xgboost_container,
                hyperparameters=hyperparameters,
                role=sagemaker.get_execution_role(),
                instance_count=1,
                instance_type='ml.m5.12xlarge',
                volume_size=200, # 5 GB
                output_path=output_path)

predictor_v2 = estimator_v2.deploy(initial_instance_count=1,
            instance_type='ml.m5.2xlarge',
            serializer=CSVSerializer(),
            deserializer=JSONDeserializer()
)

#### Next, we define endpoint variants for each model version. The most important parameter here is initial_weight, which specifies how much traffic should go to each model version. By setting both versions to 1, the traffic will split evenly between them. For an A/B test, you might start with weights of 20 for the existing version and 1 for the new version:

In [None]:
model1 = predictor._model_names[0]
model2 = predictor_v2._model_names[0]

from sagemaker.session import production_variant
variant1 = production_variant(model_name=model1,
                            instance_type="ml.m5.xlarge",
                              initial_instance_count=1,
                              variant_name='Variant1',
                              initial_weight=1)

variant2 = production_variant(model_name=model2,
                            instance_type="ml.m5.xlarge",
                            initial_instance_count=1,
                            variant_name='Variant2',
                            initial_weight=1)

#### Now, we deploy a new model using the following two model variants:

In [None]:
from sagemaker.session import Session

smsession = Session()
smsession.endpoint_from_production_variants(
    name='mmendpoint',
    production_variants=[variant1, variant2]
)


In [None]:
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import CSVSerializer
import boto3
from botocore.response import StreamingBody

smrt = boto3.Session().client("sagemaker-runtime")
for tl in t_lines[0:50]:
    result = smrt.invoke_endpoint(EndpointName='mmendpoint',
    ContentType="text/csv", Body=tl.strip())
    rbody = StreamingBody(raw_stream=result['Body'], 
content_length= int(result['ResponseMetadata']['HTTPHeaders']['content-length']))
    print(f"Result from {result['InvokedProductionVariant']} = " + f"{rbody.read().decode('utf-8')}")

#### You'll see output that looks like this:

- Result from Variant2 = 0.16384175419807434
- Result from Variant1 = 0.16383948922157288
- Result from Variant1 = 0.16383948922157288

#### Notice that the traffic is flipping between the two versions of the model according to the weights we specified. In a production use case, you should automate the model endpoint update in your CI/CD or MLOps automation tools.

#### When we need a real-time inference endpoint, the processing power requirements may vary based on incoming traffic. For example, if we are providing air quality inferences for a mobile application, usage will likely fluctuate based on time of day. If we provision the inference endpoint for peak load, we will pay too much during off-peak times. If we provision the inference endpoint for a smaller load, we may hit performance bottlenecks during peak times. We can use inference endpoint auto-scaling to adjust capacity to demand.

#### There are two types of scaling, vertical and horizontal. Vertical scaling means that we adjust the size of an individual endpoint instance. Horizontal scaling means that we adjust the number of endpoint instances. We prefer horizontal scaling as it results in less disruption for end users; a load balancer can redistribute traffic without having an impact on end users.

- Set the minimum and maximum number of instances.
- Choose a scaling metric.
- Set the scaling policy.
- Set the cooldown period.

#### If your load is highly variable, you can start with a small instance type and scale up aggressively. This prevents you from paying for a larger instance type that you don't always need.

## Choosing a scaling metric

#### We need to decide when to trigger a scaling action. We do that by specifying a CloudWatch metric. By default, SageMaker provides two useful metrics:

- InvocationsPerInstance reports the number of inference requests sent to each endpoint instance over some time period.
- ModelLatency is the time in microseconds to respond to inference requests.

##### We recommend ModelLatency as a metric for autoscaling, as it reports on the end user experience. Setting the actual value for the metric will depend on your requirements and some observation of endpoint performance over time. For example, you may find that latency over 100 milliseconds results in a degraded user experience if the inference result passes through several other services that add their own latency before the result reaches the end user.

## Setting the scaling policy

#### You can choose between target tracking and step scaling. Target tracking policies are more useful and try to adjust capacity to keep some target metric within a given boundary. Step scaling policies are more advanced and increase capacity in incremental steps.

## Setting the cooldown period

#### The cooldown period is how long the endpoint will wait after one scaling action before starting another scaling action. If you let the endpoint respond instantaneously, you'd end up scaling too often. As a general rule, scale up aggressively and scale down conservatively.  

#### When you obtain inferences from a deep learning model, you do not need as much GPU capacity as you need during training. Elastic Inference lets you attach fractional GPU capacity to regular EC2 instances or Elastic Container Service (ECS) containers. As a result, you can get deep learning inferences quickly at a reduced cost.

#### The Elastic Inference section in the notebook shows how to attach an Elastic Inference accelerator to an endpoint, as you can see in the following code block:

#### You'll need to look at your specific use case and figure out the best combination of RAM, CPU, network throughput, and GPU capacity that meets your performance requirements at the lowest cost. If your inferences are entirely GPU-bound, the Inferentia instance will probably give you the best price-performance balance. If you need more traditional compute resources with some GPU, the P2/P3 family will work well. If you need very little overall capacity, Elastic Inference provides the cheapest GPU option.
