# Optimizing Model Hosting and Inference Costs

## Real-time inference versus batch inference
#### SageMaker provides two ways to obtain inferences:

- Real-time inference lets you get a single inference per request, or a small number of inferences, with very low latency from a live inference endpoint.
- Batch inference lets you get a large number of inferences from a batch processing job.

#### Batch inference is more efficient and more cost-effective. Use it whenever your inference requirements allow. We'll explore batch inference first, and then pivot to real-time inference.

## Batch inference

#### In many cases, we can make inferences in advance and store them for later use. For example, if you want to generate product recommendations for users on an e-commerce site, those recommendations may be based on the users' prior purchases and which products you want to promote the next day. You can generate the recommendations nightly and store them for your e-commerce site to call up when the users browse the site.

#### There are several options for storing batch inferences. Amazon DynamoDB is a common choice for several reasons, such as the following:

- It is fast. You can look up single values within a few milliseconds.
- It is scalable. You can store millions of values at a low cost.
- The best access pattern for DynamoDB is looking up values by a high-cardinality primary key. This fits well with many inference usage patterns, for example, when we want to look up a stored recommendation for an individual user.

In [None]:
batch_input = "s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'test')
batch_output = "s3://{}/{}/{}/".format(s3_bucket, "xgboost-sample", 'xform')
transformer = estimator.transformer(instance_count=1,
instance_type='ml.m5.4xlarge', output_path=batch_output, max_payload=3)
transformer.transform(data=batch_input, data_type='S3Prefix',
content_type=content_type, split_type='Line')

## Real-time inference

#### When you deploy a SageMaker model to a real-time inference endpoint, SageMaker deploys the model artifact and your inference code (packaged in a Docker image) to one or more inference instances. You now have a live API endpoint for inference, and you can invoke it from other software services on demand.  

#### You pay for the inference endpoints (instances) as long as they are running. Use real-time inference in the following situations:

- The inferences are dependent on context. For example, if you want to recommend a video to watch, the inference may depend on the show your user just finished. If you have a large video catalog, you can't generate all the possible permutations of recommendations in advance.  
- You may need to provide inferences for new events. For example, if you are trying to classify a credit card transaction as fraudulent or not, you need to wait until your user actually attempts a transaction.

In [None]:
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import CSVSerializer

predictor = estimator.deploy(initial_instance_count=1,
                            instance_type='ml.m5.2xlarge',
                            serializer=CSVSerializer(),
                            deserializer=JSONDeserializer()
                             )

result = predictor.predict(csv_payload)
print(result)