## AWS SageMaker Inference Types

This Jupyter Notebook provides a comprehensive exploration of the various AWS SageMaker inference types, offering a detailed overview and practical insights into each type's capabilities, use cases, and considerations. AWS SageMaker is a powerful platform for building, training, and deploying machine learning models, and understanding its inference options is essential for optimizing model deployment.

The notebook covers the following AWS SageMaker inference types:

**Real-Time Inference**

This section delves into the real-time inference capabilities of SageMaker. It explains how to deploy a trained model as an endpoint and make real-time predictions using RESTful APIs. Considerations like endpoint scaling, latency, and cost are discussed, along with best practices for real-time inference scenarios.

**Batch Transform***

The Batch Transform section explores the use of SageMaker for processing large batches of data for inference. It demonstrates how to set up batch transform jobs, optimize resource allocation, and handle output results efficiently. This type of inference is suitable for scenarios where high throughput and efficiency are paramount.

**Multi-Model Endpoints**

In this section, the notebook showcases the concept of multi-model endpoints, allowing the deployment of multiple model versions as a single endpoint. It explains how to manage and route requests to specific models based on different criteria, ensuring seamless updates and version control without disrupting user experience.

**Multi-Container Endpoints**

The Multi-Container Endpoints section illustrates the process of deploying custom inference containers using SageMaker. It demonstrates how to package a custom inference algorithm and deploy it alongside pre-built SageMaker containers, enabling diverse use cases that may require specialized software stacks.

**Elastic Inference**

Elastic Inference is explored as a means to optimize the cost of deploying deep learning models. This section explains how to attach elastic inference accelerators to SageMaker instances, allowing users to match inference resource requirements with model complexity dynamically.

**Asynchronous Inference**


----

#### What is SageMaker Inference?

Machine learning models play a pivotal role in today's data-driven world, enabling businesses to extract insights, make predictions, and automate decision-making processes. However, the journey from model development to deployment can be complex, requiring a reliable and scalable infrastructure. This is where AWS SageMaker comes into play.

Amazon SageMaker is a fully managed machine learning service provided by Amazon Web Services (AWS). It streamlines the process of building, training, and deploying machine learning models, allowing data scientists and developers to focus on the modeling aspects rather than the underlying infrastructure. One of the critical aspects of deploying models with SageMaker is choosing the appropriate inference type based on the specific requirements of the application.

**Inference**, in the context of machine learning, refers to the process of using a trained model to make predictions on new, unseen data. AWS SageMaker offers various inference types, each tailored to different scenarios and use cases.

### 1. Real-Time Inference

In [None]:
import boto3
import sagemaker

# Initialize SageMaker session and role
sagemaker_session = sagemaker.Session()
role = "arn:aws:iam::[ACCOUNT_NUMBER]:role/service-role/AmazonSageMaker-ExecutionRole"

# Specify the location of the trained model artifacts
model_artifact = "s3://your-bucket/model-artifacts/model.tar.gz"

# Create a SageMaker model
model = sagemaker.Model(model_data=model_artifact,
                        role=role,
                        sagemaker_session=sagemaker_session)

# Deploy the model as an endpoint
predictor = model.deploy(instance_type="ml.m4.xlarge", initial_instance_count=1)

# Make predictions using the endpoint
input_data = ...
result = predictor.predict(input_data)

# Clean up: Delete the endpoint
predictor.delete_endpoint()

Real-time inference, also known as online or synchronous inference, involves making predictions using a machine learning model as soon as new data becomes available. This type of inference is characterized by its low latency and immediate response, making it well-suited for various use cases where timely predictions are crucial.

#### Real-time Inference in SageMaker Pipeline

Amazon SageMaker Pipelines is a feature within Amazon SageMaker that allows you to build, automate, and manage end-to-end machine learning workflows. It enables you to create, deploy, and manage machine learning pipelines as code, making it easier to streamline and automate the various steps involved in the machine learning lifecycle, from data preprocessing to model deployment.

In [None]:
import sagemaker

from sagemaker.model import Model
from sagemaker.pipeline import PipelineModel
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

# Initialize SageMaker session and role
sagemaker_session = sagemaker.Session()
role = "arn:aws:iam::ACCOUNT_NUMBER:role/service-role/AmazonSageMaker-ExecutionRole"

# Specify the location of the trained model artifacts
model_artifact = "s3://your-bucket/model-artifacts/model.tar.gz"

# Create a SageMaker model
model = Model(model_data=model_artifact,
              role=role,
              sagemaker_session=sagemaker_session)

# Deploy the model as an endpoint
predictor = model.deploy(instance_type="ml.m4.xlarge", initial_instance_count=1)

# Create a SageMaker PipelineModel
pipeline_model = PipelineModel(
    name="example-pipeline-model",
    role=role,
    models=[model])

# Attach a real-time predictor to the pipeline model
class RealTimePredictor(Predictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(RealTimePredictor, self).__init__(
            endpoint_name,
            sagemaker_session=sagemaker_session,
            serializer=JSONSerializer(),
            deserializer=JSONDeserializer())

# Use the pipeline model with the real-time predictor
real_time_predictor = RealTimePredictor(endpoint_name=pipeline_model.endpoint_name,
                                        sagemaker_session=sagemaker_session)

# Make predictions using the real-time predictor
input_data = ...
result = real_time_predictor.predict(input_data)

# Clean up: Delete the endpoint
predictor.delete_endpoint()

#### When to use real-time inference?

Use real-time inference when your application requires immediate responses to user interactions, such as recommendation systems, chatbots, fraud detection, and real-time analytics, real-time inference ensures a seamless and responsive user experience. If your data is dynamic and changes frequently, real-time inference allows you to make predictions using the most up-to-date information, ensuring accurate and relevant results.

Applications that involve real-time decision-making, such as autonomous vehicles, industrial automation, and IoT devices, benefit from real-time inference to enable instant responses and adapt to changing conditions. Use cases with stringent latency requirements, such as high-frequency trading, medical diagnosis, and online gaming, demand real-time inference to meet the sub-second response times.

Real-time inference is essential for quickly detecting anomalies or deviations from expected patterns, allowing immediate action to mitigate potential issues. Applications that provide immediate feedback to users, like language translation, speech recognition, and image recognition, rely on real-time inference to enhance user interaction.

#### When NOT to use real-time inference?

 If you have a large batch of data to process and latency is not a critical factor, batch inference might be more efficient. Batch inference can process data in parallel and is optimized for high throughput. If you have predictable or scheduled inference workloads, you can plan ahead and use batch or scheduled inference jobs to optimize resource utilization. Real-time inference might not be suitable for use cases with extremely high inference rates where the volume of requests overwhelms the real-time endpoint, causing latency spikes and reduced performance.

----

### 2. Batch Transform Inference

It allows you to perform large-scale, batch processing of data using trained machine learning models. Unlike real-time inference, where predictions are made immediately upon receiving new data, batch transform is designed for scenarios where you have a large amount of data that needs to be processed offline in bulk.

In Batch Transform, you provide the input data in batches, and SageMaker processes these batches using the trained model to generate predictions or other desired outputs. Batch Transform is particularly useful when low latency is not a requirement, and you can afford to wait for the processing to complete.

Input data for Batch Transform must be stored in Amazon S3 buckets. SageMaker supports various input formats, such as CSV, JSON, and Parquet. The data should be organized in a way that's compatible with the model's input requirements.

In [None]:
import sagemaker
from sagemaker.transformer import Transformer

# Initialize SageMaker session and role
sagemaker_session = sagemaker.Session()
role = "arn:aws:iam::ACCOUNT_NUMBER:role/service-role/AmazonSageMaker-ExecutionRole"

# Specify the location of the trained model artifacts
model_artifact = "s3://your-bucket/model-artifacts/model.tar.gz"

# Specify the input data location
input_data = "s3://your-bucket/input-data/input.csv"

# Specify the output data location
output_data = "s3://your-bucket/output-data/"

# Create a transformer object
transformer = Transformer(model_name="your-model-name",
                          instance_count=1,
                          instance_type="ml.m4.xlarge",
                          strategy="SingleRecord",
                          assemble_with="Line",
                          output_path=output_data,
                          base_transform_job_name="batch-transform-job")

# Start the batch transform job
transformer.transform(data=input_data,
                      data_type="S3Prefix",
                      content_type="text/csv",
                      split_type="Line")

# Wait for the job to complete
transformer.wait()


Parameters of the transform() method:

**model_name**
This parameter specifies the name of the trained SageMaker model that you want to use for batch transformation.

**instance_count**
The number of instances to use for batch transformation. This determines the level of parallelism and resource allocation for the transformation job.

**instance_type** 
The type of instance to use for batch transformation. This determines the computing power and memory available for the transformation job.

**strategy** 
The transformation strategy to use. In this example, "SingleRecord" indicates that each input record is transformed independently. Other options include "MultiRecord" for processing multiple records at once and "Serial" for serial processing.

**assemble_with** 
Specifies how the output is assembled. "Line" assembles the output as a single line with predictions separated by newlines. Other options include "None" and "LineOfCSV".

**output_path** 
The S3 location where the transformed data will be stored after processing. This is where the results of the batch transformation will be saved.

**data**
The S3 location of the input data.

**data_type**
The data type of the input data. Here, "S3Prefix" indicates that the input data is located in an S3 prefix.

**content_type**
The content type of the input data, such as "text/csv" in this example.

**split_type**
The type of splitting to apply to the input data. Here, "Line" indicates that each line of the input data is treated as a separate record.
wait method: Waits for the batch transform job to complete. It ensures that the code execution doesn't proceed until the job has finished processing.

### Batch Transform Limitation

Amazon SageMaker Batch Transform imposes a limit on the input payload size, which cannot exceed 100 MB. If you encounter an error related to the input size being larger than 100 MB, you'll need to consider strategies to work within this limitation. Here are a few approaches you can take: 

If your input data is too large to fit within the 100 MB limit, consider preprocessing the data to reduce its size. You might also consider sampling a representative subset of the data for batch transformation. Another solution could be: break your larger input data into smaller chunks or batches that each fall within the 100 MB limit. You can then process these chunks sequentially using multiple batch transform jobs. Streaming data, on the other hand, can be a workaround. Instead of processing the entire dataset in one go, consider streaming the data in smaller segments. You can use Amazon S3's SELECT feature to filter and retrieve only the necessary data for each batch transform job. And finally, the most common way would be splitting input data: If your data can be split into distinct subsets, you can run separate batch transform jobs for each subset. This approach can help you avoid the 100 MB limit.

An example of batch transform step in SageMaker pipeline:

In [None]:
import sagemaker
from sagemaker.model import Model
from sagemaker.pipeline import Pipeline, PipelineModel
from sagemaker.transformer import Transformer
from sagemaker.inputs import TrainingInput

# Initialize SageMaker session and role
sagemaker_session = sagemaker.Session()
role = "arn:aws:iam::ACCOUNT_NUMBER:role/service-role/AmazonSageMaker-ExecutionRole"

# Specify the location of the trained model artifacts
model_artifact = "s3://your-bucket/model-artifacts/model.tar.gz"

# Specify the input data location
input_data = "s3://your-bucket/input-data/input.csv"

# Specify the output data location
output_data = "s3://your-bucket/output-data/"

# Create a Transformer object
transformer = Transformer(model_name="your-model-name",
                          instance_count=1,
                          instance_type="ml.m4.xlarge",
                          strategy="SingleRecord",
                          assemble_with="Line",
                          output_path=output_data,
                          base_transform_job_name="batch-transform-job")

# Define a Batch Transform step
batch_transform_step = sagemaker.processing.TrainingStep(
    name="BatchTransformStep",
    transformer=transformer,
    inputs=[sagemaker.processing.ProcessingInput(
        source=input_data,
        destination="/opt/ml/processing/input")],
    outputs=[sagemaker.processing.ProcessingOutput(output_name="output", source="/opt/ml/processing/output")]
)

# Create a SageMaker pipeline
pipeline = Pipeline(
    name="BatchTransformPipeline",
    steps=[batch_transform_step]
)

# Execute the pipeline
pipeline.upsert(role_arn=role)

Model artifact refers to the result of training a machine learning model. It includes the trained model parameters, weights, and other artifacts that define the model's learned behavior. The model artifact is a key component required for deploying and using a trained model for making predictions or performing other tasks.

#### Three ways to tune batch transform job:

**Strategy** 
Choose the right transformation strategy based on your input data and use case. Strategies like "MultiRecord" can be more efficient for larger datasets, while "SingleRecord" is suitable for individual records or small batches.

**Data Format**
Optimize your input and output data formats. Choose data formats that are efficient for processing and storage. For example, Parquet or binary formats might be more efficient than plain text for large datasets.

**Output Format**
Consider how you want the output data to be assembled and organized. The output format you choose ("Line", "LineOfCSV", "None") can affect the final result and post-processing efforts.

----

### 3. Multi-Model Inference

A multi-model endpoint in Amazon SageMaker is an endpoint that allows you to deploy and serve multiple machine learning models behind a single endpoint. This feature is particularly useful when you have a variety of models that you want to deploy together for inference. Instead of creating separate endpoints for each model, you can consolidate them into a single multi-model endpoint, which can save resources, simplify deployment, and improve management.

In [None]:
import sagemaker
from sagemaker.model import Model
from sagemaker.multi_model import MultiModel

# Initialize SageMaker session and role
sagemaker_session = sagemaker.Session()
role = "arn:aws:iam::ACCOUNT_NUMBER:role/service-role/AmazonSageMaker-ExecutionRole"

# Specify the location of the trained model artifacts
model_artifact_1 = "s3://your-bucket/model1-artifacts/model.tar.gz"
model_artifact_2 = "s3://your-bucket/model2-artifacts/model.tar.gz"

# Create SageMaker models for each registered model
model_1 = Model(model_data=model_artifact_1,
                role=role,
                sagemaker_session=sagemaker_session)
model_2 = Model(model_data=model_artifact_2,
                role=role,
                sagemaker_session=sagemaker_session)

# Create a MultiModel object and register the models
multi_model = MultiModel(name="your-multi-model-endpoint",
                         model=model_1,
                         sagemaker_session=sagemaker_session)
multi_model.add_model(model_2)

# Deploy the multi-model endpoint
multi_model.deploy(initial_instance_count=1,
                   instance_type="ml.m4.xlarge")

# Make inference requests to the multi-model endpoint
input_data = ...
result = multi_model.predict(input_data)

# Clean up: Delete the multi-model endpoint
multi_model.delete_endpoint()

#### When to use Multi-Model?


If you want to deploy an ensemble of models, where multiple models work together to make predictions and improve overall accuracy, a multi-model endpoint allows you to serve them as a unified endpoint. When you have multiple versions of the same model, a multi-model endpoint can help you manage and serve different versions without the need for separate endpoints. If you have a variety of models with different computational requirements, serving them together in a single endpoint can be more resource-efficient compared to deploying separate endpoints for each model. For experimentation and A/B testing purposes, you can deploy multiple models simultaneously and route a portion of the inference traffic to each model, allowing you to compare their performance.

##### Note to consider

The inference payload must include a field that specifies the model name to be used for prediction. This requires modifying the client application to include this information in the request.

----

### 4. Elastic Inference

Amazon SageMaker Elastic Inference is a feature that allows you to attach just the right amount of GPU-powered inference acceleration to your Amazon SageMaker instances. It helps optimize the cost and performance trade-off by dynamically adjusting the GPU resources allocated to your inference workloads. Elastic Inference enables you to accelerate model inference while paying only for the GPU resources you use, making it more cost-effective than using a dedicated GPU instance for every inference workload.

Elastic Inference works by attaching GPU accelerators to SageMaker instances on-the-fly, without requiring you to explicitly provision dedicated GPU instances. You can choose the amount of GPU resources you need for your specific inference workload, and SageMaker will dynamically allocate the appropriate number of Elastic Inference accelerators.

In [None]:
import sagemaker
from sagemaker.tensorflow import TensorFlowModel

# Initialize SageMaker session and role
sagemaker_session = sagemaker.Session()
role = "arn:aws:iam::ACCOUNT_NUMBER:role/service-role/AmazonSageMaker-ExecutionRole"

# Specify the location of the trained model artifacts
model_artifact = "s3://your-bucket/model-artifacts/model.tar.gz"

# Create a TensorFlowModel
tf_model = TensorFlowModel(model_data=model_artifact,
                           role=role,
                           framework_version="2.4.1",
                           sagemaker_session=sagemaker_session)

# Define Elastic Inference settings
ei_config = {
    'AcceleratorType': 'ml.eia1.medium'  # Specify the Elastic Inference accelerator type
}

# Deploy the model with Elastic Inference
predictor = tf_model.deploy(initial_instance_count=1,
                            instance_type="ml.m5.large",  # Specify the instance type
                            accelerator_type="ml.eia1.medium",  # Attach Elastic Inference accelerator
                            accelerator_count=1,  # Number of accelerators
                            endpoint_name="your-endpoint-name",
                            endpoint_config_name="your-endpoint-config-name",
                            accelerator_config=ei_config)

# Make inference requests
input_data = ...
result = predictor.predict(input_data)

# Delete the endpoint
predictor.delete_endpoint()

To gain maximum performance, you might want to consider choosing elastic inference instance types:

Elastic Inference Accelerated Instances:

> ml.m5.large, ml.m5.xlarge, ml.m5.2xlarge, ml.m5.4xlarge, ml.m5.12xlarge, ml.m5.24xlarge

These instances can be augmented with Elastic Inference accelerators to balance cost and performance for inference workloads.

The choice of accelerator type depends on factors such as the complexity of your model, the desired latency, and your budget constraints.

#### When should we use Elastic Inference?

Consider using Amazon SageMaker Elastic Inference when you want to optimize the performance and cost of your inference workloads by attaching GPU acceleration to your SageMaker instances. Elastic Inference is particularly useful in scenarios where dedicated GPU instances might be overprovisioned for your inference requirements, leading to higher costs, or where you want to find the right balance between cost and performance. 

If your inference workloads vary in complexity and demand, Elastic Inference allows you to dynamically allocate GPU resources as needed without provisioning dedicated GPU instances. Moreover, Elastic Inference helps you optimize costs by providing GPU acceleration only when required. It can be more cost-effective than using full GPU instances for all workloads.

When you need to scale your inference resources up or down quickly based on demand, Elastic Inference provides flexibility without the need to launch or terminate instances. For low-latency applications, Elastic Inference can help accelerate your inference workloads without the overhead of provisioning full GPU instances. If your models are relatively lightweight and don't require the full power of dedicated GPU instances, Elastic Inference can be a cost-efficient choice. When deploying multiple models in a single endpoint, Elastic Inference can help balance resource allocation among the models without over-provisioning.

----

### 5. Asynchronous Inference

This type of inference allows you to perform inference on large batches of data in an asynchronous and parallelized manner. It is particularly useful when you have a substantial amount of data to process and want to optimize the inference process by submitting multiple inference requests simultaneously.

With asynchronous inference, you submit a batch of inference requests to SageMaker, and the service processes these requests in the background, independently of the client application. This helps improve efficiency and throughput, as you can take advantage of parallel processing and offload the client from waiting for each individual inference to complete.

In [None]:
import sagemaker
from sagemaker.tensorflow import TensorFlowPredictor

# Initialize SageMaker session and role
sagemaker_session = sagemaker.Session()
role = "arn:aws:iam::ACCOUNT_NUMBER:role/service-role/AmazonSageMaker-ExecutionRole"

# Create a TensorFlowPredictor
predictor = TensorFlowPredictor(endpoint_name="your-endpoint-name",
                               sagemaker_session=sagemaker_session,
                               role=role)

# Prepare a batch of input data for asynchronous inference
input_data_batch = [...] # List of dict

# Perform asynchronous inference
results = predictor.predict(input_data_batch, initial_args={"Accept": "application/jsonlines"})

# Wait for inference to complete (optional)
results.wait()

# Process inference results
for result in results:
    print(result)

# Delete the endpoint
predictor.delete_endpoint()


#### What are the limitations of asynchronous inference?

Asynchronous inference processes inference requests in parallel, so the order of results in the response might not match the order of input requests. You need to handle result ordering appropriately when processing the output. The response format depends on the content type specified in the request. You need to ensure that your application can correctly parse and interpret the response format. What is more, performing asynchronous inference with large batches of data requires sufficient compute resources. You might need to optimize the instance type and the number of instances based on your workload.

On the other hand, scaling the endpoint to handle a large number of concurrent asynchronous requests might require careful configuration and monitoring to ensure resource availability and efficient processing. While asynchronous inference can improve throughput, individual inferences within the batch might still experience varying latencies, depending on the workload and resource availability.

### Tips when using asynchronous inference

**Retries**

When performing asynchronous inference, multiple inference requests are processed in parallel. This parallel processing introduces the possibility of failures, such as network issues, instance failures, or model errors, for individual inferences within the batch. To handle these failures, you might need to implement a retry mechanism. Retries involve resubmitting the failed inference requests to the endpoint in the hope that they will eventually succeed.

Considerations for implementing retries:

Determine an appropriate number of retry attempts. Too many retries might cause excessive delay, while too few retries might result in missed inferences. Implement an exponential backoff strategy, where you gradually increase the time between retries to avoid overloading the endpoint. Set a maximum retry timeout to prevent waiting indefinitely for a response.

**Timeouts**

Timeouts are especially important in asynchronous inference scenarios because processing a large batch of data might take longer than processing individual synchronous inferences. Specifying an appropriate timeout ensures that your client application doesn't wait excessively for a response.

Considerations for setting timeouts:

Set the timeout value based on your model's processing time and the expected range of batch sizes. Larger batches might require longer timeouts. Ensure that your timeout value is reasonable to avoid prematurely giving up on inferences that are still being processed.

**An example scenario:**

Suppose you're using a client application to submit a batch of 1000 inference requests to an asynchronous inference endpoint. The model's processing time per inference is around 2 seconds on average. You set a timeout of 15 minutes (900 seconds) for the batch. If all inferences complete successfully, the response should be received well before the timeout. If some inferences fail initially, your retry mechanism could resubmit the failed ones. However, keep in mind that retries introduce additional time. If your retry mechanism is successful, you should still receive a response within the timeout. 

In this example, setting an appropriate timeout and implementing retries are crucial to ensure that your client application efficiently handles asynchronous inference, including potential failures and variations in processing times.

---

### Conclusion

SageMaker offers a streamlined process for model deployment, eliminating the complexities of managing infrastructure and scaling resources. This enables data scientists and developers to focus on their core tasks rather than the operational overhead. Additionally, SageMaker supports various inference options, including real-time, batch, and asynchronous modes, accommodating a wide range of use cases. Its ability to seamlessly integrate with other AWS services ensures data security, scalability, and robustness. Moreover, SageMaker provides a rich set of monitoring and debugging tools, allowing users to track performance, detect anomalies, and optimize resource utilization. 

Overall, leveraging SageMaker for inference empowers organizations to efficiently and cost-effectively transform trained models into valuable insights and predictions, accelerating their journey from data science to production deployment.