# Reducing Inference Costs on DeepSeek-R1-Distill-Llama-8B with SageMaker Inference's Scale to Zero Capability

This demo notebook demonstrate how you can scale in your SageMaker endpoint to zero instances during idle periods, eliminating the previous requirement of maintaining at least one running instance.

❗This notebook works well on `ml.t3.medium` instance with `PyTorch 2.2.0 Python 3.10 CPU optimized` kernel from **SageMaker Studio Classic** or `Python3` kernel from **JupyterLab**.

## Set up Environment

In [None]:
import boto3
import sagemaker


boto_region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session(boto_session=boto3.Session(region_name=boto_region))
role = sagemaker.get_execution_role()

In [None]:
import boto3
from typing import List


def get_cfn_outputs(stackname: str, region_name: str='us-east-1') -> List:
    cfn = boto3.client('cloudformation', region_name=region_name)
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

In [None]:
CFN_STACK_NAME = "SageMakerInferenceComponent" # name of CloudFormation stack

cfn_outputs = get_cfn_outputs(CFN_STACK_NAME, region_name=boto_region)
endpoint_name = cfn_outputs['EndpointName']
inference_component_name = cfn_outputs['InferenceComponentName']

endpoint_name, inference_component_name

## Create a Predictor with SageMaker Endpoint name

In [None]:
from sagemaker import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer


predictor = Predictor(
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

predictor.endpoint_name

### Inference with SageMaker SDK

SageMaker python sdk simplifies the inference construct using `sagemaker.Predictor` method.

`DeepSeek Llama8b` variant is based on 3.1 Llama8b prompt format which is as shown below,

```json
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2024
Today Date: 29 Jan 2025

You are a helpful assistant that thinks and reasons before answering.

<|eot_id|>
<|start_header_id|>user<|end_header_id|>
How many R are in STRAWBERRY? Keep your answer and explanation short!
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>

In [None]:
from typing import List, Dict
from datetime import datetime


def format_messages(messages: List[Dict[str, str]]) -> List[str]:
    """
    Format messages for Llama 3+ chat models.

    The model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and
    alternating (u/a/u/a/u...). The last message must be from 'user'.
    """
    # auto assistant suffix
    # messages.append({"role": "assistant"})

    output = "<|begin_of_text|>"
    # Adding an inferred prefix
    system_prefix = f"\n\nCutting Knowledge Date: December 2024\nToday Date: {datetime.now().strftime('%d %b %Y')}\n\n"
    for i, entry in enumerate(messages):
        output += f"<|start_header_id|>{entry['role']}<|end_header_id|>"
        if entry['role'] == 'system':
            output += f"{system_prefix}{entry['content']}<|eot_id|>"
        elif entry['role'] != 'system' and 'content' in entry:
            output += f"\n\n{entry['content']}<|eot_id|>"
    output += "<|start_header_id|>assistant<|end_header_id|>\n"
    return output

def send_prompt(predictor, initial_args, messages, parameters):
    # convert u/a format
    frmt_input = format_messages(messages)
    payload = {
        "inputs": frmt_input,
        "parameters": parameters
    }
    response = predictor.predict(
        initial_args=initial_args,
        data=payload)
    return response

### Test the endpoint with a sample prompt

Now we can invoke our endpoint with sample text to test its functionality and see the model's output.

In [None]:
%%time

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that thinks and reasons before answering."
    },
    {
        "role": "user",
        "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"
    }
]

response = send_prompt(
    predictor=predictor,
    initial_args={
        'InferenceComponentName': inference_component_name
    },
    messages=messages,
    parameters={
        "temperature": 0.6,
        "max_new_tokens": 512
    }
)

print(response['generated_text'])

## Automatically Scale To Zero

### Scaling policies

Once the endpoint is deployed and InService, you can then add the necessary scaling policies:

- A [target tracking policy](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking.html) that can scale in the copy count for our inference component model copies to zero, and from 1 to n.
- A [step scaling policy](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) that will allow the endpoint to scale out from zero.

These policies work together to provide cost-effective scaling - the endpoint can scale to zero when idle and automatically scale out as needed to handle incoming requests.

### Testing the behaviour

Notice the `MinInstanceCount: 0` setting in the Endpoint configuration, which allows the endpoint to scale down to zero instances. With the scaling policy, CloudWatch alarm, and minimum instances set to zero, your SageMaker Inference Endpoint will now be able to automatically scale down to zero instances when not in use, helping you optimize your costs and resource utilization.

### Inference Component (IC) copy count scales in to zero

We'll pause for a few minutes without making any invocations to our model. Based on our target tracking policy, when our SageMaker endpoint doesn't receive requests for about 10 to 15 minutes, it will automatically scale down to zero the number of model copies.

In [None]:
import sys
import time

sagemaker_client = boto3.client("sagemaker", region_name=boto_region)

time.sleep(600)
start_time = time.time()
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
print(desc)

### Endpoint's instances scale in to zero

After a few additional minutes of inactivity, SageMaker automatically terminates all underlying instances of the endpoint, eliminating all associated costs.

In [None]:
# after 1 mins instances will scale down to 0
time.sleep(60)

# verify whether CurrentInstanceCount is zero
sagemaker_session.wait_for_endpoint(endpoint_name)

### Invoke the endpoint with a sample prompt

If we try to invoke our endpoint while instances are scaled down to zero, we get a validation error: `An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API`.

In [None]:
print(time.strftime("%H:%M:%S"))

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that thinks and reasons before answering."
    },
    {
        "role": "user",
        "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"
    }
]

response = send_prompt(
    predictor=predictor,
    initial_args={
        'InferenceComponentName': inference_component_name
    },
    messages=messages,
    parameters={
        "temperature": 0.6,
        "max_new_tokens": 512
    }
)

print(response['generated_text'])

### Scale out from zero kicks in

However, after 1 minutes our step scaling policy should kick in. SageMaker will then start provisioning a new instance and deploy our inference component model copy to handle requests. This demonstrates the endpoint's ability to automatically scale out from zero when needed.

In [None]:
# after 1 min instances will scale out from zero to one
time.sleep(60)

# verify whether CurrentInstanceCount is zero
sagemaker_session.wait_for_endpoint(endpoint_name)

In [None]:
import sys
import time


start_time = time.time()
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
print(desc)

### verify that our endpoint has succesfully scaled out from zero

In [None]:
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant that thinks and reasons before answering."
    },
    {
        "role": "user",
        "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"
    }
]

response = send_prompt(
    predictor=predictor,
    initial_args={
        'InferenceComponentName': inference_component_name
    },
    messages=messages,
    parameters={
        "temperature": 0.6,
        "max_new_tokens": 512
    }
)

print(response['generated_text'])

## References

- [✍🏻 (AWS Machine Learning Blog) Unlock cost savings with the new scale down to zero feature in SageMaker Inference (2024-12-02)](https://aws.amazon.com/blogs/machine-learning/unlock-cost-savings-with-the-new-scale-down-to-zero-feature-in-amazon-sagemaker-inference/)
- [💻 Unlock Cost Savings with New Scale-to-Zero Feature in SageMaker Inference](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/scale-to-zero-endpoint/llama3-8b-scale-to-zero-autoscaling.ipynb)
- [💻 Deploy DeepSeek R1 Large Language Model from HuggingFace Hub on Amazon SageMaker](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/Deepseek/DeepSeek-R1-Llama8B-LMI-TGI-Deploy.ipynb)