# 🚀 Deploy Qwen QwQ 32B Large Language Model from HuggingFace Hub on Amazon SageMaker AI with Inference Components

## Introduction: [Qwen QwQ 32B](https://huggingface.co/Qwen/QwQ-32B)

[QwQ](https://huggingface.co/Qwen/QwQ-32B) is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.

In [1]:
%pip install -Uq sagemaker boto3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
autogluon-multimodal 1.2 requires nvidia-ml-py3==7.352.0, which is not installed.
aiobotocore 2.19.0 requires botocore<1.36.4,>=1.36.0, but you have botocore 1.37.7 which is incompatible.
amazon-sagemaker-sql-magic 0.1.3 requires sqlparse==0.5.0, but you have sqlparse 0.5.3 which is incompatible.
autogluon-multimodal 1.2 requires jsonschema<4.22,>=4.18, but you have jsonschema 4.23.0 which is incompatible.
autogluon-multimodal 1.2 requires nltk<3.9,>=3.4.5, but you have nltk 3.9.1 which is incompatible.
autogluon-multimodal 1.2 requires omegaconf<2.3.0,>=2.1.1, but you have omegaconf 2.3.0 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import json
import sagemaker
import boto3
import sys
import time
from typing import List, Dict
from datetime import datetime
from sagemaker.huggingface import (
    HuggingFaceModel, 
    get_huggingface_llm_image_uri
)
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [12]:
boto_region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session(boto_session=boto3.Session(region_name=boto_region))
role = sagemaker.get_execution_role()
sagemaker_client = boto3.client("sagemaker")
sagemaker_runtime_client = boto3.client("sagemaker-runtime")
s3_client = boto3.client("s3")

prefix = sagemaker.utils.unique_name_from_base("DEMO")
print(f"prefix: {prefix}")

prefix: DEMO-1741260388-e8ce


### Create SageMaker Endpoint Configuration
There are a few parameters we want to setup for our endpoint. We first start by setting the variant name, and instance type we want our endpoint to use. In addition we set the *model_data_download_timeout_in_seconds* and *container_startup_health_check_timeout_in_seconds* to have some guardrails for when we deploy inference components to our endpoint. In addition we will use Managed Instance Scaling which allows SageMaker to scale the number of instances based on the requirements of the scaling of your inference components. We set a *MinInstanceCount* and *MinInstanceCount* variable to size this according to the workload you want to service and also maintain controls around cost. Lastly, we set *RoutingStrategy* for the endpoint to optimally tune how to route requests to instances and inference components for the best performance. 

In [13]:
# Set an unique endpoint config name
endpoint_config_name = f"{prefix}-endpoint-config"
print(f"Demo endpoint config name: {endpoint_config_name}")

# Set varient name and instance type for hosting
variant_name = "AllTraffic"
instance_type = "ml.g5.12xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

initial_instance_count = 1
print(f"Initial instance count: {initial_instance_count}")
print(f"Max instance count: {max_instance_count}")

sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

Demo endpoint config name: DEMO-1741260388-e8ce-endpoint-config
Initial instance count: 1
Max instance count: 2


{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:537124949553:endpoint-config/DEMO-1741260388-e8ce-endpoint-config',
 'ResponseMetadata': {'RequestId': '32f6a1e7-a183-4fcc-a756-bc47a146a42e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '32f6a1e7-a183-4fcc-a756-bc47a146a42e',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '117',
   'date': 'Thu, 06 Mar 2025 11:26:29 GMT'},
  'RetryAttempts': 0}}

### Create SageMaker Endpoint
We can now use the EndpointConfiguration created in the last step to create and endpoint with SageMaker

In [14]:
# Set a unique endpoint name
endpoint_name = f"{prefix}-endpoint"
print(f"Demo endpoint name: {endpoint_name}")

sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)

Demo endpoint name: DEMO-1741260388-e8ce-endpoint


{'EndpointArn': 'arn:aws:sagemaker:us-west-2:537124949553:endpoint/DEMO-1741260388-e8ce-endpoint',
 'ResponseMetadata': {'RequestId': '614c9997-8f9b-48f3-bc48-913565d7b3e8',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '614c9997-8f9b-48f3-bc48-913565d7b3e8',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '97',
   'date': 'Thu, 06 Mar 2025 11:26:30 GMT'},
  'RetryAttempts': 0}}

In [15]:
sagemaker_session.wait_for_endpoint(endpoint_name)

----!

{'EndpointName': 'DEMO-1741260388-e8ce-endpoint',
 'EndpointArn': 'arn:aws:sagemaker:us-west-2:537124949553:endpoint/DEMO-1741260388-e8ce-endpoint',
 'EndpointConfigName': 'DEMO-1741260388-e8ce-endpoint-config',
 'ProductionVariants': [{'VariantName': 'AllTraffic',
   'CurrentInstanceCount': 1,
   'DesiredInstanceCount': 1,
   'RoutingConfig': {'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'}}],
 'EndpointStatus': 'InService',
 'CreationTime': datetime.datetime(2025, 3, 6, 11, 26, 30, 806000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2025, 3, 6, 11, 28, 49, 81000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '4aedc044-a35f-4a10-aa5c-9a76818d63f9',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '4aedc044-a35f-4a10-aa5c-9a76818d63f9',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '468',
   'date': 'Thu, 06 Mar 2025 11:29:02 GMT'},
  'RetryAttempts': 0}}

## Deploy using HuggingFace TGI Container

Hugging Face Large Language Model (LLM) Inference Deep Learning Container (DLC) on Amazon SageMaker enables developers to efficiently deploy and serve open-source LLMs at scale. This DLC is powered by Text Generation Inference (TGI), an open-source, purpose-built solution optimized for high-performance text generation tasks. 

**Key Features of HuggingFace TGI Containers:**

* **Tensor Parallelism**: Distributes computation across multiple GPUs, allowing the deployment of large models that exceed the memory capacity of a single GPU.
* **Dynamic Batching**: Aggregates multiple incoming requests into a single batch, enhancing throughput and resource utilization.
* **Optimized Transformers Code**: Utilizes advanced techniques like flash-attention to improve inference speed and efficiency for popular model architectures like DeepSeek, Llama, Falcon, Mistal, Mixtral and many more.

**Benefits for Deploying LLMs with HuggingFace TGI on Amazon SageMaker:**

* **Simplified Deployment**: TGI containers provide a low-code interface, allowing users to specify configurations like model parallelization and optimization settings through straightforward configuration files. 
* **Performance Optimization**: By leveraging optimized inference libraries and techniques, such as tensor parallelism and dynamic batching, these containers enhance inference performance, reducing latency and improving throughput. 
* **Scalability**: Designed to handle large models, TGI containers enable efficient scaling across multiple GPUs or specialized hardware like AWS Inferentia, ensuring that even the most demanding models can be deployed effectively. 

Choose an appropriate model name and endpoint name when hosting your model.

For a more exhaustive list, please refer to this [TGI Release Page](https://github.com/aws/deep-learning-containers/releases?q=tgi+AND+gpu&expanded=true)

In [16]:
tgi_inference_image_uri = get_huggingface_llm_image_uri(
     "huggingface", 
     version="2.3.1"
)
print(f"Using TGI Image: {tgi_inference_image_uri}")

Using TGI Image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04


Create a new [SageMaker HuggingFaceModel](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html)

## Create Model Artifact
We will be deploying the Qwen 32B model using the TGI container. In order to do so you need to set the image you would like to use with the proper configuartion. You can also create a SageMaker model to be referenced when you create your inference component

In [18]:
qwen_qwq_32b = "Qwen/QwQ-32B"
qwen_tgi_model = {
    "Image": tgi_inference_image_uri,
    "Environment": {
        "HF_MODEL_ID": qwen_qwq_32b,
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        "MESSAGES_API_ENABLED": "true",
        "SAGEMAKER_MODEL_SERVER_WORKERS": "1",
        "SM_NUM_GPUS": "4",
        "MAX_TOTAL_TOKENS": "8192",
        "MAX_INPUT_TOKENS": "4096",
        'HF_HUB_ENABLE_HF_TRANSFER': "1",
        "PORT": "8080"
    },
}
model_name_tgi = f"qwen-qwq-32b-tgi-{datetime.now().strftime('%y%m%d-%H%M%S')}"
# create SageMaker Model
sagemaker_client.create_model(
    ModelName=model_name_tgi,
    ExecutionRoleArn=role,
    Containers=[qwen_tgi_model],
)

{'ModelArn': 'arn:aws:sagemaker:us-west-2:537124949553:model/qwen-qwq-32b-tgi-250306-112909',
 'ResponseMetadata': {'RequestId': '1e48e9d6-f811-4e73-ab9f-ceea0c9be2e4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '1e48e9d6-f811-4e73-ab9f-ceea0c9be2e4',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '92',
   'date': 'Thu, 06 Mar 2025 11:29:09 GMT'},
  'RetryAttempts': 0}}

We can now create the Inference Components which will deployed on the endpoint that you specify. Please note here that you can provide a SageMaker model or a container to specification. If you provide a container, you will need to provide an image and artifactURL as parameters. In this example we set it to the model name we prepared in the cells above. You can also set the 'ComputeResourceRequirements' to supply SageMaker what should be reserved for each copy of the inference component. You can also set the copy count of the number of Inference Components you would like to deploy. These can be managed and scaled as the capabilities become available. 

Note that in this example we set the `NumberOfAcceleratorDevicesRequired` to a value of `4`. By doing so we reserve 4 accelerators for each copy of this inference component so that we can use tensor parallel. 

In [19]:
inference_component_name_qwen = f"{prefix}-IC-qwen-32b-{datetime.now().strftime('%y%m%d-%H%M%S')}"
variant_name = "AllTraffic"

sagemaker_client.create_inference_component(
    InferenceComponentName=inference_component_name_qwen,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name_tgi,
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 4,
            "NumberOfCpuCoresRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

{'InferenceComponentArn': 'arn:aws:sagemaker:us-west-2:537124949553:inference-component/DEMO-1741260388-e8ce-IC-qwen-32b-250306-112911',
 'ResponseMetadata': {'RequestId': 'c39b1f0c-4b1c-476f-b6a3-6542d8b7a1d3',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'c39b1f0c-4b1c-476f-b6a3-6542d8b7a1d3',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '135',
   'date': 'Thu, 06 Mar 2025 11:29:11 GMT'},
  'RetryAttempts': 0}}

Wait until the inference component is InService

In [20]:
while True:
    desc = sagemaker_client.describe_inference_component(
        InferenceComponentName=inference_component_name_qwen
    )
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
InService


In [21]:
inference_component_name_qwen


'DEMO-1741260388-e8ce-IC-qwen-32b-250306-112911'

### Inference with SageMaker SDK

In [22]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that thinks and reasons before answering."},
    {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"}
]

payload = {
        "messages": messages,
        "max_tokens": 512,
        "temperature": 0.6
    }

response_model = sagemaker_runtime_client.invoke_endpoint(
    InferenceComponentName=inference_component_name_qwen,
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
)
response_qwq_tgi = response_model["Body"].read().decode("utf8")
response_qwq_tgi

'{"object":"chat.completion","id":"","created":1741261295,"model":"Qwen/QwQ-32B","system_fingerprint":"2.3.1-native","choices":[{"index":0,"message":{"role":"assistant","content":"Okay, let\'s see. The user is asking how many R\'s are in the word \\"STRAWBERRY\\". Hmm, first I need to make sure I spell STRAWBERRY correctly. Let me write it out: S-T-R-A-W-B-E-R-R-Y. Wait, let me check again. S, T, then R. So after the T comes R. Then A, W, B, E, and then another R? Let me count the letters one by one.\\n\\nBreaking it down: S (1), T (2), R (3), A (4), W (5), B (6), E (7), R (8), R (9), Y (10). Oh, so after the E there are two R\'s in a row. So the letters R appear at positions 3, 8, and 9? Wait, no. Wait a second, maybe I miscounted. Let me go through it again step by step.\\n\\nS-T-R-A-W-B-E-R-R-Y. Let\'s list each letter with its position:\\n\\n1. S\\n2. T\\n3. R\\n4. A\\n5. W\\n6. B\\n7. E\\n8. R\\n9. R\\n10. Y\\n\\nSo the R\'s are at positions 3, 8, and 9. That would be three R\'s. 

In [27]:
print(json.loads(response_deepseek_tgi)['choices'][0]['message']['content'])

Okay, let's see. The user is asking how many R's are in the word "STRAWBERRY". Hmm, first I need to make sure I spell STRAWBERRY correctly. Let me write it out: S-T-R-A-W-B-E-R-R-Y. Wait, let me check again. S, T, then R. So after the T comes R. Then A, W, B, E, and then another R? Let me count the letters one by one.

Breaking it down: S (1), T (2), R (3), A (4), W (5), B (6), E (7), R (8), R (9), Y (10). Oh, so after the E there are two R's in a row. So the letters R appear at positions 3, 8, and 9? Wait, no. Wait a second, maybe I miscounted. Let me go through it again step by step.

S-T-R-A-W-B-E-R-R-Y. Let's list each letter with its position:

1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. R
10. Y

So the R's are at positions 3, 8, and 9. That would be three R's. Wait, but sometimes people might misspell STRAWBERRY. Let me confirm the correct spelling. Is it STRAWBERRY with two R's or three? Let me think. The word is spelled S-T-R-A-W-B-E-R-R-Y. Yes, that's correct. After the E, ther

### Cleanup

In [32]:
sagemaker_client.delete_inference_component(InferenceComponentName=inference_component_name_qwen)

{'ResponseMetadata': {'RequestId': '38aabee9-aedd-48d1-9e28-aaec1bc7c94e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '38aabee9-aedd-48d1-9e28-aaec1bc7c94e',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Tue, 04 Mar 2025 04:33:35 GMT',
   'content-length': '0'},
  'RetryAttempts': 0}}

In [None]:
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)