# Deploy quantized (AWQ) version of DeepSeek R1 on Amazon SageMaker AI

## Introduction: [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)

DeepSeek-R1 is an open-source reasoning model developed by [DeepSeek](https://www.deepseek.com/). It is designed to handle tasks requiring logical inference, mathematical problem-solving, and real-time decision-making. Notably, DeepSeek-R1 achieves performance comparable to leading Foundation Models across various benchmarks, including math, code, and reasoning tasks. 

The DeepSeek-R1 series includes several variants, each with distinct training methodologies and objectives:

1. **DeepSeek-R1-Zero**: This model was trained entirely through reinforcement learning (RL) without any supervised fine-tuning (SFT). While it developed strong reasoning capabilities, it faced challenges such as less readable outputs and occasional mixing of languages within responses, making it less practical for real-world applications. 


2. **DeepSeek-R1**: To address the limitations of R1-Zero, DeepSeek-R1 was developed using a hybrid approach that combines reinforcement learning with supervised fine-tuning. This method incorporated curated datasets to improve the model's readability and coherence, effectively reducing issues like language mixing and fragmented reasoning. As a result, DeepSeek-R1 is more suitable for practical use. 


3. **DeepSeek-R1 Distilled Models**: These are smaller, more efficient versions of the original DeepSeek-R1 model, created through a process called distillation. Distillation involves training a compact model to replicate the behavior of a larger model, thereby retaining much of its reasoning power while reducing computational demands. DeepSeek has released several distilled models based on different architectures, such as Qwen and Llama, with varying parameter sizes (e.g., 1.5B, 7B, 14B, 32B, and 70B). These distilled models offer a balance between performance and resource efficiency, making them accessible for a wider range of applications. 

The table below captures the DeepSeek R1 non-distilled model variants,

| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download** | **Suggested Instances for Hosting** |
| :------------: | :------------: | :------------: | :------------: | :------------: | :------------: |
| DeepSeek-R1-Zero | 671B | 37B | 128K   | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero)   | `ml.p5e.48xlarge` |
| DeepSeek-R1   | 671B | 37B |  128K   | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1)   | `ml.p5e.48xlarge` |


## 1. Setup development environment

We are going to use the `sagemaker` python SDK to deploy the model to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

In [None]:
%pip install sagemaker --upgrade --quiet --no-warn-conflicts

In [None]:
import json
import boto3
import sagemaker

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment

sm_client = boto3.client("sagemaker")  # client to intreract with SageMaker
smr_client = boto3.client("sagemaker-runtime")  # client to intreract with SageMaker Endpoints
s3_client = boto3.client("s3")

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"boto3 version: {boto3.__version__}")
print(f"sagemaker version: {sagemaker.__version__}")

## 2. Retrieve the LMI DLC

See [this](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) for more info

In [None]:
vllm_image = sagemaker.image_uris.retrieve(framework="djl-lmi", region=region, version="0.30.0")
#
# Temporary Override:
#
vllm_image = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124"

print(f"LMI-vLLM image: {vllm_image}")

## 3. Deploy cognitivecomputations/DeepSeek-R1-AWQ to Amazon SageMaker

To deploy a model to Amazon SageMaker we create a `Model` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. We will use a `ml.p4de.24xlarge` instance type. 

In [None]:
model_config_prefix = "models/DeepSeek-R1-AWQ"
gpu_instance_type = "ml.p4de.24xlarge"

## Deploy using LMI container

In [None]:
%%writefile serving.properties
option.model_id=cognitivecomputations/DeepSeek-R1-AWQ
option.rolling_batch=vllm
option.dtype=fp16
option.quantize=awq_marlin
option.trust_remote_code=True
option.tensor_parallel_degree=max
option.gpu_memory_utilization=.87
option.kv_cache_dtype=fp8_e4m3
option.max_model_len=17600
option.max_rolling_batch_size=2

In [None]:
%%writefile requirements.txt
vllm==0.7.0

In [None]:
%%sh
tar czvf config.tar.gz ./serving.properties ./requirements.txt

In [None]:
config_uri = sess.upload_data("config.tar.gz", bucket, model_config_prefix)

In [None]:
model_name = sagemaker.utils.name_from_base("DeepSeek-R1-AWQ")
endpoint_name = model_name

In [None]:
model = sagemaker.Model(name = model_name, 
                        image_uri = vllm_image, 
                        model_data = config_uri,
                        role = role)

In [None]:
model.deploy(initial_instance_count = 1,
             instance_type = gpu_instance_type,
             endpoint_name = endpoint_name,
             container_startup_health_check_timeout = 1200)

In [None]:
llm = sagemaker.Predictor(
    endpoint_name = endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

In [None]:
prompt = "What is Amazon SageMaker?"

res = llm.predict({"inputs": prompt, "parameters": {"temperature": 0.9, "max_tokens": 256}})
print(res["generated_text"])

In [None]:
question_1 = """
A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively.

User: 9.11 and 9.8, which is greater?
Assistant: <think>
Think step by step
</think>
<answer>
[Solution will be provided here]
</answer>
"""

In [None]:
response = smr_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps({
        "messages": [{"role": "user", "content": question_1}],
        "max_tokens": 1024,
        "temperature": 0.6,
        "top_p": 0.9,
        "stream": True
    })
)

for event in response['Body']:
    try:
        line = event['PayloadPart']['Bytes'].decode("utf-8")
        chunk = json.loads(line)
        if 'choices' in chunk and len(chunk['choices']) > 0:
            content = chunk['choices'][0].get('delta', {}).get('content', '')
            finish_reason = chunk['choices'][0].get('delta', {}).get('finish_reason', '')
            print(content, end='', flush=True)
    except json.JSONDecodeError:
        print("Error decoding JSON:", line)

In [None]:
question_2 = """
A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively.

User: Plan a 1 week trip to Europe in March, I like historical sites
Assistant: <think>
Think step by step
</think>
<answer>
[Solution will be provided here]
</answer>
"""

In [None]:
response = smr_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps({
        "messages": [{"role": "user", "content": question_2}],
        "max_tokens": 1024,
        "temperature": 0.6,
        "top_p": 0.9,
        "stream": True
    })
)

for event in response['Body']:
    try:
        line = event['PayloadPart']['Bytes'].decode()
        chunk = json.loads(line)
        if 'choices' in chunk and len(chunk['choices']) > 0:
            content = chunk['choices'][0].get('delta', {}).get('content', '')
            finish_reason = chunk['choices'][0].get('delta', {}).get('finish_reason', '')
            print(content, end='', flush=True)
    except json.JSONDecodeError:
        print("Error decoding JSON:", line)

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
sess.delete_model(model_name)