# Deploy deepseek-ai/DeepSeek-R1-Distill-* models on Amazon SageMaker using LMI container

Let's get started deploying one of the most capable open-source reasoning models available today!

## Introduction: [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)

DeepSeek-R1 is an open-source reasoning model developed by [DeepSeek](https://www.deepseek.com/). It is designed to handle tasks requiring logical inference, mathematical problem-solving, and real-time decision-making. Notably, DeepSeek-R1 achieves performance comparable to leading Foundation Models across various benchmarks, including math, code, and reasoning tasks. 

The DeepSeek-R1 series includes several variants, each with distinct training methodologies and objectives:

1. **DeepSeek-R1-Zero**: This model was trained entirely through reinforcement learning (RL) without any supervised fine-tuning (SFT). While it developed strong reasoning capabilities, it faced challenges such as less readable outputs and occasional mixing of languages within responses, making it less practical for real-world applications. 


2. **DeepSeek-R1**: To address the limitations of R1-Zero, DeepSeek-R1 was developed using a hybrid approach that combines reinforcement learning with supervised fine-tuning. This method incorporated curated datasets to improve the model's readability and coherence, effectively reducing issues like language mixing and fragmented reasoning. As a result, DeepSeek-R1 is more suitable for practical use. 


3. **DeepSeek-R1 Distilled Models**: These are smaller, more efficient versions of the original DeepSeek-R1 model, created through a process called distillation. Distillation involves training a compact model to replicate the behavior of a larger model, thereby retaining much of its reasoning power while reducing computational demands. DeepSeek has released several distilled models based on different architectures, such as Qwen and Llama, with varying parameter sizes (e.g., 1.5B, 7B, 14B, 32B, and 70B). These distilled models offer a balance between performance and resource efficiency, making them accessible for a wider range of applications. 

The table below captures the DeepSeek R1 non-distilled model variants,

| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download** | **Suggested Instances for Hosting** |
| :------------: | :------------: | :------------: | :------------: | :------------: | :------------: |
| DeepSeek-R1-Zero | 671B | 37B | 128K   | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero)   | `ml.p5.48xlarge`, `p5e.48xlarge` |
| DeepSeek-R1   | 671B | 37B |  128K   | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1)   | `ml.p5.48xlarge`, `p5e.48xlarge` |

The table below captures the DeepSeek R1 distilled model variants,

| **Model** | **Base Model** | **Download** | **Suggested Instances for Hosting** |
| :------------: | :------------: | :------------: | :------------: |
| DeepSeek-R1-Distill-Qwen-1.5B  | [Qwen2.5-Math-1.5B](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)   | `ml.g4dn.xlarge`, `ml.g5.xlarge`, `ml.g6.xlarge`, `ml.g6e.xlarge`   |
| DeepSeek-R1-Distill-Qwen-7B  | [Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)   | `ml.g5.2xlarge`, `ml.g6.2xlarge`, `ml.g6e.2xlarge` |
| DeepSeek-R1-Distill-Llama-8B  | [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)   | `ml.g5.2xlarge`, `ml.g6.2xlarge`, `ml.g6e.2xlarge`   |
| DeepSeek-R1-Distill-Qwen-14B   | [Qwen2.5-14B](https://huggingface.co/Qwen/Qwen2.5-14B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)   | `ml.g4dn.12xlarge`, `ml.g5.12xlarge`, `ml.g6.12xlarge`, `ml.g6e.12xlarge`   |
| DeepSeek-R1-Distill-Qwen-32B  | [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)   | `ml.g4dn.12xlarge`, `ml.g5.12xlarge`, `ml.g6.12xlarge`, `ml.g6e.12xlarge` |
| DeepSeek-R1-Distill-Llama-70B  | [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)   | `ml.g5.48xlarge`, `ml.g6.48xlarge`, `ml.g6e.48xlarge`, `ml.p4d.24xlarge`  |

> ⚠ **Warning:** This is not an exhaustive list of compatible instances, please refer to the SageMaker supported instance list here: https://aws.amazon.com/sagemaker-ai/pricing/

## 1. Setup development environment

We are going to use the `sagemaker` python SDK to deploy QwQ to Amazon SageMaker. We need to make sure to have an AWS account configured and the `sagemaker` python SDK installed. 

In [None]:
%pip install "sagemaker>=2.237.1" --upgrade --quiet --no-warn-conflicts

In [None]:
import boto3
import sagemaker
import huggingface_hub
from pathlib import Path

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment

sm_client = boto3.client("sagemaker")  # client to intreract with SageMaker
smr_client = boto3.client("sagemaker-runtime")  # client to intreract with SageMaker Endpoints

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"boto3 version: {boto3.__version__}")
print(f"sagemaker version: {sagemaker.__version__}")

## 2. Retrieve the LMI DLC

In [None]:
version = "0.30.0"
inference_image = sagemaker.image_uris.retrieve("djl-tensorrtllm", region=region, version=version)
print(f"Inference image: {inference_image}")

## 3. Deploy deepseek-ai/DeepSeek-R1-Distill-* to Amazon SageMaker

To deploy a model to Amazon SageMaker we create a `Model` model class and define our endpoint configuration including the `hf_model_id`, `instance_type` etc. We will use a `ml.p4d.24xlarge` instance type. 

### You can deploy any distilled models using this notebook. All you need to do is to change "HF_MODEL_ID" parameter in the cell below to any of the following:
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- ~~deepseek-ai/DeepSeek-R1-Distill-Qwen-32B~~
- ~~deepseek-ai/DeepSeek-R1-Distill-Qwen-14B~~
- ~~deepseek-ai/DeepSeek-R1-Distill-Qwen-7B~~
- ~~deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B~~

Version of TRT-LLM in LMI container 0.30 does NOT support Qwen architecture

### Run ahead of time compilation (one time activity)

In [None]:
model_id="deepseek-ai/DeepSeek-R1-Distill-Llama-70B"

hf_local_download_dir = Path.cwd() / "model_repo"
hf_local_download_dir.mkdir(exist_ok=True)

allow_patterns = ["*.json", "*.safetensors", "*.pt", "*.txt", "*.model", "*.tiktoken", "*.gguf"]

# - Leverage the snapshot library to download the model since the model is stored in repository using LFS
huggingface_hub.snapshot_download(
    repo_id=model_id,
    local_dir=hf_local_download_dir,
    allow_patterns=allow_patterns,
)

In [None]:
!rm -rf model_repo/.ipynb_checkpoints
!rm -rf model_repo/.cache
!rm -rf model_repo/.gitattributes
!rm -rf model_repo/original

In [None]:
model_uri = sess.upload_data(
    path=hf_local_download_dir.as_posix(),
    bucket=bucket,
    key_prefix="inference-model",
)
model_uri = model_uri + "/" #need to point towards the uncompressed model artifacts
model_uri

In [None]:
!aws s3 ls {model_uri} #verify model artifacts

In [None]:
prefix = "inference-model-trt"
model_name = sagemaker.utils.name_from_base(prefix)
output_location = f"s3://{bucket}/{prefix}/"
instance_type = "ml.p4d.24xlarge"

In [None]:
job_name = model_name
job_timeout = 7200

response = sm_client.create_optimization_job(
    OptimizationJobName=job_name,
    RoleArn=role,
    ModelSource={
        'S3': {
            'S3Uri': model_uri,
        }
    },
    DeploymentInstanceType=instance_type,
    OptimizationEnvironment={},
    OptimizationConfigs=[
        {
            'ModelCompilationConfig': {
                'Image': inference_image,
                'OverrideEnvironment': {
                    "OPTION_ROLLING_BATCH": "trtllm",
                    "OPTION_MAX_INPUT_LEN": "4096",
                    "OPTION_MAX_OUTPUT_LEN": "4096",
                    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
                    "OPTION_TENSOR_PARALLEL_DEGREE": "8",
                }
            },
        },
    ],
    OutputConfig={
        'S3OutputLocation': output_location
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': job_timeout,
        'MaxWaitTimeInSeconds': job_timeout,
        'MaxPendingTimeInSeconds': job_timeout
    },
)
response

In [None]:
sess.wait_for_optimization_job(job_name)

In [None]:
env = {
    "HF_MODEL_ID": output_location,
    "OPTION_ROLLING_BATCH": "trtllm",
    "OPTION_MAX_INPUT_LEN": "4096",
    "OPTION_MAX_OUTPUT_LEN": "4096",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "8",
}

lmi_model = sagemaker.Model(
    image_uri = inference_image,
    env = env,
    role = role,
    name = model_name
)

After we have created the `Model` we can deploy it to Amazon SageMaker using the `deploy` method. We will deploy the model with the `ml.p4d.24xlarge` instance type. 

***LMI will automatically:***
- convert model to TensorRT-LLM artifacts (if it was converted ahead of time by running optimization job)
- distribute and shard the model across all GPUs

In [None]:
# Deploy model to an endpoint
llm = lmi_model.deploy(
  initial_instance_count = 1,
  instance_type = instance_type,
  container_startup_health_check_timeout = 3600, 
  endpoint_name = model_name,
)

llm = sagemaker.Predictor(
    endpoint_name = model_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

SageMaker will now create our endpoint and deploy the model to it. This can takes a 10-15 minutes. 

In [None]:
recipe_deployment = """
How to deploy the DeepSeek R1 model on Amazon SageMaker using LMI container?
"""

prompt_template = f"""
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful ML assistant who is an expert in SageMaker hosting.
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Create a recipe here.

{recipe_deployment}

Provide the summary directly, without any introduction or preamble. Do not start the response with "Here is a...".<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""

In [None]:
response = llm.predict(
    {
        "inputs": prompt_template,
        "parameters": {
            "do_sample":True,
            "max_new_tokens":1024,
            "top_p":0.9,
            "temperature":0.6,
        }
    }
)

print(response['generated_text'])

## 4. Clean up

To clean up, we can delete the model and endpoint.


In [None]:
llm.delete_model()
llm.delete_endpoint()