# 🚀 Deploy DeepSeek R1 Large Language Model from HuggingFace Hub on Amazon SageMaker

## Introduction: [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)

DeepSeek-R1 is an open-source reasoning model developed by [DeepSeek](https://www.deepseek.com/). It is designed to handle tasks requiring logical inference, mathematical problem-solving, and real-time decision-making. Notably, DeepSeek-R1 achieves performance comparable to leading Foundation Models across various benchmarks, including math, code, and reasoning tasks. 

The DeepSeek-R1 series includes several variants, each with distinct training methodologies and objectives:

1. **DeepSeek-R1-Zero**: This model was trained entirely through reinforcement learning (RL) without any supervised fine-tuning (SFT). While it developed strong reasoning capabilities, it faced challenges such as less readable outputs and occasional mixing of languages within responses, making it less practical for real-world applications. 


2. **DeepSeek-R1**: To address the limitations of R1-Zero, DeepSeek-R1 was developed using a hybrid approach that combines reinforcement learning with supervised fine-tuning. This method incorporated curated datasets to improve the model's readability and coherence, effectively reducing issues like language mixing and fragmented reasoning. As a result, DeepSeek-R1 is more suitable for practical use. 


3. **DeepSeek-R1 Distilled Models**: These are smaller, more efficient versions of the original DeepSeek-R1 model, created through a process called distillation. Distillation involves training a compact model to replicate the behavior of a larger model, thereby retaining much of its reasoning power while reducing computational demands. DeepSeek has released several distilled models based on different architectures, such as Qwen and Llama, with varying parameter sizes (e.g., 1.5B, 7B, 14B, 32B, and 70B). These distilled models offer a balance between performance and resource efficiency, making them accessible for a wider range of applications. 

The table below captures the DeepSeek R1 non-distilled model variants,

| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download** | **Suggested Instances for Hosting** |
| :------------: | :------------: | :------------: | :------------: | :------------: | :------------: |
| DeepSeek-R1-Zero | 671B | 37B | 128K   | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero)   | `ml.p5.48xlarge`, `p5e.48xlarge` |
| DeepSeek-R1   | 671B | 37B |  128K   | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1)   | `ml.p5.48xlarge`, `p5e.48xlarge` |

The table below captures the DeepSeek R1 distilled model variants,

| **Model** | **Base Model** | **Download** | **Suggested Instances for Hosting** |
| :------------: | :------------: | :------------: | :------------: |
| DeepSeek-R1-Distill-Qwen-1.5B  | [Qwen2.5-Math-1.5B](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)   | `ml.g4dn.xlarge`, `ml.g5.xlarge`, `ml.g6.xlarge`, `ml.g6e.xlarge`   |
| DeepSeek-R1-Distill-Qwen-7B  | [Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)   | `ml.g5.2xlarge`, `ml.g6.2xlarge`, `ml.g6e.2xlarge` |
| DeepSeek-R1-Distill-Llama-8B  | [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)   | `ml.g5.2xlarge`, `ml.g6.2xlarge`, `ml.g6e.2xlarge`   |
| DeepSeek-R1-Distill-Qwen-14B   | [Qwen2.5-14B](https://huggingface.co/Qwen/Qwen2.5-14B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)   | `ml.g4dn.12xlarge`, `ml.g5.12xlarge`, `ml.g6.12xlarge`, `ml.g6e.12xlarge`   |
| DeepSeek-R1-Distill-Qwen-32B  | [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)   | `ml.g4dn.12xlarge`, `ml.g5.12xlarge`, `ml.g6.12xlarge`, `ml.g6e.12xlarge` |
| DeepSeek-R1-Distill-Llama-70B  | [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)   | `ml.g5.48xlarge`, `ml.g6.48xlarge`, `ml.g6e.48xlarge`, `ml.p4d.24xlarge`  |

> ⚠ **Warning:** This is not an exhaustive list of compatible instances, please refer to the SageMaker supported instance list here: https://aws.amazon.com/sagemaker-ai/pricing/

In [None]:
%pip install -Uq sagemaker --no-warn-conflicts

## Deploy DeepSeek R1 Distilled Variants

In [None]:
import json
import sagemaker
import boto3
from typing import List, Dict
from datetime import datetime
from sagemaker.huggingface import (
    HuggingFaceModel, 
    get_huggingface_llm_image_uri
)
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer

In [None]:
boto_region = boto3.Session().region_name
session = sagemaker.session.Session(boto_session=boto3.Session(region_name=boto_region))
role = sagemaker.get_execution_role()

## Deploy using DJL-Inference Container

The [Deep Java Library (DJL) Large Model Inference (LMI)](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-container-docs.html) containers are specialized Docker containers designed to facilitate the deployment of large language models (LLMs) on Amazon SageMaker. These containers integrate a model server with optimized inference libraries, providing a comprehensive solution for serving LLMs. 

**Key Features of DJL LMI Containers:**

* __Optimized Inference Performance__: Support for popular model architectures like DeepSeek, Mistral, Llama, Falcon and many more..
* __Integration with Inference Libraries__: Seamless integration with libraries such as vLLM, TensorRT-LLM, and Transformers NeuronX.
* __Advanced Capabilities__: Features like continuous batching, token streaming, quantization (e.g., AWQ, GPTQ, FP8), multi-GPU inference using tensor parallelism, and support for LoRA fine-tuned models.

**Benefits for Deploying LLMs with DJL-LMI on Amazon SageMaker:**

* __Simplified Deployment__: DJL LMI containers offer a low-code interface, allowing users to specify configurations like model parallelization and optimization settings through a configuration file. 
* __Performance Optimization__: By leveraging optimized inference libraries and techniques, these containers enhance inference performance, reducing latency and improving throughput.
* __Scalability__: Designed to handle large models that may not fit on a single accelerator, enabling efficient scaling across multiple GPUs or specialized hardware like AWS Inferentia.

In [None]:
## You can get inference image uri programmatically using sagemaker.image_uris.retrieve
# deepspeed_image_uri = sagemaker.image_uris.retrieve(
#     framework="djl-lmi", 
#     region=boto_region, 
#     version="0.31.0"
# )
djllmi_inference_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124"

Choose an appropriate model name and endpoint name when hosting your model.

In [None]:
model_name_lmi = f"deepseek-r1-distil-llama8b-lmi-{datetime.now().strftime('%y%m%d-%H%M%S')}"
endpoint_name_lmi = f"{model_name_lmi}-ep"

Create a new [SageMaker Model](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html)

> ⚠ Swap `HF_MODEL_ID: deepseek-ai/DeepSeek-R1-Distill-Llama-8B` with another DeepSeek Distilled Variant if you prefer to deploy a different dense model. Optionally, you can include `HF_TOKEN: "hf_..."` for gated models.

In [None]:
deepseek_lmi_model = sagemaker.Model(
    image_uri=djllmi_inference_image_uri,
    env={
        "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        "OPTION_MAX_MODEL_LEN": "10000",
        "OPTION_GPU_MEMORY_UTILIZATION": "0.95",
        "OPTION_ENABLE_STREAMING": "false",
        "OPTION_ROLLING_BATCH": "auto",
        "OPTION_MODEL_LOADING_TIMEOUT": "3600",
        "OPTION_PAGED_ATTENTION": "false",
        "OPTION_DTYPE": "fp16",
    },
    role=role,
    name=model_name_lmi,
    sagemaker_session=sagemaker.Session()
)

🚀 Deploy. Please wait for the endpoint to be `InService` before running inference against it!

In [None]:
pretrained_lmi_predictor = deepseek_lmi_model.deploy(
    endpoint_name=endpoint_name_lmi,
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=600,
    #wait=False
)
print(f"Your DJL-LMI Model Endpoint: {endpoint_name_lmi} is now deployed! 🚀")

### Inference with SageMaker SDK

SageMaker python sdk simplifies the inference construct using `sagemaker.Predictor` method.

`DeepSeek Llama8b` variant is based on 3.1 Llama8b prompt format which is as shown below,


```json
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2024
Today Date: 29 Jan 2025

You are a helpful assistant that thinks and reasons before answering.

<|eot_id|>
<|start_header_id|>user<|end_header_id|>
How many R are in STRAWBERRY? Keep your answer and explanation short!
<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
```

In [None]:
pretrained_lmi_predictor = sagemaker.Predictor(
     endpoint_name=endpoint_name_lmi,
     sagemaker_session=session,
     serializer=JSONSerializer(),
     deserializer=JSONDeserializer(),
)

In [None]:
def format_messages(messages: List[Dict[str, str]]) -> List[str]:
    """
    Format messages for Llama 3+ chat models.
    
    The model only supports 'system', 'user' and 'assistant' roles, starting with 'system', then 'user' and 
    alternating (u/a/u/a/u...). The last message must be from 'user'.
    """
    # auto assistant suffix
    # messages.append({"role": "assistant"})
    
    output = "<|begin_of_text|>"
    # Adding an inferred prefix
    system_prefix = f"\n\nCutting Knowledge Date: December 2024\nToday Date: {datetime.now().strftime('%d %b %Y')}\n\n"
    for i, entry in enumerate(messages):
        output += f"<|start_header_id|>{entry['role']}<|end_header_id|>"
        if entry['role'] == 'system':
            output += f"{system_prefix}{entry['content']}<|eot_id|>"
        elif entry['role'] != 'system' and 'content' in entry:
            output += f"\n\n{entry['content']}<|eot_id|>"
    output += "<|start_header_id|>assistant<|end_header_id|>\n"
    return output


# pretrained_lmi_predictor = sagemaker.Predictor(
#     endpoint_name=endpoint_name_lmi,
#     sagemaker_session=session,
#     serializer=JSONSerializer(),
#     deserializer=JSONDeserializer(),
# )


def send_prompt(messages, parameters):
    # convert u/a format 
    frmt_input = format_messages(messages)
    payload = {
        "inputs": frmt_input,
        "parameters": parameters
    }
    response = pretrained_lmi_predictor.predict(payload)
    return response

We can continue to use a simple `List[Dict[str, str]]` format to chat and simplify `system`, `user` and `assistant` chat transcripts.

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that thinks and reasons before answering."},
    {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"}
]
response_deepseek_lmi = send_prompt(
    messages, 
    parameters={
        "temperature": 0.6, 
        "max_new_tokens": 512
    }
)

Simply print your response

In [None]:
print(response_deepseek_lmi['generated_text'])

## Deploy using HuggingFace TGI Container

Hugging Face Large Language Model (LLM) Inference Deep Learning Container (DLC) on Amazon SageMaker enables developers to efficiently deploy and serve open-source LLMs at scale. This DLC is powered by Text Generation Inference (TGI), an open-source, purpose-built solution optimized for high-performance text generation tasks. 

**Key Features of HuggingFace TGI Containers:**

* **Tensor Parallelism**: Distributes computation across multiple GPUs, allowing the deployment of large models that exceed the memory capacity of a single GPU.
* **Dynamic Batching**: Aggregates multiple incoming requests into a single batch, enhancing throughput and resource utilization.
* **Optimized Transformers Code**: Utilizes advanced techniques like flash-attention to improve inference speed and efficiency for popular model architectures like DeepSeek, Llama, Falcon, Mistal, Mixtral and many more.

**Benefits for Deploying LLMs with HuggingFace TGI on Amazon SageMaker:**

* **Simplified Deployment**: TGI containers provide a low-code interface, allowing users to specify configurations like model parallelization and optimization settings through straightforward configuration files. 
* **Performance Optimization**: By leveraging optimized inference libraries and techniques, such as tensor parallelism and dynamic batching, these containers enhance inference performance, reducing latency and improving throughput. 
* **Scalability**: Designed to handle large models, TGI containers enable efficient scaling across multiple GPUs or specialized hardware like AWS Inferentia, ensuring that even the most demanding models can be deployed effectively. 

Choose an appropriate model name and endpoint name when hosting your model.

In [None]:
model_name_tgi = f"deepseek-r1-distil-llama8b-tgi-{datetime.now().strftime('%y%m%d-%H%M%S')}"
endpoint_name_tgi = f"{model_name_tgi}-ep"

For a more exhaustive list, please refer to this [TGI Release Page](https://github.com/aws/deep-learning-containers/releases?q=tgi+AND+gpu&expanded=true)

In [None]:
tgi_inference_image_uri = get_huggingface_llm_image_uri(
     "huggingface", 
     version="2.3.1"
)
print(f"Using TGI Image: {tgi_inference_image_uri}")

Create a new [SageMaker HuggingFaceModel](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html)

> ⚠ Swap `HF_MODEL_ID: deepseek-ai/DeepSeek-R1-Distill-Llama-8B` with another DeepSeek Distilled Variant if you prefer to deploy a different dense model. Optionally, you can include `HF_TOKEN: "hf_..."` for gated models.

In [None]:
deepseek_tgi_model = HuggingFaceModel(
    image_uri=tgi_inference_image_uri,
    env={
        "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
        "ENDPOINT_SERVER_TIMEOUT": "3600",
        "MESSAGES_API_ENABLED": "true",
        "OPTION_ENTRYPOINT": "inference.py",
        "SAGEMAKER_ENV": "1",
        "SAGEMAKER_MODEL_SERVER_WORKERS": "1",
        "SAGEMAKER_PROGRAM": "inference.py",
        "SM_NUM_GPUS": "1",
        "MAX_TOTAL_TOKENS": "8192",
        "MAX_INPUT_TOKENS": "7168",
        "MAX_BATCH_PREFILL_TOKENS": "7168",
        "DTYPE": "bfloat16",
        "PORT": "8080"
    },
    role=role,
    name=model_name_tgi,
    sagemaker_session=session
)

🚀 Deploy. Please wait for the endpoint to be `InService` before running inference against it!

In [None]:
pretrained_tgi_predictor = deepseek_tgi_model.deploy(
    endpoint_name=endpoint_name_tgi,
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge",
    container_startup_health_check_timeout=600,
    #wait=False
)

### Inference with SageMaker SDK

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant that thinks and reasons before answering."},
    {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"}
]

response_deepseek_tgi = pretrained_tgi_predictor.predict(
    {
        "messages": messages,
        "max_tokens": 1024,
        "temperature": 0.6
    }
)

In [None]:
print(response_deepseek_tgi["choices"][0]["message"]["content"])