
# Deploying DeepSeek-V3-0324 on SageMaker

This notebook demonstrates deploying and running inference with the DeepSeek-V3-0324 model from DeepSeek launched on 03/24/2025. At the time of the launch, this model is the best performing non-reasoning model. 

## Environment Setup

First, we'll upgrade the SageMaker SDK to ensure compatibility with the latest features, particularly those needed for large language model deployment and streaming inference.

> **Note**: The `--quiet` and `--no-warn-conflicts` flags are used to minimize unnecessary output while installing dependencies.

> ⚠️ **Important**: After running the installation cell below, you may need to restart your notebook kernel to ensure the updated packages are properly loaded. To do this:


In [None]:
%pip install sagemaker --upgrade --quiet --no-warn-conflicts

# Deploying and Interacting with DeepSeek-R1 LLM on SageMaker

This notebook demonstrates how to deploy and interact with the DeepSeek-R1 language model using Amazon SageMaker. We'll cover:

1. Setting up SageMaker resources and permissions
2. Deploying the model using SageMaker LMI (Large Model Inference Container powered by Vllm)
3. Implementing a streaming chat interface


## Setup SageMaker Environment

First, we'll import the necessary libraries and initialize our SageMaker session. This includes:
- `boto3` for AWS API interactions
- `sagemaker` SDK for model deployment and management
- Setting up IAM roles and session objects

The code below establishes these basic requirements:

In [None]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs

## Configure Model Container and Instance

For deploying DeepSeek-V3, we'll use:
- **LMI (Deep Java Library) Inference Container**: A container optimized for large language model inference
- **P5 Instance**: AWS's latest GPU instance type optimized for large model inference

Key configurations:
- The container URI points to the DJL inference container in ECR (Elastic Container Registry)
- We use `ml.p5en.48xlarge` or `ml.p5e.48xlarge` instances which offer:
  - 8 NVIDIA H200 GPUs
  - 1128 GB of memory
  - High network bandwidth for optimal inference performance

> **Note**: The region in the container URI should match your AWS region. Replace `us-east-2` with your region if different.

In [None]:
# Define region where you have capacity
REGION = 'us-east-2'  

#Select the latest container. Check the link for the latest available version https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers 
CONTAINER_VERSION = '0.32.0-lmi14.0.0-cu126'

# Construct container URI
container_uri = f'763104351884.dkr.ecr.{REGION}.amazonaws.com/djl-inference:{CONTAINER_VERSION}'

# Select instance type
instance_type = "ml.p5en.48xlarge"  # Alternative: "ml.p5e.48xlarge"

# Validate region and print configuration
if REGION != sess.boto_region_name:
    print(f"⚠️ Warning: Container region ({REGION}) differs from session region ({sess.boto_region_name})")
else:
    print(f"✅ Region validation passed: {REGION}")
    
print(f"📦 Container URI: {container_uri}")
print(f"🖥️ Instance Type: {instance_type}")

## Configure Model Serving Properties

Now we'll create a `serving.properties` file that configures how the model will be served. This configuration is crucial for optimal performance and memory utilization.

Key configurations explained:
- **Engine**: Python backend for model serving
- **Model Settings**:
  - Using DeepSeek-V3-0324 model from Hugging Face
  - Maximum sequence length of 32768 tokens
- **Performance Optimizations**:
  - Tensor parallelism across all available GPUs
  - 87% GPU memory utilization target
  - vLLM rolling batch with max size of 16 for efficient batching
  
### Understanding KV Cache and Context Window

The `max_model_len` parameter controls the maximum sequence length the model can handle, which directly affects the size of the KV (Key-Value) cache in GPU memory. For P5 instances, you can progressively increase this value to find the optimal balance:

1. Start with a conservative value (current: 32768)
2. Monitor GPU memory usage
3. Incrementally increase if memory permits
4. Target the model's full context window 

In [None]:
%%writefile serving.properties
engine=Python
option.trust_remote_code=True
option.tensor_parallel_degree=max
option.gpu_memory_utilization=.87
option.max_model_len=32768
option.model_id=deepseek-ai/DeepSeek-V3-0324
option.max_rolling_batch_size=16
option.rolling_batch=vllm

## Configure vLLM Requirements

(Optional) The `requirements.txt` file specifies the vLLM version needed for model inference. vLLM inference framework provides optimized serving capabilities.

### Version Considerations
- **vLLM 0.7.1**: Currently specified stable version

### Performance Impact
Different vLLM versions can affect:
- Inference speed
- Memory utilization
- Batch processing efficiency
- Compatibility with other libraries

## Package Model Artifacts

Now we'll create a deployment package containing our configuration files. This involves:
1. Creating a model directory
2. Moving configuration files into it
3. Creating a compressed tarball for SageMaker deployment

> **Note**: SageMaker expects model artifacts in a compressed format (`.tar.gz`) with a specific structure for deployment.

In [None]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/

## Upload Model Artifacts to S3

Before deploying to SageMaker, we need to upload our model artifacts to Amazon S3. This process:
1. Determines the S3 bucket location (using SageMaker default bucket)
2. Defines a prefix path for organization
3. Uploads the packaged model artifacts

> **Note**: The default SageMaker bucket follows the naming pattern: `sagemaker-{region}-{account-id}`

In [None]:
s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)

## Create SageMaker Model

Now we'll create a SageMaker Model object that combines our:
- Container image (LMI)
- Model artifacts (configuration files)
- IAM role (for permissions)

This step defines the model configuration but doesn't deploy it yet. The Model object represents the combination of:

1. **Container Image** (`image_uri`): DJL Inference optimized for LLMs
2. **Model Data** (`model_data`): Our configuration files in S3
3. **IAM Role** (`role`): Permissions for model execution

### Required Permissions
The IAM role needs:
- S3 read access for model artifacts
- CloudWatch permissions for logging
- ECR permissions to pull the container

In [None]:
model = Model(image_uri=container_uri,
              model_data=code_artifact,
              role=role,)

## Deploy Model to SageMaker Endpoint

Now we'll deploy our model to a SageMaker endpoint for real-time inference. This is a significant step that:
1. Provisions the specified compute resources (P5 instance)
2. Deploys the model container
3. Sets up the endpoint for API access

### Deployment Configuration
- **Instance Count**: 1 instance for single-node deployment
- **Instance Type**: `ml.p5en.48xlarge` for high-performance inference
- **Health Check Timeout**: 2800 seconds (≈47 minutes)
  - Extended timeout needed for large model loading
  - Includes time for container setup and model initialization

> ⚠️ **Important**: 
> - Deployment can take 30-45 minutes for large models
> - Monitor the CloudWatch logs for progress

In [None]:
endpoint_name = sagemaker.utils.name_from_base("DeepSeek-V3")

print(endpoint_name)
model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout = 2800
)

## Implement Streaming Chat Interface

This section implements a streaming chat interface for real-time interaction with the DeepSeek-R1 model. The implementation includes:

1. **Streaming Infrastructure**:
   - Custom `LineIterator` for efficient stream processing
   - Real-time token processing
   - Performance monitoring (tokens per second)

2. **Chat Formatting**:
   - DeepSeek-R1 specific template
   - Chat history management
   - Special token handling

3. **Performance Features**:
   - Live response streaming
   - Token speed monitoring
   - Memory-efficient processing

### Key Components

#### Chat Template Format
<｜begin▁of▁sentence｜> <｜User｜>{user_message}<｜Assistant｜>{assistant_response}


#### Streaming Parameters
- `max_new_tokens`: 8192 (default)
- `do_sample`: True for sampling-based generation
- Real-time TPS (Tokens Per Second) monitoring

In [None]:
import io
import json
import time
import boto3
from IPython.display import clear_output

# SageMaker Runtime client
smr_client = boto3.client("sagemaker-runtime")
# Replace with your SageMaker endpoint name if needed
#endpoint_name = "DeepSeek-V3-2025-03-26-05-10-49-961"

class LineIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if "PayloadPart" not in chunk:
                print("Unknown event type:" + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

def format_deepseek_chat_template(user_input, chat_history=None):
    """
    Format input according to DeepSeek R1 chat template
    
    Args:
    - user_input (str): Current user message
    - chat_history (list, optional): Previous conversation turns
    
    Returns:
    - str: Formatted chat input with special tokens
    """
    # Start with the beginning of sentence token
    formatted_input = "<｜begin▁of▁sentence｜>"
    
    # Add chat history if provided
    if chat_history:
        for turn in chat_history:
            formatted_input += f"<｜User｜>{turn['user']}<｜Assistant｜>{turn['assistant']}"
    
    # Add current user input
    formatted_input += f"<｜User｜>{user_input}<｜Assistant｜>"
    
    return formatted_input

def stream_chat_response(endpoint_name, inputs, max_new_tokens=8192):
    # Format the input using the DeepSeek chat template
    formatted_inputs = format_deepseek_chat_template(inputs)
    
    body = {
        "inputs": formatted_inputs,
        "parameters": {
            "max_new_tokens": max_new_tokens,
            "do_sample": True,
        },
        "stream": True,
    }

    resp = smr_client.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(body),
        ContentType="application/json",
    )

    event_stream = resp["Body"]
    start_json = b"{"
    full_response = ""
    start_time = time.time()
    token_count = 0

    for line in LineIterator(event_stream):
        if line != b"" and start_json in line:
            data = json.loads(line[line.find(start_json):].decode("utf-8"))
            token_text = data["token"]["text"]
            full_response += token_text
            token_count += 1

            # Calculate tokens per second
            elapsed_time = time.time() - start_time
            tps = token_count / elapsed_time if elapsed_time > 0 else 0

            # Clear the output and reprint everything
            clear_output(wait=True)
            print("Bot:", full_response)
            print(f"\nTokens per Second: {tps:.2f}", end="")

    print("\n") # Add a newline after response is complete
    return full_response

def chat(endpoint_name):
    print("Welcome to the SageMaker Streaming Chat! Type 'exit' to quit.")
    chat_history = []
    while True:
        user_input = input("\nYou: ")
        if user_input.lower() == "exit":
            break
        bot_response = stream_chat_response(endpoint_name, user_input)
        
        # Update chat history
        chat_history.append({
            'user': user_input,
            'assistant': bot_response
        })


# Start the chat
chat(endpoint_name)