# Deploy LG AI EXAONE Deep 7.8B on Amazon SageMaker AI with SGLang

❗This notebook works well on `ml.g5.xlarge` instance with 50GB of disk size and `PyTorch 2.2.0 Python 3.10 CPU optimized kernel` from **SageMaker Studio Classic** or `Python3 kernel` from **JupyterLab**.

Note that SageMaker provides [pre-built SageMaker AI Docker images](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) that can help you quickly start with the model inference on SageMaker. It also allows you to [bring your own Docker container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html) and use it inside SageMaker AI for training and inference. To be compatible with SageMaker AI, your container must have the following characteristics:

- Your container must have a web server listening on port `8080`.
- Your container must accept POST requests to the `/invocations` and `/ping` real-time endpoints.

In this notebook, we'll demonstrate how to adapt the [SGLang](https://github.com/sgl-project/sglang) framework to run on SageMaker AI endpoints. SGLang is a serving framework for large language models that provides state-of-the-art performance, including a fast backend runtime for efficient serving with RadixAttention, extensive model support, and an active open-source community. For more information refer to [https://docs.sglang.ai/index.html](https://docs.sglang.ai/index.html) and [https://github.com/sgl-project/sglang](https://github.com/sgl-project/sglang).

By using SGLang and building a custom Docker container, you can run advanced AI models like the [EXAONE Deep 7.8B](https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-7.8B) on a SageMaker AI endpoint.

### Set up Environment

In [None]:
%%capture --no-stderr

!pip install -U pip
!pip install -U "sagemaker>=2.237.3"
!pip install -U "transformers>=4.47.0"
!pip install -U "accelerate>=0.26.0"
!pip install -U huggingface-hub==0.26.2
!pip install -U sagemaker-studio-image-build==0.6.0

In [None]:
!pip freeze | grep -E "accelerate|huggingface_hub|sagemaker|torch|transformers"

### Prepare the SGLang SageMaker container

In [None]:
DOCKER_IMAGE = "sglang-sagemaker"
DOCKER_IMAGE_TAG = "latest"

[sm-docker](https://github.com/aws-samples/sagemaker-studio-image-build-cli) is a CLI for building Docker images in SageMaker Studio using AWS CodeBuild

In [None]:
%%time

!cd ../container && sm-docker build . \
  --repository {DOCKER_IMAGE}:{DOCKER_IMAGE_TAG} \
  --build-arg BASE_IMAGE='lmsysorg/sglang:v0.4.4.post1-cu125'

In [None]:
DOCKER_IMAGE = "sglang-sagemaker"
DOCKER_IMAGE_TAG = "latest"

### Create SageMaker AI endpoint for EXAONE Deep 7.8B model

In this example, we will download the model from HuggingFace and upload to S3.

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path

model_dir = Path('model')
model_dir.mkdir(exist_ok=True)

model_id = "LGAI-EXAONE/EXAONE-Deep-7.8B"
snapshot_download(model_id, local_dir=model_dir)

In [None]:
import boto3
import sagemaker

region = boto3.Session().region_name
sess = sagemaker.Session()
bucket = sess.default_bucket()

region, bucket

In [None]:
model_name = "LGAI-EXAONE/EXAONE-Deep-7.8B"

base_name = model_name.split('/')[-1].replace('.', '-').lower()
base_name

In [None]:
!aws s3 cp model/ s3://{bucket}/{base_name}/ --recursive

In [None]:
model_data = f"s3://{bucket}/{base_name}/"
model_data

In [None]:
import sagemaker
from sagemaker.session import Session

session = Session()
region = session._region_name
role = sagemaker.get_execution_role()

ecr_uri = f'{session.account_id()}.dkr.ecr.{region}.amazonaws.com/{DOCKER_IMAGE}:{DOCKER_IMAGE_TAG}'
ecr_uri

Then we will create the [SageMaker model](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) with the custom docker image and model data available on s3.

In [None]:
from sagemaker.model import Model
from sagemaker.predictor import Predictor


model = Model(
    model_data={
        "S3DataSource": {
            "S3Uri": model_data,
            "S3DataType": "S3Prefix",
            "CompressionType": "None",
        },
    },
    role=role,
    image_uri=ecr_uri,
    env={
        'TENSOR_PARALLEL_DEGREE': '1', # ml.g5.2xlarge
        # 'TENSOR_PARALLEL_DEGREE': '8' # ml.g5.48xlarge
    },
    predictor_cls=Predictor
)

In [None]:
%%time

from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer


instance_type = 'ml.g5.2xlarge' # you can also change to ml.g5.48xlarge or p4d.24xlarge

predictor = model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

### Invoke endpoint with SageMaker Python SDK

In [None]:
response = predictor.predict({
    'model':'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    'temperature': 0.6,
    'max_new_tokens': 32768,
    'do_sample': True,
    'top_p': 0.95,
})

print(response['choices'][0]['message']['content'])

Okay, so I need to list three countries and their capitals. Let me start by recalling some countries I know. The first one that comes to mind is Germany. I think their capital is Berlin. Let me confirm that. Yeah, I'm pretty sure Berlin is Germany's capital. 

Next, maybe Japan? Their capital is Tokyo. Wait, but sometimes people get confused between Tokyo and another city. Let me think. No, Tokyo is definitely the capital of Japan. Although I remember there was a recent move, but I think the capital is still Tokyo. Maybe there's a debate, but I'll go with Tokyo for now.

Third country... How about Brazil? Their capital is Brasília. Right, because Rio de Janeiro was the capital before, but they moved it to Brasília to develop the country inland. So Brasília is correct. 

Wait, should I check if there's any other common countries people might ask about? Maybe France? Paris is the capital, but the user asked for three, so maybe stick with the first three I thought of. Let me make sure I d

### Streaming response from the endpoint

Additionally, SGLang allows you to invoke the endpoint and receive streaming response. Below is an example of how to interact with the endpoint with streaming response.

In [None]:
import io
import json
from sagemaker.iterators import BaseIterator
from sagemaker.iterators import handle_stream_errors


class TokenIterator(BaseIterator):
    def __init__(self, event_stream):
        super().__init__(event_stream)
        self.byte_iterator = iter(self.event_stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        r"""Returns the next Line for an Line iterable.

        The output of the event stream will be in the following format:

        ```
        b'data: {"id":"2d81e745f32e46879c2e6bf28171570f","object":"chat.completion.chunk","created":1742104124,"model":"mymodel","choices":[{"index":0,"delta":{"role":"assistant","content":"","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}'
        ...
        b'bf28171570f","object":"chat.completion.chunk","created":1742104141,"model":"mymodel","choices":[],"usage":{"prompt_tokens":11,"total_tokens":523,"completion_tokens":512}}\n\n'
        b'data: [DONE]\n\n'
        ```

        While usually each PayloadPart event from the event stream will contain a byte array
        with a full json, this is not guaranteed and some of the json objects may be split across
        PayloadPart events. For example:
        ```
        {'PayloadPart': {'Bytes': b'data: {"id":"1f7cb39ac2e24f6187305bdb20fc0002",'}
        {'PayloadPart': {'Bytes': b'"object":"chat.completion.chunk",'}
        ...
        {'PayloadPart': {'Bytes': b'}\n\n'}
        ```

        This class accounts for this by concatenating bytes written via the 'write' function
        and then exposing a method which will return lines (ending with a '\n' character) within
        the buffer via the 'scan_lines' function. It maintains the position of the last read
        position to ensure that previous bytes are not exposed again.

        Returns:
            str: Read and return one line from the event stream.
        """
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                full_line = line[:-1].decode('utf-8')
                if full_line.startswith("data:"):
                    try:
                        json_line = json.loads(full_line.lstrip("data:").rstrip("\n"))
                    except Exception as _:
                        json_line = {}
                    part = json_line.get('choices')[0]['delta']['content'] if json_line.get('choices') else ""
                    return part
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if "PayloadPart" not in chunk:
                # handle API response errors and force terminate.
                handle_stream_errors(chunk)
                # print and move on to next response byte
                print("Unknown event type:" + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

In [None]:
%%time

payload = {
    'model': 'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    'temperature': 0.6,
    'max_new_tokens': 1000,
    'do_sample': True,
    'top_p': 0.95,
    'stream': True,
    # 'stream_options': {'include_usage': True}
}

response_stream = predictor.predict_stream(
    data=payload,
    iterator=TokenIterator,
)

for token in response_stream:
    print(token, end="", flush=True)

Okay, I need to list three countries along with their capitals. Let me start by recalling some countries and their capitals. First, I know that France's capital is Paris. That's straightforward. Then, maybe Japan? Their capital is Tokyo. Wait, but I should make sure. Let me think again. Yes, Tokyo is the capital of Japan. Third, perhaps Brazil? Their capital is Brasília. Hmm, but sometimes people might confuse it with Rio de Janeiro, which is a major city but not the capital. So Brasília is correct. Let me verify another one just to be safe. Maybe Canada? Their capital is Ottawa. But since the user asked for three, maybe stick with the first three I thought of. Alternatively, maybe include a different continent. Let me see. Alternatively, Germany's capital is Berlin. So maybe France, Japan, and Germany. Wait, but the user might expect more common ones. Alternatively, maybe include the United States? Their capital is Washington, D.C. So perhaps list three different continents. Let me se

In [None]:
%%time

prompt = r"""Let $x,y$ and $z$ be positive real numbers that satisfy the following system of equations:
\[\log_2\left({x \over yz}\right) = {1 \over 2}\]\[\log_2\left({y \over xz}\right) = {1 \over 3}\]\[\log_2\left({z \over xy}\right) = {1 \over 4}\]
Then the value of $\left|\log_2(x^4y^3z^2)\right|$ is $\tfrac{m}{n}$ where $m$ and $n$ are relatively prime positive integers. Find $m+n$.

Please reason step by step, and put your final answer within \boxed{}."""

messages = [
    {"role": "user", "content": prompt}
]

payload = {
    'model': 'mymodel',
    'messages': messages,
    'temperature': 0.6,
    'max_new_tokens': 32768,
    'do_sample': True,
    'top_p': 0.95,
    'stream': True,
    # 'stream_options': {'include_usage': True}
}

response_stream = predictor.predict_stream(
    data=payload,
    iterator=TokenIterator,
)

for token in response_stream:
    print(token, end="", flush=True)

Okay, so I need to solve this system of logarithmic equations involving x, y, and z. The problem gives me three equations with logs base 2, and I need to find the absolute value of log2(x4y3z2), then express it as a reduced fraction m/n and find m + n. Alright, let me start by writing down the equations again to make sure I have them right.

The equations are:

1. log2(x/(yz)) = 1/2
2. log2(y/(xz)) = 1/3
3. log2(z/(xy)) = 1/4

Hmm. Since they are logarithms, maybe I can convert them into exponential form to make them easier to handle. Remember that log_b(a) = c is equivalent to b^c = a. So applying that here:

1. For the first equation: 2^(1/2) = x/(yz)
2. Second equation: 2^(1/3) = y/(xz)
3. Third equation: 2^(1/4) = z/(xy)

Okay, so now I have these equations in terms of x, y, z. Let me write them out again as equations:

1. x/(yz) = √2 (since 2^(1/2) is √2)
2. y/(xz) = 2^(1/3)
3. z/(xy) = 2^(1/4)

Hmm. Maybe I can express each variable in terms of the others and substitute? Alternat

In [None]:
%%time

#   Korean MCQA example (CSAT Math 2025)
prompt = r"""Question : $a_1 = 2$인 수열 $\{a_n\}$과 $b_1 = 2$인 등차수열 $\{b_n\}$이 모든 자연수 $n$에 대하여\[\sum_{k=1}^{n} \frac{a_k}{b_{k+1}} = \frac{1}{2} n^2\]을 만족시킬 때, $\sum_{k=1}^{5} a_k$의 값을 구하여라.

Options :
A) 120
B) 125
C) 130
D) 135
E) 140

Please reason step by step, and you should write the correct option alphabet (A, B, C, D or E) within \\boxed{}."""

messages = [
    {"role": "user", "content": prompt}
]

payload = {
    'model': 'mymodel',
    'messages': messages,
    'temperature': 0,
    'max_new_tokens': 32768,
    'do_sample': True,
    'temperature': 0.6,
    'top_p': 0.95,
    'stream': True,
    # 'stream_options': {'include_usage': True}
}

response_stream = predictor.predict_stream(
    data=payload,
    iterator=TokenIterator,
)

for token in response_stream:
    print(token, end="", flush=True)

Okay, let's tackle this problem step by step. So we have two sequences here: a_n and b_n. The sequence a_n starts with a_1 = 2, and the sequence b_n is an arithmetic sequence starting with b_1 = 2. The key equation given is that for all natural numbers n, the sum from k=1 to n of (a_k)/(b_{k+1}) equals (1/2)n2. We need to find the sum of the first 5 terms of a, which is a_1 + a_2 + ... + a_5. The options are given from A to E, so we need to figure out which one is correct.

First, let me recall what an arithmetic sequence is. An arithmetic sequence has a common difference between consecutive terms. Since b_1 = 2, then the general term for b_n should be b_n = b_1 + (n-1)d, where d is the common difference. But the problem doesn't specify the common difference, so maybe we need to find that as part of the solution?

Similarly, the sequence a_n is just given as a sequence starting with 2, but we don't know if it's arithmetic or not. The problem doesn't specify, so I guess a_n could be any

### Invoke endpoint with boto3

Note that you can also invoke the endpoint with boto3. If you have an existing endpoint, you don't need to recreate the predictor and can follow below example to invoke the endpoint with an endpoint name.

In [None]:
%%time

import boto3
import json

sagemaker_runtime = boto3.client('sagemaker-runtime', region_name=region)
endpoint_name = predictor.endpoint_name # you can manually set the endpoint name with an existing endpoint

prompt = {
    'model': 'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    'temperature': 0,
    'max_new_tokens': 1000,
    'do_sample': True,
    'top_p': 0.95,
}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(prompt)
)

response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

Okay, I need to list three countries and their capitals. Let me start by recalling some countries and their capitals. First, I know that France's capital is Paris. That's a common one. Then maybe Japan? Their capital is Tokyo. Wait, is that right? I think so. Let me double-check. Yes, Tokyo is the capital of Japan. 

Now for the third country. Let's see, maybe Brazil? Their capital is Brasília. Alternatively, Spain? Their capital is Madrid. Or perhaps Germany? Berlin. Hmm, which ones are more commonly known? Maybe Brazil is a good choice. But I should make sure. Let me think of another one. How about Egypt? Their capital is Cairo. That's another one. Or maybe Canada? Ottawa. 

Wait, the user might expect the most well-known ones. Let me confirm: France - Paris, Japan - Tokyo, Brazil - Brasília. Alternatively, maybe the UK? Their capital is London. But the user might want three different continents? Let me see. The user didn't specify, so maybe any three. Let me pick three that are comm

### Streaming response from the endpoint with boto3

In [None]:
request_body = {
    'model': 'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    'temperature': 0.6,
    'max_new_tokens': 1000,
    'do_sample': True,
    'top_p': 0.95,
    'stream': True,
    # 'stream_options': {'include_usage': True}
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(request_body),
    ContentType="application/json"
)

# Gets the EventStream object returned by the SDK:
response_stream = TokenIterator(response['Body'])
for token in response_stream:
    print(token, end="", flush=True)

Okay, so I need to list three countries and their capitals. Let me start by recalling some countries I know. Maybe start with the most common ones. Let's see... The first country that comes to mind is Germany. I think their capital is Berlin. Wait, is that right? Yeah, I'm pretty sure Berlin is Germany's capital. 

Next, maybe Japan? Their capital is Tokyo. That seems correct. I've heard that Tokyo is a major city there. Then, perhaps Brazil? Their capital is Brasília. Wait, is there another capital? Maybe Rio de Janeiro is a major city but I think the capital is Brasília. 

Wait, let me make sure I'm not mixing any up. Let me think again. For Germany, yes, Berlin is definitely the capital. Japan's capital is Tokyo. Brazil's capital is Brasília. Hmm, are there any other countries that might be more commonly known? Maybe France? Their capital is Paris. But the user asked for three, so maybe stick with the first three I thought of. Alternatively, maybe the US? The capital is Washington, 

### Clean up the environment

Make sure to delete the endpoint and other artifacts that were created to avoid unnecessary cost. You can also go to SageMaker AI console to delete all the resources created in this example.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()

### References

- [EXAONE Deep 7.8B Model Card](https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-7.8B)
- [SGLang Documentation](https://docs.sglang.ai/index.html) - a fast serving framework for large language models and vision language models
- [sagemaker-genai-hosting-examples/Deepseek/SGLang-Deepseek/deepseek-r1-llama-70b-sglang.ipynb](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/Deepseek/SGLang-Deepseek/deepseek-r1-llama-70b-sglang.ipynb)