## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page [Medical Reasoning LLM - 32B](https://aws.amazon.com/marketplace/pp/prodview-x5bfvnroddgfe)
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

- **Model**: `JSL-Medical-Reasoning-LLM-32B`  
- **Model Description**: Medical Reasoning LLM - 32B

In [None]:
model_package_arn = "<Customer to specify Model package ARN corresponding to their AWS region>"

In [None]:
import os
import re
import copy
import base64
import json
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
from IPython.display import Image, display
from PIL import Image as ImageEdit
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

sagemaker_session = sage.Session()
s3_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")
role = get_execution_role()

sagemaker = boto3.client("sagemaker")
s3_client = sagemaker_session.boto_session.client("s3")
ecr = boto3.client("ecr")
sm_runtime = boto3.client("sagemaker-runtime")

In [3]:
model_name = "JSL-Medical-Reasoning-LLM-32B"

real_time_inference_instance_type = "ml.g5.48xlarge"
batch_transform_inference_instance_type = "ml.g5.48xlarge"

## Model Configuration Documentation  

### Default Configuration  
The container comes with the following default configurations:  

| Parameter                  | Default Value | Description                                                                   |  
|----------------------------|---------------|-------------------------------------------------------------------------------|  
| **`dtype`**                | `auto`        | Data type for model weights and activations (automatically determined)        |  
| **`tensor_parallel_size`** | Auto          | Automatically set to the number of available GPUs (`torch.cuda.device_count()`)|  
| **`host`**                 | `0.0.0.0`     | Host name                                                                     |  
| **`port`**                 | `8080`        | Port number                                                                   |  
| **`tokenizer_mode`**       | `auto`        | Tokenizer mode (automatically determined)                                     |  
| **`reasoning_parser`**     | `qwen3`       | Reasoning parser to use for extracting reasoning content from the model output|  

### Hardcoded Settings  
The following settings are hardcoded in the container and cannot be changed:  

| Parameter       | Value           | Description                           |  
|-----------------|-----------------|---------------------------------------|  
| **`model`**     | `/opt/ml/model` | Model path where SageMaker mounts the model |  

### Configurable Environment Variables  
You can customize the vLLM server by setting environment variables when creating the model.  

**Any parameter from the [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve) can be set using the corresponding environment variable with the `SM_VLLM_` prefix.**  

The container uses a script similar to the [SageMaker entrypoint example](https://docs.vllm.ai/en/v0.8.5/getting_started/examples/sagemaker-entrypoint.html) from the vLLM documentation to convert environment variables to command-line arguments.  

---  

## Input Format  

### 1. Chat Completion  

#### Example Payload  
```json  
{  
    "model": "/opt/ml/model",  
    "messages": [  
        {"role": "system", "content": "You are a helpful medical assistant."},  
        {"role": "user", "content": "What should I do if I have a fever and body aches?"}  
    ],  
    "max_tokens": 1024,  
    "temperature": 0.7  
}  
```  

For additional parameters:  
- [ChatCompletionRequest](https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/entrypoints/openai/protocol.py#L223)  
- [OpenAI's Chat API](https://platform.openai.com/docs/api-reference/chat/create)  

---  

### 2. Text Completion  

#### Single Prompt Example  
```json  
{  
    "model": "/opt/ml/model",  
    "prompt": "How can I maintain good kidney health?",  
    "max_tokens": 512,  
    "temperature": 0.6  
}  
```  

#### Multiple Prompts Example  
```json  
{  
    "model": "/opt/ml/model",  
    "prompt": [  
        "How can I maintain good kidney health?",  
        "What are the best practices for kidney care?"  
    ],  
    "max_tokens": 512,  
    "temperature": 0.6  
}  
```  

Reference:  
- [CompletionRequest](https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/entrypoints/openai/protocol.py#L741)  
- [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions/create)  

---  

### Important Notes:
- **Streaming Responses:** Add `"stream": true` to your request payload to enable streaming
- **Model Path Requirement:** Always set `"model": "/opt/ml/model"` (SageMaker's fixed model location)

### Initial setup

In [4]:
prompt1 = """A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus.

Which of the following is the best treatment for this patient?
A: Ampicillin
B: Ceftriaxone
C: Ciprofloxacin
D: Doxycycline
E: Nitrofurantoin
"""

prompt2 = "What should I do if I have a fever and body aches?"

prompts = [
    "How can I maintain good kidney health?",
    "What are the symptoms of high blood pressure?"
]



In [5]:
system_prompt ="You are a helpful medical assistant. Provide accurate, evidence-based information in response to the following question. Organize the response with clear hierarchical headings and include a conclusion if necessary."

## 2. Create a deployable model from the model package.

In [6]:
model = ModelPackage(
    role=role,
    model_package_arn=model_package_arn,
    sagemaker_session=sagemaker_session,
)

## 3. Create SageMaker Endpoint

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [7]:
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type, 
    endpoint_name=model_name,
    model_data_download_timeout=3600
)

--------------------------!

### 3.1 Real-time inference via Amazon SageMaker Endpoint

In [7]:
def invoke_realtime_endpoint(record):

    response = sm_runtime.invoke_endpoint(
        EndpointName=model_name,
        ContentType="application/json",
        Accept="application/json",
        Body=json.dumps(record),
    )

    return json.load(response["Body"])

In [8]:
def print_colored(text, color='green'):
    colors = {
        'green': '\033[92m',
        'reset': '\033[0m',
    }
    color_code = colors.get(color, colors['reset'])
    print(f"{color_code}{text}{colors['reset']}", end="", flush=True)

#### Chat Completion

In [9]:
input_data = {
    "model": "/opt/ml/model",
    "messages": [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt1},
    ],
    "max_tokens": 8192,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 20,
    "min_p": 0.0,
    "presence_penalty":0.0
}

result = invoke_realtime_endpoint(input_data)
message = result["choices"][0]["message"]

reasoning_content = message.get("reasoning_content")
content = message.get("content")

print_colored(reasoning_content, 'green')
print_colored(content, 'reset')

[92m
Okay, let's try to figure out the best treatment for this pregnant woman with burning on urination. So, the patient is 23 years old, 22 weeks pregnant, and presents with dysuria. The symptoms started a day ago and have been getting worse despite increased water and cranberry extract. She's otherwise well, with normal vital signs. No costovertebral angle tenderness, and a gravid uterus.

First, I need to consider the possible diagnosis. Dysuria in pregnancy can be due to a urinary tract infection (UTI), which is common during pregnancy. Since she's in the second trimester, it's important to consider both lower UTI (cystitis) and upper UTI (pyelonephritis). However, there's no fever or flank pain, which are more suggestive of pyelonephritis. The absence of costovertebral tenderness also points away from pyelonephritis. So, it's likely a lower UTI, like cystitis.

Now, the next step is to determine the appropriate antibiotic. The key here is to choose an antibiotic that's both effec

#### Text Completion

In [11]:
input_data ={
        "model": "/opt/ml/model",
        "prompt": f"{system_prompt}\n\nUser: {prompt2}\n\nAssistant:",
        "max_tokens": 2048,
        "temperature": 0.7,
        "top_p": 0.95,
        "top_k": 20,
        "min_p": 0.0,
        "presence_penalty":0.0
    }

result = invoke_realtime_endpoint(input_data)
output_text = result['choices'][0]['text']

reasoning_match = re.match(r"<think>\n?(.*?)</think>\n?", output_text, flags=re.DOTALL)
reasoning_content = None
if reasoning_match:
    reasoning_content = reasoning_match.group(1).rstrip()
    content = output_text[len(reasoning_match.group(0)):]
else:
    content = output_text

if reasoning_content:
    print_colored(reasoning_content + "\n\n", 'green')

print_colored(content, 'reset')

[92mOkay, the user is asking what to do if they have a fever and body aches. Let me start by recalling the common causes. Fever and body aches are usually symptoms of a viral infection, like the flu or a cold. They could also be from bacterial infections, but more often it's viral.

First, I should address the immediate steps they can take. Rest is important because the body needs energy to fight off the infection. Hydration is key too, as fever can cause dehydration. They should drink water, herbal teas, or electrolyte solutions. Over-the-counter medications like acetaminophen or ibuprofen can help reduce fever and pain. I need to make sure to mention the correct dosages and possible side effects, like avoiding alcohol with acetaminophen.

Monitoring symptoms is crucial. They should keep track of their temperature and note if the fever is getting worse or if other symptoms like difficulty breathing appear. If the fever is above 103°F or lasts more than a few days, they should seek me

### 3.2 Real-time inference response as a stream via Amazon SageMaker Endpoint

In [14]:
def invoke_streaming_endpoint(record):
    try:
        response = sm_runtime.invoke_endpoint_with_response_stream(
            EndpointName=model_name,
            Body=json.dumps(record),
            ContentType="application/json",
            Accept="text/event-stream"
        )

        is_chat_completion = "messages" in record

        for event in response["Body"]:
            if "PayloadPart" in event:
                chunk = event["PayloadPart"]["Bytes"].decode("utf-8")
                if chunk.startswith("data:"):
                    try:
                        data = json.loads(chunk[5:].strip())

                        if "choices" not in data or len(data["choices"]) == 0:
                            continue

                        choice = data["choices"][0]
                        if is_chat_completion:
                            if "delta" in choice:
                                delta = choice["delta"]

                                if "reasoning_content" in delta:
                                    yield {'type': 'reasoning', 'content': delta["reasoning_content"]}

                                elif "content" in delta:
                                    yield {'type': 'content', 'content': delta["content"]}
                        else:

                            if "text" in choice:
                                yield {'type': 'text', 'content': choice["text"]}

                    except json.JSONDecodeError:
                        continue

            elif "ModelStreamError" in event:
                error = event["ModelStreamError"]
                yield {'type': 'error', 'content': f"\nStream error: {error['Message']} (Error code: {error['ErrorCode']})"}
                break
            elif "InternalStreamFailure" in event:
                failure = event["InternalStreamFailure"]
                yield {'type': 'error', 'content': f"\nInternal stream failure: {failure['Message']}"}
                break

    except Exception as e:
        yield {'type': 'error', 'content': f"\nAn error occurred during streaming: {str(e)}"}

In [15]:
def handle_streaming_chunk(chunk):
    """
    Handles and prints a streaming chunk with appropriate formatting
    Args:
        chunk: Dictionary with 'type' and 'content' keys
    """
    if chunk['type'] == 'reasoning':
        print_colored(chunk['content'], 'green')
    elif chunk['type'] == 'content':
        print_colored(chunk['content'], 'reset')
    elif chunk['type'] == 'text':  # For text completion
        print(chunk['content'], end="", flush=True)
    elif chunk['type'] == 'error':
        print_colored(chunk['content'], 'reset')

#### Chat Completion

In [16]:
payload = {
    "model": "/opt/ml/model",
    "messages": [
        {"role": "system", "content": system_prompt },
        {"role": "user", "content": prompt1},
    ],
    "max_tokens": 8192,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 20,
    "min_p": 0.0,
    "presence_penalty": 0.0,
    "stream": True
}

for chunk in invoke_streaming_endpoint(payload):
    handle_streaming_chunk(chunk)

[0m[0m[92m
[0m[92mOkay[0m[92m,[0m[92m so[0m[92m I[0m[92m need[0m[92m to[0m[92m figure[0m[92m out[0m[92m the[0m[92m best[0m[92m treatment[0m[92m for[0m[92m this[0m[92m [0m[92m2[0m[92m3[0m[92m-year[0m[92m-old[0m[92m pregnant[0m[92m woman[0m[92m who[0m[92m's[0m[92m [0m[92m2[0m[92m2[0m[92m weeks[0m[92m along[0m[92m and[0m[92m has[0m[92m burning[0m[92m when[0m[92m she[0m[92m ur[0m[92min[0m[92mates[0m[92m.[0m[92m The[0m[92m symptoms[0m[92m started[0m[92m a[0m[92m day[0m[92m ago[0m[92m and[0m[92m are[0m[92m getting[0m[92m worse[0m[92m even[0m[92m after[0m[92m she[0m[92m's[0m[92m been[0m[92m drinking[0m[92m more[0m[92m water[0m[92m and[0m[92m taking[0m[92m cran[0m[92mberry[0m[92m extract[0m[92m.[0m[92m She[0m[92m's[0m[92m healthy[0m[92m,[0m[92m has[0m[92m a[0m[92m normal[0m[92m temperature[0m[92m,[0m[92m and[0m[92m her[0m[92m physical[0m[92m exam

#### Text Completion

In [17]:
payload = {
    "model": "/opt/ml/model",
    "prompt": f"{system_prompt}\n\nUser: {prompt2}\n\nAssistant:",
    "max_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 20,
    "min_p": 0.0,
    "presence_penalty":0.0,
    "stream": True
}

for chunk in invoke_streaming_endpoint(payload):
    handle_streaming_chunk(chunk)

<think>
Okay, the user is asking what to do if they have a fever and body aches. Let me start by recalling common causes. Fevers and body aches are usually from viral infections like the flu or a common cold. But they could also be, like strep throat or something else. First, I should outline the steps someone should take when they have these symptoms.

They need to rest and stay hydrated. That's basic. Fluids help with fever and prevent dehydration. Maybe mention water, herbal teas,tes. Then, over-the-counter medications. Acetaminophen or ibuprofen can reduce fever and pain. I should check the dosages and any contraindications, like liver issues for acetaminophen.

Next, symptoms. They should keep an eye on the fever's duration and temperature. If it's over 103°F or lasts more than a few days, they should see a doctor. Also, if there are other symptoms like chest pain, breathing, or confusion, that's a red flag. 

Self-care measures like cool compresses, rest, and a cool room might he

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

In [18]:
model.sagemaker_session.delete_endpoint(model_name)
model.sagemaker_session.delete_endpoint_config(model_name)

## 5. Batch inference

In [19]:
validation_json_file_name1 = "input1.json"

validation_input_json_path = f"s3://{s3_bucket}/{model_name}/validation-input/"
validation_output_json_path = f"s3://{s3_bucket}/{model_name}/validation-output/"


def write_and_upload_to_s3(input_data, file_name):
    s3_client.put_object(
        Bucket=s3_bucket,
        Key=f"{model_name}/validation-input/{file_name}",
        Body=(bytes(input_data.encode("UTF-8"))),
    )

In [20]:
input_json_data1 = json.dumps(
    {
        "model": "/opt/ml/model",
        "prompt": [f"{system_prompt}\n\nUser: {prompt}\n\nAssistant:" for prompt in prompts],
        "max_tokens": 2048,
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20,
        "min_p": 0.0,
        "presence_penalty":0.0,
    }
)

write_and_upload_to_s3(input_json_data1, f"{validation_json_file_name1}")

In [None]:
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/json",
    output_path=validation_output_json_path,
)
transformer.transform(validation_input_json_path, content_type="application/json")
transformer.wait()

In [None]:
from urllib.parse import urlparse

def retrieve_json_output_from_s3(validation_file_name):

    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)
    result = json.loads(response["Body"].read().decode("utf-8"))
    
    for idx, choice in enumerate(result.get("choices", [])):
        print(f"Response {idx + 1}:\n{choice.get('text', '')}\n{'=' * 75}")

In [24]:
retrieve_json_output_from_s3(validation_json_file_name1)

Response 1:
<think>
Okay, the user is asking how to maintain good kidney health. Let me start by recalling the key factors that contribute to kidney health. First, staying hydrated is important. I should mention that water helps the kidneys function properly. But how much is enough? Maybe 2-3 liters a day, but adjust based on activity and climate.

Next, diet. Sodium intake is a big one. High sodium can increase blood pressure, which affects the kidneys. So, advising to limit processed foods and added salt makes sense. Also, protein intake. Too much protein, especially from animal sources, can strain the kidneys. Maybe recommend a balanced protein intake and choosing plant-based sources.

Blood pressure control is crucial. High blood pressure is a leading cause of kidney disease. So, monitoring and managing it through lifestyle changes and medication if needed. Mention exercise and a low-sodium diet again here.

Blood sugar management, especially for diabetics. Diabetes is another lead

Congratulations! You just verified that the batch transform job is working as expected. Since the model is not required, you can delete it. Note that you are deleting the deployable model. Not the model package.

In [None]:
model.delete_model()

### Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

