## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page [Medical LLM 8B](https://aws.amazon.com/marketplace/pp/prodview-dn7ktdl2sg7bi)
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

- **Model**: `JSL-Medical-LLM-8B`  
- **Model Description**: Medical LLM 8B

In [None]:
model_package_arn = "<Customer to specify Model package ARN corresponding to their AWS region>"

In [None]:
import os
import re
import base64
import json
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
from IPython.display import Image, display
from PIL import Image as ImageEdit
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

sagemaker_session = sage.Session()
s3_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")
role = get_execution_role()

sagemaker = boto3.client("sagemaker")
s3_client = sagemaker_session.boto_session.client("s3")
ecr = boto3.client("ecr")
sm_runtime = boto3.client("sagemaker-runtime")

In [3]:
model_name = "JSL-Medical-LLM-8B"

real_time_inference_instance_type = "ml.g5.12xlarge"
batch_transform_inference_instance_type = "ml.g5.12xlarge"

## Model Configuration Documentation  

### Default Configuration  
The container comes with the following default configurations:  

| Parameter                  | Default Value | Description                                                                   |  
|----------------------------|---------------|-------------------------------------------------------------------------------|  
| **`dtype`**                | `auto`        | Data type for model weights and activations (automatically determined)        |  
| **`tensor_parallel_size`** | Auto          | Automatically set to the number of available GPUs (`torch.cuda.device_count()`)|  
| **`host`**                 | `0.0.0.0`     | Host name                                                                     |  
| **`port`**                 | `8080`        | Port number                                                                   |  
| **`tokenizer_mode`**       | `auto`        | Tokenizer mode (automatically determined)                                     |  
| **`reasoning_parser`**     | `qwen3`       | Reasoning parser to use for extracting reasoning content from the model output|  

### Hardcoded Settings  
The following settings are hardcoded in the container and cannot be changed:  

| Parameter       | Value           | Description                           |  
|-----------------|-----------------|---------------------------------------|  
| **`model`**     | `/opt/ml/model` | Model path where SageMaker mounts the model |  

### Configurable Environment Variables  
You can customize the vLLM server by setting environment variables when creating the model.  

**Any parameter from the [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve) can be set using the corresponding environment variable with the `SM_VLLM_` prefix.**  

The container uses a script similar to the [SageMaker entrypoint example](https://docs.vllm.ai/en/v0.8.5/getting_started/examples/sagemaker-entrypoint.html) from the vLLM documentation to convert environment variables to command-line arguments.  

---  

## Input Format  

### 1. Chat Completion  

#### Example Payload  
```json  
{  
    "model": "/opt/ml/model",  
    "messages": [  
        {"role": "system", "content": "You are a helpful medical assistant."},  
        {"role": "user", "content": "What should I do if I have a fever and body aches?"}  
    ],  
    "max_tokens": 1024,  
    "temperature": 0.7  
}  
```  

For additional parameters:  
- [ChatCompletionRequest](https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/entrypoints/openai/protocol.py#L223)  
- [OpenAI's Chat API](https://platform.openai.com/docs/api-reference/chat/create)  

---  

### 2. Text Completion  

#### Single Prompt Example  
```json  
{  
    "model": "/opt/ml/model",  
    "prompt": "How can I maintain good kidney health?",  
    "max_tokens": 512,  
    "temperature": 0.6  
}  
```  

#### Multiple Prompts Example  
```json  
{  
    "model": "/opt/ml/model",  
    "prompt": [  
        "How can I maintain good kidney health?",  
        "What are the best practices for kidney care?"  
    ],  
    "max_tokens": 512,  
    "temperature": 0.6  
}  
```  

Reference:  
- [CompletionRequest](https://github.com/vllm-project/vllm/blob/v0.9.2/vllm/entrypoints/openai/protocol.py#L741)  
- [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions/create)  

---  

### Important Notes:
- **Streaming Responses:** Add `"stream": true` to your request payload to enable streaming
- **Model Path Requirement:** Always set `"model": "/opt/ml/model"` (SageMaker's fixed model location)

### Initial setup

In [4]:
prompt1 = """A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus.

Which of the following is the best treatment for this patient?
A: Ampicillin
B: Ceftriaxone
C: Ciprofloxacin
D: Doxycycline
E: Nitrofurantoin
"""

prompt2 = "What should I do if I have a fever and body aches?"

prompts = [
    "How can I maintain good kidney health?",
    "What are the symptoms of high blood pressure?"
]



In [5]:
system_prompt ="You are a helpful medical assistant. Provide accurate, evidence-based information in response to the following question. Organize the response with clear hierarchical headings and include a conclusion if necessary."

## 2. Create a deployable model from the model package.

In [6]:
model = ModelPackage(
    role=role,
    model_package_arn=model_package_arn,
    sagemaker_session=sagemaker_session,
)

## 3. Create SageMaker Endpoint

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [7]:
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type, 
    endpoint_name=model_name,
    model_data_download_timeout=3600
)

------------!

### 3.1 Real-time inference via Amazon SageMaker Endpoint

In [54]:
def invoke_realtime_endpoint(record):

    response = sm_runtime.invoke_endpoint(
        EndpointName=model_name,
        ContentType="application/json",
        Accept="application/json",
        Body=json.dumps(record),
    )

    return json.load(response["Body"])

In [55]:
def print_colored(text, color='green'):
    colors = {
        'green': '\033[92m',
        'reset': '\033[0m',
    }
    color_code = colors.get(color, colors['reset'])
    print(f"{color_code}{text}{colors['reset']}", end="", flush=True)

#### Chat Completion

In [56]:
input_data = {
    "model": "/opt/ml/model",
    "messages": [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt1},
    ],
    "max_tokens": 8192,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20,
    "min_p": 0.0,
    "presence_penalty":0.0
}

result = invoke_realtime_endpoint(input_data)
message = result["choices"][0]["message"]

reasoning_content = message.get("reasoning_content")
content = message.get("content")

print_colored(reasoning_content, 'green')
print_colored(content, 'reset')

[92m
Okay, let's tackle this question. The patient is a 23-year-old pregnant woman at 22 weeks gestation with a burning sensation upon urination that started a day ago and is worsening despite drinking more water and taking cranberry extract. She's otherwise feeling well, with normal vital signs and no costovertebral angle tenderness. The question is asking for the best treatment for her condition.

First, I need to figure out what the likely diagnosis is. Burning upon urination is a classic symptom of urinary tract infection (UTI), especially in a pregnant woman. Given that she's in her second trimester and the symptoms are worsening, it's likely a urinary tract infection. The absence of costovertebral angle tenderness suggests it's not a pyelonephritis (upper UTI), so probably a lower UTI, like cystitis.

Now, the key here is to determine the appropriate antibiotic for a pregnant woman with a UTI. The options are Ampicillin, Ceftriaxone, Ciprofloxacin, Doxycycline, and Nitrofurantoi

#### Text Completion

In [58]:
input_data ={
        "model": "/opt/ml/model",
        "prompt": f"{system_prompt}\n\nUser: {prompt2}\n\nAssistant:",
        "max_tokens": 2048,
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20,
        "min_p": 0.0,
        "presence_penalty":0.0
    }

result = invoke_realtime_endpoint(input_data)
output_text = result['choices'][0]['text']
print(output_text)

 # Management of Fever and Body Aches

## Immediate Actions

1. **Rest**: Get plenty of rest to help your body recover
2. **Hydration**: Drink plenty of fluids (water, electrolyte solutions) to stay hydrated
3. **Comfort measures**:
   - Wear light clothing
   - Use a cool, comfortable room temperature
   - Take warm baths or showers if comfortable

## Medication Options

1. **Over-the-counter medications**:
   - Acetaminophen (Tylenol) for fever and pain relief
   - Ibuprofen (Advil, Motrin) for inflammation and pain relief
   - Aspirin (Bayer) for fever and pain relief (not recommended for children or teenagers)
2. **Prescription medications**: Your doctor may prescribe stronger medications if needed

## When to Seek Medical Attention

- If symptoms persist for more than 3-5 days
- If you have difficulty breathing
- If you have severe headache
- If you have persistent vomiting
- If you have a rash
- If you have a history of chronic illness
- If you are pregnant or have a compromised 

### 3.2 Real-time inference response as a stream via Amazon SageMaker Endpoint

In [62]:
def invoke_streaming_endpoint(record):
    try:
        response = sm_runtime.invoke_endpoint_with_response_stream(
            EndpointName=model_name,
            Body=json.dumps(record),
            ContentType="application/json",
            Accept="text/event-stream"
        )

        is_chat_completion = "messages" in record

        for event in response["Body"]:
            if "PayloadPart" in event:
                chunk = event["PayloadPart"]["Bytes"].decode("utf-8")
                if chunk.startswith("data:"):
                    try:
                        data = json.loads(chunk[5:].strip())

                        if "choices" not in data or len(data["choices"]) == 0:
                            continue

                        choice = data["choices"][0]
                        if is_chat_completion:
                            if "delta" in choice:
                                delta = choice["delta"]

                                if "reasoning_content" in delta:
                                    yield {'type': 'reasoning', 'content': delta["reasoning_content"]}

                                elif "content" in delta:
                                    yield {'type': 'content', 'content': delta["content"]}
                        else:

                            if "text" in choice:
                                yield {'type': 'text', 'content': choice["text"]}

                    except json.JSONDecodeError:
                        continue
                        
            elif "ModelStreamError" in event:
                error = event["ModelStreamError"]
                yield {'type': 'error', 'content': f"\nStream error: {error['Message']} (Error code: {error['ErrorCode']})"}
                break
            elif "InternalStreamFailure" in event:
                failure = event["InternalStreamFailure"]
                yield {'type': 'error', 'content': f"\nInternal stream failure: {failure['Message']}"}
                break
                
    except Exception as e:
        yield {'type': 'error', 'content': f"\nAn error occurred during streaming: {str(e)}"}

In [64]:
def handle_streaming_chunk(chunk):
    """
    Handles and prints a streaming chunk with appropriate formatting
    Args:
        chunk: Dictionary with 'type' and 'content' keys
    """
    if chunk['type'] == 'reasoning':
        print_colored(chunk['content'], 'green')
    elif chunk['type'] == 'content':
        print_colored(chunk['content'], 'reset')
    elif chunk['type'] == 'text':  # For text completion
        print(chunk['content'], end="", flush=True)
    elif chunk['type'] == 'error':
        print_colored(chunk['content'], 'reset')

#### Chat Completion

In [67]:
payload = {
    "model": "/opt/ml/model",
    "messages": [
        {"role": "system", "content": system_prompt },
        {"role": "user", "content": prompt1},
    ],
    "max_tokens": 8192,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20,
    "min_p": 0.0,
    "presence_penalty": 0.0,
    "stream": True
}

for chunk in invoke_streaming_endpoint(payload):
    handle_streaming_chunk(chunk)

[0m[0m[92m
[0m[92mOkay[0m[92m,[0m[92m let[0m[92m's[0m[92m tackle[0m[92m this[0m[92m question[0m[92m.[0m[92m The[0m[92m patient[0m[92m is[0m[92m a[0m[92m [0m[92m2[0m[92m3[0m[92m-year[0m[92m-old[0m[92m pregnant[0m[92m woman[0m[92m at[0m[92m [0m[92m2[0m[92m2[0m[92m weeks[0m[92m gest[0m[92mation[0m[92m presenting[0m[92m with[0m[92m burning[0m[92m on[0m[92m ur[0m[92mination[0m[92m.[0m[92m She[0m[92m's[0m[92m been[0m[92m experiencing[0m[92m this[0m[92m for[0m[92m a[0m[92m day[0m[92m,[0m[92m and[0m[92m it[0m[92m's[0m[92m getting[0m[92m worse[0m[92m despite[0m[92m increased[0m[92m water[0m[92m intake[0m[92m and[0m[92m cran[0m[92mberry[0m[92m extract[0m[92m.[0m[92m She[0m[92m feels[0m[92m otherwise[0m[92m well[0m[92m and[0m[92m is[0m[92m under[0m[92m her[0m[92m doctor[0m[92m's[0m[92m care[0m[92m for[0m[92m her[0m[92m pregnancy[0m[92m.[0m[92m The[0m

#### Text Completion

In [70]:
payload = {
    "model": "/opt/ml/model",
    "prompt": f"{system_prompt}\n\nUser: {prompt2}\n\nAssistant:",
    "max_tokens": 2048,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20,
    "min_p": 0.0,
    "presence_penalty":0.0,
    "stream": True
}

for chunk in invoke_streaming_endpoint(payload):
    handle_streaming_chunk(chunk)

 # Managing Fever and Body Aches

## Initial Assessment
When experiencing fever and body aches, start by:

1. **Measuring your temperature**: Confirm the fever is above 100.4°F (38°C) for adults, or above 100.0°F (37.8°C) for children
2. **Assessing severity**: Determine if symptoms are mild (manage at home) or severe (seek medical attention)

## Home Management Strategies

### Medications
- **Acetaminophen (Tylenol)**: for fever and pain relief
- **Ibuprofen (Advil, Motrin)**: Can reduce fever and alleviate body aches
- **NSAIDs**: May help with inflammation
- **Avoid aspirin**: Particularly in children due to risk of Reye's syndrome

### Hydration
- Drink plenty of fluids to prevent dehydration
- Water, electrolyte, or clear broths are good options

### Rest
- Get adequate sleep to support your immune system
- Avoid strenuous activities until symptoms improve

### Comfort Measures
- Take warm baths or showers
- Use damp cloths to cool your skin
- Keep your room comfortably cool and w

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

In [71]:
model.sagemaker_session.delete_endpoint(model_name)
model.sagemaker_session.delete_endpoint_config(model_name)

## 5. Batch inference

In [72]:
validation_json_file_name1 = "input1.json"

validation_input_json_path = f"s3://{s3_bucket}/{model_name}/validation-input/"
validation_output_json_path = f"s3://{s3_bucket}/{model_name}/validation-output/"


def write_and_upload_to_s3(input_data, file_name):
    s3_client.put_object(
        Bucket=s3_bucket,
        Key=f"{model_name}/validation-input/{file_name}",
        Body=(bytes(input_data.encode("UTF-8"))),
    )

In [74]:
input_json_data1 = json.dumps(
    {
        "model": "/opt/ml/model",
        "prompt": [f"{system_prompt}\n\nUser: {prompt}\n\nAssistant:" for prompt in prompts],
        "max_tokens": 2048,
        "temperature": 0.6,
        "top_p": 0.95,
        "top_k": 20,
        "min_p": 0.0,
        "presence_penalty":0.0,
    }
)

write_and_upload_to_s3(input_json_data1, f"{validation_json_file_name1}")

In [None]:
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/json",
    output_path=validation_output_json_path,
)
transformer.transform(validation_input_json_path, content_type="application/json")
transformer.wait()

In [None]:
from urllib.parse import urlparse

def retrieve_json_output_from_s3(validation_file_name):

    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)
    result = json.loads(response["Body"].read().decode("utf-8"))
    
    for idx, choice in enumerate(result.get("choices", [])):
        print(f"Response {idx + 1}:\n{choice.get('text', '')}\n{'=' * 75}")

In [78]:
retrieve_json_output_from_s3(validation_json_file_name1)

Response 1:
 # Maintaining Good Kidney Health

## Prevention Strategies

### Diet
- **Stay hydrated**: Drink plenty of water throughout the day (8-10 glasses daily)
- **Reduce sodium intake**: Limit processed foods, fast food, and high-sodium snacks
- **Balanced diet**: Include fruits, vegetables, whole grains, lean proteins, and healthy fats
- **Limit protein**: Moderate protein intake, especially for those with kidney disease
- **Control blood pressure**: Maintain blood pressure within the normal range (below 120/80 mmHg)
- **Manage diabetes**: Control blood sugar levels to prevent kidney damage

### Lifestyle Habits
- **Maintain a healthy weight**: Obesity increases risk of kidney disease
- **Exercise regularly**: Aim for 150 minutes of moderate exercise weekly
- **Avoid smoking**: Smoking damages blood vessels and reduces blood flow to the kidneys
- **Limit alcohol consumption**: Drink in moderation (up to 1 drink daily for women, 2 for men)

## Regular Monitoring
- **Annual check-

Congratulations! You just verified that the batch transform job is working as expected. Since the model is not required, you can delete it. Note that you are deleting the deployable model. Not the model package.

In [79]:
model.delete_model()

INFO:sagemaker:Deleting model with name: JSL-Medical-LLM-8B-v0-2025-07-10-10-23-25-567


### Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

