## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page [Medical Reasoning LLM - 32B]()
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

- **Model**: `JSL-Medical-Reasoning-LLM-32B`
- **Model Description**: Medical model for summarization, question answering (open-book and closed-book), and general chat.

In [1]:
model_package_arn = "<Customer to specify Model package ARN corresponding to their AWS region>"

In [None]:
import os
import base64
import json
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
from IPython.display import Image, display
from PIL import Image as ImageEdit
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

sagemaker_session = sage.Session()
s3_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")
role = get_execution_role()

sagemaker = boto3.client("sagemaker")
s3_client = sagemaker_session.boto_session.client("s3")
ecr = boto3.client("ecr")
sm_runtime = boto3.client("sagemaker-runtime")

In [9]:
model_name = "JSL-Medical-Reasoning-LLM-32B"

real_time_inference_instance_type = "ml.g5.48xlarge"
batch_transform_inference_instance_type = "ml.g5.48xlarge"

## 2. Create a deployable model from the model package.

In [66]:
# Define ModelPackage
model = ModelPackage(
    role=role, 
    model_package_arn=model_package_arn, 
    sagemaker_session=sagemaker_session, 
)

## Model Configuration Documentation  

### Default Configuration  
The container comes with the following default configurations:  

| Parameter                  | Default Value | Description                                                                   |  
|----------------------------|---------------|-------------------------------------------------------------------------------|  
| **`dtype`**                | `float16`     | Data type for model weights and activations                                   |  
| **`max_model_len`**        | `32,768`      | Default maximum context length (`input + output ≤ max_model_len`). Automatically increases to `131,072` if GPU memory ≥ 240GB |  
| **`tensor_parallel_size`** | Auto          | Automatically set to the number of available GPUs                            |  
| **`host`**                 | `0.0.0.0`     | Host name                                                                     |  
| **`port`**                 | `8080`        | Port number                                                                   |  

### Hardcoded Settings  
The following settings are hardcoded in the container and cannot be changed:  

| Parameter       | Value           | Description                           |  
|-----------------|-----------------|---------------------------------------|  
| **`model`**     | `/opt/ml/model` | Model path where SageMaker mounts the model |  

### Configurable Environment Variables  
You can customize the vLLM server by setting environment variables when creating the model.  

**Any parameter from the [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve) can be set using the corresponding environment variable with the `SM_VLLM_` prefix.**  

The container uses a script similar to the [SageMaker entrypoint example](https://docs.vllm.ai/en/latest/examples/sagemaker_entrypoint.html) from the vLLM documentation to convert environment variables to command-line arguments.  

---  

## Input Format  

### 1. Chat Completion  

#### Example Payload  
```json  
{  
    "model": "/opt/ml/model",  
    "messages": [  
        {"role": "system", "content": "You are a helpful medical assistant."},  
        {"role": "user", "content": "What should I do if I have a fever and body aches?"}  
    ],  
    "max_tokens": 1024,  
    "temperature": 0.7  
}  
```  

For additional parameters:  
- [ChatCompletionRequest](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/entrypoints/openai/protocol.py#L212)  
- [OpenAI's Chat API](https://platform.openai.com/docs/api-reference/chat/create)  

---  

### 2. Text Completion  

#### Single Prompt Example  
```json  
{  
    "model": "/opt/ml/model",  
    "prompt": "How can I maintain good kidney health?",  
    "max_tokens": 512,  
    "temperature": 0.6  
}  
```  

#### Multiple Prompts Example  
```json  
{  
    "model": "/opt/ml/model",  
    "prompt": [  
        "How can I maintain good kidney health?",  
        "What are the best practices for kidney care?"  
    ],  
    "max_tokens": 512,  
    "temperature": 0.6  
}  
```  

Reference:  
- [CompletionRequest](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/entrypoints/openai/protocol.py#L642)  
- [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions/create)  

---  

### Important Notes:
- **Streaming Responses:** Add `"stream": true` to your request payload to enable streaming
- **Model Path Requirement:** Always set `"model": "/opt/ml/model"` (SageMaker's fixed model location)


## 3. Create an SageMaker Endpoint

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [None]:
# Deploy the model
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type, 
    endpoint_name=model_name,
    model_data_download_timeout=3600
)

### 3.1 Real-time inference via Amazon SageMaker Endpoint

#### Initial setup

In [101]:
prompt1 = """A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus.

Which of the following is the best treatment for this patient?
A: Ampicillin
B: Ceftriaxone
C: Ciprofloxacin
D: Doxycycline
E: Nitrofurantoin
"""

prompt2 = "What should I do if I have a fever and body aches?"

prompts = [
    "How can I maintain good kidney health?",
    "What are the symptoms of high blood pressure?"
]

In [102]:
import time

def invoke_realtime_endpoint(record):

    response = sm_runtime.invoke_endpoint(
        EndpointName=model_name,
        ContentType="application/json",
        Accept="application/json",
        Body=json.dumps(record),
    )

    return json.load(response["Body"])

#### Chat Completion

In [103]:
input_data = {
    "model": "/opt/ml/model",
    "messages": [
        {"role": "system", "content": "You are a helpful medical assistant."},
        {"role": "user", "content": prompt1},
    ],
    "max_tokens": 2048,
    "temperature": 0.7,
}

result = invoke_realtime_endpoint(input_data)
output_content = result['choices'][0]['message']['content']
print(output_content)

Okay, let's tackle this question step by step. So, we have a 23-year-old pregnant woman at 22 weeks gestation who's experiencing burning when she urinates. This symptom started just a day ago and has been getting worse even though she's been drinking more water and taking cranberry extract. She mentions she's otherwise feeling well and is under a doctor's care for her pregnancy. Her vital signs are all within normal limits: temperature 97.7°F, blood pressure 122/77 mmHg, pulse 80/min, respirations 19/min, and oxygen saturation 98% on room air. The physical exam doesn't show any costovertebral angle tenderness, which usually indicates kidney involvement in urinary tract infections, and there's a gravid uterus present.

The question asks for the best treatment option among the choices given. The options are Amoxicillin, Ceftriaxone, Ciprofloxacin, Doxycycline, and Nitrofurantoin.

First, I need to figure out what condition this patient is likely dealing with. The key symptom here is dysu

#### Text Completion

In [105]:
input_data ={
        "model": "/opt/ml/model",
        "prompt": prompt2,
        "max_tokens": 2048,
        "temperature": 0.7,
    }

result = invoke_realtime_endpoint(input_data)
output_text = result['choices'][0]['text']
print(output_text)

 If you have a fever and body aches, here's what you can do to help alleviate your symptoms and feel better:

1. **Rest**: Get plenty of rest to help your body fight off the infection or illness causing your symptoms.

2. **Hydrate**: Drink plenty of fluids like water, herbal teas, or clear broths to stay hydrated. Fever can cause dehydration, so it's important to replenish fluids.

3. **Cool Compresses**: Apply cool compresses to your forehead or body to help reduce fever and soothe aches.

4. **Over-the-Counter Medications**: Consider taking over-the-counter medications like acetaminophen (Tylenol) or ibuprofen (Advil, Motrin) to reduce fever and alleviate body aches. Follow the dosage instructions carefully.

5. **Light Clothing**: Wear light, loose-fitting clothing to help your body regulate temperature.

6. **Comfortable Environment**: Stay in a cool, well-ventilated room to help manage your fever.

7. **Monitor Symptoms**: Keep track of your temperature and any other symptoms you

### 3.2 Real-time inference response as a stream via Amazon SageMaker Endpoint

In [107]:
def invoke_streaming_endpoint(record):
    try:
        response = sm_runtime.invoke_endpoint_with_response_stream(
            EndpointName=model_name,
            Body=json.dumps(record),
            ContentType="application/json",
            Accept="text/event-stream"
        )

        for event in response["Body"]:
            if "PayloadPart" in event:
                chunk = event["PayloadPart"]["Bytes"].decode("utf-8")
                if chunk.startswith("data:"):
                    try:
                        data = json.loads(chunk[5:].strip())
                        if "choices" in data and len(data["choices"]) > 0:
                            choice = data["choices"][0]
                            if "text" in choice:
                                yield choice["text"]
                            elif "delta" in choice and "content" in choice["delta"]:
                                yield choice["delta"]["content"]

                    except json.JSONDecodeError:
                        continue 
            elif "ModelStreamError" in event:
                error = event["ModelStreamError"]
                yield f"\nStream error: {error['Message']} (Error code: {error['ErrorCode']})"
                break
            elif "InternalStreamFailure" in event:
                failure = event["InternalStreamFailure"]
                yield f"\nInternal stream failure: {failure['Message']}"
                break
    except Exception as e:
        yield f"\nAn error occurred during streaming: {str(e)}"

#### Chat Completion

In [108]:
payload = {
    "model": "/opt/ml/model",
    "messages": [
        {"role": "system", "content": "You are a helpful medical assistant."},
        {"role": "user", "content": prompt1}
    ],
    "max_tokens": 2048,
    "temperature": 0.7,
    "stream": True
}

for chunk in invoke_streaming_endpoint(payload):
    print(chunk, end="", flush=True)

Okay, so I have this question about a pregnant woman who's 23 years old and 22 weeks pregnant. She's coming in with a burning sensation when she pees, which started a day ago and is getting worse. She's tried drinking more water and taking cranberry extract, but it's not helping. She otherwise feels okay and is under a doctor's care for her pregnancy. Her vitals are all normal, no fever, blood pressure is good, pulse and respirations are normal. The physical exam doesn't show any costovertebral angle tenderness, which usually points to kidney issues, so that's probably not the case here. She has a gravid uterus, which just means she's pregnant.

The question is asking what the best treatment is for her. The options are Ampicillin, Ceftriaxone, Ciprofloxacin, Doxycycline, or Nitrofurantoin. Alright, so let's break this down.

First, the burning upon urination is classic for a urinary tract infection (UTI). Since she's pregnant, I know that UTIs can be more common during pregnancy becaus

#### Text Completion

In [109]:
payload = {
    "model": "/opt/ml/model",
    "prompt": prompt2,
    "max_tokens": 2048,
    "temperature": 0.7,
    "stream": True
}

for chunk in invoke_streaming_endpoint(payload):
    print(chunk, end="", flush=True)

 How long will it take to recover?

If you have a fever and body aches, it's to take care of yourself and monitor your symptoms. Here are some steps you can take:

1. **Rest**: Give your body time to recover by getting plenty of rest.
2. **Hydrate**: Drink fluids like water, herbal teas, or broths to stay hydrated.
3. **Medication**: Over-the-counter medications like acetaminophen (Tylenol) or ibuprofen (Advil) can help reduce fever and relieve body aches.
4. **Cool Compresses**: Applying a cool compress to your forehead may help lower your temperature.
5. **Monitor Symptoms**: Keep track of your and any other symptoms you experience. If your fever is very high (over 103°F/39.4°C), lasts more than three days, or is by severe symptoms like difficulty breathing, chest pain, or, seek medical attention immediately.

The duration of recovery depends on the underlying cause of your symptoms. If it's a common viral infection like the flu, you might start feeling better within a week, but it c

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

In [None]:
model.sagemaker_session.delete_endpoint(model_name)
model.sagemaker_session.delete_endpoint_config(model_name)

## 4. Batch inference

In [111]:
import json
import os

# JSON file names
validation_json_file_name1 = "input1.json"


# JSON paths
validation_input_json_path = f"s3://{s3_bucket}/{model_name}/validation-input/"
validation_output_json_path = f"s3://{s3_bucket}/{model_name}/validation-output/"


def write_and_upload_to_s3(input_data, file_name):
    s3_client.put_object(
        Bucket=s3_bucket,
        Key=f"{model_name}/validation-input/{file_name}",
        Body=(bytes(input_data.encode("UTF-8"))),
    )

In [112]:
input_json_data1 = json.dumps(
    {
        "model": "/opt/ml/model",
        "prompt": prompts,
        "max_tokens": 2048,
        "temperature": 0.7,
    }
)

write_and_upload_to_s3(input_json_data1, f"{validation_json_file_name1}")

In [None]:
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/json",
    output_path=validation_output_json_path,
)
transformer.transform(validation_input_json_path, content_type="application/json")
transformer.wait()

In [None]:
from urllib.parse import urlparse

def retrieve_json_output_from_s3(validation_file_name):

    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)

    data = json.loads(response["Body"].read().decode("utf-8"))
    display(data)

In [119]:
retrieve_json_output_from_s3(validation_json_file_name1)

{'id': 'cmpl-f32e9abbc45e42f7a4f64d641a4f7d28',
 'object': 'text_completion',
 'created': 1743489315,
 'model': '/opt/ml/model',
 'choices': [{'index': 0,
   'text': " Maintaining good kidney health is crucial for overall well-being. Here are some tips to help you take care of your kidneys:\n\n1. **Stay Hydrated**: Drink plenty of water throughout the day to help your kidneys function properly. However, avoid overhydration, as it can also strain the kidneys.\n\n2. **Control Blood Pressure**: High blood pressure is a leading cause of kidney damage. Regularly monitor your blood pressure and follow your doctor's advice for maintaining it within a healthy range.\n\n3. **Manage Diabetes**: If you have diabetes, carefully control your blood sugar levels to prevent damage to your kidneys.\n\n4. **Eat a Balanced Diet**: Focus on a diet rich in fruits, vegetables, whole grains, and lean proteins. Limit processed foods, excessive salt, and sugary drinks.\n\n5. **Exercise Regularly**: Engage in p

Congratulations! You just verified that the batch transform job is working as expected. Since the model is not required, you can delete it. Note that you are deleting the deployable model. Not the model package.

In [None]:
model.delete_model()

### Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

