## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page [Medical LLM - Medium](https://aws.amazon.com/marketplace/pp/prodview-z4jqmczvwgtby)
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

- **Model**: `JSL-Medical-LLM-Medium`  
- **Model Description**: Medical LLM optimized for summarization, answering complex clinical questions, and retrieval-augmented generation (RAG). Designed to process clinical notes, patient records, and biomedical literature, supporting real-time, high-quality responses for a broad range of healthcare and life sciences applications.

In [1]:
model_package_arn = "<Customer to specify Model package ARN corresponding to their AWS region>"

In [None]:
import os
import base64
import json
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
from IPython.display import Image, display
from PIL import Image as ImageEdit
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

sagemaker_session = sage.Session()
s3_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")
role = get_execution_role()

sagemaker = boto3.client("sagemaker")
s3_client = sagemaker_session.boto_session.client("s3")
ecr = boto3.client("ecr")
sm_runtime = boto3.client("sagemaker-runtime")

In [35]:
model_name = "JSL-Medical-LLM-Medium"

real_time_inference_instance_type = "ml.p4d.24xlarge"
batch_transform_inference_instance_type = "ml.g5.48xlarge"

## 2. Create a deployable model from the model package.

In [36]:
model = ModelPackage(
    role=role, 
    model_package_arn=model_package_arn, 
    sagemaker_session=sagemaker_session, 
)

## Model Configuration Documentation  

### Default Configuration  
The container comes with the following default configurations:  

| Parameter                  | Default Value | Description                                                                   |  
|----------------------------|---------------|-------------------------------------------------------------------------------|  
| **`dtype`**                | `float16`     | Data type for model weights and activations                                   |  
| **`max_model_len`**        | `8192`        | Default maximum context length (`input + output ≤ max_model_len`). Automatically increases based on available GPU memory: <br>- `32768` if total GPU memory ≥ 240 GB <br>- `131072` if total GPU memory ≥ 480 GB |  
| **`tensor_parallel_size`** | Auto          | Automatically set to the number of available GPUs                            |  
| **`host`**                 | `0.0.0.0`     | Host name                                                                     |  
| **`port`**                 | `8080`        | Port number                                                                   |  

### Hardcoded Settings  
The following settings are hardcoded in the container and cannot be changed:  

| Parameter       | Value           | Description                           |  
|-----------------|-----------------|---------------------------------------|  
| **`model`**     | `/opt/ml/model` | Model path where SageMaker mounts the model |  

### Configurable Environment Variables  
You can customize the vLLM server by setting environment variables when creating the model.  

**Any parameter from the [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve) can be set using the corresponding environment variable with the `SM_VLLM_` prefix.**  

The container uses a script similar to the [SageMaker entrypoint example](https://docs.vllm.ai/en/latest/examples/sagemaker_entrypoint.html) from the vLLM documentation to convert environment variables to command-line arguments.  

---  

## Input Format  

### 1. Chat Completion  

#### Example Payload  

```json  
{  
    "model": "/opt/ml/model",  
    "messages": [  
        {"role": "system", "content": "You are a helpful medical assistant."},  
        {"role": "user", "content": "What should I do if I have a fever and body aches?"}  
    ],  
    "max_tokens": 1024,  
    "temperature": 0.7  
}  
```  

For additional parameters:  
- [ChatCompletionRequest](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/entrypoints/openai/protocol.py#L212)  
- [OpenAI's Chat API](https://platform.openai.com/docs/api-reference/chat/create)  

---  

### 2. Text Completion  

#### Single Prompt Example  

```json  
{  
    "model": "/opt/ml/model",  
    "prompt": "How can I maintain good kidney health?",  
    "max_tokens": 512,  
    "temperature": 0.6  
}  
```  

#### Multiple Prompts Example  

```json  
{  
    "model": "/opt/ml/model",  
    "prompt": [  
        "How can I maintain good kidney health?",  
        "What are the best practices for kidney care?"  
    ],  
    "max_tokens": 512,  
    "temperature": 0.6  
}  
```  

Reference:  
- [CompletionRequest](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/entrypoints/openai/protocol.py#L642)  
- [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions/create)  

---  
### Important Notes:
- **Streaming Responses:** Add `"stream": true` to your request payload to enable streaming
- **Model Path Requirement:** Always set `"model": "/opt/ml/model"` (SageMaker's fixed model location)


## 3. Create an SageMaker Endpoint

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [None]:
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type, 
    endpoint_name=model_name,
    model_data_download_timeout=3600
)

### 3.1 Real-time inference via Amazon SageMaker Endpoint

#### Initial setup

In [34]:
prompt1 = "How do emerging mRNA technologies compare to traditional vaccine approaches for disease prevention?"

prompt2 = "What screening tests are recommended for adults over 50 with family history of colorectal cancer?"

prompts = [
    "What are the early warning signs of stroke and what should I do if I suspect someone is having one?",
    "How do different classes of antidepressants work and what factors determine which medication might be prescribed?",
    "What is the relationship between inflammation, autoimmune conditions, and chronic disease progression?"
]

In [7]:
system_prompt = """You are a medical expert. Please structure your response with minimal headings with hierarchical numbering, end your response with a conclusion"""

In [8]:
def invoke_realtime_endpoint(record):

    response = sm_runtime.invoke_endpoint(
        EndpointName=model_name,
        ContentType="application/json",
        Accept="application/json",
        Body=json.dumps(record),
    )

    return json.load(response["Body"])

#### Chat Completion

In [9]:
input_data = {
    "model": "/opt/ml/model",
    "messages": [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt1},
    ],
    "max_tokens": 2048,
    "temperature": 0.6,
    "top_p":0.9,
    "repetition_penalty": 1.1,
    "top_k":50
}

result = invoke_realtime_endpoint(input_data)
output_content = result['choices'][0]['message']['content']
print(output_content)

1. **Introduction to mRNA and Traditional Vaccines**
   - Emerging mRNA (messenger RNA) technologies represent a novel approach in vaccine development compared to traditional methods.
   - Traditional vaccines typically use weakened or killed pathogens, or parts of them such as proteins or sugars, to stimulate the immune system.

2. **Mechanism Comparison**
   2.1. **Traditional Vaccines**: These work by directly introducing antigens into the body, which then triggers an immune response. The immune system recognizes these foreign substances, processes them, and mounts a defense against future infections.
   2.2. **mRNA Vaccines**: Instead of using actual antigens, mRNA vaccines contain genetic instructions (mRNA) that encode for specific viral proteins. Once inside host cells, this mRNA instructs the cell machinery to produce these proteins. The immune system then responds to these proteins as it would to actual infection, building immunity without causing disease.

3. **Advantages of 

#### Text Completion

In [13]:
input_data ={
        "model": "/opt/ml/model",
        "prompt": prompt2,
        "max_tokens": 2048,
        "temperature": 0.6,
        "top_p":0.9,
        "repetition_penalty": 1.1,
        "top_k":50
    }

result = invoke_realtime_endpoint(input_data)
output_text = result['choices'][0]['text']
print(output_text)

 (Age 55+)
Adults aged 55 and older with a family history of colorectal cancer should undergo regular screening to detect the disease early, when it is most treatable. The following screening tests are commonly recommended:
1. **Colonoscopy**: This is considered the gold standard for colorectal cancer screening. It involves inserting a flexible tube with a camera into the rectum to visually examine the entire colon for polyps or cancerous growths. Colonoscopy can also remove precancerous polyps during the procedure.
   - Recommended frequency: Every 5-10 years if previous results were normal.

2. **Fecal Immunochemical Test (FIT)**: This test checks stool samples for tiny amounts of blood that could indicate bleeding from a tumor in the colon or rectum.
   - Recommended frequency: Annually.

3. **Stool DNA Test (Cologuard)**: This test looks for genetic mutations associated with colon cancer as well as hidden blood in stool samples.
   - Recommended frequency: Every 3 years.

4. **Flex

### 3.2 Real-time inference response as a stream via Amazon SageMaker Endpoint

In [15]:
def invoke_streaming_endpoint(record):
    try:
        response = sm_runtime.invoke_endpoint_with_response_stream(
            EndpointName=model_name,
            Body=json.dumps(record),
            ContentType="application/json",
            Accept="text/event-stream"
        )

        for event in response["Body"]:
            if "PayloadPart" in event:
                chunk = event["PayloadPart"]["Bytes"].decode("utf-8")
                if chunk.startswith("data:"):
                    try:
                        data = json.loads(chunk[5:].strip())
                        if "choices" in data and len(data["choices"]) > 0:
                            choice = data["choices"][0]
                            if "text" in choice:
                                yield choice["text"]
                            elif "delta" in choice and "content" in choice["delta"]:
                                yield choice["delta"]["content"]

                    except json.JSONDecodeError:
                        continue 
            elif "ModelStreamError" in event:
                error = event["ModelStreamError"]
                yield f"\nStream error: {error['Message']} (Error code: {error['ErrorCode']})"
                break
            elif "InternalStreamFailure" in event:
                failure = event["InternalStreamFailure"]
                yield f"\nInternal stream failure: {failure['Message']}"
                break
    except Exception as e:
        yield f"\nAn error occurred during streaming: {str(e)}"

#### Chat Completion

In [17]:
payload = {
    "model": "/opt/ml/model",
    "messages": [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt1}
    ],
    "max_tokens": 2048,
    "temperature": 0.6,
    "top_p":0.9,
    "repetition_penalty": 1.1,
    "top_k":50,
    "stream": True
}

for chunk in invoke_streaming_endpoint(payload):
    print(chunk, end="", flush=True)

### Comparison of mRNA Technologies and Traditional Vaccine Approaches

1. **Development Speed**
   - **mRNA Vaccines**: Offer faster development timelines compared to traditional vaccines. They can be designed, produced, and tested in a matter of months rather than years.
   - **Traditional Vaccines**: Typically require longer development periods due to the need for culturing pathogens or producing proteins.

2. **Manufacturing Flexibility and Scalability**
   - **mRNA Vaccines**: Can be manufactured using standardized processes regardless of the target antigen. This allows for rapid scaling up of production.
   - **Traditional Vaccines**: Often require specific manufacturing facilities tailored to each type of vaccine, limiting flexibility and scalability.

3. **Safety Profile**
   - **mRNA Vaccines**: Generally have shown favorable safety profiles in clinical trials, with side effects typically being mild and temporary (e.g., injection site pain, fatigue).
   - **Traditional Vaccine

#### Text Completion

In [19]:
payload = {
    "model": "/opt/ml/model",
    "prompt": prompt2,
    "max_tokens": 2048,
    "temperature": 0.6,
    "top_p":0.9,
    "repetition_penalty": 1.1,
    "top_k":50,
    "stream": True
}

for chunk in invoke_streaming_endpoint(payload):
    print(chunk, end="", flush=True)

 Adults who have a first-degree relative (parent, sibling, or child) with a history of colorectal cancer should begin regular screening at age 40, or 10 years before the earliest age at which their relative was.
How often should someone with a family history of colon cancer be screened?
If you have a strong family history of colon cancer, your doctor may recommend that you start getting screened earlier than age 45 and/or more frequently. For example: If one first-degree relative (such as a parent, brother, sister, or child) has been diagnosed with colon cancer or adenomatous polyps, you may need to start screening at age 40 or 10 years before the age when your relative was, whichever comes first.
Should I get a colonoscopy if I have a family history of colon cancer?
Having a family history of colon cancer increases your risk of it yourself. Because of this, it’s to discuss your individual risk factors with your healthcare provider to determine the best schedule for you. A colonoscopy 

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

In [None]:
model.sagemaker_session.delete_endpoint(model_name)
model.sagemaker_session.delete_endpoint_config(model_name)

## 4. Batch inference

In [37]:
validation_json_file_name1 = "input1.json"

validation_input_json_path = f"s3://{s3_bucket}/{model_name}/validation-input/"
validation_output_json_path = f"s3://{s3_bucket}/{model_name}/validation-output/"


def write_and_upload_to_s3(input_data, file_name):
    s3_client.put_object(
        Bucket=s3_bucket,
        Key=f"{model_name}/validation-input/{file_name}",
        Body=(bytes(input_data.encode("UTF-8"))),
    )

In [38]:
input_json_data1 = json.dumps(
    {
        "model": "/opt/ml/model",
        "prompt": prompts,
        "max_tokens": 2048,
        "temperature": 0.6,
        "top_p":0.9,
        "repetition_penalty": 1.1,
        "top_k":50
    }
)

write_and_upload_to_s3(input_json_data1, f"{validation_json_file_name1}")

In [None]:
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/json",
    output_path=validation_output_json_path,
)
transformer.transform(validation_input_json_path, content_type="application/json")
transformer.wait()

In [None]:
from urllib.parse import urlparse

def retrieve_json_output_from_s3(validation_file_name):

    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)
    result = json.loads(response["Body"].read().decode("utf-8"))
    
    for idx, choice in enumerate(result.get("choices", [])):
        print(f"Response {idx + 1}:\n{choice.get('text', '')}\n{'=' * 75}")

In [42]:
retrieve_json_output_from_s3(validation_json_file_name1)

Response 1:


1. **Face:** Ask the person to smile. Does one side of their face droop?
2. **Arm:** Ask the person to raise both arms. Does one arm drift downward?
3. **Speech:** Ask the person to repeat a simple sentence. Is their speech slurred or difficult to understand?
4. **Time:** Time is of the essence! If you observe any of these symptoms, call for emergency medical services immediately.

Additional symptoms may include:
- Sudden weakness or numbness in the face, arm, or leg
- Sudden confusion or trouble speaking or understanding speech
- Sudden trouble seeing in one or both eyes
- Sudden severe headache with no known cause
- Sudden trouble walking, dizziness, loss of balance, or lack of coordination

**What to Do If You Suspect Someone Is Having a Stroke:**

1. **Call Emergency Services Immediately:** In the U.S., dial 911; in other countries, use your local emergency number.
2. **Note the Time:** The time when symptoms first appeared is critical information for healthcare prov

Congratulations! You just verified that the batch transform job is working as expected. Since the model is not required, you can delete it. Note that you are deleting the deployable model. Not the model package.

In [None]:
model.delete_model()

### Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

