## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page [Medical Reasoning LLM - 14B]()
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

- **Model**: `JSL-Medical-Reasoning-LLM-14B`
- **Model Description**: Medical model for summarization, question answering (open-book and closed-book), and general chat.

In [1]:
model_package_arn = "<Customer to specify Model package ARN corresponding to their AWS region>"

In [None]:
import os
import base64
import json
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
from IPython.display import Image, display
from PIL import Image as ImageEdit
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

sagemaker_session = sage.Session()
s3_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")
role = get_execution_role()

sagemaker = boto3.client("sagemaker")
s3_client = sagemaker_session.boto_session.client("s3")
ecr = boto3.client("ecr")
sm_runtime = boto3.client("sagemaker-runtime")

In [3]:
model_name = "JSL-Medical-Reasoning-LLM-14B"

real_time_inference_instance_type = "ml.g4dn.12xlarge"
batch_transform_inference_instance_type = "ml.g4dn.12xlarge"

## 2. Create a deployable model from the model package.

In [4]:
model = ModelPackage(
    role=role, 
    model_package_arn=model_package_arn, 
    sagemaker_session=sagemaker_session, 
)

## Model Configuration Documentation  

### Default Configuration  
The container comes with the following default configurations:  

| Parameter                  | Default Value | Description                                                                   |  
|----------------------------|---------------|-------------------------------------------------------------------------------|  
| **`dtype`**                | `float16`     | Data type for model weights and activations                                   |  
| **`max_model_len`**        | `32,768`      | Maximum length for input and output combined (`input + output ≤ max_model_len`) |  
| **`tensor_parallel_size`** | Auto          | Automatically set to the number of available GPUs                            |  
| **`host`**                 | `0.0.0.0`     | Host name                                                                     |  
| **`port`**                 | `8080`        | Port number                                                                   |  

### Hardcoded Settings  
The following settings are hardcoded in the container and cannot be changed:  

| Parameter       | Value           | Description                           |  
|-----------------|-----------------|---------------------------------------|  
| **`model`**     | `/opt/ml/model` | Model path where SageMaker mounts the model |  

### Configurable Environment Variables  
You can customize the vLLM server by setting environment variables when creating the model.  

**Any parameter from the [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#vllm-serve) can be set using the corresponding environment variable with the `SM_VLLM_` prefix.**  

The container uses a script similar to the [SageMaker entrypoint example](https://docs.vllm.ai/en/latest/examples/sagemaker_entrypoint.html) from the vLLM documentation to convert environment variables to command-line arguments.  

---  

## Input Format  

### 1. Chat Completion  

#### Example Payload  
```json  
{  
    "model": "/opt/ml/model",  
    "messages": [  
        {"role": "system", "content": "You are a helpful medical assistant."},  
        {"role": "user", "content": "What should I do if I have a fever and body aches?"}  
    ],  
    "max_tokens": 1024,  
    "temperature": 0.7  
}  
```  

For additional parameters:  
- [ChatCompletionRequest](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/entrypoints/openai/protocol.py#L212)  
- [OpenAI's Chat API](https://platform.openai.com/docs/api-reference/chat/create)  

---  

### 2. Text Completion  

#### Single Prompt Example  
```json  
{  
    "model": "/opt/ml/model",  
    "prompt": "How can I maintain good kidney health?",  
    "max_tokens": 512,  
    "temperature": 0.6  
}  
```  

#### Multiple Prompts Example  
```json  
{  
    "model": "/opt/ml/model",  
    "prompt": [  
        "How can I maintain good kidney health?",  
        "What are the best practices for kidney care?"  
    ],  
    "max_tokens": 512,  
    "temperature": 0.6  
}  
```  

Reference:  
- [CompletionRequest](https://github.com/vllm-project/vllm/blob/v0.7.3/vllm/entrypoints/openai/protocol.py#L642)  
- [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions/create)  

---  

### Important Notes:
- **Streaming Responses:** Add `"stream": true` to your request payload to enable streaming
- **Model Path Requirement:** Always set `"model": "/opt/ml/model"` (SageMaker's fixed model location)

## 3. Create an SageMaker Endpoint

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [None]:
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type, 
    endpoint_name=model_name,
    model_data_download_timeout=3600
)

### 3.1 Real-time inference via Amazon SageMaker Endpoint

#### Initial setup

In [6]:
prompt1 = "How do emerging mRNA technologies compare to traditional vaccine approaches for disease prevention?"

prompt2 = "What screening tests are recommended for adults over 50 with family history of colorectal cancer?"

prompts = [
    "What are the early warning signs of stroke and what should I do if I suspect someone is having one?",
    "How do different classes of antidepressants work and what factors determine which medication might be prescribed?",
    "What is the relationship between inflammation, autoimmune conditions, and chronic disease progression?"
]

In [7]:
system_prompt = """You are a medical expert that reviews the problem, does reasoning, and then gives a final answer.
Strictly follow this exact format for giving your output:

<think>
reasoning steps
</think>

**Final Answer**: [Conclusive Answer]"""

In [8]:
def invoke_realtime_endpoint(record):

    response = sm_runtime.invoke_endpoint(
        EndpointName=model_name,
        ContentType="application/json",
        Accept="application/json",
        Body=json.dumps(record),
    )

    return json.load(response["Body"])

#### Chat Completion

In [9]:
input_data = {
    "model": "/opt/ml/model",
    "messages": [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt1},
    ],
    "max_tokens": 2048,
    "temperature": 0.8,
    "top_p": 0.95,
}

result = invoke_realtime_endpoint(input_data)
output_content = result['choices'][0]['message']['content']
print(output_content)

<think>
Emerging mRNA technologies represent a significant advancement in vaccine development compared to traditional approaches, offering several advantages and potential challenges. Here's a comparative analysis:

1. **Mechanism of Action**:
   - **Traditional Vaccines**: These typically use inactivated or attenuated pathogens, proteins, or toxins to stimulate an immune response. They introduce a weakened or dead form of the pathogen or specific components of the pathogen into the body.
   - **mRNA Vaccines**: mRNA vaccines deliver genetic instructions (mRNA) that code for a specific antigen (usually a protein from the pathogen) directly into cells. Cells then produce the antigen, which triggers an immune response without introducing any live or infectious material.

2. **Development Speed and Flexibility**:
   - **Traditional Vaccines**: Development can be lengthy, often taking several years, due to the need for extensive testing of pathogens, toxins, or proteins.
   - **mRNA Vaccin

#### Text Completion

In [11]:
input_data ={
        "model": "/opt/ml/model",
        "prompt": f"{system_prompt}\n\nUser: {prompt2}\n\nAssistant:",
        "max_tokens": 2048,
        "temperature": 0.8,
        "top_p": 0.95,
    }

result = invoke_realtime_endpoint(input_data)
output_text = result['choices'][0]['text']
print(output_text)

 <think> The recommended screening tests for adults over 50 with a family history of colorectal cancer include: 1. **Colonoscopy**: This is the most thorough test, allowing direct visualization of the colon and rectum for any abnormal growths or polyps. It's typically performed every 5 years if the first test is normal, but may need to be done more frequently based on family history. 2. **Fecal Occult Blood Test (FOBT)** or **Fecal Immunochemical Test (FIT)**: These tests check for hidden blood in the stool, which can be an early sign of cancer. They are often used in combination with a colonoscopy or as a follow-up test. 3. **Computed Tomography Colonography (CTC)** or **Virtual Colonoscopy**: This involves taking detailed X-ray images of the colon and rectum. It's another option for detecting polyps or cancer, especially if a traditional colonoscopy is not possible. 4. **Flexible Sigmoidoscopy**: This test examines the lower part of the colon and rectum. It's usually recommended ever

### 3.2 Real-time inference response as a stream via Amazon SageMaker Endpoint

In [13]:
def invoke_streaming_endpoint(record):
    try:
        response = sm_runtime.invoke_endpoint_with_response_stream(
            EndpointName=model_name,
            Body=json.dumps(record),
            ContentType="application/json",
            Accept="text/event-stream"
        )

        for event in response["Body"]:
            if "PayloadPart" in event:
                chunk = event["PayloadPart"]["Bytes"].decode("utf-8")
                if chunk.startswith("data:"):
                    try:
                        data = json.loads(chunk[5:].strip())
                        if "choices" in data and len(data["choices"]) > 0:
                            choice = data["choices"][0]
                            if "text" in choice:
                                yield choice["text"]
                            elif "delta" in choice and "content" in choice["delta"]:
                                yield choice["delta"]["content"]

                    except json.JSONDecodeError:
                        continue 
            elif "ModelStreamError" in event:
                error = event["ModelStreamError"]
                yield f"\nStream error: {error['Message']} (Error code: {error['ErrorCode']})"
                break
            elif "InternalStreamFailure" in event:
                failure = event["InternalStreamFailure"]
                yield f"\nInternal stream failure: {failure['Message']}"
                break
    except Exception as e:
        yield f"\nAn error occurred during streaming: {str(e)}"

#### Chat Completion

In [14]:
payload = {
    "model": "/opt/ml/model",
    "messages": [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt1}
    ],
    "max_tokens": 2048,
    "temperature": 0.8,
    "top_p": 0.95,
    "stream": True
}

for chunk in invoke_streaming_endpoint(payload):
    print(chunk, end="", flush=True)

<think>
Emerging mRNA technologies a significant advancement in vaccine development, offering several advantages over traditional vaccine approaches for disease prevention. Here's a detailed comparison:

1. **Mechanism of Action**:
   - **Traditional Vaccines**: These typically use inactivated or attenuated pathogens, viral vectors, or subunit components (like proteins) to stimulate an immune response. They induce both antibody and cellular immunity.
   - **mRNA Vaccines**: These use genetic material (mRNA) that instructs cells to produce a specific viral protein, which then triggers an immune response. They primarily induce antibody responses but can also generate some cellular immunity with additional components.

2. **Speed and Flexibility**:
   - **Traditional Vaccines**: Developing and scaling up traditional vaccines can take months to years, as they require the growth of pathogens or the production of viral vectors and proteins.
   - **mRNA Vaccines**: mRNA vaccines can be design

#### Text Completion

In [15]:
payload = {
    "model": "/opt/ml/model",
    "prompt": f"{system_prompt}\n\nUser: {prompt2}\n\nAssistant:",
    "max_tokens": 2048,
    "temperature": 0.8,
    "top_p": 0.95,
    "stream": True
}

for chunk in invoke_streaming_endpoint(payload):
    print(chunk, end="", flush=True)

 <think>
In adults over the age of 50 who have a family history of colorectal cancer, the primary goal of screening is to detect early signs of the disease, including polyps or cancer itself. These screenings are crucial because early detection can significantly improve treatment outcomes and survival rates.

1. **Colonoscopy**: This is the most comprehensive and widely recommended screening method for with a family history of colorectal cancer. During a colonoscopy, a long, flexible tube with a camera is inserted into the colon to visualize the entire large intestine and rectum. This allows the doctor to identify and remove any polyps or areas, potentially preventing cancer or detecting it in its earliest stages.

2. **Flexible Sigmoidoscopy**: Although less comprehensive than a colonoscopy, a flexible sigmoidoscopy can still be effective for screening the lower part of the colon and rectum. It involves using a shorter scope to examine the left side of the colon and rectum. However, b

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

In [None]:
model.sagemaker_session.delete_endpoint(model_name)
model.sagemaker_session.delete_endpoint_config(model_name)

## 4. Batch inference

In [17]:
validation_json_file_name1 = "input1.json"

validation_input_json_path = f"s3://{s3_bucket}/{model_name}/validation-input/"
validation_output_json_path = f"s3://{s3_bucket}/{model_name}/validation-output/"


def write_and_upload_to_s3(input_data, file_name):
    s3_client.put_object(
        Bucket=s3_bucket,
        Key=f"{model_name}/validation-input/{file_name}",
        Body=(bytes(input_data.encode("UTF-8"))),
    )

In [18]:
input_json_data1 = json.dumps(
    {
        "model": "/opt/ml/model",
        "prompt": [f"{system_prompt}\n\nUser: {prompt}\n\nAssistant:" for prompt in prompts],
        "max_tokens": 2048,
        "temperature": 0.8,
        "top_p": 0.95,
    }
)

write_and_upload_to_s3(input_json_data1, f"{validation_json_file_name1}")

In [None]:
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/json",
    output_path=validation_output_json_path,
)
transformer.transform(validation_input_json_path, content_type="application/json")
transformer.wait()

In [None]:
from urllib.parse import urlparse

def retrieve_json_output_from_s3(validation_file_name):

    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)
    result = json.loads(response["Body"].read().decode("utf-8"))
    
    for idx, choice in enumerate(result.get("choices", [])):
        print(f"Response {idx + 1}:\n{choice.get('text', '')}\n{'=' * 75}")

In [22]:
retrieve_json_output_from_s3(validation_json_file_name1)

Response 1:
Response 2:
 <think>
Different classes of antidepressants work through various mechanisms to alleviate symptoms of depression. Here's a breakdown of the main classes and how they function:

1. **Selective Serotonin Reuptake Inhibitors (SSRIs):**
   - **Mechanism:** SSRIs primarily increase the availability of serotonin in the brain by blocking its reabsorption (reuptake) into neurons. This leads to higher levels of serotonin in the synaptic cleft, enhancing communication between brain cells.
   - **Examples:** Fluoxetine (Prozac), Sertraline (Zoloft), Escitalopram (Lexapro).

2. **Serotonin-Norepinephrine Reuptake Inhibitors (SNRIs):**
   - **Mechanism:** SNRIs inhibit the reuptake of both serotonin and norepinephrine, increasing their levels in the brain. This dual action can be particularly beneficial for treating depression and anxiety disorders.
   - **Examples:** Venlafaxine (Effexor), Duloxetine (Cymbalta), Desvenlafaxine (Pristiq).

3. **Tricyclic Antidepressants (TC

Congratulations! You just verified that the batch transform job is working as expected. Since the model is not required, you can delete it. Note that you are deleting the deployable model. Not the model package.

In [None]:
model.delete_model()

### Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

