## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page [HGNC Clinical Terminology Mapper](https://aws.amazon.com/marketplace/pp/prodview-gzzgijf3waxd6)
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

## Pipeline for HUGO Gene Nomenclature Committee (HGNC)

- **Model**: `hgnc_vdb_resolver`
- **Model Description**: This pretrained pipeline extracts `GENE` entities from clinical text and maps them to their corresponding HUGO Gene Nomenclature Committee (HGNC) codes

In [1]:
model_package_arn = "<Customer to specify Model package ARN corresponding to their AWS region>"

In [None]:
import json
import os
import boto3
import pandas as pd
import sagemaker as sage
from sagemaker import ModelPackage
from sagemaker import get_execution_role
from IPython.display import display
from urllib.parse import urlparse


In [3]:
sagemaker_session = sage.Session()
s3_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")
role = get_execution_role()

sagemaker = boto3.client("sagemaker")
s3_client = sagemaker_session.boto_session.client("s3")
ecr = boto3.client("ecr")
sm_runtime = boto3.client("sagemaker-runtime")

# Set display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [4]:
model_name = "hcc-vdb-resolver"

real_time_inference_instance_type = "ml.m4.xlarge"
batch_transform_inference_instance_type = "ml.m4.2xlarge"

## 2. Create a deployable model from the model package.

In [5]:
model = ModelPackage(
    role=role, 
    model_package_arn=model_package_arn,
    sagemaker_session=sagemaker_session,
)

### Input Format

To use the model, you need to provide input in one of the following supported formats:

#### JSON Format

Provide input as JSON. We support two variations within this format:

1. **Array of Text Documents**: 
   Use an array containing multiple text documents. Each element represents a separate text document.

   ```json
   {
       "text": [
           "Text document 1",
           "Text document 2",
           ...
       ]
   }

    ```

2. **Single Text Document**:
   Provide a single text document as a string.


   ```json
    {
        "text": "Single text document"
    }
   ```

#### JSON Lines (JSONL) Format

Provide input in JSON Lines format, where each line is a JSON object representing a text document.

```
{"text": "Text document 1"}
{"text": "Text document 2"}
```

## 3. Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

### A. Deploy the SageMaker model to an endpoint

In [6]:
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type, 
    endpoint_name=model_name,
)

------------!

Once endpoint has been created, you would be able to perform real-time inference.

In [7]:
def invoke_realtime_endpoint(record, content_type="application/json", accept="application/json"):
    response = sm_runtime.invoke_endpoint(
        EndpointName=model_name,
        ContentType=content_type,
        Accept=accept,
        Body=json.dumps(record) if content_type == "application/json" else record,
    )

    response_body = response["Body"].read().decode("utf-8")

    if accept == "application/json":
        return json.loads(response_body)
    elif accept == "application/jsonlines":
        return response_body
    else:
        raise ValueError(f"Unsupported accept type: {accept}")

### Initial Setup

In [8]:
docs = [
    "Recent studies have suggested a potential link between the double homeobox 4 like 20 (pseudogene), also known as DUX4L20, and FBXO48 and RNA guanine-7 methyltransferase ",
    "The EGFR gene encodes a protein that is involved in cell proliferation and survival. Mutations in this gene have been implicated in the development of several types of cancer."
]

sample_text = """During today's consultation, we reviewed the results of the comprehensive genetic analysis performed on the patient. This analysis uncovered complex interactions between several genes: DUX4, DUX4L20, FBXO48, MYOD1, and PAX7. These findings are significant as they provide new understanding of the molecular pathways that are involved in muscle differentiation and may play a role in the development and progression of muscular dystrophies in this patient."""

### JSON

In [9]:
input_json_data = {"text": sample_text}
response_json = invoke_realtime_endpoint(input_json_data, content_type="application/json", accept="application/json")
pd.DataFrame(response_json["predictions"][0])

Unnamed: 0,begin,end,ner_chunk,ner_label,ner_confidence,concept_code,resolution,score,all_codes,concept_name_detailed,locus,all_resolutions,all_score
0,185,188,DUX4,GENE,0.9697,HGNC:50800,dux4,1.0,"[HGNC:50800, HGNC:38686, HGNC:51778, HGNC:51771, HGNC:37265]","[DUX4 [double homeobox 4], DUX4L4 [double homeobox 4 like 4 (pseudogene)], DUX4L40 [double homeobox 4 like 40 (pseudogene)], DUX4L32 [double homeobox 4 like 32 (pseudogene)], DUX4L6 [double homeobox 4 like 6 (pseudogene)]]","[protein-coding gene || gene with protein product, pseudogene || pseudogene, pseudogene || pseudogene, pseudogene || pseudogene, pseudogene || pseudogene]","[dux4, dux4l4, dux4l40, dux4l32, dux4l6]","[1.0, 0.9232940077781677, 0.8669935464859009, 0.8570185899734497, 0.8551883697509766]"
1,191,197,DUX4L20,GENE,0.9412,HGNC:50801,dux4l20,1.0,"[HGNC:50801, HGNC:51788, HGNC:51778, HGNC:50802, HGNC:31354]","[DUX4L20 [double homeobox 4 like 20 (pseudogene)], DUX4L50 [double homeobox 4 like 50 (pseudogene)], DUX4L40 [double homeobox 4 like 40 (pseudogene)], DUX4L21 [double homeobox 4 like 21 (pseudogene)], DUX4L10 [double homeobox 4 like 10 (pseudogene)]]","[pseudogene || pseudogene, pseudogene || pseudogene, pseudogene || pseudogene, pseudogene || pseudogene, pseudogene || pseudogene]","[dux4l20, dux4l50, dux4l40, dux4l21, dux4l10]","[0.9999997615814209, 0.895313024520874, 0.8925592303276062, 0.8840012550354004, 0.8810112476348877]"
2,200,205,FBXO48,GENE,0.9676,HGNC:33857,fbxo48,1.0,"[HGNC:33857, HGNC:29816, HGNC:24847, HGNC:27020, HGNC:31969]","[FBXO48 [F-box protein 48], FBXO40 [F-box protein 40], FBXO44 [F-box protein 44], FBXO36 [F-box protein 36], FBXO47 [F-box protein 47]]","[protein-coding gene || gene with protein product, protein-coding gene || gene with protein product, protein-coding gene || gene with protein product, protein-coding gene || gene with protein product, protein-coding gene || gene with protein product]","[fbxo48, fbxo40, fbxo44, fbxo36, fbxo47]","[1.0000004768371582, 0.8699185848236084, 0.8525800108909607, 0.8402078151702881, 0.8378820419311523]"
3,208,212,MYOD1,GENE,0.9847,HGNC:7611,myod1,1.0,"[HGNC:7611, HGNC:7598, HGNC:7595, HGNC:7599, HGNC:13752]","[MYOD1 [myogenic differentiation 1], MYO1D [myosin ID], MYO1A [myosin IA], MYO1E [myosin IE], MYOZ1 [myozenin 1]]","[protein-coding gene || gene with protein product, protein-coding gene || gene with protein product, protein-coding gene || gene with protein product, protein-coding gene || gene with protein product, protein-coding gene || gene with protein product]","[myod1, myo1d, myo1a, myo1e, myoz1]","[0.9999996423721313, 0.8814316987991333, 0.8536235094070435, 0.8487422466278076, 0.8310863971710205]"
4,219,222,PAX7,GENE,0.8536,HGNC:8621,pax7,1.0,"[HGNC:8621, HGNC:8860, HGNC:8622, HGNC:8616, HGNC:8619]","[PAX7 [paired box 7], PEX7 [peroxisomal biogenesis factor 7], PAX8 [paired box 8], PAX2 [paired box 2], PAX5 [paired box 5]]","[protein-coding gene || gene with protein product, protein-coding gene || gene with protein product, protein-coding gene || gene with protein product, protein-coding gene || gene with protein product, protein-coding gene || gene with protein product]","[pax7, pex7, pax8, pax2, pax5]","[0.9999997019767761, 0.8129062652587891, 0.8030733466148376, 0.8027161359786987, 0.8005862832069397]"


### JSON Lines

In [11]:
def create_jsonl(records):
    if isinstance(records, str):
        records = [records]
    json_records = [{"text": text} for text in records]
    json_lines = "\n".join(json.dumps(record) for record in json_records)
    return json_lines

In [12]:
input_jsonl_data = create_jsonl(sample_text)
data = invoke_realtime_endpoint(input_jsonl_data, content_type="application/jsonlines" , accept="application/jsonlines" )
print(data)

{"predictions": [{"begin": 185, "end": 188, "ner_chunk": "DUX4", "ner_label": "GENE", "ner_confidence": "0.9697", "concept_code": "HGNC:50800", "resolution": "dux4", "score": 1.0, "all_codes": ["HGNC:50800", "HGNC:38686", "HGNC:51778", "HGNC:51771", "HGNC:37265"], "concept_name_detailed": ["DUX4 [double homeobox 4]", "DUX4L4 [double homeobox 4 like 4 (pseudogene)]", "DUX4L40 [double homeobox 4 like 40 (pseudogene)]", "DUX4L32 [double homeobox 4 like 32 (pseudogene)]", "DUX4L6 [double homeobox 4 like 6 (pseudogene)]"], "locus": ["protein-coding gene || gene with protein product", "pseudogene || pseudogene", "pseudogene || pseudogene", "pseudogene || pseudogene", "pseudogene || pseudogene"], "all_resolutions": ["dux4", "dux4l4", "dux4l40", "dux4l32", "dux4l6"], "all_score": [1.0, 0.9232940077781677, 0.8669935464859009, 0.8570185899734497, 0.8551883697509766]}, {"begin": 191, "end": 197, "ner_chunk": "DUX4L20", "ner_label": "GENE", "ner_confidence": "0.9412", "concept_code": "HGNC:50801",

### B. Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

In [14]:
model.sagemaker_session.delete_endpoint(model_name)
model.sagemaker_session.delete_endpoint_config(model_name)

## 4. Batch inference

In [15]:
validation_json_file_name = "input.json"
validation_jsonl_file_name = "input.jsonl"

validation_input_json_path = f"s3://{s3_bucket}/{model_name}/validation-input/json/"
validation_output_json_path = f"s3://{s3_bucket}/{model_name}/validation-output/json/"

validation_input_jsonl_path = f"s3://{s3_bucket}/{model_name}/validation-input/jsonl/"
validation_output_jsonl_path = f"s3://{s3_bucket}/{model_name}/validation-output/jsonl/"

def upload_to_s3(input_data, file_name):
    file_format = os.path.splitext(file_name)[1].lower()
    s3_client.put_object(
        Bucket=s3_bucket,
        Key=f"{model_name}/validation-input/{file_format[1:]}/{file_name}",
        Body=input_data.encode("UTF-8"),
    )

In [16]:
# Create JSON and JSON Lines data
input_jsonl_data = create_jsonl(docs)
input_json_data = json.dumps({"text": docs})

# Upload JSON and JSON Lines data to S3
upload_to_s3(input_json_data, validation_json_file_name)
upload_to_s3(input_jsonl_data, validation_jsonl_file_name)

### JSON

In [None]:
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/json",
    output_path=validation_output_json_path
)

transformer.transform(validation_input_json_path, content_type="application/json")
transformer.wait()

In [None]:
def retrieve_json_output_from_s3(validation_file_name):
    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)

    data = json.loads(response["Body"].read().decode("utf-8"))
    display(data)

In [20]:
retrieve_json_output_from_s3(validation_json_file_name)

{'predictions': [[{'begin': 113,
    'end': 119,
    'ner_chunk': 'DUX4L20',
    'ner_label': 'GENE',
    'ner_confidence': '0.9654',
    'concept_code': 'HGNC:50801',
    'resolution': 'dux4l20',
    'score': 0.9999998211860657,
    'all_codes': ['HGNC:50801',
     'HGNC:51788',
     'HGNC:51778',
     'HGNC:50802',
     'HGNC:31354'],
    'concept_name_detailed': ['DUX4L20 [double homeobox 4 like 20 (pseudogene)]',
     'DUX4L50 [double homeobox 4 like 50 (pseudogene)]',
     'DUX4L40 [double homeobox 4 like 40 (pseudogene)]',
     'DUX4L21 [double homeobox 4 like 21 (pseudogene)]',
     'DUX4L10 [double homeobox 4 like 10 (pseudogene)]'],
    'locus': ['pseudogene || pseudogene',
     'pseudogene || pseudogene',
     'pseudogene || pseudogene',
     'pseudogene || pseudogene',
     'pseudogene || pseudogene'],
    'all_resolutions': ['dux4l20', 'dux4l50', 'dux4l40', 'dux4l21', 'dux4l10'],
    'all_score': [0.9999998211860657,
     0.8953130841255188,
     0.8925591707229614,
     0.

### JSON Lines

In [None]:
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/jsonlines",
    output_path=validation_output_jsonl_path
)
transformer.transform(validation_input_jsonl_path, content_type="application/jsonlines")
transformer.wait()

In [None]:
def retrieve_jsonlines_output_from_s3(validation_file_name):

    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)

    data = response["Body"].read().decode("utf-8")
    print(data)

In [23]:
retrieve_jsonlines_output_from_s3(validation_jsonl_file_name)

{"predictions": [{"begin": 113, "end": 119, "ner_chunk": "DUX4L20", "ner_label": "GENE", "ner_confidence": "0.9654", "concept_code": "HGNC:50801", "resolution": "dux4l20", "score": 0.9999998211860657, "all_codes": ["HGNC:50801", "HGNC:51788", "HGNC:51778", "HGNC:50802", "HGNC:31354"], "concept_name_detailed": ["DUX4L20 [double homeobox 4 like 20 (pseudogene)]", "DUX4L50 [double homeobox 4 like 50 (pseudogene)]", "DUX4L40 [double homeobox 4 like 40 (pseudogene)]", "DUX4L21 [double homeobox 4 like 21 (pseudogene)]", "DUX4L10 [double homeobox 4 like 10 (pseudogene)]"], "locus": ["pseudogene || pseudogene", "pseudogene || pseudogene", "pseudogene || pseudogene", "pseudogene || pseudogene", "pseudogene || pseudogene"], "all_resolutions": ["dux4l20", "dux4l50", "dux4l40", "dux4l21", "dux4l10"], "all_score": [0.9999998211860657, 0.8953130841255188, 0.8925591707229614, 0.8840013742446899, 0.8810112476348877]}, {"begin": 126, "end": 131, "ner_chunk": "FBXO48", "ner_label": "GENE", "ner_confiden

In [None]:
model.delete_model()

### Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

