## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page <font color='red'> For Seller to update:[Title_of_your_product](Provide link to your marketplace listing of your product).</font>
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

## Pipeline for HUGO Gene Nomenclature Committee (HGNC)

- **Model**: `hgnc_resolver_pipeline`
- **Model Description**: This pipeline extracts `GENE` entities and maps them to their corresponding HUGO Gene Nomenclature Committee (HGNC) codes using `sbiobert_base_cased_mli` sentence embeddings.

In [None]:
model_package_arn = "<Customer to specify Model package ARN corresponding to their AWS region>"

In [2]:
import base64
import json
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
from IPython.display import Image, display
from PIL import Image as ImageEdit
import numpy as np

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
sagemaker_session = sage.Session()
s3_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")
role = get_execution_role()

sagemaker = boto3.client("sagemaker")
s3_client = sagemaker_session.boto_session.client("s3")
ecr = boto3.client("ecr")
sm_runtime = boto3.client("sagemaker-runtime")

## 2. Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [4]:
model_name = "hgnc-resolver-pipeline"

real_time_inference_instance_type = "ml.m4.xlarge"
batch_transform_inference_instance_type = "ml.m4.xlarge"


### A. Create an endpoint

In [5]:
# create a deployable model from the model package.
model = ModelPackage(
    role=role, model_package_arn=model_package_arn, sagemaker_session=sagemaker_session
)

# Deploy the model
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=model_name)

----------!

Once endpoint has been created, you would be able to perform real-time inference.

In [6]:
import json
import pandas as pd
import os
import boto3

# Set display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

def process_data_and_invoke_realtime_endpoint(data, content_type, accept):

    content_type_to_format = {'application/json': 'json', 'application/jsonlines': 'jsonl'}
    input_format = content_type_to_format.get(content_type)
    if content_type not in content_type_to_format.keys() or accept not in content_type_to_format.keys():
        raise ValueError("Invalid content_type or accept. It should be either 'application/json' or 'application/jsonlines'.")

    i = 1
    input_dir = f'inputs/real-time/{input_format}'
    output_dir = f'outputs/real-time/{input_format}'
    s3_input_dir = f"{model_name}/validation-input/real-time/{input_format}"
    s3_output_dir = f"{model_name}/validation-output/real-time/{input_format}"

    input_file_name = f'{input_dir}/input{i}.{input_format}'
    output_file_name = f'{output_dir}/{os.path.basename(input_file_name)}.out'

    while os.path.exists(input_file_name) or os.path.exists(output_file_name):
        i += 1
        input_file_name = f'{input_dir}/input{i}.{input_format}'
        output_file_name = f'{output_dir}/{os.path.basename(input_file_name)}.out'

    os.makedirs(os.path.dirname(input_file_name), exist_ok=True)
    os.makedirs(os.path.dirname(output_file_name), exist_ok=True)

    input_data = json.dumps(data) if content_type == 'application/json' else data

    # Write input data to file
    with open(input_file_name, 'w') as f:
        f.write(input_data)

    # Upload input data to S3
    s3_client.put_object(Bucket=s3_bucket, Key=f"{s3_input_dir}/{os.path.basename(input_file_name)}", Body=bytes(input_data.encode('UTF-8')))

    # Invoke the SageMaker endpoint
    response = sm_runtime.invoke_endpoint(
        EndpointName=model_name,
        ContentType=content_type,
        Accept=accept,
        Body=input_data,
    )

    # Read response data
    response_data = json.loads(response["Body"].read().decode("utf-8")) if accept == 'application/json' else response['Body'].read().decode('utf-8')

    # Save response data to file
    with open(output_file_name, 'w') as f_out:
        if accept == 'application/json':
            json.dump(response_data, f_out, indent=4)
        else:
            for item in response_data.split('\n'):
                f_out.write(item + '\n')

    # Upload response data to S3
    output_s3_key = f"{s3_output_dir}/{os.path.basename(output_file_name)}"
    if accept == 'application/json':
        s3_client.put_object(Bucket=s3_bucket, Key=output_s3_key, Body=json.dumps(response_data).encode('UTF-8'))
    else:
        s3_client.put_object(Bucket=s3_bucket, Key=output_s3_key, Body=response_data)

    return response_data

### Initial Setup

In [7]:
docs = [
    "Recent studies have suggested a potential link between the double homeobox 4 like 20 (pseudogene), also known as DUX4L20, and FBXO48 and RNA guanine-7 methyltransferase ",
    "The EGFR gene encodes a protein that is involved in cell proliferation and survival. Mutations in this gene have been implicated in the development of several types of cancer."

]


sample_text = """During today's consultation, we reviewed the results of the comprehensive genetic analysis performed on the patient. This analysis uncovered complex interactions between several genes: DUX4, DUX4L20, FBXO48, MYOD1, and PAX7. These findings are significant as they provide new understanding of the molecular pathways that are involved in muscle differentiation and may play a role in the development and progression of muscular dystrophies in this patient."""

### JSON

#### Example 1

  **Input format**:
  
  
```json
{
    "text": "Single text document"
}
```

In [8]:
input_json_data = {"text": sample_text}

data =  process_data_and_invoke_realtime_endpoint(input_json_data, content_type="application/json" , accept="application/json" )
pd.DataFrame(data["predictions"])

Unnamed: 0,0,1,2,3,4
0,"{'ner_chunk': 'DUX4', 'begin': 185, 'end': 188, 'ner_label': 'GENE', 'ner_confidence': '0.9697', 'code': 'HGNC:50800', 'resolution': 'DUX4 [double homeobox 4]', 'all_k_codes': 'HGNC:50800:::HGNC:3070:::HGNC:32183:::HGNC:38686:::HGNC:39517:::HGNC:37267:::HGNC:3082:::HGNC:51787:::HGNC:21517:::HGNC:2917:::HGNC:38670:::HGNC:25475:::HGNC:2910:::HGNC:18700:::HGNC:33855:::HGNC:38689:::HGNC:11175:::HGNC:15966:::HGNC:7727:::HGNC:20229:::HGNC:13906:::HGNC:48628:::HGNC:15518:::HGNC:11200:::HGNC:20161', 'all_k_resolutions': 'DUX4 [double homeobox 4]:::DUSP4 [dual specificity phosphatase 4]:::DUXAP4 [double homeobox A pseudogene 4]:::DUX4L4 [double homeobox 4 like 4 (pseudogene)]:::DUTP4 [deoxyuridine triphosphatase pseudogene 4]:::DUX4L2 [double homeobox 4 like 2 (pseudogene)]:::DUX4L1 [double homeobox 4 like 1 (pseudogene)]:::DUX4L49 [double homeobox 4 like 49 (pseudogene)]:::DUS4L [dihydrouridine synthase 4 like]:::DLX4 [distal-less homeobox 4]:::DUX4L8 [double homeobox 4 like 8 (pseudogene)]:::BEX4 [brain expressed X-linked 4]:::DLL4 [delta like canonical Notch ligand 4]:::DDX4 [DEAD-box helicase 4]:::DUX4L9 [double homeobox 4 like 9 (pseudogene)]:::DUX4L5 [double homeobox 4 like 5 (pseudogene)]:::SNX4 [sorting nexin 4]:::DAZ4 [deleted in azoospermia 4]:::NEDD4 [NEDD4 E3 ubiquitin protein ligase]:::DCAF4 [DDB1 and CUL4 associated factor 4]:::MXD4 [MAX dimerization protein 4]:::TEX49 [testis expressed 49]:::DCTN4 [dynactin subunit 4]:::SOX4 [SRY-box transcription factor 4]:::TOX4 [TOX high mobility group box family member 4]', 'all_k_distances': '0.0000:::3.6040:::3.6951:::3.8239:::4.1167:::4.2880:::4.5084:::4.8762:::4.9478:::5.0787:::5.1106:::5.1314:::5.2626:::5.3066:::5.3660:::5.4132:::5.4142:::5.6109:::5.6147:::5.6667:::5.6798:::5.7166:::5.7258:::5.7436:::5.8439'}","{'ner_chunk': 'DUX4L20', 'begin': 191, 'end': 197, 'ner_label': 'GENE', 'ner_confidence': '0.9412', 'code': 'HGNC:50801', 'resolution': 'DUX4L20 [double homeobox 4 like 20 (pseudogene)]', 'all_k_codes': 'HGNC:50801:::HGNC:39776:::HGNC:31982:::HGNC:26230:::HGNC:2743:::HGNC:42011:::HGNC:42254:::HGNC:42423:::HGNC:42207:::HGNC:50522:::HGNC:34070:::HGNC:24679:::HGNC:18357:::HGNC:25794:::HGNC:14478:::HGNC:42012:::HGNC:4475:::HGNC:19734:::HGNC:36437:::HGNC:42013:::HGNC:11598:::HGNC:30390:::HGNC:53837:::HGNC:37772:::HGNC:51516', 'all_k_resolutions': 'DUX4L20 [double homeobox 4 like 20 (pseudogene)]:::ZDHHC20P4 [zinc finger DHHC-type containing 20 pseudogene 4]:::ANKRD20A4P [ankyrin repeat domain 20 family member A4, pseudogene]:::TM4SF20 [transmembrane 4 L six family member 20]:::DDX20 [DEAD-box helicase 20]:::FAM204BP [family with sequence similarity 204 member B, pseudogene]:::MTND4LP20 [MT-ND4L pseudogene 20]:::ZBTB20-AS4 [ZBTB20 antisense RNA 4]:::MTND4P20 [MT-ND4 pseudogene 20]:::TOMM20P4 [TOMM20 pseudogene 4]:::RNY4P20 [RNY4 pseudogene 20]:::FBXL20 [F-box and leucine rich repeat protein 20]:::ARHGAP20 [Rho GTPase activating protein 20]:::FAM204A [family with sequence similarity 204 member A]:::MRPL20 [mitochondrial ribosomal protein L20]:::FAM204CP [family with sequence similarity 204 member C, pseudogene]:::GPR20 [G protein-coupled receptor 20]:::RANBP20P [RAN binding protein 20 pseudogene]:::RPS4XP20 [ribosomal protein S4X pseudogene 20]:::FAM204DP [family with sequence similarity 204 member D, pseudogene]:::TBX20 [T-box transcription factor 20]:::SNX20 [sorting nexin 20]:::PRR20G [proline rich 20G]:::GAPDHP20 [glyceraldehyde 3 phosphate dehydrogenase pseudogene 20]:::SPDYE20P [speedy/RINGO cell cycle regulator family member E20, pseudogene]', 'all_k_distances': '0.0000:::6.4373:::6.4389:::6.6875:::6.8085:::6.8100:::6.8730:::6.8812:::6.9250:::6.9846:::7.0950:::7.1020:::7.2438:::7.2766:::7.3176:::7.3252:::7.3437:::7.4235:::7.4729:::7.4762:::7.4965:::7.5218:::7.5315:::7.5501:::7.5699'}","{'ner_chunk': 'FBXO48', 'begin': 200, 'end': 205, 'ner_label': 'GENE', 'ner_confidence': '0.9676', 'code': 'HGNC:33857', 'resolution': 'FBXO48 [F-box protein 48]', 'all_k_codes': 'HGNC:33857:::HGNC:4930:::HGNC:16653:::HGNC:13114:::HGNC:22564:::HGNC:18533:::HGNC:37552:::HGNC:24635:::HGNC:23535:::HGNC:23385:::HGNC:23942:::HGNC:20807:::HGNC:23440:::HGNC:21368:::HGNC:23384:::HGNC:52393:::HGNC:23305:::HGNC:21785:::HGNC:23488:::HGNC:31272:::HGNC:1683:::HGNC:12079:::HGNC:37805:::HGNC:55157:::HGNC:31969', 'all_k_resolutions': 'FBXO48 [F-box protein 48]:::ZBTB48 [zinc finger and BTB domain containing 48]:::MRPL48 [mitochondrial ribosomal protein L48]:::ZNF48 [zinc finger protein 48]:::SPATA48 [spermatogenesis associated 48]:::USP48 [ubiquitin specific peptidase 48]:::PIRC48 [piwi-interacting RNA cluster 48]:::PRSS48 [serine protease 48]:::ZNF488 [zinc finger protein 488]:::ZNF484 [zinc finger protein 484]:::CYCSP48 [CYCS pseudogene 48]:::ZNF486 [zinc finger protein 486]:::ZNF485 [zinc finger protein 485]:::SNRNP48 [small nuclear ribonucleoprotein U11/U12 subunit 48]:::ZNF483 [zinc finger protein 483]:::TEX48 [testis expressed 48]:::ZNF480 [zinc finger protein 480]:::RBM48 [RNA binding motif protein 48]:::ZNF487 [zinc finger protein 487]:::OR4C48P [olfactory receptor family 4 subfamily C member 48 pseudogene]:::CD48 [CD48 molecule]:::TRAJ48 [T cell receptor alpha joining 48]:::GAPDHP48 [glyceraldehyde 3 phosphate dehydrogenase pseudogene 48]:::HMGN2P48 [high mobility group nucleosomal binding domain 2 pseudogene 48]:::FBXO47 [F-box protein 47]', 'all_k_distances': '0.0000:::5.3026:::5.3531:::5.4464:::5.8642:::5.8911:::5.9817:::6.0045:::6.1032:::6.1347:::6.1727:::6.2446:::6.3179:::6.3452:::6.3667:::6.3867:::6.3949:::6.4803:::6.5349:::6.5411:::6.5938:::6.6030:::6.6433:::6.7313:::6.8481'}","{'ner_chunk': 'MYOD1', 'begin': 208, 'end': 212, 'ner_label': 'GENE', 'ner_confidence': '0.9847', 'code': 'HGNC:7611', 'resolution': 'MYOD1 [myogenic differentiation 1]', 'all_k_codes': 'HGNC:7611:::HGNC:13879:::HGNC:7613:::HGNC:7582:::HGNC:7598:::HGNC:40750:::HGNC:40384:::HGNC:13880:::HGNC:6970:::HGNC:17590:::HGNC:23172:::HGNC:54810:::HGNC:7600:::HGNC:29401:::HGNC:18302:::HGNC:7599:::HGNC:7228:::HGNC:7622:::HGNC:28781:::HGNC:7623:::HGNC:29636:::HGNC:29917:::HGNC:7567:::HGNC:33741:::HGNC:4979', 'all_k_resolutions': 'MYOD1 [myogenic differentiation 1]:::MYO1H [myosin IH]:::MYOM1 [myomesin 1]:::MYL1 [myosin light chain 1]:::MYO1D [myosin ID]:::MYOCD-AS1 [MYOCD antisense RNA 1]:::MYADM-AS1 [MYADM antisense RNA 1]:::MYO1G [myosin IG]:::MDH1 [malate dehydrogenase 1]:::MYG1 [MYG1 exonuclease]:::MYCT1 [MYC target 1]:::MYG1-AS1 [MYG1 antisense RNA 1]:::MYO1F [myosin IF]:::MYSM1 [Myb like, SWIRM and MPN domains 1]:::MDN1 [midasin AAA ATPase 1]:::MYO1E [myosin IE]:::MRC1 [mannose receptor C-type 1]:::MYT1 [myelin transcription factor 1]:::MDP1 [magnesium dependent phosphatase 1]:::MYT1L [myelin transcription factor 1 like]:::MNS1 [meiosis specific nuclear structural 1]:::MDM1 [Mdm1 nuclear protein]:::MYH1 [myosin heavy chain 1]:::MSANTD1 [Myb/SANT DNA binding domain containing 1]:::MNX1 [motor neuron and pancreas homeobox 1]', 'all_k_distances': '0.0000:::6.0302:::6.2004:::6.2268:::6.4159:::6.6053:::6.6239:::6.7211:::6.7314:::6.7537:::6.7611:::6.8063:::6.8215:::6.8536:::6.9048:::6.9172:::6.9348:::6.9395:::6.9683:::7.0125:::7.0214:::7.0451:::7.0701:::7.0703:::7.0788'}","{'ner_chunk': 'PAX7', 'begin': 219, 'end': 222, 'ner_label': 'GENE', 'ner_confidence': '0.8536', 'code': 'HGNC:8621', 'resolution': 'PAX7 [paired box 7]', 'all_k_codes': 'HGNC:8621:::HGNC:8748:::HGNC:9351:::HGNC:8792:::HGNC:25557:::HGNC:8860:::HGNC:22958:::HGNC:12630:::HGNC:8791:::HGNC:13638:::HGNC:28130:::HGNC:21957:::HGNC:8767:::HGNC:38100:::HGNC:48824:::HGNC:28174:::HGNC:8659:::HGNC:26257:::HGNC:2292:::HGNC:28439:::HGNC:2291:::HGNC:9618:::HGNC:18196:::HGNC:3073:::HGNC:14971', 'all_k_resolutions': 'PAX7 [paired box 7]:::PCSK7 [proprotein convertase subtilisin/kexin type 7]:::PRDM7 [PR/SET domain 7]:::PDE7B [phosphodiesterase 7B]:::PRMT7 [protein arginine methyltransferase 7]:::PEX7 [peroxisomal biogenesis factor 7]:::PDLIM7 [PDZ and LIM domain 7]:::USP7 [ubiquitin specific peptidase 7]:::PDE7A [phosphodiesterase 7A]:::VENTXP7 [VENT homeobox pseudogene 7]:::PRR7 [proline rich 7, synaptic]:::KCTD7 [potassium channel tetramerization domain containing 7]:::PDCD7 [programmed cell death 7]:::PAICSP7 [phosphoribosylaminoimidazole carboxylase, phosphoribosylaminoimidazole succinocarboxamide synthetase pseudogene 7]:::PCAT7 [prostate cancer associated transcript 7]:::PLPP7 [phospholipid phosphatase 7 (inactive)]:::PCDH7 [protocadherin 7]:::PDZD7 [PDZ domain containing 7]:::COX7C [cytochrome c oxidase subunit 7C]:::CHMP7 [charged multivesicular body protein 7]:::COX7B [cytochrome c oxidase subunit 7B]:::PTK7 [protein tyrosine kinase 7 (inactive)]:::SOX7 [SRY-box transcription factor 7]:::DUSP7 [dual specificity phosphatase 7]:::SNX7 [sorting nexin 7]', 'all_k_distances': '0.0000:::7.7463:::7.7935:::7.7945:::7.8946:::7.9341:::7.9389:::7.9698:::8.0057:::8.0223:::8.0759:::8.0858:::8.1156:::8.1445:::8.1713:::8.1811:::8.1841:::8.2019:::8.2248:::8.2780:::8.3136:::8.3262:::8.3448:::8.3579:::8.4082'}"


#### Example 2

  **Input format**:
  
  
```json
{
    "text": [
        "Text document 1",
        "Text document 2",
        ...
    ]
}
```

In [9]:
input_json_data = {"text": docs}

data =  process_data_and_invoke_realtime_endpoint(input_json_data, content_type="application/json" , accept="application/json" )
pd.DataFrame(data["predictions"])

Unnamed: 0,0,1
0,"{'ner_chunk': 'DUX4L20', 'begin': 113, 'end': 119, 'ner_label': 'GENE', 'ner_confidence': '0.9654', 'code': 'HGNC:50801', 'resolution': 'DUX4L20 [double homeobox 4 like 20 (pseudogene)]', 'all_k_codes': 'HGNC:50801:::HGNC:39776:::HGNC:31982:::HGNC:26230:::HGNC:2743:::HGNC:42011:::HGNC:42254:::HGNC:42423:::HGNC:42207:::HGNC:50522:::HGNC:34070:::HGNC:24679:::HGNC:18357:::HGNC:25794:::HGNC:14478:::HGNC:42012:::HGNC:4475:::HGNC:19734:::HGNC:36437:::HGNC:42013:::HGNC:11598:::HGNC:30390:::HGNC:53837:::HGNC:37772:::HGNC:51516', 'all_k_resolutions': 'DUX4L20 [double homeobox 4 like 20 (pseudogene)]:::ZDHHC20P4 [zinc finger DHHC-type containing 20 pseudogene 4]:::ANKRD20A4P [ankyrin repeat domain 20 family member A4, pseudogene]:::TM4SF20 [transmembrane 4 L six family member 20]:::DDX20 [DEAD-box helicase 20]:::FAM204BP [family with sequence similarity 204 member B, pseudogene]:::MTND4LP20 [MT-ND4L pseudogene 20]:::ZBTB20-AS4 [ZBTB20 antisense RNA 4]:::MTND4P20 [MT-ND4 pseudogene 20]:::TOMM20P4 [TOMM20 pseudogene 4]:::RNY4P20 [RNY4 pseudogene 20]:::FBXL20 [F-box and leucine rich repeat protein 20]:::ARHGAP20 [Rho GTPase activating protein 20]:::FAM204A [family with sequence similarity 204 member A]:::MRPL20 [mitochondrial ribosomal protein L20]:::FAM204CP [family with sequence similarity 204 member C, pseudogene]:::GPR20 [G protein-coupled receptor 20]:::RANBP20P [RAN binding protein 20 pseudogene]:::RPS4XP20 [ribosomal protein S4X pseudogene 20]:::FAM204DP [family with sequence similarity 204 member D, pseudogene]:::TBX20 [T-box transcription factor 20]:::SNX20 [sorting nexin 20]:::PRR20G [proline rich 20G]:::GAPDHP20 [glyceraldehyde 3 phosphate dehydrogenase pseudogene 20]:::SPDYE20P [speedy/RINGO cell cycle regulator family member E20, pseudogene]', 'all_k_distances': '0.0000:::6.4373:::6.4389:::6.6875:::6.8085:::6.8100:::6.8730:::6.8812:::6.9250:::6.9846:::7.0950:::7.1020:::7.2438:::7.2766:::7.3176:::7.3252:::7.3437:::7.4235:::7.4729:::7.4762:::7.4965:::7.5218:::7.5315:::7.5501:::7.5699'}","{'ner_chunk': 'FBXO48', 'begin': 126, 'end': 131, 'ner_label': 'GENE', 'ner_confidence': '0.9833', 'code': 'HGNC:33857', 'resolution': 'FBXO48 [F-box protein 48]', 'all_k_codes': 'HGNC:33857:::HGNC:4930:::HGNC:16653:::HGNC:13114:::HGNC:22564:::HGNC:18533:::HGNC:37552:::HGNC:24635:::HGNC:23535:::HGNC:23385:::HGNC:23942:::HGNC:20807:::HGNC:23440:::HGNC:21368:::HGNC:23384:::HGNC:52393:::HGNC:23305:::HGNC:21785:::HGNC:23488:::HGNC:31272:::HGNC:1683:::HGNC:12079:::HGNC:37805:::HGNC:55157:::HGNC:31969', 'all_k_resolutions': 'FBXO48 [F-box protein 48]:::ZBTB48 [zinc finger and BTB domain containing 48]:::MRPL48 [mitochondrial ribosomal protein L48]:::ZNF48 [zinc finger protein 48]:::SPATA48 [spermatogenesis associated 48]:::USP48 [ubiquitin specific peptidase 48]:::PIRC48 [piwi-interacting RNA cluster 48]:::PRSS48 [serine protease 48]:::ZNF488 [zinc finger protein 488]:::ZNF484 [zinc finger protein 484]:::CYCSP48 [CYCS pseudogene 48]:::ZNF486 [zinc finger protein 486]:::ZNF485 [zinc finger protein 485]:::SNRNP48 [small nuclear ribonucleoprotein U11/U12 subunit 48]:::ZNF483 [zinc finger protein 483]:::TEX48 [testis expressed 48]:::ZNF480 [zinc finger protein 480]:::RBM48 [RNA binding motif protein 48]:::ZNF487 [zinc finger protein 487]:::OR4C48P [olfactory receptor family 4 subfamily C member 48 pseudogene]:::CD48 [CD48 molecule]:::TRAJ48 [T cell receptor alpha joining 48]:::GAPDHP48 [glyceraldehyde 3 phosphate dehydrogenase pseudogene 48]:::HMGN2P48 [high mobility group nucleosomal binding domain 2 pseudogene 48]:::FBXO47 [F-box protein 47]', 'all_k_distances': '0.0000:::5.3026:::5.3531:::5.4464:::5.8642:::5.8911:::5.9817:::6.0045:::6.1032:::6.1347:::6.1727:::6.2446:::6.3179:::6.3452:::6.3667:::6.3867:::6.3949:::6.4803:::6.5349:::6.5411:::6.5938:::6.6030:::6.6433:::6.7313:::6.8481'}"
1,"{'ner_chunk': 'EGFR', 'begin': 4, 'end': 7, 'ner_label': 'GENE', 'ner_confidence': '0.9994', 'code': 'HGNC:3236', 'resolution': 'EGFR [epidermal growth factor receptor]', 'all_k_codes': 'HGNC:3236:::HGNC:3229:::HGNC:3697:::HGNC:16898:::HGNC:12691:::HGNC:4194:::HGNC:26330:::HGNC:3662:::HGNC:21869:::HGNC:341:::HGNC:3249:::HGNC:3785:::HGNC:330:::HGNC:5715:::HGNC:3642:::HGNC:4330:::HGNC:329:::HGNC:3481:::HGNC:17470:::HGNC:4570:::HGNC:3113:::HGNC:6583:::HGNC:3483:::HGNC:4267:::HGNC:28292', 'all_k_resolutions': 'EGFR [epidermal growth factor receptor]:::EGF [epidermal growth factor]:::FGR [FGR proto-oncogene, Src family tyrosine kinase]:::EFS [embryonal Fyn-associated substrate]:::EZR [ezrin]:::GCHFR [GTP cyclohydrolase I feedback regulator]:::EFHB [EF-hand domain family member B]:::FGB [fibrinogen beta chain]:::AGK [acylglycerol kinase]:::AGXT [alanine--glyoxylate aminotransferase]:::EIF1 [eukaryotic translation initiation factor 1]:::FNTB [farnesyltransferase, CAAX box, beta]:::AGRP [agouti related neuropeptide]:::IGK [immunoglobulin kappa locus]:::FDXR [ferredoxin reductase]:::GLRX [glutaredoxin]:::AGRN [agrin]:::ETFA [electron transfer flavoprotein subunit alpha]:::EPGN [epithelial mitogen]:::GRHPR [glyoxylate and hydroxypyruvate reductase]:::E2F1 [E2F transcription factor 1]:::EIF2D [eukaryotic translation initiation factor 2D]:::ETFDH [electron transfer flavoprotein dehydrogenase]:::GHSR [growth hormone secretagogue receptor]:::EID2 [EP300 interacting inhibitor of differentiation 2]', 'all_k_distances': '0.0000:::3.3460:::5.2179:::5.6404:::5.6526:::5.7774:::5.8601:::6.0288:::6.1888:::6.2019:::6.2093:::6.2857:::6.3732:::6.3836:::6.3873:::6.4167:::6.4422:::6.4935:::6.5067:::6.5083:::6.5229:::6.5325:::6.5553:::6.5696:::6.5797'}",


### JSON Lines

In [10]:
import json

def create_jsonl(records):
    json_records = []

    for text in records:
        record = {
            "text": text
        }
        json_records.append(record)

    json_lines = '\n'.join(json.dumps(record) for record in json_records)

    return json_lines

input_jsonl_data = create_jsonl(docs)

#### Example 1

  **Input format**:
  
```json
{"text": "Text document 1"}
{"text": "Text document 2"}
```

In [11]:
data = process_data_and_invoke_realtime_endpoint(input_jsonl_data, content_type="application/jsonlines" , accept="application/jsonlines" )
print(data)

{"predictions": [{"ner_chunk": "DUX4L20", "begin": 113, "end": 119, "ner_label": "GENE", "ner_confidence": "0.9654", "code": "HGNC:50801", "resolution": "DUX4L20 [double homeobox 4 like 20 (pseudogene)]", "all_k_codes": "HGNC:50801:::HGNC:39776:::HGNC:31982:::HGNC:26230:::HGNC:2743:::HGNC:42011:::HGNC:42254:::HGNC:42423:::HGNC:42207:::HGNC:50522:::HGNC:34070:::HGNC:24679:::HGNC:18357:::HGNC:25794:::HGNC:14478:::HGNC:42012:::HGNC:4475:::HGNC:19734:::HGNC:36437:::HGNC:42013:::HGNC:11598:::HGNC:30390:::HGNC:53837:::HGNC:37772:::HGNC:51516", "all_k_resolutions": "DUX4L20 [double homeobox 4 like 20 (pseudogene)]:::ZDHHC20P4 [zinc finger DHHC-type containing 20 pseudogene 4]:::ANKRD20A4P [ankyrin repeat domain 20 family member A4, pseudogene]:::TM4SF20 [transmembrane 4 L six family member 20]:::DDX20 [DEAD-box helicase 20]:::FAM204BP [family with sequence similarity 204 member B, pseudogene]:::MTND4LP20 [MT-ND4L pseudogene 20]:::ZBTB20-AS4 [ZBTB20 antisense RNA 4]:::MTND4P20 [MT-ND4 pseudoge

### C. Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

In [12]:
model.sagemaker_session.delete_endpoint(model_name)
model.sagemaker_session.delete_endpoint_config(model_name)

## 3. Batch inference

In [13]:
import json
import os

input_dir = 'inputs/batch'
json_input_dir = f"{input_dir}/json"
jsonl_input_dir = f"{input_dir}/jsonl"

output_dir = 'outputs/batch'
json_output_dir = f"{output_dir}/json"
jsonl_output_dir = f"{output_dir}/jsonl"

os.makedirs(json_input_dir, exist_ok=True)
os.makedirs(jsonl_input_dir, exist_ok=True)
os.makedirs(json_output_dir, exist_ok=True)
os.makedirs(jsonl_output_dir, exist_ok=True)

validation_json_file_name = "input.json"

validation_jsonl_file_name = "input.jsonl"

validation_input_json_path = f"s3://{s3_bucket}/{model_name}/validation-input/batch/json/"
validation_output_json_path = f"s3://{s3_bucket}/{model_name}/validation-output/batch/json/"

validation_input_jsonl_path = f"s3://{s3_bucket}/{model_name}/validation-input/batch/jsonl/"
validation_output_jsonl_path = f"s3://{s3_bucket}/{model_name}/validation-output/batch/jsonl/"

def write_and_upload_to_s3(input_data, file_name):
    file_format = os.path.splitext(file_name)[1].lower()
    if file_format == ".json":
        input_data = json.dumps(input_data)

    with open(file_name, "w") as f:
        f.write(input_data)

    s3_client.put_object(
        Bucket=s3_bucket,
        Key=f"{model_name}/validation-input/batch/{file_format[1:]}/{os.path.basename(file_name)}",
        Body=(bytes(input_data.encode("UTF-8"))),
    )

In [14]:
input_jsonl_data = create_jsonl(docs)
input_json_data = {"text": docs}

write_and_upload_to_s3(input_json_data, f"{json_input_dir}/{validation_json_file_name}")

write_and_upload_to_s3(input_jsonl_data, f"{jsonl_input_dir}/{validation_jsonl_file_name}")

### JSON

In [15]:
# Initialize a SageMaker Transformer object for making predictions
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/json",
    output_path=validation_output_json_path
)

transformer.transform(validation_input_json_path, content_type="application/json")
transformer.wait()

INFO:sagemaker:Creating transform job with name: hgnc-resolver-pipeline-en-2024-12-03-08-39-34-010


...........................................[34m24/12/03 08:46:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable[0m
[34mSetting default log level to "WARN".[0m
[34mTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).[0m

[34m#015[Stage 0:>                                                          (0 + 1) / 1]#015#015                                                                                #015INFO:     Started server process [7][0m
[34mINFO:     Waiting for application startup.[0m
[34mINFO:     Application startup complete.[0m
[34mINFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)[0m
[34m📋 Loading license number 0 from /root/.johnsnowlabs/licenses/license_number_{number}_for_Spark-Healthcare_Spark-OCR.json[0m
[34m👌 Launched #033[92mcpu optimized#033[39m session with with: 🚀Spark-NLP==5.5.0, 💊Spark-Healthcare==5.5.0, running on ⚡

In [16]:
from urllib.parse import urlparse

def process_s3_json_output_and_save(validation_file_name):

    output_file_path = f"{json_output_dir}/{validation_file_name}.out"
    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)

    data = json.loads(response["Body"].read().decode("utf-8"))
    df = pd.DataFrame(data["predictions"])
    display(df)

    # Save the data to the output file
    with open(output_file_path, 'w') as f_out:
        json.dump(data, f_out, indent=4)

In [17]:
process_s3_json_output_and_save(validation_json_file_name)

Unnamed: 0,0,1
0,"{'ner_chunk': 'DUX4L20', 'begin': 113, 'end': 119, 'ner_label': 'GENE', 'ner_confidence': '0.9654', 'code': 'HGNC:50801', 'resolution': 'DUX4L20 [double homeobox 4 like 20 (pseudogene)]', 'all_k_codes': 'HGNC:50801:::HGNC:39776:::HGNC:31982:::HGNC:26230:::HGNC:2743:::HGNC:42011:::HGNC:42254:::HGNC:42423:::HGNC:42207:::HGNC:50522:::HGNC:34070:::HGNC:24679:::HGNC:18357:::HGNC:25794:::HGNC:14478:::HGNC:42012:::HGNC:4475:::HGNC:19734:::HGNC:36437:::HGNC:42013:::HGNC:11598:::HGNC:30390:::HGNC:53837:::HGNC:37772:::HGNC:51516', 'all_k_resolutions': 'DUX4L20 [double homeobox 4 like 20 (pseudogene)]:::ZDHHC20P4 [zinc finger DHHC-type containing 20 pseudogene 4]:::ANKRD20A4P [ankyrin repeat domain 20 family member A4, pseudogene]:::TM4SF20 [transmembrane 4 L six family member 20]:::DDX20 [DEAD-box helicase 20]:::FAM204BP [family with sequence similarity 204 member B, pseudogene]:::MTND4LP20 [MT-ND4L pseudogene 20]:::ZBTB20-AS4 [ZBTB20 antisense RNA 4]:::MTND4P20 [MT-ND4 pseudogene 20]:::TOMM20P4 [TOMM20 pseudogene 4]:::RNY4P20 [RNY4 pseudogene 20]:::FBXL20 [F-box and leucine rich repeat protein 20]:::ARHGAP20 [Rho GTPase activating protein 20]:::FAM204A [family with sequence similarity 204 member A]:::MRPL20 [mitochondrial ribosomal protein L20]:::FAM204CP [family with sequence similarity 204 member C, pseudogene]:::GPR20 [G protein-coupled receptor 20]:::RANBP20P [RAN binding protein 20 pseudogene]:::RPS4XP20 [ribosomal protein S4X pseudogene 20]:::FAM204DP [family with sequence similarity 204 member D, pseudogene]:::TBX20 [T-box transcription factor 20]:::SNX20 [sorting nexin 20]:::PRR20G [proline rich 20G]:::GAPDHP20 [glyceraldehyde 3 phosphate dehydrogenase pseudogene 20]:::SPDYE20P [speedy/RINGO cell cycle regulator family member E20, pseudogene]', 'all_k_distances': '0.0000:::6.4373:::6.4389:::6.6875:::6.8085:::6.8100:::6.8730:::6.8812:::6.9250:::6.9846:::7.0950:::7.1020:::7.2438:::7.2766:::7.3176:::7.3252:::7.3437:::7.4235:::7.4729:::7.4762:::7.4965:::7.5218:::7.5315:::7.5501:::7.5699'}","{'ner_chunk': 'FBXO48', 'begin': 126, 'end': 131, 'ner_label': 'GENE', 'ner_confidence': '0.9833', 'code': 'HGNC:33857', 'resolution': 'FBXO48 [F-box protein 48]', 'all_k_codes': 'HGNC:33857:::HGNC:4930:::HGNC:16653:::HGNC:13114:::HGNC:22564:::HGNC:18533:::HGNC:37552:::HGNC:24635:::HGNC:23535:::HGNC:23385:::HGNC:23942:::HGNC:20807:::HGNC:23440:::HGNC:21368:::HGNC:23384:::HGNC:52393:::HGNC:23305:::HGNC:21785:::HGNC:23488:::HGNC:31272:::HGNC:1683:::HGNC:12079:::HGNC:37805:::HGNC:55157:::HGNC:31969', 'all_k_resolutions': 'FBXO48 [F-box protein 48]:::ZBTB48 [zinc finger and BTB domain containing 48]:::MRPL48 [mitochondrial ribosomal protein L48]:::ZNF48 [zinc finger protein 48]:::SPATA48 [spermatogenesis associated 48]:::USP48 [ubiquitin specific peptidase 48]:::PIRC48 [piwi-interacting RNA cluster 48]:::PRSS48 [serine protease 48]:::ZNF488 [zinc finger protein 488]:::ZNF484 [zinc finger protein 484]:::CYCSP48 [CYCS pseudogene 48]:::ZNF486 [zinc finger protein 486]:::ZNF485 [zinc finger protein 485]:::SNRNP48 [small nuclear ribonucleoprotein U11/U12 subunit 48]:::ZNF483 [zinc finger protein 483]:::TEX48 [testis expressed 48]:::ZNF480 [zinc finger protein 480]:::RBM48 [RNA binding motif protein 48]:::ZNF487 [zinc finger protein 487]:::OR4C48P [olfactory receptor family 4 subfamily C member 48 pseudogene]:::CD48 [CD48 molecule]:::TRAJ48 [T cell receptor alpha joining 48]:::GAPDHP48 [glyceraldehyde 3 phosphate dehydrogenase pseudogene 48]:::HMGN2P48 [high mobility group nucleosomal binding domain 2 pseudogene 48]:::FBXO47 [F-box protein 47]', 'all_k_distances': '0.0000:::5.3026:::5.3531:::5.4464:::5.8642:::5.8911:::5.9817:::6.0045:::6.1032:::6.1347:::6.1727:::6.2446:::6.3179:::6.3452:::6.3667:::6.3867:::6.3949:::6.4803:::6.5349:::6.5411:::6.5938:::6.6030:::6.6433:::6.7313:::6.8481'}"
1,"{'ner_chunk': 'EGFR', 'begin': 4, 'end': 7, 'ner_label': 'GENE', 'ner_confidence': '0.9994', 'code': 'HGNC:3236', 'resolution': 'EGFR [epidermal growth factor receptor]', 'all_k_codes': 'HGNC:3236:::HGNC:3229:::HGNC:3697:::HGNC:16898:::HGNC:12691:::HGNC:4194:::HGNC:26330:::HGNC:3662:::HGNC:21869:::HGNC:341:::HGNC:3249:::HGNC:3785:::HGNC:330:::HGNC:5715:::HGNC:3642:::HGNC:4330:::HGNC:329:::HGNC:3481:::HGNC:17470:::HGNC:4570:::HGNC:3113:::HGNC:6583:::HGNC:3483:::HGNC:4267:::HGNC:28292', 'all_k_resolutions': 'EGFR [epidermal growth factor receptor]:::EGF [epidermal growth factor]:::FGR [FGR proto-oncogene, Src family tyrosine kinase]:::EFS [embryonal Fyn-associated substrate]:::EZR [ezrin]:::GCHFR [GTP cyclohydrolase I feedback regulator]:::EFHB [EF-hand domain family member B]:::FGB [fibrinogen beta chain]:::AGK [acylglycerol kinase]:::AGXT [alanine--glyoxylate aminotransferase]:::EIF1 [eukaryotic translation initiation factor 1]:::FNTB [farnesyltransferase, CAAX box, beta]:::AGRP [agouti related neuropeptide]:::IGK [immunoglobulin kappa locus]:::FDXR [ferredoxin reductase]:::GLRX [glutaredoxin]:::AGRN [agrin]:::ETFA [electron transfer flavoprotein subunit alpha]:::EPGN [epithelial mitogen]:::GRHPR [glyoxylate and hydroxypyruvate reductase]:::E2F1 [E2F transcription factor 1]:::EIF2D [eukaryotic translation initiation factor 2D]:::ETFDH [electron transfer flavoprotein dehydrogenase]:::GHSR [growth hormone secretagogue receptor]:::EID2 [EP300 interacting inhibitor of differentiation 2]', 'all_k_distances': '0.0000:::3.3460:::5.2179:::5.6404:::5.6526:::5.7774:::5.8601:::6.0288:::6.1888:::6.2019:::6.2093:::6.2857:::6.3732:::6.3836:::6.3873:::6.4167:::6.4422:::6.4935:::6.5067:::6.5083:::6.5229:::6.5325:::6.5553:::6.5696:::6.5797'}",


### JSON Lines

In [18]:
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/jsonlines",
    output_path=validation_output_jsonl_path
)
transformer.transform(validation_input_jsonl_path, content_type="application/jsonlines")
transformer.wait()

INFO:sagemaker:Creating model with name: hgnc-resolver-pipeline-en-2024-12-03-08-48-09-502
INFO:sagemaker:Creating transform job with name: hgnc-resolver-pipeline-en-2024-12-03-08-48-10-162


.............................................[34m24/12/03 08:55:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable[0m
[34mSetting default log level to "WARN".[0m
[34mTo adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).[0m

[34m#015[Stage 0:>                                                          (0 + 0) / 1]#015#015[Stage 0:>                                                          (0 + 1) / 1]#015#015                                                                                #015INFO:     Started server process [7][0m
[34mINFO:     Waiting for application startup.[0m
[34mINFO:     Application startup complete.[0m
[34mINFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)[0m
[34m📋 Loading license number 0 from /root/.johnsnowlabs/licenses/license_number_{number}_for_Spark-Healthcare_Spark-OCR.json[0m
[34m👌 Launched #033[92mcpu opti

In [19]:
from urllib.parse import urlparse

def process_s3_jsonlines_output_and_save(validation_file_name):

    output_file_path = f"{jsonl_output_dir}/{validation_file_name}.out"
    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)

    data = response["Body"].read().decode("utf-8")
    print(data)

    # Save the data to the output file
    with open(output_file_path, 'w') as f_out:
        for item in data.split('\n'):
            f_out.write(item + '\n')

In [20]:
process_s3_jsonlines_output_and_save(validation_jsonl_file_name)

{"predictions": [{"ner_chunk": "DUX4L20", "begin": 113, "end": 119, "ner_label": "GENE", "ner_confidence": "0.9654", "code": "HGNC:50801", "resolution": "DUX4L20 [double homeobox 4 like 20 (pseudogene)]", "all_k_codes": "HGNC:50801:::HGNC:39776:::HGNC:31982:::HGNC:26230:::HGNC:2743:::HGNC:42011:::HGNC:42254:::HGNC:42423:::HGNC:42207:::HGNC:50522:::HGNC:34070:::HGNC:24679:::HGNC:18357:::HGNC:25794:::HGNC:14478:::HGNC:42012:::HGNC:4475:::HGNC:19734:::HGNC:36437:::HGNC:42013:::HGNC:11598:::HGNC:30390:::HGNC:53837:::HGNC:37772:::HGNC:51516", "all_k_resolutions": "DUX4L20 [double homeobox 4 like 20 (pseudogene)]:::ZDHHC20P4 [zinc finger DHHC-type containing 20 pseudogene 4]:::ANKRD20A4P [ankyrin repeat domain 20 family member A4, pseudogene]:::TM4SF20 [transmembrane 4 L six family member 20]:::DDX20 [DEAD-box helicase 20]:::FAM204BP [family with sequence similarity 204 member B, pseudogene]:::MTND4LP20 [MT-ND4L pseudogene 20]:::ZBTB20-AS4 [ZBTB20 antisense RNA 4]:::MTND4P20 [MT-ND4 pseudoge

In [21]:
model.delete_model()

INFO:sagemaker:Deleting model with name: hgnc-resolver-pipeline-en-2024-12-03-08-48-09-502


In [22]:
pwd

'/home/ec2-user/SageMaker/vivek/models/hgnc_resolver_pipeline_en/sagemaker'

### Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

