## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page <font color='red'> For Seller to update:[Title_of_your_product](Provide link to your marketplace listing of your product).</font>
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

# Deidentification in Healthcare with Spark NLP

## Background

Deidentification plays a vital role in utilizing structured or unstructured clinical text for research and other purposes while safeguarding patient privacy and confidentiality. The John Snow Labs team has dedicated significant efforts to developing methods and corpora for the deidentification of clinical texts, PDFs, images, DICOM files, etc., containing Protected Health Information (PHI). PHI encompasses a wide range of data, including:

- Individual’s past, present, or future physical or mental health or condition.
- Provision of health care to the individual.
- Past, present, or future payment for the health care.

This information includes many common identifiers (e.g., name, address, birth date, Social Security Number) when they can be associated with health information.

## Spark NLP for Healthcare's Approach

Spark NLP for Healthcare offers several techniques and strategies for deidentification, including the following model:

- **Model**: `en.de_identify.clinical_pipeline`
- **Model Description**: Capable of deidentifying PHI information from medical texts by masking and obfuscating sensitive data. The pipeline can effectively handle entities such as AGE, CONTACT, DATE, ID, LOCATION, NAME, PROFESSION, CITY, COUNTRY, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, STREET, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN, DLN, PLATE, IPADDR, EMAIL, and more, providing masked or obfuscated output.




In [45]:
model_package_arn = "<Customer to specify Model package ARN corresponding to their AWS region>"

In [46]:
import base64
import json
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
from IPython.display import Image, display
from PIL import Image as ImageEdit
import numpy as np

In [47]:
sagemaker_session = sage.Session()
s3_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")
role = get_execution_role()

sagemaker = boto3.client("sagemaker")
s3_client = sagemaker_session.boto_session.client("s3")
ecr = boto3.client("ecr")
sm_runtime = boto3.client("sagemaker-runtime")

## 2. Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [48]:
model_name = "en-de-identify-clinical-pipeline"

content_type = "application/json"

real_time_inference_instance_type = "ml.m4.xlarge"
batch_transform_inference_instance_type = "ml.m4.xlarge"


### A. Create an endpoint

In [49]:
# create a deployable model from the model package.
model = ModelPackage(
    role=role, model_package_arn=model_package_arn, sagemaker_session=sagemaker_session
)

# Deploy the model
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=model_name)

INFO:sagemaker:Creating model with name: jsl-deidentify-clinical-pipeline-2024-02-19-16-04-44-671
INFO:sagemaker:Creating endpoint-config with name en-de-identify-clinical-pipeline
INFO:sagemaker:Creating endpoint with name en-de-identify-clinical-pipeline


-------------------!

Once endpoint has been created, you would be able to perform real-time inference.

### B. Perform real-time inference

  **Input format**:
  
  
  {"**text**": "Input Text that is to be Deidentified",
    "**masking_policy**":"Deidentification-Policy we want to follow"
    }

In [52]:
import json
import pandas as pd
import os
import boto3


# Set display options
pd.set_option('display.max_rows', None)  
pd.set_option('display.max_columns', None)  
pd.set_option('display.max_colwidth', None) 


def process_data_and_invoke_realtime_endpoint(data_dicts):
    for data_dict in data_dicts:
        json_input_data = json.dumps(data_dict)
        
        i = 1
        input_file_name = f'inputs/real-time/input{i}.json'
        output_file_name = f'outputs/real-time/out{i}.out'
        
        while os.path.exists(input_file_name) or os.path.exists(output_file_name):
            i += 1
            input_file_name = f'inputs/real-time/input{i}.json'
            output_file_name = f'outputs/real-time/out{i}.out'

        os.makedirs(os.path.dirname(input_file_name), exist_ok=True)
        os.makedirs(os.path.dirname(output_file_name), exist_ok=True)
        
        with open(input_file_name, 'w') as f:
            f.write(json_input_data)
        
        s3_client.put_object(Bucket=s3_bucket, Key=f"validation-input-json/{os.path.basename(input_file_name)}", Body=bytes(json_input_data.encode('UTF-8')))
        
        response = sm_runtime.invoke_endpoint(
            EndpointName=model_name,
            ContentType=content_type,
            Accept="application/json",
            Body=json_input_data,
        )

        # Process response
        response_data = json.loads(response["Body"].read().decode("utf-8"))
        masking_policy = data_dict.get("masking_policy", "masked")
        print(f"Masking Policy: {masking_policy}") 
        df = pd.DataFrame([response_data]) 
        display(df)
        
        # Save response data to file
        with open(output_file_name, 'w') as f_out:
            json.dump(response_data, f_out, indent=4)


## Masking Policies

Users can select a masking policy to determine how sensitive entities are handled:

- **masked**: Default policy that masks entities with their type.
  
  Example: "My name is Mike. I was admitted to the hospital yesterday."  
  -> "My name is `<PATIENT>`. I was admitted to the hospital yesterday."

- **obfuscated**: Replaces sensitive entities with random values of the same type.
  
  Example: "My name is Mike. I was admitted to the hospital yesterday."  
  -> "My name is `Barbaraann Share`. I was admitted to the hospital yesterday."

- **masked_fixed_length_chars**: Masks entities with a fixed length of asterisks (*).
  
  Example: "Name: Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, E-MAIL: green@gmail.com."  
  -> "Name: `****`, Record date: `****`, # `****`. Dr. `****`, E-MAIL: `****`."

- **masked_with_chars**: Masks entities with asterisks (*).
  
  Example: "Name: Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, E-MAIL: green@gmail.com."  
  -> "Name: `[**************]`, Record date: `[********]`, # `[****]`. Dr. `[********]`, E-MAIL: `[*************]`."

### masked (default-policy) 

In [53]:
# Example usage:
data_dicts = [
    {"text": "Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green,  E-MAIL: green@gmail.com."},
]
process_data_and_invoke_realtime_endpoint(data_dicts)


Masking Policy: masked


Unnamed: 0,predictions
0,"[Name : <PATIENT>, Record date: <DATE>, # <DEVICE>. Dr. <DOCTOR>, E-MAIL: <EMAIL>.]"


### obfuscated

In [54]:
data_dicts = [
    {"text": "Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green,  E-MAIL: green@gmail.com.",
    "masking_policy":"obfuscated"},

]
process_data_and_invoke_realtime_endpoint(data_dicts)


Masking Policy: obfuscated


Unnamed: 0,predictions
0,"[Name : Lynne Logan, Record date: 2093-01-25, # L3157974. Dr. Sherlon Handing, E-MAIL: Marvin@yahoo.com.]"


### fixed_length_chars

In [55]:
data_dicts = [
    {"text": "Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green,  E-MAIL: green@gmail.com.",
    "masking_policy":"masked_fixed_length_chars"},

]
process_data_and_invoke_realtime_endpoint(data_dicts)


Masking Policy: masked_fixed_length_chars


Unnamed: 0,predictions
0,"[Name : ****, Record date: ****, # ****. Dr. ****, E-MAIL: ****.]"


### masked_with_chars

In [56]:
data_dicts = [
    {"text": "Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green,  E-MAIL: green@gmail.com.",
    "masking_policy":"masked_with_chars"},

]
process_data_and_invoke_realtime_endpoint(data_dicts)


Masking Policy: masked_with_chars


Unnamed: 0,predictions
0,"[Name : [**************], Record date: [********], # [****]. Dr. [********], E-MAIL: [*************].]"


### C. Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

In [57]:
model.sagemaker_session.delete_endpoint(model_name)
model.sagemaker_session.delete_endpoint_config(model_name)

INFO:sagemaker:Deleting endpoint with name: en-de-identify-clinical-pipeline
INFO:sagemaker:Deleting endpoint configuration with name: en-de-identify-clinical-pipeline


## 3. Batch inference

In [58]:
validation_input_path = f"s3://{s3_bucket}/validation-input-json/"
validation_output_path = f"s3://{s3_bucket}/validation-output-json/"

In [74]:
input_file_name = 'inputs/batch/input.json'
output_file_name = 'outputs/batch/out.out'

os.makedirs(os.path.dirname(input_file_name), exist_ok=True)
os.makedirs(os.path.dirname(output_file_name), exist_ok=True)

In [None]:
transformer = model.transformer(
    instance_count=1,
    instance_type="ml.m4.xlarge",
    accept="application/json",
)
transformer.transform(validation_input_path, content_type=content_type)
transformer.wait()

In [75]:
validation_file_name = "input.json"
input_json_data = {"text" :["Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green,  E-MAIL: green@gmail.com.", "My name is Mike"]}
json_input_data = json.dumps(input_json_data)
with open(input_file_name, 'w') as f:
    f.write(json_input_data)

In [63]:
from urllib.parse import urlparse

parsed_url = urlparse(transformer.output_path)
file_key = f"{parsed_url.path[1:]}/{validation_file_name}.out"
response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)

data = json.loads(response["Body"].read().decode("utf-8"))
display(pd.DataFrame(data))

with open(output_file_name, 'w') as f_out:
    json.dump(data, f_out, indent=4)

Unnamed: 0,predictions
0,"Name : <PATIENT>, Record date: <DATE>, # <DEVICE>. Dr. <DOCTOR>, E-MAIL: <EMAIL>."
1,My name is <PATIENT>


In [64]:
model.delete_model()

INFO:sagemaker:Deleting model with name: jsl-deidentify-clinical-pipeline-2024-02-19-16-17-56-416


### Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

