## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page [Deidentify Clinical Documents (EN)](https://aws.amazon.com/marketplace/pp/prodview-ept2dbql5slue)
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

# Deidentification in Healthcare with Spark NLP

## Background

Deidentification plays a vital role in utilizing structured or unstructured clinical text for research and other purposes while safeguarding patient privacy and confidentiality. The John Snow Labs team has dedicated significant efforts to developing methods and corpora for the deidentification of clinical texts, PDFs, images, DICOM files, etc., containing Protected Health Information (PHI). PHI encompasses a wide range of data, including:

- Individual’s past, present, or future physical or mental health or condition.
- Provision of health care to the individual.
- Past, present, or future payment for the health care.

This information includes many common identifiers (e.g., name, address, birth date, Social Security Number) when they can be associated with health information.

## Spark NLP for Healthcare's Approach

Spark NLP for Healthcare offers several techniques and strategies for deidentification, including the following model:

- **Model**: `en.de_identify.clinical_pipeline`

- **Model Description**: Capable of deidentifying PHI information from medical texts by masking and obfuscating sensitive data. The pipeline can effectively handle entities such as AGE, CONTACT, DATE, ID, LOCATION, NAME, PROFESSION, CITY, COUNTRY, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, STREET, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN, DLN, PLATE, IPADDR, EMAIL, and more, providing masked or obfuscated output.

In [1]:
model_package_arn = "<Customer to specify Model package ARN corresponding to their AWS region>"

In [2]:
import base64
import json
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
import boto3
from IPython.display import Image, display
from PIL import Image as ImageEdit
import numpy as np

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
sagemaker_session = sage.Session()
s3_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")
role = get_execution_role()

sagemaker = boto3.client("sagemaker")
s3_client = sagemaker_session.boto_session.client("s3")
ecr = boto3.client("ecr")
sm_runtime = boto3.client("sagemaker-runtime")

## 2. Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [4]:
model_name = "en-de-identify-clinical-pipeline"

content_type = "application/json"

real_time_inference_instance_type = "ml.m4.xlarge"
batch_transform_inference_instance_type = "ml.m4.xlarge"


### A. Create an endpoint

In [5]:
# create a deployable model from the model package.
model = ModelPackage(
    role=role, model_package_arn=model_package_arn, sagemaker_session=sagemaker_session
)

# Deploy the model
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=model_name)

------------!

Once endpoint has been created, you would be able to perform real-time inference.

In [6]:
import json
import pandas as pd
import os
import boto3


# Set display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)


def process_data_and_invoke_realtime_endpoint(data_dicts):
    for data_dict in data_dicts:
        json_input_data = json.dumps(data_dict)
        i = 1
        input_file_name = f'inputs/real-time/input{i}.json'
        output_file_name = f'outputs/real-time/out{i}.out'

        while os.path.exists(input_file_name) or os.path.exists(output_file_name):
            i += 1
            input_file_name = f'inputs/real-time/input{i}.json'
            output_file_name = f'outputs/real-time/out{i}.out'

        os.makedirs(os.path.dirname(input_file_name), exist_ok=True)
        os.makedirs(os.path.dirname(output_file_name), exist_ok=True)

        with open(input_file_name, 'w') as f:
            f.write(json_input_data)

        s3_client.put_object(Bucket=s3_bucket, Key=f"{model_name}/validation-input-json/real-time/{os.path.basename(input_file_name)}", Body=bytes(json_input_data.encode('UTF-8')))

        response = sm_runtime.invoke_endpoint(
            EndpointName=model_name,
            ContentType=content_type,
            Accept="application/json",
            Body=json_input_data,
        )

        # Process response
        response_data = json.loads(response["Body"].read().decode("utf-8"))
        df = pd.DataFrame(response_data)
        display(df)

        # Save response data to file
        with open(output_file_name, 'w') as f_out:
            json.dump(response_data, f_out, indent=4)


### Initial Setup

In [7]:
docs = [
'''Mr. William Garcia, 25years-old , born in Beijing, was transfered to the The Johns Hopkins Hospital. Phone number: (541) 754-3010. MSW: 100009632582 . for his colonic polyps.  William Garcia wants to know the results from them.  He is not taking hydrochlorothiazide and is curious about his blood pressure. He said he has cut his alcohol back to 6 pack once a week.  William Garcia has cut back his cigarettes to one time per week. MR. # 50712249 P:   Follow up with Dr. Hobbs Spruce in 3 months. Gilbert P. Perez, M.D. Johns Hopkins Hospital 1800 Orleans St. PHONE: (+1) 410-955-5000 ''',

'''Record date : 2022-01-13 , Ethan Hale , M.D . , Name : Ava Davis. EMAIL: davis_a@gmail.com. MR # 7190334  Phone : (555) 555-1212. Date : 01-13-1993 PCP : Oliveira , 29years-old. A long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities.  Ava Davis was noted to have a large sacral wound; this is in a similar location with his previous laminectomy, and this continues to receive daily care. DD: 01/13/2022 . DV: 01/13/2022. Ethan Hale , M.D. MSW 754443200936. NewYork-Presbyterian Hospital . 525 East 68th Street. (85) 555-1212. ''',

]


sample_text = """Record date : 2013-01-13, David Hale, M.D. is manager,  Name: Elvis Presley. Age: 17. Phone: (9) 7765-5632. MR. # 7194334 Date: 01-13-1993 PCP: Oliveira. Record date: 2012-11-09. Cocke County Baptist Hospital 0295 Keats Street. This 17-yr-old male, presented with chest heaviness that started during a pick-up basketball game. His past medical history was unremarkable. Elvis Presley denied prior cardiac symptoms and suffered no chest trauma during the game. His father had suffered an acute myocardial infarction at age 38. Elvis Presley was a nonsmoker, did not drink alcohol, and denied recreational drug use. Elvis Presley swallowed a tablet of aspirin before coming to the emergency room. His blood pressure was 160/90 mm Hg, and his heart rate was 80 bpm. Physical examination revealed no stigmata of Marfan syndrome. The rest of his physical examination was normal. DD: 01/13/2013 . DV: 01/13/2013, Cocke County Baptist Hospital 0295 Keats Street. PHONE : (+1) 423-625-2200 """

### Important Parameters

- **masking_policy**: `str`

    Users can select a masking policy to determine how sensitive entities are handled:

    - **masked**: Default policy that masks entities with their type.

      Example: "My name is Mike. I was admitted to the hospital yesterday."  
      -> "My name is `<PATIENT>`. I was admitted to the hospital yesterday."

    - **obfuscated**: Replaces sensitive entities with random values of the same type.

      Example: "My name is Mike. I was admitted to the hospital yesterday."  
      -> "My name is `Barbaraann Share`. I was admitted to the hospital yesterday."

    - **masked_fixed_length_chars**: Masks entities with a fixed length of asterisks (*).

      Example: "Name: Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, E-MAIL: green@gmail.com."  
      -> "Name: `****`, Record date: `****`, # `****`. Dr. `****`, E-MAIL: `****`."

    - **masked_with_chars**: Masks entities with asterisks (*).

      Example: "Name: Hendrickson, Ora, Record date: 2093-01-13, # 719435. Dr. John Green, E-MAIL: green@gmail.com."  
      -> "Name: `[**************]`, Record date: `[********]`, # `[****]`. Dr. `[********]`, E-MAIL: `[*************]`."

- **sep**: `str`

    Separator used to join subparts within each prediction.

    By default, the separator is set to a single space (" "), but users can specify any other separator as needed. Necessary because the model outputs predictions as separate subparts, and the chosen separator is used to join them into coherent text.

    The separator must be one of the following characters: space (' '), newline ('\n'), comma (','), tab ('\t'), or colon (':').
    
You can specify these parameters in the input as follows:

```json
{
    "text": [
        "Text document 1",
        "Text document 2",
        ...
    ],
    "masking_policy": "masked",
    "sep": " ",
}
```


### **Input format**: Single Text Document

Provide a single text document as a string.

  
  
```json
{
    "text": "Single text document"
}
```

In [8]:
# masked (default-policy)
data_dicts = [
    {
        "text": sample_text
    }
]

process_data_and_invoke_realtime_endpoint(data_dicts)

Unnamed: 0,predictions
0,"Record date : <DATE>, <DOCTOR>, M.D. is <PROFESSION>, Name: <PATIENT>. Age: <AGE>. Phone: <PHONE>. MR. # <MEDICALRECORD> Date: <DATE> PCP: <DOCTOR>. Record date: <DATE>. <HOSPITAL> <STREET>. This <AGE> male, presented with chest heaviness that started during a pick-up basketball game. His past medical history was unremarkable. <HOSPITAL> denied prior cardiac symptoms and suffered no chest trauma during the game. His father had suffered an acute myocardial infarction at age <AGE>. <PATIENT> was a nonsmoker, did not drink alcohol, and denied recreational drug use. Elvis Presley swallowed a tablet of aspirin before coming to the emergency room. His blood pressure was 160/90 mm Hg, and his heart rate was 80 bpm. Physical examination revealed no stigmata of Marfan syndrome. The rest of his physical examination was normal. DD: <DATE> . DV: <DATE>, <HOSPITAL> <STREET>. PHONE : <PHONE>"


In [9]:
# obfuscated
data_dicts = [
    {
        "text": sample_text,
        "masking_policy": "obfuscated"
    }
]

process_data_and_invoke_realtime_endpoint(data_dicts)

Unnamed: 0,predictions
0,"Record date : 2013-02-02, Layne Benton, M.D. is Insurance risk surveyor, Name: Silvana Newness. Age: 18. Phone: (1) 6109-6045. MR. # 4098119 Date: 02-02-1993 PCP: Roxanne Gates. Record date: 2012-11-29. ALASKA REGIONAL HOSPITAL 2210 Troy Schenectady Rd. This 15-yr-old male, presented with chest heaviness that started during a pick-up basketball game. His past medical history was unremarkable. Silvana Newness denied prior cardiac symptoms and suffered no chest trauma during the game. His father had suffered an acute myocardial infarction at age 36. Silvana Newness was a nonsmoker, did not drink alcohol, and denied recreational drug use. Elvis Presley swallowed a tablet of aspirin before coming to the emergency room. His blood pressure was 160/90 mm Hg, and his heart rate was 80 bpm. Physical examination revealed no stigmata of Marfan syndrome. The rest of his physical examination was normal. DD: 02/02/2013 . DV: 02/02/2013, ALASKA REGIONAL HOSPITAL 2210 Troy Schenectady Rd. PHONE : (+1) 478-295-6213"


### **Input format**: Array of Text Documents

Use an array containing multiple text documents. Each element represents a separate text document.

```json
{
    "text": [
        "Text document 1",
        "Text document 2",
        ...
    ]
}
```

In [10]:
# masked (default-policy)
data_dicts = [
    {
        "text": docs
    }
]

process_data_and_invoke_realtime_endpoint(data_dicts)

Unnamed: 0,predictions
0,"Mr. <PATIENT>, <AGE>years-old , born in <CITY>, was transfered to the The <HOSPITAL>. Phone number: <PHONE>. MSW: <MEDICALRECORD> . for his colonic polyps. <DOCTOR> wants to know the results from them. He is not taking hydrochlorothiazide and is curious about his blood pressure. He said he has cut his alcohol back to 6 pack once a week. <PATIENT> has cut back his cigarettes to one time per week. MR. # <MEDICALRECORD> P: Follow up with Dr. <DOCTOR> in 3 months. <DOCTOR> P. <DOCTOR>, M.D. <LOCATION>. PHONE: <PHONE>"
1,"Record date : <DATE> , <DOCTOR> , M.D . , Name : <PATIENT>. EMAIL: <EMAIL>. MR # <MEDICALRECORD> Phone : <PHONE>. Date : <DATE> PCP : Oliveira , <AGE>years-old. A long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. <NAME> was noted to have a large sacral wound; this is in a similar location with his previous laminectomy, and this continues to receive daily care. DD: <DATE> . DV: <DATE>. <DOCTOR> , M.D. MSW <PHONE>. <LOCATION>. <PHONE>."


In [11]:
# obfuscated
data_dicts = [
    {
        "text": docs,
        "masking_policy": "obfuscated"
    }
]

process_data_and_invoke_realtime_endpoint(data_dicts)

Unnamed: 0,predictions
0,"Mr. Arneta Cliche, 39years-old , born in Ubide, was transfered to the The BLUE RIDGE REGIONAL HOSPITAL, INC. Phone number: (161) 096-0454. MSW: 098119147829 . for his colonic polyps. Arneta Cliche wants to know the results from them. He is not taking hydrochlorothiazide and is curious about his blood pressure. He said he has cut his alcohol back to 6 pack once a week. Arneta Cliche has cut back his cigarettes to one time per week. MR. # 56213086 P: Follow up with Dr. Wilhemena Durie in 3 months. Vaughan Sine P. Vanna Scotland, M.D. 2525 Court Drive. PHONE: (+5) 784-696-2952"
1,"Record date : 2022-02-02 , Shaaron Adler , M.D . , Name : De Burrs. EMAIL: Lenny@google.com. MR # 8413244 Phone : (010) 272-5366. Date : 02-02-1993 PCP : Oliveira , 28years-old. A long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. De Burrs was noted to have a large sacral wound; this is in a similar location with his previous laminectomy, and this continues to receive daily care. DD: 02/02/2022 . DV: 02/02/2022. Shaaron Adler , M.D. MSW 440347425956. 22 S Greene St. (38) 756-4332."


### C. Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

In [12]:
model.sagemaker_session.delete_endpoint(model_name)
model.sagemaker_session.delete_endpoint_config(model_name)

## 3. Batch inference

In [13]:
import os

validation_file_name_1 = "input_1.json"
validation_file_name_2 = "input_2.json"
validation_file_name_3 = "input_3.json"
validation_file_name_4 = "input_4.json"

validation_input_path = f"s3://{s3_bucket}/{model_name}/validation-input-json/batch"
validation_output_path = f"s3://{s3_bucket}/{model_name}/validation-output-json/batch"

input_dir = 'inputs/batch'
output_dir = 'outputs/batch'

os.makedirs(input_dir, exist_ok=True)
os.makedirs(output_dir, exist_ok=True)

In [14]:
import json

def write_and_upload_to_s3(json_input_data, file_name):

    json_data = json.dumps(json_input_data)

    with open(file_name, "w") as f:
        f.write(json_data)

    s3_client.put_object(
        Bucket=s3_bucket,
        Key=f"{model_name}/validation-input-json/batch/{os.path.basename(file_name)}",
        Body=(bytes(json_data.encode("UTF-8"))),
    )

In [15]:
# Define input JSON data for each validation file
input_json_data = {
    validation_file_name_1: {"text": docs},
    validation_file_name_2: {"text": docs, "masking_policy": "obfuscated"},
    validation_file_name_3: {"text": docs, "masking_policy": "masked_fixed_length_chars"},
    validation_file_name_4: {"text": docs, "masking_policy": "masked_with_chars"},
}

# Write and upload each input JSON data to S3
for file_name, json_data in input_json_data.items():
    write_and_upload_to_s3(json_data, f"{input_dir}/{file_name}")

In [None]:
# Initialize a SageMaker Transformer object for making predictions
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/json",
)
transformer.transform(validation_input_path, content_type=content_type)
transformer.wait()

In [17]:
from urllib.parse import urlparse

def process_s3_output_and_save(validation_file_name, output_file_name):

    output_file_path = f"{output_dir}/{output_file_name}"
    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}/{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)

    data = json.loads(response["Body"].read().decode("utf-8"))
    df = pd.DataFrame(data)
    display(df)

    # Save the data to the output file
    with open(output_file_path, 'w') as f_out:
        json.dump(data, f_out, indent=4)

#### masked (default-policy)

In [18]:
process_s3_output_and_save(validation_file_name_1, "out_1.out")

Unnamed: 0,predictions
0,"Mr. <PATIENT>, <AGE>years-old , born in <CITY>, was transfered to the The <HOSPITAL>. Phone number: <PHONE>. MSW: <MEDICALRECORD> . for his colonic polyps. <DOCTOR> wants to know the results from them. He is not taking hydrochlorothiazide and is curious about his blood pressure. He said he has cut his alcohol back to 6 pack once a week. <PATIENT> has cut back his cigarettes to one time per week. MR. # <MEDICALRECORD> P: Follow up with Dr. <DOCTOR> in 3 months. <DOCTOR> P. <DOCTOR>, M.D. <LOCATION>. PHONE: <PHONE>"
1,"Record date : <DATE> , <DOCTOR> , M.D . , Name : <PATIENT>. EMAIL: <EMAIL>. MR # <MEDICALRECORD> Phone : <PHONE>. Date : <DATE> PCP : Oliveira , <AGE>years-old. A long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. <NAME> was noted to have a large sacral wound; this is in a similar location with his previous laminectomy, and this continues to receive daily care. DD: <DATE> . DV: <DATE>. <DOCTOR> , M.D. MSW <PHONE>. <LOCATION>. <PHONE>."


#### obfuscated

In [19]:
process_s3_output_and_save(validation_file_name_2, "out_2.out")

Unnamed: 0,predictions
0,"Mr. Arneta Cliche, 39years-old , born in Ubide, was transfered to the The BLUE RIDGE REGIONAL HOSPITAL, INC. Phone number: (161) 096-0454. MSW: 098119147829 . for his colonic polyps. Arneta Cliche wants to know the results from them. He is not taking hydrochlorothiazide and is curious about his blood pressure. He said he has cut his alcohol back to 6 pack once a week. Arneta Cliche has cut back his cigarettes to one time per week. MR. # 56213086 P: Follow up with Dr. Wilhemena Durie in 3 months. Vaughan Sine P. Vanna Scotland, M.D. 2525 Court Drive. PHONE: (+5) 784-696-2952"
1,"Record date : 2022-02-02 , Shaaron Adler , M.D . , Name : De Burrs. EMAIL: Lenny@google.com. MR # 1610960 Phone : (454) 098-1191. Date : 02-02-1993 PCP : Oliveira , 28years-old. A long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. De Burrs was noted to have a large sacral wound; this is in a similar location with his previous laminectomy, and this continues to receive daily care. DD: 02/02/2022 . DV: 02/02/2022. Shaaron Adler , M.D. MSW 478295621308. 22 S Greene St. (65) 784-6962."


#### masked_fixed_length_chars

In [20]:
process_s3_output_and_save(validation_file_name_3, "out_3.out")

Unnamed: 0,predictions
0,"Mr. ****, ****years-old , born in ****, was transfered to the The ****. Phone number: ****. MSW: **** . for his colonic polyps. **** wants to know the results from them. He is not taking hydrochlorothiazide and is curious about his blood pressure. He said he has cut his alcohol back to 6 pack once a week. **** has cut back his cigarettes to one time per week. MR. # **** P: Follow up with Dr. **** in 3 months. **** P. ****, M.D. ****. PHONE: ****"
1,"Record date : **** , **** , M.D . , Name : ****. EMAIL: ****. MR # **** Phone : ****. Date : **** PCP : Oliveira , ****years-old. A long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. **** was noted to have a large sacral wound; this is in a similar location with his previous laminectomy, and this continues to receive daily care. DD: **** . DV: ****. **** , M.D. MSW ****. ****. ****."


#### masked_with_chars

In [21]:
process_s3_output_and_save(validation_file_name_4, "out_4.out")

Unnamed: 0,predictions
0,"Mr. [************], **years-old , born in [*****], was transfered to the The [********************]. Phone number: [************]. MSW: [**********] . for his colonic polyps. [************] wants to know the results from them. He is not taking hydrochlorothiazide and is curious about his blood pressure. He said he has cut his alcohol back to 6 pack once a week. [************] has cut back his cigarettes to one time per week. MR. # [******] P: Follow up with Dr. [**********] in 3 months. [*****] P. [***], M.D. [************************************]. PHONE: [***************]"
1,"Record date : [********] , [********] , M.D . , Name : [*******]. EMAIL: [***************]. MR # [*****] Phone : [************]. Date : [********] PCP : Oliveira , **years-old. A long history of insulin dependent diabetes, type 2; coronary artery disease; chronic renal insufficiency; peripheral vascular disease, also secondary to diabetes; who was originally admitted to an outside hospital for what appeared to be acute paraplegia, lower extremities. [*******] was noted to have a large sacral wound; this is in a similar location with his previous laminectomy, and this continues to receive daily care. DD: [********] . DV: [********]. [********] , M.D. MSW [**********]. [**************************************************]. [***********]."


In [22]:
model.delete_model()

INFO:sagemaker:Deleting model with name: en-de-identify-clinical-pipeline-2024-03-28-22-44-47-226


### Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

