## 1. Subscribe to the model package

To subscribe to the model package:
1. Open the model package listing page: [Clinical De-identification for German](https://aws.amazon.com/marketplace/pp/prodview-zjyeemhncdsiu).
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

## Clinical Deidentification German

- **Model**: [clinical_deidentification_docwise_wip_de](https://nlp.johnsnowlabs.com/2024/11/04/clinical_deidentification_docwise_wip_de.html)
- **Model Description**: This pipeline can be used to deidentify PHI information from medical texts in German language. The PHI information will be masked and obfuscated in the resulting text.
The pipeline can mask and obfuscate: `LOCATION`, `DATE`, `NAME`, `ID`, `AGE`, `PROFESSION`, `CONTACT`, `ORGANIZATION`, `DOCTOR`, `CITY`, `COUNTRY`, `STREET`, `PATIENT`, `PHONE`, `HOSPITAL`, `STATE`, `DLN`, `SSN`, `ZIP`, `ACCOUNT`, `LICENSE`, `PLATE`, `VIN`, `MEDICALRECORD`, `EMAIL`, `URL` entities.

In [1]:
model_package_arn = "<Customer to specify Model package ARN corresponding to their AWS region>"

In [None]:
import json
import os
import boto3
import pandas as pd
import sagemaker as sage
from sagemaker import ModelPackage
from sagemaker import get_execution_role
from IPython.display import display
from urllib.parse import urlparse

In [None]:
sagemaker_session = sage.Session()
s3_bucket = sagemaker_session.default_bucket()
region = sagemaker_session.boto_region_name
account_id = boto3.client("sts").get_caller_identity().get("Account")
role = get_execution_role()

sagemaker = boto3.client("sagemaker")
s3_client = sagemaker_session.boto_session.client("s3")
ecr = boto3.client("ecr")
sm_runtime = boto3.client("sagemaker-runtime")

# Set display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [4]:
model_name = "clinical-deidentification-docwise-wip-de"

real_time_inference_instance_type = "ml.m4.xlarge"
batch_transform_inference_instance_type = "ml.m4.2xlarge"

## 2. Create a deployable model from the model package.

In [5]:
model = ModelPackage(
    role=role, 
    model_package_arn=model_package_arn,
    sagemaker_session=sagemaker_session,
)

### Input Format

To use the model, you need to provide input in one of the following supported formats:

#### JSON Format

Provide input as JSON. We support two variations within this format:

1. **Array of Text Documents**: 
   Use an array containing multiple text documents. Each element represents a separate text document.

   ```json
   {
       "text": [
           "Text document 1",
           "Text document 2",
           ...
       ]
   }

    ```

2. **Single Text Document**:
   Provide a single text document as a string.


   ```json
    {
        "text": "Single text document"
    }
   ```

#### JSON Lines (JSONL) Format

Provide input in JSON Lines format, where each line is a JSON object representing a text document.

```
{"text": "Text document 1"}
{"text": "Text document 2"}
```

### Important Parameters

- **masking_policy**: `str`

    Users can select a masking policy to determine how sensitive entities are handled:

    Example: "**Dr. Hans-Wolfgang Weihmann - RM57, Städt Klinikum Dresden-Friedrichstadt, Friedrichstraße 41, Dresden**"

    - **masked**: Default policy that masks entities with their type.

      -> 'Dr.  `<PATIENT>` - `<USERNAME>`, `<HOSPITAL>`, `<STREET>`, `<CITY>`'

    - **obfuscated**: Replaces sensitive entities with random values of the same type.

      -> 'Dr.  `Karl-August Blümel` - `RP400`, `University Hospital Cologne`, `Fadime-Pölitz-Allee`, `Böblingen`'
      
You can specify these parameters in the input as follows:

```json
{
    "text": [
        "Text document 1",
        "Text document 2",
        ...
    ],
    "masking_policy": "masked",
}
```


## 3. Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

### A. Deploy the SageMaker model to an endpoint

In [None]:
predictor = model.deploy(
    initial_instance_count=1,
    instance_type=real_time_inference_instance_type, 
    endpoint_name=model_name,
)

Once endpoint has been created, you would be able to perform real-time inference.

In [7]:
def invoke_realtime_endpoint(record, content_type="application/json", accept="application/json"):
    response = sm_runtime.invoke_endpoint(
        EndpointName=model_name,
        ContentType=content_type,
        Accept=accept,
        Body=json.dumps(record) if content_type == "application/json" else record,
    )

    response_body = response["Body"].read().decode("utf-8")

    if accept == "application/json":
        return json.loads(response_body)
    elif accept == "application/jsonlines":
        return response_body
    else:
        raise ValueError(f"Unsupported accept type: {accept}")

### Initial Setup

In [8]:
docs = [
    '''Dr. Hans-Wolfgang Weihmann - RM57, Städt Klinikum Dresden-Friedrichstadt, Friedrichstraße 41, Dresden''',

    '''Er arbeitete bis 24.08.1940 - Gärtner bei Planten un Blomen in Hamburg, verbrannte sich an beiden Beinen - entwickelte Geschwüre. Der Patient konsultierte Dr. Klein im September.''',
    ]


sample_text = """Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus ei... 
Zusammenfassung : Michael Berger wird am Morgen des 12 Dezember 2018 ins St.Elisabeth Krankenhaus eingeliefert. Herr Michael Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.Persönliche Daten :ID-Nummer: T0110053F Platte A-BC124 Kontonummer: DE89370400440532013000 SSN : 13110587M565 Lizenznummer: B072RRE2I55 Adresse : St.Johann-Straße 13 19300 """

### JSON

#### Example 1: masked (default-policy)

In [9]:
input_json_data = {"text": sample_text}
response_json = invoke_realtime_endpoint(input_json_data, content_type="application/json", accept="application/json")
pd.DataFrame(response_json)

Unnamed: 0,predictions
0,Zusammenfassung : <PATIENT> wird am Morgen des <DATE> ins <HOSPITAL> ei... \nZusammenfassung : <PATIENT> wird am Morgen des <DATE> ins <HOSPITAL> eingeliefert. Herr <NAME> <NAME> ist <AGE> Jahre alt und hat zu viel Wasser in den Beinen.Persönliche Daten :ID-Nummer: <ID> Platte <PLATE> Kontonummer: <ACCOUNT> SSN : <SSN>Lizenznummer: <DLN> Adresse : <STREET> <ZIP>


#### Example 2: obfuscated

In [11]:
input_json_data = {"text": sample_text, "masking_policy": "obfuscated"}
response_json = invoke_realtime_endpoint(input_json_data, content_type="application/json", accept="application/json")
pd.DataFrame(response_json)

Unnamed: 0,predictions
0,Zusammenfassung : Natalja Köhn wird am Morgen des 10 Februar 2019 ins Klinikum Osnabrück ei... \nZusammenfassung : Natalja Köhn wird am Morgen des 10 Februar 2019 ins Klinikum Osnabrück eingeliefert. Herr Sabine Beatrice ist 71 Jahre alt und hat zu viel Wasser in den Beinen.Persönliche Daten :ID-Nummer: H2112253V Platte S-PQ146 Kontonummer: NO43716066006978657666 SSN : 68667052J030Lizenznummer: N496XXU6I77 Adresse : Scheuermannstr. 33 64877


### JSON Lines

In [13]:
def create_jsonl(records, masking_policy=None):
    json_records = []

    if isinstance(records, str):
        records = [records]

    for text in records:
        record = {"text": text}

        if masking_policy is not None:
            record["masking_policy"] = masking_policy
        json_records.append(record)

    json_lines = '\n'.join(json.dumps(record, ensure_ascii=False) for record in json_records)
    return json_lines


#### Example 1: masked (default-policy)

In [14]:
input_jsonl_data = create_jsonl(sample_text, masking_policy="masked")
data = invoke_realtime_endpoint(input_jsonl_data, content_type="application/jsonlines" , accept="application/jsonlines" )
print(data)

{"predictions": "Zusammenfassung : <PATIENT> wird am Morgen des <DATE> ins <HOSPITAL> ei... \nZusammenfassung : <PATIENT> wird am Morgen des <DATE> ins <HOSPITAL> eingeliefert. Herr <NAME> <NAME> ist <AGE> Jahre alt und hat zu viel Wasser in den Beinen.Persönliche Daten :ID-Nummer: <ID> Platte <PLATE> Kontonummer: <ACCOUNT> SSN : <SSN>Lizenznummer: <DLN> Adresse : <STREET> <ZIP> "}


#### Example 2: obfuscated

In [16]:
input_jsonl_data = create_jsonl(sample_text, masking_policy="obfuscated")
data = invoke_realtime_endpoint(input_jsonl_data, content_type="application/jsonlines" , accept="application/jsonlines" )
print(data)

{"predictions": "Zusammenfassung : Natalja Köhn wird am Morgen des 10 Februar 2019 ins Klinikum Osnabrück ei... \nZusammenfassung : Natalja Köhn wird am Morgen des 10 Februar 2019 ins Klinikum Osnabrück eingeliefert. Herr Sabine Beatrice ist 71 Jahre alt und hat zu viel Wasser in den Beinen.Persönliche Daten :ID-Nummer: H2112253V Platte S-PQ146 Kontonummer: NO43716066006978657666 SSN : 68667052J030Lizenznummer: N496XXU6I77 Adresse : Scheuermannstr. 33 64877 "}


### B. Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. You can terminate the endpoint to avoid being charged.

In [None]:
model.sagemaker_session.delete_endpoint(model_name)
model.sagemaker_session.delete_endpoint_config(model_name)

## 4. Batch inference

In [19]:
validation_input_json_path = f"s3://{s3_bucket}/{model_name}/validation-input/json/"
validation_output_json_path = f"s3://{s3_bucket}/{model_name}/validation-output/json/"

validation_input_jsonl_path = f"s3://{s3_bucket}/{model_name}/validation-input/jsonl/"
validation_output_jsonl_path = f"s3://{s3_bucket}/{model_name}/validation-output/jsonl/"

def upload_to_s3(input_data, file_name):
    file_format = os.path.splitext(file_name)[1].lower()
    s3_client.put_object(
        Bucket=s3_bucket,
        Key=f"{model_name}/validation-input/{file_format[1:]}/{file_name}",
        Body=input_data.encode("UTF-8"),
    )

In [20]:
# Create JSON and JSON Lines data
input_json_data = {
    "input1.json": json.dumps({"text": docs, "masking_policy": "masked"}, ensure_ascii=False),
    "input2.json": json.dumps({"text": docs, "masking_policy": "obfuscated"}, ensure_ascii=False),
}

input_jsonl_data = {
    "input1.jsonl": create_jsonl(docs, masking_policy="masked"),
    "input2.jsonl": create_jsonl(docs, masking_policy="obfuscated"),
}

# Upload JSON and JSON Lines data to S3
for file_name, data in input_json_data.items():
    upload_to_s3(data, file_name)

for file_name, data in input_jsonl_data.items():
    upload_to_s3(data, file_name)


### JSON

In [None]:
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/json",
    output_path=validation_output_json_path
)

transformer.transform(validation_input_json_path, content_type="application/json")
transformer.wait()

In [None]:
def retrieve_json_output_from_s3(validation_file_name):
    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)

    data = json.loads(response["Body"].read().decode("utf-8"))
    display(data)

In [23]:
masking_policies = {
    "masked": "input1.json",
    "obfuscated": "input2.json",
}

for policy_name, validation_file_name in masking_policies.items():
    print(f"Masking policy: {policy_name}")
    retrieve_json_output_from_s3(validation_file_name)
    print("\n")

Masking policy: masked


{'predictions': ['Dr. <NAME> Weihmann - <USERNAME>, <HOSPITAL>, <STREET>, <CITY>',
  'Er arbeitete bis <DATE> - <PROFESSION> bei <ORGANIZATION> in <STATE>, verbrannte sich an beiden Beinen - entwickelte Geschwüre. Der Patient konsultierte Dr. <NAME> im <NAME>.']}



Masking policy: obfuscated


{'predictions': ['Dr. Benjamin Weihmann - Fred.Ewald, Sankt Maria Krankenhaus Dresden, Henkring 112, Bruchsal',
  'Er arbeitete bis 23.10.1940 - Laborant bei Deutsche Bank AG in Bremen, verbrannte sich an beiden Beinen - entwickelte Geschwüre. Der Patient konsultierte Dr. Alida im Imre.']}





### JSON Lines

In [None]:
transformer = model.transformer(
    instance_count=1,
    instance_type=batch_transform_inference_instance_type,
    accept="application/jsonlines",
    output_path=validation_output_jsonl_path
)
transformer.transform(validation_input_jsonl_path, content_type="application/jsonlines")
transformer.wait()

In [None]:
def retrieve_jsonlines_output_from_s3(validation_file_name):

    parsed_url = urlparse(transformer.output_path)
    file_key = f"{parsed_url.path[1:]}{validation_file_name}.out"
    response = s3_client.get_object(Bucket=s3_bucket, Key=file_key)

    data = response["Body"].read().decode("utf-8")
    print(data)

In [26]:
masking_policies = {
    "masked": "input1.jsonl",
    "obfuscated": "input2.jsonl",
}

for policy_name, validation_file_name in masking_policies.items():
    print(f"Masking policy: {policy_name}")
    retrieve_jsonlines_output_from_s3(validation_file_name)
    print("\n")

Masking policy: masked
{"predictions": "Dr. <NAME> Weihmann - <USERNAME>, <HOSPITAL>, <STREET>, <CITY>"}
{"predictions": "Er arbeitete bis <DATE> - <PROFESSION> bei <ORGANIZATION> in <STATE>, verbrannte sich an beiden Beinen - entwickelte Geschwüre. Der Patient konsultierte Dr. <NAME> im <NAME>."}


Masking policy: obfuscated
{"predictions": "Dr. Benjamin Weihmann - Fred.Ewald, Sankt Maria Krankenhaus Dresden, Henkring 112, Bruchsal"}
{"predictions": "Er arbeitete bis 23.10.1940 - Laborant bei Deutsche Bank AG in Bremen, verbrannte sich an beiden Beinen - entwickelte Geschwüre. Der Patient konsultierte Dr. Alida im Imre."}




In [None]:
model.delete_model()

### Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

