# PII Detection and Redaction for setting compliance and control

In this , we will be performing extracting the text from the documents using AWS Textract and then use Comprehend to perform pii detection. Then we will be using python function to redact that portion of the image. 
Here is conceptual architectural flow:

![alt-text](piiredact.png)

You can automate the entire end to end flow using step function and lambda for orchestration.

We will walk you through following steps:

## Step 1: Setup and install libraries 
## Step 2: Extract text from sample document
## Step 3: Save the extracted text into text/csv file and uplaod to Amazon S3 bucket
## Step 4: Check for PII using Amazon Comprehend Detect PII Sync API.
## Step 5: Mask PII using Amazon Comprehend PII Analysis Job
## Step 6: View the redacted/masked output in Amazon S3 Bucket


# Lets start with Step 1: Setup and install libraries

import json
import boto3
import re
import csv
import sagemaker
from sagemaker import get_execution_role
from sagemaker.s3 import S3Uploader, S3Downloader
import uuid
import time
import io
from io import BytesIO
import sys
from pprint import pprint

from IPython.display import Image, display
from PIL import Image as PImage, ImageDraw

In [None]:
!pip install amazon-textract-response-parser

In [None]:
import pandas as pd
import webbrowser, os
import json
import boto3
import re
import sagemaker
from sagemaker import get_execution_role
from sagemaker.s3 import S3Uploader, S3Downloader
import uuid
import time
import io
from io import BytesIO
import sys
from pprint import pprint

from IPython.display import Image, display
from PIL import Image as PImage, ImageDraw

In [None]:

region = boto3.Session().region_name

role = get_execution_role()
print(role)

bucket = sagemaker.Session().default_bucket()

prefix = "pii-detection-redaction"
bucket_path = "https://s3-{}.amazonaws.com/{}".format(region, bucket)


# Step 2: Extract text from sample document¶

In [None]:
# Document
documentName = "bankstatement.JPG"

display(Image(filename=documentName))

In [None]:
client = boto3.client(service_name='textract',
         region_name= 'us-east-1',
         endpoint_url='https://textract.us-east-1.amazonaws.com')

with open(documentName, 'rb') as file:
            img_test = file.read()
            bytes_test = bytearray(img_test)
            print('Image loaded', documentName)

    # process using image bytes
response = client.detect_document_text(Document={'Bytes': bytes_test})


In [None]:
#Extract key values
# Iterate over elements in the document
from trp import Document


doc = Document(response)
page_string = ''
for page in doc.pages:
    # Print lines and words
       
        for line in page.lines:
            #print((line.text))
            page_string += str(line.text)
print(page_string)

# Step 3: Save the extracted text into text/csv file and uplaod to Amazon S3 bucket¶

In [None]:
# Lets get the  data into a text file
text_filename = 'pii_data.txt'
doc = Document(response)
with open(text_filename, 'w', encoding='utf-8') as f:
    for page in doc.pages:
    # Print lines and words
        page_string = ''
        for line in page.lines:
            #print((line.text))
            page_string += str(line.text)
        #print(page_string)
        f.writelines(page_string + "\n")

In [None]:
# Load the documents locally for later analysis
with open(text_filename, "r") as fi:
    raw_texts = [line.strip() for line in fi.readlines()]

In [None]:
import boto3

s3 = boto3.resource('s3')
s3.Bucket(bucket).upload_file("pii_data.txt", "pii-detection-redaction/pii_data.txt")

# Step 4: Check for PII using Amazon Comprehend Detect PII Sync API

In [None]:
comprehend = boto3.client(service_name='comprehend')

In [None]:
# Call Amazon Comprehend   and pass it the aggregated text from our   image.

piilist=comprehend.detect_pii_entities(Text = page_string, LanguageCode='en')
redacted_box_color='red'
dpi = 72
pii_detection_threshold = 0.00
print ('Finding PII text...')
not_redacted=0
redacted=0
for pii in piilist['Entities']:
    print(pii['Type'])
    if pii['Score'] > pii_detection_threshold:
                    print ("detected as type '"+pii['Type']+"' and will be redacted.")
                    redacted+=1
                
    else:
        print (" was detected as type '"+pii['Type']+"', but did not meet the confidence score threshold and will not be redacted.")
        not_redacted+=1


print ("Found", redacted, "text boxes to redact.")
print (not_redacted, "additional text boxes were detected, but did not meet the confidence score threshold.")

# Step 5: Mask PII using Amazon Comprehend PII Analysis Job

We will use StartPiiEntitiesDetectionJob API

StartPiiEntitiesDetectionJob API starts an asynchronous PII entity detection job for a collection of documents.

We would be using this API to perform pii detection and redaction for pii_data.txt which we had inspected above.


In [None]:
import uuid
InputS3URI= "s3://"+bucket+ "/pii-detection-redaction/pii_data.txt"
print(InputS3URI)
OutputS3URI="s3://"+bucket+"/pii-detection-redaction"
print(OutputS3URI)
job_uuid = uuid.uuid1()
job_name = f"pii-job-{job_uuid}"

# Adding Amazon Comprehend as an additional trusted entity to this role

This step is needed if you want to pass the execution role of this Notebook while calling Comprehend APIs as well without creating an additional Role. 



On the IAM dashboard, please click on Roles on the left sidenav and search for this Role. Once the Role appears, click on the Role to go to its Summary page. Click on the Trust relationships tab on the Summary page to add Amazon Comprehend as an additional trusted entity.

Click on **Edit trust relationship** and replace the JSON with this JSON.
```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "comprehend.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
```

Once this is complete, click on Update Trust Policy and you are done.

In [None]:
role_name = role[role.rfind('/') + 1:]
print("https://console.aws.amazon.com/iam/home?region={0}#/roles/{1}".format(region, role_name))


In [None]:

response = comprehend.start_pii_entities_detection_job(
    InputDataConfig={
        'S3Uri': InputS3URI,
        'InputFormat': 'ONE_DOC_PER_FILE'
    },
    OutputDataConfig={
        'S3Uri': OutputS3URI
       
    },
    Mode='ONLY_REDACTION',
    RedactionConfig={
        'PiiEntityTypes': [
           'ALL',
        ],
        'MaskMode': 'MASK',
        'MaskCharacter': '*'
    },
    DataAccessRoleArn = role,
    JobName=job_name,
    LanguageCode='en',
    
)

In [None]:
# Get the job ID
events_job_id = response['JobId']
job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)
print(job)


The job will take roughly 6-7 minutes. 
The below code is to check the status of the job. 
The cell execution would be completed after the job is completed.
In case the job fails you can check the logs and status in AWS Console https://console.aws.amazon.com/comprehend/v2/home?region=us-east-1#analysis
and try re running the job if you get this failure reason:
    NO_WRITE_ACCESS_TO_OUTPUT: The provided data access role does not have write access to the output S3 URI.

In [None]:
from time import sleep
# Get current job status
job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)
print(job)
# Loop until job is completed
waited = 0
timeout_minutes = 10
while job['PiiEntitiesDetectionJobProperties']['JobStatus'] != 'COMPLETED':
    sleep(60)
    waited += 60
    assert waited//60 < timeout_minutes, "Job timed out after %d seconds." % waited
    job = comprehend.describe_pii_entities_detection_job(JobId=events_job_id)

In [None]:
print(response)

# Step 6: View the redacted/masked output in Amazon S3 Bucket¶

In [None]:
filename="pii_data.txt"
output_data_s3_file = job['PiiEntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'
print(output_data_s3_file)

In [None]:

# The output filename is the input filename + ".out"
s3_client = boto3.client(service_name='s3')
filename="pii_data.txt"
output_data_s3_file = job['PiiEntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri'] + filename + '.out'
print(output_data_s3_file)
output_data_s3_filepath=output_data_s3_file.split("//")[1].split("/")[1]+"/"+output_data_s3_file.split("//")[1].split("/")[2]+"/"+output_data_s3_file.split("//")[1].split("/")[3]+"/"+output_data_s3_file.split("//")[1].split("/")[4]
print(output_data_s3_filepath)

f = BytesIO()
s3_client.download_fileobj(bucket, output_data_s3_filepath, f)
f.seek(0)
print(f.getvalue())

Clean Up!

Delete Amazon S3 Bucket https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html