# Introduction

**Table of Contents**
1. Amazon SageMaker Studio notebooks
2. Dataset: Synthetic insurance documents

## 1. Using the Amazon SageMaker Studio notebook environment

Amazon SageMaker Studio is an IDE (integrated development environment) for machine learning. It is based on the JupyterLab platform, and has built-in integrations with AWS AI and ML services to help developers and data scientists accelerate their productivity.

In SageMaker Studio, users write and execute code in notebooks. Notebook documents can contain both executable code (Python) and rich text elements (paragraph, markdown, figures), and consist of multiple cells which can be executed one at a time.

### How to execute a notebook cell
To run a notebook cell, first select the cell by clicking on the cell. The selected cell will have a blue highlighted border around it. Then execute the highlighted cell by either:
  * Entering the keyboard shortcut `Shift + Enter`. This runs the selected cell and automatically selects the next cell.
  * Choosing the ▶️ button in the notebook toolbar at the top of the notebook.
  * In the SageMaker Studio dropdown menu, choosing **Run**, **Run Selected Cells**

## 2. Exploring the synthetic insurance documents dataset

Now that you are familiar with the notebook environment, let's take a closer look at the dataset we will be using for this workshop. We have provided you a dataset of synthetically generated documents, representative of the type of forms that would be submitted as part of a claim during the claims adjudication process. Insurers have a business need to extract the relevant information from these forms quickly and accurately, in order to process customer claims at scale.

As part of the initial setup instructions, the documents dataset was copied to an S3 bucket in this account. Run the cell below to print the S3 path where they are stored. We will save this path as the `DATA_S3_PREFIX` variable, and also store it for use in subsequent notebooks.


In [2]:
import boto3
import os
import time
from IPython.display import IFrame


def get_ssm_parameter(parameter_name):
    return boto3.client('ssm').get_parameter(Name=parameter_name)['Parameter']['Value']

def split_s3_uri(uri):
    """return (bucket, key) tuple from s3 uri like 's3://bucket/prefix/file.txt' """
    return uri.replace('s3://','').split('/',1)

def visualize_pdf(document_location, width=600, height=800):
    """Visualize pdf stored in s3 as an IFrame in notebook"""
    bucket, key = split_s3_uri(document_location)
    tmpdir = 'tmp'
    tmpfile = os.path.join(tmpdir,'sample_doc.pdf')
    os.makedirs(tmpdir, exist_ok=True)
    boto3.resource('s3').Object(bucket, key).download_file(tmpfile)
    time.sleep(5)
    return IFrame(tmpfile, width=width, height=height)

# path where notebooks are located. They were copied from the assets bucket to the Output bucket in the setup instructions.
DATA_S3_PREFIX = os.path.join('s3://', get_ssm_parameter('OutputBucketName'), 'documents/')

# Also store the DATA_S3_PREFIX for subsequent notebooks
%store DATA_S3_PREFIX

DATA_S3_PREFIX

Stored 'DATA_S3_PREFIX' (str)


's3://sagemaker-studio-c758d250/documents/'

In [3]:
# list of document files
documents = [d.key.split('/')[-1] for d in boto3.resource('s3').Bucket(get_ssm_parameter('OutputBucketName')).objects.filter(Prefix='documents/')]

# number of documents in the collection
len(documents)

251

In [4]:
MAX_RESULTS = 10  # max number of file names to display - increase this if you want to see more files
documents[:MAX_RESULTS]

['INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00000.pdf',
 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00001.pdf',
 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00002.pdf',
 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00003.pdf',
 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00004.pdf',
 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00005.pdf',
 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00006.pdf',
 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00007.pdf',
 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00008.pdf',
 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00009.pdf']

## Inspecting the forms dataset

There are several types of forms represented in the synthetic dataset.

You can navigate to the S3 console and view the contents of the sagemaker-studio-xx bucket, under the documents/ prefix, where you can download and inspect document.

Below we show you how to can also inspect PDF files inline in the notebook environment.

*Note: if you get a 504 Bad Notebook Gateway Error while attempting to render the PDF inline, try executing the cell again.*

In [19]:
visualize_pdf(
    os.path.join(DATA_S3_PREFIX, 'INSR_FEMA-Form_086-0-11_Notice-of-Loss_1_pii_00000.pdf'),
    height=800, width=600  # adjust pixel height/width if desired
)

In [20]:
visualize_pdf(os.path.join(DATA_S3_PREFIX, 'INSR_lightning-affidavit_pii_00030.pdf'))

In [21]:
visualize_pdf(os.path.join(DATA_S3_PREFIX, 'INSR_pm_hipaa_1_pii_00048.pdf'))

In [9]:
visualize_pdf(os.path.join(DATA_S3_PREFIX, 'INSR_claim-form-agwm_5_pii_00000.pdf'))

In [8]:
visualize_pdf(os.path.join(DATA_S3_PREFIX, 'INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00000.pdf'))

Let's say that as an insurer, your company has a need to identify the following pieces of information within claims package forms:
* Date of the form
* Date of the loss
* Name of insured
* Location of loss
* Insured mailing address

Try identifying these fields in the example forms above. Questions to consider:
* Are there any rules-based approaches that could work for one or more of these form type?
* Which fields could be challenging to identify or disamiguate among several values?
* What kind of sensitive information is contained in these forms?