# Amazon Comprehend Custom

## Table of Contents
1. Introduction
2. Setup
3. Download data
4. Train a recognizer
5. Inference
6. Model evaluation
7. Summary


## 1. Introduction

Sometimes, we have custom entities that we would like to detect in our documents.  Instead of being limited to the entity types that are detected by Comprehend out of the box, we can build a custom Comprehend model to detect any entity type of interest. 

You can choose one of two ways to provide data to Amazon Comprehend in order to train a custom entity recognition model:

   - Annotations — Provides the location of your entities in a large number of documents so Amazon Comprehend can train on both the entity and its context. To create a model which can be used to analyze PDF, Word and plain text documents, you must train your recognizer using PDF annotations.

   - Entity Lists (Plain Text Only) — Lists the specific entities so Amazon Comprehend can train to identify your custom entities. Note: Entity lists can only be used for plain text documents.
In both cases, Amazon Comprehend will learn about the kind of documents and the context where the entities occur and build a recognizer that can generalize to new entities in documents at inference.  To learn more refer to the official AWS documentation [here](https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html).



In this tutorial, we will show how to use annotations to build a custom Comprehend model to detect five custom entities in insurance documents (e.g. DateOfLoss), and how to use that custom model to detect entities in an unlabled document.  We will demonstrate training the model with PDF annotations.


## 2. Setup

In [2]:
from pprint import pprint
import os
import sys
import json
import boto3
import uuid
import time

# we'll use a custom utils module for visualizing annotations on pdfs
!pip install --upgrade pymupdf
module_path = os.path.join(os.path.abspath(os.path.join('.')), 'helperPackage')
if module_path not in sys.path:
    sys.path.append(module_path)
from pdfhelper.PDFHelper import PDFHelper

from IPython.display import IFrame

def get_ssm_parameter(parameter_name):
    return boto3.client('ssm').get_parameter(Name=parameter_name)['Parameter']['Value']

def split_s3_uri(uri):
    """return (bucket, key) tuple from s3 uri like 's3://bucket/prefix/file.txt' """
    return uri.replace('s3://','').split('/',1)

def s3_object_from_uri(uri):
    """Initialize a boto3 s3 Object instance from a URI"""
    s3 = boto3.resource('s3')
    return s3.Object(*split_s3_uri(uri))

def s3_contents_from_uri(uri, decode=True):
    """Read contents from S3 object into memory"""
    data = s3_object_from_uri(uri).get()['Body'].read()
    return data.decode() if decode else data



  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting pymupdf
  Downloading PyMuPDF-1.19.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (8.7 MB)
     |████████████████████████████████| 8.7 MB 21.7 MB/s            
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.19.2


## 3. Download data

To create annoations for PDF documents, you can use [Amazon SageMaker GroundTruth](https://aws.amazon.com/sagemaker/groundtruth/) - a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning.

For this tutorial, we have already annotated the PDFs, in their native form (i.e. without converting to plain text) using SageMaker GroundTruth. (To set up your own annotation job, refer to the resources in the **Summary/Resources** section of this notebook)

The Ground Truth job generates three paths we will need for training our Comprehend custom model.
1. Sources: Path to the input PDFs
2. Annotations: Path to the annotation jsons containing the labeled entity information
3. Manifest: Points to the location of the annotations and source PDFs.  You will use this manifest file to create an Amazon Comprehend custom entity recognition training job and train your custom model.  Manifests are saved in s3://comprehend-semi-structured-documents-us-east-1--<AWS Account number>/output/your labeling job/manifests/output/
    
Let's get some example outputs from that annotation job.

In [4]:
ASSETS_S3_PREFIX = get_ssm_parameter('AssetsS3Prefix')

# output S3 bucket to store results in
OUTPUT_BUCKET_NAME = get_ssm_parameter('OutputBucketName')

# trained recognizer
TRAINED_RECOGNIZER_ARN = get_ssm_parameter('ModelArn')

# Information about the training data and how the SageMaker Ground Truth job output looks in S3
TRAINING_DOCS_S3_URI_PREFIX = os.path.join(ASSETS_S3_PREFIX, 'documents/')
ANNOTATIONS_S3_URI_PREFIX = os.path.join(ASSETS_S3_PREFIX, 'annotations/annotations/consolidated-annotation/consolidation-response/iteration-1/annotations/')
MANIFEST_S3_URI = os.path.join(ASSETS_S3_PREFIX, 'annotations/manifests/output/output.manifest')
LABEL_ATTRIBUTE_NAME = 'claim-full-job-labeling-job-20211019T163532'


# local directory containing example training data artifacts (pdfs, annotations, manifest) referenced in this notebook
LOCAL_ARTIFACTS_DIR = 'ComprehendCustom-Artifacts'
# local path to store results in
LOCAL_OUTPUT_DIR = 'tmp/ComprehendCustom'
# set up tmp dir under the working directory
!mkdir -p {LOCAL_OUTPUT_DIR}


In [5]:
# Let's preview a portion of the manifest file

# We will find the line of the manifest corresponding to a particular input document
document_s3_uri = os.path.join(ASSETS_S3_PREFIX, 'documents','INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00000.pdf')

manifest_data = [json.loads(obj) for obj in s3_contents_from_uri(MANIFEST_S3_URI).splitlines()]

manifest_line = [r for r in manifest_data if r['source-ref']==document_s3_uri][0]
# manifest_line

# Let's download the annotation file and look at a sample annotation

annotations_uri = manifest_line[LABEL_ATTRIBUTE_NAME]['annotation-ref']

annotations = json.loads(s3_contents_from_uri(annotations_uri))
annotations['Entities'][0]

## Uncomment the following line to see more of the annotated entities:
# annotations['Entities']


{'BlockReferences': [{'BlockId': 'fe83c69f-b32f-423f-ac5e-f37959a8cf25',
   'ChildBlocks': [{'BeginOffset': 0,
     'EndOffset': 10,
     'ChildBlockId': '331afc54-fbc0-466e-9b16-b614cb81d3bb'}],
   'BeginOffset': 0,
   'EndOffset': 10}],
 'Text': '03-28-2007',
 'Type': 'DateOfForm',
 'Score': 1}

As you can see above, the custom GroundTruth job generates a PDF annotation that captures block-level information about the entity.  Such block-level information provides the precise positional coordinates of the entity (with the child blocks representing each word within the entity block).  This is distinct from a standard GroundTruth job in which the data in the PDF is flattened to textual format and only offset information - but not precise coordinate information - is captured during annotation.  The rich positional information we obtain with this custom annotation paradigm will allow us to train a more accurate model. 

The manifest that's generated from this type of job is called an Augmented Manifest, as opposed to a CSV that's used for standard annotations.
For more information, see: https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html


In [6]:
# Visualize the annotated pdf inline

original_file = f'{LOCAL_ARTIFACTS_DIR}/ex_pdfs/INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00000.pdf'
annotated_file = f'{LOCAL_OUTPUT_DIR}/annotated.pdf'

# using a custom module (PDFHelper) to add annotations the file before displaying
PDFHelper.add_annotations_to_file(annotations, original_file, annotated_file)
IFrame(annotated_file, width=600, height=800)

# Note: you may need to zoom in to read the label names



PyMuPDF 1.19.2: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-11-20 00:00:01.
Built for Python 3.7 on linux (64-bit).



In [7]:
# Lets look at another annotated sample

# changed the document s3 uri
document_s3_uri = os.path.join(ASSETS_S3_PREFIX, 'documents','INSR_pm_hipaa_1_pii_00048.pdf')

# get the annotations data
manifest_data = [json.loads(obj) for obj in s3_contents_from_uri(os.path.join(ASSETS_S3_PREFIX, 'annotations/manifests/output/output.manifest')).splitlines()]
manifest_line = [r for r in manifest_data if r['source-ref']==document_s3_uri][0]
annotations_uri = manifest_line[LABEL_ATTRIBUTE_NAME]['annotation-ref']
annotations = json.loads(s3_contents_from_uri(annotations_uri))

original_file = f'{LOCAL_ARTIFACTS_DIR}/ex_pdfs/INSR_pm_hipaa_1_pii_00048.pdf'
annotated_file =f'{LOCAL_OUTPUT_DIR}/INSR_pm_hipaa_1_pii_00048_annotated.pdf'

# using a custom module (PDFHelper) to add annotations the file before displaying
PDFHelper.add_annotations_to_file(annotations, original_file, annotated_file)
IFrame(annotated_file, width=600, height=800)


PyMuPDF 1.19.2: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-11-20 00:00:01.
Built for Python 3.7 on linux (64-bit).



# 4. Train a recognizer

Note: In order to train a custom recognizer, SageMaker will need access to a role that has policy permissions to the s3 location where your data is. We have already set up this role - see information on the process in the Appendix section of the workshop guide.

An augmented manifest file must be formatted in JSON Lines format. In JSON Lines format, each line in the file is a complete JSON object followed by a newline separator.
https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html

In [8]:
# Let's have a look at an entry within this augmented manifest file.
manifest_line


{'source-ref': 's3://ee-assets-prod-us-east-1/modules/b2d6c897c659445583c2edb826183e8e/v1/documents/INSR_pm_hipaa_1_pii_00048.pdf',
 'page': '1',
 'metadata': {'pages': '1',
  'use-textract-only': False,
  'labels': ['DateOfForm',
   'DateOfLoss',
   'NameOfInsured',
   'LocationOfLoss',
   'InsuredMailingAddress']},
 'annotator-metadata': {'Info': 'Sample information',
  'Due Date': 'Sample date value 12/12/1212'},
 'claim-full-job-labeling-job-20211019T163532': {'annotation-ref': 's3://ee-assets-prod-us-east-1/modules/b2d6c897c659445583c2edb826183e8e/v1/annotations/annotations/consolidated-annotation/consolidation-response/iteration-1/annotations/INSR_pm_hipaa_1_pii_00048-1-51948fd8-ann.json'},
 'claim-full-job-labeling-job-20211019T163532-metadata': {'type': 'groundtruth/custom',
  'job-name': 'claim-full-job-labeling-job-20211019t163532',
  'human-annotated': 'yes',
  'creation-date': '2021-10-20T06:28:22.043000'}}

A few things to note:

- There are 5 labeling types associated with this job: DateOfForm, DateOfLoss, NameOfInsured, LocationOfLoss, and InsuredMailingAddress

- The manifest file makes reference to both the source PDF location and the annotation location

- Metadata about the annotation job (e.g. creation date) is captured.

- Use-textract-only is set to False, meaning the annotation tool will decide whether to use PDFPlumber (for a native PDF) or Amazon Textract (for a scanned PDF). If it were set to true, Textract would be used in either case (more costly but potentially more accurate).  

In [9]:
# Now let's train a recognizer

comprehend = boto3.client('comprehend')
response = comprehend.create_entity_recognizer(
    RecognizerName="recognizer-example-{}".format(str(uuid.uuid4())),
    LanguageCode="en",
    DataAccessRoleArn=get_ssm_parameter('ComprehendRoleArn'),
    InputDataConfig={
        "DataFormat": "AUGMENTED_MANIFEST",
        "EntityTypes": [
            {
                "Type": "DateOfForm"
            },
            {
                "Type": "DateOfLoss"
            },
            {
                "Type": "NameOfInsured"
            },
            {
                "Type": "LocationOfLoss"
            },
            {
                "Type": "InsuredMailingAddress"
            }
        ],
        "AugmentedManifests": [
            {
                'S3Uri': MANIFEST_S3_URI,
                'AnnotationDataS3Uri': ANNOTATIONS_S3_URI_PREFIX,
                'SourceDocumentsS3Uri': TRAINING_DOCS_S3_URI_PREFIX,
                'AttributeNames': [LABEL_ATTRIBUTE_NAME],
                'DocumentType': 'SEMI_STRUCTURED_DOCUMENT',
            }
        ],
    }
)
recognizer_arn = response["EntityRecognizerArn"]
recognizer_arn

'arn:aws:comprehend:us-east-1:469755836051:entity-recognizer/recognizer-example-51396fc8-53ee-43e9-9d8d-336e61772d92'

Here, we are creating a recognizer to recognize all five types of entities.  Of course, we could have used a subset of these entities if we preferred.  You can use up to 25 entities. 

The details of each parameter are given below (source: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/comprehend.html#Comprehend.Client.create_entity_recognizer)

**DataFormat (string) --**

    The format of your training data:

    COMPREHEND_CSV : A CSV file that supplements your training documents. The CSV file contains information about the custom entities that your trained model will detect. The required format of the file depends on whether you are providing annotations or an entity list. If you use this value, you must provide your CSV file by using either the 
    Annotations or EntityList parameters. You must provide your training documents by using the Documents parameter.

    AUGMENTED_MANIFEST : A labeled dataset that is produced by Amazon SageMaker Ground Truth. This file is in JSON lines format. Each line is a complete JSON object that contains a training document and its labels. Each label annotates a named entity in the training document. If you use this value, you must provide the AugmentedManifests parameter in your request.
    If you don't specify a value, Amazon Comprehend uses COMPREHEND_CSV as the default.

**EntityTypes (list) -- [REQUIRED]**

    The entity types in the labeled training data that Amazon Comprehend uses to train the custom entity recognizer. Any entity types that you don't specify are ignored.

    A maximum of 25 entity types can be used at one time to train an entity recognizer.

**S3Uri (string) -- [REQUIRED]**

    The Amazon S3 location of the augmented manifest file.

**AnnotationDataS3Uri (string) --**

    The S3 prefix to the annotation files that are referred in the augmented manifest file.

**SourceDocumentsS3Uri (string) --**

    The S3 prefix to the source files (PDFs) that are referred to in the augmented manifest file.

**AttributeNames (list) -- [REQUIRED]**

    The JSON attribute that contains the annotations for your training documents. The number of attribute names that you specify depends on whether your augmented manifest file is the output of a single labeling job or a chained labeling job.

    If your file is the output of a single labeling job, specify the LabelAttributeName key that was used when the job was created in Ground Truth.

    If your file is the output of a chained labeling job, specify the LabelAttributeName key for one or more jobs in the chain. Each LabelAttributeName key provides the annotations from an individual job.

**DocumentType (string) --**

    The type of augmented manifest. PlainTextDocument or SemiStructuredDocument. If you don't specify, the default is PlainTextDocument.

    PLAIN_TEXT_DOCUMENT A document type that represents any unicode text that is encoded in UTF-8.

    SEMI_STRUCTURED_DOCUMENT A document type with positional and structural context, like a PDF. For training with Amazon Comprehend, only PDFs are supported. For inference, Amazon Comprehend support PDFs, DOCX and TXT.

In [10]:
# Let's check on the status of the submitted training job

# All recognizers
comprehend = boto3.client('comprehend')

recognizers = comprehend.list_entity_recognizers()
# View the last submitted job
recognizers['EntityRecognizerPropertiesList'][-1]

{'EntityRecognizerArn': 'arn:aws:comprehend:us-east-1:469755836051:entity-recognizer/recognizer-example-51396fc8-53ee-43e9-9d8d-336e61772d92',
 'LanguageCode': 'en',
 'Status': 'SUBMITTED',
 'SubmitTime': datetime.datetime(2021, 12, 2, 21, 7, 51, 294000, tzinfo=tzlocal()),
 'InputDataConfig': {'DataFormat': 'AUGMENTED_MANIFEST',
  'EntityTypes': [{'Type': 'DateOfForm'},
   {'Type': 'DateOfLoss'},
   {'Type': 'NameOfInsured'},
   {'Type': 'LocationOfLoss'},
   {'Type': 'InsuredMailingAddress'}],
  'AugmentedManifests': [{'S3Uri': 's3://ee-assets-prod-us-east-1/modules/b2d6c897c659445583c2edb826183e8e/v1/annotations/manifests/output/output.manifest',
    'Split': 'TRAIN',
    'AttributeNames': ['claim-full-job-labeling-job-20211019T163532'],
    'AnnotationDataS3Uri': 's3://ee-assets-prod-us-east-1/modules/b2d6c897c659445583c2edb826183e8e/v1/annotations/annotations/consolidated-annotation/consolidation-response/iteration-1/annotations/',
    'SourceDocumentsS3Uri': 's3://ee-assets-prod-u

### Waiting for training job completion

This snippet below could be used to print out the status of your model.

We've already trained a model in the account in advance, using the exact same dataset, so we will use that model and continue with the notebook instead of waiting for the new model training job to finish.

```
# check status of custom model training periodically until complete
recognizer_arn = recognizers['EntityRecognizerPropertiesList'][-1]['EntityRecognizerArn']

while True:
    response = comprehend.describe_entity_recognizer(
        EntityRecognizerArn=recognizer_arn
    )

    status = response["EntityRecognizerProperties"]["Status"]
    if "IN_ERROR" == status:
        print('TRAINING ERROR')
        break
    if "TRAINED" == status:
        print('TRAINING COMPLETE')
        break
    print(status)
    time.sleep(60)
```

## 5. Inference

Let's run inference with our trained model on a document that was not part of the training procedure. This asynchronous API can be used for standard or custom NER. If it is being used for custom NER (as it is here) we must pass the ARN of the trained model.

In [11]:
# Start entities detection job

# This asynchronous API can be used for standard or custom NER.
# To use custom NER, pass the ARN of the trained model.

response = comprehend.start_entities_detection_job(
    EntityRecognizerArn=TRAINED_RECOGNIZER_ARN,
    JobName="Detection-Job-{}".format(str(uuid.uuid4())),
    LanguageCode="en",
    DataAccessRoleArn=get_ssm_parameter('ComprehendRoleArn'),
    InputDataConfig={
        "InputFormat": "ONE_DOC_PER_FILE",
        "S3Uri": os.path.join(ASSETS_S3_PREFIX, 'holdout/')
    },
    OutputDataConfig={
        "S3Uri": f's3://{OUTPUT_BUCKET_NAME}/custom_comprehend/'
    }
)
response

{'JobId': '62ff8b5dbc0e0883eb4df53fc9e02226',
 'JobArn': 'arn:aws:comprehend:us-east-1:469755836051:entities-detection-job/62ff8b5dbc0e0883eb4df53fc9e02226',
 'JobStatus': 'SUBMITTED',
 'ResponseMetadata': {'RequestId': '8f1f76bd-f750-49a7-962e-962fc06f0856',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '8f1f76bd-f750-49a7-962e-962fc06f0856',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '177',
   'date': 'Thu, 02 Dec 2021 21:08:46 GMT'},
  'RetryAttempts': 0}}

**S3Uri (string) -- [REQUIRED]**
    The Amazon S3 URI for the input data. The URI must be in same region as the API endpoint that you are calling. The URI can point to a single input file or it can provide the prefix for a collection of data files.

    For example, if you use the URI S3://bucketName/prefix , if the prefix is a single file, Amazon Comprehend uses that file as input. If more than one file begins with the prefix, Amazon Comprehend uses all of them as input.

**InputFormat (string) --**
    Specifies how the text in an input file should be processed:

    ONE_DOC_PER_FILE - Each file is considered a separate document. Use this option when you are processing large documents, such as newspaper articles or scientific papers.
    ONE_DOC_PER_LINE - Each line in a file is considered a separate document. Use this option when you are processing many short documents, such as text messages.

**Waiting for detection job completion**

This code snippet could be used to periodically check the status of your detection job and wait until it finishes.

We've already done a detection with this model so we will take a look at those results instead of using this code to wait for the job to finish.

```
while True:
    job = comprehend.describe_entities_detection_job(
        JobId=response['JobId']
    )
    
    status = job["EntitiesDetectionJobProperties"]["JobStatus"]
    if "IN_ERROR" == status:
        print('DETECTION ERROR')
        break
    if "COMPLETED" == status:
        print('DETECTION COMPLETE')
        break
    print(status)
    time.sleep(60)
```

In [12]:
# Get the output from the detection job

# download pre-generated inference output for INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00017
INFERENCE_RESULTS_S3_URI = os.path.join(ASSETS_S3_PREFIX, 'detection/output/output.tar.gz')

# Detection job output is at {LOCAL_ARTIFACTS_DIR}/inference_output/output.tar.gz

!mkdir -p {LOCAL_OUTPUT_DIR}/inference_output/
!tar -xvzf {LOCAL_ARTIFACTS_DIR}/inference_output/output.tar.gz -C {LOCAL_OUTPUT_DIR}/inference_output/

INFERENCE_RESULTS_PATH = os.path.join(LOCAL_OUTPUT_DIR, 'inference_output/INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00017_NativePDF.out')



INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00017_NativePDF.out
output


In [13]:
# Let's look at the inference on a new example document
import pandas as pd

INFERENCE_RESULTS_PATH = os.path.join(LOCAL_OUTPUT_DIR, 'inference_output/INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00017_NativePDF.out')
# fname = f'{LOCAL_DIR}/inference_output/INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00017_NativePDF.out' 
with open(INFERENCE_RESULTS_PATH) as f:
    detection_output = json.load(f)

entities_list = detection_output['Entities']

pd.DataFrame(entities_list)

Unnamed: 0,BlockReferences,Score,Text,Type
0,"[{'BeginOffset': 0, 'BlockId': '47c99141-baa5-...",0.995752,02-11-2008,DateOfForm
1,"[{'BeginOffset': 0, 'BlockId': '382f2f85-e1c8-...",0.991779,03-05-2005 17:06:44,DateOfLoss
2,"[{'BeginOffset': 0, 'BlockId': 'fb9b875f-7a79-...",0.998773,Jaleesa Gonzalez,NameOfInsured
3,"[{'BeginOffset': 23, 'BlockId': 'fb9b875f-7a79...",0.912355,", Vermont",InsuredMailingAddress
4,"[{'BeginOffset': 0, 'BlockId': '867de541-8fc2-...",0.986978,312 Fernwood Alley,LocationOfLoss
5,"[{'BeginOffset': 18, 'BlockId': 'df9a3ba3-b843...",0.997802,South Dakota,LocationOfLoss
6,"[{'BeginOffset': 9, 'BlockId': '3bb70a0f-f970-...",0.999795,United States,LocationOfLoss


In [17]:
# Note that the model prediction output format closely resembles the annotation output format shown above.
entities_list[0]

{'BlockReferences': [{'BeginOffset': 0,
   'BlockId': '47c99141-baa5-4df9-8aa3-ed24d4675c8d',
   'ChildBlocks': [{'BeginOffset': 0,
     'ChildBlockId': '6e7c051e-3d96-4b2f-93bd-023ed5f8991d',
     'EndOffset': 10}],
   'EndOffset': 10}],
 'Score': 0.9957520982071427,
 'Text': '02-11-2008',
 'Type': 'DateOfForm'}

In [19]:
# Let's visualize the the labels predicted by our model for this new example pdf

holdout_pdf = f'{LOCAL_ARTIFACTS_DIR}/ex_pdfs/INSR_ACORD-Property-Loss-Notice-12.05.16_1_pii_00017.pdf'
annotated_file =f'{LOCAL_OUTPUT_DIR}/detection_annotated.pdf'
PDFHelper.add_annotations_to_file(detection_output, holdout_pdf, annotated_file)
IFrame(annotated_file, width=600, height=800)


PyMuPDF 1.19.2: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-11-20 00:00:01.
Built for Python 3.7 on linux (64-bit).



## 6. Model Evaluation

Comprehend provides model performance metrics for a trained model, which indiciates how well the trained model is expected to make predictions using similar inputs.

For detailed description of the performance metrics and how they are calculated, see: https://docs.aws.amazon.com/comprehend/latest/dg/cer-metrics.html

In [21]:
# We will look at the metrics for the model trained in advance of the workshop, using the same dataset

trained_recognizer = comprehend.describe_entity_recognizer(
    EntityRecognizerArn=TRAINED_RECOGNIZER_ARN
)

In [22]:
# Global evaluation metrics
trained_recognizer['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']

{'Precision': 1.0, 'Recall': 0.9943181818181818, 'F1Score': 0.9971509971509972}

In [23]:
# Per entity metrics
entity_metrics = trained_recognizer['EntityRecognizerProperties']['RecognizerMetadata']['EntityTypes']
for entity in entity_metrics:
    print(entity['Type'])
    print(entity['EvaluationMetrics'])
    print()

DateOfForm
{'Precision': 1.0, 'Recall': 1.0, 'F1Score': 1.0}

DateOfLoss
{'Precision': 1.0, 'Recall': 1.0, 'F1Score': 1.0}

InsuredMailingAddress
{'Precision': 1.0, 'Recall': 0.9814814814814815, 'F1Score': 0.9906542056074767}

LocationOfLoss
{'Precision': 1.0, 'Recall': 1.0, 'F1Score': 1.0}

NameOfInsured
{'Precision': 1.0, 'Recall': 1.0, 'F1Score': 1.0}



For Precision, Recall, and F1 Score, 1.0 is the highest possible score. Most of these metrics for the trained model are at or close to 1.0 which indicates the model is accurately predicting custom entities on a set of test documents randomly selected (and held out from the training data) by Comprehend during training.

## 7. Summary

In addition to the standard set of entities recognized by Amazon Comprehend's standard entity detection capabilities, Comprehend enables you to train and use your own custom models for detecting user-defined entities specific to your business use case directly on PDF documents.

In this notebook you used a dataset of PDFs annotated with SageMaker Ground Truth to train a entity detection model in Comprehend. The standard entity detection capabilities of Comprehend did not recognize entities required in thise specific insurance form use case such as ”LocationOfLoss“. After training, the resulting Comprehend model can be used to reliably detect these custom entities in new documents.

**At this point, you can go back to the workshop guide to start Part 2 of the workshop.**

### Resources

- Here are additional resources to help you dive deeper:

 - Setting up your own custom annotation job: https://aws.amazon.com/blogs/machine-learning/custom-document-annotation-for-extracting-named-entities-in-documents-using-amazon-comprehend/

 - Training a custom NER model using the Comprehend console: https://aws.amazon.com/blogs/machine-learning/extract-custom-entities-from-documents-in-their-native-format-with-amazon-comprehend/

 - API reference: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/comprehend.
