<a href="https://colab.research.google.com/github/GunduSriBhanu/Computer-vision-on-pdf/blob/main/Comprehend_Builtin_Custom_Entity_Recognizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon Custom Entity Recognizer

***Welcome to an end-to-end example of how to use [Amazon Comprehend Custom Entity Recognizer](https://docs.aws.amazon.com/comprehend/latest/dg/custom-entity-recognition.html) to create a custom entity recognition model***

Custom entity recognition extends the capability of Amazon Comprehend by helping you identify your specific new entity types that are not in the preset generic entity types. This means that you can analyze documents and extract entities like product codes or business-specific entities that fit your particular needs.

Building an accurate custom entity recognizer on your own can be a complex process, requiring preparation of large sets of manually annotated training documents and the selection of the right algorithms and parameters for model training. Amazon Comprehend helps to reduce the complexity by providing automatic annotation and model development to create a custom entity recognition model.

Creating a custom entity recognition model is a more effective approach than using string matching or regular expressions to extract entities from documents. For example, to extract ENGINEER names in a document, it is difficult to enumerate all possible names. Additionally, without context, it is challenging to distinguish between ENGINEER names and ANALYST names. A custom entity recognition model can learn the context where those names are likely to appear. Additionally, string matching will not detect entities that have typos or follow new naming conventions, while this is possible using a custom model.

In this notebook, we demonstrate how to use [Custom Entity Recognition API](https://docs.aws.amazon.com/comprehend/latest/dg/training-recognizers.html) to create an entity recognition model. The example takes the training dataset in CSV format and runs inference against text input. Comprehend also supports advanced use case that takes Ground Truth annotated data for training and allows you to directly run inference on PDF and Word document. For more information refer to [Custom Entity Recognizer for PDF document](https://aws.amazon.com/blogs/machine-learning/build-a-custom-entity-recognizer-for-pdf-documents-using-amazon-comprehend/).

This notebook requires no ML expertise to train a model with the example dataset or with your own business specific dataset. You can use the API operations discussed in this notebook in your own applications.
***

## Outline

1. [Prerequisites and Set up](#step1)
    1. [Install dependencies](#step1.1)
    2. [Set up IAM permissions](#step1.2)
2. [Dataset Set Up](#step2)
    1. [Download the dataset and upload to S3](#step2.1)
3. [Train the model](#step3)
    1. [Start Training](#step3.1)
    2. [Monitor the status of the training job](#step3.2)
    3. [Retrieve the trained model metrics](#step3.3)
4. [Start a batch entity recognition job](#step4)
    1. [Start the batch inference job](#step4.1)
    2. [Monitor the batch inference job status ](#step4.2)
    3. [Download the output of the batch job](#step4.3)
5. [Real-time analysis for custom entity recognition](#step5)
    1. [Create Model End Point](#step5.1)
    2. [Monitor creation status of the entity recognizer endpoint](#step5.2)
    3. [Running real-time custom entity detection](#step5.3)
    4. [Stop the End point](#step5.4)
6. [Conclusion](#step6)
7. [Learn more about Comprehend Custom Entity Recognition](#step7)

## 1. Prerequisites and setup  <a id="step1"></a>
***
Amazon Compehend Custom Recognizer requires certain access policies that are attached to the IAM role and the Amazon S3 bucket that stores the datasets. By default, these policies are not present in your Amazon SageMaker Studio environment. We show how to set these permissions,  install dependencies, and set up relevant environment variables.
***

### 1.1. Install dependencies  <a id="step1.1"></a>

Here we install dependencies needed to run the notebook.

In [None]:
%%time
## Upgrade boto3 and botocore
!pip install botocore --upgrade
!pip install boto3 --upgrade

Collecting botocore
  Obtaining dependency information for botocore from https://files.pythonhosted.org/packages/02/55/7070f28d963cf8843e1335c8c3de0a37dd6382b53e83315ddaab1f645f5e/botocore-1.32.5-py3-none-any.whl.metadata
  Downloading botocore-1.32.5-py3-none-any.whl.metadata (6.1 kB)
Collecting urllib3<1.27,>=1.25.4 (from botocore)
  Obtaining dependency information for urllib3<1.27,>=1.25.4 from https://files.pythonhosted.org/packages/b0/53/aa91e163dcfd1e5b82d8a890ecf13314e3e149c05270cc644581f77f17fd/urllib3-1.26.18-py2.py3-none-any.whl.metadata
  Downloading urllib3-1.26.18-py2.py3-none-any.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.9/48.9 kB[0m [31m859.7 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Downloading botocore-1.32.5-py3-none-any.whl (11.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.5/11.5 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m:01[0m
[?25hDownloading urllib3-1.26.18-py2.py3-none-

In [None]:
from datetime import datetime
import boto3
import json
import sagemaker as sm

Initialize service object and set up variables to keep the value of the current AWS account id, region, role, and the default SageMaker S3 bucket name.

In [None]:
sts_client = boto3.client("sts")
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')



In [None]:
# Get current AWS Account ID
account_id = sts_client.get_caller_identity()["Account"]
print("Your account id is {}".format(account_id))

# Get currrent region
session = boto3.session.Session()
region = session.region_name
print("Your current region is {}".format(region))

# Get current role
role_arn = sm.get_execution_role()
role_name = role_arn.split('/')[len(role_arn.split('/'))-1]
print("Your current role is {}".format(role_name))

# Get SageMaker default S3 Bucket
bucket_name = sm.Session().default_bucket()
print ("Bucket name used is " + bucket_name)

Your account id is 054719795948
Your current region is us-west-2
Your current role is AmazonSageMaker-ExecutionRole-20231003T093231
Bucket name used is sagemaker-us-west-2-054719795948


### 1.2. Set up IAM permissions  <a id="step1.2"></a>
***
To use the Amazon Comprehend Custom Entity Recognition APIs in Sagemaker Studio, we need to attach <b>ComprehendFullAccess</b> and <b>AmazonS3FullAccess</b> policies to the IAM role associated with the Sagemaker Studio user. These policies provides full access to Amazon Comprehend Custom Entity Recognition and to Amazon S3.

The policies cannot be attached to the current IAM role from inside the notebook. You can attach the policies by using the IAM console or by using AWS CloudShell. We show how to attach the policies by using CloudShell.

**Note:** For a production application, we recommend that you restrict access policies to only those needed to run the application. Permissions can be restricted based on the use case (training/inference) and specific resource names (such as a full S3 bucket name or an S3 bucket name pattern). You should also restrict access to the Custom Entity Recognition or Amazon Sagemaker operations to just those that your application needs.
***

In [None]:
# Command to attach AmazonComprehendFullAccess to the IAM role.
cmd_1 = f"aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/ComprehendFullAccess --role-name {role_name} "

# Command to attach AmazonS3FullAccess to the IAM role.
cmd_2 = f"aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --role-name {role_name}"

# Command to assume Comprehend role to the SageMaker role.
trust_relationship_policy_comprehend = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "comprehend.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
cmd_3 = f"aws iam update-assume-role-policy --policy-document '{json.dumps(trust_relationship_policy_comprehend)}' --role-name {role_name}"

iam_pass = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": "*"
        }
    ]
}
cmd_4 = f"aws iam put-role-policy --policy-name ComprehendLabAssumeRole --policy-document '{json.dumps(iam_pass)}'  --role-name {role_name}"

# Next, we print the commands which attach the required role.
print("\x1b[0;39;43m" + "Please copy the following command and execute in CloudShell:" + "\x1b[0m")
print(f"{cmd_1} && {cmd_2} && {cmd_3} && {cmd_4}")

[0;39;43mPlease copy the following command and execute in CloudShell:[0m
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/ComprehendFullAccess --role-name AmazonSageMaker-ExecutionRole-20231003T093231  && aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --role-name AmazonSageMaker-ExecutionRole-20231003T093231 && aws iam update-assume-role-policy --policy-document '{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", "Principal": {"Service": "sagemaker.amazonaws.com"}, "Action": "sts:AssumeRole"}, {"Effect": "Allow", "Principal": {"Service": "comprehend.amazonaws.com"}, "Action": "sts:AssumeRole"}]}' --role-name AmazonSageMaker-ExecutionRole-20231003T093231 && aws iam put-role-policy --policy-name ComprehendLabAssumeRole --policy-document '{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", "Action": ["iam:PassRole"], "Resource": "*"}]}'  --role-name AmazonSageMaker-ExecutionRole-20231003T093231


### **Attach IAM policies by using AWS CloudShell**

***
You can attach the required policies by running a command from any shell authenticated with the credentials of the current AWS account. For simplicity, we use the AWS CloudShell service.

**To attach the IAM policies**

1. Open the [CloudShell console](https://aws.amazon.com/cloudshell/) in the same AWS account as the current Sagemaker Studio domain. It might take up to a minute to start. Note that the IAM role used in Cloudshell must be able to change IAM permissions. Typically, the Administrator role can change IAM permissions.

![cloudshell_part_1.png](https://jumpstart-cache-prod-us-west-2.s3.us-west-2.amazonaws.com/ai_services_assets/custom_labels/screenshots/cloudshell_screenshot_part_1.png)

2. At the CloudShell command prompt, run the command that you noted in the output from the previous cell.

![cloudshell_part_2.jpeg](https://jumpstart-cache-prod-us-west-2.s3.us-west-2.amazonaws.com/ai_services_assets/custom_labels/screenshots/cloudshell_screenshot_part_2.jpeg)


## 2. Set up the example datasets  <a id="step2"></a>

### 2.1 Download the dataset and upload to S3.   <a id="step2.1"></a>
The dataset we will use for this example is AWS Product Announcement messages from a public demo website. The dataset contains more than 3000 entries collected from AWS product announcements. In this case, we will download the dataset since it's small.

The demo website contains the following files. The training and testing dataset includes the original AWS announcement messages. The Entity List training file contains 2 columns with mapping between the entity value in the original message and the entity label name we want Comprehend Custom Recognizer to detect.

1. Training Dataset - aws-offering-docs.txt
2. Test Dataset - aws-offerings-test.txt
3. Entity List training dataset - aws-offerings.csv

We will use the training dataset to train a Comprehend Entity Recognization model to detect the below entity:
* AWS_OFFERING

The code below will download the files to your SageMaker Studio and upload them to the S3 bucket we created earlier, so it is ready for the Comprehend Custom Recogizer training process.

In [None]:
s3_client = boto3.client('s3')
s3_entity_prefix = 'entity-training'

training_data_bucket = f"jumpstart-cache-prod-{region}"
training_data_prefix = "training-datasets/comprehend"

s3_client.download_file(training_data_bucket, f"{training_data_prefix}/aws-offerings.csv", "aws-offerings.csv")
response = s3_client.upload_file('./aws-offerings.csv', bucket_name, "comprehend_lab/{}/aws-offerings.csv".format(s3_entity_prefix))

s3_client.download_file(training_data_bucket, f"{training_data_prefix}/aws-offerings-docs.txt", "aws-offerings-docs.txt")
response = s3_client.upload_file('./aws-offerings-docs.txt', bucket_name, "comprehend_lab/{}/aws-offerings-docs.txt".format(s3_entity_prefix))

s3_client.download_file(training_data_bucket, f"{training_data_prefix}/aws-offerings-test.txt", "aws-offerings-test.txt")
response = s3_client.upload_file('./aws-offerings-test.txt', bucket_name, "comprehend_lab/{}/aws-offerings-test.txt".format(s3_entity_prefix))

Let's take a look at the data

In [None]:
!head -20 aws-offerings.csv

"Text", "Type"
"ACM", "AWS_OFFERING"
"ACM PCA", "AWS_OFFERING"
"ACM Private CA", "AWS_OFFERING"
"AD Connector", "AWS_OFFERING"
"AMS", "AWS_OFFERING"
"API Gateway", "AWS_OFFERING"
"AWS", "AWS_OFFERING"
"AWS Amplify", "AWS_OFFERING"
"AWS App Mesh", "AWS_OFFERING"
"AWS AppSync", "AWS_OFFERING"
"AWS Application Auto Scaling", "AWS_OFFERING"
"AWS Application Discovery Service", "AWS_OFFERING"
"AWS Artifact", "AWS_OFFERING"
"AWS Auto Scaling", "AWS_OFFERING"
"AWS Backup", "AWS_OFFERING"
"AWS Batch", "AWS_OFFERING"
"AWS Billing and Cost Management", "AWS_OFFERING"
"AWS Blockchain Templates", "AWS_OFFERING"
"AWS CLI", "AWS_OFFERING"


In [None]:
!head -20 aws-offerings-docs.txt

AWS X-Ray Supports Analytics
Use Analytics in X-Ray and delve into traces to quickly pinpoint issues that may effect your application and its underlying services.
Amazon Sumerian Service Update on 2019-04-26
Service improvements for Amazon Sumerian.
Release: Amazon GameLift on 2019-04-25
Realtime Servers provides ready-to-go, customizable game servers for mobile multiplayer games.
AWS Budgets now Supports Instance Family Filtering for Reservation Coverage Alerts
Starting today, you can use AWS Budgets to create Reservation Coverage budgets to monitor your Amazon EC2 reservations for a given instance family, and receive alerts when your reservation coverage falls below the threshold you define.
Amazon Sumerian Service Update on 2019-04-08
Service improvements for Amazon Sumerian.
AWS X-Ray Supports AWS App Mesh Integration
Use X-Ray to trace communications through AWS App Mesh's service mesh as it networks across multiple types of computer infrastructure.
AWS X-Ray Groups: Deep Dive Dev

---

## 3. Train the model  <a id="step3"></a>

### 3.1 Start training job  <a id="step3.1"></a>
Automatically train the recognizer to label words or sets of adjacent words with custom entity types. Automatic training requires having two types of information: sample documents and the entity list or annotations. Once the recognizer is trained, you can use it to detect custom entities in your documents. You can quickly analyze a small body of text in real time, or you can analyze a large set of documents with an asynchronous job.

You can prepare separate training and testing datasets for Comprehend custom entity recognizer training and model evaluation. Or only provide one dataset for both training and testing. Comprehend will automatically select 10% of your provided dataset to use as testing data. In the below example, we specified the training dataset as *Documents.S3Uri* under *InputDataConfig*.

In [None]:
import uuid

comprehend_client = boto3.client("comprehend")

custom_recognizer_name = f'jumpstart-cer-{uuid.uuid4()}'

response = comprehend_client.create_entity_recognizer(
    RecognizerName=custom_recognizer_name,
    LanguageCode="en",
    DataAccessRoleArn=role_arn,
    InputDataConfig={
        "EntityTypes": [
            {
                'Type': "AWS_OFFERING"
            }
        ],
        'EntityList': {
            'S3Uri': "s3://{}/comprehend_lab/{}/aws-offerings.csv".format(bucket_name,s3_entity_prefix)
        },
        'Documents': {
            'S3Uri': "s3://{}/comprehend_lab/{}/aws-offerings-docs.txt".format(bucket_name,s3_entity_prefix)
        },

    }
)
recognizer_arn = response["EntityRecognizerArn"]
print("The ARN for the entity recognizer is {}".format(recognizer_arn))

The ARN for the entity recognizer is arn:aws:comprehend:us-west-2:054719795948:entity-recognizer/jumpstart-cer-2f06259c-56b1-49f1-92b0-3c663ebb43df


### 3.2 Monitor the status of the training job  <a id="step3.2"></a>
The below code will monitor the training job status and stop when the model is ready. The process may take around 15 minutes to complete.

In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins
from IPython.display import clear_output
import time
from datetime import datetime

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    describe_custom_recognizer = comprehend_client.describe_entity_recognizer(
        EntityRecognizerArn=recognizer_arn
    )
    status = describe_custom_recognizer["EntityRecognizerProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom document entity recognizer: {status}")

    if status == "TRAINED" or status == "IN_ERROR":
        break
    time.sleep(10)

18:51:39 : Custom document entity recognizer: TRAINED
CPU times: user 1.28 s, sys: 109 ms, total: 1.39 s
Wall time: 14min 6s


### 3.3 Retrieve the trained model metrics  <a id="step3.3"></a>

Amazon Comprehend provides you with metrics to help you estimate how well an entity recognizer should work for your job. They are based on training the recognizer model, and so while they accurately represent the performance of the model during training, they are only an approximation of the API performance during entity discovery.

Metrics are returned any time metadata from a trained entity recognizer is returned.

In [None]:
describe_custom_recognizer['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']

{'Precision': 96.57142857142857,
 'Recall': 97.40634005763688,
 'F1Score': 96.98708751793401}

## 4. Start a batch entity recognition job  <a id="step4"></a>
You can run an asynchronous analysis job to detect custom entities in a set of one or more documents. We will use the testing dataset in the below step to run a batch job against the newly trained Custom Recognizer model.
### 4.1 Start the batch inference job  <a id="step4.1"></a>

In [None]:
response = comprehend_client.start_entities_detection_job(
    JobName='AWS_OFFERING-001',
    EntityRecognizerArn=recognizer_arn,
    LanguageCode="en",
    DataAccessRoleArn=role_arn,
    InputDataConfig={
        'S3Uri': "s3://{}/comprehend_lab/{}/aws-offerings-test.txt".format(bucket_name,s3_entity_prefix),
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': "s3://{}/comprehend_lab/{}/results/".format(bucket_name,s3_entity_prefix)
    }
)
job_id = response['JobId']

Amazon Comprehend batch job responds with the JobID and JobStatus and will return the output from the job in the S3 bucket that you specified in your request.
### 4.2 Monitor the batch inference job status  <a id="step4.2"></a>
We can use the JobID to get the status of the batch job.

The below code will monitor the batch job status and stop when it is done. The process may take around 20 minutes to complete.

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")
    response = comprehend_client.describe_entities_detection_job(
        JobId=job_id
    )
    status = response["EntitiesDetectionJobProperties"]["JobStatus"]
    clear_output(wait=True)
    print(f"{current_time} : Custom document entity batch job: {status}")

    if status == "COMPLETED" or status == "FAILED":
        break
    time.sleep(10)

18:57:54 : Custom document entity batch job: COMPLETED


### 4.3  Download the output of the batch job  <a id="step4.3"></a>
Now the batch job is completed. Let's download the output from the S3 bucket and take a closer look.

In [None]:
response = comprehend_client.describe_entities_detection_job(
    JobId=job_id
)
if response['EntitiesDetectionJobProperties']['JobStatus'] == "COMPLETED":
    output_s3_uri = response['EntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri']
    s3_key = output_s3_uri.replace("s3://{}/".format(bucket_name),'')
    s3.meta.client.download_file(bucket_name, s3_key, 'output.tar.gz')
    !tar zxvf output.tar.gz
else:
    print("Batch transformation job not complete.  Please wait until this job is completed before attempting to view output.")

tar: Ignoring unknown extended header keyword 'LIBARCHIVE.creationtime'
output


Let's review the test data and the output

In [None]:
!head -20 aws-offerings-test.txt

Great connecting with other #AWS ninjas today at #AWSSummitLondon 🐱‍👤 @awscloud
"How far out is Fargate?" - Diving into the strengths and weaknesses of #AWS #Fargate with Michael Lavers.
AWS announces availability of Amazon Managed Blockchain service http://bit.ly/2H3oepU
New in #AWS: Amazon Kinesis Data Analytics now allows you to assign AWS resource tags to your real-time applications
Serverless has become the most used deployment pattern for cloud applications. In this field, AWS Lambda is a very well known player. Every developer wants to get his hands dirty with lambda, build a quick function code, and run it. 
At re:Invent 2050, Amazon announced a new product called Amazon DeepThought.  It is the most amazing cloud service ever developed.

In [None]:
!cat output

{"Entities": [], "File": "aws-offerings-test.txt", "Line": 0}
{"Entities": [], "File": "aws-offerings-test.txt", "Line": 1}
{"Entities": [{"BeginOffset": 0, "EndOffset": 3, "Score": 1.0, "Text": "AWS", "Type": "AWS_OFFERING"}, {"BeginOffset": 30, "EndOffset": 55, "Score": 1.0, "Text": "Amazon Managed Blockchain", "Type": "AWS_OFFERING"}], "File": "aws-offerings-test.txt", "Line": 2}
{"Entities": [{"BeginOffset": 13, "EndOffset": 42, "Score": 0.9130047913060774, "Text": "Amazon Kinesis Data Analytics", "Type": "AWS_OFFERING"}, {"BeginOffset": 68, "EndOffset": 71, "Score": 1.0, "Text": "AWS", "Type": "AWS_OFFERING"}], "File": "aws-offerings-test.txt", "Line": 3}
{"Entities": [{"BeginOffset": 94, "EndOffset": 104, "Score": 0.9999999403953606, "Text": "AWS Lambda", "Type": "AWS_OFFERING"}], "File": "aws-offerings-test.txt", "Line": 4}
{"Entities": [{"BeginOffset": 57, "EndOffset": 75, "Score": 0.9997209494040638, "Text": "Amazon DeepThought", "Type": "AWS_OFFERING"}], "File": "aws-offering

## 5. Real-time analysis for custom entity recognition  <a id="step5"></a>
With Amazon Comprehend, you can quickly detect custom entities in individual text documents by running real-time analysis. Unlike asynchronous batch jobs that analyze large documents or large sets of documents, real-time analysis is useful for applications that process small bodies of text as they arrive. For example, you can immediately detect custom entities in social media posts, support tickets, or customer reviews.

### 5.1 Create Model End Point  <a id="step5.1"></a>
You create an endpoint to make your custom model available for real-time analysis.

To meet your text processing needs, you assign inference units to the endpoint, and each unit allows a throughput of 100 characters per second for up to 2 documents per second. You can then adjust the throughput up or down.

In [None]:
realtime_endpoint_name = f'jumpstart-demo-custom-ner-endpoint'
endpoint_arn = ''

try:
    response = comprehend_client.create_endpoint(
        EndpointName=realtime_endpoint_name,
        ModelArn=recognizer_arn,
        DesiredInferenceUnits=10
    )
    endpoint_arn = response['EndpointArn']
except Exception as error:
    print(error)


print('Document Entity Recognition Endpoint ARN: ' + endpoint_arn)

Document Entity Recognition Endpoint ARN: arn:aws:comprehend:us-west-2:054719795948:entity-recognizer-endpoint/jumpstart-demo-custom-ner-endpoint


### 5.2 Monitor creation status of the entity recognizer endpoint  <a id="step5.2"></a>
The below code will monitor the endpoint deployment job status and stop when it is done. The process may take around 10 minutes to complete.

In [None]:
%%time
# Loop through and wait for the training to complete . Takes up to 10 mins
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    now = datetime.now()
    current_time = now.strftime("%H:%M:%S")

    describe_endpoint_resp = comprehend_client.describe_endpoint(
        EndpointArn=endpoint_arn
    )
    status = describe_endpoint_resp["EndpointProperties"]["Status"]
    clear_output(wait=True)
    print(f"{current_time} : Custom entity recognizer Entity Recognition: {status}")

    if status == "IN_SERVICE" or status == "FAILED":
        break

    time.sleep(10)

19:06:50 : Custom entity recognizer Entity Recognition: IN_SERVICE
CPU times: user 681 ms, sys: 53.7 ms, total: 734 ms
Wall time: 7min 23s


### 5.3 Running real-time custom entity detection  <a id="step5.3"></a>
After you create an endpoint for your custom entity recognizer model, you can run real-time analysis to quickly detect entities in individual bodies of text.

In [None]:
from IPython.core.display import display, HTML

examples = [
    "Great connecting with other #AWS ninjas today at #AWSSummitLondon",
    "How far out is Fargate? - Diving into the strengths and weaknesses of #AWS #Fargate with Michael Lavers.",
    "AWS announces availability of Amazon Managed Blockchain service http://bit.ly/2H3oepU",
    "New in #AWS: Amazon Kinesis Data Analytics now allows you to assign AWS resource tags to your real-time applications",
    "Serverless has become the most used deployment pattern for cloud applications. In this field, AWS Lambda is a very well known player. Every developer wants to get his hands dirty with lambda, build a quick function code, and run it.",
    "At re:Invent 2050, Amazon announced a new product called Amazon DeepThought.  It is the most amazing cloud service ever developed.",
    "Michael Lavers approved Stainless steel bulb on 10 ft capillary ",
    "Set at 35degF with adjustable range of 25deg to 325degF ",
    "Electrical rating of 22 amp with voltage from 125 to 480V AC ",
    "NEMA - 4X metal enclosure "
]

for i in range(0,len(examples)):
    response = comprehend_client.detect_entities(
        Text=examples[i],
        EndpointArn=endpoint_arn,
        LanguageCode="en",
    )
    # Detect entities
    entities =  comprehend.detect_entities(LanguageCode="en", Text=examples[i])

    if "Entities" in response and len(response["Entities"]) > 0:
        entity = response["Entities"][0]
        print(f'Text: {entity["Text"]}, Score: {entity["Score"]}, Offset: {entity["BeginOffset"]}-{entity["EndOffset"]}')
        display(HTML(examples[i][0:entity["BeginOffset"]] + '<b style="color:red">'+ examples[i][entity["BeginOffset"]:entity["EndOffset"]] +'</b>' + examples[i][entity["EndOffset"]: len(examples[i])]))
        print()





Text: Fargate, Score: 0.6265328526496887, Offset: 15-22



Text: AWS, Score: 1.0, Offset: 0-3



Text: Amazon Kinesis Data Analytics, Score: 0.9130048155784607, Offset: 13-42



Text: AWS Lambda, Score: 0.9999999403953552, Offset: 94-104



Text: Amazon DeepThought, Score: 0.9997209310531616, Offset: 57-75





### 5.3 Running real-time built in entity detection  <a id="step5.3"></a>
By using built in entity recognizer model, you can run real-time analysis to quickly detect entities in individual bodies of text.

In [None]:
comprehend = boto3.client('comprehend')
# Detect entities
for i in range(6,len(examples)):
    #print(examples[i])
    entities =  comprehend.detect_entities(LanguageCode="en", Text=examples[i])

    #print(entities["Entities"])
    #print(response["Entities"]["Text"], response["Entities"]["Type"])
    #print("\nEntities\n========")

    for entity in entities["Entities"]:
        print ("Built in : {}\t=>\t{}".format(entity["Type"], entity["Text"]))


#print("\nEntities\n========")
#for entity in entities["Entities"]:
    #print ("{}\t=>\t{}".format(entity["Type"], entity["Text"]))

Built in : PERSON	=>	Michael Lavers
Built in : QUANTITY	=>	10 ft
Built in : QUANTITY	=>	35degF
Built in : QUANTITY	=>	25deg
Built in : QUANTITY	=>	325degF
Built in : QUANTITY	=>	22 amp
Built in : QUANTITY	=>	125
Built in : QUANTITY	=>	480V
Built in : ORGANIZATION	=>	NEMA
Built in : QUANTITY	=>	4X


### 5.4 Stop the End point  <a id="step5.4"></a>
The cost for real-time Custom Entity Recognition is based on both the throughput you set and the length of time the endpoint is active.  We clean up the end point here to save cost

In [None]:
response = comprehend_client.delete_endpoint(
    EndpointArn = endpoint_arn
)
response

{'ResponseMetadata': {'RequestId': '50915118-1101-469f-8dfa-9abbb5f2484d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '50915118-1101-469f-8dfa-9abbb5f2484d',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '2',
   'date': 'Wed, 22 Nov 2023 19:45:36 GMT'},
  'RetryAttempts': 0}}

## 6. Conclusion  <a id="step6"></a>
Creating a custom entity recognition model is a more effective approach than using string matching or regular expressions to extract entities from documents. For example, to extract ENGINEER names in a document, it is difficult to enumerate all possible names. Additionally, without context, it is challenging to distinguish between ENGINEER names and ANALYST names. A custom entity recognition model can learn the context where those names are likely to appear. Additionally, string matching will not detect entities that have typos or follow new naming conventions, while this is possible using a custom model.

In this notebook, we showed you how to use the Customr Entity Recognization Model. The notebook showed how to process data to create the training and test datasets, train a model, host a model, run inference, and stop a model.  To do this we provided example datasets.

### 7. Additional resources  <a id="step7"></a>
***

To learn more about the Comprehend Custom Entity Recognizer, see [Amazon Comprehend Custom entity recognition](https://docs.aws.amazon.com/comprehend/latest/dg/custom-entity-recognition.html).

Post your questions related to Comprehend and find the other FAQs at: https://repost.aws/tags/TArJuWuDW_RS2Qbz1XXvbVzA/amazon-comprehend.

***