Skip to content

sahays/comprehend-custom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

.NET Core C# code samples for Amazon Comprehend Custom Classification. You can use Amazon Comprehend to build your own models for custom classification, assigning a document to a class or a category.

Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. Amazon Comprehend processes any text file in UTF-8 format. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases.

Overview

Custom classification is a two step process. First you train a custom classifier to recognize the categories that are of interest to you. To train the classifier, you send Amazon Comprehend a group of labeled documents. After Amazon Comprehend builds the classifier, you send documents to be classified. The custom classifier examines each document and returns the label that best represents the content of the document.

This sample has two .NET Core projects:

  • The project custom-classification uses Amazon Comprehend to create a Custom Classifier
  • The project analysis-job uses Amazon Comprehend custom classifier to categorize unlabeled documents in a test file (each line is a document) by starting a classification job that helps you Analyze the content of documents stored in Amazon S3 to find insights like entities, phrases, primary language or sentiment

Variables

You need to set the following variables in Program.cs file inside custom-classification and analysis-job folders before following the steps to execute the program

Variable Purpose
ServiceRoleArn IAM Service Role for Amazon Comprehend that needs read/write access to S3 buckets. You need to create this role in your AWS account and then set it's ARN to this variable
TrainingFile This file has labeled data that is used by Comprehend to train the custom classifier. You can use your own file or upload the training-data.csv to your S3 bucket provided with this sample
InputFile This file has test data that is used as an input for the Comprehend classification batch job. You can use your own file or upload the test-data.csv to your S3 bucket provided with this sample
OutputLocation This is the S3 bucket where the Comprehend classification batch job output will be emitted. You can see a sample output file output.jsonl in the analysis-job folder

ServiceRoleArn uses the following policy document to grant privileges to Amazon Comprehend to access the S3 bucket where training data is stored

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:*Bucket"
            ],
            "Resource": [
                "arn:aws:s3:::<your-bucket-name>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*Object"
            ],
            "Resource": [
                "arn:aws:s3:::<your-bucket-name>/*"
            ]
        }
    ]
}

Prerequisites

Steps to execute

  • Download the code
  • create a new S3 bucket for training and unlabeled data
  • create an IAM role using the policy document described below
  • go to Program.cs in each project, find all const string variables and replace placeholder values with actual values
  • From a command line, go to custom-classification project first in the downloaded folder and then execute dotnet run this will download all dependencies, build, and run the program. Follow the same for analysis-job project

At the completion of custom-classification run, you'll see an output similar to the following

Status: [TRAINED], Message: []
Started at: [7/3/19 9:52:14 PM], completed at: [7/3/19 9:52:14 PM]
Accuracy: [0.9149], F1Score: [0.8674], Precision: [0.8901], Recall: [0.8489]
custom classifier created

At the completion of analysis-job run, you'll see an output similar to the following

Job Id: [8df6e23b534a9c7aa2831e58cbef04ac], Name: [06df74c8-c5ba-4325-a8e1-9ba5c54eeea5], Status: [COMPLETED], Message: []
Started at: [7/3/19 9:33:33 PM], completed at: [7/3/19 9:40:13 PM]
Output located at: [s3://<your-bucket-name>/<some-object-key>/<your-account-id>-CLN-8df6e23b534a9c7aa2831e58cbef04ac/output/output.tar.gz]

Dependencies

The following dependencies are defined in the .csproj file that are downloaded when you first execute dotnet run

<ItemGroup>
    <PackageReference Include="AWSSDK.Comprehend" Version="3.3.101" />
    <PackageReference Include="AWSSDK.Extensions.NETCore.Setup" Version="3.3.100.1" />
    <PackageReference Include="Microsoft.Extensions.Configuration" Version="2.2.0" />
    <PackageReference Include="Microsoft.Extensions.Configuration.Json" Version="2.2.0" />
</ItemGroup>

Troubleshooting

In case you encounter a Classification failure error like the following, please ensure that the S3 bucket is in the same region as Comprehend

INPUT_BUCKET_NOT_IN_SERVICE_REGION: The provided input S3 bucket is not in the service region.

If you get the following error, then please note that each classification can have up to a maximum of 1000 unique labels. The sample training file that I have used jeopardy-filtered-labeled.csv has only 3 unique labels each having more than 1000 documents (each line is a document). Read Training a Custom Classifier for more information

Error: [Found 27983 unique labels. The maximum allowed number of unique labels is 1000.]

Reference

Source of the file training-data.csv is this website

About

.NET Core C# code samples for Amazon Comprehend Custom Classification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages