About

.NET Core C# code samples for Amazon Comprehend Custom Classification. You can use Amazon Comprehend to build your own models for custom classification, assigning a document to a class or a category.

Amazon Comprehend uses natural language processing (NLP) to extract insights about the content of documents. Amazon Comprehend processes any text file in UTF-8 format. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document. Use Amazon Comprehend to create new products based on understanding the structure of documents. For example, using Amazon Comprehend you can search social networking feeds for mentions of products or scan an entire document repository for key phrases.

Overview

Custom classification is a two step process. First you train a custom classifier to recognize the categories that are of interest to you. To train the classifier, you send Amazon Comprehend a group of labeled documents. After Amazon Comprehend builds the classifier, you send documents to be classified. The custom classifier examines each document and returns the label that best represents the content of the document.

This sample has two .NET Core projects:

The project custom-classification uses Amazon Comprehend to create a Custom Classifier
The project analysis-job uses Amazon Comprehend custom classifier to categorize unlabeled documents in a test file (each line is a document) by starting a classification job that helps you Analyze the content of documents stored in Amazon S3 to find insights like entities, phrases, primary language or sentiment

Variables

You need to set the following variables in Program.cs file inside custom-classification and analysis-job folders before following the steps to execute the program

Variable	Purpose
ServiceRoleArn	IAM Service Role for Amazon Comprehend that needs read/write access to S3 buckets. You need to create this role in your AWS account and then set it's ARN to this variable
TrainingFile	This file has labeled data that is used by Comprehend to train the custom classifier. You can use your own file or upload the `training-data.csv` to your S3 bucket provided with this sample
InputFile	This file has test data that is used as an input for the Comprehend classification batch job. You can use your own file or upload the `test-data.csv` to your S3 bucket provided with this sample
OutputLocation	This is the S3 bucket where the Comprehend classification batch job output will be emitted. You can see a sample output file `output.jsonl` in the `analysis-job` folder

ServiceRoleArn uses the following policy document to grant privileges to Amazon Comprehend to access the S3 bucket where training data is stored

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:*Bucket"
            ],
            "Resource": [
                "arn:aws:s3:::<your-bucket-name>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*Object"
            ],
            "Resource": [
                "arn:aws:s3:::<your-bucket-name>/*"
            ]
        }
    ]
}

Prerequisites

Dotnet Core 2.2
AWS CLI for running AWS CLI commands after configuring a default or named profile

Steps to execute

Download the code
create a new S3 bucket for training and unlabeled data
create an IAM role using the policy document described below
go to Program.cs in each project, find all const string variables and replace placeholder values with actual values
From a command line, go to custom-classification project first in the downloaded folder and then execute dotnet run this will download all dependencies, build, and run the program. Follow the same for analysis-job project

At the completion of custom-classification run, you'll see an output similar to the following

Status: [TRAINED], Message: []
Started at: [7/3/19 9:52:14 PM], completed at: [7/3/19 9:52:14 PM]
Accuracy: [0.9149], F1Score: [0.8674], Precision: [0.8901], Recall: [0.8489]
custom classifier created

At the completion of analysis-job run, you'll see an output similar to the following

Job Id: [8df6e23b534a9c7aa2831e58cbef04ac], Name: [06df74c8-c5ba-4325-a8e1-9ba5c54eeea5], Status: [COMPLETED], Message: []
Started at: [7/3/19 9:33:33 PM], completed at: [7/3/19 9:40:13 PM]
Output located at: [s3://<your-bucket-name>/<some-object-key>/<your-account-id>-CLN-8df6e23b534a9c7aa2831e58cbef04ac/output/output.tar.gz]

Dependencies

The following dependencies are defined in the .csproj file that are downloaded when you first execute dotnet run

<ItemGroup>
    <PackageReference Include="AWSSDK.Comprehend" Version="3.3.101" />
    <PackageReference Include="AWSSDK.Extensions.NETCore.Setup" Version="3.3.100.1" />
    <PackageReference Include="Microsoft.Extensions.Configuration" Version="2.2.0" />
    <PackageReference Include="Microsoft.Extensions.Configuration.Json" Version="2.2.0" />
</ItemGroup>

Troubleshooting

In case you encounter a Classification failure error like the following, please ensure that the S3 bucket is in the same region as Comprehend

INPUT_BUCKET_NOT_IN_SERVICE_REGION: The provided input S3 bucket is not in the service region.

If you get the following error, then please note that each classification can have up to a maximum of 1000 unique labels. The sample training file that I have used jeopardy-filtered-labeled.csv has only 3 unique labels each having more than 1000 documents (each line is a document). Read Training a Custom Classifier for more information

Error: [Found 27983 unique labels. The maximum allowed number of unique labels is 1000.]

Reference

Source of the file training-data.csv is this website

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
analysis-job		analysis-job
custom-classification		custom-classification
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis-job

analysis-job

custom-classification

custom-classification

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

About

Overview

Variables

Prerequisites

Steps to execute

Dependencies

Troubleshooting

Reference

About

Releases

Packages

Languages

License

sahays/comprehend-custom

Folders and files

Latest commit

History

Repository files navigation

About

Overview

Variables

Prerequisites

Steps to execute

Dependencies

Troubleshooting

Reference

About

Resources

License

Stars

Watchers

Forks

Languages