# Amazon Comprehend

## What is Amazon Comprehend?

Amazon comprehend is one of the many AI tools that AWS offers. This one is specifically designed for Natural language processing, and we can divide it's capabilities into 2 categories:

- On one side we have the **prebuilt models**, which are tools that allow you to send text over a request and directly get insights with no training.

- On the other side we have **custom classifiers**. If you aim to classify text with a certain label, you can build your own classifier from training data easily.



## Prebuilt Model Capabilities

You can extract valuable information and insights. These can be divided into the following categories:

- Entities: list of entities, such as people, places, and locations, identified in a document.

- Key phrases: comprehend extracts key phrases that appear in a document. For example, a document about a basketball game might return the names of the teams, the name of the venue, and the final score.

- Language - Amazon Comprehend identifies the dominant language in a document. It can identify 100 languages.

- Sentiment - Amazon Comprehend determines the emotional sentiment of a document. Sentiment can be positive, neutral, negative, or mixed. *This is prebuilt tool we are going to use through the project.*

- Syntax - Amazon Comprehend parses each word in your document and determines the part of speech for the word. For example, in the sentence "It is raining today in Seattle," "it" is identified as a pronoun, "raining" is identified as a verb, and "Seattle" is identified as a proper noun.

- Document clustering or topic modeling: examine documents to organize them based on similar keywords within them.

Source: https://docs.aws.amazon.com/comprehend/latest/dg/what-is.html 


In [11]:
import boto3
import json
import pandas as pd
import numpy as np
import io
import csv
from sklearn.model_selection import train_test_split

In [2]:
s3_client = boto3.client('s3')
bucket = 'sagemaker-studio-197812454142-1pz1d5gxfcv'

In [3]:
# Load comprehend tool
# Remember to attach policy "ComprehendFullAccess" in IAM role 
comprehend = boto3.client(service_name='comprehend', region_name='us-east-1')

# Helper function to apply in pd dataframe
def getComprehendSentiment(text):
    result = comprehend.detect_sentiment(Text=text, LanguageCode='en')["Sentiment"]
    
    if result == "NEGATIVE":
        return 1
    elif result == "POSITIVE":
        return 2
    #else MIXED OR NEUTRAL
    return 3

def getDataframeFromS3(file_name, names):
    # Load from S3 boto client
    response = s3_client.get_object(Bucket=bucket ,Key=file_name)
    # Get file from request
    file = response["Body"].read()
    # Load into pandas
    return pd.read_csv(io.BytesIO(file), names=names)
    

# S3 Load

In [5]:
file_name = 'test.csv'
names=["label", "title", "review"]
data_test = getDataframeFromS3(file_name = file_name, names = names)

In [24]:
data_test['predicted'] = None
data_test.loc[:,'incorrect']  = 1
data_test.head()

Unnamed: 0,label,title,review,predicted,incorrect
0,2,Great CD,My lovely Pat has one of the GREAT voices of h...,,1
1,2,One of the best game music soundtracks - for a...,Despite the fact that I have only played a sma...,,1
2,1,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...,,1
3,2,"works fine, but Maha Energy is better",Check out Maha Energy's website. Their Powerex...,,1
4,2,Great for the non-audiophile,Reviewed quite a bit of the combo players and ...,,1


# Default Amazon Comprehend Sentiment Analysis

## Loading

In [10]:
data_test.shape

(400000, 5)

In [12]:
data_test = data_test.sample(5000)

In [13]:
data_test.loc[:,'predicted'] = data_test.loc[:,'review'].apply(getComprehendSentiment)

In [14]:
data_test

Unnamed: 0,label,title,review,predicted,incorrect
315933,1,counterfeit?,the batteries barely held a charge. One to thr...,1,1
330527,2,Awesome! Can't live without this stuff!,I love this stuff! I used to have a horrible t...,2,1
160858,1,waste of money,This CD is for you if you are a die hard Spide...,1,1
226132,1,Definately Not Like Morrowind,I was dissapointed that this wasn't like Morro...,1,1
281701,2,Pretty Cute Comedy,"This isn't the funniest movie I've ever seen, ...",2,1
...,...,...,...,...,...
124740,1,not great,This is a typical book for the Dragonlance ser...,1,1
85054,1,"Broke in 2 months, hideous customer service",My first unit was stolen at a car service plac...,1,1
265831,1,Not worth the money!!,"A good lens, but save your dollars and buy the...",2,1
364079,2,Good guide for those interested in Mustang upg...,This book is basically a republishing of artic...,2,1


In [15]:
data_test.loc[data_test['predicted'] == data_test['label'],'incorrect'] = 0

## Results

In [22]:
errors = data_test.incorrect.sum()
accuracy = (data_test.shape[0]-errors)/data_test.shape[0]
print(f"Accuracy is {accuracy*100}%")

Accuracy is 81.64%


In [24]:
errors = data_test.loc[data_test.predicted != 3, 'incorrect'].sum()
accuracy = (data_test.shape[0]-errors)/data_test.shape[0]
print(f"Accuracy discarding 'MIXED' and 'NEUTRAL' results is {accuracy*100}%")

Accuracy discarding 'MIXED' and 'NEUTRAL' results is 92.84%


In [25]:
del data_test

# Build Custom Classifier

## Prepare Training File

### Doesn't work

In [4]:
file_name = 'train.csv'
names=["label", "title", "review"]
data_train = getDataframeFromS3(file_name = file_name, names = names)

In [5]:
data_train.label = data_train.label.replace(1, "NEGATIVE")
data_train.label = data_train.label.replace(2, "POSITIVE")
data_train.review = data_train.review.str.replace(",", "")

In [6]:
data_train.head()

Unnamed: 0,label,title,review
0,POSITIVE,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,POSITIVE,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,POSITIVE,Amazing!,This soundtrack is my favorite music of all ti...
3,POSITIVE,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,POSITIVE,"Remember, Pull Your Jaw Off The Floor After He...",If you've played the game you know how divine ...


### Try with normalized text

In [4]:
file_name = 'X_train.csv'
names=["review"]
data_train = getDataframeFromS3(file_name = file_name, names = names)

In [5]:
file_name = 'y_train.csv'
names=["label"]
data_train["label"] = getDataframeFromS3(file_name = file_name, names = names)["label"]

In [19]:
# Comprehend only supports 1M samples
MAX_DOCS = 1_000_000
_, data_train = train_test_split(data_train, test_size=MAX_DOCS, random_state=0, stratify = data_train['label'])

In [23]:
data_train.label = data_train.label.replace(0, "NEGATIVE")
data_train.label = data_train.label.replace(1, "POSITIVE")
data_train = data_train[["label", "review"]]
data_train.shape

(1000000, 2)

In [20]:
data_train.label.value_counts()

1    500000
0    500000
Name: label, dtype: int64

In [22]:
data_train.head()

Unnamed: 0,label,review
2648109,POSITIVE,this item perfect edition our bathroom compact...
1083475,NEGATIVE,talk crab and dog that type the computer and d...
3433029,POSITIVE,opinion the story start out bite slow but the ...
2830597,POSITIVE,author royce adam again this action pack story...
1738281,NEGATIVE,when saw the review title arrive break twice t...


## Save Training File to S3

In [24]:
df = data_train
fileName = "amazon-comprehend-train-data.csv"
# Write to a file in S3
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, header=False, index=False)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, fileName).put(Body=csv_buffer.getvalue())

{'ResponseMetadata': {'RequestId': 'E3833BEC5DFE35C9',
  'HostId': '26LOiGKLBKuWUpVb77m04MQEUdR/Khkwdthuc2D6od3Jg+gfGEW1Naka1LROJ5Mzd5UwcJxOaOo=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': '26LOiGKLBKuWUpVb77m04MQEUdR/Khkwdthuc2D6od3Jg+gfGEW1Naka1LROJ5Mzd5UwcJxOaOo=',
   'x-amz-request-id': 'E3833BEC5DFE35C9',
   'date': 'Mon, 07 Sep 2020 00:57:54 GMT',
   'etag': '"25573ddfd32fb1e019d3f0e5c0b7c932"',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0},
 'ETag': '"25573ddfd32fb1e019d3f0e5c0b7c932"'}

## Prepare request to create custom classifier

In [25]:
comprehend.create_document_classifier(
    DocumentClassifierName='amazon-reviews-custom-classifier-sample',
    DataAccessRoleArn='arn:aws:iam::197812454142:role/role_comprehend_s3',
    InputDataConfig={
        'S3Uri': f's3://{bucket}/{fileName}'
    },
    LanguageCode='en',
    Mode='MULTI_CLASS'
)

{'DocumentClassifierArn': 'arn:aws:comprehend:us-east-1:197812454142:document-classifier/amazon-reviews-custom-classifier',
 'ResponseMetadata': {'RequestId': 'ba0e44f5-8150-49d4-bc9d-9e4f45e78ffc',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'ba0e44f5-8150-49d4-bc9d-9e4f45e78ffc',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '122',
   'date': 'Mon, 07 Sep 2020 00:57:56 GMT'},
  'RetryAttempts': 0}}

## Check results in Amazon Comprehend.

Results are written over a file in S3.

Accuracy: 93.37%

Precision: 93.38%

Recall: 93.37%

F1 Score: 93.00%
 