# Comprehend Demo

***
Copyright [2017]-[2018] Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at

http://aws.amazon.com/apache2.0/

or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
***

### Prerequisites:

#### Identity and Acces Management

The user or role that executes the commands must have permissions in AWS Identity and Access Management (IAM) to perform those actions. AWS provides a set of managed policies that help you get started quickly. For our example, you should apply the following managed policy to your user or role:

    ComprehendReadOnly
    TranslateReadOnly

Be aware that we recommend you follow AWS IAM best practices for production implementations, which is out of scope for this workshop.

### Dataset

Download the [first sample](https://github.com/fivethirtyeight/russian-troll-tweets/) of the russian troll tweaks that were published by FiveThirtyEight story [Why We’re Sharing 3 Million Russian Troll Tweets](https://fivethirtyeight.com/features/why-were-sharing-3-million-russian-troll-tweets/).

In [None]:
!wget -nv https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv

In [None]:
import boto3
import itertools
from pprint import pprint
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline  

comprehend = boto3.client('comprehend')

In [None]:
# Ingest file
df = pd.read_csv('IRAhandle_tweets_1.csv')[['language','author','region','following','followers','content']]
print(df.size)
df.head()

In [None]:
# Get a distribution of followers
df[['following', 'followers']].hist(bins=50)

## Sentiment 

Use comprehend to extract sentiment.  

Comprehend supports a number of [languages](https://docs.aws.amazon.com/comprehend/latest/dg/how-languages.html), lets look at a distribution of languages in the dataset

In [None]:
lang_top10 = df['language'].value_counts()[:10]
lang_top10

Get a batch of English results, and take a look at the first

In [None]:
# Get a batch of 25 results
max_results = 25
contents = list(df[df['language'] == 'English']['content'].head(max_results).values)
contents[0]

Get back the sentiment for these results and plot the scores

In [None]:
resp = comprehend.batch_detect_sentiment(TextList=contents, LanguageCode='en')
sentiments = pd.DataFrame([r['SentimentScore'] for r in resp['ResultList']])
print(sentiments.head())
sentiments.plot() # TODO: Change colors for results

### Translate

Get back some italian results and translate into english

In [None]:
import boto3

translate = boto3.client('translate')

In [None]:
contents = list(df[df['language'] == 'Italian']['content'].head(max_results).values)
sample = contents[4] # Pick a random sample 
sample

In [None]:
response = translate.translate_text(
    Text=sample,
    SourceLanguageCode='it',
    TargetLanguageCode='en'
)
response['TranslatedText']

## Entities

Get Entities extracted from this content

In [None]:
# Get some entities back
resp = comprehend.batch_detect_entities(TextList=list(contents), LanguageCode='en')
entities = list(itertools.chain.from_iterable([r['Entities'] for r in resp['ResultList']]))
df_entities = pd.DataFrame(entities)
df_entities.head()

In [None]:
# Get the top most common person references
df_entities[df_entities['Type'] == 'PERSON']['Text'].value_counts().head(10)

In [None]:
# Look at some of the key phrases that are being discussed in these tweets
resp = comprehend.batch_detect_key_phrases(TextList=list(contents), LanguageCode='en')
df_phrases = pd.DataFrame([[p['Text'] for p in r['KeyPhrases']] for r in resp['ResultList']])
df_phrases.head(10)