# SIT205 Thinking Systems and Cognition Science - Assignment 2

## Group: Philip Castiglione (217157862) and Warwick Smith (215239649)

## Topic 1: Text Analysis

We chose to analyse the recent events of Australian politics using text produced by members of the general public on Twitter.

To do this, we collected approximately 6,000 tweets containing the hashtag #libspill using the Twitter API. We obtain these using the open source `python-twitter` [client library](https://github.com/bear/python-twitter).

We then clean and filter these tweets to leave us with a set of 941 unique documents for textual analysis.

We send these documents to the Watson Natural Language Understanding API.

Finally, we complete a statistical analysis of the results.

This notebook is separated into 4 parts:

1. [Part 1 - Document Collection](#Part-1---Document-Collection)
1. [Part 2 - Document Cleaning](#Part-2---Document-Cleaning)
1. [Part 3 - Watson NLU](#Part-3---Watson-NLU)
1. [Part 4 - Analysis of Watson Output](#Part-4---Analysis-of-Watson-Output)

Notes:
- Each part writes out results to a file, to allow us to work on sections independently, and to save on API calls
- We ran the analysis at Mon 17 Sept 2018 and the number of tweets and documents were from this run. Running at a different point in time will lead to different data capture and results.

### Setup

Load our dependencies:

In [2]:
# Standard library
import os
import pickle
import json
from collections import Counter
import statistics

# Additional libraries
import twitter                    # python-twitter API
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 import *
from dotenv import load_dotenv    # for management of twitter credentials

We use the `dotenv` library to load key-value pairs from a `.env` file into the os environment:

In [3]:
load_dotenv()

True

Utilise external environment to source our API authentication credentials:

In [4]:
TWITTER_CONSUMER_KEY = os.getenv("TWITTER_CONSUMER_KEY")
TWITTER_CONSUMER_SECRET = os.getenv("TWITTER_CONSUMER_SECRET")
TWITTER_ACCESS_TOKEN_KEY = os.getenv("TWITTER_ACCESS_TOKEN_KEY")
TWITTER_ACCESS_TOKEN_SECRET = os.getenv("TWITTER_ACCESS_TOKEN_SECRET")
WATSON_NLU_API_KEY = os.getenv("WATSON_NLU_API_KEY")

# WARNING: do not commit to git with any of these values printed to a cell's output

## Part 1 - Document Collection

[Return to Contents](#Topic-1:-Text-Analysis)

Create python-twitter API instance:

In [5]:
api = twitter.Api(consumer_key=TWITTER_CONSUMER_KEY,
                  consumer_secret=TWITTER_CONSUMER_SECRET,
                  access_token_key=TWITTER_ACCESS_TOKEN_KEY,
                  access_token_secret=TWITTER_ACCESS_TOKEN_SECRET,
                  tweet_mode='extended')

Twitter search hashtag to find documents for analysis:

In [6]:
hashtag = "libspill"

We collect tweets in batches, up to a particular count, total_count:

In [7]:
def collect_tweets(api, hashtag, batch_max, total_count):
    tweets = []
    batch_max = str(batch_max)
    results = api.GetSearch(term=hashtag, result_type="recent", lang="en", 
                            count=batch_max, return_json=True)

    tweets += results['statuses']
    
    ids = [tweet['id'] for tweet in tweets]
    max_tweet_id = str(min(ids)-1)
    
    previous_tweet_count = 0
    while len(tweets) < total_count and len(tweets) > previous_tweet_count:
        previous_tweet_count = len(tweets)
        
        print("{} tweets collected for hashtag {}. Most recent tweeted at {}".format(
            len(tweets), hashtag, tweets[len(tweets)-1]['created_at']))
        
        results = api.GetSearch(term=hashtag, result_type="recent", lang="en", 
                                count=batch_max, return_json=True, max_id=max_tweet_id)
        tweets += results['statuses']
        ids = [tweet['id'] for tweet in tweets]
        max_tweet_id = str(min(ids)-1)
        
    print("{} tweets collected for hashtag {}. Most recent tweeted at {}".format(
            len(tweets), hashtag, tweets[len(tweets)-1]['created_at']))
    return tweets

To allow for tweet cleaning, and particularly filtering of re-tweets, we aim to collect 6,000 tweets relating to the #libspill hashtag:

In [16]:
tweet_collection = collect_tweets(api, hashtag, 100, 6000)

100 tweets collected for hashtag libspill. Most recent tweeted at Mon Sep 17 05:59:07 +0000 2018
200 tweets collected for hashtag libspill. Most recent tweeted at Sun Sep 16 23:05:12 +0000 2018
300 tweets collected for hashtag libspill. Most recent tweeted at Sun Sep 16 08:46:30 +0000 2018
400 tweets collected for hashtag libspill. Most recent tweeted at Sun Sep 16 00:06:21 +0000 2018
500 tweets collected for hashtag libspill. Most recent tweeted at Sat Sep 15 22:57:07 +0000 2018
600 tweets collected for hashtag libspill. Most recent tweeted at Sat Sep 15 08:51:00 +0000 2018
700 tweets collected for hashtag libspill. Most recent tweeted at Sat Sep 15 01:24:29 +0000 2018
800 tweets collected for hashtag libspill. Most recent tweeted at Fri Sep 14 13:27:57 +0000 2018
900 tweets collected for hashtag libspill. Most recent tweeted at Fri Sep 14 07:47:36 +0000 2018
1000 tweets collected for hashtag libspill. Most recent tweeted at Fri Sep 14 04:19:05 +0000 2018
1100 tweets collected for has

Write out the colleted tweets to a file, to save on future API calls:

In [17]:
tweets_filename = f"cached_tweets_{hashtag}.pkl"
with open(tweets_filename, 'wb') as f:
    pickle.dump(tweet_collection, f)

## Part 2 - Document Cleaning

[Return to Contents](#Topic-1:-Text-Analysis)

Load tweets from the stored .pkl file:

In [8]:
tweets_filename = f"cached_tweets_{hashtag}.pkl"
tweet_collection = None
with open(tweets_filename, 'rb') as f:
    tweet_collection = pickle.load(f)

In [9]:
print("The number of raw uncleaned tweets is {}.".format(len(tweet_collection)))

The number of raw uncleaned tweets is 6078.


We perform a series of transformations in a pipeline to extract documents in the format and with the content we want, from the tweet collection:

In [10]:
def extract_documents(tweets):
    extract_text = lambda tweets: [tweet['full_text'] for tweet in tweets]
    convert_whitespace_chars = lambda tweets: [tweet.replace('\n', ' ').replace('\r', ' ').replace('\t', ' ') for tweet in tweets]
    squash_whitespace = lambda tweets: [tweet.replace('  ', ' ') for tweet in tweets]
    tokenize = lambda tweets: [tweet.strip().split() for tweet in tweets]
    strip_links = lambda tweets: [[token for token in tokens if "http" not in token] for tokens in tweets]
    strip_mentions = lambda tweets: [[token for token in tokens if token[0] is not '@'] for tokens in tweets]
    strip_hashtags = lambda tweets: [[token for token in tokens if token[0] is not '#'] for tokens in tweets]
    filter_empty = lambda tweets: [tweet for tweet in tweets if len(tweet) > 0]
    filter_retweets = lambda tweets: [tweet for tweet in tweets if tweet[0] != 'RT']
    rejoin = lambda tweets: [' '.join(tokens) for tokens in tweets]
    filter_short = lambda tweets: [tweet for tweet in tweets if len(tweet) > 40]
    
    documents = tweets
    
    for transformation in [
        extract_text,
        convert_whitespace_chars,
        squash_whitespace,
        tokenize,
        strip_links,
        strip_mentions,
        strip_hashtags,
        filter_empty,
        filter_retweets,
        rejoin,
        filter_short,
    ]:
        documents = transformation(documents)
    return documents

In [11]:
documents = extract_documents(tweet_collection)

We filter out a number of tweets (particularly retweets, since they're duplicate content), so let's see how many documents we have now:

In [12]:
print("The number of cleaned documents is {}.".format(len(documents)))

The number of cleaned documents is 941.


## Part 3 - Watson NLU

[Return to Contents](#Topic-1:-Text-Analysis)

Create an NLU instance using IBM Watson SDK. Created using Sydney as the optional location selection:

In [13]:
url = "https://gateway-syd.watsonplatform.net/natural-language-understanding/api"
version = "2018-03-19"

natural_language_understanding = NaturalLanguageUnderstandingV1(
    version=version,
    iam_apikey=WATSON_NLU_API_KEY,
    url=url
)

Create a function which retrieves Watson NLU analyses for each of our documents using the NLU instance, and tracks progress:

In [14]:
def get_analyses(documents):
    entities = EntitiesOptions(sentiment=True, emotion=True, limit=5)
    sentiment = SentimentOptions()
    categories = CategoriesOptions()
    keywords = KeywordsOptions(sentiment=True, emotion=True, limit=5)
    emotion = EmotionOptions()
    document_count = 0
    
    def analyze(document):
        count = documents.index(document) + 1
        if count % 10 == 0 or count == len(documents):
            print("Analysing document #{} of {}...".format(count, len(documents)))
        return natural_language_understanding.analyze(
            text=document,
            features=Features(
                entities=entities,
                sentiment=sentiment,
                categories=categories,
                keywords=keywords,
                emotion=emotion,
            )
        )
    
    return [analyze(document) for document in documents]

In [228]:
watson_analyses = get_analyses(documents)

Analysing document #10 of 941...
Analysing document #20 of 941...
Analysing document #30 of 941...
Analysing document #40 of 941...
Analysing document #50 of 941...
Analysing document #60 of 941...
Analysing document #70 of 941...
Analysing document #80 of 941...
Analysing document #90 of 941...
Analysing document #40 of 941...
Analysing document #100 of 941...
Analysing document #110 of 941...
Analysing document #120 of 941...
Analysing document #130 of 941...
Analysing document #40 of 941...
Analysing document #160 of 941...
Analysing document #170 of 941...
Analysing document #180 of 941...
Analysing document #190 of 941...
Analysing document #200 of 941...
Analysing document #210 of 941...
Analysing document #230 of 941...
Analysing document #240 of 941...
Analysing document #250 of 941...
Analysing document #40 of 941...
Analysing document #280 of 941...
Analysing document #290 of 941...
Analysing document #300 of 941...
Analysing document #310 of 941...
Analysing document #330 of

Write out analyses to local disk, so we can perform analysis without refetching each time:

In [16]:
watson_analysis_filename = f"cached_watson_analysis_{hashtag}.pkl"

In [None]:
with open(watson_analysis_filename, 'wb') as f:
    pickle.dump(watson_analyses, f)

## Part 4 - Analysis of Watson Output

[Return to Contents](#Topic-1:-Text-Analysis)

Load in our Watson analyses from disk, build five analysis objects with the information we need for our report, and write those objects out to disk.

In [17]:
watson_analyses = None
with open(watson_analysis_filename, 'rb') as f:
    watson_analyses = pickle.load(f)

### Sentiment Analysis

In [18]:
sentiment_anaylsis = {}

In [19]:
sentiment_labels = [analysis.result['sentiment']['document']['label'] for analysis in watson_analyses]
sentiment_counter = Counter(sentiment_labels)
print(sentiment_counter)

Counter({'negative': 560, 'neutral': 216, 'positive': 165})


In [20]:
sentiment_anaylsis['positive_percentage'] = 100 * sentiment_counter['positive'] / sum(sentiment_counter.values())
sentiment_anaylsis['neutral_percentage'] = 100 * sentiment_counter['neutral'] / sum(sentiment_counter.values())
sentiment_anaylsis['negative_percentage'] = 100 * sentiment_counter['negative'] / sum(sentiment_counter.values())

In [21]:
print("The percentage of positive documents is: \t{:.2f}%".format(sentiment_anaylsis['positive_percentage']))
print("The percentage of negative documents is: \t{:.2f}%".format(sentiment_anaylsis['negative_percentage']))
print("The percentage of neutral documents is: \t{:.2f}%".format(sentiment_anaylsis['neutral_percentage']))

The percentage of positive documents is: 	17.53%
The percentage of negative documents is: 	59.51%
The percentage of neutral documents is: 	22.95%


In [25]:
pos_sentiment_scores = []
neg_sentiment_scores = []
neu_sentiment_scores = []
for analysis in watson_analyses:
    if analysis.result['sentiment']['document']['label'] == "positive":
        pos_sentiment_scores.append(analysis.result['sentiment']['document']['score'])
    elif analysis.result['sentiment']['document']['label'] == "negative":
        neg_sentiment_scores.append(analysis.result['sentiment']['document']['score'])
    elif analysis.result['sentiment']['document']['label'] == "neutral":
        neu_sentiment_scores.append(analysis.result['sentiment']['document']['score'])

In [26]:
sentiment_anaylsis['average_pos_score'] = sum(pos_sentiment_scores) / len(pos_sentiment_scores)
sentiment_anaylsis['std_dev_pos_score'] = statistics.stdev(pos_sentiment_scores)
sentiment_anaylsis['average_neg_score'] = sum(neg_sentiment_scores) / len(neg_sentiment_scores)
sentiment_anaylsis['std_dev_neg_score'] = statistics.stdev(neg_sentiment_scores)

In [27]:
print("The average of the positive sentiment scores is: {:.3f}\t(Std. Dev. = {:.3f})".format(
    sentiment_anaylsis['average_pos_score'], sentiment_anaylsis['std_dev_pos_score']))
print("The average of the negative sentiment scores is: {:.3f}\t(Std. Dev. = {:.3f})".format(
    sentiment_anaylsis['average_neg_score'], sentiment_anaylsis['std_dev_neg_score']))

The average of the positive sentiment scores is: 0.571	(Std. Dev. = 0.288)
The average of the negative sentiment scores is: -0.618	(Std. Dev. = 0.204)


### Emotion  Analysis

In [28]:
emotion_analysis = {}

In [29]:
# document level emotion scores
for emotion in ['sadness', 'joy', 'fear', 'disgust', 'anger']:
    emotion_scores = [analysis.result['emotion']['document']['emotion'][emotion] for analysis in watson_analyses]

    average_emotion_score = sum(emotion_scores) / len(emotion_scores)
    emotion_score_std_dev = statistics.stdev(emotion_scores)
    emotion_analysis[emotion] = {'average_score': average_emotion_score, 'score_std_dev': emotion_score_std_dev}

In [30]:
for emotion in emotion_analysis.keys():
    print("The average score for '{}' is: \t{:.3f}\t(Std. Dev. = {:.3f})".format(
        emotion, emotion_analysis[emotion]['average_score'], emotion_analysis[emotion]['score_std_dev']))

The average score for 'sadness' is: 	0.292	(Std. Dev. = 0.178)
The average score for 'joy' is: 	0.194	(Std. Dev. = 0.205)
The average score for 'fear' is: 	0.140	(Std. Dev. = 0.113)
The average score for 'disgust' is: 	0.240	(Std. Dev. = 0.189)
The average score for 'anger' is: 	0.225	(Std. Dev. = 0.159)


### Category Analysis

In [31]:
category_analysis = {}

In [32]:
analyses_categories = [analysis.result['categories'] for analysis in watson_analyses]
labels = lambda categories: [category['label'] for category in categories]
category_labels = []
[category_labels.extend(labels(categories)) for categories in analyses_categories]
categories_counter = Counter(category_labels)
category_analysis['count'] = sum(categories_counter.values())
category_analysis['categories'] = dict(categories_counter)

In [34]:
print("The total number of categories in the corpus is: \t{}\n".format(category_analysis['count']))
print("The top 50 most common categories as follows: \n")

for category, count in categories_counter.most_common(50):
    print("{}  {}".format(category, count))

The total number of categories in the corpus is: 	2766

The top 50 most common categories as follows: 

/law, govt and politics/government  275
/law, govt and politics/government/parliament  218
/travel/tourist destinations/australia and new zealand  91
/law, govt and politics/immigration  90
/news  86
/law, govt and politics/politics/elections  74
/art and entertainment/humor  65
/business and industrial  53
/art and entertainment/movies and tv/movies  50
/society/unrest and war  49
/style and fashion/clothing/shorts  47
/law, govt and politics/politics  46
/law, govt and politics  45
/family and parenting/children  43
/society/work/unions  40
/education/school  33
/food and drink  30
/law, govt and politics/politics/elections/presidential elections  28
/society/sex  27
/law, govt and politics/legal issues/legislation  24
/law, govt and politics/politics/political parties  22
/art and entertainment/music  21
/law, govt and politics/law enforcement/police  18
/travel/tourist destinatio

### Entity Analysis

In [35]:
entity_analysis = {}

In [52]:
analyses_entities = [analysis.result['entities'] for analysis in watson_analyses]
raw_entities = lambda entities: [entity for entity in entities]
entities = []
[entities.extend(raw_entities(analyses_entity)) for analyses_entity in analyses_entities]

unique_entity_names = set([entity['text'] for entity in entities])

entity_analysis['count'] = len(unique_entity_names)
entity_analysis['entities'] = {}

for entity_name in unique_entity_names:
    sentiments = []
    for entity in entities:
        if entity['text'] == entity_name:
            sentiments.append(entity['sentiment']['score'])
    entity_analysis['entities'][entity_name] = {}
    entity_analysis['entities'][entity_name]['frequency'] = len(sentiments)
    entity_analysis['entities'][entity_name]['average_sentiment'] = sum(sentiments) / len(sentiments)
    if (len(sentiments) > 1):
        entity_analysis['entities'][entity_name]['sentiment_std_dev'] = statistics.stdev(sentiments)
    else:
        entity_analysis['entities'][entity_name]['sentiment_std_dev'] = 0


In [53]:
print("The total number of entities in the corpus is: \t{}\n".format(entity_analysis['count']))

for entity_name, stats in entity_analysis['entities'].items():
    print("Entity: {}average sentiment: {:.3f}\t(Std. Dev. = {:.3f})".format(entity_name.ljust(30), stats['average_sentiment'], stats['sentiment_std_dev']))


The total number of entities in the corpus is: 	357

Entity: Greg Hunt                     average sentiment: 0.000	(Std. Dev. = 0.000)
Entity: Parliament                    average sentiment: -0.341	(Std. Dev. = 0.355)
Entity: Dutton Govt                   average sentiment: 0.000	(Std. Dev. = 0.000)
Entity: Henderson                     average sentiment: -0.373	(Std. Dev. = 0.040)
Entity: Deb Frecklington              average sentiment: -0.692	(Std. Dev. = 0.000)
Entity: South Australia               average sentiment: -0.319	(Std. Dev. = 0.000)
Entity: Clarke                        average sentiment: -0.630	(Std. Dev. = 0.000)
Entity: two weeks                     average sentiment: 0.000	(Std. Dev. = 0.000)
Entity: J. Edgar Tuber                average sentiment: -0.547	(Std. Dev. = 0.000)
Entity: paliament                     average sentiment: 0.000	(Std. Dev. = 0.000)
Entity: WW2                           average sentiment: 0.000	(Std. Dev. = 0.000)
Entity: Scumo               

### Keyword Analysis

In [56]:
keyword_analysis = {}

In [57]:
analyses_keywords = [analysis.result['keywords'] for analysis in watson_analyses]
raw_keywords = lambda keywords: [keyword for keyword in keywords]
keywords = []
[keywords.extend(raw_keywords(analyses_keyword)) for analyses_keyword in analyses_keywords]

unique_keyword_names = set([keyword['text'] for keyword in keywords])

keyword_analysis['count'] = len(unique_keyword_names)
keyword_analysis['keywords'] = {}

for keyword_name in unique_keyword_names:
    sentiments = []
    for keyword in keywords:
        if keyword['text'] == keyword_name:
            sentiments.append(keyword['sentiment']['score'])
    keyword_analysis['keywords'][keyword_name] = {}
    keyword_analysis['keywords'][keyword_name]['frequency'] = len(sentiments)
    keyword_analysis['keywords'][keyword_name]['average_sentiment'] = sum(sentiments) / len(sentiments)
    if (len(sentiments) > 1):
        keyword_analysis['keywords'][keyword_name]['sentiment_std_dev'] = statistics.stdev(sentiments)
    else:
        keyword_analysis['keywords'][keyword_name]['sentiment_std_dev'] = 0


In [59]:
print("The total number of entities in the corpus is: \t{}\n".format(entity_analysis['count']))

for keyword_name, stats in keyword_analysis['keywords'].items():
    print("Keyword: {}average sentiment: {:.3f}\t(Std. Dev. = {:.3f})".format(keyword_name.ljust(30), stats['average_sentiment'], stats['sentiment_std_dev']))


The total number of entities in the corpus is: 	357

Keyword: Greg Hunt                     average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: recent run                    average sentiment: -0.610	(Std. Dev. = 0.000)
Keyword: circus rolls                  average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: Liberal Party meeting         average sentiment: -0.459	(Std. Dev. = 0.000)
Keyword: Excellent.                    average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: pre-Bjelke Qld copper         average sentiment: 0.902	(Std. Dev. = 0.000)
Keyword: Liberal branches              average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: conservative politics         average sentiment: 0.681	(Std. Dev. = 0.000)
Keyword: LIBERALS                      average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: TheIPA⁩ now= coal             average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: J. Edgar Tuber                average sentiment: -0.547	(Std. Dev. = 0.000)
Keyword: Bishop hint

Keyword: promise                       average sentiment: -0.343	(Std. Dev. = 0.374)
Keyword: comments                      average sentiment: -0.514	(Std. Dev. = 0.000)
Keyword: transparent appeal            average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: chance                        average sentiment: -0.480	(Std. Dev. = 0.000)
Keyword: grab                          average sentiment: 0.685	(Std. Dev. = 0.000)
Keyword: Josh Frydenburg               average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: Pot                           average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: LinkedIn profile              average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: crime                         average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: Aus flag lapel                average sentiment: 0.000	(Std. Dev. = 0.000)
Keyword: deflects                      average sentiment: -0.321	(Std. Dev. = 0.000)
Keyword: blazer v blazer               average sentiment: 0.000	(Std. De

Write our analyses out to disk, so they can be used in the report

In [60]:
report_analyses = {
    'sentiment': sentiment_anaylsis,
    'emotion': emotion_analysis,
    'category': category_analysis,
    'keyword': keyword_analysis,
    'entity': entity_analysis,
}

In [62]:
report_analysis_filename = f"cached_report_analysis_{hashtag}.pkl"

with open(report_analysis_filename, 'wb') as f:
    pickle.dump(report_analyses, f)

END

[Return to Contents](#Topic-1:-Text-Analysis)