# SIT205 Thinking Systems and Cognition Science - Assignment 2

## Group: Philip Castiglione (217157862) and Warwick Smith (215239649)

## Topic 1: Text Analysis

## Project Code

Introductory comments regarding theme of analysis, data source used (e.g. Twitter), etc.

Note, we use local caching of results using files on disk to reduce api calls and allow us to run different parts of this notebook more independently

Note for report
we investigated #libspill, #auspol and #MalcolmTurnbull. #libspill produced the most interesting results for analysis.

## TODO

- cleanup
- run with a big target
- markdown throughout this file
- collect libspill tweets around 21st august ish, should be more dramatic
- final report

## Part 1 - Document Collection

In [6]:
# Standard library
import os
import pickle
import json
from collections import Counter
import statistics

# Additional libraries
import twitter                    # python-twitter API
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 import *
from dotenv import load_dotenv    # for management of twitter credentials

We use the `dotenv` library to load key-value pairs from a `.env` file into the os environment

In [7]:
load_dotenv()

True

Utilise external environment to source our API authentication credentials

In [133]:
TWITTER_CONSUMER_KEY = os.getenv("TWITTER_CONSUMER_KEY")
TWITTER_CONSUMER_SECRET = os.getenv("TWITTER_CONSUMER_SECRET")
TWITTER_ACCESS_TOKEN_KEY = os.getenv("TWITTER_ACCESS_TOKEN_KEY")
TWITTER_ACCESS_TOKEN_SECRET = os.getenv("TWITTER_ACCESS_TOKEN_SECRET")
WATSON_NLU_API_KEY = os.getenv("WATSON_NLU_API_KEY")

# WARNING: do not commit to git with any of these values printed to a cell's output

Create python-twitter API instance

In [9]:
api = twitter.Api(consumer_key=TWITTER_CONSUMER_KEY,
                  consumer_secret=TWITTER_CONSUMER_SECRET,
                  access_token_key=TWITTER_ACCESS_TOKEN_KEY,
                  access_token_secret=TWITTER_ACCESS_TOKEN_SECRET,
                  tweet_mode='extended')

Twitter hashtag for searching to find documents for analysis

In [171]:
hashtag = "libspill"

In [52]:
#function for tweet collection
def collect_tweets(api, hashtag, batch_max, total_count):
    """
    Function to collects tweets using the python-twitter GetSearch API
    
    api:         Twitter API instance
    hashtag:     search hashtag
    batch_max:   maximum number of tweets to collect per each request
    total_count: maximum number of tweets to collect in total
    """
    
    # the collection of tweets to be returned
    tweets = []
    batch_max = str(batch_max)
    
    #collect the first batch of tweets

    results = api.GetSearch(term=hashtag, result_type="recent", lang="en", 
                            count=batch_max, return_json=True)
    
    #add results to list
    tweets += results['statuses']
    
    # find the the relevant starting ID for the next search
    ids = [tweet['id'] for tweet in tweets]
    max_tweet_id = str(min(ids)-1)   #reduced the minimum ID by 1 to remove duplication of tweets at the start/end of batches.

    # collect the remaining batches in the total_count
    while len(tweets) < total_count:
        
        print("{} tweets collected for hashtag {}. Most recent tweeted at {}".format(
            len(tweets), hashtag, tweets[len(tweets)-1]['created_at']))
        
        #add tweet_mode=extended to the below
        results = api.GetSearch(term=hashtag, result_type="recent", lang="en", 
                                count=batch_max, return_json=True, max_id=max_tweet_id)
        tweets += results['statuses']
        ids = [tweet['id'] for tweet in tweets]
        max_tweet_id = str(min(ids)-1)   #reduced the minimum ID by 1 to remove duplication of tweets at the start/end of batches.
        
    print("{} tweets collected for hashtag {}. Most recent tweeted at {}".format(
            len(tweets), hashtag, tweets[len(tweets)-1]['created_at']))
    return tweets

In [53]:
tweet_collection = collect_tweets(api, hashtag, 100, 100)

100 tweets collected for hashtag libspill. Most recent tweeted at Fri Sep 14 17:57:21 +0000 2018


In [54]:
tweets_filename = f"cached_tweets_{hashtag}.pkl"
with open(tweets_filename, 'wb') as f:
    pickle.dump(tweet_collection, f)

## Part 2 - Document Cleaning

In [None]:
tweet_collection = None
with open(tweets_filename, 'rb') as f:
    tweet_collection = pickle.load(f)

We perform a series of transformations in a pipeline to extract documents in the format and with the content we want, from the tweet collection

In [57]:
def extract_documents(tweets):
    extract_text = lambda tweets: [tweet['full_text'] for tweet in tweets]
    convert_whitespace_chars = lambda tweets: [tweet.replace('\n', ' ').replace('\r', ' ').replace('\t', ' ') for tweet in tweets]
    squash_whitespace = lambda tweets: [tweet.replace('  ', ' ') for tweet in tweets]
    tokenize = lambda tweets: [tweet.strip().split() for tweet in tweets]
    strip_links = lambda tweets: [[token for token in tokens if "http" not in token] for tokens in tweets]
    strip_mentions = lambda tweets: [[token for token in tokens if token[0] is not '@'] for tokens in tweets]
    strip_hashtags = lambda tweets: [[token for token in tokens if token[0] is not '#'] for tokens in tweets]
    filter_empty = lambda tweets: [tweet for tweet in tweets if len(tweet) > 0]
    filter_retweets = lambda tweets: [tweet for tweet in tweets if tweet[0] != 'RT']
    rejoin = lambda tweets: [' '.join(tokens) for tokens in tweets]

    documents = tweets
    
    for transformation in [
        extract_text,
        convert_whitespace_chars,
        squash_whitespace,
        tokenize,
        strip_links,
        strip_mentions,
        strip_hashtags,
        filter_empty,
        filter_retweets,
        filter_empty,
        rejoin,
    ]:
        documents = transformation(documents)

    return documents

In [65]:
documents = extract_documents(tweet_collection)

## Part 3 - Watson NLU

Create an nlu instance using ibm watson sdk

In [66]:
url = "https://gateway-syd.watsonplatform.net/natural-language-understanding/api"
version = "2018-03-19"

natural_language_understanding = NaturalLanguageUnderstandingV1(
    version=version,
    iam_apikey=WATSON_NLU_API_KEY,
    url=url
)

We retrieve analyses for each of our documents using the nlu instance

In [68]:
def get_analyses(documents):
    entities = EntitiesOptions(sentiment=True, emotion=True, limit=5)
    sentiment = SentimentOptions()
    categories = CategoriesOptions()
    keywords = KeywordsOptions(sentiment=True, emotion=True, limit=5)
    emotion = EmotionOptions()
    
    def analyze(document):
        return natural_language_understanding.analyze(
            text=document,
            features=Features(
                entities=entities,
                sentiment=sentiment,
                categories=categories,
                keywords=keywords,
                emotion=emotion,
            )
        )
    
    return [analyze(document) for document in documents]

In [69]:
watson_analyses = get_analyses(documents)

Write out analyses to local disk, so we can perform analysis without refetching each time.

In [71]:
watson_analysis_filename = f"cached_watson_analysis_{hashtags[0]}.pkl"

with open(watson_analysis_filename, 'wb') as f:
    pickle.dump(watson_analyses, f)

## Part 4 - Analysis of Watson Output

Load in our Watson analyses from disk, build five analysis objects with the information we need for our report, and write those objects out to disk.

In [72]:
watson_analyses = None
with open(watson_analysis_filename, 'rb') as f:
    watson_analyses = pickle.load(f)

### Sentiment Analysis

In [98]:
sentiment_anaylsis = {}

In [99]:
sentiment_labels = [analysis.result['sentiment']['document']['label'] for analysis in analyses]
sentiment_counter = Counter(sentiment_labels)

In [100]:
sentiment_anaylsis['positive_percentage'] = 100 * sentiment_counter['positive'] / sum(sentiment_counter.values())
sentiment_anaylsis['neutral_percentage'] = 100 * sentiment_counter['neutral'] / sum(sentiment_counter.values())
sentiment_anaylsis['negative_percentage'] = 100 * sentiment_counter['negative'] / sum(sentiment_counter.values())

In [101]:
sentiment_scores = [analysis.result['sentiment']['document']['score'] for analysis in analyses]

In [102]:
sentiment_anaylsis['average_score'] = sum(sentiment_scores) / len(sentiment_scores)
sentiment_anaylsis['score_std_dev'] = statistics.stdev(sentiment_scores)

In [103]:
sentiment_anaylsis

{'positive_percentage': 8.333333333333334,
 'neutral_percentage': 19.444444444444443,
 'negative_percentage': 72.22222222222223,
 'average_score': -0.402114913888889,
 'score_std_dev': 0.4068022533357083}

### Emotion  Analysis

In [105]:
emotion_analysis = {}

In [106]:
# document level emotion scores
for emotion in ['sadness', 'joy', 'fear', 'disgust', 'anger']:
    emotion_scores = [analysis.result['emotion']['document']['emotion'][emotion] for analysis in analyses]

    average_emotion_score = sum(emotion_scores) / len(emotion_scores)
    emotion_score_std_dev = statistics.stdev(emotion_scores)
    emotion_analysis[emotion] = {'average_score': average_emotion_score, 'score_std_dev': emotion_score_std_dev}

In [107]:
emotion_analysis

{'sadness': {'average_score': 0.2116943611111111,
  'score_std_dev': 0.14032413936681493},
 'joy': {'average_score': 0.20690461111111114,
  'score_std_dev': 0.22862744060080148},
 'fear': {'average_score': 0.12323252777777778,
  'score_std_dev': 0.06747994023051215},
 'disgust': {'average_score': 0.22668377777777776,
  'score_std_dev': 0.22556472631307417},
 'anger': {'average_score': 0.23425372222222224,
  'score_std_dev': 0.19718658102063222}}

### Category Analysis

In [129]:
category_analysis = {}

In [130]:
analyses_categories = [analysis.result['categories'] for analysis in analyses]
labels = lambda categories: [category['label'] for category in categories]
category_labels = []
[category_labels.extend(labels(categories)) for categories in analyses_categories]
categeories_counter = Counter(category_labels)
category_analysis['count'] = sum(categeories_counter.values())
category_analysis['categories'] = dict(categeories_counter)

### Entity Analysis

In [137]:
entity_analysis = {}

In [156]:
analyses_entities = [analysis.result['entities'] for analysis in analyses]
raw_entities = lambda entities: [entity for entity in entities]
entities = []
[entities.extend(raw_entities(analyses_entity)) for analyses_entity in analyses_entities]

unique_entity_names = set([entity['text'] for entity in entities])

entity_analysis['count'] = len(unique_entity_names)
entity_analysis['entities'] = {}

for entity_name in unique_entity_names:
    sentiments = []
    for entity in entities:
        if entity['text'] == entity_name:
            sentiments.append(entity['sentiment']['score'])
    entity_analysis['entities'][entity_name] = {}
    entity_analysis['entities'][entity_name]['frequency'] = len(sentiments)
    entity_analysis['entities'][entity_name]['average_sentiment'] = sum(sentiments) / len(sentiments)
    if (len(sentiments) > 1):
        entity_analysis['entities'][entity_name]['sentiment_std_dev'] = statistics.stdev(sentiments)
    else:
        entity_analysis['entities'][entity_name]['sentiment_std_dev'] = 0


### Keyword Analysis

In [159]:
keyword_analysis = {}

In [161]:
analyses_keywords = [analysis.result['keywords'] for analysis in analyses]
raw_keywords = lambda keywords: [keyword for keyword in keywords]
keywords = []
[keywords.extend(raw_keywords(analyses_keyword)) for analyses_keyword in analyses_keywords]

unique_keyword_names = set([keyword['text'] for keyword in keywords])

keyword_analysis['count'] = len(unique_keyword_names)
keyword_analysis['keywords'] = {}

for keyword_name in unique_keyword_names:
    sentiments = []
    for keyword in keywords:
        if keyword['text'] == keyword_name:
            sentiments.append(keyword['sentiment']['score'])
    keyword_analysis['keywords'][keyword_name] = {}
    keyword_analysis['keywords'][keyword_name]['frequency'] = len(sentiments)
    keyword_analysis['keywords'][keyword_name]['average_sentiment'] = sum(sentiments) / len(sentiments)
    if (len(sentiments) > 1):
        keyword_analysis['keywords'][keyword_name]['sentiment_std_dev'] = statistics.stdev(sentiments)
    else:
        keyword_analysis['keywords'][keyword_name]['sentiment_std_dev'] = 0


Write our analyses out to disk, so they can be used in the report

In [167]:
report_analyses = {
    'sentiment': sentiment_anaylsis,
    'emotion': emotion_analysis,
    'category': category_analysis,
    'keyword': keyword_analysis,
    'entity': entity_analysis,
}

In [168]:
report_analysis_filename = f"cached_report_analysis_{hashtags[0]}.pkl"

with open(report_analysis_filename, 'wb') as f:
    pickle.dump(report_analyses, f)