# SIT205 Thinking Systems and Cognition Science - Assignment 2

## Group: Philip Castiglione (217157862) and Warwick Smith (215239649)

## Topic 1: Text Analysis

## Report

This report satisfies the requirements for SIT205 Assignment 2, using the text analytics topic.

The code associated with this report is contained in a Jupyter Notebook named `SIT205 Assignment 2 - Project Code`. This report, the code, and associated data files can be found on GitHub at this link: 

https://github.com/PhilipCastiglione/SIT205_Watson

## Data

For this report we chose to analyse the recent events of Australian politics using text produced by members of the general public on Twitter.

Using the Twitter API, 6,078 tweets were collected containing the hashtag #libspill, which were made by 2,534 unique twitter user accounts. These tweets were dated between Saturday, 8th September and Monday, 17th September in 2018.

The tweets were cleaned of excess whitespace, tokenized and stripped of urls/links, @ mentions, and hashtags. Tweets were further filtered, excluding empty tweets, retweets and short tweets (less than 41 characters).

The reduced set of 941 documents was the final document corpus for textual analysis.

Due to the large number of categories, entities and keywords identified by the Watson NLU API for this theme, the body of this report displays an example snapshot of each result, while the corresponding Appendices contain an output of all results.

[Appendix 1 - Category Analysis - Full Output](#Appendix-1---Category-Analysis---Full-Output)  
[Appendix 2 - Entity Analysis - Full Output](#Appendix-2---Entity-Analysis---Full-Output)  
[Appendix 3 - Keyword Analysis - Full Output](#Appendix-3---Keyword-Analysis---Full-Output)

## Analysis

Tweets were analysed using the IBM Watson natural language understanding APIs.

In [1]:
import pickle
with open("cached_report_analysis_libspill.pkl", 'rb') as f:
    report_analyses = pickle.load(f)

### Sentiment

The percentage of positive, netural and negative documents in the corpus were as follows:

In [2]:
print("Positive percentage:\t{:.2f}%".format(report_analyses['sentiment']['positive_percentage']))
print("Neutral percentage:\t{:.2f}%".format(report_analyses['sentiment']['neutral_percentage']))
print("Negative percentage:\t{:.2f}%".format(report_analyses['sentiment']['negative_percentage']))
print("Ratio of negative to positive:\t{:.2f}".format(
    report_analyses['sentiment']['negative_percentage'] / report_analyses['sentiment']['positive_percentage'])
)

Positive percentage:	17.53%
Neutral percentage:	22.95%
Negative percentage:	59.51%
Ratio of negative to positive:	3.39


The ratio of negative to positive sentiments of 3.39 in the corpus expresses a generally negative sentiment in tweets with the hashtag #libspill contained in our corpus.

The average and standard deviation of the positive and negative sentiment scores in the corpus:

In [3]:
print("Positive scores average:\t{:.3f}\t (Std. Dev. = {:.3f})".format(
    report_analyses['sentiment']['average_pos_score'], report_analyses['sentiment']['std_dev_pos_score'])
)
print("Negative scores average:\t{:.3f}\t (Std. Dev. = {:.3f})".format(
    report_analyses['sentiment']['average_neg_score'], report_analyses['sentiment']['std_dev_neg_score'])
)

Positive scores average:	0.571	 (Std. Dev. = 0.288)
Negative scores average:	-0.618	 (Std. Dev. = 0.204)


### Emotion

The average and standard deviation of each emotion type in the corpus were then identified:

In [4]:
for emotion in report_analyses['emotion'].keys():
    print("The average score for '{}' is: \t{:.3f}\t(Std. Dev. = {:.3f})".format(
        emotion, report_analyses['emotion'][emotion]['average_score'], report_analyses['emotion'][emotion]['score_std_dev']))

The average score for 'sadness' is: 	0.292	(Std. Dev. = 0.178)
The average score for 'joy' is: 	0.194	(Std. Dev. = 0.205)
The average score for 'fear' is: 	0.140	(Std. Dev. = 0.113)
The average score for 'disgust' is: 	0.240	(Std. Dev. = 0.189)
The average score for 'anger' is: 	0.225	(Std. Dev. = 0.159)


There is a spread of emotions present in the corpus, however 'sadness' represents the modal emotion, followed by 'disgust'.

### Category

A total of 465 different categories were found in the 941 document corpus. The entire analysis result is shown in [Appendix 1](#Appendix-1---Category-Analysis---Full-Output). The top 50 categories by frequency were as follows:

In [5]:
print("The total number of categories in the corpus is: \t{}\n".format(report_analyses['category']['count']))
print("The frequencies of the top {} most common categories: \n".format(
    len(report_analyses['category']['categories_common'])))
for category, count in report_analyses['category']['categories_common']:
    print("{}  {}".format(category.ljust(95), count))

The total number of categories in the corpus is: 	465

The frequencies of the top 50 most common categories: 

/law, govt and politics/government                                                               275
/law, govt and politics/government/parliament                                                    218
/travel/tourist destinations/australia and new zealand                                           91
/law, govt and politics/immigration                                                              90
/news                                                                                            86
/law, govt and politics/politics/elections                                                       74
/art and entertainment/humor                                                                     65
/business and industrial                                                                         53
/art and entertainment/movies and tv/movies                                            

The root category is heavily dominated by 'law, govt and politics' as should be expected by the analysis theme.

### Entity

A total of 358 unique entities were identified by Watson NLU in the corpus. The entire analysis result is shown in [Appendix 2](#Appendix-2---Entity-Analysis---Full-Output). An example set of 25 entities, with their respective sentiment score average and standard deviation are as follows:

In [9]:
print("The total number of unique entities in the corpus is: \t{}\n".format(report_analyses['entity']['count']))
print("An example set of 25 entities:\n")
for entity_name, stats in list(report_analyses['entity']['entities'].items())[:25]:
    print("Entity: {}average sentiment: {:6.3f}   (Std. Dev. = {:.3f})".format(
        entity_name.ljust(40), stats['average_sentiment'], stats['sentiment_std_dev']))

The total number of unique entities in the corpus is: 	358

An example set of 25 entities:

Entity: 5 year                                  average sentiment:  0.000   (Std. Dev. = 0.000)
Entity: Matthias corman                         average sentiment: -0.493   (Std. Dev. = 0.000)
Entity: one day                                 average sentiment:  0.000   (Std. Dev. = 0.000)
Entity: Duttons                                 average sentiment: -0.260   (Std. Dev. = 0.368)
Entity: China                                   average sentiment:  0.433   (Std. Dev. = 0.000)
Entity: 30 years                                average sentiment:  0.000   (Std. Dev. = 0.000)
Entity: Scomo                                   average sentiment:  0.675   (Std. Dev. = 0.000)
Entity: Treasurer                               average sentiment:  0.000   (Std. Dev. = 0.000)
Entity: Tony Wright                             average sentiment:  0.000   (Std. Dev. = 0.000)
Entity: Morrison Government                 

Again, the complete entity analysis is dominated by political personalities.

### Keyword

A total of 2,330 keywords were found in the corpus. The entire analysis result is shown in [Appendix 3](#Appendix-3---Keyword-Analysis---Full-Output). An example set of 25 keywords, with their respective sentiment score average and standard deviation are as follows:

In [14]:
print("The total number of keywords in the corpus is: \t{}\n".format(report_analyses['keyword']['count']))
print("An example set of 25 keywords:\n")
for keyword_name, stats in list(report_analyses['keyword']['keywords'].items())[:25]:
    print("Keyword: {}Average sentiment: {:6.3f}   (Std. Dev. = {:.3f})".format(
        keyword_name.ljust(30), stats['average_sentiment'], stats['sentiment_std_dev']))


The total number of keywords in the corpus is: 	2330

An example set of 25 keywords:

Keyword: tremendous rancour            Average sentiment:  0.360   (Std. Dev. = 0.000)
Keyword: blind eye                     Average sentiment: -0.689   (Std. Dev. = 0.000)
Keyword: bitterness                    Average sentiment: -0.661   (Std. Dev. = 0.000)
Keyword: gender                        Average sentiment: -0.771   (Std. Dev. = 0.000)
Keyword: muppet                        Average sentiment: -0.581   (Std. Dev. = 0.000)
Keyword: big companies                 Average sentiment:  0.000   (Std. Dev. = 0.000)
Keyword: solidarity                    Average sentiment:  0.284   (Std. Dev. = 0.000)
Keyword: Duttons                       Average sentiment: -0.521   (Std. Dev. = 0.000)
Keyword: ✋✋                            Average sentiment:  0.000   (Std. Dev. = 0.000)
Keyword: Nett_News                     Average sentiment:  0.000   (Std. Dev. = 0.000)
Keyword: Voters                        Avera

[Return to Top](#SIT205-Thinking-Systems-and-Cognition-Science---Assignment-2)

## Appendix 1 - Category Analysis - Full Output

A total of 465 different categories were found in the 941 document corpus. All categories by frequency were as follows:

In [15]:
print("The total number of categories in the corpus is: \t{}\n".format(report_analyses['category']['count']))
for category, count in report_analyses['category']['categories'].items():
    print("{}  {}".format(category.ljust(95), count))

The total number of categories in the corpus is: 	465

/law, govt and politics/government/parliament                                                    218
/health and fitness/disorders/mental disorder/panic and anxiety                                  5
/style and fashion/beauty/cosmetics/eyeshadow                                                    3
/travel/tourist destinations/australia and new zealand                                           91
/travel/tourist destinations                                                                     1
/law, govt and politics/legal issues/civil law/copyright                                         5
/law, govt and politics/politics                                                                 46
/art and entertainment/music                                                                     21
/society/unrest and war                                                                          49
/law, govt and politics/government              

[Return to Top](#SIT205-Thinking-Systems-and-Cognition-Science---Assignment-2)

## Appendix 2 - Entity Analysis - Full Output

A total of 357 unique entities were identified by Watson NLU in the corpus. The entities, and their respective sentiment score average and standard deviation are as follows:

In [16]:
print("The total number of unique entities in the corpus is: \t{}\n".format(report_analyses['entity']['count']))
for entity_name, stats in report_analyses['entity']['entities'].items():
    print("Entity: {}average sentiment: {:6.3f}   (Std. Dev. = {:.3f})".format(
        entity_name.ljust(30), stats['average_sentiment'], stats['sentiment_std_dev']))

The total number of unique entities in the corpus is: 	358

Entity: 5 year                        average sentiment:  0.000   (Std. Dev. = 0.000)
Entity: Matthias corman               average sentiment: -0.493   (Std. Dev. = 0.000)
Entity: one day                       average sentiment:  0.000   (Std. Dev. = 0.000)
Entity: Duttons                       average sentiment: -0.260   (Std. Dev. = 0.368)
Entity: China                         average sentiment:  0.433   (Std. Dev. = 0.000)
Entity: 30 years                      average sentiment:  0.000   (Std. Dev. = 0.000)
Entity: Scomo                         average sentiment:  0.675   (Std. Dev. = 0.000)
Entity: Treasurer                     average sentiment:  0.000   (Std. Dev. = 0.000)
Entity: Tony Wright                   average sentiment:  0.000   (Std. Dev. = 0.000)
Entity: Morrison Government           average sentiment:  0.000   (Std. Dev. = 0.000)
Entity: 6 weeks                       average sentiment:  0.000   (Std. Dev. = 0

[Return to Top](#SIT205-Thinking-Systems-and-Cognition-Science---Assignment-2)

## Appendix 3 - Keyword Analysis - Full Output

A total of 2048 keywords were found in the corpus. Each keyword's sentiment score average and standard deviation are as follows:

In [17]:
print("The total number of keywords in the corpus is: \t{}\n".format(report_analyses['keyword']['count']))
for keyword_name, stats in report_analyses['keyword']['keywords'].items():
    print("Keyword: {}Average sentiment: {:6.3f}   (Std. Dev. = {:.3f})".format(
        keyword_name.ljust(30), stats['average_sentiment'], stats['sentiment_std_dev']))


The total number of keywords in the corpus is: 	2330

Keyword: tremendous rancour            Average sentiment:  0.360   (Std. Dev. = 0.000)
Keyword: blind eye                     Average sentiment: -0.689   (Std. Dev. = 0.000)
Keyword: bitterness                    Average sentiment: -0.661   (Std. Dev. = 0.000)
Keyword: gender                        Average sentiment: -0.771   (Std. Dev. = 0.000)
Keyword: muppet                        Average sentiment: -0.581   (Std. Dev. = 0.000)
Keyword: big companies                 Average sentiment:  0.000   (Std. Dev. = 0.000)
Keyword: solidarity                    Average sentiment:  0.284   (Std. Dev. = 0.000)
Keyword: Duttons                       Average sentiment: -0.521   (Std. Dev. = 0.000)
Keyword: ✋✋                            Average sentiment:  0.000   (Std. Dev. = 0.000)
Keyword: Nett_News                     Average sentiment:  0.000   (Std. Dev. = 0.000)
Keyword: Voters                        Average sentiment: -0.542   (Std. Dev

Keyword: Disunited mob                 Average sentiment: -0.488   (Std. Dev. = 0.000)
Keyword: marshmallow                   Average sentiment:  0.000   (Std. Dev. = 0.000)
Keyword: f*cking Prime Minister        Average sentiment:  0.460   (Std. Dev. = 0.135)
Keyword: disgraceful,disgusting state  Average sentiment: -0.300   (Std. Dev. = 0.000)
Keyword: leader                        Average sentiment: -0.158   (Std. Dev. = 0.302)
Keyword: Labspill Series               Average sentiment:  0.000   (Std. Dev. = 0.000)
Keyword: prime minister                Average sentiment:  0.512   (Std. Dev. = 0.000)
Keyword: work                          Average sentiment: -0.477   (Std. Dev. = 0.624)
Keyword: statement                     Average sentiment:  0.000   (Std. Dev. = 0.000)
Keyword: god damn country              Average sentiment: -0.571   (Std. Dev. = 0.000)
Keyword: start                         Average sentiment:  0.000   (Std. Dev. = 0.000)
Keyword: Bullied Into Saying           Aver

[Return to Top](#SIT205-Thinking-Systems-and-Cognition-Science---Assignment-2)

END