# SLO Twitter Data Analysis - Table of Contents

## Joseph Jinn and Keith VanderLinden

<span style="font-family:Papyrus; font-size:1.25em;">
    
</p>This collection of Jupyter Notebook files provide an analysis of Twitter data obtained by CSIRO Data61 from a period of time covering 2012 through 2018.  The Twitter API was utilized to extract the raw Tweet data.  The sections below provides a short summary and hyperlinks to individual Jupyter Notebook files that provide further details on our analysis of different attributes and combinations of attributes in our Twitter Dataset.<br>

</span>

## Twitter API Introduction:

<span style="font-family:Papyrus; font-size:1.25em;">
    
This section provides an introduction to the various attributes present in the raw JSON data structure.<br>

Currently, we have extracted the following fields in the Lists below and derived additional fields from them.  Anything with the "derived" prefix indicates a field we have derived from the native fields present in the JSON structure.  Anything with the "tweet" prefix indicates a renamed attribute that is present in the main Tweet object.  Anything with the "user" prefix is a nested attribute we have extracted from the "user" object in the main Tweet object of the JSON file structure.  Anything with the "entities" prefix is a nested attribute we have extracted from the "entities" object in the main Tweet object of the JSON file structure.  Anything with the "retweeted_status" prefix is a nested attribute we have extracted from the "retweeted_status" object in the main Tweet object of the JSON file structure.  Anything with the "retweeted_status_user" prefix is a nested atributes we have extracted from the "user" object in the "retweeted_user" sub-object present in the main Tweet object of the JSON file structure.<br>

</span>

In [None]:
    # Original Tweet object attribute names present in raw JSON file.
    original_tweet_object_field_names = [
        'created_at', 'id', 'full_text', 'in_reply_to_status_id', 'in_reply_to_user_id',
        'in_reply_to_screen_name', 'retweet_count', 'favorite_count', 'lang']

    # Names to rename main Tweet object attributes.
    tweet_object_fields = [
        'tweet_created_at', 'tweet_id', 'tweet_full_text', 'tweet_in_reply_to_status_id',
        'tweet_in_reply_to_user_id', 'tweet_in_reply_to_screen_name', 'tweet_retweet_count',
        'tweet_favorite_count', 'tweet_lang']

    # Names to give "user" object attributes.
    user_object_fields = [
        'user_id', 'user_name', 'user_screen_name', 'user_location', 'user_description',
        'user_followers_count', 'user_friends_count', 'user_listed_count', 'user_favourites_count',
        'user_statuses_count', 'user_created_at', 'user_time_zone', 'user_lang']

    # Names to give "entities" object attributes.
    entities_object_fields = [
        "tweet_entities_expanded_urls", "tweet_entities_hashtags", "tweet_entities_user_mentions_id",
        "tweet_entities_user_mentions_name", "tweet_entities_user_mentions_screen_name",
        "tweet_entities_symbols"]

    # Names to give "retweeted_status" object attributes.
    retweeted_status_object_fields = [
        'retweeted_status_created_at', 'retweeted_status_id', 'retweeted_status_full_text',
        'retweeted_status_in_reply_to_status_id', 'retweeted_status_in_reply_to_user_id',
        'retweeted_status_in_reply_to_screen_name', 'retweeted_status_retweet_count',
        'retweeted_status_favorite_count', 'retweeted_status_lang',
        'retweeted_status_entities',
        'retweeted_status_user', 'retweeted_status_coordinates', 'retweeted_status_place']

    # Names to give "user" object attributes.
    retweeted_status_user_object_fields = [
        'retweeted_status_user_id', 'retweeted_status_user_name', 'retweeted_status_user_screen_name',
        'retweeted_status_user_location', 'retweeted_status_user_description', 'retweeted_status_user_followers_count',
        'retweeted_status_user_friends_count', 'retweeted_status_user_listed_count',
        'retweeted_status_user_favourites_count', 'retweeted_status_user_statuses_count',
        'retweeted_status_user_created_at', 'retweeted_status_user_time_zone', 'retweeted_status_user_lang']

    # Modify these to determine what to export to CSV.
    required_fields = \
        ['retweeted_derived', 'company_derived', 'text_derived',  # "tweet_quoted_status_id",
         'tweet_url_link_derived', 'multiple_companies_derived_count', "company_derived_designation",
         'tweet_text_length_derived', "spaCy_language_detect",
         "user_description_text_length"] + tweet_object_fields + user_object_fields + entities_object_fields + \
        retweeted_status_object_fields

[Twitter API Introduction](slo-twitter-data-analysis-tweet-api-intro.ipynb#bookmark)

## Codebase Introduction:

<span style="font-family:Papyrus; font-size:1.25em;">
    
This section provides an introduction to our data analysis codebase.<br>

Please refer to the following files below for the actual codebase in its entirety:<br>

* slo_twitter_data_analysis_v2.py<br>

* slo_twitter_data_analysis_utility_functions_v2.py<br>

* dataset_processor_adapted_v2.py<br>

</span>


[Codebase Introduction](slo-twitter-data-analysis-codebase-intro.ipynb#bookmark)

## Single versus Multiple Company Associated Tweets Statistics and Analysis Summary:

<span style="font-family:Papyrus; font-size:1.25em;">
    
This section provides a simple analysis on a subset of our Twitter data for Tweets associated with single versus multiple companies.  As a result of this analysis, we have decided to include Tweets with multiple company associations as part of our analysis.  A few of them are not particularly useful as they are just stock market hashtags but the majority of them do provide some sort of stance or sentiment on SLO mining companies.<br>

</span>


[Single v. Multi Company Associated Tweets](slo-twitter-data-analysis-one-versus-multiple-companies-statistics.ipynb)

## Time Series Statistics and Analysis Summary:

<span style="font-family:Papyrus; font-size:1.25em;">
    
Most Tweets in our dataset are associated with "Adani" and are in the 2017-2018 time period.  The Tweets for the other companies in our dataset are far less numerous.  Some companies show a even distribution across the years while others are skewed more towards the left (indicating they are older Tweets) or skewed more towards the right (indicating they are newer Tweets).<br>

</span>

[Time Series Statistics](slo-twitter-data-analysis-time-statistics.ipynb#bookmark)

## Retweet Statistics and Analysis Summary:

<span style="font-family:Papyrus; font-size:1.25em;">
    
Most Tweets associated with "Adani" are ReTweets.  We see a more even distribution between ReTweets and non-ReTweets for the other companies.  Again, "Adani" Tweets heavily influence the distribution of our statistics and graphs.<br>

</span>

[Retweet Statistics](slo-twitter-data-analysis-retweet-statistics.ipynb#bookmark)

## User (Tweet Author) Statistics and Analysis Summary:

<span style="font-family:Papyrus; font-size:1.25em;">
    
We have far fewer unique authors than we do total Tweets in the dataset.  From the stats, we see that the Twitter users that are actually news outlet or organizations are responsible for the majority of Tweets and ReTweets.<br>

</span>

[User (Tweet Author) Statistics](slo-twitter-data-analysis-user-statistics.ipynb#bookmark)

## Character Count Statistics and Analysis Summary:

<span style="font-family:Papyrus; font-size:1.25em;">
    
Most Tweets are under 300 characters long but many are over 140 characters long.  We surmise this is because our "dataset_processor_adapted_v2.py" dataset creator adds the full ReTweeted text to the "text_derived" field we derive for our dataset.  It is also possible there are encoding issues for foreign Tweets.  Further analysis is necessary.<br>

</span>

[Character Count Statistics](slo-twitter-data-analysis-character-statistics.ipynb#bookmark)

## #Hashtag Statistics and Analysis Summary:

<span style="font-family:Papyrus; font-size:1.25em;">

40% of "Adani" Tweets do not have any hashtags while the rest have at least one hashtag.  These Tweets comprise most of our dataset so this should be significantly representative of the dataset as a whole.  As for the other companies, a not insignificant portion of their Tweets do not have any hashtags.<br>

**TODO: Attempt to get remainder of Shuntaro's code for this section functional.**

</span>

[Hashtags Statistics](slo-twitter-data-analysis-hashtag-statistics.ipynb#bookmark)

## @User Mentions Statistics and Analysis Summary:

<span style="font-family:Papyrus; font-size:1.25em;">
    
Placeholder.

**TODO: Check associated code for correctness.**

**FIXME: weird output.**

</span>

[User Mentions Statistics](slo-twitter-data-analysis-mentions-statistics.ipynb#bookmark)

## Tweet Language Statistics and Analysis Summary:

<span style="font-family:Papyrus; font-size:1.25em;">
    
The "spaCy-langdetect" library does a decent job of identifying the language of the Tweet text in comparison to the Twitter API.  However, it is inferior to using "textblob", another Python library, which uses the Google Translate API to perform language detection.  Unfortunately, the Google Translate is no longer free and is now a paid service.  Thus, we are forced to find free alternatives.  In the future, we may also use the "polyglot" library for Tweet text language detection if we can get it working on our Windows workstation or a Linux workstation.<br>

**TODO - finish implementing.**

</span>

[Language Statistics](slo-twitter-data-analysis-language-statistics.ipynb#bookmark)

## Pandas.describe() + Nan/Non-NaN Values Statistics and Analysis Summary:

<span style="font-family:Papyrus; font-size:1.25em;">
    
**TODO: Refactor into just the attributes we are interested and move them to the relevant Jupyter Notebook Files covering them.**<br>

</span>

[Pandas.describe() + Nan/Non-NaN Statistics](slo-twitter-data-analysis-other-statistics.ipynb#bookmark)

## Resources Used:

<span style="font-family:Papyrus; font-size:1.25em;">

**TODO: convert to annotated bibliography**

Dataset Files (obtained from Borg supercomputer):<br>

dataset_slo_20100101-20180510.json<br>
dataset_20100101-20180510.csv<br>

Note: These are large files not included in the project GitHub Repository.<br>


- [SLO-analysis.ipynb](SLO-analysis.ipynb)<br>
    -original SLO Twitter data analysis file from Shuntaro Yada.<br>


- https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json<br>
    -explanation of all data fields in JSON file format for Tweets.<br>


- https://datatofish.com/export-dataframe-to-csv/<br>
- https://datatofish.com/export-pandas-dataframe-json/<br>
    -saving Pandas dataframe to CSV/JSON<br>
    

- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html<br>
    -Pandas to_datetime() function call.<br>
    

- https://www.machinelearningplus.com/plots/matplotlib-tutorial-complete-guide-python-plot-examples/<br>
    -plotting with matplotlib.<br>


- https://stackoverflow.com/questions/49566007/jupyter-multiple-notebooks-using-same-data<br>
     
- https://stackoverflow.com/questions/16966280/reusing-code-from-different-ipython-notebooks<br>
    -sharing kernels and code across multiple Jupyter notebook files<br>
     
 
- https://stackoverflow.com/questions/32370281/how-to-embed-image-or-picture-in-jupyter-notebook-either-from-a-local-machine-o
    -displaying embedded images<br>
    
    
</span>