# SLO Twitter Data Analysis - Table of Contents

## Joseph Jinn and Keith VanderLinden

<span style="font-family:Papyrus; font-size:1.25em;">
    
</p>This collection of Jupyter Notebook files provide an analysis of Twitter data obtained by CSIRO Data61 from a period of time covering 2010 through 2018.  The Twitter API was utilized to extract the raw Tweet data.  The sections below provide a short summary and hyperlinks to individual Jupyter Notebook files that provide further details on our analysis of different attributes and combinations of attributes in our Twitter Dataset.<br>

</span>

## Twitter API

<span style="font-family:Papyrus; font-size:1.25em;">

This section introduces the raw JSON hierarchical file structure of the Twitter data.  All Tweets are encapsulated within the main Tweet object which contains the "entities", "user", and "retweeted_status" sub-objects that themselves contain various nested attributes.  The attributes we currently consider of the uptmost important to our research is as follows:

REFORMAT...

- *retweeted_derived*: indicates whether the Tweet is a retweet.<br>
- *company_derived*: associates the Tweet with a company.<br>
- *text_derived*: the full Tweet text.<br>
- *tweet_url_link_derived*: hyperlink to the actual Tweet on Twitter.<br>
- *company_derived_designation*: single company Tweet is associated with or "multiple" for multi-company.<br>
- *tweet_text_length_derived*: character count of the Tweet text length.<br>
- *spaCy_language_detect*: language of the Tweet as determined by "spacy-langdetect" python library.<br>
- *tweet_created_at*: time-date stamp of the Tweet.<br>
- *tweet_id*: unique ID # of the Tweet.<br>
- *tweet_retweet_count*: the # of times the Tweet has been retweeted.<br>
- *user_id*: unique ID of the user (Tweet author).<br>
- *user_description*: short user written description of himself/herself.<br>
- *tweet_entities_hashtags*: hashtags present in the Tweet text.<br>
- *tweet_entities_user_mentions_id*: unique ID of the users mentioned in a Tweet.<br>
- *retweeted_status_full_text*: full text of the original retweeted Tweet.<br>
</span>

For more details, see this [Twitter API Introduction](slo-twitter-data-analysis-tweet-api-intro.ipynb#bookmark).

## Codebase

<span style="font-family:Papyrus; font-size:1.25em;">

The data analysis done in the following sections is based on a set of function libraries. These libraries include functions that process the raw JSON data and convert it into CSV format, libraries that load this processed dataset, and libraries that perform relevant data analysis.

- `dataset_processor_adapted.py` reads in the raw JSON file structure by chunks and parses through it to derive and extract various attributes of the Tweets that we are interested in.  We export the results to a CSV dataset file. This is done once, at the very beginning of the analysis.
 
- `slo_twitter_data_analysis.py` contains all the analysis code we use in this collection of Jupyter Notebook files.

- `slo_twitter_data_analysis_utility_functions.py` contains utility functions used by the analysis functions. These functions include: graphing helper functions; functions that extract existing attributes and derive new attributes.  This allows us to readily manipulate our dataset in a computationally inexpensive manner (requires less time).

Each of the following sections imports the necessary libraries, configures the settings for those libraries, loads the processed CSV format, and then performs the required data analysis. 

</span>


For more details, see this [Codebase Introduction](slo-twitter-data-analysis-codebase-intro.ipynb#bookmark).

## Tweet Language Statistics and Analysis

<span style="font-family:Papyrus; font-size:1.25em;">
    
    move this earlier, before the company assignment. 
    we should be dropping non-English tweets (I think). Give some stats on this and decide what to do.
   

**TODO: remove after Professor VanderLinden gives the thumbs up for revised summary.**

</span>

<span style="font-family:Papyrus; font-size:1.25em;">

We have decided to drop non-English Tweets from our Twitter dataset.  Approximately 96.53% of our Tweets are English and 3.47% of our Tweets are non-English.  They do not comprise much of our dataset and will only increase the difficulty of topic extraction.  We used the results of our "spacy-langdetect" library for language detection.<br>

</span>

For more details, see this [Language Statistics](slo-twitter-data-analysis-language-statistics.ipynb#bookmark).

## Company Assignments for Tweets

<span style="font-family:Papyrus; font-size:1.25em;">
    
    95%? of tweets have a single company assignment (see the sub-notebook for the assignment rules); others are assiated with multiple companies.
    10%? of the multiple company tweets are not that interesting (stock info tweets for the energy sector), but others are nore useful (individual tweets about the industry in general). Thus we combined multi-company tweets into a single "multiple" company category.
    The analysis sections below will show analysis for aggregate, company-specific and multiple company tweets.
    

**TODO: remove after Professor VanderLinden gives the thumbs up for revised summary.**

</span>

<span style="font-family:Papyrus; font-size:1.25em;">

A feature is added to all tweets indicating the company that is the subject of the tweet. 98.77% of tweets designate a single company; 1.23% mention multiple companies. To these latter tweets, we assigned a "multiple" company subject. We decided to retain these multiple-company tweets because they retain some useful information on mining in general. Admittedly, 10%? of the multiple company tweets are not that interesting (EG? stock info tweets for the energy sector), but others are nore useful (EG? individual tweets about the industry in general). <br>
    
The analysis sections below will generally show analysis for aggregate, company-specific and multiple company tweets.  As a result of this analysis.
</span>


For more details, see this [Single v. Multi Company Associated Tweets](slo-twitter-data-analysis-one-versus-multiple-companies-statistics.ipynb).

## Time Series Statistics and Analysis:

<span style="font-family:Papyrus; font-size:1.25em;">
    
    Indicate that we've decided to lump the years together for now. We could consider bucketing the data by time blocks to see a flow of topics over time.
    
??% Tweets in our dataset are associated with "Adani" and are in the 2017-2018 time period.  The Tweets for the other companies in our dataset are far less numerous.  Some companies show a even distribution across the years (companies?) while others are skewed more towards the left (indicating they are older Tweets, ??) or skewed more towards the right (indicating they are newer Tweets, ??).<br>

**TODO: remove after Professor VanderLinden gives the thumbs up for revised summary.**

</span>

<span style="font-family:Papyrus; font-size:1.25em;">

64.29% of our Tweets are associated with "Adani" and ??.??% of those are in the 2017-2018 time period.  The Tweets for the other companies in our dataset are far less numerous.  Fortescue, iluka, newmont, and woodside show a relatively even distribution across the years.  Multi-company Tweets, bhp, oilsearch, and riotinto show a distribution somewhat skewed to the right (indicating newer Tweets).  Cuesta and whitehaven show a distribution somewhat skewed to the left (indicating older Tweets).  There are also some spikes during certain periods where many Tweets were made for a particular company, perhaps indicating some sort of event occurred.<br>

For now, we have lumped the time-date stamp across the years all together for our analysis though we could consider bucketing the data by time blocks to see a flow of topics over time.<br>

</span>

For more details, see this [Time Series Statistics](slo-twitter-data-analysis-time-statistics.ipynb#bookmark).

## Retweet Statistics and Analysis:

<span style="font-family:Papyrus; font-size:1.25em;">
    
    We will experiment with different strategies, e.g., remove all retweets; use retweet counts to determine "influencers"; compute votes for/against original posts via stance analysis of retweet sets for individual tweets.
    

**TODO: remove after Professor VanderLinden gives the thumbs up for revised summary.**

</span>

<span style="font-family:Papyrus; font-size:1.25em;">

Approximately 66.55% of our dataset are ReTweets while the other 33.45% aren't.  Most Tweets associated with "Adani" are ReTweets.  We see a more even distribution between ReTweets and non-ReTweets for the other companies.  Again, "Adani" Tweets heavily influence the distribution of our statistics and graphs.  Of the 446,177 ReTweets, we possess the orignal text of the ReTweeted Tweet for 445,533 of them.  We do not have the original text for 644 of them.  There are also a few orginal ReTweeted Tweets whose ReTweet counts are extremely high, with 98,886 being the highest.<br>

In the future, we will experiment with different strategies, e.g., remove all retweets; use retweet counts to determine "influencers"; compute votes for/against original posts via stance analysis of retweet sets for individual tweets.

</span>

For more details, see this [Retweet Statistics](slo-twitter-data-analysis-retweet-statistics.ipynb#bookmark).

## User (Tweet Author) Statistics and Analysis

<span style="font-family:Papyrus; font-size:1.25em;">
    
    What does this mean for our analysis plan? We could bucket tweets based on rule-based stance assignments (see autocoding-preprocessor.py). We could throw out neutral tweets. We could to a network analysis of authors to compute "influencers", "new" ideas. 
    
We have far fewer unique authors HOW MANY? than we do total Tweets in the dataset.  From the stats, we see that the Twitter users that are actually (neutral) news outlet or organizations are responsible for the majority of Tweets and ReTweets.<br>

**TODO: remove after Professor VanderLinden gives the thumbs up for revised summary.**

</span>

<span style="font-family:Papyrus; font-size:1.25em;">

At a unique author count of 38,107, we have far fewer unique authors than we do for the total of 670,423 Tweets in the dataset.  From the stats, we see that the Twitter users that are actually (neutral) news outlet or organizations are responsible for the majority of Tweets and ReTweets.  We also found that 375,104 of the 670,423 Tweet texts are over 140 characters long while only 256,373 user description texts are over 140 characters long.<br>

What does this mean for our analysis plan? We could bucket tweets based on rule-based stance assignments (see autocoding-preprocessor.py). We could throw out neutral tweets. We could to a network analysis of authors to compute "influencers", "new" ideas.<br> 

</span>

For more details, see this [User (Tweet Author) Statistics](slo-twitter-data-analysis-user-statistics.ipynb#bookmark).

## Tweet Statistics and Analysis

<span style="font-family:Papyrus; font-size:1.25em;">
    
    Combine all tweet text analysis here, including character counts and any other of Shutaro's analysis that seem useful.
    We may need to move to Linux to use the CMU tweet tagger for this work.
    

**TODO: remove after Professor VanderLinden gives the thumbs up for revised summary.**

</span>

<span style="font-family:Papyrus; font-size:1.25em;">

Most Tweets are under 300 characters long but many are over 140 characters long.  We surmise this is because our "dataset_processor_adapted_v2.py" dataset creator adds the full ReTweeted text to the "text_derived" field we derive for our dataset.  It is also possible there are encoding issues for foreign Tweets.  Further analysis is necessary.<br>

</span>

For more details, see this [Tweet Text Statistics](slo-twitter-data-analysis-tweet-text-statistics.ipynb#bookmark).

## #Hashtag Statistics and Analysis:

<span style="font-family:Papyrus; font-size:1.25em;">

#number (40.??% - unknown, need to create analysis function to compute) of "Adani" Tweets do not have any hashtags while the rest have at least one hashtag.  367,220  (54.77%) Tweets have at least one hashtag.

We leave the hashtags in the text because (to be completed).

</span>

For more details, see this [Hashtags Statistics](slo-twitter-data-analysis-hashtag-statistics.ipynb#bookmark).

## @User Mentions Statistics and Analysis:

<span style="font-family:Papyrus; font-size:1.25em;">
    
Approximately 79.85% of all Tweets in our dataset have @user mentions.  Only about 5.89% of all Tweets are replices to other Tweets in our dataset.  Most Tweets have at most a single user mention.<br>

</span>

For more details, see this [User Mentions Statistics](slo-twitter-data-analysis-mentions-statistics.ipynb#bookmark).

## Tweet Stock Symbols, URL's, and Emojis Analysis

<span style="font-family:Papyrus; font-size:1.25em;">

Only approximately 10.31% of all the Tweets in our dataset possess stock symbols of some sort.  On the flip side, about 77.54% of all the Tweets in our dataset possess URL's.  As for emoji's, only approximately 0.34% of our Tweets have emoticons of some sort.

</span>

For more details, see this [Stock Symbols, URL's, Emojis - Statistics](slo-twitter-data-analysis-stock-symbols-and-urls-emojis.ipynb#bookmark).

## Nan/Non-NaN Values Statistics and Analysis (will be deprecated in near-future)

<span style="font-family:Papyrus; font-size:1.25em;">

This Jupyter Notebook file provides a overview on the number of null or non-null values for each attribute across the entire Twitter dataset.  It displays the count of how many rows (examples) in the dataset have NaN or non-NaN values for each column (field/attribute) that is present in our dataset.<br>

</span>

For more details, see this [Nan/Non-NaN Statistics](slo-twitter-data-analysis-other-statistics.ipynb#bookmark).

## Resources Referenced

<span style="font-family:Papyrus; font-size:1.25em;">

Dataset Files (obtained from Borg supercomputer):<br>

dataset_slo_20100101-20180510.json<br>

Note: These are large files not included in the project GitHub Repository.<br>


- [SLO-analysis.ipynb](SLO-analysis.ipynb)<br>
    -original SLO Twitter data analysis file from Shuntaro Yada.<br>


- https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json<br>
    -explanation of all data fields in JSON file format for Tweets.<br>


- https://datatofish.com/export-dataframe-to-csv/<br>
- https://datatofish.com/export-pandas-dataframe-json/<br>
    -saving Pandas dataframe to CSV/JSON<br>
    

- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html<br>
    -Pandas to_datetime() function call.<br>
    

- https://www.machinelearningplus.com/plots/matplotlib-tutorial-complete-guide-python-plot-examples/<br>
    -plotting with matplotlib.<br>


- https://stackoverflow.com/questions/49566007/jupyter-multiple-notebooks-using-same-data<br>
     
- https://stackoverflow.com/questions/16966280/reusing-code-from-different-ipython-notebooks<br>
    -sharing kernels and code across multiple Jupyter notebook files<br>
     
 
- https://stackoverflow.com/questions/32370281/how-to-embed-image-or-picture-in-jupyter-notebook-either-from-a-local-machine-o
    -displaying embedded images<br>
    
    
</span>