# SLO Twitter Data Analysis

### Joseph Jinn and Keith VanderLinden

<span style="font-family:Papyrus; font-size:1.25em;">
    
</p>This Jupyter Notebook provides an analysis of Twitter data obtained by CSIRO Data61 from a period of time covering 2012 through 2018.  The Twitter API was utilized to extract the raw Tweet data.  The first sections below cover the structure of the raw Tweet data with explanations of the various attributes (fields) and their associated values.  The latter sections showcase our analysis of the raw Tweet dataset as well as one of our preprocessed dataset in CSV format.<br>


TODO - text length statistics, whether user has description or not; is null equal = to empty string or not?

JSON --> CSV --> plot/describe data

fields: timestamp, tweet text, "retweet", in_reply_to, numeric value/remove string representation., CSV new derived column - Twitter URL that takes you directly to the Tweet:

flatten out the "user" data structure.

how long the tweet
whose tweeting them.

</span>

## Raw Json Twitter Dataset Tweet Structure:

<span style="font-family:Papyrus; font-size:1.25em;">

We utilize a single sample from the raw Twitter JSON dataset file in order to provide example values in the tables below.  Every Tweet in our raw dataset contains three JSON objects: the "tweet"; the "user"; and the "entities" object.  The "tweet" object encapsulates the other objects.  There may also be a "extended_entities" and "geo" object present in some Tweets depending on whether the Tweet contains native media such as photos, videos, etc., and whether they are geo-tagged.According to the Twitter API Documentation:

"Tweets are the basic atomic building block of all things Twitter. Tweets are also known as “status updates.” The Tweet object has a long list of ‘root-level’ attributes, including fundamental attributes such as id, created_at, and text. Tweet objects are also the ‘parent’ object to several child objects. Tweet child objects include user, entities, and extended_entities. Tweets that are geo-tagged will have a place child object." ("Tweet object - Twitter Developers")  Refer to the link below for further introductory information.<br>

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

</span>

### Main Tweet Object:

<span style="font-family:Papyrus; font-size:1.25em;">

The main Tweet object.  This contains all other sub-objects.  Any attribute without an example value indicates that the field was not present in the sample we are utilizing.  There are also some attributes present in our sample in the main Tweet object that are no longer present in the current up-to-date Tweet object from the Twitter API Documentation.  We will create a separate table for them.<br>

Use N/A instead of blank.

</span>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj">Attribute</th>
    <th class="tg-xldj">Value</th>
    <th class="tg-xldj">Description</th>
  </tr>
  <tr>
    <td class="tg-xldj">created_at</td>
    <td class="tg-xldj">"Sat Feb 23 03:40:21 +0000 2013"</td>
    <td class="tg-xldj">UTC time when this Tweet was created.</td>
  </tr>
  <tr>
    <td class="tg-xldj">id</td>
    <td class="tg-xldj">305160140833816576</td>
    <td class="tg-xldj">The integer representation of the unique identifier for this Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">id_str</td>
    <td class="tg-xldj">"305160140833816576"</td>
    <td class="tg-xldj">The string representation of the unique identifier for this Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">text</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">The actual UTF-8 text of the status update.</td>
  </tr>
  <tr>
    <td class="tg-xldj">source</td>
    <td class="tg-xldj">"&lt;a href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"&gt;Twitter for iPhone&lt;\/a&gt;"</td>
    <td class="tg-xldj">Utility used to post the Tweet, as an HTML-formatted string.</td>
  </tr>
  <tr>
    <td class="tg-xldj">truncated</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">Indicates whether the value of the text parameter was truncated, for example, as a result of a retweet exceeding the original Tweet text length limit of 140 characters. <br><br>Truncated text will end in ellipsis, like this ...<br><br>Since Twitter now rejects long Tweets vs truncating them, the large majority of Tweets will have this set to false. <br><br>Note that while native retweets may have their toplevel text property shortened, the original text will be available under the retweeted_status object <br>and the truncated parameter will be set to the value of the original status (in most cases, false).</td>
  </tr>
  <tr>
    <td class="tg-xldj">in_reply_to_status_id</td>
    <td class="tg-xldj">305159434462691328</td>
    <td class="tg-xldj">Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID.</td>
  </tr>
  <tr>
    <td class="tg-xldj">in_reply_to_status_id_str</td>
    <td class="tg-xldj">"305159434462691328"</td>
    <td class="tg-xldj">Nullable. If the represented Tweet is a reply, this field will contain the string representation of the original Tweet’s ID.</td>
  </tr>
  <tr>
    <td class="tg-xldj">in_reply_to_user_id</td>
    <td class="tg-xldj">2768501</td>
    <td class="tg-xldj">Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s author ID. This will not necessarily always be the user directly mentioned in the Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">in_reply_to_user_id_str</td>
    <td class="tg-xldj">"2768501"</td>
    <td class="tg-xldj">Nullable. If the represented Tweet is a reply, this field will contain the string representation of the original Tweet’s author ID. This will not necessarily always be the user directly mentioned in the Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">in_reply_to_screen_name</td>
    <td class="tg-xldj">"abcnews"</td>
    <td class="tg-xldj">Nullable. If the represented Tweet is a reply, this field will contain the screen name of the original Tweet’s author.</td>
  </tr>
  <tr>
    <td class="tg-xldj">user</td>
    <td class="tg-xldj">Object containing a multitude of attributes.</td>
    <td class="tg-xldj">The user who posted this Tweet. See User data dictionary for complete list of attributes.</td>
  </tr>
  <tr>
    <td class="tg-xldj">coordinates</td>
    <td class="tg-xldj">Object containing a multitude of attributes.</td>
    <td class="tg-xldj">Nullable. Represents the geographic location of this Tweet as reported by the user or client application. The inner coordinates array is formatted as <br><a href="http://www.geojson.org/">geoJSON </a>(longitude first, then latitude).</td>
  </tr>
  <tr>
    <td class="tg-xldj">place</td>
    <td class="tg-xldj">Object containing a multitude of attributes.</td>
    <td class="tg-xldj">Nullable When present, indicates that the tweet is associated (but not necessarily originating from) a <br><a href="https://developer.twitter.com/overview/api/places">Place </a>.</td>
  </tr>
  <tr>
    <td class="tg-xldj">quoted_status_id</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">This field only surfaces when the Tweet is a quote Tweet. This field contains the integer value Tweet ID of the quoted Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">quoted_status_id_str</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">This field only surfaces when the Tweet is a quote Tweet. This is the string representation Tweet ID of the quoted Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">is_quote_status</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">Indicates whether this is a Quoted Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">quoted_status</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">This field only surfaces when the Tweet is a quote Tweet. This attribute contains the Tweet object of the original Tweet that was quoted.</td>
  </tr>
  <tr>
    <td class="tg-xldj">retweeted_status</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">Users can amplify the broadcast of Tweets authored by other users by <a href="https://developer.twitter.com/rest/reference/post/statuses/retweet/%3Aid">retweeting</a>. <br><br>Retweets can be distinguished from typical Tweets by the existence of a retweeted_status attribute. <br><br>This attribute contains a representation of the original Tweet that was retweeted. <br><br>Note that retweets of retweets do not show representations of the intermediary retweet, but only the original Tweet. <br>(Users can also <a href="https://developer.twitter.com/rest/reference/post/statuses/destroy/%3Aid">unretweet </a>a retweet they created by deleting their retweet.)</td>
  </tr>
  <tr>
    <td class="tg-xldj">quote_count</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">Nullable. Indicates approximately how many times this Tweet has been quoted by Twitter users.</td>
  </tr>
  <tr>
    <td class="tg-xldj">reply_count</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">Number of times this Tweet has been replied to.</td>
  </tr>
  <tr>
    <td class="tg-xldj">retweet_count</td>
    <td class="tg-xldj">0</td>
    <td class="tg-xldj">Number of times this Tweet has been retweeted.</td>
  </tr>
  <tr>
    <td class="tg-xldj">favorite_count</td>
    <td class="tg-xldj">0</td>
    <td class="tg-xldj">Nullable. Indicates approximately how many times this Tweet has been <br><a href="https://developer.twitter.com/rest/reference/post/favorites/create">liked </a>by Twitter users.</td>
  </tr>
  <tr>
    <td class="tg-0pky">entities</td>
    <td class="tg-0pky">Object containing a multitude of attributes.</td>
    <td class="tg-0pky">Entities which have been parsed out of the text of the Tweet. Additionally see <br><a href="https://developer.twitter.com/overview/api/entities-in-twitter-objects">Entities in Twitter Objects </a>.</td>
  </tr>
  <tr>
    <td class="tg-0pky">extended_entities</td>
    <td class="tg-0pky">Object containing a multitude of attributes.</td>
    <td class="tg-0pky">When between one and four native photos or one video or one animated GIF are in Tweet, contains an array 'media' metadata. <br><br>This is also available in Quote Tweets. Additionally see <a href="https://developer.twitter.com/overview/api/entities-in-twitter-objects">Entities in Twitter Objects </a>.</td>
  </tr>
  <tr>
    <td class="tg-0pky">favorited</td>
    <td class="tg-0pky">false</td>
    <td class="tg-0pky">Nullable. Indicates whether this Tweet has been liked by the authenticating user.</td>
  </tr>
  <tr>
    <td class="tg-0pky">retweeted</td>
    <td class="tg-0pky">false</td>
    <td class="tg-0pky">Indicates whether this Tweet has been Retweeted by the authenticating user.</td>
  </tr>
  <tr>
    <td class="tg-0pky">possibly_sensitive</td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">Nullable. This field only surfaces when a Tweet contains a link. <br><br>The meaning of the field doesn’t pertain to the Tweet content itself, <br>but instead it is an indicator that the URL contained in the Tweet may contain content or media identified as sensitive content.</td>
  </tr>
  <tr>
    <td class="tg-0pky">filter_level</td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">Indicates the maximum value of the <a href="https://developer.twitter.com/streaming/overview/request-parameters#filter_level">filter_level </a>parameter which may be used and still stream this Tweet. <br><br>So a value of medium will be streamed on none, low, and medium streams.</td>
  </tr>
  <tr>
    <td class="tg-0pky">lang</td>
    <td class="tg-0pky">"en"</td>
    <td class="tg-0pky">Nullable. When present, indicates a <a href="http://tools.ietf.org/html/bcp47">BCP 47 </a>language identifier corresponding to the machine-detected language of the Tweet text, <br>or und if no language could be detected. See more documentation <a href="http://support.gnip.com/apis/powertrack2.0/rules.html#Operators">HERE</a>.</td>
  </tr>
  <tr>
    <td class="tg-0pky">matching_rules</td>
    <td class="tg-0pky"></td>
    <td class="tg-0pky">Present in filtered products such as Twitter Search and PowerTrack. <br><br>Provides the id and tag associated with the rule that matched the Tweet. <br><br>With PowerTrack, more than one rule can match a Tweet. See more documentation <a href="http://support.gnip.com/enrichments/matching_rules.html">HERE</a>.</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

The "text" attribute should contain the full raw text of the Tweet but in our sample Tweet from our dataset it is instead contained in the "full_text" field.<br>

</span>

### Main Tweet Object - Attributes present in our Sample but not in Twitter API Docs for the Main Tweet Object:

<span style="font-family:Papyrus; font-size:1.25em;">

These are the attributes we noticed that are present in our sample in the main Tweet object but are not listed as being part of the main Tweet object in the current Twitter API Documentation.<br>

</span>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-s268{text-align:left}
</style>
<table class="tg">
  <tr>
    <th class="tg-s268">Attribute</th>
    <th class="tg-s268">Value</th>
    <th class="tg-s268">Description</th>
  </tr>
  <tr>
    <td class="tg-s268">"full_text":</td>
    <td class="tg-s268">"@abcnews About bloody time. Adani only wants FIFO Indian workers for his Bowen basin mines."</td>
    <td class="tg-s268">Replaces "text" in the Extended Mode of&nbsp;&nbsp;REST API endpoints.</td>
  </tr>
  <tr>
    <td class="tg-s268">"display_text_range":</td>
    <td class="tg-s268">[0,91]</td>
    <td class="tg-s268">Part of the "extended_tweet" attribute for streaming API's.</td>
  </tr>
  <tr>
    <td class="tg-s268">"contributors":</td>
    <td class="tg-s268">null</td>
    <td class="tg-s268">Can't find description for this exact field in the documentation.</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

Refer to the link below for more information on these fields.  We couldn't find any information on the "contributors" field however.  It maybe have been removed and is no longer listed in the Twitter API Documentation.<br>

https://developer.twitter.com/en/docs/tweets/tweet-updates.html

</span>

### Main Tweet Object - Additional Attributes

<span style="font-family:Papyrus; font-size:1.25em;">

These are additional attributes listed in the Twitter API Documentation for the main Tweet object.  They are not present in the sample we use from our raw Twitter dataset.<br>

</span>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-s268{text-align:left}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-s268">Attribute</th>
    <th class="tg-s268">Value</th>
    <th class="tg-s268">Description</th>
  </tr>
  <tr>
    <td class="tg-s268">current_user_retweet</td>
    <td class="tg-s268"></td>
    <td class="tg-s268">Perspectival Only surfaces on methods supporting the include_my_retweet parameter, when set to true. Details the Tweet ID of the user’s own retweet (if existent) of this Tweet.</td>
  </tr>
  <tr>
    <td class="tg-s268">scopes</td>
    <td class="tg-s268"></td>
    <td class="tg-s268">A set of key-value pairs indicating the intended contextual delivery of the containing Tweet. Currently used by Twitter’s Promoted Products.</td>
  </tr>
  <tr>
    <td class="tg-s268">withheld_copyright</td>
    <td class="tg-s268"></td>
    <td class="tg-s268">When present and set to “true”, it indicates that this piece of content has been withheld due to a <br><a href="http://en.wikipedia.org/wiki/Digital_Millennium_Copyright_Act">DMCA complaint </a>.</td>
  </tr>
  <tr>
    <td class="tg-0lax">withheld_in_countries</td>
    <td class="tg-0lax"></td>
    <td class="tg-0lax">When present, indicates a list of uppercase <a href="http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2">two-letter country codes </a>this content is withheld from.</td>
  </tr>
  <tr>
    <td class="tg-0lax">withheld_scope</td>
    <td class="tg-0lax"></td>
    <td class="tg-0lax">When present, indicates whether the content being withheld is the “status” or a “user.”</td>
  </tr>
  <tr>
    <td class="tg-0lax">geo</td>
    <td class="tg-0lax"></td>
    <td class="tg-0lax"><span style="font-weight:700">Deprecated.</span><br><span style="font-weight:700"> </span><br>Nullable. Use the coordinates field instead. This deprecated attribute has its coordinates formatted as [lat, long], while all other Tweet geo is formatted as [long, lat].</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

It appears the "geo" object is now deprecated.  However, our raw Twitter dataset does contain the "geo" field for some Tweets so apparently it was not outdated at the time CSIRO was still collecting this data.<br>

</span>

### User Object within the Main Tweet Object:

<span style="font-family:Papyrus; font-size:1.25em;">

This is the "user" object nested within the main Tweet object.  It is a large data structure containing a multitude of attributes and their corresponding values.  Extraction of just the "user" object resulted in a CSV file over 1.0 GBS in file size.  Refer to the link below for more in-depth information concerning "user".<br>

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object

</span>

#### Non-Deprecated Fields within the User Object:

<span style="font-family:Papyrus; font-size:1.25em;">

These are the non-deprecated attributes currently in use as of June 14, 2019.  Any attribute without a sample value indicates that the attribute was not present in the sample we extracted from our raw Tweet dataset.<br>

</span>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-xldj{border-color:inherit;text-align:left}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj">Attribute</th>
    <th class="tg-xldj">Value</th>
    <th class="tg-xldj">Description</th>
  </tr>
  <tr>
    <td class="tg-xldj">id</td>
    <td class="tg-xldj">772466924</td>
    <td class="tg-xldj">The integer representation of the unique identifier for this User.</td>
  </tr>
  <tr>
    <td class="tg-xldj">id_str</td>
    <td class="tg-xldj">"772466924"</td>
    <td class="tg-xldj">The string representation of the unique identifier for this User.</td>
  </tr>
  <tr>
    <td class="tg-xldj">name</td>
    <td class="tg-xldj">"Daryl Dickson"</td>
    <td class="tg-xldj">The name of the user, as they’ve defined it. Not necessarily a person’s name. Typically capped at 50 characters, but subject to change.</td>
  </tr>
  <tr>
    <td class="tg-xldj">screen_name</td>
    <td class="tg-xldj">"DazzDicko"</td>
    <td class="tg-xldj">The screen name, handle, or alias that this user identifies themselves with. screen_names are unique but subject to change. Typically a maximum of 15 characters long, but some historical accounts may exist with longer names.</td>
  </tr>
  <tr>
    <td class="tg-xldj">location</td>
    <td class="tg-xldj">"Far North Queensland"</td>
    <td class="tg-xldj">Nullable . The user-defined location for this account’s profile. Not necessarily a location, nor machine-parseable.</td>
  </tr>
  <tr>
    <td class="tg-xldj">derived</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">Enterprise APIs only Collection of Enrichment metadata derived for user. Provides the <br><a href="https://developer.twitter.com/en/docs/tweets/enrichments/overview/profile-geo">Profile Geo </a>Enrichment metadata.</td>
  </tr>
  <tr>
    <td class="tg-xldj">url</td>
    <td class="tg-xldj">null</td>
    <td class="tg-xldj">Nullable . A URL provided by the user in association with their profile.</td>
  </tr>
  <tr>
    <td class="tg-xldj">description</td>
    <td class="tg-xldj">"Train Driver extraordinaire, proud Union Leftie and Labor supporter. Cant stand the LNP and their regressive ideas. Mainly political but I do enjoy a laugh."</td>
    <td class="tg-xldj">Nullable . The user-defined UTF-8 string describing their account.</td>
  </tr>
  <tr>
    <td class="tg-xldj">protected</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">When true, indicates that this user has chosen to protect their Tweets.</td>
  </tr>
  <tr>
    <td class="tg-xldj">verified</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">When true, indicates that the user has a verified account. See <br><a href="https://support.twitter.com/articles/119135-faqs-about-verified-accounts">Verified Accounts </a>.</td>
  </tr>
  <tr>
    <td class="tg-xldj">followers_count</td>
    <td class="tg-xldj">945</td>
    <td class="tg-xldj">The number of followers this account currently has. Under certain conditions of duress, this field will temporarily indicate “0”.</td>
  </tr>
  <tr>
    <td class="tg-xldj">friends_count</td>
    <td class="tg-xldj">1385</td>
    <td class="tg-xldj">The number of users this account is following (AKA their “followings”). Under certain conditions of duress, this field will temporarily indicate “0”.</td>
  </tr>
  <tr>
    <td class="tg-xldj">listed_count</td>
    <td class="tg-xldj">3</td>
    <td class="tg-xldj">The number of public lists that this user is a member of.</td>
  </tr>
  <tr>
    <td class="tg-xldj">favourites_count</td>
    <td class="tg-xldj">533</td>
    <td class="tg-xldj">The number of Tweets this user has liked in the account’s lifetime.</td>
  </tr>
  <tr>
    <td class="tg-xldj">statuses_count</td>
    <td class="tg-xldj">5176</td>
    <td class="tg-xldj">The number of Tweets (including retweets) issued by the user.</td>
  </tr>
  <tr>
    <td class="tg-xldj">created_at</td>
    <td class="tg-xldj">"Tue Aug 21 23:23:52 +0000 2012"</td>
    <td class="tg-xldj">The UTC datetime that the user account was created on Twitter.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_banner_url</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">The HTTPS-based URL pointing to the standard web representation of the user’s uploaded profile banner.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_image_url_https</td>
    <td class="tg-xldj">"https://pbs.twimg.com/profile_images/698290934618787840/SIpBKnWE_normal.jpg"</td>
    <td class="tg-xldj">A HTTPS-based URL pointing to the user’s profile image.</td>
  </tr>
  <tr>
    <td class="tg-xldj">default_profile</td>
    <td class="tg-xldj">true</td>
    <td class="tg-xldj">When true, indicates that the user has not altered the theme or background of their user profile.</td>
  </tr>
  <tr>
    <td class="tg-xldj">default_profile_image</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">When true, indicates that the user has not uploaded their own profile image and a default image is used instead.</td>
  </tr>
  <tr>
    <td class="tg-xldj">withheld_in_countries</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">When present, indicates a list of uppercase <br><a href="http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2">two-letter country codes </a>this content is withheld from.</td>
  </tr>
  <tr>
    <td class="tg-xldj">withheld_scope</td>
    <td class="tg-xldj"></td>
    <td class="tg-xldj">When present, indicates that the content being withheld is a “user.”</td>
  </tr>
</table>

#### Deprecated Fields within the User Object:

<span style="font-family:Papyrus; font-size:1.25em;">
 
These are the deprecated attributes that are no longer in use.  Any attribute without a sample value indicates that the attribute was not present in the sample we extracted from our raw Tweet dataset.<br>
 
<span>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-s268{text-align:left}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-s268">Attribute</th>
    <th class="tg-s268">Value</th>
    <th class="tg-s268">Description</th>
  </tr>
  <tr>
    <td class="tg-s268">utc_offset</td>
    <td class="tg-s268">36000</td>
    <td class="tg-s268">Value will be set to null. Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/manage-account-settings/api-reference/get-account-settings">GET account/settings</a></td>
  </tr>
  <tr>
    <td class="tg-s268">time_zone</td>
    <td class="tg-s268">"Australia/Brisbane"</td>
    <td class="tg-s268">Value will be set to null. Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/manage-account-settings/api-reference/get-account-settings">GET account/settings </a>as tzinfo_name</td>
  </tr>
  <tr>
    <td class="tg-s268">lang</td>
    <td class="tg-s268">"en"</td>
    <td class="tg-s268">Value will be set to null. Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/manage-account-settings/api-reference/get-account-settings">GET account/settings </a>as language</td>
  </tr>
  <tr>
    <td class="tg-s268">geo_enabled</td>
    <td class="tg-s268">true</td>
    <td class="tg-s268">Value will be set to null.  Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/manage-account-settings/api-reference/get-account-settings">GET account/settings</a>. This field must be true for the current user to attach geographic data when using <br><a href="https://developer.twitter.com/en/docs/tweets/post-and-engage/guides/post-tweet-geo-guide">POST statuses / update</a></td>
  </tr>
  <tr>
    <td class="tg-s268">following</td>
    <td class="tg-s268">false</td>
    <td class="tg-s268">Value will be set to null. Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-friendships-lookup">GET friendships/lookup</a></td>
  </tr>
  <tr>
    <td class="tg-s268">follow_request_sent</td>
    <td class="tg-s268">false</td>
    <td class="tg-s268">Value will be set to null. Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-friendships-lookup">GET friendships/lookup</a></td>
  </tr>
  <tr>
    <td class="tg-s268">has_extended_profile</td>
    <td class="tg-s268"></td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">notifications</td>
    <td class="tg-s268">false</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">profile_location</td>
    <td class="tg-s268"></td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">contributors_enabled</td>
    <td class="tg-s268">false</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">profile_image_url</td>
    <td class="tg-s268">"http://pbs.twimg.com/profile_images/698290934618787840/SIpBKnWE_normal.jpg"</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null. NOTE: Profile images are only available using the profile_image_url_https field.</td>
  </tr>
  <tr>
    <td class="tg-s268">profile_background_color</td>
    <td class="tg-s268"></td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">profile_background_image_url</td>
    <td class="tg-s268">"http://abs.twimg.com/images/themes/theme1/bg.png"</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">profile_background_image_url_https</td>
    <td class="tg-s268">"https://abs.twimg.com/images/themes/theme1/bg.png"</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">profile_background_tile</td>
    <td class="tg-s268">false</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">profile_link_color</td>
    <td class="tg-s268">"1DA1F2"</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">profile_sidebar_border_color</td>
    <td class="tg-s268">"C0DEED"</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">profile_sidebar_fill_color</td>
    <td class="tg-s268">"DDEEF6"</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">profile_text_color</td>
    <td class="tg-s268">"333333"</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">profile_use_background_image</td>
    <td class="tg-s268">true</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">is_translator</td>
    <td class="tg-s268">false</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-s268">is_translation_enabled</td>
    <td class="tg-s268">false</td>
    <td class="tg-s268"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-0lax">translator_type</td>
    <td class="tg-0lax">"none"</td>
    <td class="tg-0lax"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

Our dataset was gathered over the course of 10 years so it stands to reason that Twiter has deprecated some of the fields that were used, added fields, and changed other fields.  The Twitter API Documentation do not give a description of the former purpose of these deprecated fields.<br>

</span>

### Entities Object within the Main Tweet Object:

<span style="font-family:Papyrus; font-size:1.25em;">

This is the "entities" object for a Tweet within our dataset.  All the Lists are empty except for "user_mentions" which is a List containing a Dictionary of key-value pairs of various attributes.  It should be noted that each of these attributes are actually Objects themselves with multiple key (attribute)-value pairs within.  For a more in-depth listing of the attributes and format, please refer to the link below.<br>

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object

</span>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj">Attribute</th>
    <th class="tg-xldj">Value</th>
    <th class="tg-xldj">Description</th>
  </tr>
  <tr>
    <td class="tg-xldj">urls</td>
    <td class="tg-xldj">[]</td>
    <td class="tg-xldj">Represents URLs included in the text of a Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">hashtags</td>
    <td class="tg-xldj">[]</td>
    <td class="tg-xldj">Represents hashtags which have been parsed out of the Tweet text.</td>
  </tr>
  <tr>
    <td class="tg-xldj">user_mentions</td>
    <td class="tg-xldj">[{"indices":[0,8],"screen_name":"abcnews","id_str":"2768501", "name":"ABC News","id":2768501}]</td>
    <td class="tg-xldj">Represents other Twitter users mentioned in the text of the Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">symbols</td>
    <td class="tg-xldj">[]</td>
    <td class="tg-xldj">Represents symbols, i.e. $cashtags, included in the text of the Tweet.</td>
  </tr>
  <tr>
    <td class="tg-0pky">media</td>
    <td class="tg-0pky">[]</td>
    <td class="tg-0pky">Represents media elements uploaded with the Tweet.</td>
  </tr>
  <tr>
    <td class="tg-0pky">polls</td>
    <td class="tg-0pky">[]</td>
    <td class="tg-0pky">Represents Twitter Polls included in the Tweet.</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

For the sample we have chosen, only the "url", "hashtags", "user_mentions", and "symbols" fields were present, even though most are empty.  The other fields in the table were not present in the "entities" object for this particular Tweet.<br>

</span>

### Extended Entities Object within the Main Tweet Object:

<span style="font-family:Papyrus; font-size:1.25em;">
    
This element is present in any Tweet that contains "native media" such as photos, videos, images, etc.  It is an object type that contains all the metadata for each of the native media elemnts present in the Tweet.<br>

Refer to the link below for all the particulars on the "extended_entities" object.<br>

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/extended-entities-object

</span>

<span style="font-family:Papyrus; font-size:1.25em;">
    
{"extended_entities":{"media":[{"display_url":"pic.twitter.com\/NcnlVdBAxt","indices":[110,132],"sizes":{"small":{"w":406,"h":680,"resize":"fit"},"large":{"w":448,"h":750,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":448,"h":750,"resize":"fit"}},"id_str":"394955471976538112","expanded_url":"https:\/\/twitter.com\/fightforthereef\/status\/394955472064622593\/photo\/1","media_url_https":"https:\/\/pbs.twimg.com\/media\/BXsp6MFCIAA4Zet.png","id":394955471976538112,"type":"photo","media_url":"http:\/\/pbs.twimg.com\/media\/BXsp6MFCIAA4Zet.png","url":"http:\/\/t.co\/NcnlVdBAxt"}]}

</span>

<span style="font-family:Papyrus; font-size:1.25em;">
    
The above is a sample of an "extended_entities" object from a Tweet in our dataset.  Ths object was found as the value for the "retweeted_status" key.  We forgot building a table for all the attributes in this object as the Twitter API Documentation does not itself have a table listing each attribute, example values, and a description explaining each.<br>
    
</span>

<span style="font-family:Papyrus; font-size:1.25em;">
    
</span>

### Geo Object within the Main Tweet Object:

<span style="font-family:Papyrus; font-size:1.25em;">

The "geo" sub-object within the main Tweet object is comprised of the "coordinates" and "place" objects.  According to the Twitter API documentation,<br>

"The place object is always present when a Tweet is geo-tagged, while the coordinates object is only present (non-null) when the Tweet is assigned an exact location. If an exact location is provided, the coordinates object will provide a [long, lat] array with the geographical coordinates, and a Twitter Place that corresponds to that location will be assigned." ("Geo objects - Twitter Developers")<br>

Therfore, not every Tweet will necessarily possess both or either objects.<br>

</span>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-s268{text-align:left}
</style>
<table class="tg">
  <tr>
    <th class="tg-s268">Attribute</th>
    <th class="tg-s268">Value</th>
    <th class="tg-s268">Description</th>
  </tr>
  <tr>
    <td class="tg-s268">place</td>
    <td class="tg-s268">{}</td>
    <td class="tg-s268">Places are specific, named locations with corresponding geo coordinates.</td>
  </tr>
  <tr>
    <td class="tg-s268">coordinates</td>
    <td class="tg-s268">{}</td>
    <td class="tg-s268">An array of longitude and latitude coordinates.&nbsp;&nbsp;May also include a type attribute.</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

Refer to the link below for a specific explanation of how "place" and "coordinates" are utilized together for geo-tagged Tweet objects.<br>

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/geo-objects

</span>

#### Coordinates Object within the Geo Object:

<span style="font-family:Papyrus; font-size:1.25em;">

The "coordinates" object for geo-tagged Tweets contains the two attributes as described below.<br>

</span>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-s268{text-align:left}
</style>
<table class="tg">
  <tr>
    <th class="tg-s268">Attribute</th>
    <th class="tg-s268">Value</th>
    <th class="tg-s268">Description</th>
  </tr>
  <tr>
    <td class="tg-s268">coordinates</td>
    <td class="tg-s268"><span style="font-style:italic">[-97.51087576,35.46500176]</span></td>
    <td class="tg-s268">The longitude and latitude of the Tweet’s location, as a collection in the form <br><span style="font-weight:700">[longitude, latitude]</span>.</td>
  </tr>
  <tr>
    <td class="tg-s268">type</td>
    <td class="tg-s268">"Point"</td>
    <td class="tg-s268">The type of data encoded in the coordinates property. This will be “Point” for Tweet coordinates fields.</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">


The values are examples copied from the Twitter API Documentation on "geo" objects.<br>

</span>

#### Place Object within the Geo Object:

<span style="font-family:Papyrus; font-size:1.25em;">
    
</span>

<span style="font-family:Papyrus; font-size:1.25em;">

The "place" object for geo-tagged Tweets contains the following attributes as describe below.<br>

</span>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj">Attribute</th>
    <th class="tg-xldj">Value</th>
    <th class="tg-xldj">Description</th>
  </tr>
  <tr>
    <td class="tg-xldj">id</td>
    <td class="tg-xldj">"01a9a39529b27f36"</td>
    <td class="tg-xldj">ID representing this place. Note that this is represented as a string, not an integer.</td>
  </tr>
  <tr>
    <td class="tg-xldj">url</td>
    <td class="tg-xldj">"https://api.twitter.com/1.1/geo/id/01a9a39529b27f36.json"</td>
    <td class="tg-xldj">URL representing the location of additional place metadata for this place.</td>
  </tr>
  <tr>
    <td class="tg-0pky">place_type</td>
    <td class="tg-0pky">"city"</td>
    <td class="tg-0pky">The type of location represented by this place.</td>
  </tr>
  <tr>
    <td class="tg-0pky">name</td>
    <td class="tg-0pky">"Manhattan"</td>
    <td class="tg-0pky">Short human-readable representation of the place’s name.</td>
  </tr>
  <tr>
    <td class="tg-0pky">full_name</td>
    <td class="tg-0pky">"Manhattan, NY"</td>
    <td class="tg-0pky">Full human-readable representation of the place’s name.</td>
  </tr>
  <tr>
    <td class="tg-0pky">country_code</td>
    <td class="tg-0pky">"US"</td>
    <td class="tg-0pky">Shortened country code representing the country containing this place.</td>
  </tr>
  <tr>
    <td class="tg-0pky">country</td>
    <td class="tg-0pky">"United States"</td>
    <td class="tg-0pky">Name of the country containing this place.</td>
  </tr>
  <tr>
    <td class="tg-0pky">bounding_box</td>
    <td class="tg-0pky">"bounding_box":{"coordinates":[[[144.886226909269,-37.7802081941697],[144.988666911647,-37.7802081941697],[144.988666911647,-37.6909396998182],[144.886226909269,-37.6909396998182]]],"type":"Polygon"}</td>
    <td class="tg-0pky">A bounding box of coordinates which encloses this place.</td>
  </tr>
  <tr>
    <td class="tg-0pky">attributes</td>
    <td class="tg-0pky">{}</td>
    <td class="tg-0pky">Dictionary of Tweet attributes.</td>
  </tr>
</table>

<span style="font-family:Papyrus; font-size:1.25em;">

The sample values are from the Twitter API Documentation except for the "bounding_box" attribute, which is from a Tweet in our dataset.<br>

</span>

<span style="font-family:Papyrus; font-size:1.25em;">
    
</span>

# Data Analysis Codebase:

<span style="font-family:Papyrus; font-size:1.25em;">
    
The following sections present our current codebase that analyzes various combinations of attributes present in the raw JSON Twitter data file and preprocessed CSV Twitter data file.<br>

</span>

## Data Analysis Utility Functions:

<span style="font-family:Papyrus; font-size:1.25em;">

The following sample function calls illustrate how we use the utility functions in "slo_twitter_data_analysis_utility_functions.py" to perform individual attribute extraction/export and data chuncking extraction/export.<br>

</span>

In [None]:
# Extract the "created_at" field from raw JSON file and export to CSV file.
tweet_util.generalized_field_extraction_function(
    "D:/Dropbox/summer-research-2019/json/dataset_slo_20100101-20180510.json",
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/",
    "user", "csv")

In [None]:
# Read in JSON raw data as chunks and export to CSV/JSON files.
tweet_util.generalized_json_data_chunking_file_export_function(
    "D:/Dropbox/summer-research-2019/json/dataset_slo_20100101-20180510.json",
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/dataset-chunks/", "csv")

<span style="font-family:Papyrus; font-size:1.25em;">

Refer to the Python file itself if interested in that codebase.  Several graphing helper functions are also included.<br>

</span>

## Import libraries and set parameters:

<span style="font-family:Papyrus; font-size:1.25em;">

We import the required libraries as well as our custom utility functions for data anlysis.

</span>

In [None]:
import logging as log
import warnings
import time
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Import custom utility functions.
import slo_twitter_data_analysis_utility_functions as tweet_util

<span style="font-family:Papyrus; font-size:1.25em;">

Pandas settings alters the maximum number of rows to be displayed and the number of decimal places to display for floating point values.  We also filter out several warning types to reduce potential output clutter.<br>

</span>

In [None]:
sns.set()
pd.options.display.max_rows = 100
pd.set_option('precision', 7)
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

<span style="font-family:Papyrus; font-size:1.25em;">

Change log levels between "INFO" and "DEBUG" depending on whether you wish to see log output or not.<br>

</span>

In [None]:
log.basicConfig(level=log.DEBUG)

## Import preprocessed Twitter dataset CSV file:

<span style="font-family:Papyrus; font-size:1.25em;">

We read in the untokenized Twitter dataset as a CSV file and generate a Pandas dataframe from the dataset.<br>

</span>

In [None]:
# Import dataset and convert to dataframe.
tweet_preprocessed_csv_dataframe = tweet_util.import_dataset(
    "D:/Dropbox/summer-research-2019/datasets/dataset_20100101-20180510.csv", "csv")

<span style="font-family:Papyrus; font-size:1.25em;">
 
The above log.INFO output shows the shape, columns, and a sample from the Pandas Dataframe that contains the entirety of the CSV file.<br>
 
</span>

## Import raw JSON file and break into chunks:

<span style="font-family:Papyrus; font-size:1.25em;">

This data analysis function reads in the raw JSON Twitter dataset and operates on chunks of data that consists of part of the whole file.  We cannot read in the entire JSON file as its 3.6 GBS filesize exceeds our memory (RAM) capacity.  The function essentially calls a specific data analysis function as specified in the parameters which operates on each chunk individually to display statistics and graphical visualizations of the data.<br>

</span>

In [None]:
def call_data_analysis_function_on_json_file_chunks(input_file_path, function_name):
    """
    This function reads the raw JSON Tweet dataset in chunk-by-chunk and calls the user-defined data analysis
    function that is specified as a parameter.

    :param input_file_path: absolute file path of the input raw JSON file.
    :param function_name: name of the data analysis function to call.
    :return: None.
    """
    start_time = time.time()

    # Define size of chunks to read in.
    chunksize = 100000

    # Read in the JSON file.
    twitter_data = pd.read_json(f"{input_file_path}",
                                orient='records',
                                lines=True,
                                chunksize=chunksize)

    # Create a empty Pandas dataframe.
    json_dataframe = pd.DataFrame()

    counter = 0
    chunk_number = 0

    # Loop through chunk-by-chunk and call the data analysis function on each chunk.
    for data in twitter_data:
        json_dataframe = json_dataframe.append(data, ignore_index=True)

        counter += 1
        chunk_number += 1

        if chunk_number == 1 and function_name == "none":
            # Print shape and column names.
            log.info(
                f"\nThe shape of the dataframe storing the contents of the raw JSON Tweet file chunk "
                f"{chunk_number} is:\n")
            log.info(json_dataframe.shape)
            log.info(
                f"\nThe columns of the dataframe storing the contents of the raw JSON Tweet file chunk "
                f"{chunk_number} is:\n")
            log.info(json_dataframe.columns)
            log.info(
                f"\nA sample from the dataframe storing the contents of the raw JSON Tweet file chunk "
                f"{chunk_number} is:\n")
            with pd.option_context('display.max_rows', None, 'display.max_columns',
                                   None, 'display.width', None, 'display.max_colwidth', 1000):
                log.info(f"\n{json_dataframe.sample(1, axis=0)}")
            time.sleep(2)

        if function_name != "none":
            # Call the data analysis functions.
            function_name(json_dataframe, chunk_number)
        else:
            return
            # Clear the contents of the dataframe.
        json_dataframe = pd.DataFrame()

        # For debug purposes.
        # break

    end_time = time.time()
    time_elapsed = (end_time - start_time) / 60.0
    time.sleep(3)
    log.info(f"The time taken to read in the JSON file by Chunks is {time_elapsed} minutes")
    log.info(f"The number of chunks is {chunk_number} based on chunk size of {chunksize}")
    log.info('\n')

<span style="font-family:Papyrus; font-size:1.25em;">

We currently use this function for the purpose of analyzing how many Tweets in each chunk are "retweeted" or "favorited".<br>

</span>

## Display raw JSON file data chunk dataframe information:

<span style="font-family:Papyrus; font-size:1.25em;">

By specifying "none" as the function name, we simply print out logger INFO on the shape, columns, and a single sample of the dataframe on the first chunk of data.<br>

</span>

In [None]:
tweet_util.call_data_analysis_function_on_json_file_chunks(
    "D:/Dropbox/summer-research-2019/json/dataset_slo_20100101-20180510.json", "none")

<span style="font-family:Papyrus; font-size:1.25em;">

As can be seen, each Tweet object in its raw JSON form contains many different attributes.  The "user" attribute is a Object itself that especially contains many other attributes and Objects.<br>

</span>

## Tweet Attributes Analysis :

<span style="font-family:Papyrus; font-size:1.25em;">

Here we begin to analyze various aspects of our raw data.<br>

</span>

In [None]:
mylog = log.getLogger("matplotlib")
mylog.setLevel(log.INFO)

<span style="font-family:Papyrus; font-size:1.25em;">

Disable "DEBUG" level messages for matplotlib specifically.<br>

</span>

### Time Series Statistics:

<span style="font-family:Papyrus; font-size:1.25em;">

This function analyzes when a Tweet was created and accrues statistics on the # of Tweets created in the same time period across the entire dataset .<br>

</span>

In [None]:
def tweet_count_by_timedate_time_series(created_at_attribute_file, file_type):
    """
    Visualize the Tweet creation time based on time-date information in the "created_at" attribute field of the
    input file.

    This function will work for any JSON file or CSV file that contains a attribute or column named "created_at".

    Note: Ensure input file is small enough to fit in RAM.  This function will not read in data by chunks!

    :param file_type: type of input file.
    :param created_at_attribute_file: the input file containing the "created_at" Tweet attribute.
    :return: None.
    """
    start_time = time.time()

    if file_type == "csv":
        twitter_data = pd.read_csv(f"{created_at_attribute_file}", sep=",")
    elif file_type == "json":
        twitter_data = pd.read_json(f"{created_at_attribute_file}",
                                    orient='records',
                                    lines=True)
    else:
        print(f"Invalid file type entered - aborting operation")
        return

    # Create a empty Pandas dataframe.
    json_dataframe = pd.DataFrame(twitter_data)

    plt.figure()
    plt.title(f"Tweet Creation Time-Date Count by Year/Month/Day")
    plt.xlabel("Year/Month/Day")
    plt.ylabel("Tweet Count")
    pd.to_datetime(json_dataframe['created_at']).value_counts().resample('1D').sum().plot()
    plt.show()
    end_time = time.time()
    time_elapsed = (end_time - start_time) / 60.0
    log.debug(f"The time taken to visualize the statistics is {time_elapsed} minutes")

<span style="font-family:Papyrus; font-size:1.25em;">

We call our data analysis function.  Notice that we have pre-extracted just the one attribute we need into a separate CSV file so that the time-date information for all 650k+ Tweets in our raw JSON file fits into memory (RAM).<br>

</span>

In [None]:
# Display Tweet count by time-date time series statistics.
tweet_count_by_timedate_time_series(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/created_at-attribute.csv",
    "csv")

<span style="font-family:Papyrus; font-size:1.25em;">

As we can see, more of the Tweets were created relatively recently in 2017 and 2018.  The further we go back in time, the fewer Tweets we have.<br>

</span>

### Re-Tweet Statistics for raw JSON dataset:

<span style="font-family:Papyrus; font-size:1.25em;">

Here, we analyze the number of "True" or "False" values for the "retweeted" attribute for each chunk of the raw JSON file.<br>

</span>

In [None]:
def json_retweeted(json_dataframe, chunk):
    """
    Re-tweet statistics and visualizations for the raw JSON Twitter data chunks.

    :param json_dataframe: the dataframe containing the JSON data chunk.
    :param chunk: the JSON data chunk number.
    :return: None.
    """
    print(f"Re-Tweet Statistics for raw JSON Twitter data chunk {chunk}:")
    print(json_dataframe['retweeted'].value_counts())
    print()

<span style="font-family:Papyrus; font-size:1.25em;">
    
We call the the data analysis function for each chunk of data until we have accounted for the entire raw JSON Twitter dataset.<br>
    
</span>

In [None]:
tweet_util.call_data_analysis_function_on_json_file_chunks(
    "D:/Dropbox/summer-research-2019/json/dataset_slo_20100101-20180510.json", json_retweeted)

<span style="font-family:Papyrus; font-size:1.25em;">

It appears that none of our Tweets have been retweeted.  We find this rather odd for a dataset of 650k+ Tweets.  We may need to look further into what exactly the "retweeted" attribute is referencing.  From our tables above, the "retweeted" field "Indicates whether this Tweet has been Retweeted by the authenticating user."<br>

</span>

### Favorited Statistics for raw JSON dataset:

<span style="font-family:Papyrus; font-size:1.25em;">

Here, we analyze the number of "True" or "False" values for the "favorited" attribute for each chunk of the raw JSON file.<br>

</span>

In [None]:
def json_favorited(json_dataframe, chunk):
    """
    Re-tweet statistics and visualizations for the raw JSON Twitter data chunks.

    :param json_dataframe: the dataframe containing the JSON data chunk.
    :param chunk: the JSON data chunk number.
    :return: None.
    """
    print(f"Re-Tweet Statistics for raw JSON Twitter data chunk {chunk}:")
    print(json_dataframe['favorited'].value_counts())
    print()

<span style="font-family:Papyrus; font-size:1.25em;">

We call the the data analysis function for each chunk of data until we have accounted for the entire raw JSON Twitter dataset.<br>

</span>

In [None]:
tweet_util.call_data_analysis_function_on_json_file_chunks(
    "D:/Dropbox/summer-research-2019/json/dataset_slo_20100101-20180510.json", json_favorited)

<span style="font-family:Papyrus; font-size:1.25em;">

It appears that none of our Tweets have been favorited.  We find this rather odd for a dataset of 650k+ Tweets.  We may need to look further into what exactly the "favorited" attribute is referencing.  From our tables above, the "Favorited" field "Indicates whether this Tweet has been liked by the authenticating user".<br>

</span>

### Re-Tweet Statistics for preprocessed CSV Twitter dataset:

<span style="font-family:Papyrus; font-size:1.25em;">

Here, we analyze the number of "True" or "False" values for the "retweeted" attribute for the preprocessed CSV Tweet data file.<br>

</span>

In [None]:
def csv_retweeted(tweet_csv_dataframe):
    """
    Re-tweet statistics and visualizations for the CSV Twitter preprocessed dataset.

    Note: The raw JSON file does not have associated "company" information.

    :return: None.
    """

    print("Re-Tweet Statistics for entire CSV dataset:")
    print(tweet_csv_dataframe['retweeted'].value_counts())
    print()

    print("Re-Tweet Statistics for CSV dataset by Company:")
    print("Number of Tweets that are or aren't re-tweets by associated company: ")
    print(tweet_csv_dataframe.groupby(['company', 'retweeted']).size())
    print()

    # Graph the Re-Tweet Statistics.
    print("Proportion of Re-Tweets versus non Re-Tweets by associated company: ")
    plt.figure()
    grid = sns.FacetGrid(tweet_csv_dataframe[['retweeted', 'company']], col='company', col_wrap=6,
                         ylim=(0, 1))
    grid.map_dataframe(tweet_util.bar_plot, 'retweeted').set_titles('{col_name}')
    plt.show()

<span style="font-family:Papyrus; font-size:1.25em;">
    
Unlike with the raw JSON Twitter dataset, we do not need to read the CSV file in chunks and can analyze its contents as a whole.<br>
    
</span>

In [None]:
# Determine whether Tweets have been re-Tweeted.
csv_retweeted(tweet_preprocessed_csv_dataframe)

<span style="font-family:Papyrus; font-size:1.25em;">

The graphs show the proportion of Tweets that are or are not re-tweets by the company the Tweets are associated with.<br>

0.0 = NOT a re-tweet.<br>
1.0 = IS a re-tweet.<br>

</span>

### User Statistics for preprocessed CSV Twitter dataset:

<span style="font-family:Papyrus; font-size:1.25em;">

This function provides text-based statistics on the users that account for the most Tweets about a given company.  It also constructs graphs that displays how many of the Tweets for a given company are created by those users in comparison to each other.<br>

</span>

In [None]:
def most_tweets_by_users_per_company(tweet_csv_dataframe):
    """
    User related statistics and visualizations.

    Note: The raw JSON file does not have associated "company" information.

    :return: None.
    """

    # Adjusted parameters to allow statistics for all companies to show in output.
    pd.set_option("display.precision", 12)
    pd.options.display.max_rows = 100

    print("User Statistics for CSV dataset by Company: ")
    print("Top Tweet counts for unique user by associated company.")
    print(
        tweet_csv_dataframe[['company', 'user_screen_name']].groupby('company')
            .apply(lambda x: x['user_screen_name'].value_counts(normalize=True).head())
        # .value_counts(normalize=True)\
        # .sort_index(ascending=False).head())
    )
    print()

    # Graph the User Statistics.
    print("Proportion of most Tweets for unique users by associated company: ")
    plt.figure()
    grid = sns.FacetGrid(tweet_csv_dataframe[['user_screen_name', 'company']], col='company', col_wrap=6,
                         ylim=(0, 1),
                         xlim=(0, 10))
    grid.map_dataframe(tweet_util.bar_plot_zipf, 'user_screen_name').set_titles('{col_name}').set_xlabels(
        'appearance count')
    plt.show()

<span style="font-family:Papyrus; font-size:1.25em;">

We cannot currently perform this analysis on the raw JSON Tweet data file as it has not been auto-encoded or hand-labeled to be associated with a SLO mining company.  Thus, we are restricted to analyzing the preprocessed CSV dataset that does possess this information.  In the future, we may re-preprocess the raw JSON Tweet data file and export the results to a new JSON or CSV file that does contain a attribute or column with auto-encoded "company" information.<br>

</span>

In [None]:
# Determine the Tweet count for most prolific user by company.
most_tweets_by_users_per_company(tweet_preprocessed_csv_dataframe)

<span style="font-family:Papyrus; font-size:1.25em;">

The text output displays the top 5 unique users that account for the largest proportion of Tweets that are associated with a given company.  The graph output shows that there are a few users that account for the majority of Tweets about a given company.<br>

</span>

### Character Count Statistics preprocessed CSV Twitter dataset::

<span style="font-family:Papyrus; font-size:1.25em;">

This function provides character counts for all the Tweets associated with a given company.  We then plot a relative frequency histogram of those counts across all the Tweets in the dataset.<br>

</span>

In [None]:
def tweet_character_counts(tweet_csv_dataframe):
    """
    Character related statistics and visualizations.

    Note: The raw JSON file does not have associated "company" information.

    :return: None.
    """

    def relhist_proc(col, **kwargs):
        """
        Helper function to visualize the data.

        :param col: the columns of the graph.
        :param kwargs: variable number of arguments.
        :return: None.
        """
        ax = plt.gca()
        data = kwargs.pop('data')
        proc = kwargs.pop('proc')
        processed = proc(data[col])
        # relative frequency histogram
        # https://stackoverflow.com/questions/9767241/setting-a-relative-frequency-in-a-matplotlib-histogram
        ax.hist(processed, weights=np.ones_like(processed) / processed.size, **kwargs)

    def char_len(tweets):
        """
        Determine the length of the Tweet text.

        :param tweets: the Tweet text.
        :return: the length of the Tweet.
        """
        return tweets.str.len()

    print("Character Statistics for CSV dataset by Company: ")
    print("Character count relative frequency histogram: ")
    plt.figure()
    grid = sns.FacetGrid(tweet_csv_dataframe[['text', 'company']], col='company', col_wrap=6, ylim=(0, 1))
    grid.map_dataframe(relhist_proc, 'text', bins=10, proc=char_len).set_titles('{col_name}')
    plt.show()

<span style="font-family:Papyrus; font-size:1.25em;">

We cannot currently perform this analysis on the raw JSON Tweet data file as it has not been auto-encoded or hand-labeled to be associated with a SLO mining company.  Thus, we are restricted to analyzing the preprocessed CSV dataset that does possess this information.  In the future, we may re-preprocess the raw JSON Tweet data file and export the results to a new JSON or CSV file that does contain a attribute or column with auto-encoded "company" information.<br>

</span>

In [None]:
# Determine the # of characters in Tweets via relative frequency histogram.
tweet_character_counts(tweet_preprocessed_csv_dataframe)

<span style="font-family:Papyrus; font-size:1.25em;">

The graph outputs appear to show that most Tweets for any given company are relatively long in length.  It would have been nice if Shuntaro Yada, from whom we adapted this data analysis, had constructed matplotlib graphing functions with more explicit and detailed information.<br>

**FIXME - is this the correct interpretation?**<br>

</span>

## Pandas.describe() Analysis for Individual Attributes:

<span style="font-family:Papyrus; font-size:1.25em;">

This function uses the built-in "describe" function for a Pandas dataframe to output statistics relevant to the type of data that is present for a single attribute.<br>

</span>

In [None]:
def attribute_describe(input_file_path, attribute_name):
    """
    Function utilizes Pandas "describe" function to return dataframe statistics.

    :param input_file_path: absolute file path of the dataset in CSV format.
    :param attribute_name:  name of the attribute we are analyzing.
    :return: None.
    """
    dataframe = tweet_util.import_dataset(
        f"{input_file_path}", "csv")

    print(f"Pandas describe for {attribute_name}: ")
    print(dataframe.describe(include='all'))

In [None]:
attribute_describe("D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/full_text-attribute.csv",
                   "full_text")

In [None]:
attribute_describe(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/retweet_count-attribute.csv",
    "retweet_count")

In [None]:
attribute_describe(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/favorite_count-attribute.csv",
    "favorite_count")

## Pandas.describe() Analysis for the entire dataset:

<span style="font-family:Papyrus; font-size:1.25em;">

Here, we output statistics for the entire preprocessed CSV dataset.<br>

</span>

In [None]:
pd.options.display.max_rows = None
pd.options.display.max_columns = None
pd.options.display.width = None
pd.options.display.max_colwidth = 1000

attribute_describe(
    "D:/Dropbox/summer-research-2019/datasets/dataset_20100101-20180510.csv",
    "entire CSV dataset")

## Resources Used:

<span style="font-family:Papyrus; font-size:1.25em;">

**TODO: convert to annotated bibliography**

Dataset Files (obtained from Borg supercomputer):<br>

dataset_slo_20100101-20180510.json<br>
dataset_20100101-20180510.csv<br>

Note: These are large fiels not included in the project GitHub Repository.<br>


- [SLO-analysis.ipynb](SLO-analysis.ipynb)<br>
    -original SLO Twitter data analysis file from Shuntaro Yada.<br>


- https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json<br>
    -explanation of all data fields in JSON file format for Tweets.<br>


- https://datatofish.com/export-dataframe-to-csv/<br>
- https://datatofish.com/export-pandas-dataframe-json/<br>
    -saving Pandas dataframe to CSV/JSON<br>
    

- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html<br>
    -Pandas to_datetime() function call.<br>
    

- https://www.machinelearningplus.com/plots/matplotlib-tutorial-complete-guide-python-plot-examples/<br>
    -plotting with matplotlib.<br>


</span>

## TODO's:

<span style="font-family:Papyrus; font-size:1.25em;">

Implement further elements from Shuntaro Yada's SLO Twitter Dataset Analysis.<br>

</span>