# SLO Twitter Data Analysis  - Twitter API Intro

Setup the Jupyter Notebook kernel for SLO data analysis.

In [2]:
import logging as log
import warnings
import time
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns

# Import custom utility functions.
import slo_twitter_data_analysis_utility_functions_v2 as tweet_util_v2

#############################################################
# Adjust parameters to display all contents.
pd.options.display.max_rows = None
pd.options.display.max_columns = None
pd.options.display.width = None
pd.options.display.max_colwidth = 1000
# Seaborn setting.
sns.set()
# Set level of precision for float value output.
pd.set_option('precision', 12)
# Ignore these types of warnings - don't output to console.
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
# Matplotlib log settings.
mylog = log.getLogger("matplotlib")
mylog.setLevel(log.INFO)

"""
Turn debug log statements for various sections of code on/off.
(adjust log level as necessary)
"""
log.basicConfig(level=log.INFO)


# Import CSV dataset and convert to dataframe.
tweet_dataframe = tweet_util_v2.import_dataset(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-6-22-19-fixed.csv",
    "csv", False)

  if (yield from self.run_code(code, result)):


    
</p>The sections below cover the structure of the raw Tweet data with explanations of the various attributes (fields) and their associated values.  Separate Jupyter Notebook files in our table of contents showcase our analysis of the raw Tweet dataset after processing it into a CSV file containing only the attributes we are interested in.<br>



## Raw Json Twitter Dataset Hierarchical File Structure:


We utilize a single sample from the raw Twitter JSON dataset file in order to provide example values in the tables below.  Every Tweet in our raw dataset contains three JSON objects: the "tweet"; the "user"; and the "entities" object.  The "tweet" object encapsulates the other objects.  There may also be a "extended_entities" and "geo" object present in some Tweets depending on whether the Tweet contains native media such as photos, videos, etc., and whether they are geo-tagged.According to the Twitter API Documentation:

"Tweets are the basic atomic building block of all things Twitter. Tweets are also known as “status updates.” The Tweet object has a long list of ‘root-level’ attributes, including fundamental attributes such as id, created_at, and text. Tweet objects are also the ‘parent’ object to several child objects. Tweet child objects include user, entities, and extended_entities. Tweets that are geo-tagged will have a place child object." ("Tweet object - Twitter Developers")  Refer to the link below for further introductory information.<br>

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object



### Main Tweet Object:


The main Tweet object.  This contains all other sub-objects.  Any attribute an N/A (not applicable) example value indicates that the field was not present in the sample we are utilizing.  There are also some attributes present in our sample in the main Tweet object that are no longer present in the current up-to-date Tweet object from the Twitter API Documentation.  We will create a separate table for them.<br>

Note: change quoted status from N/A to valid.



<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj">Attribute</th>
    <th class="tg-xldj">Value</th>
    <th class="tg-xldj">Description</th>
  </tr>
  <tr>
    <td class="tg-xldj">created_at</td>
    <td class="tg-xldj">"Sat Feb 23 03:40:21 +0000 2013"</td>
    <td class="tg-xldj">UTC time when this Tweet was created.</td>
  </tr>
  <tr>
    <td class="tg-xldj">id</td>
    <td class="tg-xldj">305160140833816576</td>
    <td class="tg-xldj">The integer representation of the unique identifier for this Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">id_str</td>
    <td class="tg-xldj">"305160140833816576"</td>
    <td class="tg-xldj">The string representation of the unique identifier for this Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">text</td>
    <td class="tg-xldj">N/A (not present in dataset)</td>
    <td class="tg-xldj">The actual UTF-8 text of the status update.</td>
  </tr>
  <tr>
    <td class="tg-xldj">source</td>
    <td class="tg-xldj">"&lt;a href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"&gt;Twitter for iPhone&lt;\/a&gt;"</td>
    <td class="tg-xldj">Utility used to post the Tweet, as an HTML-formatted string.</td>
  </tr>
  <tr>
    <td class="tg-xldj">truncated</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">Indicates whether the value of the text parameter was truncated, for example, as a result of a retweet exceeding the original Tweet text length limit of 140 characters. <br><br>Truncated text will end in ellipsis, like this ...<br><br>Since Twitter now rejects long Tweets vs truncating them, the large majority of Tweets will have this set to false. <br><br>Note that while native retweets may have their toplevel text property shortened, the original text will be available under the retweeted_status object <br>and the truncated parameter will be set to the value of the original status (in most cases, false).</td>
  </tr>
  <tr>
    <td class="tg-xldj">in_reply_to_status_id</td>
    <td class="tg-xldj">305159434462691328</td>
    <td class="tg-xldj">Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID.</td>
  </tr>
  <tr>
    <td class="tg-xldj">in_reply_to_status_id_str</td>
    <td class="tg-xldj">"305159434462691328"</td>
    <td class="tg-xldj">Nullable. If the represented Tweet is a reply, this field will contain the string representation of the original Tweet’s ID.</td>
  </tr>
  <tr>
    <td class="tg-xldj">in_reply_to_user_id</td>
    <td class="tg-xldj">2768501</td>
    <td class="tg-xldj">Nullable. If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s author ID. This will not necessarily always be the user directly mentioned in the Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">in_reply_to_user_id_str</td>
    <td class="tg-xldj">"2768501"</td>
    <td class="tg-xldj">Nullable. If the represented Tweet is a reply, this field will contain the string representation of the original Tweet’s author ID. This will not necessarily always be the user directly mentioned in the Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">in_reply_to_screen_name</td>
    <td class="tg-xldj">"abcnews"</td>
    <td class="tg-xldj">Nullable. If the represented Tweet is a reply, this field will contain the screen name of the original Tweet’s author.</td>
  </tr>
  <tr>
    <td class="tg-xldj">user</td>
    <td class="tg-xldj">Object containing a multitude of attributes.</td>
    <td class="tg-xldj">The user who posted this Tweet. See User data dictionary for complete list of attributes.</td>
  </tr>
  <tr>
    <td class="tg-xldj">coordinates</td>
    <td class="tg-xldj">Object containing a multitude of attributes.</td>
    <td class="tg-xldj">Nullable. Represents the geographic location of this Tweet as reported by the user or client application. The inner coordinates array is formatted as <br><a href="http://www.geojson.org/">geoJSON </a>(longitude first, then latitude).</td>
  </tr>
  <tr>
    <td class="tg-xldj">place</td>
    <td class="tg-xldj">Object containing a multitude of attributes.</td>
    <td class="tg-xldj">Nullable When present, indicates that the tweet is associated (but not necessarily originating from) a <br><a href="https://developer.twitter.com/overview/api/places">Place </a>.</td>
  </tr>
  <tr>
    <td class="tg-xldj">quoted_status_id</td>
    <td class="tg-xldj">N/A (empty for our sample but present)</td>
    <td class="tg-xldj">This field only surfaces when the Tweet is a quote Tweet. This field contains the integer value Tweet ID of the quoted Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">quoted_status_id_str</td>
    <td class="tg-xldj">N/A (empty for our sample but present)</td>
    <td class="tg-xldj">This field only surfaces when the Tweet is a quote Tweet. This is the string representation Tweet ID of the quoted Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">is_quote_status</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">Indicates whether this is a Quoted Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">quoted_status</td>
    <td class="tg-xldj">N/A (empty for our sample but present)</td>
    <td class="tg-xldj">This field only surfaces when the Tweet is a quote Tweet. This attribute contains the Tweet object of the original Tweet that was quoted.</td>
  </tr>
  <tr>
    <td class="tg-xldj">retweeted_status</td>
    <td class="tg-xldj">N/A (empty for our sample but present)</td>
    <td class="tg-xldj">Users can amplify the broadcast of Tweets authored by other users by <a href="https://developer.twitter.com/rest/reference/post/statuses/retweet/%3Aid">retweeting</a>. <br><br>Retweets can be distinguished from typical Tweets by the existence of a retweeted_status attribute. <br><br>This attribute contains a representation of the original Tweet that was retweeted. <br><br>Note that retweets of retweets do not show representations of the intermediary retweet, but only the original Tweet. <br>(Users can also <a href="https://developer.twitter.com/rest/reference/post/statuses/destroy/%3Aid">unretweet </a>a retweet they created by deleting their retweet.)</td>
  </tr>
  <tr>
    <td class="tg-xldj">quote_count</td>
    <td class="tg-xldj">N/A (not present in dataset)</td>
    <td class="tg-xldj">Nullable. Indicates approximately how many times this Tweet has been quoted by Twitter users.</td>
  </tr>
  <tr>
    <td class="tg-xldj">reply_count</td>
    <td class="tg-xldj">N/A (not present in dataset)</td>
    <td class="tg-xldj">Number of times this Tweet has been replied to.</td>
  </tr>
  <tr>
    <td class="tg-xldj">retweet_count</td>
    <td class="tg-xldj">0</td>
    <td class="tg-xldj">Number of times this Tweet has been retweeted.</td>
  </tr>
  <tr>
    <td class="tg-xldj">favorite_count</td>
    <td class="tg-xldj">0</td>
    <td class="tg-xldj">Nullable. Indicates approximately how many times this Tweet has been <br><a href="https://developer.twitter.com/rest/reference/post/favorites/create">liked </a>by Twitter users.</td>
  </tr>
  <tr>
    <td class="tg-0pky">entities</td>
    <td class="tg-0pky">Object containing a multitude of attributes.</td>
    <td class="tg-0pky">Entities which have been parsed out of the text of the Tweet. Additionally see <br><a href="https://developer.twitter.com/overview/api/entities-in-twitter-objects">Entities in Twitter Objects </a>.</td>
  </tr>
  <tr>
    <td class="tg-0pky">extended_entities</td>
    <td class="tg-0pky">Object containing a multitude of attributes.</td>
    <td class="tg-0pky">When between one and four native photos or one video or one animated GIF are in Tweet, contains an array 'media' metadata. <br><br>This is also available in Quote Tweets. Additionally see <a href="https://developer.twitter.com/overview/api/entities-in-twitter-objects">Entities in Twitter Objects </a>.</td>
  </tr>
  <tr>
    <td class="tg-0pky">favorited</td>
    <td class="tg-0pky">false</td>
    <td class="tg-0pky">Nullable. Indicates whether this Tweet has been liked by the authenticating user.</td>
  </tr>
  <tr>
    <td class="tg-0pky">retweeted</td>
    <td class="tg-0pky">false</td>
    <td class="tg-0pky">Indicates whether this Tweet has been Retweeted by the authenticating user.</td>
  </tr>
  <tr>
    <td class="tg-0pky">possibly_sensitive</td>
    <td class="tg-0pky">N/A (empty for our sample but present)</td>
    <td class="tg-0pky">Nullable. This field only surfaces when a Tweet contains a link. <br><br>The meaning of the field doesn’t pertain to the Tweet content itself, <br>but instead it is an indicator that the URL contained in the Tweet may contain content or media identified as sensitive content.</td>
  </tr>
  <tr>
    <td class="tg-0pky">filter_level</td>
    <td class="tg-0pky">N/A (not present in dataset)</td>
    <td class="tg-0pky">Indicates the maximum value of the <a href="https://developer.twitter.com/streaming/overview/request-parameters#filter_level">filter_level </a>parameter which may be used and still stream this Tweet. <br><br>So a value of medium will be streamed on none, low, and medium streams.</td>
  </tr>
  <tr>
    <td class="tg-0pky">lang</td>
    <td class="tg-0pky">"en"</td>
    <td class="tg-0pky">Nullable. When present, indicates a <a href="http://tools.ietf.org/html/bcp47">BCP 47 </a>language identifier corresponding to the machine-detected language of the Tweet text, <br>or und if no language could be detected. See more documentation <a href="http://support.gnip.com/apis/powertrack2.0/rules.html#Operators">HERE</a>.</td>
  </tr>
  <tr>
    <td class="tg-0pky">matching_rules</td>
    <td class="tg-0pky">N/A (not present in dataset)</td>
    <td class="tg-0pky">Present in filtered products such as Twitter Search and PowerTrack. <br><br>Provides the id and tag associated with the rule that matched the Tweet. <br><br>With PowerTrack, more than one rule can match a Tweet. See more documentation <a href="http://support.gnip.com/enrichments/matching_rules.html">HERE</a>.</td>
  </tr>
</table>


The "text" attribute should contain the full raw text of the Tweet but in our sample Tweet from our dataset it is instead contained in the "full_text" field.<br>



### Main Tweet Object - Attributes present in our Sample and Dataset but not in Twitter API Docs for the Main Tweet Object:


These are the attributes we noticed that are present in our sample in the main Tweet object but are not listed as being part of the main Tweet object in the current Twitter API Documentation.<br>



<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-s268{text-align:left}
</style>
<table class="tg">
  <tr>
    <th class="tg-s268">Attribute</th>
    <th class="tg-s268">Value</th>
    <th class="tg-s268">Description</th>
  </tr>
  <tr>
    <td class="tg-s268">"full_text":</td>
    <td class="tg-s268">"@abcnews About bloody time. Adani only wants FIFO Indian workers for his Bowen basin mines."</td>
    <td class="tg-s268">Replaces "text" in the Extended Mode of&nbsp;&nbsp;REST API endpoints.</td>
  </tr>
  <tr>
    <td class="tg-s268">"display_text_range":</td>
    <td class="tg-s268">[0,91]</td>
    <td class="tg-s268">Part of the "extended_tweet" attribute for streaming API's.</td>
  </tr>
  <tr>
    <td class="tg-s268">"contributors":</td>
    <td class="tg-s268">null</td>
    <td class="tg-s268">Can't find description for this exact field in the documentation.</td>
  </tr>
</table>


Refer to the link below for more information on these fields.  We couldn't find any information on the "contributors" field however.  It maybe have been removed and is no longer listed in the Twitter API Documentation.<br>

https://developer.twitter.com/en/docs/tweets/tweet-updates.html



### Main Tweet Object - Additional Attributes


These are additional attributes listed in the Twitter API Documentation for the main Tweet object.  They are not present in the sample we use from our raw Twitter dataset.<br>



<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj">Attribute</th>
    <th class="tg-xldj">Value</th>
    <th class="tg-xldj">Description</th>
  </tr>
  <tr>
    <td class="tg-xldj">current_user_retweet</td>
    <td class="tg-xldj">N/A (not present in dataset)</td>
    <td class="tg-xldj">Perspectival Only surfaces on methods supporting the include_my_retweet parameter, when set to true. Details the Tweet ID of the user’s own retweet (if existent) of this Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">scopes</td>
    <td class="tg-xldj">N/A (not present in dataset)</td>
    <td class="tg-xldj">A set of key-value pairs indicating the intended contextual delivery of the containing Tweet. Currently used by Twitter’s Promoted Products.</td>
  </tr>
  <tr>
    <td class="tg-xldj">withheld_copyright</td>
    <td class="tg-xldj">N/A (not present in dataset)</td>
    <td class="tg-xldj">When present and set to “true”, it indicates that this piece of content has been withheld due to a <br><a href="http://en.wikipedia.org/wiki/Digital_Millennium_Copyright_Act">DMCA complaint </a>.</td>
  </tr>
  <tr>
    <td class="tg-0pky">withheld_in_countries</td>
    <td class="tg-0pky">N/A (not present in dataset)</td>
    <td class="tg-0pky">When present, indicates a list of uppercase <a href="http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2">two-letter country codes </a>this content is withheld from.</td>
  </tr>
  <tr>
    <td class="tg-0pky">withheld_scope</td>
    <td class="tg-0pky">N/A (not present in dataset)</td>
    <td class="tg-0pky">When present, indicates whether the content being withheld is the “status” or a “user.”</td>
  </tr>
  <tr>
    <td class="tg-0pky">geo</td>
    <td class="tg-0pky">N/A (not present in dataset)</td>
    <td class="tg-0pky"><span style="font-weight:700">Deprecated.</span><br><span style="font-weight:700"> </span><br>Nullable. Use the coordinates field instead. This deprecated attribute has its coordinates formatted as [lat, long], while all other Tweet geo is formatted as [long, lat].</td>
  </tr>
</table>


It appears the "geo" object is now deprecated.  However, our raw Twitter dataset does contain the "geo" field for some Tweets so apparently it was not outdated at the time CSIRO was still collecting this data.<br>



### User Object within the Main Tweet Object:


This is the "user" object nested within the main Tweet object.  It is a large data structure containing a multitude of attributes and their corresponding values.  Extraction of just the "user" object resulted in a CSV file over 1.0 GBS in file size.  Refer to the link below for more in-depth information concerning "user".<br>

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object



#### Non-Deprecated Fields within the User Object:


These are the non-deprecated attributes currently in use as of June 14, 2019.  Any attribute without a sample value indicates that the attribute was not present in the sample we extracted from our raw Tweet dataset.<br>



<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-xldj{border-color:inherit;text-align:left}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj">Attribute</th>
    <th class="tg-xldj">Value</th>
    <th class="tg-xldj">Description</th>
  </tr>
  <tr>
    <td class="tg-xldj">id</td>
    <td class="tg-xldj">772466924</td>
    <td class="tg-xldj">The integer representation of the unique identifier for this User.</td>
  </tr>
  <tr>
    <td class="tg-xldj">id_str</td>
    <td class="tg-xldj">"772466924"</td>
    <td class="tg-xldj">The string representation of the unique identifier for this User.</td>
  </tr>
  <tr>
    <td class="tg-xldj">name</td>
    <td class="tg-xldj">"Daryl Dickson"</td>
    <td class="tg-xldj">The name of the user, as they’ve defined it. Not necessarily a person’s name. Typically capped at 50 characters, but subject to change.</td>
  </tr>
  <tr>
    <td class="tg-xldj">screen_name</td>
    <td class="tg-xldj">"DazzDicko"</td>
    <td class="tg-xldj">The screen name, handle, or alias that this user identifies themselves with. screen_names are unique but subject to change. Typically a maximum of 15 characters long, but some historical accounts may exist with longer names.</td>
  </tr>
  <tr>
    <td class="tg-xldj">location</td>
    <td class="tg-xldj">"Far North Queensland"</td>
    <td class="tg-xldj">Nullable . The user-defined location for this account’s profile. Not necessarily a location, nor machine-parseable.</td>
  </tr>
  <tr>
    <td class="tg-xldj">derived</td>
    <td class="tg-xldj">N/A (not present in our dataset)</td>
    <td class="tg-xldj">Enterprise APIs only Collection of Enrichment metadata derived for user. Provides the <br><a href="https://developer.twitter.com/en/docs/tweets/enrichments/overview/profile-geo">Profile Geo </a>Enrichment metadata.</td>
  </tr>
  <tr>
    <td class="tg-xldj">url</td>
    <td class="tg-xldj">null</td>
    <td class="tg-xldj">Nullable . A URL provided by the user in association with their profile.</td>
  </tr>
  <tr>
    <td class="tg-xldj">description</td>
    <td class="tg-xldj">"Train Driver extraordinaire, proud Union Leftie and Labor supporter. Cant stand the LNP and their regressive ideas. Mainly political but I do enjoy a laugh."</td>
    <td class="tg-xldj">Nullable . The user-defined UTF-8 string describing their account.</td>
  </tr>
  <tr>
    <td class="tg-xldj">protected</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">When true, indicates that this user has chosen to protect their Tweets.</td>
  </tr>
  <tr>
    <td class="tg-xldj">verified</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">When true, indicates that the user has a verified account. See <br><a href="https://support.twitter.com/articles/119135-faqs-about-verified-accounts">Verified Accounts </a>.</td>
  </tr>
  <tr>
    <td class="tg-xldj">followers_count</td>
    <td class="tg-xldj">945</td>
    <td class="tg-xldj">The number of followers this account currently has. Under certain conditions of duress, this field will temporarily indicate “0”.</td>
  </tr>
  <tr>
    <td class="tg-xldj">friends_count</td>
    <td class="tg-xldj">1385</td>
    <td class="tg-xldj">The number of users this account is following (AKA their “followings”). Under certain conditions of duress, this field will temporarily indicate “0”.</td>
  </tr>
  <tr>
    <td class="tg-xldj">listed_count</td>
    <td class="tg-xldj">3</td>
    <td class="tg-xldj">The number of public lists that this user is a member of.</td>
  </tr>
  <tr>
    <td class="tg-xldj">favourites_count</td>
    <td class="tg-xldj">533</td>
    <td class="tg-xldj">The number of Tweets this user has liked in the account’s lifetime.</td>
  </tr>
  <tr>
    <td class="tg-xldj">statuses_count</td>
    <td class="tg-xldj">5176</td>
    <td class="tg-xldj">The number of Tweets (including retweets) issued by the user.</td>
  </tr>
  <tr>
    <td class="tg-xldj">created_at</td>
    <td class="tg-xldj">"Tue Aug 21 23:23:52 +0000 2012"</td>
    <td class="tg-xldj">The UTC datetime that the user account was created on Twitter.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_banner_url</td>
    <td class="tg-xldj">N/A (empty for our sample but present)</td>
    <td class="tg-xldj">The HTTPS-based URL pointing to the standard web representation of the user’s uploaded profile banner.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_image_url_https</td>
    <td class="tg-xldj">"https://pbs.twimg.com/profile_images/698290934618787840/SIpBKnWE_normal.jpg"</td>
    <td class="tg-xldj">A HTTPS-based URL pointing to the user’s profile image.</td>
  </tr>
  <tr>
    <td class="tg-xldj">default_profile</td>
    <td class="tg-xldj">true</td>
    <td class="tg-xldj">When true, indicates that the user has not altered the theme or background of their user profile.</td>
  </tr>
  <tr>
    <td class="tg-xldj">default_profile_image</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">When true, indicates that the user has not uploaded their own profile image and a default image is used instead.</td>
  </tr>
  <tr>
    <td class="tg-xldj">withheld_in_countries</td>
    <td class="tg-xldj">N/A (empty for our sample but present)</td>
    <td class="tg-xldj">When present, indicates a list of uppercase <br><a href="http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2">two-letter country codes </a>this content is withheld from.</td>
  </tr>
  <tr>
    <td class="tg-xldj">withheld_scope</td>
    <td class="tg-xldj">N/A (not present in our dataset)</td>
    <td class="tg-xldj">When present, indicates that the content being withheld is a “user.”</td>
  </tr>
</table>

#### Deprecated Fields within the User Object:

 
These are the deprecated attributes that are no longer in use.  Any attribute without a sample value indicates that the attribute was not present in the sample we extracted from our raw Tweet dataset.<br>
 
<span>

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj">Attribute</th>
    <th class="tg-xldj">Value</th>
    <th class="tg-xldj">Description</th>
  </tr>
  <tr>
    <td class="tg-xldj">utc_offset</td>
    <td class="tg-xldj">36000</td>
    <td class="tg-xldj">Value will be set to null. Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/manage-account-settings/api-reference/get-account-settings">GET account/settings</a></td>
  </tr>
  <tr>
    <td class="tg-xldj">time_zone</td>
    <td class="tg-xldj">"Australia/Brisbane"</td>
    <td class="tg-xldj">Value will be set to null. Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/manage-account-settings/api-reference/get-account-settings">GET account/settings </a>as tzinfo_name</td>
  </tr>
  <tr>
    <td class="tg-xldj">lang</td>
    <td class="tg-xldj">"en"</td>
    <td class="tg-xldj">Value will be set to null. Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/manage-account-settings/api-reference/get-account-settings">GET account/settings </a>as language</td>
  </tr>
  <tr>
    <td class="tg-xldj">geo_enabled</td>
    <td class="tg-xldj">true</td>
    <td class="tg-xldj">Value will be set to null.  Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/manage-account-settings/api-reference/get-account-settings">GET account/settings</a>. This field must be true for the current user to attach geographic data when using <br><a href="https://developer.twitter.com/en/docs/tweets/post-and-engage/guides/post-tweet-geo-guide">POST statuses / update</a></td>
  </tr>
  <tr>
    <td class="tg-xldj">following</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">Value will be set to null. Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-friendships-lookup">GET friendships/lookup</a></td>
  </tr>
  <tr>
    <td class="tg-xldj">follow_request_sent</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj">Value will be set to null. Still available via <br><a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-friendships-lookup">GET friendships/lookup</a></td>
  </tr>
  <tr>
    <td class="tg-xldj">has_extended_profile</td>
    <td class="tg-xldj">N/A (empty for our sample but present)</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">notifications</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_location</td>
    <td class="tg-xldj">N/A (not present in our dataset)</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">contributors_enabled</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_image_url</td>
    <td class="tg-xldj">"http://pbs.twimg.com/profile_images/698290934618787840/SIpBKnWE_normal.jpg"</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null. NOTE: Profile images are only available using the profile_image_url_https field.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_background_color</td>
    <td class="tg-xldj">N/A (empty for our sample but present)</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_background_image_url</td>
    <td class="tg-xldj">"http://abs.twimg.com/images/themes/theme1/bg.png"</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_background_image_url_https</td>
    <td class="tg-xldj">"https://abs.twimg.com/images/themes/theme1/bg.png"</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_background_tile</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_link_color</td>
    <td class="tg-xldj">"1DA1F2"</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_sidebar_border_color</td>
    <td class="tg-xldj">"C0DEED"</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_sidebar_fill_color</td>
    <td class="tg-xldj">"DDEEF6"</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_text_color</td>
    <td class="tg-xldj">"333333"</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">profile_use_background_image</td>
    <td class="tg-xldj">true</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">is_translator</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-xldj">is_translation_enabled</td>
    <td class="tg-xldj">false</td>
    <td class="tg-xldj"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
  <tr>
    <td class="tg-0pky">translator_type</td>
    <td class="tg-0pky">"none"</td>
    <td class="tg-0pky"><span style="font-weight:700">Deprecated</span><br>. Value will be set to null.</td>
  </tr>
</table>


Our dataset was gathered over the course of 10 years so it stands to reason that Twiter has deprecated some of the fields that were used, added fields, and changed other fields.  The Twitter API Documentation do not give a description of the former purpose of these deprecated fields.<br>



### Entities Object within the Main Tweet Object:


This is the "entities" object for a Tweet within our dataset.  All the Lists are empty except for "user_mentions" which is a List containing a Dictionary of key-value pairs of various attributes.  It should be noted that each of these attributes are actually Objects themselves with multiple key (attribute)-value pairs within.  For a more in-depth listing of the attributes and format, please refer to the link below.<br>

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/entities-object



<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj">Attribute</th>
    <th class="tg-xldj">Value</th>
    <th class="tg-xldj">Description</th>
  </tr>
  <tr>
    <td class="tg-xldj">urls</td>
    <td class="tg-xldj">[]</td>
    <td class="tg-xldj">Represents URLs included in the text of a Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">hashtags</td>
    <td class="tg-xldj">[]</td>
    <td class="tg-xldj">Represents hashtags which have been parsed out of the Tweet text.</td>
  </tr>
  <tr>
    <td class="tg-xldj">user_mentions</td>
    <td class="tg-xldj">[{"indices":[0,8],"screen_name":"abcnews","id_str":"2768501", "name":"ABC News","id":2768501}]</td>
    <td class="tg-xldj">Represents other Twitter users mentioned in the text of the Tweet.</td>
  </tr>
  <tr>
    <td class="tg-xldj">symbols</td>
    <td class="tg-xldj">[]</td>
    <td class="tg-xldj">Represents symbols, i.e. $cashtags, included in the text of the Tweet.</td>
  </tr>
  <tr>
    <td class="tg-0pky">media</td>
    <td class="tg-0pky">[]</td>
    <td class="tg-0pky">Represents media elements uploaded with the Tweet.</td>
  </tr>
  <tr>
    <td class="tg-0pky">polls</td>
    <td class="tg-0pky">[]</td>
    <td class="tg-0pky">Represents Twitter Polls included in the Tweet.</td>
  </tr>
</table>


For the sample we have chosen, only the "url", "hashtags", "user_mentions", and "symbols" fields were present, even though most are empty.  The other fields in the table were not present in the "entities" object for this particular Tweet.<br>



### Extended Entities Object within the Main Tweet Object:

    
This element is present in any Tweet that contains "native media" such as photos, videos, images, etc.  It is an object type that contains all the metadata for each of the native media elemnts present in the Tweet.<br>

Refer to the link below for all the particulars on the "extended_entities" object.<br>

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/extended-entities-object



    
{"extended_entities":{"media":[{"display_url":"pic.twitter.com\/NcnlVdBAxt","indices":[110,132],"sizes":{"small":{"w":406,"h":680,"resize":"fit"},"large":{"w":448,"h":750,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":448,"h":750,"resize":"fit"}},"id_str":"394955471976538112","expanded_url":"https:\/\/twitter.com\/fightforthereef\/status\/394955472064622593\/photo\/1","media_url_https":"https:\/\/pbs.twimg.com\/media\/BXsp6MFCIAA4Zet.png","id":394955471976538112,"type":"photo","media_url":"http:\/\/pbs.twimg.com\/media\/BXsp6MFCIAA4Zet.png","url":"http:\/\/t.co\/NcnlVdBAxt"}]}



    
The above is a sample of an "extended_entities" object from a Tweet in our dataset.  Ths object was found as the value for the "retweeted_status" key.  We forgot building a table for all the attributes in this object as the Twitter API Documentation does not itself have a table listing each attribute, example values, and a description explaining each.<br>
    


### Geo Object within the Main Tweet Object:


The "geo" sub-object within the main Tweet object is comprised of the "coordinates" and "place" objects.  According to the Twitter API documentation,<br>

"The place object is always present when a Tweet is geo-tagged, while the coordinates object is only present (non-null) when the Tweet is assigned an exact location. If an exact location is provided, the coordinates object will provide a [long, lat] array with the geographical coordinates, and a Twitter Place that corresponds to that location will be assigned." ("Geo objects - Twitter Developers")<br>

Therfore, not every Tweet will necessarily possess both or either objects.<br>



<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-s268{text-align:left}
</style>
<table class="tg">
  <tr>
    <th class="tg-s268">Attribute</th>
    <th class="tg-s268">Value</th>
    <th class="tg-s268">Description</th>
  </tr>
  <tr>
    <td class="tg-s268">place</td>
    <td class="tg-s268">{}</td>
    <td class="tg-s268">Places are specific, named locations with corresponding geo coordinates.</td>
  </tr>
  <tr>
    <td class="tg-s268">coordinates</td>
    <td class="tg-s268">{}</td>
    <td class="tg-s268">An array of longitude and latitude coordinates.&nbsp;&nbsp;May also include a type attribute.</td>
  </tr>
</table>


Refer to the link below for a specific explanation of how "place" and "coordinates" are utilized together for geo-tagged Tweet objects.<br>

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/geo-objects



#### Coordinates Object within the Geo Object:


The "coordinates" object for geo-tagged Tweets contains the two attributes as described below.<br>



<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-s268{text-align:left}
</style>
<table class="tg">
  <tr>
    <th class="tg-s268">Attribute</th>
    <th class="tg-s268">Value</th>
    <th class="tg-s268">Description</th>
  </tr>
  <tr>
    <td class="tg-s268">coordinates</td>
    <td class="tg-s268"><span style="font-style:italic">[-97.51087576,35.46500176]</span></td>
    <td class="tg-s268">The longitude and latitude of the Tweet’s location, as a collection in the form <br><span style="font-weight:700">[longitude, latitude]</span>.</td>
  </tr>
  <tr>
    <td class="tg-s268">type</td>
    <td class="tg-s268">"Point"</td>
    <td class="tg-s268">The type of data encoded in the coordinates property. This will be “Point” for Tweet coordinates fields.</td>
  </tr>
</table>



The values are examples copied from the Twitter API Documentation on "geo" objects.<br>



#### Place Object within the Geo Object:


The "place" object for geo-tagged Tweets contains the following attributes as describe below.<br>



<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-xldj{border-color:inherit;text-align:left}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-xldj">Attribute</th>
    <th class="tg-xldj">Value</th>
    <th class="tg-xldj">Description</th>
  </tr>
  <tr>
    <td class="tg-xldj">id</td>
    <td class="tg-xldj">"01a9a39529b27f36"</td>
    <td class="tg-xldj">ID representing this place. Note that this is represented as a string, not an integer.</td>
  </tr>
  <tr>
    <td class="tg-xldj">url</td>
    <td class="tg-xldj">"https://api.twitter.com/1.1/geo/id/01a9a39529b27f36.json"</td>
    <td class="tg-xldj">URL representing the location of additional place metadata for this place.</td>
  </tr>
  <tr>
    <td class="tg-0pky">place_type</td>
    <td class="tg-0pky">"city"</td>
    <td class="tg-0pky">The type of location represented by this place.</td>
  </tr>
  <tr>
    <td class="tg-0pky">name</td>
    <td class="tg-0pky">"Manhattan"</td>
    <td class="tg-0pky">Short human-readable representation of the place’s name.</td>
  </tr>
  <tr>
    <td class="tg-0pky">full_name</td>
    <td class="tg-0pky">"Manhattan, NY"</td>
    <td class="tg-0pky">Full human-readable representation of the place’s name.</td>
  </tr>
  <tr>
    <td class="tg-0pky">country_code</td>
    <td class="tg-0pky">"US"</td>
    <td class="tg-0pky">Shortened country code representing the country containing this place.</td>
  </tr>
  <tr>
    <td class="tg-0pky">country</td>
    <td class="tg-0pky">"United States"</td>
    <td class="tg-0pky">Name of the country containing this place.</td>
  </tr>
  <tr>
    <td class="tg-0pky">bounding_box</td>
    <td class="tg-0pky">"bounding_box":{"coordinates":[[[144.886226909269,-37.7802081941697],[144.988666911647,-37.7802081941697],[144.988666911647,-37.6909396998182],[144.886226909269,-37.6909396998182]]],"type":"Polygon"}</td>
    <td class="tg-0pky">A bounding box of coordinates which encloses this place.</td>
  </tr>
  <tr>
    <td class="tg-0pky">attributes</td>
    <td class="tg-0pky">{}</td>
    <td class="tg-0pky">Dictionary of Tweet attributes.</td>
  </tr>
</table>


The sample values are from the Twitter API Documentation except for the "bounding_box" attribute, which is from a Tweet in our dataset.<br>



# SLO Twitter Data Analysis  - Pandas.describe()


The section below provides an simple overview analysis of each attribute in our Twitter CSV dataset file using built-in Pandas function calls.  We output statistics for each attribute/column in the entire CSV dataset.<br>



In [4]:
def attribute_describe(input_file_path, attribute_name_list, file_type):
    """
    Function utilizes Pandas "describe" function to return dataframe statistics.

    https://chrisalbon.com/python/data_wrangling/pandas_dataframe_descriptive_stats/

    Note: This function will not work for attributes whose values are "objects" themselves.
    (can only be numeric type or string)

    :param input_file_path: absolute file path of the dataset in CSV or JSON format.
    :param attribute_name_list:  list of names of the attributes we are analyzing.
    :param file_type: type of input file. (JSON or CSV)
    :return: None.
    """
    start_time = time.time()

    if file_type == "csv":
        twitter_data = pd.read_csv(f"{input_file_path}", sep=",", encoding="ISO-8859-1", dtype=object)
    elif file_type == "json":
        twitter_data = pd.read_json(f"{input_file_path}", orient='records', lines=True)
    else:
        print(f"Invalid file type entered - aborting operation")
        return

    # Create a empty Pandas dataframe.
    dataframe = pd.DataFrame(twitter_data)

    if len(attribute_name_list) > 0:
        for attribute_name in attribute_name_list:
            print(f"\nPandas describe() for \"{attribute_name}\":\n")
            print(dataframe[attribute_name].describe(include='all'))
    else:
        print(f"\nPandas describe() for the entire dataframe/dataset:\n")
        print(dataframe.describe(include='all'))

    end_time = time.time()
    time_elapsed_seconds = end_time - start_time
    time_elapsed_minutes = (end_time - start_time) / 60.0
    time_elapsed_hours = (end_time - start_time) / 60.0 / 60.0
    log.debug(f"The time taken to visualize the statistics is {time_elapsed_seconds} seconds, "
              f"{time_elapsed_minutes} minutes, {time_elapsed_hours} hours")


The usual data analysis function call.<br>



In [6]:
    # Analyze full-text.
    attribute_describe(
        "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/twitter-dataset-6-22-19-fixed.csv",
        [], "csv")


Pandas describe() for the entire dataframe/dataset:

       retweeted_derived company_derived  \
count             670426          669369   
unique                 2              92   
top                 TRUE           adani   
freq              446177          431022   

                                                                                                                                                           text_derived  \
count                                                                                                                                                            670426   
unique                                                                                                                                                           337228   
top     RT @AdamBandt: RT if you want Labor &amp; Bill Shorten to stop Adani coal mega-mine by announcing they'll halt the project if they win next election #StopAdani   
freq                                     


The statistics displayed depend on the type of data present as values for each attribute.  For numerical data, we get count, mean, std, min, percentiles, and max.  For categorical data, we get count, unique, top, and frequency.<Br>

