# SLO Twitter Data Analysis  - Codebase Introduction

<span style="font-family:Papyrus; font-size:1.25em;">
    
The following sections present our current codebase that analyzes various combinations of attributes present in the raw JSON Twitter data file and CSV Twitter data file.<br>

</span>

## Dataset Creation:

<span style="font-family:Papyrus; font-size:1.25em;">

We create out dataset using the function call below and "dataset_processor_adapted.py".

</span>

In [None]:
    # Absolute file path.
    create_dataset("D:/Dropbox/summer-research-2019/json/dataset_slo_20100101-20180510.json",
                   "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/twitter-dataset-6-22-19.csv",
                   False)

## Data Analysis Utility Functions:

<span style="font-family:Papyrus; font-size:1.25em;">

The following sample function calls illustrate how we use the utility functions in "slo_twitter_data_analysis_utility_functions_v2.py" to perform individual attribute extraction/export and data chuncking extraction/export.<br>

</span>

In [None]:
    # Extract multiple fields from raw JSON file and export to CSV file.
    tweet_util.generalized_multi_field_extraction_function(
        "D:/Dropbox/summer-research-2019/json/dataset_slo_20100101-20180510.json",
        "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/",
        ["retweet_count", "retweeted", "favorite_count", "favorited"], "csv")

In [None]:
    # Extract various individual fields from raw JSON file and export to CSV/JSON file.
    tweet_util.generalized_field_extraction_function(
        "D:/Dropbox/summer-research-2019/json/dataset_slo_20100101-20180510.json",
        "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/",
        "id", "csv")

In [None]:
    # Read in JSON/CSV data as chunks and export to CSV/JSON files.
    tweet_util.generalized_data_chunking_file_export_function(
        "D:/Dropbox/summer-research-2019/json/dataset_slo_20100101-20180510.json",
        "D:/Dropbox/summer-research-2019/jupyter-notebooks/dataset-chunks/", "csv")

<span style="font-family:Papyrus; font-size:1.25em;">

Refer to the Python file itself if interested in that codebase.  Several graphing helper functions are also included.<br>

</span>

## Import libraries and set parameters:

<span style="font-family:Papyrus; font-size:1.25em;">

We import the required libraries as well as our custom utility functions for data anlysis.

</span>

In [2]:
import logging as log
import warnings
import time
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns

# Import custom utility functions.
import slo_twitter_data_analysis_utility_functions_v2 as tweet_util_v2

<span style="font-family:Papyrus; font-size:1.25em;">

Pandas settings alters the maximum number of rows to be displayed and the number of decimal places to display for floating point values.  We also filter out several warning types to reduce potential output clutter.<br>

</span>

In [3]:
# Adjust parameters to display all contents.
pd.options.display.max_rows = None
pd.options.display.max_columns = None
pd.options.display.width = None
pd.options.display.max_colwidth = 1000
# Seaborn setting.
sns.set()
# Set level of precision for float value output.
pd.set_option('precision', 12)
# Ignore these types of warnings - don't output to console.
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

<span style="font-family:Papyrus; font-size:1.25em;">

Change log levels between "INFO" and "DEBUG" depending on whether you wish to see log output or not.<br>

</span>

In [4]:
"""
Turn debug log statements for various sections of code on/off.
(adjust log level as necessary)
"""
log.basicConfig(level=log.INFO)

## Display raw JSON file data chunk dataframe information:

<span style="font-family:Papyrus; font-size:1.25em;">

By specifying "none" as the function name, we simply print out logger INFO on the shape, columns, and a single sample of the dataframe on the first chunk of data.<br>

Note: We use this function instead of the one in the previous code cell as the raw JSON file is too large to fit into system RAM.<br>

</span>

In [8]:
    # Specify and call data analysis functions on chunked raw JSON Tweet file.
    tweet_util_v2.call_data_analysis_function_on_json_file_chunks(
        "D:/Dropbox/summer-research-2019/json/dataset_slo_20100101-20180510.json", "none")

INFO:root:
The shape of the dataframe storing the contents of the raw JSON Tweet file chunk 1 is:

INFO:root:(100000, 30)
INFO:root:
The columns of the dataframe storing the contents of the raw JSON Tweet file chunk 1 is:

INFO:root:Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'favorite_count', 'favorited',
       'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status',
       'lang', 'place', 'possibly_sensitive', 'quoted_status',
       'quoted_status_id', 'quoted_status_id_str', 'retweet_count',
       'retweeted', 'retweeted_status', 'source', 'truncated', 'user'],
      dtype='object')
INFO:root:
The first row from the dataframe storing the contents of the raw JSON Tweet file chunk 1 is:

INFO:root:
contributors                                                              

<span style="font-family:Papyrus; font-size:1.25em;">

As can be seen, each Tweet object in its raw JSON form contains many different attributes.  The "user" attribute is a Object itself that especially contains many other attributes and Objects.<br>

</span>

## Import raw JSON converted to CSV  Twitter dataset file:

<span style="font-family:Papyrus; font-size:1.25em;">

We read in the untokenized Twitter dataset as a CSV file and generate a Pandas dataframe from the dataset.<br>

Note: Does not contain all fields from the raw JSON file; only the ones we are currently interested in analyzing.  We will append columns to the CSV file as necessary if additional attributes become relevant for analysis.  Anything with "custom" is a derived column created via .apply(function) to the native attributes.<br>

Note: flattened, so no nested structures.

Note 2: re-create a fresh dataset without shuffling data to preserve same order in JSON file.

</span>

In [5]:
    # Import CSV dataset and convert to dataframe.
    tweet_csv_dataframe = tweet_util_v2.import_dataset(
        "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/twitter-dataset-6-22-19.csv",
        "csv", True)

  if (yield from self.run_code(code, result)):
INFO:root:
The shape of our dataframe storing the contents of the csv Tweet data is:

INFO:root:(670427, 50)
INFO:root:
The columns of our dataframe storing the contents of the csv Tweet data is:

INFO:root:Index(['retweeted_derived', 'company_derived', 'text_derived',
       'tweet_url_link_derived', 'multiple_companies_derived_count',
       'company_derived_designation', 'tweet_text_length_derived',
       'spaCy_language_detect', 'user_description_text_length',
       'tweet_created_at', 'tweet_id', 'tweet_full_text',
       'tweet_in_reply_to_status_id', 'tweet_in_reply_to_user_id',
       'tweet_in_reply_to_screen_name', 'tweet_retweet_count',
       'tweet_favorite_count', 'tweet_lang', 'user_id', 'user_name',
       'user_screen_name', 'user_location', 'user_description',
       'user_followers_count', 'user_friends_count', 'user_listed_count',
       'user_favourites_count', 'user_statuses_count', 'user_created_at',
       'user_t

<span style="font-family:Papyrus; font-size:1.25em;">
 
The above log.INFO output shows the shape, columns, and a sample from the Pandas Dataframe that contains the entirety of the CSV file.<br>
 
</span>