# SLO Twitter Data Analysis  - Pandas.describe() and Nan/Non-NaN Values

### Define necessary attribute/field names for data analysis functions below:

<span style="font-family:Papyrus; font-size:1.25em;">

These Lists containing the attribute/column names in our CSV dataset are necessary for the two data analysis functions below.  These were copy/pasted from our "dataset_processor.adapted.py" Python file which contains the codebase to construct the CSV dataset from the raw JSON dataset.<br>

</span>

In [None]:
    original_fields = ['created_at', 'id', 'full_text', 'in_reply_to_status_id',
                       'in_reply_to_user_id', 'in_reply_to_screen_name',
                       'retweet_count', 'favorite_count', 'lang']

    tweet_fields = ['tweet_created_at', 'tweet_id', 'tweet_full_text', 'tweet_in_reply_to_status_id',
                    'tweet_in_reply_to_user_id', 'tweet_in_reply_to_screen_name',
                    'tweet_retweet_count', 'tweet_favorite_count', 'tweet_lang']

    user_fields = ['user_id', 'user_name', 'user_screen_name', 'user_location', 'user_description',
                   'user_followers_count', 'user_friends_count', 'user_listed_count', 'user_favourites_count',
                   'user_statuses_count', 'user_created_at', 'user_time_zone', 'user_lang']

    entities_fields = ["tweet_entities_expanded_urls", "tweet_entities_hashtags", "tweet_entities_user_mentions_id",
                       "tweet_entities_user_mentions_name", "tweet_entities_user_mentions_screen_name",
                       "tweet_entities_symbols"]

    additional_fields = ["company_derived_designation", "user_description_text_length"]

    required_fields = ['retweeted_derived', 'company_derived', 'text_derived',  # "tweet_quoted_status_id",
                       'tweet_url_link_derived', 'multiple_companies_derived', 'multiple_companies_derived_count',
                       'tweet_text_length_derived'] + tweet_fields + user_fields + entities_fields + additional_fields

### Pandas.describe() Analysis for Twitter dataset:

<span style="font-family:Papyrus; font-size:1.25em;">

Here, we output statistics for each attribute/column in the entire CSV dataset.<br>

</span>

In [None]:
def attribute_describe(input_file_path, attribute_name_list, file_type):
    """
    Function utilizes Pandas "describe" function to return dataframe statistics.

    https://chrisalbon.com/python/data_wrangling/pandas_dataframe_descriptive_stats/

    Note: This function will not work for attributes whose values are "objects" themselves.
    (can only be numeric type or string)

    :param input_file_path: absolute file path of the dataset in CSV format.
    :param attribute_name_list:  list of names of the attributes we are analyzing.
    :param file_type: type of input file.
    :return: None.
    """
    start_time = time.time()

    if file_type == "csv":
        twitter_data = pd.read_csv(f"{input_file_path}", sep=",")
    elif file_type == "json":
        twitter_data = pd.read_json(f"{input_file_path}",
                                    orient='records',
                                    lines=True)
    else:
        print(f"Invalid file type entered - aborting operation")
        return

    # Create a empty Pandas dataframe.
    dataframe = pd.DataFrame(twitter_data)

    for attribute_name in attribute_name_list:
        print(f"\nPandas describe for \"{attribute_name}\":\n")
        print(dataframe[attribute_name].describe(include='all'))

    end_time = time.time()
    time_elapsed = (end_time - start_time) / 60.0
    log.debug(f"The time taken to visualize the statistics is {time_elapsed} minutes")

<span style="font-family:Papyrus; font-size:1.25em;">

The usual data analysis function call.<br>

</span>

In [None]:
    # Analyze full-text.
    attribute_describe(
        "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/selected-attributes-final.csv",
        required_fields, "csv")

<span style="font-family:Papyrus; font-size:1.25em;">

The statistics displayed depend on the type of data present as values for each attribute.  For numerical data, we get count, mean, std, min, percentiles, and max.  For categorical data, we get count, unique, top, and frequency.<Br>

</span>

### NaN versus non-Nan Counts for each Attribute in the Twitter dataset:

<span style="font-family:Papyrus; font-size:1.25em;">

This function displays statistics that count the # of rows/examples in the dataset that are NaN or non-Nan using the Pandas ".isnull().sum()" function chain.<br>

</span>

In [None]:
def count_nan_non_nan(input_file_path, attribute_name_list, file_type):
    """
    Function counts the number of NaN and non-Nan examples in a Pandas dataframe for the specified columns.

    :param input_file_path: absolute file path of the dataset in CSV format.
    :param attribute_name_list:  list of names of the attributes we are analyzing.
    :param file_type: type of input file.
    :return: None.
    """
    start_time = time.time()

    if file_type == "csv":
        twitter_data = pd.read_csv(f"{input_file_path}", sep=",", dtype=object)
    elif file_type == "json":
        twitter_data = pd.read_json(f"{input_file_path}",
                                    orient='records',
                                    lines=True)
    else:
        print(f"Invalid file type entered - aborting operation")
        return

    # Create a empty Pandas dataframe.
    dataframe = pd.DataFrame(twitter_data)

    number_examples = dataframe.shape[0]
    number_attributes = dataframe.shape[1]
    print(f"\nThe number of rows (examples) in the dataframe is {number_examples}")
    print(f"The number of columns (attributes) in the dataframe is {number_attributes}\n")

    for attribute_name in attribute_name_list:
        null_examples = dataframe[attribute_name].isnull().sum()
        non_null_examples = number_examples - null_examples

        print(f"The number of NaN rows for \"{attribute_name}\" is {null_examples}")
        print(f"The number of non-NaN rows for \"{attribute_name}\" is {non_null_examples}\n")

    end_time = time.time()
    time_elapsed = (end_time - start_time) / 60.0
    log.debug(f"The time taken to visualize the statistics is {time_elapsed} minutes")

<span style="font-family:Papyrus; font-size:1.25em;">

The usual data analysis function call.<br>

</span>

In [None]:
    # Determine the number of NaN and non-NaN rows for a attribute in a dataset.
    count_nan_non_nan(
        "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/selected-attributes-final.csv",
        required_fields, "csv")

<span style="font-family:Papyrus; font-size:1.25em;">

The attribute name is in double quotations.  Each pair of lines in between a blank line is the statistics for a single attribute.<br>

</span>

## Resources Used:

<span style="font-family:Papyrus; font-size:1.25em;">

**TODO: convert to annotated bibliography**

Dataset Files (obtained from Borg supercomputer):<br>

dataset_slo_20100101-20180510.json<br>
dataset_20100101-20180510.csv<br>

Note: These are large fiels not included in the project GitHub Repository.<br>


- [SLO-analysis.ipynb](SLO-analysis.ipynb)<br>
    -original SLO Twitter data analysis file from Shuntaro Yada.<br>


- https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json<br>
    -explanation of all data fields in JSON file format for Tweets.<br>


- https://datatofish.com/export-dataframe-to-csv/<br>
- https://datatofish.com/export-pandas-dataframe-json/<br>
    -saving Pandas dataframe to CSV/JSON<br>
    

- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html<br>
    -Pandas to_datetime() function call.<br>
    

- https://www.machinelearningplus.com/plots/matplotlib-tutorial-complete-guide-python-plot-examples/<br>
    -plotting with matplotlib.<br>


</span>

## TODO's:

<span style="font-family:Papyrus; font-size:1.25em;">

Implement further elements from Shuntaro Yada's SLO Twitter Dataset Analysis.<br>

</span>