### User Statistics for Twitter dataset:

<span style="font-family:Papyrus; font-size:1.25em;">

This function computes statistics related to user screen-names.  Refer to the code comments in the cell below for specifics.<br>

</span>

In [None]:
def user_screen_name_statistics(tweet_dataframe):
    """
    User screen-name by associated company related statistics and visualizations.

    Note: The raw JSON file does not have associated "company" information.

    :param tweet_dataframe: the Twitter dataset in a Pandas dataframe.
    :return: None.
    """
    # Increased limit in order to display all 5 users for all companies and combinations of companies.
    pd.options.display.max_rows = 1000

    # Select only rows with one associated company. (don't graph company combos)
    single_company_only_df = tweet_dataframe.loc[tweet_dataframe['multiple_companies_derived_count'] == 1]

    print("User Statistics for CSV dataset by Company: ")
    print("Top Tweet counts for Unique User by Associated Company.")
    print(
        tweet_dataframe[['company_derived_designation', 'user_screen_name']].groupby('company_derived_designation')
            .apply(lambda x: x['user_screen_name'].value_counts(normalize=True).head())
        # .value_counts(normalize=True)\
        # .sort_index(ascending=False).head())
    )
    print()

    # Graph the User Statistics.
    print("Number of Times a Percentage of Users Appears as Tweet Author for a Given Company: ")
    plt.figure()
    grid = sns.FacetGrid(tweet_dataframe[['user_screen_name', 'company_derived_designation']],
                         col='company_derived_designation',
                         col_wrap=6,
                         ylim=(0, 1),
                         xlim=(0, 10))
    grid.map_dataframe(tweet_util.bar_plot_zipf, 'user_screen_name').set_titles('{col_name}').set_xlabels(
        'Appearance (Tweet author) Count').set_ylabels("Percentage of all Users")
    plt.show()

<span style="font-family:Papyrus; font-size:1.25em;">

Similar to the graphs above, we only graph Tweets associated with a single company.  Otherwise, the output would be lengthy from graphing all the Tweets that are associated with multiple combinations of companies.<br>

</span>

In [None]:
    # Determine the Tweet count for most prolific user by company.
    user_screen_name_statistics(tweet_csv_dataframe)

<span style="font-family:Papyrus; font-size:1.25em;">

The text stats show unique users that accounted for most of the Tweets associated with a given company.<br>

The graphs show the number of times a percentage of all the unique users associated with a given company was the author of a Tweet associated with that company.  For example, a y-axis value of 0.4 and a x-axis value of 4 would indicate that 40% of all users associated with that given company was the author of 4 Tweets associated with that company.<br>

</span>

### Unique Users (Authors):

<span style="font-family:Papyrus; font-size:1.25em;">

Displays all the unique user screennames in our dataset for Tweets.<br>

</span>

In [None]:
def unique_authors_tweet_counts(tweet_dataframe):
    """
    This function provides statistics on all unique authors and their Tweet post counts in the dataset.

    :param tweet_dataframe: Tweet dataframe.
    :return: None.

    TODO - implement graph of distribution of unique authors.
    """
    author_series = pd.Series(tweet_dataframe["user_screen_name"])
    print("All Unique Authors by User Screen Name and their Tweet Post Count:")
    print(author_series.value_counts(sort=True, ascending=False))

<span style="font-family:Papyrus; font-size:1.25em;">

Function uses "value_counts" Pandas built-in method.<br>

</span>

In [None]:
    # List all unique authors (user) and their Tweet counts.
    unique_authors_tweet_counts(tweet_csv_dataframe)

<span style="font-family:Papyrus; font-size:1.25em;">
    
**TODO: visualize distribution and explain results**    

</span>

## Resources Used:

<span style="font-family:Papyrus; font-size:1.25em;">

**TODO: convert to annotated bibliography**

Dataset Files (obtained from Borg supercomputer):<br>

dataset_slo_20100101-20180510.json<br>
dataset_20100101-20180510.csv<br>

Note: These are large fiels not included in the project GitHub Repository.<br>


- [SLO-analysis.ipynb](SLO-analysis.ipynb)<br>
    -original SLO Twitter data analysis file from Shuntaro Yada.<br>


- https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json<br>
    -explanation of all data fields in JSON file format for Tweets.<br>


- https://datatofish.com/export-dataframe-to-csv/<br>
- https://datatofish.com/export-pandas-dataframe-json/<br>
    -saving Pandas dataframe to CSV/JSON<br>
    

- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html<br>
    -Pandas to_datetime() function call.<br>
    

- https://www.machinelearningplus.com/plots/matplotlib-tutorial-complete-guide-python-plot-examples/<br>
    -plotting with matplotlib.<br>


</span>

## TODO's:

<span style="font-family:Papyrus; font-size:1.25em;">

Implement further elements from Shuntaro Yada's SLO Twitter Dataset Analysis.<br>

</span>