# Biorefinery Tweets Dataset Descriptive Statistics

## Questions
1. First tweet about biorefineries
    1. When did the fist tweet about biorefineries appear?
    2. What did it say?
    3. Who did it say it? 
    4. To what field does the author belong?
2. Time & fequency: how many tweetas about biorefinery are per year? How does it compare to all tweets?
2. Oldest tweet in relation to the network age
3. Proportions of discussions / retweets with sole tweets / unfavorited
4. Growth of the subject level of discussion, in relation to the growth of the network usage
5. Are there clusters of frequent interacting actors? 
   1. Is it possible to link it with _real-world_ links? (associations, contracts, commercial relations)
6. What is noise in a tweeter feed? Has it been defined, measured, operationalized?
7. Proportion of personal and organization accounts

## Compiled answers (2021-05-30 to ...)
1. First tweet about biorefineries
    1. The first tweet was on 2015-09-19 02:38:56 UTC
    2. It said "Good morning, I've been busy to get this day going. Writing, reading, figuring out things around biotech, clean tech an biorefineries.
    3. It was tweeted by Helge Keitel (@digitalvillages)
    4. He is at "Business developmnet at KK-net" (maybe his own company), same as when he tweet it. Maybe he is a sort of trader or venture capitalist.
2. 

## Imports

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

## Importing and combining the data

Tweets containing the keywords "biorefinery" or "biorefineries" up until 2021-....

Tweets where downloaded and combined localy, the following script loads and merges two pikle files: one with individual tweets, aand another with each user information. From both datasets, columns irrelevant for the study where removed.

In [None]:
def load_pickle_and_combine(tweets_pkl, users_pkl, on_column, rename_cols):
    tweets = pd.read_pickle(tweets_pkl)
    tweets_df = pd.DataFrame(tweets)  # sequence suggested by Albert

    users = pd.read_pickle(users_pkl)
    users_df = pd.DataFrame(users)

    del tweets
    del users

    users_df['user_id'] = users_df.index
    tweets_df['id'] = tweets_df.index
    biorefinery_df = pd.merge(tweets_df, users_df, on=on_column)
    biorefinery_df.rename(columns=rename_cols, inplace=True)
    biorefinery_df.set_index('id', inplace=True)

    del tweets_df
    del users_df

    return biorefinery_df


tweets_pkl_file = '/datasets/biorefinery-field-tweets/biorefinery_tweets_df_cleaned.pkl'
users_pkl_file = '/datasets/biorefinery-field-tweets/biorefinery_user_df_cleaned.pkl'
join_on_column = 'user_id'
columns_renaming = {
    'created_at_x': 'tweet_created_at',
    'created_at_y': 'user_created_at',
    'favourites_count': 'user_favourites_count'
}

df = load_pickle_and_combine(tweets_pkl_file, users_pkl_file, join_on_column, columns_renaming)

del tweets_pkl_file
del users_pkl_file
del join_on_column
del columns_renaming

## Frequency and growth

New DF counting the number of tweets about BR per year, and adding the Twitter network size, as million daily active users

In [None]:
percentiles = list(range(10, 100, 10))
percentiles = [x / 100 for x in percentiles]


description = df.describe(percentiles=percentiles, include='all', datetime_is_numeric=True)

df_grupped_year = df.groupby(df['tweet_created_at'].dt.year)['full_text'].count()

df_grupped_year = pd.DataFrame(df_grupped_year)

twitter_MAU = {
    "2007": np.NaN,
    "2008": np.NaN,
    "2009": np.NaN,
    "2010": 54,
    "2011": 117,
    "2012": 185,
    "2013": 241,
    "2014": 288,
    "2015": 305,
    "2016": 318,
    "2017": 330,
    "2018": 321,
    "2019": 330,
    "2020": 353,
    "2021": np.NaN
}  # source: Twitter financial statements and https://backlinko.com/twitter-users accessed on 2021-05-30

df_grupped_year.reset_index(inplace=True)

df_grupped_year.rename(
    columns={'tweet_created_at': 'Year', 'full_text': 'Biorefinery tweets'},
    inplace=True
    )

df_grupped_year['Twitter million daily active users'] = twitter_MAU.values()


Plotting and comparing the evolution.

In [None]:
# plotting figures by creating aexs object
# using subplots() function
fig, ax = plt.subplots(figsize = (10, 5))
plt.title('BR tweets & network size per year')
  
# using the twinx() for creating another
# axes object for secondry y-Axis
ax2 = ax.twinx()
ax.plot(df_grupped_year['Year'], df_grupped_year['Biorefinery tweets'], color = 'b')
ax2.plot(df_grupped_year['Year'], df_grupped_year['Twitter million daily active users'], color = 'g')
  
# giving labels to the axises
ax.set_xlabel('Year', color = 'k')
ax.set_ylabel('Number of tweets', color = 'k')
  
# secondary y-axis label
ax2.set_ylabel('million daily active users', color = 'k')


# ax.set_ylabel('Number of tweets containing "biorefinery"')
ax.set_ylim(0, 4000)
ax2.set_ylim(0, 400)

# defining display layout 
plt.tight_layout()
  
# show plot
plt.show()

NameError: name 'plt' is not defined

## Questions about time growth
2. Time & fequency: how many tweetas about biorefinery are per year? How does it compare to all tweets?
3. Oldest tweet in relation to the network age

## Answers from previos plot
2. Tweets count grwo from the start of the network, in a similar slope as the network, but stabilize and start a slow descend in 2014, while the network keeps slowly growing.

In [None]:
df_grupped_year

Unnamed: 0,Year,Biorefinery tweets,Twitter million daily active users
0,2007,4,
1,2008,34,
2,2009,765,
3,2010,1551,54.0
4,2011,2119,117.0
5,2012,3266,185.0
6,2013,2093,241.0
7,2014,3077,288.0
8,2015,2933,305.0
9,2016,2871,318.0


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=871ffec2-54c7-4bc1-854a-6bcd2254c0fb' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>