# EDA

This notebook will be going through an overview of the tweets that our team has collected. These tweets were downloaded from a Git repository hosted by the PanaceaLab at Georgia Tech University. They have already filtered these tweets to only those pertaining to the novel Coronavirus.

First let us load the tweets:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from dataset import Tweet_Dataset
import warnings
from datetime import datetime, timedelta #refactor
warnings.filterwarnings('ignore')

In [2]:
data_folder = '../../data/b07/'
config_path = '../../config/data_params.yaml'
data = Tweet_Dataset(config_path, data_folder)

We have created a Tweet_Dataset class to easily interact with the tweets. For example there are:

In [None]:
print(len(data), 'tweets in the dataset')

And this is that a tweet looks like:

In [None]:
twt = next(data.tweets())
twt

As you see, the JSON format of the tweet allows us to easily access the metadata for each tweet. 

## Top 10 Hashtags

One of the most important pieces of information contained in the tweets are the hashtags. Hashtags are a special kind of data because they serve as a sort of classification by themselves. Here, we look at the top 10 hashtags that were present in the dataset.

In [None]:
tag_counts = data.hashtag_counts()

most_common_tags = tag_counts.most_common(10)

tag_labels = [elem[0] for elem in most_common_tags]
tag_values = [elem[1] for elem in most_common_tags]
indexes = np.arange(len(tag_labels))

In [None]:
plt.bar(tag_labels, tag_values)
plt.xticks(indexes, tag_labels, rotation='vertical', fontsize=10)
plt.suptitle('Top 10 Hashtags')
plt.show()

## Most Posting Users

Now we shall do the same but for the number of times each user posts.

In [None]:
usr_counts = data.user_name_counts()

most_common_users = usr_counts.most_common(10)

usr_labels = [elem[0] for elem in most_common_users]
usr_values = [elem[1] for elem in most_common_users]
indexes = np.arange(len(usr_labels))

In [None]:
plt.bar(usr_labels, usr_values)
plt.xticks(indexes, usr_labels, rotation='vertical', fontsize=10)
plt.suptitle('Top 10 Most Posting Users')
plt.show()

Immediately we notice something interesting. The users who post the most frequently are bots. They seem to make up a large proportion of the Top 10 as well, with 40% of the top ten users containing the phrase 'bot' in their names. Of course, some of the other users may be bots as well, but just without the word 'bot' in their screen name. This definitely warrants further investigation.

## Hashtag Counts by Day

In this section we will select, by hand, 3 hashtags for the science set and 3 for the conspiracy set. Note that this was based off of our biased assumptions and could very well not be the case. We will investigate how the usage of these hashtags changed throughout the dataset's timespan.

In [None]:
daily_tag_occurrences = data.get_daily_tag_counts()

In [None]:
# 3 science
covid19 = daily_tag_occurrences['Covid19']
wearAMask = daily_tag_occurrences['WearAMask']
socialDistancing = daily_tag_occurrences['SocialDistancing']

# 3 misinformation
fakeNews = daily_tag_occurrences['FakeNews']
hoax = daily_tag_occurrences['Hoax']
chinaVirus = daily_tag_occurrences['ChinaVirus']

### Science Hashtag Usages

We chose to associate #Covid19, #WearAMask, and #SocialDistancing with being categorized as science based tweets. Below we portray the usage of the these hashtags throughout the days contained in the dataset we collected.

In [None]:
dates, frequencies = zip(*sorted(covid19.items()))
plt.xticks(rotation='vertical')
plt.suptitle('Occurrences of #Covid19')
plt.plot(dates,frequencies)

In [None]:
dates, frequencies = zip(*sorted(wearAMask.items()))
plt.xticks(rotation='vertical')
plt.suptitle('Occurrences of #WearAMask')
plt.plot(dates,frequencies)

In [None]:
dates, frequencies = zip(*sorted(socialDistancing.items()))
plt.xticks(rotation='vertical')
plt.suptitle('Occurrences of #SocialDistancing')
plt.plot(dates,frequencies)

### Misinformation Hashtag Usages

Categorizing hashtags as misinformation based was subjective, just as it was for the former category. We chose #FakeNews, #Hoax, and #ChinaVirus to be the hashtags associated with tweets that were not based on fact. Their daily usage plots are below.

In [None]:
dates, frequencies = zip(*sorted(fakeNews.items()))
plt.xticks(rotation='vertical')
plt.suptitle('Occurrences of #FakeNews')
plt.plot(dates,frequencies)

In [None]:
dates, frequencies = zip(*sorted(hoax.items()))
plt.xticks(rotation='vertical')
plt.suptitle('Occurrences of #Hoax')
plt.plot(dates,frequencies)

In [None]:
dates, frequencies = zip(*sorted(chinaVirus.items()))
plt.xticks(rotation='vertical')
plt.suptitle('Occurrences of #ChinaVirus')
plt.plot(dates,frequencies)

### ID Echo Chamber Tweets

In [3]:
gen = data.tweets()

In [4]:
highly_retweeted = []
while True:
    try:
        twt = next(gen)
        if twt['retweet_count'] > 1000:
            highly_retweeted.append(twt)
    except StopIteration:
        break

In [5]:
len(highly_retweeted)

221490

In [6]:
sorted_retweets = sorted(highly_retweeted, key=lambda k: k['retweet_count'], reverse=True) 


In [7]:
tweets_seen = []
unique_tweets = []
for twt in sorted_retweets:
    if twt['full_text'] not in tweets_seen:
        tweets_seen.append(twt['full_text'])
        unique_tweets.append(twt)
    else:
        pass

In [8]:
len(unique_tweets)

47829

In [59]:
unique_tweets[77]

{'created_at': datetime.datetime(2020, 3, 22, 14, 19, 50, tzinfo=datetime.timezone.utc),
 'id': 1241731316665561088,
 'id_str': '1241731316665561088',
 'full_text': 'RT @eugenegu: The H1N1 swine flu pandemic that infected up to 1.4 billion people and killed up to 575,000 originated in factory farmed pigs‚Ä¶',
 'truncated': False,
 'display_text_range': [0, 140],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [{'screen_name': 'eugenegu',
    'name': 'Eugene Gu, MD',
    'id': 65497475,
    'id_str': '65497475',
    'indices': [3, 12]}],
  'urls': []},
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 101460812,
  'id_str': '101460812',
  'name': 'Meghan Standefer',
  'screen_name': 'megstatic',
  'location': '',
  'description': 'just here for

In [62]:
three_science = [unique_tweets[7], unique_tweets[29], unique_tweets[69], unique_tweets[77]]

In [63]:
three_conspiracy = [unique_tweets[9], unique_tweets[22], unique_tweets[41], unique_tweets[63]]

In [65]:
three_conspiracy

[{'created_at': datetime.datetime(2020, 3, 28, 19, 47, tzinfo=datetime.timezone.utc),
  'id': 1243987978055278594,
  'id_str': '1243987978055278594',
  'full_text': 'RT @SethAbramson: MAJOR BREAKING NEWS: NPR Source Says Trump Blocked Coronavirus Testing in January to Aid His Reelection Chances By Keepin‚Ä¶',
  'truncated': False,
  'display_text_range': [0, 140],
  'entities': {'hashtags': [],
   'symbols': [],
   'user_mentions': [{'screen_name': 'SethAbramson',
     'name': 'Seth Abramson',
     'id': 3223426134,
     'id_str': '3223426134',
     'indices': [3, 16]}],
   'urls': []},
  'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'in_reply_to_screen_name': None,
  'user': {'id': 1206228966463418370,
   'id_str': '1206228966463418370',
   'name': 'Laura',
   'screen_name': 'silencedforreal',
   'location': 'U