# Utilizing Twitter to Measure the Public's Sentiment on the Pandemic Over Time
**Team 7:** Insert Name Here 

**Members:** Lipsa J., Ji K., Yuanfeng L., Yu L.

The pandemic has uprooted the lives of every single person in the world. While it began as a minor inconvenience to many people, the harsh reality and severity of the virus were soon realized. In the beginning of enforcing protective measures to protect the public, many people's opinions on the virus, protective rules & procedures, and other topics relating to the pandemic have changed and continually do into 2021. We want to record and analyze these trends by looking at the metrics such as sentiment, LIWC metrics, and possibly more as we make further discoveries.

Utilizing Twitter, an online social media platform for sharing content and microblogging, we'll be analyzing "tweets" (publically posted messages) from everyday people about how they feel about the pandemic. This procedure will be run on data from January 22 2020, all the way to the most current data being available at this current time of the project (February and March of 2021). 

We've outlined objectives & research questions we hope to answer through this approach.
* **Goal 1**: Find out how many tweets sentiments changed on the regulation or rules about wearing a mask or taking a vaccine for the the year 2020 and current months in 2021 (January - March)
* **Goal 2**: Find out the sentiment of tweets relating to the COVID-19 virus for the year 2020 and the current months in 2021 (January - March)
* **Stretch Goal 1**: Find out the sentiments across geographical locations within the U.S about either protective measures (Eg. Wearing a mask) and the taking the vaccine. It's been shown throughout various news outlets and social media that different areas in the U.S have had varying responses to these rules. If time & resources allow, we want to run the research experiment at a lower level - focusing on specific areas in the U.S - Perhaps areas with the lowest cases per capita vs. moderate vs. high. 
* **Stretch goal 2**: Relate our findings to how misinformation & fake news on Twitter changed before and after the election; as well as its possible consequences on the public's sentiment on the topic of COVID-19 and its related topics (Eg. vaccines, lockdown, social distancing).

Through our efforts, we hope to be able to answer or at least find insight into the following questions as well:
* Have specific events affected the public's stance on the pandemic? These could be the presidential election, the presidential candidates debate as well as the vice-presidents debate, Trump getting diagnosed and hospital stayed, and so on. 
* How have different cities, counties, and states efficacy in containing the virus relate to the public sentiment from the people there?

## Our Current Approach
Currently, we're utilizing Twitter for our data generation. Initially, we stated that we'd be utilizing Reddit as well but the caveats of a public forum platform is that it's heavily moderated. With the pandemic being a global crisis, Reddit has emphasized initiatives to remove and censor posts that may be incediary, controversial, promote misinformation and so on. While these things are negative in the grand scheme of society, we actually want to collect this kind of data as well since it shows a sub-population with different views. 

We're utilizing Tweepy which is Twitter's API wrapper for Python. It's extremely easy to utilize but one of its caveats is that it will only look at the past week to pull data; which makes sense since many people actually use real-time data for analysis. To get around this issue, we relied on Kaggle and IEEE.
Both of them have been data mining the ID number of tweets with keywords relating to the pandemic since near the beginning of 2020. These keywords include identifiers such as "n95", "ppe", "washyourhands", "stayathome", "selfisolating", "social distancing", "covid-19", and so on. 

Utilizing Tweepy and Python, we iterate through these tweet id values to pull the actual tweet status object from Twitter. From there, we extract the following information: 
* id: ID number of the tweet
* username: Username of the person who posted the tweet
* text: The literal text content of the tweet
* entities: Hashtags the tweet had
* retweet_count: Number of times the tweet had been retweeted
* favorite_count: Number of times the tweet had been favorited
* created_at: Time the tweet had been posted

We're collecting our own data currently with the same parameters and keywords. Taking these datasets, in a csv format, we're running each through LIWC and looking at the following metrics: 
* Summary variables: Analytical thinking, clout, authentic, and emotional tone
* Affect words: Positive emotions, negative emotions, anxiety, anger, sadness
* Social words: Family, friends, female referents, male referents
* Cognitive Processes: Insight
* Biological processes: Body, health/illness
* Personal concerns: Work, leisure, money
* Informal speech: Swear words

While the biggest contributors will be relating to authenticity, emotions (emotional tone & positive/negative emotions), we believe the other attributes will aid answering in our research questions and stretch goals.

## Collecting Twitter Data From the Entirety of 2020
As mentioned previously, since we can only directly scrape tweets for the past week, we utilize Kaggle's dataset which is found at https://www.kaggle.com/lopezbec/covid19-tweets-dataset.

Additionally, the IEEE have published a similar dataset with a wider range of keywords which can be found at https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset. 

These files contain minimal data in order to save space. Kaggle's has just the tweet ids in a list-like structure, while the IEEE has a similar format but in pairs of tweet id and a sentiment score calculated for the content of the tweet. 

A snippet of the dataset content: 
[1220041022694137856, 1220041024031977472, 1220041026661965824, 1220041030478761984, 1220041036107370496, 1220041039597015040, 1220041040012480513, 1220041041971118080, 1220041046484082689, 1220041052641325056]

This is a very small portion from one day. Every one of those numbers refers to a tweet. Each day has more than 80,000 tweets worth of tweet ids - which for the scope of the project, we feel is very "overkill". Since we're looking at the entire year, this results in hundreds of gigabytes of actual twitter data. We sampled from these files and utilized Tweepy to collect the twitter status objects to extract data from. 

From the descriptions of both platforms, they have collected the data utilizing keyword searching only which is what we're doing as well. From this assumption, we utilize the tweet-id dataset as to represent the tweet data as if we manually ran Tweepy from the beginning of 2020 to 2021. 

In [None]:
import pandas as pd
import os, json, time, tweepy
import numpy as np
from collections import defaultdict

"""
Looks in the current directory for Twitter developer credential files in a json
Loads up the appropriate key-value pairings and returns it. 
"""
def load_keys(key_file):
    with open(key_file) as f:
        key_dict = json.load(f)
    return key_dict['api_key'], key_dict['api_secret'], key_dict['token'], key_dict['token_secret']

"""
Helper method for recursive descent into each directory. The data files are often separated by month, 
then individual files represent the tweet id values collected for that day. 
"""
def get_path():
    iterate_files(os.getcwd(), "")

"""
path = absolute path of the files.
subdir = current path in respect to the home directory of the folder.
"""
def iterate_files(path, subdir):
    #Load credentials
    KEY_FILE = "./twitter.json"
    api_key, api_secret, token, token_secret = load_keys(KEY_FILE)
    
    auth = tweepy.OAuthHandler(api_key, api_secret)
    auth.set_access_token(token, token_secret)
    api = tweepy.API(auth)
    
    # File recursion portion. Irrelevant to actual data collection.
    # Author: Ji-Hoon. Wrote this recursive directory program for a file-backup program. Repurposed for this project.
    for filename in os.listdir(path):
        filePath = path + "/" + filename
        if (os.path.isdir(filePath)):
            tempSubdir = ""
            if subdir: tempSubdir = subdir + "/" + filename
            else: tempSubdir = filename
            iterate_files(filePath, tempSubdir)
        else:
            filekey = subdir
            if subdir: file = subdir + "/" + filename
            else: file = filename
            tweet_content = defaultdict(list)
            
            if filename != 'process_tweets.py': # The python program will attempt to read in itself.
                tweet_ids = open(filePath, 'r')
                content = tweet_ids.read()[1:-1] #removes beginning and ending '[' and  ']'
                
                #Each id value is comma separated 
                ids = np.fromstring(content, dtype=int, sep= ',')
                
                tweet = None
                for id in ids: # Each tweet id
                    try:
                        tweet = api.get_status(id) #returns status object
                    except tweepy.RateLimitError:
                        # Rate limit is 15,000 requests per 15 minutes.
                        print("Rate Limit hit. Sleeping for 15 minutes.")
                        time.sleep(900)
                        continue
                        # Treats suspended users as a 404 request. Their actual page is deleted so the content has been as well.
                    except Exception as e:
                        # Suspended/Deleted users don't contribute any data so we skip them. 
                        continue
                    if tweet is None:
                        print("Should never be reached. If seen, something went wrong.")
                    # Extract the features listed prior, convert to a data frame, then save to a uniquely named csv file.
                    tweet_content['id'].append(tweet.id)
                    tweet_content['username'].append(tweet.user.name)
                    tweet_content['text'].append(tweet.text)
                    tweet_content['entities'].append(tweet.entities)
                    tweet_content['retweet_count'].append(tweet.retweet_count)
                    tweet_content['favorite_count'].append(tweet.favorite_count)
                    tweet_content['created_at'].append(tweet.created_at)
                pd.DataFrame(tweet_content).to_csv(filename[:-4] + ".csv")
    
if __name__ == "__main__":
    get_path()

## Current Data Collection in 2021
We relied on those datasets to help supplement the earlier twitter data we can't directly retrieve. However, for the current data, we're querying and data mining with a similar approach. 

We have taken inspiration from the IEEE Coronavirus (COVID-19) Tweets Dataset, which can be found at https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset. They have collected tweets relating to a large set of keywords since the very beginning of the pandemic and continually do so. We have taken a scaled down version and taken specific keywords from their larger set - which can be found at https://rlamsal.com.np/keywords.tsv. Note: The link will start the download of a tab-separated file with the keywords but is small in terms of memory size. Just a warning.

We data mined data in the same fashion, ran the text fields through LIWC, and separated files to organize based on the time period they represent. A snippet of the outputted file can be shown here:

In [6]:
import pandas as pd
sample = pd.read_csv('LIWC2015_feb.csv')
print(sample.head())

print("Column Names: {}".format(list(sample.columns)))

   standard            id            username  \
0         0  1.363275e+18        Fire Is Born   
1         1  1.363263e+18   MyFrenchDietitian   
2         2  1.363260e+18  healingcolorsmusic   
3         3  1.363260e+18  healingcolorsmusic   
4         4  1.363260e+18  healingcolorsmusic   

                                                text  \
0  @iamungit I've been in one of them in San Fran...   
1  Enjoy the #weekend, go #outdoor, reconnect wit...   
2  #healingcolorsmusic #art #music ...there is #S...   
3  #healingcolorsmusic #art #music ...there is #S...   
4  #healingcolorsmusic #art #music ...there is #S...   

                                            entities retweet_count  \
0  {'hashtags': [{'text': 'WearMask', 'indices': ...             1   
1  {'hashtags': [{'text': 'weekend', 'indices': [...             0   
2  {'hashtags': [{'text': 'healingcolorsmusic', '...             0   
3  {'hashtags': [{'text': 'healingcolorsmusic', '...             0   
4  {'hashtags': [{

We have noticed some problems across different operating systems for handling csv files. We initially ran into problems while sharing datasets with each other across Debian, Windows, and Mac and have resolved most of them since. One example is that some empty columns will show themselves as "Unnamed" columns with empty or NaN values in them. We ignore these values.

In [8]:
sample.nlargest(10, ['posemo'])

Unnamed: 0,standard,id,username,text,entities,retweet_count,favorite_count,created_at,WC,Analytic,...,Quote,Apostro,Parenth,OtherP,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106
15037,40342,1.363335e+18,King Jamison Fawkes ♚,"@WildHogPower ""WELL WELL WELL WELL WELL WELL W...","{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-21 03:50:29,26,1.0,...,3.85,0.0,0.0,11.54,,,,,,
15994,43474,1.363335e+18,King Jamison Fawkes ♚,"@WildHogPower ""WELL WELL WELL WELL WELL WELL W...","{'hashtags': [], 'symbols': [], 'user_mentions...",0,1,2021-02-21 03:50:29,26,1.0,...,3.85,0.0,0.0,11.54,,,,,,
9187,21521,1.363309e+18,FA_eye(formally Accureye) #CyberPunk2077,@TheSphereHunter Nice love it great mask,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-21 02:05:49,6,62.04,...,0.0,0.0,0.0,16.67,,,,,,
12488,32070,1.363325e+18,TEA POt,@FailedSoul_ *winning laughs* pretty impressiv...,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-21 03:10:45,8,72.69,...,0.0,0.0,0.0,50.0,,,,,,
14506,38373,1.363325e+18,TEA POt,@FailedSoul_ *winning laughs* pretty impressiv...,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,1,2021-02-21 03:10:45,8,72.69,...,0.0,0.0,0.0,50.0,,,,,,
17703,50702,1.363353e+18,Mask Up 2021,@lindyli That's helpful. Thanks.,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-21 05:02:38,4,1.92,...,0.0,25.0,0.0,25.0,,,,,,
5789,11080,1.363292e+18,T Partain,@Kiss_My_Mask Works pretty good unstoppiunstop...,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,1,2021-02-21 00:57:26,7,68.29,...,0.0,0.0,0.0,42.86,,,,,,
1712,1712,1.36327e+18,Dirk Diggler MMA🥊👊🏿😈,@Delta Lol nice mask pussy,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-20 23:30:29,5,93.26,...,0.0,0.0,0.0,20.0,,,,,,
7553,16307,1.3633e+18,Keri Casazza,"@heavenskincare Would love to win, I'm loving ...","{'hashtags': [{'text': 'win', 'indices': [66, ...",0,0,2021-02-21 01:32:21,13,43.96,...,0.0,7.69,0.0,23.08,,,,,,
1122,1122,1.363273e+18,dave@az,@PissOffTrumpkin Good afternoon. Love the mask.,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-20 23:44:56,6,99.0,...,0.0,0.0,0.0,16.67,,,,,,


The 10 tweets from the most recently collected data that have the highest scores in terms of positive emotions. However, we noticed that even tweets that have a positive sentiment initially can that the overall message is negative. This is why the other metrics are utilized alongside. For comparison, here is the top 10 most negative tweets.

In [9]:
sample.nlargest(10, ['negemo'])

Unnamed: 0,standard,id,username,text,entities,retweet_count,favorite_count,created_at,WC,Analytic,...,Quote,Apostro,Parenth,OtherP,Unnamed: 101,Unnamed: 102,Unnamed: 103,Unnamed: 104,Unnamed: 105,Unnamed: 106
12848,32430,1.363322e+18,BIG_B00B$,WEAR A FUCKING MASK YOU STUPID FUCK,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-21 02:59:35,7,93.26,...,0.0,0.0,0.0,0.0,,,,,,
20535,61597,1.363367e+18,Sassy | BLM,Goodnight. Fuck racists. Fuck Ted Cuntface Cru...,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-21 05:58:21,11,98.34,...,0.0,0.0,0.0,0.0,,,,,,
235,235,1.360839e+18,calledryan,copernicus was wrong,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-14 06:30:15,3,18.82,...,0.0,0.0,0.0,0.0,,,,,,
9540,21874,1.363307e+18,KingOfSoup ❼,My mask ugly,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-21 01:57:06,3,18.82,...,0.0,0.0,0.0,0.0,,,,,,
10850,26761,1.363317e+18,President Dr.Jillian(MAGA Bean)🇺🇸,America is full of fools. Weak mask wearing fo...,"{'hashtags': [], 'symbols': [], 'user_mentions...",1,6,2021-02-21 02:39:01,9,93.26,...,0.0,0.0,0.0,0.0,,,,,,
12092,29923,1.363317e+18,President Dr.Jillian(MAGA Bean)🇺🇸,America is full of fools. Weak mask wearing fo...,"{'hashtags': [], 'symbols': [], 'user_mentions...",2,11,2021-02-21 02:39:01,9,93.26,...,0.0,0.0,0.0,0.0,,,,,,
17667,50666,1.363354e+18,Dean Forbes,@Coolretro72 Mask fail.,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-21 05:03:28,3,93.26,...,0.0,0.0,0.0,33.33,,,,,,
21424,66013,1.363354e+18,Dean Forbes,@Coolretro72 Mask fail.,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,1,2021-02-21 05:03:28,3,93.26,...,0.0,0.0,0.0,33.33,,,,,,
9083,21417,1.36331e+18,mask is losing,Oh fuck I missed more than 1 affinity shit bet...,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,0,2021-02-21 02:09:11,17,93.26,...,0.0,0.0,0.0,0.0,,,,,,
3931,5721,1.363284e+18,Liz Harvey,Also mask to face ratio SUCKS SHIT,"{'hashtags': [], 'symbols': [], 'user_mentions...",0,5,2021-02-21 00:26:37,7,68.29,...,0.0,0.0,0.0,0.0,,,,,,


A similar situation happens. We can see that some of these tweets have negative emotions over the fact not enough people are wearing masks while others have negative emotions *because* of masks.

## Credit Listing

**Lipsa Jena**
* asd

**Ji Kang**
* Collected past twitter using the IEEE and Kaggle tweet-id datasets
* Ran data through LIWC to generate LIWC metrics.

**Yuanfeng Li**
* asd

**Yu Ling**
* asd


## Report format using QQQ

1. Qualitative
* Question, problem, hypothesis, claim, context, motivation
* Definitions, data, methods to be used
* Rationale, assumptions, biases

2. Quantitative:
* Data processing, analysis, visualization
* Documented code and results
* Summary visuals

3. Qualitative:
* Answer, update question/claim, summary, re-contextualization, story, * relate to domain knowledge
* Uncertainty, limitations, caveats
* New problems, next steps

4. Repeat. QQQ-QQQ-QQQ-...
* Break down a large problem into parts
* Alternative approaches to a problem
* Sequence of related pro

## References: 
* Good example of a Jupyter Notebook report: https://nbviewer.jupyter.org/gist/nealcaren/5105037
(https://nbviewer.jupyter.org/gist/nealcaren/5105037)

* QQQ: https://www.bava.stat.vt.edu/wp-content/uploads/2017/08/Developing-a-New-Interdisciplinary-
Computational-Analytics-Undergraduate-Program-A-Qualitative-Quantitative-Qualitative-Approach.pdf
(https://www.bava.stat.vt.edu/wp-content/uploads/2017/08/Developing-a-New-Interdisciplinary-
Computational-Analytics-Undergraduate-Program-A-Qualitative-Quantitative-Qualitative-Approach.pdf)

* Using visuals to support claims: https://www.cbre.com/research-and-reports/Scoring-Tech-Talent-in-North-
America-2018 (https://www.cbre.com/research-and-reports/Scoring-Tech-Talent-in-North-America-2018)

* Typical industry spam: https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-
dataage-whitepaper.pdf (https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-
dataage-whitepaper.pdf)