# Sentiment Analysis of Covid-19 Related Tweets based on Geographic Location and State Policies

#### Authors: Steve Diamond [(GitHub)](ttps://github.com/ssdiam2000), Markell Jones-Francis [(GitHub)](https://git.generalassemb.ly/markelljones-francis), and Julia Kelman [(GitHub)](https://git.generalassemb.ly/julia-kelman/)

## Loading Libraries

In [1]:
pip install twitterscraper

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import twitterscraper as ts
from twitterscraper.tweet import Tweet
from time import strftime, localtime
import json

INFO: {'User-Agent': 'Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16'}


## Gathering Twitter Data

We will use a publically available dataset created by Rabindra Lamsal and available [here](https://ieee-dataport.org/open-access/corona-virus-covid-19-geolocation-based-sentiment-data) (more specifically, the `april28-may6.csv` file was used). This data provides tweet IDs and sentiment scores for tweets posted between April 28th and May 6th 2020 which mentioned anything about "corona", "covid-19", "coronavirus" or the variants of "sars-cov-2" and have location information available.  
Complying with Twitter's data sharing policy, only the tweet IDs were shared in the original dataset. As a result, we need to hydrate these IDs using twitterscrapper in order to retrive the content of the tweet and location information. This information then needs to be saved in a new dataframe. 

### Loading Tweet Id and Sentiment Score Data

In [3]:
tweets_id = pd.read_csv("./april28-may6.csv", header=None)
tweets_id.rename(columns={0:'tweet_id', 1:'sentiment_score'}, inplace=True)
tweets_id.head()

Unnamed: 0,tweet_id,sentiment_score
0,1254995208888094721,0.0
1,1254995485452050438,0.0
2,1254995705527177216,0.525
3,1254995868333289475,0.2875
4,1254996123514789893,0.0


### Getting Original Tweet and Location Function

In [4]:
def get_tweet_df(df):
    tweets = []          # creates list to store Tweet objects                                                     
    locations = []       # creates list to store scraped location data
    deleted_tweets = 0   # starts counter for deleted tweets if tweet can no longer be reached
    
    # loops through each tweet_id in the imported dataframe of tweet_id
    for tweet_id in df['tweet_id']:
        url = f'https://twitter.com/anyuser/status/{tweet_id}'    # sets url to the tweet - tweet urls do not need a specified user
        res = requests.get(url)                                   # request for tweet url
        soup = BeautifulSoup(res.content, 'lxml')                 # creates beautifulsoup object for the webpage of the tweet
        # attempts to create tweet object using beautifulsoup object
        # and adds location based on location scrape from beatifulsoup object
        try:                                              
            tweets.append(Tweet.from_soup(soup))
            locations.append(soup.find('span', 'permalink-tweet-geo-text').text[12: -1])
        # adds to deleted_tweets counter if tweet url does not exist
        except:
            deleted_tweets += 1
    print(f'Deleted Tweets = {deleted_tweets}')                   # prints number of deleted tweets in dataframe
    tweet_df = pd.DataFrame(t.__dict__ for t in tweets)           # creates dataframe using tweet objects in tweets list
    tweet_df = tweet_df.astype({'tweet_id': 'int64'})             # converts tweet_id from str to int so it can be merged to id_df
    # dropping columns that are note useful for our analysis
    tweet_df.drop(columns = ['user_id', 'tweet_url',
        'timestamp_epochs', 'text_html', 'links',
        'has_media', 'img_urls', 'video_url', 'is_replied', 'is_reply_to', 'parent_tweet_id',
       'reply_to_users'], inplace = True)
    # merge original df to new df to keep sentiment score
    tweet_df = pd.merge(left = tweet_df, right = df, on = 'tweet_id')  
    # create dataframe of locations from list of scraped locations
    location_df = pd.DataFrame(locations, columns = ['location'])       
    # merge locations dataframe onto tweets dataframe with sentiments
    # merge on index because the index is the same
    tweet_df = pd.merge(tweet_df, location_df, left_index= True, right_index = True) 
    # creates city column by using the location value up to the last word - which is the state (or country if outside of the US) 
    # removes the comma at the end of the city name
    tweet_df['city'] = [' '.join(loc.split()[:-1:]).replace(',', '') for loc in tweet_df['location']]
    # creates state column by taking the last word of the location value - (returns the country if location is outside US)
    tweet_df['state'] = [loc.split()[-1] for loc in tweet_df['location']]
    # removes location column as it is no longer neccessary
    tweet_df.drop(columns = ['location'], inplace = True)
    return tweet_df

### Gathering Original Tweet and Location Data

In [5]:
tweets = get_tweet_df(tweets_id)

Deleted Tweets = 419


In [6]:
tweets.head()

Unnamed: 0,screen_name,username,tweet_id,timestamp,text,hashtags,likes,retweets,replies,sentiment_score,city,state
0,ChefHosea,Hosea Rosenberg,1254995208888094721,2020-04-28 04:45:49,Xmas in April covid19 edition @santoboulder #b...,"[burrito, xmas, smotheredburrito, greenchile, ...",1,0,0,0.0,Boulder,CO
1,Robin4ascii,Robin Hubbard,1254995485452050438,2020-04-28 04:46:55,@QUORA a sample of #Homework Topics Q&A -- ask...,"[Homework, schools, collegestudents, universit...",0,0,0,0.0,Mountain View,CA
2,iam_ifegbolahan,Ise Owo Omogbolahan,1254995705527177216,2020-04-28 04:47:47,"""There is nothing more beautiful than someone ...","[covid_19, CoronaVirus, Art, IseOwoOmogbolahan...",1,0,0,0.525,Ikeja,Nigeria
3,JC_RWRC,J.C.,1254995868333289475,2020-04-28 04:48:26,Another one in the books today for an AMAZING ...,[],0,0,0,0.2875,Jonesboro,AR
4,harsh05710408,harsh vyas,1254996123514789893,2020-04-28 04:49:27,#gvkemri to fight against #covid19 @ GVK Emri ...,"[gvkemri, covid19]",0,0,0,0.0,Medchal,India


In [10]:
tweets.shape

(12898, 12)

#### Saving this Data as a .CSV File 

In [None]:
#tweets.to_csv("../data/tweets_all.csv", index = False)

### Selecting Tweets from Specific States Function

In [7]:
# extract specific states from the cleaned data
def get_states(data, states = []):
    # instantiate empty list to place the dataframe of each state
    df_list = []
    # loop through the provided list of states and add the dataframe containing each state to the datagrame list
    for i, state in enumerate(states):
        df = data.loc[data['state'] == state]
        df_list.append(df)
    # create new dataframe by concatenating each dataframe in the datafame list
    new_df = pd.concat(objs = df_list)
    #return the dataframe of specified states
    return new_df

### Selecting Tweets from New York and Texas

In [8]:
tweets_ny_tx = get_states(tweets, states=['NY', 'TX'])
tweets_ny_tx.head()

Unnamed: 0,screen_name,username,tweet_id,timestamp,text,hashtags,likes,retweets,replies,sentiment_score,city,state
5,BillBodouva,Real Estate BuyerRep,1254996305082036224,2020-04-28 04:50:10,Happy Birthday Mom! You wouldn’t believe what...,"[bestmom, bestfriend, happybirthday, 1son, cov...",0,0,0,0.313973,Sands Point,NY
13,zagnut99,Frank Zagottis,1254999586302820353,2020-04-28 05:03:12,Isolation dinners continue with Isabel’s roast...,"[Dinner, Chicken, Isolation, IsolationDinner, ...",0,0,0,-0.6,Queens,NY
23,johnnybebad666_,John Fitzgerald Kennedy Page®,1255004404723462146,2020-04-28 05:22:21,COVID-19 update \nThis MTA BUS didn't pick me ...,[],0,0,0,0.0,Queens,NY
26,juanjeremy100,Cornerman Juan Jeremy,1255006251676831744,2020-04-28 05:29:41,When the #Cornerman confronts #COVID19 #Pneumo...,"[Cornerman, COVID19, Pneumonia, coronavirus, c...",3,1,0,0.0,Brooklyn,NY
239,officialdanek,Danèk,1255067181924134913,2020-04-28 09:31:48,Just chilling by the fire! breeeee.x \n.\n.\n...,"[model, instagram, tiktok, picoftheday, ootd, ...",0,0,0,-0.2625,Manhattan,NY


In [11]:
tweets_ny_tx.shape

(1731, 12)

## Gatherting Covid-19 Occurences Data 

We will use a publically available dataset created by The New York Times and available [here](https://github.com/nytimes/covid-19-data).

## Gathering State Policy Data 

## References

[Original Tweets IDs and Sentiment Scores Dataset](https://ieee-dataport.org/open-access/corona-virus-covid-19-geolocation-based-sentiment-data)  
[Covid-19 Cases and Deaths Dataset](https://github.com/nytimes/covid-19-data)