# Sentiment Analysis of Covid-19 Related Tweets based on Geographic Location and State Policies

#### Authors: Steve Diamond [(GitHub)](ttps://github.com/ssdiam2000), Markell Jones-Francis [(GitHub)](https://git.generalassemb.ly/markelljones-francis), and Julia Kelman [(GitHub)](https://git.generalassemb.ly/julia-kelman/)

## Loading Libraries

In [1]:
pip install twitterscraper

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import twitterscraper as ts
from twitterscraper.tweet import Tweet
from time import strftime, localtime
import json

INFO: {'User-Agent': 'Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16'}


## Gathering Twitter Data

We will use a publically available dataset created by Rabindra Lamsal and available [here](https://ieee-dataport.org/open-access/corona-virus-covid-19-geolocation-based-sentiment-data) (more specifically, the `april28-may6.csv` file was used). This data provides tweet IDs and sentiment scores for tweets posted between April 28th and May 6th 2020 which mentioned anything about "corona", "covid-19", "coronavirus" or the variants of "sars-cov-2" and have location information available.  
Complying with Twitter's data sharing policy, only the tweet IDs were shared in the original dataset. As a result, we need to hydrate these IDs using twitterscrapper in order to retrive the content of the tweet and location information. This information then needs to be saved in a new dataframe. 

### Loading Tweet Id and Sentiment Score Data

In [3]:
tweets_id = pd.read_csv("./april28-may6.csv", header=None)
tweets_id.rename(columns={0:'tweet_id', 1:'sentiment_score'}, inplace=True)
tweets_id.head()

Unnamed: 0,tweet_id,sentiment_score
0,1254995208888094721,0.0
1,1254995485452050438,0.0
2,1254995705527177216,0.525
3,1254995868333289475,0.2875
4,1254996123514789893,0.0


### Getting Original Tweet and Location Function

In [4]:
def get_tweet_df(df):
    tweets = []          # creates list to store Tweet objects                                                     
    locations = []       # creates list to store scraped location data
    deleted_tweets = 0   # starts counter for deleted tweets if tweet can no longer be reached
    
    # loops through each tweet_id in the imported dataframe of tweet_id
    for tweet_id in df['tweet_id']:
        url = f'https://twitter.com/anyuser/status/{tweet_id}'    # sets url to the tweet - tweet urls do not need a specified user
        res = requests.get(url)                                   # request for tweet url
        soup = BeautifulSoup(res.content, 'lxml')                 # creates beautifulsoup object for the webpage of the tweet
        # attempts to create tweet object using beautifulsoup object
        # and adds location based on location scrape from beatifulsoup object
        try:                                              
            tweets.append(Tweet.from_soup(soup))
            locations.append(soup.find('span', 'permalink-tweet-geo-text').text[12: -1])
        # adds to deleted_tweets counter if tweet url does not exist
        except:
            deleted_tweets += 1
    print(f'Deleted Tweets = {deleted_tweets}')                   # prints number of deleted tweets in dataframe
    tweet_df = pd.DataFrame(t.__dict__ for t in tweets)           # creates dataframe using tweet objects in tweets list
    tweet_df = tweet_df.astype({'tweet_id': 'int64'})             # converts tweet_id from str to int so it can be merged to id_df
    # dropping columns that are note useful for our analysis
    tweet_df.drop(columns = ['user_id', 'tweet_url',
        'timestamp_epochs', 'text_html', 'links',
        'has_media', 'img_urls', 'video_url', 'is_replied', 'is_reply_to', 'parent_tweet_id',
       'reply_to_users'], inplace = True)
    # merge original df to new df to keep sentiment score
    tweet_df = pd.merge(left = tweet_df, right = df, on = 'tweet_id')  
    # create dataframe of locations from list of scraped locations
    location_df = pd.DataFrame(locations, columns = ['location'])       
    # merge locations dataframe onto tweets dataframe with sentiments
    # merge on index because the index is the same
    tweet_df = pd.merge(tweet_df, location_df, left_index= True, right_index = True) 
    # creates city column by using the location value up to the last word - which is the state (or country if outside of the US) 
    # removes the comma at the end of the city name
    tweet_df['city'] = [' '.join(loc.split()[:-1:]).replace(',', '') for loc in tweet_df['location']]
    # creates state column by taking the last word of the location value - (returns the country if location is outside US)
    tweet_df['state'] = [loc.split()[-1] for loc in tweet_df['location']]
    # removes location column as it is no longer neccessary
    tweet_df.drop(columns = ['location'], inplace = True)
    return tweet_df

### Gathering Original Tweet and Location Data

In [5]:
tweets = get_tweet_df(tweets_id)

Deleted Tweets = 419


In [6]:
tweets.head()

Unnamed: 0,screen_name,username,tweet_id,timestamp,text,hashtags,likes,retweets,replies,sentiment_score,city,state
0,ChefHosea,Hosea Rosenberg,1254995208888094721,2020-04-28 04:45:49,Xmas in April covid19 edition @santoboulder #b...,"[burrito, xmas, smotheredburrito, greenchile, ...",1,0,0,0.0,Boulder,CO
1,Robin4ascii,Robin Hubbard,1254995485452050438,2020-04-28 04:46:55,@QUORA a sample of #Homework Topics Q&A -- ask...,"[Homework, schools, collegestudents, universit...",0,0,0,0.0,Mountain View,CA
2,iam_ifegbolahan,Ise Owo Omogbolahan,1254995705527177216,2020-04-28 04:47:47,"""There is nothing more beautiful than someone ...","[covid_19, CoronaVirus, Art, IseOwoOmogbolahan...",1,0,0,0.525,Ikeja,Nigeria
3,JC_RWRC,J.C.,1254995868333289475,2020-04-28 04:48:26,Another one in the books today for an AMAZING ...,[],0,0,0,0.2875,Jonesboro,AR
4,harsh05710408,harsh vyas,1254996123514789893,2020-04-28 04:49:27,#gvkemri to fight against #covid19 @ GVK Emri ...,"[gvkemri, covid19]",0,0,0,0.0,Medchal,India


In [10]:
tweets.shape

(12898, 12)

#### Saving this Data as a .CSV File 

In [None]:
#tweets.to_csv("../data/tweets_all.csv", index = False)

### Selecting Tweets from Specific States Function

In [7]:
# extract specific states from the cleaned data
def get_states(data, states = []):
    # instantiate empty list to place the dataframe of each state
    df_list = []
    # loop through the provided list of states and add the dataframe containing each state to the dataframe list
    for i, state in enumerate(states):
        df = data.loc[data['state'] == state]
        df_list.append(df)
    # create new dataframe by concatenating each dataframe in the datafame list
    new_df = pd.concat(objs = df_list)
    #return the dataframe of specified states
    return new_df

### Selecting Tweets from New York and Texas

In [8]:
tweets_ny_tx = get_states(tweets, states=['NY', 'TX'])
tweets_ny_tx.head()

Unnamed: 0,screen_name,username,tweet_id,timestamp,text,hashtags,likes,retweets,replies,sentiment_score,city,state
5,BillBodouva,Real Estate BuyerRep,1254996305082036224,2020-04-28 04:50:10,Happy Birthday Mom! You wouldn’t believe what...,"[bestmom, bestfriend, happybirthday, 1son, cov...",0,0,0,0.313973,Sands Point,NY
13,zagnut99,Frank Zagottis,1254999586302820353,2020-04-28 05:03:12,Isolation dinners continue with Isabel’s roast...,"[Dinner, Chicken, Isolation, IsolationDinner, ...",0,0,0,-0.6,Queens,NY
23,johnnybebad666_,John Fitzgerald Kennedy Page®,1255004404723462146,2020-04-28 05:22:21,COVID-19 update \nThis MTA BUS didn't pick me ...,[],0,0,0,0.0,Queens,NY
26,juanjeremy100,Cornerman Juan Jeremy,1255006251676831744,2020-04-28 05:29:41,When the #Cornerman confronts #COVID19 #Pneumo...,"[Cornerman, COVID19, Pneumonia, coronavirus, c...",3,1,0,0.0,Brooklyn,NY
239,officialdanek,Danèk,1255067181924134913,2020-04-28 09:31:48,Just chilling by the fire! breeeee.x \n.\n.\n...,"[model, instagram, tiktok, picoftheday, ootd, ...",0,0,0,-0.2625,Manhattan,NY


In [11]:
tweets_ny_tx.shape

(1731, 12)

#### Saving this Data in a .CSV File 

In [None]:
#tweets_ny_tx.to_csv("../data/tweets_ny_tx.csv", index = False)

## Gatherting Covid-19 Occurences Data 

We will use a publically available dataset created by The New York Times and available [here](https://github.com/nytimes/covid-19-data). The New York Times provides information about the number of Covid-19 cases and deaths at the country, state, and county level. More specifically we used the `us-states.csv` and `us-counties.csv` files downloaded on 5/7/2020 and providing information up to 5/6/2020.

In [14]:
covid_states = pd.read_csv("./us-states.csv")
covid_states.head()

Unnamed: 0,date,state,fips,cases,deaths
0,2020-01-21,Washington,53,1,0
1,2020-01-22,Washington,53,1,0
2,2020-01-23,Washington,53,1,0
3,2020-01-24,Illinois,17,1,0
4,2020-01-24,Washington,53,1,0


In [17]:
covid_counties = pd.read_csv("./us-counties.csv")
covid_counties.head()

Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0
1,2020-01-22,Snohomish,Washington,53061.0,1,0
2,2020-01-23,Snohomish,Washington,53061.0,1,0
3,2020-01-24,Cook,Illinois,17031.0,1,0
4,2020-01-24,Snohomish,Washington,53061.0,1,0


### Selecting Covid-19 Data for New York and Texas

In [19]:
covid_states_ny_tx_all_dates = get_states(covid_states, states=['New York', 'Texas']) 

In [20]:
covid_counties_ny_tx_all_dates = get_states(covid_counties, states=['New York', 'Texas'])

### Selecting Information from April 28th to May 6th 2020

In [22]:
def get_dates(data):
    data_date = data[data['date']>='2020-04-28']
    return data_date

In [50]:
covid_states_ny_tx = get_dates(covid_states_ny_tx_all_dates)
covid_states_ny_tx.head()

Unnamed: 0,date,state,fips,cases,deaths
3127,2020-04-28,New York,36,295137,17638
3182,2020-04-29,New York,36,299722,18015
3237,2020-04-30,New York,36,304401,18321
3292,2020-05-01,New York,36,308345,18610
3347,2020-05-02,New York,36,313008,18909


In [51]:
covid_states_ny_tx.shape

(18, 5)

In [52]:
covid_counties_ny_tx = get_dates(covid_counties_ny_tx_all_dates)
covid_counties_ny_tx.head()

Unnamed: 0,date,county,state,fips,cases,deaths
96890,2020-04-28,Albany,New York,36001.0,1009,45
96891,2020-04-28,Allegany,New York,36003.0,35,0
96892,2020-04-28,Broome,New York,36007.0,266,15
96893,2020-04-28,Cattaraugus,New York,36009.0,45,1
96894,2020-04-28,Cayuga,New York,36011.0,48,1


In [53]:
covid_counties_ny_tx.shape

(2425, 6)

### Combining State and County Data 

In [77]:
def combine_state_county(state_df, county_df):
    df_list = []
    # loop through the provided list of states and add the dataframe containing each state to the dataframe list
    for state in ['New York', 'Texas']:
        state_data = state_df[state_df['state'] == state]
        county_data = county_df[county_df['state'] == state]
        df = pd.merge(left=county_data, right=state_data, on='date')
        df_list.append(df)
    # create new dataframe by concatenating each dataframe in the datafame list
    new_df = pd.concat(objs = df_list)
    new_df.drop(columns=['state_y'], inplace=True)
    new_df.rename(columns={'state_x':'state','fips_x':'county_fips', 'cases_x':'county_cases',
                           'deaths_x':'county_deaths', 'fips_y':'state_fips',
                           'cases_y':'state_cases', 'deaths_y':'state_deaths'}, inplace=True)
    #return the dataframe of specified states
    return new_df

In [78]:
covid_ny_tx = combine_state_county(covid_states_ny_tx, covid_counties_ny_tx)
covid_ny_tx.head(2)

Unnamed: 0,date,county,state,county_fips,county_cases,county_deaths,state_fips,state_cases,state_deaths
0,2020-04-28,Albany,New York,36001.0,1009,45,36,295137,17638
1,2020-04-28,Allegany,New York,36003.0,35,0,36,295137,17638


In [71]:
covid_ny_tx.shape

(2425, 9)

According to information from The New York Times, all cases for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) were assigned to a single area called `New York City`. Since this area is not officially recognized, it does not have a fip and `county_fips` is set to `NaN` for those rows.  
In order to be able to merge covid data with population data later on, we need to assign a fip to this New York City area. We will assign fip `36000` (this fip does not actually exist) to the New York City area. 

In [137]:
covid_ny_tx['county_fips'].fillna(36000, inplace=True)

In [231]:
covid_ny_tx['county_fips'] = covid_ny_tx['county_fips'].astype(int)

## Gathering Population Data 

Note: includes data for all state as a row where county is name of state. 
Need to delete that row and include information in a new column instead. 

In [324]:
population = pd.read_csv('./co-est2019-alldata.csv', usecols=['STATE', 'COUNTY', 'STNAME', 'CTYNAME', 'POPESTIMATE2019'], encoding = "ISO-8859-1", engine='python')
population.head()

Unnamed: 0,STATE,COUNTY,STNAME,CTYNAME,POPESTIMATE2019
0,1,0,Alabama,Alabama,4903185
1,1,1,Alabama,Autauga County,55869
2,1,3,Alabama,Baldwin County,223234
3,1,5,Alabama,Barbour County,24686
4,1,7,Alabama,Bibb County,22394


### Selection Population Data for New York and Texas

In [323]:
def format_population(df):
    #renaming columns to match our formating 
    df.rename(columns={'STATE':'state_fips', 'COUNTY':'county_fips', 'STNAME':'state', 'CTYNAME':'county', 'POPESTIMATE2019':'county_population'}, inplace=True)
    #selecting information for NY and TX
    df = get_states(df, states=['New York', 'Texas'])
    #reformating county fips to match covid dataframe format
    for i,fip in enumerate(df['county_fips']):
        if len(str(fip)) == 1:
            df.iloc[i,1] = int(str(df.iloc[i,0])+"00"+str(fip))
        if len(str(fip)) == 2:
            df.iloc[i,1] = int(str(df.iloc[i,0])+"0"+str(fip))
        if len(str(fip)) == 3:
            df.iloc[i,1] = int(str(df.iloc[i,0])+str(fip))
    #creating a state_population column
    df['state_population'] = [df[df['county']=='Texas']['county_population'].values[0] if x == 'Texas' else df[df['county']=='New York']['county_population'].values[0] for x in df['state']]
        # code adapted from https://stackoverflow.com/questions/50375985/pandas-add-column-with-value-based-on-condition-based-on-other-columns
    #deleting rows giving state population data instead of county data
    for state in ['New York', 'Texas']:
        df = df.drop(df["county"].loc[df["county"]==state].index)
        # code adapted from https://stackoverflow.com/questions/53182464/pandas-delete-a-row-in-a-dataframe-based-on-a-value
    #adding `New York City` county row compiling the information from NYC 5 boroughs to match the covid dataset format
    d = {'state_fips':36, 'county_fips':36000, 'state':'New York', 'county':'New York City', 
     'county_population':[df[df['county']=='New York County']['county_population'].values[0] + 
                         df[df['county']=='Kings County']['county_population'].values[0] + 
                         df[df['county']=='Bronx County']['county_population'].values[0] + 
                         df[df['county']=='Richmond County']['county_population'].values[0] +
                         df[df['county']=='Queens County']['county_population'].values[0]], 
     'state_population':19453561}
    population_nyc = pd.DataFrame(data=d)
    df = pd.concat(objs=[df, population_nyc])
    return df 

In [326]:
population_ny_tx = format_population(population)

In [330]:
population_ny_tx.head()

Unnamed: 0,state_fips,county_fips,state,county,county_population,state_population
1861,36,36001,New York,Albany County,305506,19453561
1862,36,36003,New York,Allegany County,46091,19453561
1863,36,36005,New York,Bronx County,1418207,19453561
1864,36,36007,New York,Broome County,190488,19453561
1865,36,36009,New York,Cattaraugus County,76117,19453561


In [328]:
population_ny_tx.shape

(317, 6)

### Combining Covid-19 and Population Data 

In [339]:
covid_and_pop_ny_tx = pd.merge(left=covid_ny_tx, right=population_ny_tx, on='county_fips')
covid_and_pop_ny_tx.head()

Unnamed: 0,date,county_x,state_x,county_fips,county_cases,county_deaths,state_fips_x,state_cases,state_deaths,state_fips_y,state_y,county_y,county_population,state_population
0,2020-04-28,Albany,New York,36001,1009,45,36,295137,17638,36,New York,Albany County,305506,19453561
1,2020-04-29,Albany,New York,36001,1067,47,36,299722,18015,36,New York,Albany County,305506,19453561
2,2020-04-30,Albany,New York,36001,1165,53,36,304401,18321,36,New York,Albany County,305506,19453561
3,2020-05-01,Albany,New York,36001,1204,55,36,308345,18610,36,New York,Albany County,305506,19453561
4,2020-05-02,Albany,New York,36001,1238,60,36,313008,18909,36,New York,Albany County,305506,19453561


In [340]:
#removing duplicate columns: 
covid_and_pop_ny_tx.drop(columns=['state_fips_y', 'state_y', 'county_y'],inplace=True)
#renaming columns:
covid_and_pop_ny_tx.rename(columns={'county_x':'county', 'state_x':'state', 'state_fips_x':'state_fips'},inplace=True)
covid_and_pop_ny_tx.head()

Unnamed: 0,date,county,state,county_fips,county_cases,county_deaths,state_fips,state_cases,state_deaths,county_population,state_population
0,2020-04-28,Albany,New York,36001,1009,45,36,295137,17638,305506,19453561
1,2020-04-29,Albany,New York,36001,1067,47,36,299722,18015,305506,19453561
2,2020-04-30,Albany,New York,36001,1165,53,36,304401,18321,305506,19453561
3,2020-05-01,Albany,New York,36001,1204,55,36,308345,18610,305506,19453561
4,2020-05-02,Albany,New York,36001,1238,60,36,313008,18909,305506,19453561


In [341]:
covid_and_pop_ny_tx.shape

(2425, 11)

## Gathering State Policy Data 

## References

[Original Tweets IDs and Sentiment Scores Dataset](https://ieee-dataport.org/open-access/corona-virus-covid-19-geolocation-based-sentiment-data)  
[Covid-19 Cases and Deaths Dataset](https://github.com/nytimes/covid-19-data)