# COVID-19 Infections and Happiness
This is the notebook for the Python for Economics Project at the London  School of Economics analysing the effect of COVID-19 infections on happiness.


## Introduction
As policy-making during an epidemic is all about making economic tradeoffs, one would like to quantify the gains and losses in the factors a government is is trading off between. The trade-offs to monetary factors and other classical economic factors are well documented, of course. However, the social costs of viral cases less so (among other forms of social costs involved in a pandemic). One may make an attempt to quantify the social costs of the number of cases of such a virus in your country by looking at the causal effect of COVID-19 cases on the average sentiment of how people express themselves online.

## Overview Project
In this project the main goal is to run a regression of the number of COVID-19 infections on the average sentiment of how people express themselves online. You will start by carrying out this analysis for the UK. A clear confounder here are government restrictions to curb the spread of the virus. You will control for this confounder in the regression alongside time-fixed effects that deal with the biases caused by new ways of measuring cases, changes in testing accuracy and availability, among other possible biases.
</br></br>
To be able to run this final regression, though, you will need to collect the data. This notebook will walk you through the steps associated with this and the final step of running the regression.


## Table of Contents

>[COVID-19 Infections and Happiness](#scrollTo=M_2dLRCIIqv9)

>>[Introduction](#scrollTo=c3t9AWywlLa_)

>>[Overview Project](#scrollTo=hpTdOHFalo5C)

>>[Table of Contents](#scrollTo=loLc9eEEVSsP)

>>[Preparation](#scrollTo=5kqfEAq9S8KC)

>>[Data Collection](#scrollTo=Xviu1_5NnsrF)

>>>[Loading Datasets](#scrollTo=3gOgjpQpKGoe)

>>>[Cleaning Datasets](#scrollTo=Xzmo0WIbKtrk)

>>>>[Preparation](#scrollTo=Xzmo0WIbKtrk)

>>>[Stringency](#scrollTo=KB_HFVFsdT5H)

>>>>[Cases](#scrollTo=6zlemmyxk2JT)

>>>[Merging Dataframes](#scrollTo=8UXSCwxHg1Gi)

>>>[Average Sentiment](#scrollTo=C_Wv1iKpmdo6)

>>>>[Scraping Tweets](#scrollTo=UVZi_0ONKq5p)

>>>>[Classifying Tweets](#scrollTo=zOiuZ0v8NSKA)

>>[Running Regressions](#scrollTo=1wSD_8JrKslV)

>>[Further Exercises](#scrollTo=HIx-MyVcN7Vh)

>>[References](#scrollTo=GTQjBctvVLWv)




## Preparation
First, you will need to install a few libraries for this project. To install a library, write ``!pip install`` in a code block followed by ``name-library`` and the optional ``--quiet`` keyword to suppress the logs. For example, installing the package ``pandas`` can be done by running ``!pip install pandas --quiet`` in a code block.

(note: between countries the definitions and methods of confirming cases differs. maybe look at percentual change in infections but then not the absolute size of infections. maybe ONS positive rates).

In [24]:
# TODO - Install the following packages: pandas, datetime.
!pip install pandas --quiet
!pip install datetime --quiet

Now, you have to import the packages you installed. Additionally, import the preinstalled package ``numpy`` as ``np``.

In [25]:
# TODO - Import the installed packages.
# Two additional libraries necessary for CSV uploads are already given (no need to install these, they are installed by default on Colabs).
from google.colab import files
import io
import pandas as pd
import datetime as dt
import numpy as np

## Data Collection
We can now start collecting our data.

### Loading Datasets
The data we will use for this analysis will come from the John Hopkins University Center for Systems Science and Engineering, Our World in Data and, of course, Twitter. 


* The dataset on confirmed cases per country (including the UK) can be found and downloaded [here](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series).
* The dataset on COVID-19 government restriction stringency can be found and downloaded [here](https://ourworldindata.org/covid-stringency-index).
* We will get into the Tweets later.

Once you have downloaded the datasets, you can upload one of them to to Colabs by running the comands below which will store a dataset as a Pandas dataframe (sort of like a spreadsheet). It is a good coding practice to wrap commands like this in a function. Do this and make the function output both datasets in a list. Then, call the function and assign the result to a variable ``dataframes``, storing the two dataframes in a list.

In [6]:
# This command will prompt you with an upload screen and store the uploaded files in a dictionary.
# You can upload multiple files at once.
uploaded = files.upload()

# This command stores the filenames in a list.
filenames = uploaded.keys()

# This command selects the filename of the first file in the files you uploaded.
filename = filenames[0]

# This command stores a dataset in a variable as a Pandas dataframe.
dataset = pd.DataFrame(io.BytesIO(uploaded[filename]))

# TODO - Create and call the function.

KeyboardInterrupt: ignored

### Cleaning Datasets
#### Preparation
First, it would be nice to have each of the datasets stored in a variable with a corresponding name. Below I show a trick to assign two variables at once. Use this trick to assign your datasets to the variables ``df_cases`` and ``df_stringency``.

In [None]:
lse, ucl = ["awesome", "mwa"]

# TODO - Replicate the trick with the variable names given.


### Stringency
Let's start with the easiest dataset first. Inspect the structure of the dataset by printing the dataframe.

In [None]:
# TODO - print the dataframe and inspect the structure.

Clearly, there are lots of variables and countries of which we do not need the data. Therefore, we would like to drop the redundant entries. Do this by selecting only the date and stringency index values for just the United Kingdom. Overwrite ``df_stringency`` with this transformed dataframe. As a final nit-picky step, reset the index of the dataframe.

In [None]:
# TODO - Overwrite the dataframe with the filtered version.

We would like to have our data of suitable data types, so it is easiest to work with down the line. For example, we would like the values in our ``date`` column to be of the ``datetime`` data type. Also, we would like the values in our ``stringency_index`` column to be of the ``float`` data type. Check if this is the case and if not, convert the column values to the desired data type.

In [None]:
# TODO - Check if the column values data types are correct and convert them if not.

#### Cases
Now on to the harder dataset. Inspect the structure of the dataset by printing the dataframe.

In [None]:
# TODO - print the dataframe and inspect the structure.

Again, there are a lot of countries we do not need the data of. Filter the dataframe to only contain records of the UK (be precise here) and overwrite the original dataframe with the filtered one.

In [None]:
# TODO - Filter and overwrite the dataframe of cases.

Some might think we are done now with this dataset, but this dataset has a nasty characteristic. Namely, it is [*wide*](https://en.wikipedia.org/wiki/Wide_and_narrow_data), and quite *wide*, to say the least. Libraries written for Python and other programming languages hardly support this kind of data shape. Therefore, we want to change the shape of the data to the *narrow* format.
</br></br>
In essence, we would like one column for the date and one column for the confirmed cases. Thus, we need to put the column names in a new variable name called ``date`` and link the corresponding case numbers to the right row.
</br></br>
Convert the dataframe to a *narrow* format. After understanding the concepts by reading the Wikipedia page linked before, use Pandas' [``melt``](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) implementation to achieve this.

In [None]:
# TODO - Convert the dataframe from wide to narrow format.

Now, we would like to convert the date column of data type ``string`` to the data type ``datetime``, because we want to link the time series datasets that we now have parsed to each other and make one big, complete dataset. This is not as easy as it was for the previous dataset, and you will probably find out why.

In [None]:
# TODO - Convert the values of the date column to the datetime data type.

### Merging Dataframes
Now, we would like to merge the dataframes of the COVID-19 cases and COVID-19 policy stringency with eachother, so that for each date that is present in both dataframes we have one observation for the stringency and the number of cases. We will use [Pandas' implementation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) of a merge function.

In [None]:
# TODO - Merge "df_cases" with "df_stringency" and save the result in a variable called "df_cases_stringency".

Upon inspecting the data, we can see that there are some missing observations for the stringency index, probably because the stringency data does not go as far in time as the cases dataset. To clean this up, we would like to drop these missing values.

In [None]:
# TODO - Drop the missing values in the dataset.

After having edited data with code that takes a bit to run, you usually want to save your progress by downloading the dataset. (In more advanced projects, you would maybe use a database when using computationally expensive operations). Thus, download your dataset. You can use the previously installed ``files`` Colabs library for this. Download the dataset as ``cases_stringency.csv``. Make sure to exclude the index in the dataframe to CSV conversion step.

In [None]:
# TODO - Convert the dataframe to a CSV file and download it.

### Average Sentiment
Now, in the data collection part of this project we only have left the task of collecting data on the average sentiment of how people express themselves online.

#### Scraping Tweets
In this section, we will start scraping tweets from the UK in the same time period as variables ``stringency_index`` and ``cases`` are recorded in. In an academic setting, you might prefer to use an official Twitter API, but this can take a while to be admitted to. Additionally, few compromises are made by using an unofficial Twitter scraper.
</br></br>
If you have left off since everything before this code chunk and your Google Colabs runtime has restarted, you can optionally load the dataset you created in the previous parts with the code below.

In [None]:
df_cases_stringency = upload_datasets()[0]

First, we install a library that allows us to easily scrape tweets from Twitter.

In [26]:
!pip install snscrape --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/69.2 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.2/69.2 KB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h

Importing the scraping library.

In [27]:
import snscrape.modules.twitter as sntwitter

We will have to define the date range we want to scrape data from before we start scraping tweets. A useful function for this is Pandas' [``date_range``](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html). Define a date range that starts from the earliest date all the way to the last date in your dataframe ``df_cases_stringency``. Store this range of dates in a variable called ``date_range``.

In [28]:
# TODO - Define the date range.
date_range = pd.date_range(start='1/30/2020', end='1/1/2023')

Defining a list to store the tweets in.

In [29]:
tweets = []

Defining the number of tweets to be scraped per day. You can change this number to your liking. I would recommend to try running the code with this number first and possibly increasing it later when sure the code works so wasting computation time can be prevented.

In [30]:
tweets_per_day = 10

As we want to scrape tweets published from the UK, we need to tell this to our scraper. As it so happens, Twitter uses geographic tags users can choose to attach to their tweets. (There are some problems of representativeness with this approach discussed [here](https://developer.twitter.com/en/docs/tutorials/advanced-filtering-for-geo-data) if you are interested.) The UK tag is ``6416b8512febefc9``. If needed while exploring the **optional** further exercises, you can find tags of other countries via the following Twitter API: ``f"https://api.twitter.com/1.1/geo/reverse_geocode.json?lat={latitude}&lon={longitude}&granularity=country"``. You would format the string based on your latitude and longitude variables before plugging the link in your browser or Python API module of choice. Documentation for this API can be found [here](https://developer.twitter.com/en/docs/twitter-api/v1/geo/places-near-location/api-reference/get-geo-reverse_geocode).

Now, we can start scraping. To get you started with the functionality of the ``snscrape`` module, I have written a simple piece of code that you can run to understand how this module can be used.

In [24]:
# Demonstrating the working of the "enumerate" function.
text_list = ["This", "is", "how", "enumerate", "works."]
for i, text in enumerate(text_list):
  print(i, text)

# Storing the place ID for the UK.
place_id = "6416b8512febefc9"

# Defining the search query for our Twitter scraper.
# The keyword "lang:en" will filter for English tweets only.
# The keywords "since:date" and "until:date" define the time range the tweet has to be from.
# "until" is exclusive, meaning no tweets are scraped from "2020-05-20". "since" is inclusive.
scraped_tweets = sntwitter.TwitterSearchScraper(f"(lang:en place:{place_id} since:2020-05-19 until:2020-05-20)").get_items()

# This piece of code will print 5 tweets.
# For each iteration in the loop, the scraper will scroll to the next tweet in the feed returned by Twitter.
for i, tweet in enumerate(scraped_tweets):
  print(tweet.rawContent)
  print(tweet.date)
  # We will only need the rawContent and date properties of the tweet.
  # tweet.rawContent gives the text of the tweet (string)
  # tweet.date gives the date and time of the tweet (datetime)
  # For more properties, see line 60 and onwards of https://github.com/JustAnotherArchivist/snscrape/blob/master/snscrape/modules/twitter.py.

  # Stopping the loop.
  if i == 4:
    break

0 This
1 is
2 how
3 enumerate
4 works.
@OrinKerr I hope they don’t honour the subpoenas. Take it all the way to the Supreme Court, just like these scoundrels have done.
2020-05-19 23:38:39+00:00
Almost at 500 followers this is exciting 😁 We are feeling the love 😍
.
.
#ChihuahuaLover #twitterdogs #milonmily
#lockdown #dogcelebration #dog #dogs #doggy #dogsduringlockdown #doglover #dogsoftwitter #doglovers #Chihuahua #cute #RETWEEET #RT https://t.co/GAmHwJTLBk
2020-05-19 22:59:36+00:00
Conscious Co. #Gin is a rather eye-catching gin distilled from surplus potatoes that weren't so eye-catching and would have otherwise gone to waste! Plus, six local botanicals make for one fragrant tipple.

https://t.co/45a8IPB6Rv https://t.co/BgtYbKLvrf
2020-05-19 22:31:02+00:00
@carolynewart @BASW_UK @BASW_NI Unity is strength, great contributions tonight, all messages highlightied the importance of being part of the international community of social work. Thank you @ScotsSW @AngieBartoli
@BASW_Cymru @IF

Now that you hopefully understand how this module works, I want you to write a function called ``scrape_time_range``.  This function will have to return a list of scraped tweets, containing the raw content and date for each tweet in the list.
</br></br>
This function should take four arguments:
1. A list to append the scraped tweets to.
2. The place ID.
3. The date range.
4. The number of tweets to be scraped per day.

You want this function to iterate over the dates in the date range first, before defining the search query for that day and scraping the desired number of tweets. Notice that the dates stored in the previously created ``date_range`` are of the data type ``datetime``. They can be converted to strings by using the function [``strftime``](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.strftime.html). You can format the desired output strings with the following keywords:
* ``%Y`` which corresponds to YYYY.
* ``%m`` which corresponds to mm.
* ``%d`` which corresponds to dd.

Make sure to take care of the hypens in these dates, too, when converting the date range, as your Twitter search query will be invalid without them. The same applies to the order of the year, month and date in the string.
</br></br>
**Hint:** wrap the output of ``date_range.strftime()`` in ``list()`` to convert the Numpy object to a Python list, which is more convenient in this instance.

In [31]:
# TODO
# 1. Convert the date range to a list of date strings.
date_range_list = list(date_range.strftime('%Y-%m-%d'))
# 2. Write the scraping function.
place_id = '6416b8512febefc9'
def scrape_time_range(list_of_tweets, place_ID, Date_Range, tweets_per_day):
  index = 0
  for date in Date_Range:
    start_date = date
    index += 1
    if index == len(Date_Range):
      break
    else:
      end_date = Date_Range[index]
      scraped_tweets = sntwitter.TwitterSearchScraper(f"lang:en place:{place_ID} since:{start_date} until:{end_date}").get_items()
      for i, tweet in enumerate(scraped_tweets):
        list_of_tweets.append([tweet.rawContent, tweet.date])
        if i == tweets_per_day - 1:
          break            
  return list_of_tweets

Call your scraping function.
**Warning:** with 10 tweets a day this takes about 40 minutes to run and at a later stage the tweet classification task with the best model would take around 6 hours (but you can do this in batches of course).

In [32]:
# TODO - Call it.
scrape_time_range(tweets, place_id, date_range_list, 10)

[['I love It this project @spacedoge_io and i recommend everyone invest on Miner.',
  datetime.datetime(2020, 1, 30, 23, 37, 53, tzinfo=datetime.timezone.utc)],
 ['@grantdashwood 🤣 not quite yet (I think). But who knows... maybe just a matter of time 🤔 #Automation #robotics #robot #bot',
  datetime.datetime(2020, 1, 30, 23, 11, 24, tzinfo=datetime.timezone.utc)],
 ["Celebrating #NationalBackwardDay With My Favourite Family! Here's When @jimmyosmond Announced  Big Sister Marie's Mobility Flaws On The #Osmond Family Show! 🤣\n\n#FridayThoughts For Anyone Missing @donnyosmond\n&amp; @marieosmond I Got Em Back! 😜\n\nEnjoy EVERYONE It's Hilarious 💕💋 https://t.co/xKtw1c1Rt1",
  datetime.datetime(2020, 1, 30, 23, 8, 51, tzinfo=datetime.timezone.utc)],
 ['@TonyThePoett Testing times for the ravaged minds of "War" calm and free with a cup of tea and a fag in hand. Thank you @TonyThePoett \n\nAlways the rebel 🇬🇧',
  datetime.datetime(2020, 1, 30, 23, 5, 38, tzinfo=datetime.timezone.utc)],
 ['Beau

Now, we would like to convert the list of tweets to a dataframe and a CSV file to save our progress. Call the dataframe ``df_tweets`` and the CSV file ``tweets.csv``.

In [35]:
# TODO - Convert the list of tweets to a dataframe and a CSV file.
df_tweets = pd.DataFrame(tweets)
df_tweets.iloc[:,1] = df_tweets.iloc[:,1].dt.tz_localize(None)
df_tweets.to_excel('tweets.xlsx', index=False)

In [36]:
from google.colab import files
files.download('tweets.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#### Classifying Tweets
If you have left off before this chunk and your Colabs runtime has refreshed in the meantime, load the dataset below.

In [37]:
# Taking the first index of the list of uploaded datasets, as you only upload one.
from google.colab import files
uploaded = files.upload()
df_tweets = pd.read_excel('tweets.xlsx')

Saving tweets.xlsx to tweets (1).xlsx


In [38]:
df_tweets.columns = ['tweet', 'date']

In [39]:
print(df_tweets.head())
print(df_tweets.info())

                                               tweet                date
0  I love It this project @spacedoge_io and i rec... 2020-01-30 23:37:53
1  @grantdashwood 🤣 not quite yet (I think). But ... 2020-01-30 23:11:24
2  Celebrating #NationalBackwardDay With My Favou... 2020-01-30 23:08:51
3  @TonyThePoett Testing times for the ravaged mi... 2020-01-30 23:05:38
4  Beautiful Tip! @emmerdale https://t.co/99lVfBFaYy 2020-01-30 23:05:28
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10670 entries, 0 to 10669
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   tweet   10670 non-null  object        
 1   date    10670 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 166.8+ KB
None


At this stage, we need to define a function that cleans the tweets. Namely, users and tweets mentioned in tweets might confuse the classification model that we will use at a later stage. This is possible if usernames and links have words in them that would refer to a certain sentiment but are not used for that purpose in natural text. Thus, we need to neutralise these words in the tweets. Create a function that converts all users (in the form of ``@username``) to "``@user``" and all links (in the form of ``https://`` to "``https``". Call it ``neutralise_mentions_links`` and make it so that it takes one argument called ``text``.
</br></br>
Use the ``.split()`` function of strings in Python. Hastags start with "#", mentions with "@", links with "https://". 

In [40]:
# TODO - Write a function that removes hashtags and links from a piece of text.
def neutralise_mentions_links(text):
    new_text = text.split()
    modified_text = []
    for string in new_text:
        if string.startswith('#'):
            continue
        elif string.startswith('@'):
            modified_text.append('@user')
        elif string.startswith('https://'):
            modified_text.append('https')
        else:
            modified_text.append(string)
    return ' '.join(modified_text)

In [41]:
cleaned_tweet = []
for row in df_tweets['tweet']:
  new_row = neutralise_mentions_links(row)
  cleaned_tweet.append(new_row)
print(cleaned_tweet)



In [42]:
df_tweets['clean_tweet'] = cleaned_tweet
print(df_tweets[:5])

                                               tweet                date  \
0  I love It this project @spacedoge_io and i rec... 2020-01-30 23:37:53   
1  @grantdashwood 🤣 not quite yet (I think). But ... 2020-01-30 23:11:24   
2  Celebrating #NationalBackwardDay With My Favou... 2020-01-30 23:08:51   
3  @TonyThePoett Testing times for the ravaged mi... 2020-01-30 23:05:38   
4  Beautiful Tip! @emmerdale https://t.co/99lVfBFaYy 2020-01-30 23:05:28   

                                         clean_tweet  
0  I love It this project @user and i recommend e...  
1  @user 🤣 not quite yet (I think). But who knows...  
2  Celebrating With My Favourite Family! Here's W...  
3  @user Testing times for the ravaged minds of "...  
4                         Beautiful Tip! @user https  


In [43]:
df_tweets.to_excel('cleaned_tweets.xlsx', index=False)
from google.colab import files
files.download('cleaned_tweets.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Apply the function to all the tweets in the dataframe.

In [None]:
# TODO - Apply the function to all tweets in the dataframe.

Now, we would like to classify the sentiment of the tweets in our dataframe. We task an external library with this exercise. The library we will use is ``happytransformer``. First, we install the library.

In [44]:
!pip install happytransformer --quiet

Second, we import the text classification functionality from the library we installed.

In [45]:
from happytransformer import HappyTextClassification

Third, we load the AI model that has been trained on a large dataset of tweets with sentiment labels. We will use this for the analysis. This type of model is called a transformer model which you can read more on [here](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model).

In [46]:
happy_tc = HappyTextClassification(model_type="BERT",  model_name="cardiffnlp/twitter-roberta-base-sentiment", num_labels=3)

This is a demonstration of how the model can be used. Now write a function called ``classify_sentiment`` that takes in one argument of ``text`` and outputs the label in numeric form. 
</br></br>
It is important for you to know that the label that the NLP model outputs is one of:
* ``LABEL_0``, which corresponds to negative or the numeric form of -1.
* ``LABEL_1``, which corresponds to neutral or the numeric form of 0.
* ``LABEL_2``, which corresponds to positive or the numeric form of 1.

The model outputs one score for each label and returns the label and score corresponding to the label with the highest score.

In [47]:
result = happy_tc.classify_text("I think the Python for Economics week is a great initiative.")
print(result.label, result.score)

# TODO - Write a function that outputs the label in numeric form.
classified = []
def classify_sentiment(text):
  sentiment = happy_tc.classify_text(text)
  classified.append([sentiment.label, sentiment.score])
  return classified

LABEL_2 0.9771161079406738


In [49]:
from google.colab import files
uploaded = files.upload()
df_cleaned_tweets = pd.read_excel('cleaned_tweets.xlsx')

Saving cleaned_tweets.xlsx to cleaned_tweets (1).xlsx


In [50]:
cl_tweet = df_cleaned_tweets['clean_tweet']
for row in cl_tweet:
  classify_sentiment(row)
print(classified)

[['LABEL_2', 0.9852529168128967], ['LABEL_1', 0.6525818109512329], ['LABEL_2', 0.9834097623825073], ['LABEL_2', 0.6667680144309998], ['LABEL_2', 0.9594331383705139], ['LABEL_2', 0.8809717893600464], ['LABEL_2', 0.6463283896446228], ['LABEL_0', 0.9427138566970825], ['LABEL_2', 0.9645416140556335], ['LABEL_2', 0.5569496750831604], ['LABEL_2', 0.975188672542572], ['LABEL_2', 0.9666039943695068], ['LABEL_0', 0.8882222175598145], ['LABEL_0', 0.6291963458061218], ['LABEL_0', 0.46580812335014343], ['LABEL_2', 0.490807443857193], ['LABEL_2', 0.6613503098487854], ['LABEL_2', 0.9862269759178162], ['LABEL_1', 0.4567926526069641], ['LABEL_0', 0.9170284867286682], ['LABEL_0', 0.8124008178710938], ['LABEL_2', 0.9868990182876587], ['LABEL_0', 0.5962307453155518], ['LABEL_2', 0.94588303565979], ['LABEL_0', 0.7323468327522278], ['LABEL_0', 0.6218178868293762], ['LABEL_2', 0.9641792178153992], ['LABEL_2', 0.9215770363807678], ['LABEL_2', 0.9606326222419739], ['LABEL_2', 0.9632844924926758], ['LABEL_1', 

In [51]:
df_classified_sentiment = pd.DataFrame(classified, columns=['label','sentiment'])
df_cleaned_tweets = pd.concat([df_cleaned_tweets, df_classified_sentiment], axis=1)

In [52]:
print(df_cleaned_tweets.head())

                                               tweet                date  \
0  I love It this project @spacedoge_io and i rec... 2020-01-30 23:37:53   
1  @grantdashwood 🤣 not quite yet (I think). But ... 2020-01-30 23:11:24   
2  Celebrating #NationalBackwardDay With My Favou... 2020-01-30 23:08:51   
3  @TonyThePoett Testing times for the ravaged mi... 2020-01-30 23:05:38   
4  Beautiful Tip! @emmerdale https://t.co/99lVfBFaYy 2020-01-30 23:05:28   

                                         clean_tweet    label  sentiment  
0  I love It this project @user and i recommend e...  LABEL_2   0.985253  
1  @user 🤣 not quite yet (I think). But who knows...  LABEL_1   0.652582  
2  Celebrating With My Favourite Family! Here's W...  LABEL_2   0.983410  
3  @user Testing times for the ravaged minds of "...  LABEL_2   0.666768  
4                         Beautiful Tip! @user https  LABEL_2   0.959433  


In [53]:
del df_cleaned_tweets['tweet']

In [54]:
df_cleaned_tweets['label'] = df_cleaned_tweets['label'].replace('LABEL_0', '-1')
df_cleaned_tweets['label'] = df_cleaned_tweets['label'].replace('LABEL_1', '0')
df_cleaned_tweets['label'] = df_cleaned_tweets['label'].replace('LABEL_2', '1')
df_cleaned_tweets['label'] = df_cleaned_tweets['label'].astype(int)
print(df_cleaned_tweets)

                     date                                        clean_tweet  \
0     2020-01-30 23:37:53  I love It this project @user and i recommend e...   
1     2020-01-30 23:11:24  @user 🤣 not quite yet (I think). But who knows...   
2     2020-01-30 23:08:51  Celebrating With My Favourite Family! Here's W...   
3     2020-01-30 23:05:38  @user Testing times for the ravaged minds of "...   
4     2020-01-30 23:05:28                         Beautiful Tip! @user https   
...                   ...                                                ...   
10665 2022-12-31 21:37:48  Ending 2022 on a Song. Matt Smith as ‘Daemon T...   
10666 2022-12-31 21:36:45  What a way to close off the year than with a t...   
10667 2022-12-31 21:16:30             @user @user @user Happy Birthday! 😘 xx   
10668 2022-12-31 20:55:44  😃Buy one get one free😃 💨Special Offer - Ends t...   
10669 2022-12-31 20:16:49  @user @user FAB Jumpers are compliments of Mom...   

       label  sentiment  
0          1 

In [55]:
del df_cleaned_tweets['sentiment']

In [56]:
df_cleaned_tweets_v1 = df_cleaned_tweets.copy()

In [57]:
del df_cleaned_tweets_v1['clean_tweet']

In [58]:
print(df_cleaned_tweets_v1.head())

                 date  label
0 2020-01-30 23:37:53      1
1 2020-01-30 23:11:24      0
2 2020-01-30 23:08:51      1
3 2020-01-30 23:05:38      1
4 2020-01-30 23:05:28      1


In [61]:
df_cleaned_tweets_v2['label'].astype(float)

0        1.0
1        0.0
2        1.0
3        1.0
4        1.0
        ... 
10665    1.0
10666    1.0
10667    1.0
10668    0.0
10669    0.0
Name: label, Length: 10670, dtype: float64

In [59]:
df_cleaned_tweets_v1['date'] = pd.to_datetime(df_cleaned_tweets_v1['date'])
df_cleaned_tweets_v1.set_index('date', inplace=True)
daily_mean = df_cleaned_tweets_v1.resample('d').mean()
start_d = pd.to_datetime('2020-01-30')
end_d = pd.to_datetime('2022-12-31')
daily_mean_sentiment = daily_mean.iloc[(daily_mean.index >= start_d) & (daily_mean.index <= end_d)]

print(df_cleaned_tweets_v1)

                     label
date                      
2020-01-30 23:37:53      1
2020-01-30 23:11:24      0
2020-01-30 23:08:51      1
2020-01-30 23:05:38      1
2020-01-30 23:05:28      1
...                    ...
2022-12-31 21:37:48      1
2022-12-31 21:36:45      1
2022-12-31 21:16:30      1
2022-12-31 20:55:44      0
2022-12-31 20:16:49      0

[10670 rows x 1 columns]


In [69]:
df_cleaned_tweets_v3 = df_cleaned_tweets.copy()

In [70]:
del df_cleaned_tweets_v3['clean_tweet']

In [71]:
print(df_cleaned_tweets_v3)

                     date  label
0     2020-01-30 23:37:53      1
1     2020-01-30 23:11:24      0
2     2020-01-30 23:08:51      1
3     2020-01-30 23:05:38      1
4     2020-01-30 23:05:28      1
...                   ...    ...
10665 2022-12-31 21:37:48      1
10666 2022-12-31 21:36:45      1
10667 2022-12-31 21:16:30      1
10668 2022-12-31 20:55:44      0
10669 2022-12-31 20:16:49      0

[10670 rows x 2 columns]


In [72]:
df_cleaned_tweets_v3['date'] = pd.to_datetime(df_cleaned_tweets_v3['date'])
df_cleaned_tweets_v3 = df_cleaned_tweets_v3.resample('d', on='date').mean()

In [73]:
print(df_cleaned_tweets_v3)

            label
date             
2020-01-30    0.7
2020-01-31    0.1
2020-02-01    0.2
2020-02-02    0.4
2020-02-03    0.5
...           ...
2022-12-27    0.1
2022-12-28    0.5
2022-12-29   -0.1
2022-12-30    0.2
2022-12-31    0.6

[1067 rows x 1 columns]


In [86]:
df_mean_label = pd.DataFrame(df_cleaned_tweets_v3)
dates = pd.date_range(start='2020-01-30', end='2022-12-31', freq='D')



In [87]:
df_mean_label['date'] = dates

In [88]:
print(df_mean_label)

            label       date
date                        
2020-01-30    0.7 2020-01-30
2020-01-31    0.1 2020-01-31
2020-02-01    0.2 2020-02-01
2020-02-02    0.4 2020-02-02
2020-02-03    0.5 2020-02-03
...           ...        ...
2022-12-27    0.1 2022-12-27
2022-12-28    0.5 2022-12-28
2022-12-29   -0.1 2022-12-29
2022-12-30    0.2 2022-12-30
2022-12-31    0.6 2022-12-31

[1067 rows x 2 columns]


In [89]:
df_mean_label.reset_index(drop = True)

Unnamed: 0,label,date
0,0.7,2020-01-30
1,0.1,2020-01-31
2,0.2,2020-02-01
3,0.4,2020-02-02
4,0.5,2020-02-03
...,...,...
1062,0.1,2022-12-27
1063,0.5,2022-12-28
1064,-0.1,2022-12-29
1065,0.2,2022-12-30


Apply the ``sentiment_classifier`` function to the tweets and store the returned labels in a new column called ``sentiment``. **Warning:** doing this can be time intensive. This notebook was tested with 10 tweets per day and it took 6 hours to classify all the tweets scraped over the time range. Try doing this in chunks and downloading the results if you can't run the notebook for 6 hours straight.

In [None]:
# TODO - Apply the sentiment classifier function to the tweets.
cl_tweet = df_cleaned_tweets['clean_tweet']
for row in cl_tweet:
  classify_sentiment(row)

Now, we want to calculate the average sentiment for each day. We can drop the column of tweets before we transform the dataframe. Store this new dataframe in a variable called ``df_sentiment``.

In [None]:
# TODO - Drop the column of tweets and transform the dataframe.
#Done
df_sentiment = classified

We have now successfully generated all of our data necessary for the analysis. One last thing to do is to merge the previously merged datasets with our final dataset of average sentiment scores to create the dataframe ``df_covid_happiness``. Download the dataset of the previously merged datasets with the code below if necessary.

In [97]:
#Uploading the cases dataset
from google.colab import files 
uploaded = files.upload()
df_cases_stringency = pd.read_excel('COVID_cases_and_stringency.xlsx')

Saving COVID_cases_and_stringency.xlsx to COVID_cases_and_stringency (3).xlsx


In [101]:
df_mean_label = df_mean_label.reset_index(drop = True)

In [103]:
df_mean_label['date'] = pd.to_datetime(df_mean_label['date'])

In [104]:
df_cases_stringency['date'] = pd.to_datetime(df_cases_stringency['date'])

In [143]:
# TODO - Merge the stringency and cases dataset with the sentiment dataset.
df_covid_happiness = pd.merge(df_cases_stringency, df_mean_label, on = ['date'])

Finally, we save the generated dataset.

In [144]:
filename = "covid_happiness.csv"
df_covid_happiness.to_csv(filename, index=False)
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Running Regressions
In this section you will have to run the following regression and report the results:
$average\_sentiment_t = \beta positive\_cases_t + \gamma stringency_t + \eta_t + \varepsilon_t$
</br></br>
Before running this regression, think of the interpretation of the coefficient $\beta$ if you run this regression. Would you want to rescale the corresponding variable $positive\_cases$ with some proportion to improve the interpretability of this regression?
</br></br>
When interpreting the regression results you should make sure you understand the definitions of the variables used in the regression. For example, the number of confirmed cases for our purposes is actually the 7-day rolling average.
</br></br>
First we load our dataset if not loaded yet.

In [107]:
df_covid_happiness = upload_datasets()[0]

NameError: ignored

In [145]:
import statistics

Weight the number of cases by some constant.

In [146]:
df_covid_happiness_v1 = df_covid_happiness.copy()

In [147]:
# TODO - Weight the variable to improve the interpretability of the coefficient.
population = 68800000
df_covid_happiness_v1["cases_standardised"] = df_covid_happiness_v1["Cases"] / population

In [148]:
print(df_covid_happiness_v1)

      Unnamed: 0       date     Cases  stringency_index  label  \
0              0 2020-01-30         0              5.56    0.7   
1              1 2020-01-31         2              8.33    0.1   
2              2 2020-02-01         2              8.33    0.2   
3              3 2020-02-02         2             11.11    0.4   
4              4 2020-02-03         8             11.11    0.5   
...          ...        ...       ...               ...    ...   
1062        1062 2022-12-27  24135080              5.56    0.1   
1063        1063 2022-12-28  24135080              5.56    0.5   
1064        1064 2022-12-29  24135080              5.56   -0.1   
1065        1065 2022-12-30  24135080              5.56    0.2   
1066        1066 2022-12-31  24135080              5.56    0.6   

      cases_standardised  
0           0.000000e+00  
1           2.906977e-08  
2           2.906977e-08  
3           2.906977e-08  
4           1.162791e-07  
...                  ...  
1062        3.5080

We now install the required packages for running regressions and generating the corresponding regression tables.

In [149]:
!pip install linearmodels --quiet

We then import the installed libraries.

In [150]:
from linearmodels.panel import PanelOLS

Suppose we want to use month fixed effects in our regression. We will need to create a variable of month first in order to take this up in our final regression. Create a column that takes a different index for each month-year pair and wrap this in the function ``pd.Categorical()``.

In [154]:
# TODO - Create a column that takes a different index for each month-year pair.
df_covid_happiness_v1['year'] = pd.DatetimeIndex(df_covid_happiness_v1['date']).year
df_covid_happiness_v1['month'] = pd.DatetimeIndex(df_covid_happiness_v1['date']).month
df_covid_happiness_v1['day'] = pd.DatetimeIndex(df_covid_happiness_v1['date']).day



In [156]:
df_covid_happiness_v1['month_year'] = df_covid_happiness_v1['month'].astype(str) + "-" + df_covid_happiness_v1['year'].astype(str)


In [157]:
[print(df_covid_happiness_v1)]

            Unnamed: 0       date     Cases  stringency_index  label  \
month_year                                                             
2020-01-01           0 2020-01-30         0              5.56    0.7   
2020-01-01           1 2020-01-31         2              8.33    0.1   
2020-02-01           2 2020-02-01         2              8.33    0.2   
2020-02-01           3 2020-02-02         2             11.11    0.4   
2020-02-01           4 2020-02-03         8             11.11    0.5   
...                ...        ...       ...               ...    ...   
2022-12-01        1062 2022-12-27  24135080              5.56    0.1   
2022-12-01        1063 2022-12-28  24135080              5.56    0.5   
2022-12-01        1064 2022-12-29  24135080              5.56   -0.1   
2022-12-01        1065 2022-12-30  24135080              5.56    0.2   
2022-12-01        1066 2022-12-31  24135080              5.56    0.6   

            cases_standardised  year  month  day month_year  
m

[None]

In [158]:
df_covid_happiness_v1.reset_index(drop = True)

Unnamed: 0.1,Unnamed: 0,date,Cases,stringency_index,label,cases_standardised,year,month,day,month_year
0,0,2020-01-30,0,5.56,0.7,0.000000e+00,2020,1,30,1-2020
1,1,2020-01-31,2,8.33,0.1,2.906977e-08,2020,1,31,1-2020
2,2,2020-02-01,2,8.33,0.2,2.906977e-08,2020,2,1,2-2020
3,3,2020-02-02,2,11.11,0.4,2.906977e-08,2020,2,2,2-2020
4,4,2020-02-03,8,11.11,0.5,1.162791e-07,2020,2,3,2-2020
...,...,...,...,...,...,...,...,...,...,...
1062,1062,2022-12-27,24135080,5.56,0.1,3.508006e-01,2022,12,27,12-2022
1063,1063,2022-12-28,24135080,5.56,0.5,3.508006e-01,2022,12,28,12-2022
1064,1064,2022-12-29,24135080,5.56,-0.1,3.508006e-01,2022,12,29,12-2022
1065,1065,2022-12-30,24135080,5.56,0.2,3.508006e-01,2022,12,30,12-2022


In [160]:
print(df_covid_happiness_v1['month_year'].dtype)

object


In [None]:
#If needed convert the object to the datetime

Now, save and download the time series dataframe as ``covid_happiness_timeseries.csv``.

In [161]:
# TODO - Save and download the dataframe.
filename = 'covid_happiness_timeseries.csv'
df_covid_happiness_v1.to_csv(filename, index = False)
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Specifying the model. This is not an exercise because I am of the strong opinion that one should not do their econometrics in Python and the time spent searching the code to do this can be seen as suboptimally spent. Namely, documentation on econometric methods in Stata are arguably better documented and is more intuitive to use for people with a background in economics.

In [162]:
# Adding the date to the index as is required by the package of use.
# Also, placing the index of dates in the first column.
df_covid_happiness_v1 = df_covid_happiness_v1.set_index("date", append=True)
df_covid_happiness_v1.index = df_covid_happiness_v1.index.swaplevel(0, 1)

# Specifying the model.
regression_model = PanelOLS(dependent=df_covid_happiness_v1['label'],
                            exog=df_covid_happiness_v1[["Cases", "stringency_index"]],
                            entity_effects=False,
                            time_effects=False,
                            other_effects=df_covid_happiness_v1['month_year'])

Running the regression.

In [163]:
regression_results_summary = regression_model.fit(cov_type='clustered', cluster_entity=True).summary

Creating a regression table with the results.

In [164]:
pd.options.display.latex.repr = True
print(regression_results_summary)
print(regression_results_summary.as_latex())

                          PanelOLS Estimation Summary                           
Dep. Variable:                  label   R-squared:                        0.0018
Estimator:                   PanelOLS   R-squared (Between):             -0.0955
No. Observations:                1067   R-squared (Within):               0.0000
Date:                Thu, Feb 23 2023   R-squared (Overall):             -0.0955
Time:                        18:46:06   Log-likelihood                   -157.45
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      0.9035
Entities:                        1067   P-value                           0.4055
Avg Obs:                       1.0000   Distribution:                  F(2,1029)
Min Obs:                       1.0000                                           
Max Obs:                       1.0000   F-statistic (robust):             0.9085
                            

Storing and downloading the regression table in ``LaTeX`` format.

In [165]:
regression_table = open("regression_table.tex", "w")
regression_table = print(regression_results_summary.as_latex(), file=regression_table)
files.download("regression_table.tex")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Further Exercises
1. One option is to expand this analysis to different countries. Here, it is important to realise that comparing the coefficients of different countries is not justified. Namely, different countries may confirm cases in different ways. Looking at proportional increases in the number of cases will remove this problem, but will disregard the base level of new cases in the country which of course influences the magnitude of the effect on the average sentiment for a given proportional increase in the number of cases.
2. Data visualisation: plot the comovement of the variables of interest over time or something else you are interested in seeing that can give a new insight into the problem.
3. There may be other confounders present in the regression that I can't think of right now. If you can think of any, download the data for these, clean that data and create a new variable to run the regression with again.
4. Scrape tweets from random time intervals to reduce bias induced by Twitter's feed selection methods. Documentation available for some of the keywords necessary in the Twitter search query to do this can be found [here](https://github.com/igorbrigadir/twitter-advanced-search).
5. Filtering out spam tweets. You can approach this Natural Language Processing (NLP) problem in various ways, from as advanced as using AI classification methods as looking for duplicated tweets in your list of scraped tweets. You can always combine methods like these, of course.
6. Improving the tweet cleaning function.
7. Running the regression with different model specifications of how the confounder affects the outcome variable and the dependent variable. Namely, it may be the case that the start of heavy restrictions is not so bad yet, but that people get tired of it the longer these heavy restrictions are in place. You would need to transform the restriction variable to carry out the regression with this different definition of the control variable.


## References
* Edouard Mathieu, Hannah Ritchie, Lucas Rodés-Guirao, Cameron Appel, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Saloni Dattani, Diana Beltekian, Esteban Ortiz-Ospina and Max Roser (2020) - "Coronavirus Pandemic (COVID-19)". Published online at OurWorldInData.org. Retrieved from: 'https://ourworldindata.org/coronavirus' [Online Resource].
* Ensheng Dong, Hongru Du, Lauren Gardner, An interactive web-based dashboard to track COVID-19 in real time, The Lancet Infectious Diseases, Volume 20, Issue 5, 2020, Pages 533-534, ISSN 1473-3099, https://doi.org/10.1016/S1473-3099(20)30120-1. (https://www.sciencedirect.com/science/article/pii/S1473309920301201).
* JustAnotherArchivist, snscrape, (2023), GitHub repository, https://github.com/JustAnotherArchivist/snscrape.
* igorbrigadir, Twitter Advanced Search, (2023), GitHub repository, https://github.com/igorbrigadir/twitter-advanced-search.
* Wide and Narrow Data, Wikipedia, (12 Feb 2023), https://en.wikipedia.org/wiki/Wide_and_narrow_data.
* Advanced Filtering for Geo Data, (2023), https://developer.twitter.com/en/docs/tutorials/advanced-filtering-for-geo-data.
* Get Places Near a Location, (2023), https://developer.twitter.com/en/docs/twitter-api/v1/geo/places-near-location/api-reference/get-geo-reverse_geocode.

