<a href="https://colab.research.google.com/github/Py4Econ2023/COVID-Social-Cost/blob/Malika/Copy_of_covid_happiness.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COVID-19 Infections and Happiness
This is the notebook for the Python for Economics Project at the London  School of Economics analysing the effect of COVID-19 infections on happiness.


## Introduction

As policy-making during an epidemic is all about making economic tradeoffs, one would like to quantify the gains and losses in the factors a government is is trading off between. The trade-offs to monetary factors and other classical economic factors are well documented, of course. However, the social costs of viral cases less so (among other forms of social costs involved in a pandemic). One may make an attempt to quantify the social costs of the number of cases of such a virus in your country by looking at the causal effect of COVID-19 cases on the average sentiment of how people express themselves online.

## Overview Project
In this project the main goal is to run a regression of the number of COVID-19 infections on the average sentiment of how people express themselves online. You will start by carrying out this analysis for the UK. A clear confounder here are government restrictions to curb the spread of the virus. You will control for this confounder in the regression alongside time-fixed effects that deal with the biases caused by new ways of measuring cases, changes in testing accuracy and availability, among other possible biases.
</br></br>
To be able to run this final regression, though, you will need to collect the data. This notebook will walk you through the steps associated with this and the final step of running the regression.


## Table of Contents

>[COVID-19 Infections and Happiness](#scrollTo=M_2dLRCIIqv9)

>>[Introduction](#scrollTo=c3t9AWywlLa_)

>>[Overview Project](#scrollTo=hpTdOHFalo5C)

>>[Table of Contents](#scrollTo=loLc9eEEVSsP)

>>[Preparation](#scrollTo=5kqfEAq9S8KC)

>>[Data Collection](#scrollTo=Xviu1_5NnsrF)

>>>[Loading Datasets](#scrollTo=3gOgjpQpKGoe)

>>>[Cleaning Datasets](#scrollTo=Xzmo0WIbKtrk)

>>>>[Preparation](#scrollTo=Xzmo0WIbKtrk)

>>>>[Stringency](#scrollTo=KB_HFVFsdT5H)

>>>>[Cases](#scrollTo=6zlemmyxk2JT)

>>>[Merging Dataframes](#scrollTo=8UXSCwxHg1Gi)

>>>[Average Sentiment](#scrollTo=C_Wv1iKpmdo6)

>>>>[Scraping Tweets](#scrollTo=UVZi_0ONKq5p)

>>>>[Classifying Tweets](#scrollTo=zOiuZ0v8NSKA)

>>[Running Regressions](#scrollTo=1wSD_8JrKslV)

>>[Further Exercises](#scrollTo=HIx-MyVcN7Vh)

>>[References](#scrollTo=GTQjBctvVLWv)




## Preparation
First, you will need to install a few libraries for this project. To install a library, write ``!pip install`` in a code block followed by ``name-library`` and the optional ``--quiet`` keyword to suppress the logs. For example, installing the package ``pandas`` can be done by running ``!pip install pandas --quiet`` in a code block.

(note: between countries the definitions and methods of confirming cases differs. maybe look at percentual change in infections but then not the absolute size of infections. maybe ONS positive rates).

In [None]:
# TODO - Install the following packages: pandas, datetime.
!pip install pandas --quiet 
!pip install datetime --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.1/52.1 KB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 KB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h

Now, you have to import the packages you installed. Additionally, import the preinstalled package ``numpy`` as ``np``.

In [2]:
# TODO - Import the installed packages.
# One additional library necessary for CSV uploads is already given (no need to install this one, it is installed by default on Colabs).
from google.colab import files
import io
import numpy as np 
import pandas as pd 
import datetime as dt 



## Data Collection
We can now start collecting our data.

### Loading Datasets
The data we will use for this analysis will come from the John Hopkins University Center for Systems Science and Engineering, Our World in Data and, of course, Twitter. 


* The dataset on confirmed cases per country (including the UK) can be found and downloaded [here](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series).
* The dataset on COVID-19 government restriction stringency can be found and downloaded [here](https://ourworldindata.org/covid-stringency-index).
* We will get into the Tweets later.

Once you have downloaded the datasets, you can upload one of them to to Colabs by running the comands below which will store a dataset as a Pandas dataframe (sort of like a spreadsheet). It is a good coding practice to wrap commands like this in a function. Do this and make the function output both datasets in a list. Then, call the function and assign the result to a variable ``dataframes``, storing the two dataframes in a list.

In [None]:
# This command will prompt you with an upload screen and store the uploaded files in a dictionary.
# You can upload multiple files at once.
uploaded = files.upload()

# This command stores the filenames in a list.
filenames = list(uploaded.keys())

# This command selects the filename of the first file in the files you uploaded.
filename = filenames[0]

# This command stores a dataset in a variable as a Pandas dataframe.
df_stringency = pd.read_csv('owid-covid-data.csv')
df_cases = pd.read_csv('time_series_covid19_confirmed_global.csv')

# TODO - Create and call the function.
print(df_stringency)
print(df_cases)

Saving time_series_covid19_confirmed_global.csv to time_series_covid19_confirmed_global.csv
Saving owid-covid-data.csv to owid-covid-data.csv
       iso_code continent     location        date  total_cases  new_cases  \
0           AFG      Asia  Afghanistan  2020-02-24          5.0        5.0   
1           AFG      Asia  Afghanistan  2020-02-25          5.0        0.0   
2           AFG      Asia  Afghanistan  2020-02-26          5.0        0.0   
3           AFG      Asia  Afghanistan  2020-02-27          5.0        0.0   
4           AFG      Asia  Afghanistan  2020-02-28          5.0        0.0   
...         ...       ...          ...         ...          ...        ...   
259638      ZWE    Africa     Zimbabwe  2023-02-19     263642.0        0.0   
259639      ZWE    Africa     Zimbabwe  2023-02-20     263642.0        0.0   
259640      ZWE    Africa     Zimbabwe  2023-02-21     263642.0        0.0   
259641      ZWE    Africa     Zimbabwe  2023-02-22     263921.0      279.0   


### Cleaning Datasets
#### Preparation
First, it would be nice to have each of the datasets stored in a variable with a corresponding name. Below I show a trick to assign two variables at once. Use this trick to assign your datasets to the variables ``df_cases`` and ``df_stringency``.

In [None]:

# TODO - Replicate the trick with the variable names given.
df_stringency = pd.read_csv('owid-covid-data.csv')
print(df_stringency)
df_cases = pd.read_csv('time_series_covid19_confirmed_global.csv')
print(df_cases)

       iso_code continent     location        date  total_cases  new_cases  \
0           AFG      Asia  Afghanistan  2020-02-24          5.0        5.0   
1           AFG      Asia  Afghanistan  2020-02-25          5.0        0.0   
2           AFG      Asia  Afghanistan  2020-02-26          5.0        0.0   
3           AFG      Asia  Afghanistan  2020-02-27          5.0        0.0   
4           AFG      Asia  Afghanistan  2020-02-28          5.0        0.0   
...         ...       ...          ...         ...          ...        ...   
259638      ZWE    Africa     Zimbabwe  2023-02-19     263642.0        0.0   
259639      ZWE    Africa     Zimbabwe  2023-02-20     263642.0        0.0   
259640      ZWE    Africa     Zimbabwe  2023-02-21     263642.0        0.0   
259641      ZWE    Africa     Zimbabwe  2023-02-22     263921.0      279.0   
259642      ZWE    Africa     Zimbabwe  2023-02-23     263921.0        NaN   

        new_cases_smoothed  total_deaths  new_deaths  new_death

#### Stringency
Let's start with the easiest dataset first. Inspect the structure of the dataset by printing the dataframe.

In [None]:
# TODO - print the dataframe and inspect the structure.
print(df_stringency)

# Print the first 5 rows of the stringency dataset 
df_stringency.head()

# Print the last 5 rows of the dataset 
df_stringency.tail()

# Print the description of the stringency dataset
df_stringency.describe()


       iso_code continent     location        date  total_cases  new_cases  \
0           AFG      Asia  Afghanistan  2020-02-24          5.0        5.0   
1           AFG      Asia  Afghanistan  2020-02-25          5.0        0.0   
2           AFG      Asia  Afghanistan  2020-02-26          5.0        0.0   
3           AFG      Asia  Afghanistan  2020-02-27          5.0        0.0   
4           AFG      Asia  Afghanistan  2020-02-28          5.0        0.0   
...         ...       ...          ...         ...          ...        ...   
259638      ZWE    Africa     Zimbabwe  2023-02-19     263642.0        0.0   
259639      ZWE    Africa     Zimbabwe  2023-02-20     263642.0        0.0   
259640      ZWE    Africa     Zimbabwe  2023-02-21     263642.0        0.0   
259641      ZWE    Africa     Zimbabwe  2023-02-22     263921.0      279.0   
259642      ZWE    Africa     Zimbabwe  2023-02-23     263921.0        NaN   

        new_cases_smoothed  total_deaths  new_deaths  new_death

Unnamed: 0,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
count,245103.0,244840.0,243636.0,225414.0,225337.0,224151.0,243990.0,243727.0,242528.0,224314.0,...,156090.0,102925.0,186040.0,238303.0,203958.0,258530.0,8649.0,8649.0,8649.0,8649.0
mean,5324724.0,11748.77,11794.97,80103.67,126.648731,127.209499,74129.933417,177.214225,177.793152,755.539345,...,32.821249,50.894304,3.089125,73.59706,0.724529,140513000.0,52733.96,10.335603,14.090851,1514.39039
std,32718060.0,81887.31,79604.65,407967.1,735.661817,681.924334,124831.747044,1124.969189,665.975633,1031.179439,...,13.539375,31.883202,2.551417,7.446413,0.149553,691611700.0,143332.1,13.192857,25.875053,1856.830173
min,1.0,0.0,0.0,1.0,0.0,0.0,0.001,0.0,0.0,0.0,...,7.7,1.188,0.1,53.28,0.394,47.0,-37726.1,-28.45,-95.92,-1984.2816
25%,5801.5,0.0,4.429,125.0,0.0,0.0,1387.598,0.0,0.909,34.686,...,21.6,20.859,1.3,69.5,0.602,836783.0,93.20001,1.08,0.19,56.186893
50%,62350.0,37.0,76.714,1361.0,0.0,1.143,14013.625,4.675,15.889,251.4225,...,33.1,49.839,2.5,75.05,0.742,6948395.0,7149.899,8.01,7.62,944.00085
75%,647396.0,816.0,995.857,11043.0,11.0,13.286,88264.9525,83.366,122.3025,1141.5335,...,41.3,83.241,4.2,79.07,0.838,33696610.0,38119.3,16.31,19.23,2428.449
max,674678600.0,4082893.0,3436562.0,6868577.0,60902.0,14860.286,722127.171,228872.025,36421.827,6443.162,...,78.1,100.0,13.8,86.75,0.957,7975105000.0,1273323.0,76.55,376.77,10251.77


Clearly, there are lots of variables and countries of which we do not need the data. Therefore, we would like to drop the redundant entries. Do this by selecting only the date and stringency index values for just the United Kingdom. Overwrite ``df_stringency`` with this transformed dataframe. As a final nit-picky step, reset the index of the dataframe.

In [None]:
# TODO - Overwrite the dataframe with the filtered version.
# Let's choose as a filter the column location, and modify it to include only the date and stringecy index. Our filter would be 'United Kingdom'
df_stringency = df_stringency.loc[df_stringency['location'] == 'United Kingdom', ['date', 'stringency_index']]

# Reset the index of the dataframe without adding a new column
df_stringency = df_stringency.reset_index(drop = True)

# Check and pray that it works 
print(df_stringency)



            date  stringency_index
0     2020-01-30              5.56
1     2020-01-31              8.33
2     2020-02-01              8.33
3     2020-02-02             11.11
4     2020-02-03             11.11
...          ...               ...
1116  2023-02-19               NaN
1117  2023-02-20               NaN
1118  2023-02-21               NaN
1119  2023-02-22               NaN
1120  2023-02-23               NaN

[1121 rows x 2 columns]


We would like to have our data of suitable data types, so it is easiest to work with down the line. For example, we would like the values in our ``date`` column to be of the ``datetime`` data type. Also, we would like the values in our ``stringency_index`` column to be of the ``float`` data type. Check if this is the case and if not, convert the column values to the desired data type.

In [None]:
# TODO - Check if the column values data types are correct and convert them if not.

# Check the type of date
print(df_stringency['date'].dtype)

# Right, we see that it's type is object. Let's convert it to datetime. 
df_stringency['date'] = pd.to_datetime(df_stringency['date'])

# Check
print(df_stringency['date'].dtype)

# Check the type of stringency_index
print(df_stringency['stringency_index'].dtype)

# The data type of the stringency_index is float64



object
datetime64[ns]
float64


In [None]:
# Filter the dataset for the pre-calculated dates
df_stringency_filter = df_stringency.iloc[:1067]
print(df_stringency_filter)

           date  stringency_index
0    2020-01-30              5.56
1    2020-01-31              8.33
2    2020-02-01              8.33
3    2020-02-02             11.11
4    2020-02-03             11.11
...         ...               ...
1062 2022-12-27              5.56
1063 2022-12-28              5.56
1064 2022-12-29              5.56
1065 2022-12-30              5.56
1066 2022-12-31              5.56

[1067 rows x 2 columns]


#### Cases
Now on to the harder dataset. Inspect the structure of the dataset by printing the dataframe.

In [None]:
# TODO - print the dataframe and inspect the structure.
df_cases = df_cases.fillna('N/A')
print(df_cases)
print(df_cases.info())

    Province/State        Country/Region        Lat       Long  1/22/20  \
0              N/A           Afghanistan   33.93911  67.709953        0   
1              N/A               Albania    41.1533    20.1683        0   
2              N/A               Algeria    28.0339     1.6596        0   
3              N/A               Andorra    42.5063     1.5218        0   
4              N/A                Angola   -11.2027    17.8739        0   
..             ...                   ...        ...        ...      ...   
284            N/A    West Bank and Gaza    31.9522    35.2332        0   
285            N/A  Winter Olympics 2022    39.9042   116.4074        0   
286            N/A                 Yemen  15.552727  48.516388        0   
287            N/A                Zambia -13.133897  27.849332        0   
288            N/A              Zimbabwe -19.015438  29.154857        0   

     1/23/20  1/24/20  1/25/20  1/26/20  1/27/20  ...  2/14/23  2/15/23  \
0          0        0   

Again, there are a lot of countries we do not need the data of. Filter the dataframe to only contain records of the UK (be precise here) and overwrite the original dataframe with the filtered one.

In [None]:
# TODO - Filter and overwrite the dataframe of cases.
df_cases = df_cases.loc[(df_cases['Province/State'] == 'N/A') & (df_cases['Country/Region'] == 'United Kingdom')]
print(df_cases)

    Province/State  Country/Region      Lat   Long  1/22/20  1/23/20  1/24/20  \
278            N/A  United Kingdom  55.3781 -3.436        0        0        0   

     1/25/20  1/26/20  1/27/20  ...   2/14/23   2/15/23   2/16/23   2/17/23  \
278        0        0        0  ...  24315979  24315979  24341611  24341611   

      2/18/23   2/19/23   2/20/23   2/21/23   2/22/23   2/23/23  
278  24341611  24341611  24341611  24341611  24341611  24370150  

[1 rows x 1133 columns]


Some might think we are done now with this dataset, but this dataset has a nasty characteristic. Namely, it is [*wide*](https://en.wikipedia.org/wiki/Wide_and_narrow_data), and quite *wide*, to say the least. Libraries written for Python and other programming languages hardly support this kind of data shape. Therefore, we want to change the shape of the data to the *narrow* format.
</br></br>
In essence, we would like one column for the date and one column for the confirmed cases. Thus, we need to put the column names in a new variable name called ``date`` and link the corresponding case numbers to the right row.
</br></br>
Convert the dataframe to a *narrow* format. After understanding the concepts by reading the Wikipedia page linked before, use Pandas' [``melt``](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) implementation to achieve this.

In [None]:
# TODO - Convert the dataframe from wide to narrow format.
id_vars = df_cases.loc[:, ['Province/State']]
value_vars = df_cases.loc[:, df_cases.columns != 'Province/State']
df_cases = pd.melt(df_cases, id_vars == id_vars, value_vars == value_vars, var_name='date', value_name='Cases', ignore_index=True)
print(df_cases)

     Province/State            date           Cases
0               N/A  Country/Region  United Kingdom
1               N/A             Lat         55.3781
2               N/A            Long          -3.436
3               N/A         1/22/20               0
4               N/A         1/23/20               0
...             ...             ...             ...
1127            N/A         2/19/23        24341611
1128            N/A         2/20/23        24341611
1129            N/A         2/21/23        24341611
1130            N/A         2/22/23        24341611
1131            N/A         2/23/23        24370150

[1132 rows x 3 columns]


In [None]:
df_casesdropped = df_cases.drop(columns=['Province/State'])
print(df_casesdropped)

                date           Cases
0     Country/Region  United Kingdom
1                Lat         55.3781
2               Long          -3.436
3            1/22/20               0
4            1/23/20               0
...              ...             ...
1127         2/19/23        24341611
1128         2/20/23        24341611
1129         2/21/23        24341611
1130         2/22/23        24341611
1131         2/23/23        24370150

[1132 rows x 2 columns]


In [None]:
df_rowdropped = df_casesdropped.iloc[11:]
print(df_rowdropped)

         date     Cases
11    1/30/20         0
12    1/31/20         2
13     2/1/20         2
14     2/2/20         2
15     2/3/20         8
...       ...       ...
1127  2/19/23  24341611
1128  2/20/23  24341611
1129  2/21/23  24341611
1130  2/22/23  24341611
1131  2/23/23  24370150

[1121 rows x 2 columns]


In [None]:
df_cases_filter= df_rowdropped.iloc[:1067]
df_cases_filter = df_cases_filter.reset_index(drop = True)
print(df_cases_filter)


          date     Cases
0      1/30/20         0
1      1/31/20         2
2       2/1/20         2
3       2/2/20         2
4       2/3/20         8
...        ...       ...
1062  12/27/22  24135080
1063  12/28/22  24135080
1064  12/29/22  24135080
1065  12/30/22  24135080
1066  12/31/22  24135080

[1067 rows x 2 columns]


Now, we would like to convert the date column of data type ``string`` to the data type ``datetime``, because we want to link the time series datasets that we now have parsed to each other and make one big, complete dataset. This is not as easy as it was for the previous dataset, and you will probably find out why.

In [None]:
# TODO - Convert the values of the date column to the datetime data type.
df_cases_filter['date'] = pd.to_datetime(df_cases_filter['date'])
print(df_cases_filter)
print(df_cases_filter.dtypes)
     


           date     Cases
0    2020-01-30         0
1    2020-01-31         2
2    2020-02-01         2
3    2020-02-02         2
4    2020-02-03         8
...         ...       ...
1062 2022-12-27  24135080
1063 2022-12-28  24135080
1064 2022-12-29  24135080
1065 2022-12-30  24135080
1066 2022-12-31  24135080

[1067 rows x 2 columns]
date     datetime64[ns]
Cases            object
dtype: object


### Merging Dataframes
Now, we would like to merge the dataframes of the COVID-19 cases and COVID-19 policy stringency with eachother, so that for each date that is present in both dataframes we have one observation for the stringency and the number of cases. We will use [Pandas' implementation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) of a merge function.

In [None]:
# TODO - Merge "df_cases" with "df_stringency" and save the result in a variable called "df_cases_stringency".
df_cases_stringency = pd.merge(df_cases_filter, df_stringency_filter, on = ['date'])

print(df_cases_stringency)

NameError: ignored

Upon inspecting the data, we can see that there are some missing observations for the stringency index, probably because the stringency data does not go as far in time as the cases dataset. To clean this up, we would like to drop these missing values.

In [None]:
# TODO - Drop the missing values in the dataset.
# Done that above 

After having edited data with code that takes a bit to run, you usually want to save your progress by downloading the dataset. (In more advanced projects, you would maybe use a database when using computationally expensive operations). Thus, download your dataset. You can use the previously installed ``files`` Colabs library for this. Download the dataset as ``cases_stringency.csv``. Make sure to exclude the index in the dataframe to CSV conversion step.

In [None]:
# TODO - Convert the dataframe to a CSV file and download it.
from google.colab import files
df_cases_stringency.to_csv('COVID_cases_and_stringency.csv', encoding = 'utf-8-sig') 
files.download('COVID_cases_and_stringency.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Check the types of the data
df_cases_stringency.dtypes

date                datetime64[ns]
Cases                       object
stringency_index           float64
dtype: object

### Average Sentiment
Now, in the data collection part of this project we only have left the task of collecting data on the average sentiment of how people express themselves online.

#### Scraping Tweets
In this section, we will start scraping tweets from the UK in the same time period as variables ``stringency_index`` and ``cases`` are recorded in. In an academic setting, you might prefer to use an official Twitter API, but this can take a while to be admitted to. Additionally, few compromises are made by using an unofficial Twitter scraper.
</br></br>
If you have left off since everything before this code chunk and your Google Colabs runtime has restarted, you can optionally load the dataset you created in the previous parts with the code below.

In [None]:
df_cases_stringency = upload_datasets()[0]

NameError: ignored

First, we install a library that allows us to easily scrape tweets from Twitter.

In [None]:
!pip install snscrape --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/69.2 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.2/69.2 KB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h

Importing the scraping library.

In [None]:
import snscrape.modules.twitter as sntwitter

We will have to define the date range we want to scrape data from before we start scraping tweets. A useful function for this is Pandas' [``date_range``](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html). Define a date range that starts from the earliest date all the way to the last date in your dataframe ``df_cases_stringency``. Store this range of dates in a variable called ``date_range``.

In [None]:
# TODO - Define the date range.
date_range = pd.date_range(start='1/30/2020', end='1/1/2023')

Defining a list to store the tweets in.

In [None]:
tweets = []

Defining the number of tweets to be scraped per day. You can change this number to your liking. I would recommend to try running the code with this number first and possibly increasing it later when sure the code works so wasting computation time can be prevented.

In [None]:
# Changing the number of tweets from 10 to 20
tweets_per_day = 20

As we want to scrape tweets published from the UK, we need to tell this to our scraper. As it so happens, Twitter uses geographic tags users can choose to attach to their tweets. (There are some problems of representativeness with this approach discussed [here](https://developer.twitter.com/en/docs/tutorials/advanced-filtering-for-geo-data) if you are interested.) The UK tag is ``6416b8512febefc9``. If needed while exploring the **optional** further exercises, you can find tags of other countries via the following Twitter API: ``f"https://api.twitter.com/1.1/geo/reverse_geocode.json?lat={latitude}&lon={longitude}&granularity=country"``. You would format the string based on your latitude and longitude variables before plugging the link in your browser or Python API module of choice. Documentation for this API can be found [here](https://developer.twitter.com/en/docs/twitter-api/v1/geo/places-near-location/api-reference/get-geo-reverse_geocode).

Now, we can start scraping. To get you started with the functionality of the ``snscrape`` module, I have written a simple piece of code that you can run to understand how this module can be used.

In [None]:
# Demonstrating the working of the "enumerate" function.
text_list = ["This", "is", "how", "enumerate", "works."]
for i, text in enumerate(text_list):
  print(i, text)

# Storing the place ID for the UK.
place_id = "6416b8512febefc9"

# Defining the search query for our Twitter scraper.
# The keyword "lang:en" will filter for English tweets only.
# The keywords "since:date" and "until:date" define the time range the tweet has to be from.
# "until" is exclusive, meaning no tweets are scraped from "2020-05-20". "since" is inclusive.
scraped_tweets = sntwitter.TwitterSearchScraper(f"lang:en place:{place_id} since:2020-05-19 until:2020-05-20").get_items()

# This piece of code will print 5 tweets.
# For each iteration in the loop, the scraper will scroll to the next tweet in the feed returned by Twitter.
for i, tweet in enumerate(scraped_tweets):
  print(tweet.rawContent)
  print(tweet.date)
  # We will only need the rawContent and date properties of the tweet.
  # tweet.rawContent gives the text of the tweet (string)
  # tweet.date gives the date and time of the tweet (datetime)
  # For more properties, see line 60 and onwards of https://github.com/JustAnotherArchivist/snscrape/blob/master/snscrape/modules/twitter.py.

  # Stopping the loop.
  if i == 4:
    break

0 This
1 is
2 how
3 enumerate
4 works.
@OrinKerr I hope they don’t honour the subpoenas. Take it all the way to the Supreme Court, just like these scoundrels have done.
2020-05-19 23:38:39+00:00
Almost at 500 followers this is exciting 😁 We are feeling the love 😍
.
.
#ChihuahuaLover #twitterdogs #milonmily
#lockdown #dogcelebration #dog #dogs #doggy #dogsduringlockdown #doglover #dogsoftwitter #doglovers #Chihuahua #cute #RETWEEET #RT https://t.co/GAmHwJTLBk
2020-05-19 22:59:36+00:00
Conscious Co. #Gin is a rather eye-catching gin distilled from surplus potatoes that weren't so eye-catching and would have otherwise gone to waste! Plus, six local botanicals make for one fragrant tipple.

https://t.co/45a8IPB6Rv https://t.co/BgtYbKLvrf
2020-05-19 22:31:02+00:00
@carolynewart @BASW_UK @BASW_NI Unity is strength, great contributions tonight, all messages highlightied the importance of being part of the international community of social work. Thank you @ScotsSW @AngieBartoli
@BASW_Cymru @IF

Now that you hopefully understand how this module works, I want you to write a function called ``scrape_time_range``.  This function will have to return a list of scraped tweets, containing the raw content and date for each tweet in the list.
</br></br>
This function should take four arguments:
1. A list to append the scraped tweets to.
2. The place ID.
3. The date range.
4. The number of tweets to be scraped per day.

You want this function to iterate over the dates in the date range first, before defining the search query for that day and scraping the desired number of tweets. Notice that the dates stored in the previously created ``date_range`` are of the data type ``datetime``. They can be converted to strings by using the function [``strftime``](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.strftime.html). You can format the desired output strings with the following keywords:
* ``%Y`` which corresponds to YYYY.
* ``%m`` which corresponds to mm.
* ``%d`` which corresponds to dd.

Make sure to take care of the hypens in these dates, too, when converting the date range, as your Twitter search query will be invalid without them. The same applies to the order of the year, month and date in the string.
</br></br>**Hint:** wrap the output of ``date_range.strftime()`` in ``list()`` to convert the Numpy object to a Python list, which is more convenient in this instance.

In [None]:
# TODO
# 1. Convert the date range to a list of date strings.
date_range_list = list(date_range.strftime('%Y-%m-%d'))
# 2. Write the scraping function.
place_id = '6416b8512febefc9'
def scrape_time_range(list_of_tweets, place_ID, Date_Range, tweets_per_day):
  index = 0
  for date in Date_Range:
    start_date = date
    index += 1
    if index == len(Date_Range):
      break
    else:
      end_date = Date_Range[index]
      scraped_tweets = sntwitter.TwitterSearchScraper(f"lang:en place:{place_ID} since:{start_date} until:{end_date}").get_items()
      for i, tweet in enumerate(scraped_tweets):
        list_of_tweets.append([tweet.rawContent, tweet.date])
        if i == tweets_per_day - 1:
          break            
  return list_of_tweets

Call your scraping function.
**Warning:** with 10 tweets a day this takes about 40 minutes to run and at a later stage the tweet classification task with the best model would take around 6 hours (but you can do this in batches of course).

In [None]:
# TODO - Call it. # Changing the number of tweet from 10 to 20. 
scrape_time_range(tweets, place_id, date_range_list, 20)



[['I love It this project @spacedoge_io and i recommend everyone invest on Miner.',
  datetime.datetime(2020, 1, 30, 23, 37, 53, tzinfo=datetime.timezone.utc)],
 ['@grantdashwood 🤣 not quite yet (I think). But who knows... maybe just a matter of time 🤔 #Automation #robotics #robot #bot',
  datetime.datetime(2020, 1, 30, 23, 11, 24, tzinfo=datetime.timezone.utc)],
 ["Celebrating #NationalBackwardDay With My Favourite Family! Here's When @jimmyosmond Announced  Big Sister Marie's Mobility Flaws On The #Osmond Family Show! 🤣\n\n#FridayThoughts For Anyone Missing @donnyosmond\n&amp; @marieosmond I Got Em Back! 😜\n\nEnjoy EVERYONE It's Hilarious 💕💋 https://t.co/xKtw1c1Rt1",
  datetime.datetime(2020, 1, 30, 23, 8, 51, tzinfo=datetime.timezone.utc)],
 ['@TonyThePoett Testing times for the ravaged minds of "War" calm and free with a cup of tea and a fag in hand. Thank you @TonyThePoett \n\nAlways the rebel 🇬🇧',
  datetime.datetime(2020, 1, 30, 23, 5, 38, tzinfo=datetime.timezone.utc)],
 ['Beau

Now, we would like to convert the list of tweets to a dataframe and a CSV file to save our progress. Call the dataframe ``df_tweets`` and the CSV file ``tweets.csv``.

In [None]:
# TODO - Convert the list of tweets to a dataframe and a CSV file.
df_tweets = pd.DataFrame(tweets)
df_tweets.iloc[:,1] = df_tweets.iloc[:,1].dt.tz_localize(None)
df_tweets.to_excel('tweets.xlsx', index = False)


In [None]:
from google.colab import files 
files.download ('tweets.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#### Classifying Tweets
If you have left off before this chunk and your Colabs runtime has refreshed in the meantime, load the dataset below.

In [None]:
# Taking the first index of the list of uploaded datasets, as you only upload one.
#df_tweets = upload_datasets()[0]
from google.colab import files 
uploaded = files.upload()
df_tweets = pd.read_excel('tweets.xlsx')

Saving tweets.xlsx to tweets (1).xlsx


At this stage, we need to define a function that cleans the tweets. Namely, users and tweets mentioned in tweets might confuse the classification model that we will use at a later stage. This is possible if usernames and links have words in them that would refer to a certain sentiment but are not used for that purpose in natural text. Thus, we need to neutralise these words in the tweets. Create a function that converts all users (in the form of ``@username``) to "``@user``" and all links (in the form of ``https://`` to "``https``". Call it ``neutralise_mentions_links`` and make it so that it takes one argument called ``text``.
</br></br>
Use the ``.split()`` function of strings in Python. Mentions start with "@", links with "https://". 

In [None]:
df_tweets.columns = ['tweet', 'date']

In [None]:
# TODO - Write a function that removes mentions fom and shortens links in a piece of text.
def neutralise_mentions_links(text): 
  new_text = text.split()
  modified_text = []
  for string in new_text:
      if string.startswith('#'): 
         continue
      elif string.startswith('@'): 
         modified_text.append('@user')
      elif string.startswith('https://'): 
         modified_text.append('https')
      else:
         modified_text.append(string)
  return ' '.join(modified_text)

Apply the function to all the tweets in the dataframe.

In [None]:
# TODO - Apply the function to all tweets in the dataframe.
cleaned_tweet = []
for row in df_tweets['tweet']:
    new_row = neutralise_mentions_links(row)
    cleaned_tweet.append(new_row)
print(cleaned_tweet)



In [None]:
# Adding a column of clean_tweet to the dataset
df_tweets['clean_tweet'] = cleaned_tweet
print(df_tweets[:5])

                                               tweet                date  \
0  I love It this project @spacedoge_io and i rec... 2020-01-30 23:37:53   
1  @grantdashwood 🤣 not quite yet (I think). But ... 2020-01-30 23:11:24   
2  Celebrating #NationalBackwardDay With My Favou... 2020-01-30 23:08:51   
3  @TonyThePoett Testing times for the ravaged mi... 2020-01-30 23:05:38   
4  Beautiful Tip! @emmerdale https://t.co/99lVfBFaYy 2020-01-30 23:05:28   

                                         clean_tweet  
0  I love It this project @user and i recommend e...  
1  @user 🤣 not quite yet (I think). But who knows...  
2  Celebrating With My Favourite Family! Here's W...  
3  @user Testing times for the ravaged minds of "...  
4                         Beautiful Tip! @user https  


In [None]:
df_tweets.to_excel('cleaned_tweets.xlsx', index =False)
from google.colab import files
files.download('cleaned_tweets.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Now, we would like to classify the sentiment of the tweets in our dataframe. We task an external library with this exercise. The library we will use is ``happytransformer``. First, we install the library.

In [None]:
!pip install happytransformer --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 KB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m56.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m213.0/213.0 KB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

Second, we import the text classification functionality from the library we installed.

In [None]:
from happytransformer import HappyTextClassification

Third, we load the AI model that has been trained on a large dataset of tweets with sentiment labels. We will use this for the analysis. This type of model is called a transformer model which you can read more on [here](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model).

In [None]:
happy_tc = HappyTextClassification(model_type="BERT",  model_name="cardiffnlp/twitter-roberta-base-sentiment", num_labels=3)

Downloading (…)lve/main/config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

This is a demonstration of how the model can be used. Now write a function called ``classify_sentiment`` that takes in one argument of ``text`` and outputs the label in numeric form. 
</br></br>
It is important for you to know that the label that the NLP model outputs is one of:
* ``LABEL_0``, which corresponds to negative or the numeric form of -1.
* ``LABEL_1``, which corresponds to neutral or the numeric form of 0.
* ``LABEL_2``, which corresponds to positive or the numeric form of 1.

The model outputs one score for each label and returns the label and score corresponding to the label with the highest score.

In [None]:
result = happy_tc.classify_text("I think the Python for Economics week is a great initiative.")
print(result.label, result.score)

# TODO - Write a function that outputs the label in numeric form.
classified = []
def classify_sentiment(text):
      sentiment = happy_tc.classify_text(text)
      classified.append([sentiment.label, sentiment.score])
      return classified 



LABEL_2 0.9771161079406738


Apply the ``sentiment_classifier`` function to the tweets and store the returned labels in a new column called ``sentiment``. **Warning:** doing this can be time intensive. This notebook was tested with 10 tweets per day and it took 6 hours to classify all the tweets scraped over the time range. Try doing this in chunks and downloading the results if you can't run the notebook for 6 hours straight.

In [None]:
from google.colab import files
uploaded = files.upload()
df_cleaned_tweets = pd.read_excel('cleaned_tweets.xlsx') 

Saving cleaned_tweets.xlsx to cleaned_tweets.xlsx


In [None]:
# TODO - Apply the sentiment classifier function to the tweets.
cl_tweet = df_cleaned_tweets['clean_tweet']
for row in cl_tweet: 
    classify_sentiment(row)
print(classified)


[['LABEL_2', 0.9852529168128967], ['LABEL_1', 0.6525818109512329], ['LABEL_2', 0.9834097623825073], ['LABEL_2', 0.6667680144309998], ['LABEL_2', 0.9594331383705139], ['LABEL_2', 0.8809717893600464], ['LABEL_2', 0.6463283896446228], ['LABEL_0', 0.9427138566970825], ['LABEL_2', 0.9645416140556335], ['LABEL_2', 0.5569496750831604], ['LABEL_2', 0.975188672542572], ['LABEL_2', 0.9666039943695068], ['LABEL_0', 0.8882222175598145], ['LABEL_0', 0.6291963458061218], ['LABEL_0', 0.46580812335014343], ['LABEL_2', 0.490807443857193], ['LABEL_2', 0.6613503098487854], ['LABEL_2', 0.9862269759178162], ['LABEL_1', 0.4567926526069641], ['LABEL_0', 0.9170284867286682], ['LABEL_0', 0.8124008178710938], ['LABEL_2', 0.9868990182876587], ['LABEL_0', 0.5962307453155518], ['LABEL_2', 0.94588303565979], ['LABEL_0', 0.7323468327522278], ['LABEL_0', 0.6218178868293762], ['LABEL_2', 0.9641792178153992], ['LABEL_2', 0.9215770363807678], ['LABEL_2', 0.9606326222419739], ['LABEL_2', 0.9632844924926758], ['LABEL_1', 

In [None]:
# Adding a column label and sentiment to the classified_sentiment dataframe
df_classified_sentiment = pd.DataFrame(classified, columns=['label','sentiment'])
# Merging both datframes. Note: We are using concat because we want to merge them along the same axis = 1, i.e columns 
df_cleaned_tweets = pd.concat([df_cleaned_tweets, df_classified_sentiment], axis=1)

In [None]:
# Check and see what the dataframe looks like
print(df_cleaned_tweets.head())

                                               tweet                date  \
0  I love It this project @spacedoge_io and i rec... 2020-01-30 23:37:53   
1  @grantdashwood 🤣 not quite yet (I think). But ... 2020-01-30 23:11:24   
2  Celebrating #NationalBackwardDay With My Favou... 2020-01-30 23:08:51   
3  @TonyThePoett Testing times for the ravaged mi... 2020-01-30 23:05:38   
4  Beautiful Tip! @emmerdale https://t.co/99lVfBFaYy 2020-01-30 23:05:28   

                                         clean_tweet    label  sentiment  
0  I love It this project @user and i recommend e...  LABEL_2   0.985253  
1  @user 🤣 not quite yet (I think). But who knows...  LABEL_1   0.652582  
2  Celebrating With My Favourite Family! Here's W...  LABEL_2   0.983410  
3  @user Testing times for the ravaged minds of "...  LABEL_2   0.666768  
4                         Beautiful Tip! @user https  LABEL_2   0.959433  


Now, we want to calculate the average sentiment for each day. We can drop the column of tweets before we transform the dataframe. Store this new dataframe in a variable called ``df_sentiment``.

In [None]:
# TODO - Drop the column of tweets and transform the dataframe.
del df_cleaned_tweets['tweet']


In [None]:
# Replacing the Label 0, Label 1 and Label 2 with -1, 0 and 1
df_cleaned_tweets['label'] = df_cleaned_tweets['label'].replace('LABEL_0', '-1')
df_cleaned_tweets['label'] = df_cleaned_tweets['label'].replace('LABEL_1', '0')
df_cleaned_tweets['label'] = df_cleaned_tweets['label'].replace('LABEL_2', '1')
df_cleaned_tweets['label'] = df_cleaned_tweets['label'].astype(int)



In [None]:
# Print the cleaned_tweets dataset that has replaced labels with numeric form
print(df_cleaned_tweets)

                     date                                        clean_tweet  \
0     2020-01-30 23:37:53  I love It this project @user and i recommend e...   
1     2020-01-30 23:11:24  @user 🤣 not quite yet (I think). But who knows...   
2     2020-01-30 23:08:51  Celebrating With My Favourite Family! Here's W...   
3     2020-01-30 23:05:38  @user Testing times for the ravaged minds of "...   
4     2020-01-30 23:05:28                         Beautiful Tip! @user https   
...                   ...                                                ...   
22914 2022-12-31 19:19:44  Happy 2023... "May this new year be of peace, ...   
22915 2022-12-31 19:18:45  The countdown to 2023 starts here on XL:UK Rad...   
22916 2022-12-31 19:02:12  Minor geomagnetic activity. Issued 2022-12-31 ...   
22917 2022-12-31 17:55:04   Happy New Year’s Eve all for 2023 everyone https   
22918 2022-12-31 17:47:03  I would like to wish all my family, friends &a...   

       label  sentiment  
0          1 

In [None]:
# We delete the sentiment column because later we will use the .resample function for the daily average wich will need 2 columns
del df_cleaned_tweets['sentiment']

In [None]:
# Deleting the clean_tweet column for the same reason as above
del df_cleaned_tweets['clean_tweet']

In [None]:
# Printing the dataframe to check if both columns dropped
print(df_cleaned_tweets)

                     date  label
0     2020-01-30 23:37:53      1
1     2020-01-30 23:11:24      0
2     2020-01-30 23:08:51      1
3     2020-01-30 23:05:38      1
4     2020-01-30 23:05:28      1
...                   ...    ...
22914 2022-12-31 19:19:44      1
22915 2022-12-31 19:18:45      1
22916 2022-12-31 19:02:12      0
22917 2022-12-31 17:55:04      1
22918 2022-12-31 17:47:03      1

[22919 rows x 2 columns]


In [None]:
# Converting the date column to the datetime type
df_cleaned_tweets['date'] = pd.to_datetime(df_cleaned_tweets['date'])
# Resample function allows for frequency specifid as an argument. Since we need the daily sentiment, we choose 'd'
df_cleaned_tweets = df_cleaned_tweets.resample('d', on='date').mean()

In [None]:
# Assigning the new name to the dataframe. Note that we assign df_mean_label as opposed to df_sentiment as instructed
df_mean_label = pd.DataFrame(df_cleaned_tweets)
# Note that resample function makes date as an index. Therefore, we create a new date range of dates
dates = pd.date_range(start='2020-01-30', end='2022-12-31', freq='D')


In [None]:
# Adding the new date range as a column to teh dataframe 
df_mean_label['date'] = dates

In [None]:
# Printing the dataframe to check the format
print(df_mean_label)

            label       date
date                        
2020-01-30   0.58 2020-01-30
2020-01-31  -0.06 2020-01-31
2020-02-01   0.08 2020-02-01
2020-02-02   0.48 2020-02-02
2020-02-03   0.38 2020-02-03
...           ...        ...
2022-12-27   0.15 2022-12-27
2022-12-28   0.35 2022-12-28
2022-12-29  -0.10 2022-12-29
2022-12-30   0.10 2022-12-30
2022-12-31   0.70 2022-12-31

[1067 rows x 2 columns]


In [None]:
# Dropping the index of the dataframe to get rif the dates as an index
df_mean_label.reset_index(drop = True)

Unnamed: 0,label,date
0,0.58,2020-01-30
1,-0.06,2020-01-31
2,0.08,2020-02-01
3,0.48,2020-02-02
4,0.38,2020-02-03
...,...,...
1062,0.15,2022-12-27
1063,0.35,2022-12-28
1064,-0.10,2022-12-29
1065,0.10,2022-12-30


In [None]:
# Save it just in case the file reloads
filename = "Mean_label.csv"
df_mean_label.to_csv(filename, index=False)
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

We have now successfully generated all of our data necessary for the analysis. One last thing to do is to merge the previously merged datasets with our final dataset of average sentiment scores to create the dataframe ``df_covid_happiness``. Download the dataset of the previously merged datasets with the code below if necessary.

In [None]:
# Converting the column date to datetime type in the df_mean_label
df_mean_label['date'] = pd.to_datetime(df_mean_label['date'])

In [None]:
# Upload the df_cases_stringency because the google colab may have refreshed
from google.colab import files
uploaded = files.upload()
df_cases_stringency = pd.read_csv('COVID_cases_and_stringency.csv')

Saving COVID_cases_and_stringency.csv to COVID_cases_and_stringency (3).csv


In [None]:
# Converting the column date to datetime type in the df_cases_stringency
df_cases_stringency['date'] = pd.to_datetime(df_cases_stringency['date'])

In [None]:
# Converting the column date to datetime again 
df_mean_label['date'] = pd.to_datetime(df_mean_label['date'])

In [None]:
print(df_mean_label)

            label       date
date                        
2020-01-30   0.58 2020-01-30
2020-01-31  -0.06 2020-01-31
2020-02-01   0.08 2020-02-01
2020-02-02   0.48 2020-02-02
2020-02-03   0.38 2020-02-03
...           ...        ...
2022-12-27   0.15 2022-12-27
2022-12-28   0.35 2022-12-28
2022-12-29  -0.10 2022-12-29
2022-12-30   0.10 2022-12-30
2022-12-31   0.70 2022-12-31

[1067 rows x 2 columns]


In [None]:
# Dropping the index because google colab refreshed 
df_mean_label.reset_index(drop = True)

Unnamed: 0,label,date
0,0.58,2020-01-30
1,-0.06,2020-01-31
2,0.08,2020-02-01
3,0.48,2020-02-02
4,0.38,2020-02-03
...,...,...
1062,0.15,2022-12-27
1063,0.35,2022-12-28
1064,-0.10,2022-12-29
1065,0.10,2022-12-30


In [None]:
print(df_mean_label)

      label       date
0      0.58 2020-01-30
1     -0.06 2020-01-31
2      0.08 2020-02-01
3      0.48 2020-02-02
4      0.38 2020-02-03
...     ...        ...
1062   0.15 2022-12-27
1063   0.35 2022-12-28
1064  -0.10 2022-12-29
1065   0.10 2022-12-30
1066   0.70 2022-12-31

[1067 rows x 2 columns]


In [None]:
# TODO - Merge the stringency and cases dataset with the sentiment dataset.
df_covid_happiness = pd.merge(df_cases_stringency, df_mean_label, on = ['date'])     

Finally, we save the generated dataset.


In [None]:
filename = "covid_happiness.csv"
df_covid_happiness.to_csv(filename, index=False)
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Running Regressions
In this section you will have to run the following regression and report the results:
$average\_sentiment_t = \beta positive\_cases_t + \gamma stringency_t + \eta_t + \varepsilon_t$
</br></br>
Before running this regression, think of the interpretation of the coefficient $\beta$ if you run this regression. Would you want to rescale the corresponding variable $positive\_cases$ with some proportion to improve the interpretability of this regression?
</br></br>
When interpreting the regression results you should make sure you understand the definitions of the variables used in the regression. For example, the number of confirmed cases for our purposes is actually the 7-day rolling average.
</br></br>
First we load our dataset if not loaded yet.

In [None]:
df_covid_happiness = upload_datasets()[0]

NameError: ignored

In [3]:
from google.colab import files
uploaded = files.upload()
df_covid_happiness = pd.read_csv('covid_happiness.csv')

Saving covid_happiness.csv to covid_happiness (1).csv


Weight the number of cases by some constant.

In [4]:
# TODO - Weight the variable to improve the interpretability of the coefficient.
population = 68800000
df_covid_happiness['cases_standardised'] = df_covid_happiness['Cases'] / population

In [5]:
print(df_covid_happiness)

      Unnamed: 0        date     Cases  stringency_index  label  \
0              0  2020-01-30         0              5.56   0.58   
1              1  2020-01-31         2              8.33  -0.06   
2              2  2020-02-01         2              8.33   0.08   
3              3  2020-02-02         2             11.11   0.48   
4              4  2020-02-03         8             11.11   0.38   
...          ...         ...       ...               ...    ...   
1062        1062  2022-12-27  24135080              5.56   0.15   
1063        1063  2022-12-28  24135080              5.56   0.35   
1064        1064  2022-12-29  24135080              5.56  -0.10   
1065        1065  2022-12-30  24135080              5.56   0.10   
1066        1066  2022-12-31  24135080              5.56   0.70   

      cases_standardised  
0           0.000000e+00  
1           2.906977e-08  
2           2.906977e-08  
3           2.906977e-08  
4           1.162791e-07  
...                  ...  
1062  

We now install the required packages for running regressions and generating the corresponding regression tables.

In [6]:
!pip install linearmodels --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.2/68.2 KB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h

We then import the installed libraries.

In [7]:
from linearmodels.panel import PanelOLS

Suppose we want to use month fixed effects in our regression. We will need to create a variable of month first in order to take this up in our final regression. Create a column that takes a different index for each month-year pair and wrap this in the function ``pd.Categorical()``.

In [8]:
# TODO - Create a column that takes a different index for each month-year pair.
df_covid_happiness['year'] = pd.DatetimeIndex(df_covid_happiness['date']).year
df_covid_happiness['month'] = pd.DatetimeIndex(df_covid_happiness['date']).month
df_covid_happiness['day'] = pd.DatetimeIndex(df_covid_happiness['date']).day


In [9]:
# Creating a new column in the dataframe - month_year
df_covid_happiness['month_year'] = df_covid_happiness['month'].astype(str) + "-" + df_covid_happiness['year'].astype(str)


In [10]:
# Check. Ruoxi's code had square brackets around teh whole command
print(df_covid_happiness)

      Unnamed: 0        date     Cases  stringency_index  label  \
0              0  2020-01-30         0              5.56   0.58   
1              1  2020-01-31         2              8.33  -0.06   
2              2  2020-02-01         2              8.33   0.08   
3              3  2020-02-02         2             11.11   0.48   
4              4  2020-02-03         8             11.11   0.38   
...          ...         ...       ...               ...    ...   
1062        1062  2022-12-27  24135080              5.56   0.15   
1063        1063  2022-12-28  24135080              5.56   0.35   
1064        1064  2022-12-29  24135080              5.56  -0.10   
1065        1065  2022-12-30  24135080              5.56   0.10   
1066        1066  2022-12-31  24135080              5.56   0.70   

      cases_standardised  year  month  day month_year  
0           0.000000e+00  2020      1   30     1-2020  
1           2.906977e-08  2020      1   31     1-2020  
2           2.906977e-08  2

In [11]:
# Saving it to avoid losing the progress
filename = "covid_happiness_timeseries.csv"
df_covid_happiness.to_csv(filename, index=False)
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [12]:
# Reset the index #skipping this step part 1
df_covid_happiness = df_covid_happiness.reset_index(drop = True)

In [13]:
print(df_covid_happiness)

      Unnamed: 0        date     Cases  stringency_index  label  \
0              0  2020-01-30         0              5.56   0.58   
1              1  2020-01-31         2              8.33  -0.06   
2              2  2020-02-01         2              8.33   0.08   
3              3  2020-02-02         2             11.11   0.48   
4              4  2020-02-03         8             11.11   0.38   
...          ...         ...       ...               ...    ...   
1062        1062  2022-12-27  24135080              5.56   0.15   
1063        1063  2022-12-28  24135080              5.56   0.35   
1064        1064  2022-12-29  24135080              5.56  -0.10   
1065        1065  2022-12-30  24135080              5.56   0.10   
1066        1066  2022-12-31  24135080              5.56   0.70   

      cases_standardised  year  month  day month_year  
0           0.000000e+00  2020      1   30     1-2020  
1           2.906977e-08  2020      1   31     1-2020  
2           2.906977e-08  2

Now, save and download the time series dataframe as ``covid_happiness_timeseries.csv``.

In [14]:
# TODO - Save and download the dataframe.
filename = 'covid_happiness_timeseries.csv'
df_covid_happiness.to_csv(filename, index = False)
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Specifying the model. This is not an exercise because I am of the opinion that one should not learn to do their econometrics in Python and that the time spent searching the code to do this can be seen as suboptimally spent. Namely, documentation on econometric methods in Stata is arguably better and more intuitive to use for people with a background in economics.

In [15]:
print(df_covid_happiness)

      Unnamed: 0        date     Cases  stringency_index  label  \
0              0  2020-01-30         0              5.56   0.58   
1              1  2020-01-31         2              8.33  -0.06   
2              2  2020-02-01         2              8.33   0.08   
3              3  2020-02-02         2             11.11   0.48   
4              4  2020-02-03         8             11.11   0.38   
...          ...         ...       ...               ...    ...   
1062        1062  2022-12-27  24135080              5.56   0.15   
1063        1063  2022-12-28  24135080              5.56   0.35   
1064        1064  2022-12-29  24135080              5.56  -0.10   
1065        1065  2022-12-30  24135080              5.56   0.10   
1066        1066  2022-12-31  24135080              5.56   0.70   

      cases_standardised  year  month  day month_year  
0           0.000000e+00  2020      1   30     1-2020  
1           2.906977e-08  2020      1   31     1-2020  
2           2.906977e-08  2

In [16]:
# Adding the date to the index as is required by the package of use.
# Also, placing the index of dates in the first column.
df_covid_happiness = df_covid_happiness.set_index("date", append=True)
df_covid_happiness.index = df_covid_happiness.index.swaplevel(0, 1)

# Specifying the model.
# Note: changed 'sentiment' to 'label' 
# Note: changed 'cases' to 'Cases'
regression_model = PanelOLS(dependent=df_covid_happiness['label'],
                            exog=df_covid_happiness[["cases_standardised", "stringency_index"]],
                            entity_effects=False,
                            time_effects=False,
                            other_effects=df_covid_happiness['month'])

Running the regression.

In [17]:
regression_results_summary = regression_model.fit(cov_type='clustered', cluster_entity=True).summary

Creating a regression table with the results.

In [18]:
pd.options.display.latex.repr = True
print(regression_results_summary)
print(regression_results_summary.as_latex())

                          PanelOLS Estimation Summary                           
Dep. Variable:                  label   R-squared:                        0.0176
Estimator:                   PanelOLS   R-squared (Between):              0.1015
No. Observations:                1067   R-squared (Within):               0.0000
Date:                Sat, Feb 25 2023   R-squared (Overall):              0.1015
Time:                        09:20:12   Log-likelihood                    49.076
Cov. Estimator:             Clustered                                           
                                        F-statistic:                      9.4491
Entities:                        1067   P-value                           0.0001
Avg Obs:                       1.0000   Distribution:                  F(2,1053)
Min Obs:                       1.0000                                           
Max Obs:                       1.0000   F-statistic (robust):             9.3210
                            

Storing and downloading the regression table in ``LaTeX`` format.

In [19]:
regression_table = open("regression_table.tex", "w")
regression_table = print(regression_results_summary.as_latex(), file=regression_table)
files.download("regression_table.tex")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Further Exercises
1. One option is to expand this analysis to different countries. Here, it is important to realise that comparing the coefficients of different countries is not justified. Namely, different countries may confirm cases in different ways. Looking at proportional increases in the number of cases will remove this problem, but will disregard the base level of new cases in the country which of course influences the magnitude of the effect on the average sentiment for a given proportional increase in the number of cases.
2. Data visualisation: plot the comovement of the variables of interest over time or something else you are interested in seeing that can give a new insight into the problem.
3. There may be other confounders present in the regression that I can't think of right now. If you can think of any, download the data for these, clean that data and create a new variable to run the regression with again.
4. Scrape tweets from random time intervals to reduce bias induced by Twitter's feed selection methods. Documentation available for some of the keywords necessary in the Twitter search query to do this can be found [here](https://github.com/igorbrigadir/twitter-advanced-search).
5. Filtering out spam tweets. You can approach this Natural Language Processing (NLP) problem in various ways, from as advanced as using AI classification methods as looking for duplicated tweets in your list of scraped tweets. You can always combine methods like these, of course.
6. Improving the tweet cleaning function.
7. Running the regression with different model specifications of how the confounder affects the outcome variable and the dependent variable. Namely, it may be the case that the start of heavy restrictions is not so bad yet, but that people get tired of it the longer these heavy restrictions are in place. You would need to transform the restriction variable to carry out the regression with this different definition of the control variable.


## References
* Edouard Mathieu, Hannah Ritchie, Lucas Rodés-Guirao, Cameron Appel, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Saloni Dattani, Diana Beltekian, Esteban Ortiz-Ospina and Max Roser (2020) - "Coronavirus Pandemic (COVID-19)". Published online at OurWorldInData.org. Retrieved from: 'https://ourworldindata.org/coronavirus' [Online Resource].
* Ensheng Dong, Hongru Du, Lauren Gardner, An interactive web-based dashboard to track COVID-19 in real time, The Lancet Infectious Diseases, Volume 20, Issue 5, 2020, Pages 533-534, ISSN 1473-3099, https://doi.org/10.1016/S1473-3099(20)30120-1. (https://www.sciencedirect.com/science/article/pii/S1473309920301201).
* JustAnotherArchivist, snscrape, (2023), GitHub repository, https://github.com/JustAnotherArchivist/snscrape.
* igorbrigadir, Twitter Advanced Search, (2023), GitHub repository, https://github.com/igorbrigadir/twitter-advanced-search.
* Wide and Narrow Data, Wikipedia, (12 Feb 2023), https://en.wikipedia.org/wiki/Wide_and_narrow_data.
* Advanced Filtering for Geo Data, (2023), https://developer.twitter.com/en/docs/tutorials/advanced-filtering-for-geo-data.
* Get Places Near a Location, (2023), https://developer.twitter.com/en/docs/twitter-api/v1/geo/places-near-location/api-reference/get-geo-reverse_geocode.

