# COVID-19 Infections and Happiness
This is the notebook for the Python for Economics Project at the London  School of Economics analysing the effect of COVID-19 infections on happiness.


## Introduction

As policy-making during an epidemic is all about making economic tradeoffs, one would like to quantify the gains and losses in the factors a government is is trading off between. The trade-offs to monetary factors and other classical economic factors are well documented, of course. However, the social costs of viral cases less so (among other forms of social costs involved in a pandemic). One may make an attempt to quantify the social costs of the number of cases of such a virus in your country by looking at the causal effect of COVID-19 cases on the average sentiment of how people express themselves online.

## Overview Project
In this project the main goal is to run a regression of the number of COVID-19 infections on the average sentiment of how people express themselves online. You will start by carrying out this analysis for the UK. A clear confounder here are government restrictions to curb the spread of the virus. You will control for this confounder in the regression alongside time-fixed effects that deal with the biases caused by new ways of measuring cases, changes in testing accuracy and availability, among other possible biases.
</br></br>
To be able to run this final regression, though, you will need to collect the data. This notebook will walk you through the steps associated with this and the final step of running the regression.


## Table of Contents

>[COVID-19 Infections and Happiness](#scrollTo=M_2dLRCIIqv9)

>>[Introduction](#scrollTo=c3t9AWywlLa_)

>>[Overview Project](#scrollTo=hpTdOHFalo5C)

>>[Table of Contents](#scrollTo=loLc9eEEVSsP)

>>[Preparation](#scrollTo=5kqfEAq9S8KC)

>>[Data Collection](#scrollTo=Xviu1_5NnsrF)

>>>[Loading Datasets](#scrollTo=3gOgjpQpKGoe)

>>>[Cleaning Datasets](#scrollTo=Xzmo0WIbKtrk)

>>>>[Preparation](#scrollTo=Xzmo0WIbKtrk)

>>>>[Stringency](#scrollTo=KB_HFVFsdT5H)

>>>>[Cases](#scrollTo=6zlemmyxk2JT)

>>>[Merging Dataframes](#scrollTo=8UXSCwxHg1Gi)

>>>[Average Sentiment](#scrollTo=C_Wv1iKpmdo6)

>>>>[Scraping Tweets](#scrollTo=UVZi_0ONKq5p)

>>>>[Classifying Tweets](#scrollTo=zOiuZ0v8NSKA)

>>[Running Regressions](#scrollTo=1wSD_8JrKslV)

>>[Further Exercises](#scrollTo=HIx-MyVcN7Vh)

>>[References](#scrollTo=GTQjBctvVLWv)




## Preparation
First, you will need to install a few libraries for this project. To install a library, write ``!pip install`` in a code block followed by ``name-library`` and the optional ``--quiet`` keyword to suppress the logs. For example, installing the package ``pandas`` can be done by running ``!pip install pandas --quiet`` in a code block.

(note: between countries the definitions and methods of confirming cases differs. maybe look at percentual change in infections but then not the absolute size of infections. maybe ONS positive rates).

In [2]:
# TODO - Install the following packages: pandas, datetime.
!pip install pandas --quiet



Now, you have to import the packages you installed. Additionally, import the preinstalled package ``numpy`` as ``np``.

In [6]:
# TODO - Import the installed packages.
import numpy as np
import pandas as pd

# One additional library necessary for CSV uploads is already given (no need to install this one, it is installed by default on Colabs).
from google.colab import files

## Data Collection
We can now start collecting our data.

### Loading Datasets
The data we will use for this analysis will come from the John Hopkins University Center for Systems Science and Engineering, Our World in Data and, of course, Twitter. 


* The dataset on confirmed cases per country (including the UK) can be found and downloaded [here](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series).
* The dataset on COVID-19 government restriction stringency can be found and downloaded [here](https://ourworldindata.org/covid-stringency-index).
* We will get into the Tweets later.

Once you have downloaded the datasets, you can upload one of them to to Colabs by running the comands below which will store a dataset as a Pandas dataframe (sort of like a spreadsheet). It is a good coding practice to wrap commands like this in a function. Do this and make the function output both datasets in a list. Then, call the function and assign the result to a variable ``dataframes``, storing the two dataframes in a list.

In [7]:
# This command will prompt you with an upload screen and store the uploaded files in a dictionary.
# You can upload multiple files at once.
uploaded = files.upload()

# This command stores the filenames in a list.
filenames = list(uploaded.keys())

# This command selects the filename of the first file in the files you uploaded.
filename = filenames[0]

# This command stores a dataset in a variable as a Pandas dataframe.
dataset = pd.read_csv('time_series_covid19_confirmed_global.csv')

# TODO - Create and call the function.

Saving time_series_covid19_confirmed_global.csv to time_series_covid19_confirmed_global (7).csv


### Cleaning Datasets
#### Preparation
First, it would be nice to have each of the datasets stored in a variable with a corresponding name. Below I show a trick to assign two variables at once. Use this trick to assign your datasets to the variables ``df_cases`` and ``df_stringency``.

In [8]:
df_cases = dataset

# TODO - Replicate the trick with the variable names given.

#### Stringency
Let's start with the easiest dataset first. Inspect the structure of the dataset by printing the dataframe.

In [None]:
# TODO - print the dataframe and inspect the structure.

Clearly, there are lots of variables and countries of which we do not need the data. Therefore, we would like to drop the redundant entries. Do this by selecting only the date and stringency index values for just the United Kingdom. Overwrite ``df_stringency`` with this transformed dataframe. As a final nit-picky step, reset the index of the dataframe.

In [None]:
# TODO - Overwrite the dataframe with the filtered version.

We would like to have our data of suitable data types, so it is easiest to work with down the line. For example, we would like the values in our ``date`` column to be of the ``datetime`` data type. Also, we would like the values in our ``stringency_index`` column to be of the ``float`` data type. Check if this is the case and if not, convert the column values to the desired data type.

In [None]:
# TODO - Check if the column values data types are correct and convert them if not.

#### Cases
Now on to the harder dataset. Inspect the structure of the dataset by printing the dataframe.

In [9]:
# TODO - print the dataframe and inspect the structure
df_cases = df_cases.fillna('N/A')
print(df_cases)
print(df_cases.info())

    Province/State        Country/Region        Lat       Long  1/22/20  \
0              N/A           Afghanistan   33.93911  67.709953        0   
1              N/A               Albania    41.1533    20.1683        0   
2              N/A               Algeria    28.0339     1.6596        0   
3              N/A               Andorra    42.5063     1.5218        0   
4              N/A                Angola   -11.2027    17.8739        0   
..             ...                   ...        ...        ...      ...   
284            N/A    West Bank and Gaza    31.9522    35.2332        0   
285            N/A  Winter Olympics 2022    39.9042   116.4074        0   
286            N/A                 Yemen  15.552727  48.516388        0   
287            N/A                Zambia -13.133897  27.849332        0   
288            N/A              Zimbabwe -19.015438  29.154857        0   

     1/23/20  1/24/20  1/25/20  1/26/20  1/27/20  ...  2/10/23  2/11/23  \
0          0        0   

Again, there are a lot of countries we do not need the data of. Filter the dataframe to only contain records of the UK (be precise here) and overwrite the original dataframe with the filtered one.

In [10]:
# TODO - Filter and overwrite the dataframe of cases.
df_cases = df_cases.loc[(df_cases['Province/State'] == 'N/A') & (df_cases['Country/Region'] == 'United Kingdom')]
print(df_cases)

    Province/State  Country/Region      Lat   Long  1/22/20  1/23/20  1/24/20  \
278            N/A  United Kingdom  55.3781 -3.436        0        0        0   

     1/25/20  1/26/20  1/27/20  ...   2/10/23   2/11/23   2/12/23   2/13/23  \
278        0        0        0  ...  24315979  24315979  24315979  24315979   

      2/14/23   2/15/23   2/16/23   2/17/23   2/18/23   2/19/23  
278  24315979  24315979  24341611  24341611  24341611  24341611  

[1 rows x 1129 columns]


Some might think we are done now with this dataset, but this dataset has a nasty characteristic. Namely, it is [*wide*](https://en.wikipedia.org/wiki/Wide_and_narrow_data), and quite *wide*, to say the least. Libraries written for Python and other programming languages hardly support this kind of data shape. Therefore, we want to change the shape of the data to the *narrow* format.
</br></br>
In essence, we would like one column for the date and one column for the confirmed cases. Thus, we need to put the column names in a new variable name called ``date`` and link the corresponding case numbers to the right row.
</br></br>
Convert the dataframe to a *narrow* format. After understanding the concepts by reading the Wikipedia page linked before, use Pandas' [``melt``](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) implementation to achieve this.

In [11]:
# TODO - Convert the dataframe from wide to narrow format.
id_vars = df_cases.loc[:, ['Province/State']]
value_vars = df_cases.loc[:, df_cases.columns != 'Province/State']
df_cases = pd.melt(df_cases, id_vars == id_vars, value_vars == value_vars, var_name='Date', value_name='Cases', ignore_index=True)
print(df_cases)

     Province/State            Date           Cases
0               N/A  Country/Region  United Kingdom
1               N/A             Lat         55.3781
2               N/A            Long          -3.436
3               N/A         1/22/20               0
4               N/A         1/23/20               0
...             ...             ...             ...
1123            N/A         2/15/23        24315979
1124            N/A         2/16/23        24341611
1125            N/A         2/17/23        24341611
1126            N/A         2/18/23        24341611
1127            N/A         2/19/23        24341611

[1128 rows x 3 columns]


In [12]:
df_casesdropped = df_cases.drop(columns=['Province/State'])
print(df_casesdropped)

                Date           Cases
0     Country/Region  United Kingdom
1                Lat         55.3781
2               Long          -3.436
3            1/22/20               0
4            1/23/20               0
...              ...             ...
1123         2/15/23        24315979
1124         2/16/23        24341611
1125         2/17/23        24341611
1126         2/18/23        24341611
1127         2/19/23        24341611

[1128 rows x 2 columns]


In [13]:
df_rowdropped = df_casesdropped.drop([1, 2])
print(df_rowdropped)

                Date           Cases
0     Country/Region  United Kingdom
3            1/22/20               0
4            1/23/20               0
5            1/24/20               0
6            1/25/20               0
...              ...             ...
1123         2/15/23        24315979
1124         2/16/23        24341611
1125         2/17/23        24341611
1126         2/18/23        24341611
1127         2/19/23        24341611

[1126 rows x 2 columns]


In [17]:
df_cases = df_rowdropped.drop([0])
print(df_cases)

         Date     Cases
3     1/22/20         0
4     1/23/20         0
5     1/24/20         0
6     1/25/20         0
7     1/26/20         0
...       ...       ...
1123  2/15/23  24315979
1124  2/16/23  24341611
1125  2/17/23  24341611
1126  2/18/23  24341611
1127  2/19/23  24341611

[1125 rows x 2 columns]


Now, we would like to convert the date column of data type ``string`` to the data type ``datetime``, because we want to link the time series datasets that we now have parsed to each other and make one big, complete dataset. This is not as easy as it was for the previous dataset, and you will probably find out why.

In [19]:
# TODO - Convert the values of the date column to the datetime data type.
df_cases['Date'] = pd.to_datetime(df_cases['Date'])
print(df_cases)
print(df_cases.dtypes)

           Date     Cases
3    2020-01-22         0
4    2020-01-23         0
5    2020-01-24         0
6    2020-01-25         0
7    2020-01-26         0
...         ...       ...
1123 2023-02-15  24315979
1124 2023-02-16  24341611
1125 2023-02-17  24341611
1126 2023-02-18  24341611
1127 2023-02-19  24341611

[1125 rows x 2 columns]
Date     datetime64[ns]
Cases            object
dtype: object


### Merging Dataframes
Now, we would like to merge the dataframes of the COVID-19 cases and COVID-19 policy stringency with eachother, so that for each date that is present in both dataframes we have one observation for the stringency and the number of cases. We will use [Pandas' implementation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) of a merge function.

In [None]:
# TODO - Merge "df_cases" with "df_stringency" and save the result in a variable called "df_cases_stringency".

Upon inspecting the data, we can see that there are some missing observations for the stringency index, probably because the stringency data does not go as far in time as the cases dataset. To clean this up, we would like to drop these missing values.

In [None]:
# TODO - Drop the missing values in the dataset.

After having edited data with code that takes a bit to run, you usually want to save your progress by downloading the dataset. (In more advanced projects, you would maybe use a database when using computationally expensive operations). Thus, download your dataset. You can use the previously installed ``files`` Colabs library for this. Download the dataset as ``cases_stringency.csv``. Make sure to exclude the index in the dataframe to CSV conversion step.

In [None]:
# TODO - Convert the dataframe to a CSV file and download it.

### Average Sentiment
Now, in the data collection part of this project we only have left the task of collecting data on the average sentiment of how people express themselves online.

#### Scraping Tweets
In this section, we will start scraping tweets from the UK in the same time period as variables ``stringency_index`` and ``cases`` are recorded in. In an academic setting, you might prefer to use an official Twitter API, but this can take a while to be admitted to. Additionally, few compromises are made by using an unofficial Twitter scraper.
</br></br>
If you have left off since everything before this code chunk and your Google Colabs runtime has restarted, you can optionally load the dataset you created in the previous parts with the code below.

In [None]:
df_cases_stringency = upload_datasets()[0]

First, we install a library that allows us to easily scrape tweets from Twitter.

In [None]:
!pip install snscrape --quiet

Importing the scraping library.

In [None]:
import snscrape.modules.twitter as sntwitter

We will have to define the date range we want to scrape data from before we start scraping tweets. A useful function for this is Pandas' [``date_range``](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html). Define a date range that starts from the earliest date all the way to the last date in your dataframe ``df_cases_stringency``. Store this range of dates in a variable called ``date_range``.

In [None]:
# TODO - Define the date range.

Defining a list to store the tweets in.

In [None]:
tweets = []

Defining the number of tweets to be scraped per day. You can change this number to your liking. I would recommend to try running the code with this number first and possibly increasing it later when sure the code works so wasting computation time can be prevented.

In [None]:
tweets_per_day = 10

As we want to scrape tweets published from the UK, we need to tell this to our scraper. As it so happens, Twitter uses geographic tags users can choose to attach to their tweets. (There are some problems of representativeness with this approach discussed [here](https://developer.twitter.com/en/docs/tutorials/advanced-filtering-for-geo-data) if you are interested.) The UK tag is ``6416b8512febefc9``. If needed while exploring the **optional** further exercises, you can find tags of other countries via the following Twitter API: ``f"https://api.twitter.com/1.1/geo/reverse_geocode.json?lat={latitude}&lon={longitude}&granularity=country"``. You would format the string based on your latitude and longitude variables before plugging the link in your browser or Python API module of choice. Documentation for this API can be found [here](https://developer.twitter.com/en/docs/twitter-api/v1/geo/places-near-location/api-reference/get-geo-reverse_geocode).

Now, we can start scraping. To get you started with the functionality of the ``snscrape`` module, I have written a simple piece of code that you can run to understand how this module can be used.

In [None]:
# Demonstrating the working of the "enumerate" function.
text_list = ["This", "is", "how", "enumerate", "works."]
for i, text in enumerate(text_list):
  print(i, text)

# Storing the place ID for the UK.
place_id = "6416b8512febefc9"

# Defining the search query for our Twitter scraper.
# The keyword "lang:en" will filter for English tweets only.
# The keywords "since:date" and "until:date" define the time range the tweet has to be from.
# "until" is exclusive, meaning no tweets are scraped from "2020-05-20". "since" is inclusive.
scraped_tweets = sntwitter.TwitterSearchScraper(f"lang:en place:{place_id} since:2020-05-19 until:2020-05-20").get_items()

# This piece of code will print 5 tweets.
# For each iteration in the loop, the scraper will scroll to the next tweet in the feed returned by Twitter.
for i, tweet in enumerate(scraped_tweets):
  print(tweet.rawContent)
  print(tweet.date)
  # We will only need the rawContent and date properties of the tweet.
  # tweet.rawContent gives the text of the tweet (string)
  # tweet.date gives the date and time of the tweet (datetime)
  # For more properties, see line 60 and onwards of https://github.com/JustAnotherArchivist/snscrape/blob/master/snscrape/modules/twitter.py.

  # Stopping the loop.
  if i == 4:
    break

Now that you hopefully understand how this module works, I want you to write a function called ``scrape_time_range``.  This function will have to return a list of scraped tweets, containing the raw content and date for each tweet in the list.
</br></br>
This function should take four arguments:
1. A list to append the scraped tweets to.
2. The place ID.
3. The date range.
4. The number of tweets to be scraped per day.

You want this function to iterate over the dates in the date range first, before defining the search query for that day and scraping the desired number of tweets. Notice that the dates stored in the previously created ``date_range`` are of the data type ``datetime``. They can be converted to strings by using the function [``strftime``](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.strftime.html). You can format the desired output strings with the following keywords:
* ``%Y`` which corresponds to YYYY.
* ``%m`` which corresponds to mm.
* ``%d`` which corresponds to dd.

Make sure to take care of the hypens in these dates, too, when converting the date range, as your Twitter search query will be invalid without them. The same applies to the order of the year, month and date in the string.
</br></br>
**Hint:** wrap the output of ``date_range.strftime()`` in ``list()`` to convert the Numpy object to a Python list, which is more convenient in this instance.

In [None]:
# TODO
# 1. Convert the date range to a list of date strings.
# 2. Write the scraping function.

Call your scraping function.
**Warning:** with 10 tweets a day this takes about 40 minutes to run and at a later stage the tweet classification task with the best model would take around 6 hours (but you can do this in batches of course).

In [None]:
# TODO - Call it.

Now, we would like to convert the list of tweets to a dataframe and a CSV file to save our progress. Call the dataframe ``df_tweets`` and the CSV file ``tweets.csv``.

In [None]:
# TODO - Convert the list of tweets to a dataframe and a CSV file.

#### Classifying Tweets
If you have left off before this chunk and your Colabs runtime has refreshed in the meantime, load the dataset below.

In [None]:
# Taking the first index of the list of uploaded datasets, as you only upload one.
df_tweets = upload_datasets()[0]

At this stage, we need to define a function that cleans the tweets. Namely, users and tweets mentioned in tweets might confuse the classification model that we will use at a later stage. This is possible if usernames and links have words in them that would refer to a certain sentiment but are not used for that purpose in natural text. Thus, we need to neutralise these words in the tweets. Create a function that converts all users (in the form of ``@username``) to "``@user``" and all links (in the form of ``https://`` to "``https``". Call it ``neutralise_mentions_links`` and make it so that it takes one argument called ``text``.
</br></br>
Use the ``.split()`` function of strings in Python. Mentions start with "@", links with "https://". 

In [None]:
# TODO - Write a function that removes mentions fom and shortens links in a piece of text.

Apply the function to all the tweets in the dataframe.

In [None]:
# TODO - Apply the function to all tweets in the dataframe.

Now, we would like to classify the sentiment of the tweets in our dataframe. We task an external library with this exercise. The library we will use is ``happytransformer``. First, we install the library.

In [None]:
!pip install happytransformer --quiet

Second, we import the text classification functionality from the library we installed.

In [None]:
from happytransformer import HappyTextClassification

Third, we load the AI model that has been trained on a large dataset of tweets with sentiment labels. We will use this for the analysis. This type of model is called a transformer model which you can read more on [here](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model).

In [None]:
happy_tc = HappyTextClassification(model_type="BERT",  model_name="cardiffnlp/twitter-roberta-base-sentiment", num_labels=3)

This is a demonstration of how the model can be used. Now write a function called ``classify_sentiment`` that takes in one argument of ``text`` and outputs the label in numeric form. 
</br></br>
It is important for you to know that the label that the NLP model outputs is one of:
* ``LABEL_0``, which corresponds to negative or the numeric form of -1.
* ``LABEL_1``, which corresponds to neutral or the numeric form of 0.
* ``LABEL_2``, which corresponds to positive or the numeric form of 1.

The model outputs one score for each label and returns the label and score corresponding to the label with the highest score.

In [None]:
result = happy_tc.classify_text("I think the Python for Economics week is a great initiative.")
print(result.label, result.score)

# TODO - Write a function that outputs the label in numeric form.

Apply the ``sentiment_classifier`` function to the tweets and store the returned labels in a new column called ``sentiment``. **Warning:** doing this can be time intensive. This notebook was tested with 10 tweets per day and it took 6 hours to classify all the tweets scraped over the time range. Try doing this in chunks and downloading the results if you can't run the notebook for 6 hours straight.

In [None]:
# TODO - Apply the sentiment classifier function to the tweets.

Now, we want to calculate the average sentiment for each day. We can drop the column of tweets before we transform the dataframe. Store this new dataframe in a variable called ``df_sentiment``.

In [None]:
# TODO - Drop the column of tweets and transform the dataframe.

We have now successfully generated all of our data necessary for the analysis. One last thing to do is to merge the previously merged datasets with our final dataset of average sentiment scores to create the dataframe ``df_covid_happiness``. Download the dataset of the previously merged datasets with the code below if necessary.

In [None]:
df_cases_stringency = upload_datasets()[0]

# TODO - Merge the stringency and cases dataset with the sentiment dataset.

Finally, we save the generated dataset.

In [None]:
filename = "covid_happiness.csv"
df_covid_happiness.to_csv(filename, index=False)
files.download(filename)

## Running Regressions
In this section you will have to run the following regression and report the results:
$average\_sentiment_t = \beta positive\_cases_t + \gamma stringency_t + \eta_t + \varepsilon_t$
</br></br>
Before running this regression, think of the interpretation of the coefficient $\beta$ if you run this regression. Would you want to rescale the corresponding variable $positive\_cases$ with some proportion to improve the interpretability of this regression?
</br></br>
When interpreting the regression results you should make sure you understand the definitions of the variables used in the regression. For example, the number of confirmed cases for our purposes is actually the 7-day rolling average.
</br></br>
First we load our dataset if not loaded yet.

In [None]:
df_covid_happiness = upload_datasets()[0]

Weight the number of cases by some constant.

In [None]:
# TODO - Weight the variable to improve the interpretability of the coefficient.

We now install the required packages for running regressions and generating the corresponding regression tables.

In [None]:
!pip install linearmodels --quiet

We then import the installed libraries.

In [None]:
from linearmodels.panel import PanelOLS

Suppose we want to use month fixed effects in our regression. We will need to create a variable of month first in order to take this up in our final regression. Create a column that takes a different index for each month-year pair and wrap this in the function ``pd.Categorical()``.

In [None]:
# TODO - Create a column that takes a different index for each month-year pair.

Now, save and download the time series dataframe as ``covid_happiness_timeseries.csv``.

In [None]:
# TODO - Save and download the dataframe.

Specifying the model. This is not an exercise because I am of the opinion that one should not learn to do their econometrics in Python and that the time spent searching the code to do this can be seen as suboptimally spent. Namely, documentation on econometric methods in Stata is arguably better and more intuitive to use for people with a background in economics.

In [None]:
# Adding the date to the index as is required by the package of use.
# Also, placing the index of dates in the first column.
df_covid_happiness = df_covid_happiness.set_index("date", append=True)
df_covid_happiness.index = df_covid_happiness.index.swaplevel(0, 1)

# Specifying the model.
regression_model = PanelOLS(dependent=df_covid_happiness['sentiment'],
                            exog=df_covid_happiness[["cases", "stringency_index"]],
                            entity_effects=False,
                            time_effects=False,
                            other_effects=df_covid_happiness['month'])

Running the regression.

In [None]:
regression_results_summary = regression_model.fit(cov_type='clustered', cluster_entity=True).summary

Creating a regression table with the results.

In [None]:
pd.options.display.latex.repr = True
print(regression_results_summary)
print(regression_results_summary.as_latex())

Storing and downloading the regression table in ``LaTeX`` format.

In [None]:
regression_table = open("regression_table.tex", "w")
regression_table = print(regression_results_summary.as_latex(), file=regression_table)
files.download("regression_table.tex")

## Further Exercises
1. One option is to expand this analysis to different countries. Here, it is important to realise that comparing the coefficients of different countries is not justified. Namely, different countries may confirm cases in different ways. Looking at proportional increases in the number of cases will remove this problem, but will disregard the base level of new cases in the country which of course influences the magnitude of the effect on the average sentiment for a given proportional increase in the number of cases.
2. Data visualisation: plot the comovement of the variables of interest over time or something else you are interested in seeing that can give a new insight into the problem.
3. There may be other confounders present in the regression that I can't think of right now. If you can think of any, download the data for these, clean that data and create a new variable to run the regression with again.
4. Scrape tweets from random time intervals to reduce bias induced by Twitter's feed selection methods. Documentation available for some of the keywords necessary in the Twitter search query to do this can be found [here](https://github.com/igorbrigadir/twitter-advanced-search).
5. Filtering out spam tweets. You can approach this Natural Language Processing (NLP) problem in various ways, from as advanced as using AI classification methods as looking for duplicated tweets in your list of scraped tweets. You can always combine methods like these, of course.
6. Improving the tweet cleaning function.
7. Running the regression with different model specifications of how the confounder affects the outcome variable and the dependent variable. Namely, it may be the case that the start of heavy restrictions is not so bad yet, but that people get tired of it the longer these heavy restrictions are in place. You would need to transform the restriction variable to carry out the regression with this different definition of the control variable.


## References
* Edouard Mathieu, Hannah Ritchie, Lucas Rodés-Guirao, Cameron Appel, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Saloni Dattani, Diana Beltekian, Esteban Ortiz-Ospina and Max Roser (2020) - "Coronavirus Pandemic (COVID-19)". Published online at OurWorldInData.org. Retrieved from: 'https://ourworldindata.org/coronavirus' [Online Resource].
* Ensheng Dong, Hongru Du, Lauren Gardner, An interactive web-based dashboard to track COVID-19 in real time, The Lancet Infectious Diseases, Volume 20, Issue 5, 2020, Pages 533-534, ISSN 1473-3099, https://doi.org/10.1016/S1473-3099(20)30120-1. (https://www.sciencedirect.com/science/article/pii/S1473309920301201).
* JustAnotherArchivist, snscrape, (2023), GitHub repository, https://github.com/JustAnotherArchivist/snscrape.
* igorbrigadir, Twitter Advanced Search, (2023), GitHub repository, https://github.com/igorbrigadir/twitter-advanced-search.
* Wide and Narrow Data, Wikipedia, (12 Feb 2023), https://en.wikipedia.org/wiki/Wide_and_narrow_data.
* Advanced Filtering for Geo Data, (2023), https://developer.twitter.com/en/docs/tutorials/advanced-filtering-for-geo-data.
* Get Places Near a Location, (2023), https://developer.twitter.com/en/docs/twitter-api/v1/geo/places-near-location/api-reference/get-geo-reverse_geocode.

