The Effect of Large-Scale Anti-Contagion Policies on the COVID-19 Pandemic
This repository contains code and data necessary to replicate the findings of Hsiang et al. (Nature 2020).
Scripts in this repository are written in R, Python, and Stata. Note that you will need a Stata license to fully replicate the analysis (provided in the CodeOcean capsule). Throughout this Readme, when indicating paths to code and data, it is assumed that you’ll execute scripts from the repo root directory.
The easiest way to interact with our code and data is via our CodeOcean capsule, because all of the relevant setup described below has been done for you. You may replicate the full analysis through the "Reproducible Run" feature or interact directly with our code through Jupyter Notebooks that run Python, R, and Stata. You may also utilize RStudio. If you wish to use the command line on a cloud workstation, you will want to activate our conda environment with
conda activate gpl-covid.
You may also view and download source code from our Github Repository. "v0.5.1" is the tag that is associated with the current version of the manuscript (as of 06/07/2020), but you may also view the latest codebase and datasets on the master branch. To run this code, you will first want to create and activate our conda environment.
conda env create -f environment/environment.yml conda activate gpl-covid
Once you have activated this environment, to run some of the Python scripts, you’ll need to install the small package (1 module) that is included in this repo.
pip install -e code
To execute the full
run.sh script, which runs the analysis from start to finish, you will also need to add this conda environment to the kernels that Jupyter is able to use (this is only needed to run simulations used to create Extended Data Figures 8 and 9). You do this with the following command (from inside the
gpl-covid conda environment):
python -m ipykernel install --user --name gpl-covid
Finally, to estimate the regression models, you will need several package installed in Stata. To add them, launch Stata and run:
ssc install reghdfe, replace ssc install ftools, replace # the latest version of reghdfe would also require the installation of ftools ssc install coefplot, replace ssc install filelist, replace ssc install outreg2, replace ssc install carryforward, replace
code ├── LICENSE ├── data │ ├── china │ │ ├── collate_data.py │ │ └── download_and_clean_JHU_china.R │ ├── cutoff_dates.csv │ ├── france │ │ ├── download_and_clean_JHU_france.R │ │ ├── format_infected.do │ │ └── format_policy.do │ ├── iran │ │ ├── download_and_clean_JHU_iran.R │ │ ├── iran-split-interim-into-processed.py │ │ └── iran_cleaning.R │ ├── italy │ │ ├── download_and_clean_JHU_italy.R │ │ └── italy-download-cases-merge-policies.py │ ├── korea │ │ ├── download_and_clean_JHU_korea.R │ │ ├── generate_KOR_processed.R │ │ └── make_JHU_comparison_data.R │ ├── multi_country │ │ ├── convert-policies-raw-to-interim.py │ │ ├── download_6_countries_JHU.R │ │ ├── get_JHU_country_data.R │ │ ├── get_adm_info.py │ │ └── quality-check-processed-datasets.py │ └── usa │ ├── add_testing_regimes_to_covidtrackingdotcom_data.ipynb │ ├── add_testing_regimes_to_covidtrackingdotcom_data.py │ ├── check_health_data.R │ ├── download_and_clean_JHU_usa.R │ ├── download_and_clean_usafacts.R │ ├── download_latest_covidtrackingdotcom_data.py │ ├── gen_state_name_abbrev_xwalk.R │ ├── get_usafacts_data.R │ └── merge_policy_and_cases.py ├── models │ ├── CHN_create_CBs.R │ ├── CHN_generate_data_and_model_projection.R │ ├── FRA_create_CBs.R │ ├── FRA_generate_data_and_model_projection.R │ ├── IRN_create_CBs.R │ ├── IRN_generate_data_and_model_projection.R │ ├── ITA_create_CBs.R │ ├── ITA_generate_data_and_model_projection.R │ ├── KOR_create_CBs.R │ ├── KOR_generate_data_and_model_projection.R │ ├── USA_create_CBs.R │ ├── USA_generate_data_and_model_projection.R │ ├── alt_growth_rates │ │ ├── CHN_adm2.do │ │ ├── FRA_adm1.do │ │ ├── IRN_adm1.do │ │ ├── ITA_adm2.do │ │ ├── KOR_adm1.do │ │ ├── MASTER_run_all_reg.do │ │ ├── USA_adm1.do │ │ └── disaggregated_policies │ │ ├── FRA_adm1_disag.do │ │ ├── IRN_adm1_disag.do │ │ ├── ITA_adm2_disag.do │ │ ├── KOR_adm1_disag.do │ │ ├── MASTER_run_all_reg_disag.do │ │ └── USA_adm1_disag.do │ ├── get_gamma.py │ ├── output_underlying_projection_output.R │ ├── predict_felm.R │ ├── projection_helper_functions.R │ ├── run_all_CB_simulations.R │ ├── run_projection_sensitivity.R │ └── test_that_r_coefs_match_stata.R ├── notebooks │ ├── archived │ │ ├── addl-sim-plots.ipynb │ │ └── resimulate-outbreaks.ipynb │ └── simulate-and-regress.ipynb ├── plotting │ ├── aggregate_fig1_source_data.py │ ├── count-policies.py │ ├── distribution_of_policies_by_dow.R │ ├── examine_lagged_relationship_between_new_deaths_recoveries_and_older_cases.R │ ├── extended_data_fig3_4.do │ ├── extended_data_fig5.do │ ├── fig1.R │ ├── fig2.R │ ├── fig4_analysis.py │ ├── figED1.py │ ├── figED2.R │ ├── gen_fig4.py │ ├── pre_policy_growth.do │ └── sims.py ├── run.sh ├── setup.py ├── src │ ├── impute.py │ ├── merge.py │ ├── models │ │ └── epi.py │ ├── pop.py │ └── utils.py └── statab.sh
A detailed description of the epidemiological and policy data obtained and processed for this analysis can be found in the Supplementary Information associated with the article linked at the top of this README. A description of the variables appearing in
data/processed/[adm]/[country]_processed.csv is available in data/raw/multi_country/data_dictionary.xlsx. The epidemiological and policy data sources for all countries are listed in data/raw/multi_country/data_sources.xlsx.
There are four stages to our analysis:
- Data collection and processing
- Regression model estimation
- SIR model projections
- Figure creation
The end dates of the data analyzed, and of the "cases averted" projections, are controlled by
Run full pipeline
The entire pipeline can be run by calling
bash code/run.sh. If you would rather do it step by step, see the description of each stage below. There are four flags that can be passed to that script to change the default behavior, described below:
-s|--nostata: Don't run the Stata scripts (must include this flag if you do not have Stata). In this case, the regression steps will be skipped and you will rely on previously computed regression results.
p|--num-proj: Integer that controls both (a) the number of bootstrap samples when calculating projection uncertainty bounds in Fig. 4 and (b) the number of Monte Carlo samples used to create synthetic outbreaks (Extended Data Figures 8 and 9). Default 1000.
d|--download: Download new raw data, rather than using that which has already been downloaded. This may affect results slightly if, e.g. a country retroactively adjusts their reported cases.
Data collection and processing
The steps to obtain all data in <data/raw>, and then process this data into datasets that can be ingested into a regression, are described below. Note that some of the data collection was performed through manual downloading and/or processing of datasets and is described in as much detail as possible. The sections should be run in the order listed, as some files from later sections will depend on those from earlier sections (e.g. the geographical and population data).
Geographical and population data
python code/data/multi_country/get_adm_info.py: Generates shapefiles and csvs with administrative unit names, geographies, and populations (most countries).
For Chinese city-level population data, the dataset is extracted from a compiled dataset of the 2010 Chinese City Statistical Yearbooks. We manually matched the city level population dataset to the city level COVID-19 epidemiology dataset. The resulting file is in data/raw/china/china_city_pop.csv.
a. Click the
ITEMtab and check the
b. Click the
By Administrative Districttab and check
1 Level Select all.
c. Click the
By Age Grouptab and check the
d. Click the
Timetab and check
e. Click the green
Searchbutton on the upper right of the window.
f. Click the blue
Downloadbutton under the
File formatand download.
h. Open this file, remove the top two rows and the second column. Then change the header (the top row to
adm1_name, population). Save to data/interim/korea/KOR_population.csv.
Policy and testing data
Most policy and testing data was manually collected from a variety of sources. A mapping was developed from each policy to one of the variables we encode for our regression. These sources and mappings are listed in a csv for each country following the pattern
In the U.S., testing regime data is obtained programmatically, via these steps:
python code/data/usa/download_latest_covidtrackingdotcom_data.py: Downloads testing regime data. Note: It seems this site has been getting high traffic and frequently fails to process requests. If this script throws an error due to that issue, try again later.
python code/data/usa/add_testing_regimes_to_covidtrackingdotcom_data.py: Check that detected testing regime changes make sense and discard any false detections. Because this can be an interactive step, there is also a corresponding Notebook that you may run.
Rscript code/data/multi_country/download_6_countries_JHU.R: Downloads 6 countries' data from the Johns Hopkins University Data underlying their dashboard. Note: The JHU dataset format has been changing frequently, so it is possible that this script will need to be modified.
- For data from January 24, 2020 onwards, we relied on an open source GitHub project. Download the data and save it to data/raw/china/DXYArea.csv.
- For data before January 24, 2020, we manually collected data, the file is in data/raw/china/china_city_health_jan.xlsx.
- Download the March 12 file update for the number of confirmed cases per région from the French government’s website and save it to data/raw/france/fr-sars-cov-2-20200312.xlsx. This file only gets updated every 1-5 days, so we augment it with data scraped daily from a live website through March 25, 2020. At this point, the live website stopped reporting daily infections, and we're currently working to figure out if this periodically updated site will continue to produce updates.
stata -b do code/data/france/format_infected.do: Run in Stata to clean and format the French regional epidemiological dataset, set at the beginning the last sample date. Default is March 18th.
Copy all of the date lines from the "New COVID-19 cases in Iran by province" table on the Wikipedia page tracking this outbreak in Iran.
Open the excel file data/raw/iran/covid_iran.xlsx. This file contains the first step cleaning template for the cases data. The tabs in this file are:
200314_cases_raw: A template into which raw data from the Wikipedia table -- see (1) -- should be pasted.
cases_cleaned--to_csv: A cleaned column format for the intermediate cases data. Simply extend the formulas (by copy/paste) in each row so that each row of the raw data is included. Do not change the column headings. Once all raw data has been included, save this tab to a csv file and save in data/interim/iran/covid_iran_cases.csv.
Epi data is downloaded and merged with policy data in one step, described in the following section
- Korean epi data were manually collected from various Korean provincial websites. Note that these provinces often report the data in different formats (e.g. pdf attachments, interactive dashboards) and usually do not have English translations. This data is saved in data/interim/korea/KOR_health.csv.
Rscript code/data/usa/download_and_clean_usafacts.R: Downloads county- and state-level data from usafacts.org
Merge all data for each country
Run the following scripts to merge epi, policy, testing, and population data for each country. After completion, you may run code/data/multi_country/quality-check-processed-datasets.py, to make sure all of the fully processed datasets are correctly and consistently formatted.
Include policies implied by other policies
stata -b do code/data/france/format_policy.do
python code/data/usa/merge_policy_and_cases.py: Merge all US data. This outputs data/processed/adm1/USA_processed.csv.
Regression model estimation
Once data is obtained and processed, you can estimate regression models for each country using the following command:
stata -b do code/models/alt_growth_rates/MASTER_run_all_reg.do
You can optionally pass an integer command line argument if you would like to include bootstrapped estimates of the r2 vs. fixed lag length analysis (ED Fig 5b).
Each of the individual country regressions are also available to be run within code/models/alt_growth_rates.
SIR model projections
Once the regression coefficients have been estimated in the above models, run the following code to generate projections of active and cumulative infections using an SIR model:
python code/models/get_gamma.py: Estimate removal rate to use in projections from data that contains both cumulative cases and active cases.
Rscript code/models/run_all_CB_simulations.R: Generate the csv inputs for Figure 4.
Rscript code/models/save_underlying_projection_output.R: Output the raw projection output if you wish to examine the underlying output.
To generate the four figures in the paper, run the following scripts. Figure 1 only requires the data collection steps to be complete. Figures 2 and 3 require the regression step to be complete, and Figure 4 requires the projection step to be complete and to have previously run the code for Figure 1. Each of the Extended Data Figures and Supplementary Information Tables may require different steps of the analysis to be finalized.
Rscript code/plotting/fig1.R: Generate 12 outputs that constitute Figure 1 (
*_map.pdffor each of the 6 countries). Note: This script requires data/raw/china/match_china_city_name_w_adm2.csv, a manually generated crosswalk of Chinese city names.
python code/plotting/aggregate_fig1_source_data.py: Combine csv's into the "source_data" excel file.
Rscript code/plotting/fig2.R: Generate 3 outputs that constitute Figure 2, in
- Panel A:
- Panel B:
- Panel C:
Figure 3 is generated by the regression estimation step (
Note that the outputs of code/plotting/fig1.R are required for Fig 4 as well.
- (if not already generated)
Rscript code/plotting/fig1.R: Generate the cases data
python code/plotting/gen_fig4.py: Generate Figure 4.
python code/plotting/fig4_analysis.py: Generate a printout of numerical results from the projections for each country.
Extended Data Figure 1
Rscript code/data/korea/make_JHU_comparison_data.R: Create data/interim/korea/KOR_JHU_data_comparison.csv
python code/plotting/figED1.py: Generate 2 outputs that constitute ED Figure 1 (
Extended Data Figure 2
Rscript code/plotting/figED2.R: Generate ED Figure 2 (
Extended Data Figures 3-4
stata -b do code/plotting/extended_data_fig3_4.do: Generate a csv with point estimate and standard errors used to plot a mock PDF version of ED Figures 3 and 4. The final version in printed document are designed using Adobe Illustrator.
Extended Data Figure 5
- Panel a is part of the regression estimation step (
- Panels b and c are directly generated by
stata -b do code/plotting/extended_data_fig5.dousing result of the regression estimation step (
Extended Data Figure 6
stata -b do code/models/alt_growth_rates/disaggregated_policies/MASTER_run_all_reg_disag.do
Extended Data Figure 7
Extended Data Figures 8 and 9
papermill code/notebooks/simulate-and-regress.ipynb code/notebooks/simulate-and-regress-log.ipynb -k gpl-covid: Run Monte Carlo simulations of synthetic outbreaks
python code/plotting/sims.py results/other/sims/measNoise_0.05_betaNoise_Exp_gammaNoise_0.01_sigmaNoise_0.03 results/figures/appendix/sims --source-dir "results/source_data/ExtendedDataFigure89.csv": Create figures
Extended Data Figure 10
ED Figure 10 is generated by the regression estimation step (
code/models/alt_growth_rates/MASTER_run_all_reg.do). The final output file is
Supplementary Information Table 1
Supplementary Information Table 2
This table is not generated programatically.
Supplementary Information Table 3
This table is generated by the regression estimation step (
Supplementary Information Table 4
This table is generated by the regression estimation step using disaggregated policy variables (
Supplementary Information Table 5
This table is generated by the regression estimation step (
We welcome potential collaborators and contributors from the wider community. Please see CONTRIBUTING.md for more details.