Skip to content

Extract COVID-19 age-specific mortality counts in U.S states and metropolitan areas

License

Notifications You must be signed in to change notification settings

ImperialCollegeLondon/US-covid19-agespecific-mortality-data

Repository files navigation

Run daily update Run daily update to s3

Age-specific COVID-19 mortality data in the United-States

Updates

  • 2021-04-02 This repository is now no longer maintained. What does this mean?
  1. Some of the states, which required manual extraction, will no longer be updated.
  2. The automatic extraction, for the rest of the states, will continue to be run daily and updated in the update-data branch. However, we do not guarantee the accuracy of this data as they are no longer checked.
  3. The processed data, gathering all sources across the states, is no longer updated.
  • 2021-01-28 Version 1 Release DOI This is the release related to our upcoming peer-reviewed age paper, where we use age-specific mobility data to estimate the epidemic in the USA by accounting for age-specific heterogeneity.

One may directly get:

  1. the age-specific mortality data used in the paper here
  2. the crude estimates of the COVID-19 cases and mortality across common age strata here

Data

The user may directly find the latest update of the age-specific mortality by date, age and location in

data/processed/latest/DeathsByAge_US.csv

We aim to update the data at least once a week. The data set currently includes 44 U.S. states and 2 metropolitan areas. The locations are listed in the table below.

Usage

Docker

The easiest way for reproducibility is using docker. A Dockerfile is in the repository.

Run:

sudo apt-get install docker # for linux. For mac you can use something like brew. In any case,
# you need to install docker onto your machine 
docker build -t usaage .
docker run --rm -t -d --name usaage_container -v $(pwd):/code usaage

This will keep a docker container running in the background, which you can inspect using docker ps.

Now all the development can be done in the container and you can edit the code as usual locally (changes will be synced to the docker container since we made it share folders using the flag -v). You might need to use Remote-SSH in the VSCODE IDE for convenience. You can also just attach a shell onto the container using docker exec -it usaage_container /bin/bash

You can check that everything works by running make all in the container.

Structure Overview

The code is divided into 2 parts: First, the extraction of the COVID-19 mortality counts data from Department of Health websites. Second, the processing of the extracted data to create a complete time series of age-specific COVID-19 mortality counts for every location.

Dependencies

Data extraction

  • Python version >= 3.6.1
  • Python libraries:
fitz
PyMuPDF
pandas
pyjson
beautifulsoup4
requests
selenium

Data processing

  • R version >= 4.0.2
  • R libraries:
data.table
ggplot2 
scales
gridExtra
tidyverse
rjson
readxl
reshape2

1. Data extraction

To extract, run

$ make files

This will get you the latest data in data/$DATE.

2. Data processing

To process, run

$ Rscript scripts/process.data.R

This will get you a csv file for every state with variables age, date, daily.deaths and (state) code in data/processed/$DATE/.

More details about the data extraction

The main entry point is make files.

Scripts

make files will execute the files task in Makefile, which currently is composed only of the script ./download_files.sh. This script follows the following steps:

  1. Set a date, $date, in the local environment
  2. Create new folders in data and pdfs for the $date.
  3. Run the following scripts:
    • scripts/age_extraction.py to extract the locations for which data are available in CSV, XLSX or JSON format.
    • a series of GET requests to the web API. They download CSVs made available by the DoH directly.
    • scripts/extraction_try.py, which downloads data that are in webpage, XLSX or PDF format.
    • python scripts/get_nm.py to get New Mexico data.

General procedure

Depending on the data format made available by the DoH, we do the following:

PDFs: We use fitz in order to read data within PDFs and save them to JSON or CSV format.

CSVs, XLSX, JSON: We download the data directly.

Static Webpages (HTML): We save the HTML and extract the data using BeautifulSoup, and save them in JSON format.

Dynamic Webpages (Dashboard): We use selenium to render a webpage and switch to the right page. Then, if the data is stored in the source code, we find their path or css, extract them and save them to a JSON format. Otherwise, if the webpage can be saved as a PDFs, we use BeautifulSoup to download the webpage in a PDFs format and fitz to extract the data within PDFs. If we cannot use either of the latter options, we take a screenshot of the webpage, and extract the data manually.

Screenshots/PNGs: To record the data published in the dynamic webpages

More details about the data processing

Procedure

Pre-processing adjustments

We reconstruct time series for every location and age band, therefore all extracted data need to have the same age bands. If the DoH changes the reported age bands at time $t$ and,

  • the old age bands can be used to find the new age bands, then we find the mortality counts by the old age bands for every data from $t$ before processing.
  • the old age bands cannot be used to find the new age bands, then we truncate the time series: $t$ becomes the first day of the time series and all data extracted before $t$ are ignored.

Processing stages

  1. Read the data

    • If a complete time series records of age-specific COVID-19 attributable death burden is available

      • Use only the last data available
      • Every state has its own processing function depending on the data format
    • If daily snapshots of age-specific COVID-19 attributable death burden are available

      • Use every data ever extracted
      • if CSV or XLSX: the state has its own processing function
      • if JSON: common processing function
  2. Ensure that the mortality counts are strictly increasing

    • some DoH updates indicated a decreasing mortality count from one day to the next.
    • In this case, we set the mortality count on the earliest day to match the mortality count on the most recent day.
  3. Find daily deaths

    • some days had missing data, usually either because no updates were reported, because the webpage failed or because the URL of the website had mutated.
    • The missing daily mortality count were imputed, assuming a constant increase in daily mortality count between days with data.
  4. Check that the reconstructed cumulative deaths on the last day match the ones reported in the latest data.

The script that acts as a spine for those four stages is utils/obtain.data.R. Functions for stage 1 are in utils/read.daily-historical.data.R and utils/read.json.data.R. Functions from stage 2, 3 are in utils/summary_functions.R. Function for stage 4 is in utils/sanity.check.processed.data.R.

Post-processing adjustments

After reconstructing the time series, we make final adjustements for analysis:

  1. Modify the age bands boundaries from the ones declared by the Department of Health, such that they reflect the closest age bands in the set, A = { [0-4], [5-9], ..., [75-79], [80-84], [85+] }. For example, age band [0-17] becomes [0-19] and age band [61-65].

  2. Keep only days that match closely with JHU overall mortality counts.

Both data set, adjusted and non adjusted are available, DeathsByAge_US_adj.csv and DeathsByAge_US.csv.

Data source

This table includes a complete list of all sources ever used in the data set. We acknowledge and are grateful to U.S. state Departments of Health for making the primary data available at the following sources:

State Date record start Link(s) Notes
Alabama 2020-05-03 link dashboard updated daily and replaced; no historical archive
Alaska 2020-06-09 link metadata updated daily and replaced; no historical archive
Arizona 2020-05-13 link dashboard updated daily and replaced; no historical archive
California 2020-05-13 link dashboard updated daily and replaced; no historical archive
Colorado 2020-03-23 (1) link until 2020-08-20, (2) link since 2020-08-20 (1) metadata updated daily; full time series; died in 2020-08-20; (2) dashboard updated daily and replaced; no historical archive
Connecticut 2020-04-05 link metadata updated daily; full time series
Delaware 2020-05-12 link dashboard updated daily and replaced; no historical archive
District of Columbia 2020-04-13 link metadata updated daily; full time series
Florida 2020-03-27 link daily report; with historical archive
Hawaii 2020-09-18 link dashboard updated weekly and replaced
Georgia 2020-04-27 link metadata updated daily and replaced; no historical archive
Idaho 2020-05-13 (1) link, (2) link dashboard updated daily and replaced; no historical archive ; (1) died on 2020-09-04
Illinois 2020-05-14 link dashboard updated daily and replaced; no historical archive
Indiana 2020-05-13 link dashboard updated daily and replaced; no historical archive
Iowa 2020-05-13 link dashboard updated daily and replaced; no historical archive
Kansas 2020-05-13 link dashboard updated Monday, Wednesday and Friday, and replaced; no historical archive
Kentucky 2020-05-13 link dashboard updated daily and replaced; no historical archive
Louisiana 2020-05-12 link dashboard updated daily except on Saturday and replaced; no historical archive
Maine 2020-03-12 link metadata updated daily; full time series
Maryland 2020-05-14 link dashboard updated daily and replaced; no historical archive
Massachusetts 2020-04-20 link until 2020-08-11 and link since (1) daily report, with historical archive; (2) weekly report, with historical archive
Michigan 2020-03-21 (1) data/req/michigan weekly.csv and (2) link (1) data requested to the DoH (2) dashboard updated daily and replaced; no historical archive
Minnesota 2020-05-21 link weekly report, with historical archive
Mississippi 2020-04-27 link dashboard updated daily and replaced; no historical archive
Missouri 2020-05-13 (1)link and (2)link dashboard updated daily and replaced; no historical archive
Nevada 2020-06-07 link dashboard updated daily and replaced; no historical archive
New Hampshire 2020-06-07 (1)link until 2021-01-08, and (2)link since 2021-01-08 dashboard updated daily and replaced; no historical archive
New Jersey 2020-05-25 link dashboard updated daily and replaced; no historical archive
New Mexico 2020-05-25 link daily written report; with history archive
New York City 2020-04-14 link, link since 2020-05-18, link since 2020-11-08 report / csv updated daily, with history archive
North Carolina 2020-05-20 link dashboard updated daily and replaced; no historical archive
North Dakota 2020-05-14 link dashboard updated daily and replaced; no historical archive
Oklahoma 2020-05-13 link dashboard updated daily and replaced; no historical archive
Oregon 2020-06-05 link dashboard updated dashboard updated on Monday-Friday and sometimes on Saturday and replaced; no historical archive
Pennsylvania 2020-06-07 (1)link and (2)link dashboard updated daily and replaced; no historical archive
Rhode Island 2020-06-01 link metadata updated weekly and replaced; no historical archive
South Carolina 2020-05-14 link dashboard updated on Tuesday and Friday; no historical archive
Tennessee 2020-04-09 link metadata updated daily; full time series
Texas 2020-05-06 (1) link until 2020-09-24, (2) link since 2020-09-24 metadata updated daily and replaced; no historical archive
Utah 2020-06-17 link dashboard updated daily and replaced; no historical archive
Vermont 2020-05-13 (1) link until 2020-09-03, (2) link since 2020-09-03 dashboard updated daily and replaced; no historical archive; (1) does not report mortality by age since 2020-09-03
Virginia 2020-04-21 link metadata updated daily; full time series
Washington 2020-06-08 link dashboard updated daily and replaced; no historical archive
Wisconsin 2020-03-15 (1) link until 2020-10-19, (2) link since 2020-10-19 metadata updated daily; full time series
Wyoming 2020-09-22 link dashboard updated daily and replaced; no historical archive

About

Maintainers and Contributors

Active maintainers (alphabetically)

  • Yu Chen - Department of Mathematics, Imperial College London
  • Michael Hutchinson - Department of Statistics, Oxford
  • Vidoushee Jogarah - Mary Lister McCammon Fellow, Department of Mathematics, Imperial College London
  • Mélodie Monod - Department of Mathematics, Imperial College London
  • Oliver Ratmann - Department of Mathematics, Imperial College London
  • Harrison Zhu - Department of Mathematics, Imperial College London

Contributors

Licence

This data set is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) by Imperial College London on behalf of its COVID-19 Response Team. Copyright Imperial College London 2020.

Warranty

Imperial makes no representation or warranty about the accuracy or completeness of the data nor that the results will not constitute in infringement of third-party rights. Imperial accepts no liability or responsibility for any use which may be made of any results, for the results, nor for any reliance which may be placed on any such work or results.

Cite

Attribute the data as the "COVID-19 Age specific Mortality Data Repository by the Imperial College London COVID-19 Response Team", and the urls sepecified above.

Acknowledgements

We acknowledge the support of the EPSRC through the EPSRC Centre for Doctoral Training in Modern Statistics and Statistical Machine Learning at Imperial and Oxford.

Funding

This research was partly funded by the The Imperial College COVID-19 Research Fund.