Skip to content

Forecast Data Format

Johannes Bracher edited this page Sep 5, 2020 · 12 revisions

This is an adapted version of materials provided by the US COVID-19 Forecast Hub under the MIT license (https://github.com/reichlab/covid19-forecast-hub#data-model)

Naming of files

Forecasts for different types of forecasts (death, ICU use, cases) need to be stored in separate files for each type. The respective naming conventions are described here.

Structure of files

Teams are asked to provide their forecasts in a quantile-based format (even though we also accept submissions containing only point forecasts). We are using a general data model developed for the US COVID-19 Forecast Hub. The tabular version of the data model is a simple, long-form data format, with six required columns and one optional column:

example You can find a template here or consider existing forecasts in the data-processed folder. Files have to contain the following variables:

forecast_date

The date on which the submitted forecast data was made available in YYYY-MM-DD format. This will typically be the date on which the model finishes running and produces the standard formatted file. forecast_date should correspond and be redundant with the date in the filename, but is included here by request from some analysts. We will enforce that the forecast_date for a file must be either the date on which the file was submitted to the repository or the previous day. Exceptions will be made for legitimate extenuating circumstances.

target

Values in the target column must be a character (string) and be one of the following specific targets (details on the definition of the targets can be found here). Remember that death, case and ICU forecasts need to be stored in separate files.

Death forecasts:

  • "N day ahead cum death" where N is a number between -1 and 130
  • "N day ahead inc death" where N is a number between -1 and 130
  • "N wk ahead cum death" where N is a number between -1 and 20
  • "N wk ahead inc death" where N is a number between -1 and 20

Case forecasts (to be added soon):

  • "N day ahead cum case" where N is a number between -1 and 130
  • "N day ahead inc case" where N is a number between -1 and 130
  • "N wk ahead cum case" where N is a number between -1 and 20
  • "N wk ahead inc case" where N is a number between -1 and 20

ICU forecasts:

  • "N day ahead curr ICU" where N is a number between -1 and 130
  • "N day ahead curr ventilated" where N is a number between -1 and 130
  • "N wk ahead curr ICU" where N is a number between -1 and 20
  • "N wk ahead curr ventilated" where N is a number between -1 and 20

Additional targets will be added after further consultation with interested teams.

target_end_date:

the date corresponding to the end time of the target, in YYYY-MM-DD format. E.g. if the target is "1 wk ahead inc death" and this forecast is submitted on Monday 2020-04-20, then this field should correspond to the Saturday that ends the current week 2020-04-25. See details about the handling of target_end_date for week-ahead-targets in the overview of forecast targets.

location

A unique id for the location (we use standardized to FIPS codes as in the US hub. For more information see here or use our csv mapping names of German Länder to FIPS codes. For Polish vojvodeships see here or this csv file.)

location_name (optional)

Location name in a human-readable form. Note that we do not use this variable in our codes and treat location as the authoritative variable.

type

one of either "point", "quantile" or "observed". Note that "observed" is not a permitted value in the US Forecast Hub. We added it to be able to store the last two observed values along with the forecasts (e.g. with target = "-1 wk ahead inc death" and target = "0 wk ahead inc death"). This can help identify differences in ground truth values between different forecasts, which may be the reason for systematic shifts between them.

quantile

a value between 0 and 1 (inclusive), stating which quantile is displayed in this row. if type=="point" or type == "observed" then NA. We encourage all groups to make available the following 23 quantiles for each distribution: 0.01, 0.025, 0.05, 0.1, ..., 0.95, 0.975, 0.99.

value

a numeric value representing the value of the quantile function evaluated at the probability specified in quantile

For example, if quantile is 0.3 and value is 10, then this row is saying that the 30th percentile of the distribution is 10. If type is "point" and value is 15, then this row is saying that the point estimate from this model is 15. If type is "observed" and value is 15, then this row is saying that the respective value has already been observed and is 15.