# Dados coletados do CovidHub

https://covid19forecasthub.org/doc/ensemble/
## Sobre os Dados
No Covid-Hub, forma feitos modelos para casos e mortes, com forecasts por estado,
sendo estes agregados para nível nacional. Neste notebook utilizamos os dados de nível nacional (US).

Os dados reais (ground truth) de Covid se encontram no dataframe `covid_data`,
enquanto `forecasts` contém os modelos de forecasting (ensembles e modelos componentes utlizados no ensamble).

Forecasts podem conter quantis ou point predictions, que está na coluna `type`. Aqui usamos somente os quantis, que
foram os valores utilizados para o ensamble.

### Modelos

Existem diversos modelos na tabela de `forecasts`.
#### COVIDhub-4_week_ensemble

"This ensemble produces forecasts of incident cases (discontinued as of February 2023), incident deaths, and cumulative deaths (discontinued as of March 2023) at horizons of 1 through 4 weeks ahead, and forecasts of incident hospitalizations at horizons of 1 through 28 days ahead. For all of these targets, the ensemble forecasts are computed as the equally-weighted median of all component forecasts at each location, forecast horizon, and quantile level."

In [1]:
import pandas as pd
import polars as pl
import polars.selectors as cs

import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import scoringrules as sr
from mosqlient.scoring import compute_wis

from lets_plot import *
LetsPlot.setup_html()

#### Ground Truth

Os dados de ground truth são simples. A tabela tem `value` para os valores e `target_end_date` como a data
da medição.

In [2]:
covid_data = pl.read_parquet('./data/covid_truth_data.parquet')
covid_data = covid_data.with_columns(
    pl.lit('ground_truth').alias('type'),
    pl.col('target_end_date').cast(pl.Datetime),
    pl.col('target_end_date').dt.weekday().alias('target_day_of_week')
)


In [3]:
(
    ggplot(data=covid_data) +
    geom_line(aes(x='target_end_date', y='value')) +
    geom_point(aes(x='target_end_date', y='value')) +
    ggsize(1400, 400)+
    theme_bw()
)

#### Forecasts

Na tabela de forecast temos os modelos, o valor previsto para um certo `horizon` e um certo `quantile`.
Cada forecast tem um `forecast_date` que é quando o forecast foi submetido, e tem um `target_end_date`,
que é a data que está sendo prevista (por exemplo, para forecast_date de 1/1/2020 eu prevejo o número
de caso do target_end_date 8/1/2020).

Temos diversos modelos, porém, somente os modelos 'primary' são utilizados no ensamble.
Além disso, há modelos que são submetidos mais de uma vez, de forma que apresentam
duas datas de 'forecast_date' para um mesmo quantil, horizon e target_end_date. A razão disso pode ter sido
uma correção, ou erro de submissão. De toda forma, o ensamble usa a última submissão.

Assim, nossa tabela de `forecasts` será filtrada para o `forecast_date` que caem na Segunda, além disso, iremos retirar os modelos que não constam no artigo, para tanto utilizamos como critério os nomes dos modelos que aparecem na página $5$ do material suplementar (*Supplemental Materials for Comparing trained and untrained
probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States*: https://ars.els-cdn.com/content/image/1-s2.0-S0169207022000966-mmc1.pdf).

Ressaltamos que nem todos os modelos presentes lá estavam nos dados, então temos $34$ modelos, $6$ a menos que o artigo (que aponta para o uso de $40$). 

In [4]:
#The list contain the methods that were in the paper.
matched_models = ['BPagano-RtDriven',
'CEID-Walk',
'Covid19Sim-Simulator',
'CovidActNow-SEIR_CAN',
'CovidAnalytics-DELPHI',
'COVIDhub-baseline',
'CU-select',
#'COVIDhub-ensemble',
#'COVIDhub_CDC-ensemble',
#'COVIDhub-trained_ensemble',
#'COVIDhub-4_week_ensemble',
'DDS-NBDS',
'Google_Harvard-CPF',
'IEM_MED-CovidProject',
'IHME-CurveFit',
'IowaStateLW-STEM',
'IQVIA_ACOE-STAN',
'IUPUI-HkPrMobiDyR',
'JCB-PRM',
'JHU_CSSE-DECOM',
'JHU_IDD-CovidSP',
'JHU_UNC_GAS-StatMechPool',
'JHUAPL-Bucky',
'Karlen-pypm',
'LANL-GrowthRate',
'LNQ-ens1',
'LosAlamos_NAU-CModel_SDVaxVar',
'Microsoft-DeepSTIA',
'MIT_ISOLAT-Mixtures',
'MIT-Cassandra',
'MOBS-GLEAM_COVID',
'MUNI-ARIMA',
'OneQuietNight-ML',
'RobertWalraven-ESG',
'SDSC_ISG-TrendModel',
'SigSci-TS',
'UChicagoCHATTOPADHYAY-UnIT',
'UCLA-SuEIR',
'UCSB-ACTS',
'UMich-RidgeTfReg',
'USACE-ERDC_SEIR',
'USC-SI_kJalpha',
'UVA-Ensemble',
'Wadhwani_AI-BayesOpt']
matched_models

['BPagano-RtDriven',
 'CEID-Walk',
 'Covid19Sim-Simulator',
 'CovidActNow-SEIR_CAN',
 'CovidAnalytics-DELPHI',
 'COVIDhub-baseline',
 'CU-select',
 'DDS-NBDS',
 'Google_Harvard-CPF',
 'IEM_MED-CovidProject',
 'IHME-CurveFit',
 'IowaStateLW-STEM',
 'IQVIA_ACOE-STAN',
 'IUPUI-HkPrMobiDyR',
 'JCB-PRM',
 'JHU_CSSE-DECOM',
 'JHU_IDD-CovidSP',
 'JHU_UNC_GAS-StatMechPool',
 'JHUAPL-Bucky',
 'Karlen-pypm',
 'LANL-GrowthRate',
 'LNQ-ens1',
 'LosAlamos_NAU-CModel_SDVaxVar',
 'Microsoft-DeepSTIA',
 'MIT_ISOLAT-Mixtures',
 'MIT-Cassandra',
 'MOBS-GLEAM_COVID',
 'MUNI-ARIMA',
 'OneQuietNight-ML',
 'RobertWalraven-ESG',
 'SDSC_ISG-TrendModel',
 'SigSci-TS',
 'UChicagoCHATTOPADHYAY-UnIT',
 'UCLA-SuEIR',
 'UCSB-ACTS',
 'UMich-RidgeTfReg',
 'USACE-ERDC_SEIR',
 'USC-SI_kJalpha',
 'UVA-Ensemble',
 'Wadhwani_AI-BayesOpt']

In [5]:
models = pl.read_parquet('./data/models.parquet')

models = models.with_columns(
    pl.col('model').is_in(matched_models).alias('include_ensemble')
)

# models = models.filter(
#     pl.col('designation') == 'primary'
# )

models = models.with_columns(
    pl.when(pl.col('model').is_in(['COVIDhub-4_week_ensemble']))
    .then(pl.lit(True))
    .otherwise(pl.col('include_ensemble'))
    .alias('include_ensemble')
)
models

model,designation,include_ensemble
str,str,bool
"""KITmetricslab-select_ensemble""","""primary""",false
"""CovidAnalytics-DELPHI""","""primary""",true
"""UMass-MechBayes""","""primary""",false
"""UT-Osiris""","""primary""",false
"""USACE-ERDC_SEIR""","""primary""",true
…,…,…
"""MIT-Cassandra""","""primary""",true
"""PandemicCentral-COVIDForest""","""primary""",false
"""JHU_IDD-CovidSP""","""primary""",true
"""OHT_JHU-nbxd""","""primary""",false


In [6]:
forecast = pl.read_parquet('./data/fullforecasts.parquet')
forecast = forecast.with_columns(
    pl.col('forecast_date').cast(pl.Datetime),
    pl.col('target_end_date').cast(pl.Datetime),
    pl.col('target_end_date').dt.weekday().alias('target_day_of_week'),
    pl.col('forecast_date').dt.weekday().alias('forecast_day_of_week')
)
# ).filter(
#     pl.col('forecast_day_of_week') == 1,
# )

forecast = forecast.join(models, on='model', how='left').filter(
    pl.col('include_ensemble') == True
)
latest_dates = (
    forecast
    .group_by(['model', 'quantile', 'target_end_date', 'horizon'])
    .agg(pl.col('forecast_date').max().alias('forecast_date'))
)

forecast = forecast.join(latest_dates, on=['model', 'quantile', 'target_end_date', 'horizon','forecast_date'], how='inner')

In [7]:
forecast.filter(
    pl.col('model') == 'BPagano-RtDriven'
).sort('forecast_date').head(10)

model,forecast_date,location,horizon,temporal_resolution,target_variable,target_end_date,type,quantile,value,location_name,population,geo_type,geo_value,abbreviation,full_location_name,target_day_of_week,forecast_day_of_week,designation,include_ensemble
str,datetime[μs],str,str,str,str,datetime[μs],str,f64,f64,str,f64,str,str,str,str,i8,i8,str,bool
"""BPagano-RtDriven""",2020-10-18 00:00:00,"""US""","""1""","""wk""","""inc case""",2020-10-24 00:00:00,"""quantile""",0.025,252157.46981,"""United States""",332875137.0,"""state""","""us""","""US""","""United States""",6,7,"""primary""",True
"""BPagano-RtDriven""",2020-10-18 00:00:00,"""US""","""1""","""wk""","""inc case""",2020-10-24 00:00:00,"""quantile""",0.1,309027.43446,"""United States""",332875137.0,"""state""","""us""","""US""","""United States""",6,7,"""primary""",True
"""BPagano-RtDriven""",2020-10-18 00:00:00,"""US""","""1""","""wk""","""inc case""",2020-10-24 00:00:00,"""quantile""",0.25,360710.91108,"""United States""",332875137.0,"""state""","""us""","""US""","""United States""",6,7,"""primary""",True
"""BPagano-RtDriven""",2020-10-18 00:00:00,"""US""","""1""","""wk""","""inc case""",2020-10-24 00:00:00,"""quantile""",0.5,418324.95041,"""United States""",332875137.0,"""state""","""us""","""US""","""United States""",6,7,"""primary""",True
"""BPagano-RtDriven""",2020-10-18 00:00:00,"""US""","""1""","""wk""","""inc case""",2020-10-24 00:00:00,"""quantile""",0.75,475938.98974,"""United States""",332875137.0,"""state""","""us""","""US""","""United States""",6,7,"""primary""",True
"""BPagano-RtDriven""",2020-10-18 00:00:00,"""US""","""1""","""wk""","""inc case""",2020-10-24 00:00:00,"""quantile""",0.9,527622.46637,"""United States""",332875137.0,"""state""","""us""","""US""","""United States""",6,7,"""primary""",True
"""BPagano-RtDriven""",2020-10-18 00:00:00,"""US""","""1""","""wk""","""inc case""",2020-10-24 00:00:00,"""quantile""",0.975,584492.43102,"""United States""",332875137.0,"""state""","""us""","""US""","""United States""",6,7,"""primary""",True
"""BPagano-RtDriven""",2020-10-18 00:00:00,"""US""","""2""","""wk""","""inc case""",2020-10-31 00:00:00,"""quantile""",0.025,250557.3476,"""United States""",332875137.0,"""state""","""us""","""US""","""United States""",6,7,"""primary""",True
"""BPagano-RtDriven""",2020-10-18 00:00:00,"""US""","""2""","""wk""","""inc case""",2020-10-31 00:00:00,"""quantile""",0.1,325076.2206,"""United States""",332875137.0,"""state""","""us""","""US""","""United States""",6,7,"""primary""",True
"""BPagano-RtDriven""",2020-10-18 00:00:00,"""US""","""2""","""wk""","""inc case""",2020-10-31 00:00:00,"""quantile""",0.25,392799.0415,"""United States""",332875137.0,"""state""","""us""","""US""","""United States""",6,7,"""primary""",True


In [8]:
forecast['model'].unique()

model
str
"""LANL-GrowthRate"""
"""DDS-NBDS"""
"""MUNI-ARIMA"""
"""IHME-CurveFit"""
"""IQVIA_ACOE-STAN"""
…
"""SDSC_ISG-TrendModel"""
"""JHU_CSSE-DECOM"""
"""JCB-PRM"""
"""MIT-Cassandra"""


# Visualizando os Dados

In [9]:
def quantile_plot(forecast, ground_truth):
    quantile_df  = forecast.pivot(
        index=['target_end_date', 'horizon'],
        on='quantile',
        values='value'
    )
    return (
        ggplot(quantile_df, aes(x='target_end_date')) +
        geom_ribbon(aes(ymin='0.025', ymax='0.975'), fill='#084594', alpha=0.1,size=0.2,manual_key='2.5% to 97.5%') +
        geom_ribbon(aes(ymin='0.25', ymax='0.75'), fill='#2171b5', alpha=0.3,size=0.2,manual_key='25% to 75%') +
        geom_line(aes(y='0.5'), color='blue', size=0.4, manual_key='Median (0.5)') +
        facet_grid(y='horizon', scales='fixed') +
        geom_line(aes(x='target_end_date', y='value'), data=ground_truth, color='black',size=0.5, linetype='dashed', manual_key='Ground Truth') +
        ggsize(1200, 800) +
        theme_bw() +
        labs(title='Forecast Quantiles as Ribbons', y='Value', x='Date')
        )
    

In [10]:
import ipywidgets as widgets
from IPython.display import display, clear_output

model_options = forecast.select('model').unique().to_series().sort().to_list()
model_dropdown = widgets.Dropdown(
    options=model_options,
    value='COVIDhub-4_week_ensemble',
    description='Model:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

output = widgets.Output()

def update_plot(change):
    with output:
        clear_output(wait=True)
        selected_model = change['new']
        display(quantile_plot(forecast.filter(pl.col('model') == selected_model), covid_data))

model_dropdown.observe(update_plot, names='value')

display(model_dropdown)
with output:
    display(quantile_plot(forecast.filter(pl.col('model') == model_dropdown.value), covid_data))
display(output)

Dropdown(description='Model:', index=2, layout=Layout(width='50%'), options=('BPagano-RtDriven', 'CEID-Walk', …

Output()

In [11]:
fs = forecast.pivot(
    values='value',
    index=['model', 'forecast_date', 'target_end_date', 'horizon'],
    on='quantile'
)

fs = fs.join(covid_data[['target_end_date','value']],on=['target_end_date'],how='left')

In [12]:
fs.head()

model,forecast_date,target_end_date,horizon,0.025,0.1,0.25,0.5,0.75,0.9,0.975,value
str,datetime[μs],datetime[μs],str,f64,f64,f64,f64,f64,f64,f64,f64
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-10-24 00:00:00,"""1""",252157.46981,309027.43446,360710.91108,418324.95041,475938.98974,527622.46637,584492.43102,485474.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-10-31 00:00:00,"""2""",250557.3476,325076.2206,392799.0415,468292.9038,543786.76609,611509.58699,686028.46,572162.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-11-07 00:00:00,"""3""",236909.92449,332846.6848,420034.10156,517226.0783,614418.05504,701605.47179,797542.2321,777428.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-11-14 00:00:00,"""4""",213876.19851,330803.52595,437067.18958,555524.36963,673981.54967,780245.2133,897172.54074,1028914.0
"""BPagano-RtDriven""",2020-10-25 00:00:00,2020-10-31 00:00:00,"""1""",299044.13571,364922.81525,424793.42033,491534.04269,558274.66504,618145.27012,684023.94966,572162.0


In [13]:
fs = fs.rename({'0.025' : 'lower_95',
                '0.1' : 'lower_80',
                '0.25' : 'lower_50',
                '0.5' : 'pred',
                '0.75' : 'upper_50',
                '0.9' : 'upper_80',
                '0.975' : 'upper_95'})
fs.head()

model,forecast_date,target_end_date,horizon,lower_95,lower_80,lower_50,pred,upper_50,upper_80,upper_95,value
str,datetime[μs],datetime[μs],str,f64,f64,f64,f64,f64,f64,f64,f64
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-10-24 00:00:00,"""1""",252157.46981,309027.43446,360710.91108,418324.95041,475938.98974,527622.46637,584492.43102,485474.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-10-31 00:00:00,"""2""",250557.3476,325076.2206,392799.0415,468292.9038,543786.76609,611509.58699,686028.46,572162.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-11-07 00:00:00,"""3""",236909.92449,332846.6848,420034.10156,517226.0783,614418.05504,701605.47179,797542.2321,777428.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-11-14 00:00:00,"""4""",213876.19851,330803.52595,437067.18958,555524.36963,673981.54967,780245.2133,897172.54074,1028914.0
"""BPagano-RtDriven""",2020-10-25 00:00:00,2020-10-31 00:00:00,"""1""",299044.13571,364922.81525,424793.42033,491534.04269,558274.66504,618145.27012,684023.94966,572162.0


In [14]:
aux_matched_models = np.array(fs['model'].unique())
aux_matched_models

array(['JHU_IDD-CovidSP', 'SDSC_ISG-TrendModel', 'BPagano-RtDriven',
       'JHUAPL-Bucky', 'MUNI-ARIMA', 'SigSci-TS', 'MOBS-GLEAM_COVID',
       'MIT_ISOLAT-Mixtures', 'COVIDhub-baseline', 'IUPUI-HkPrMobiDyR',
       'LNQ-ens1', 'UVA-Ensemble', 'COVIDhub-4_week_ensemble',
       'UMich-RidgeTfReg', 'CU-select', 'Covid19Sim-Simulator',
       'IEM_MED-CovidProject', 'JCB-PRM', 'Karlen-pypm',
       'CovidAnalytics-DELPHI', 'IQVIA_ACOE-STAN', 'RobertWalraven-ESG',
       'CEID-Walk', 'USC-SI_kJalpha', 'USACE-ERDC_SEIR',
       'IowaStateLW-STEM', 'UChicagoCHATTOPADHYAY-UnIT', 'JHU_CSSE-DECOM',
       'MIT-Cassandra', 'OneQuietNight-ML', 'IHME-CurveFit',
       'Microsoft-DeepSTIA', 'LANL-GrowthRate', 'UCLA-SuEIR', 'DDS-NBDS'],
      dtype='<U26')

In [15]:
'''
The models with ### before appears in paper page six but not in the dataset of full_data.
aux_matched_models = ['BPagano-RtDriven',
                     'CEID-Walk',
                     'Covid19Sim-Simulator',
                     ### 'CovidActNow-SEIR_CAN',
                     'CovidAnalytics-DELPHI',
                     'COVIDhub-baseline',
                     'CU-select',
                     'DDS-NBDS',
                     ### 'Google_Harvard-CPF',
                     'IEM_MED-CovidProject',
                     'IHME-CurveFit',
                     'IowaStateLW-STEM',
                     'IQVIA_ACOE-STAN',
                     'IUPUI-HkPrMobiDyR',
                     'JCB-PRM',
                     'JHU_CSSE-DECOM',
                     'JHU_IDD-CovidSP',
                     ### 'JHU_UNC_GAS-StatMechPool',
                     'JHUAPL-Bucky',
                     'Karlen-pypm',
                     'LANL-GrowthRate',
                     'LNQ-ens1',
                     ### 'LosAlamos_NAU-CModel_SDVaxVar',
                     'Microsoft-DeepSTIA',
                     'MIT_ISOLAT-Mixtures',
                     'MIT-Cassandra',
                     'MOBS-GLEAM_COVID',
                     'MUNI-ARIMA',
                     'OneQuietNight-ML',
                     'RobertWalraven-ESG',
                     'SDSC_ISG-TrendModel',
                     'SigSci-TS',
                     'UChicagoCHATTOPADHYAY-UnIT',
                     'UCLA-SuEIR',
                     ### 'UCSB-ACTS',
                     'UMich-RidgeTfReg',
                     'USACE-ERDC_SEIR',
                     'USC-SI_kJalpha',
                     'UVA-Ensemble',
                     ### 'Wadhwani_AI-BayesOpt']
'''
aux_list_matched = []
for i in aux_matched_models:
    if i in matched_models:
        aux_list_matched.append(i)
    else:
        print("Not match:", i)
print("# matched:", len(aux_list_matched))
print("# total here:", len(aux_matched_models))

print("Not in here, but in page 5 of supplementary material:")
for i in matched_models:
      if i not in aux_matched_models:
          print("   ", i)

Not match: COVIDhub-4_week_ensemble
# matched: 34
# total here: 35
Not in here, but in page 5 of supplementary material:
    CovidActNow-SEIR_CAN
    Google_Harvard-CPF
    JHU_UNC_GAS-StatMechPool
    LosAlamos_NAU-CModel_SDVaxVar
    UCSB-ACTS
    Wadhwani_AI-BayesOpt


In [16]:
aux_fs = fs.filter(pl.col('model')!="COVIDhub-4_week_ensemble")
print(len(aux_fs['model'].unique()))
aux_fs

34


model,forecast_date,target_end_date,horizon,lower_95,lower_80,lower_50,pred,upper_50,upper_80,upper_95,value
str,datetime[μs],datetime[μs],str,f64,f64,f64,f64,f64,f64,f64,f64
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-10-24 00:00:00,"""1""",252157.46981,309027.43446,360710.91108,418324.95041,475938.98974,527622.46637,584492.43102,485474.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-10-31 00:00:00,"""2""",250557.3476,325076.2206,392799.0415,468292.9038,543786.76609,611509.58699,686028.46,572162.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-11-07 00:00:00,"""3""",236909.92449,332846.6848,420034.10156,517226.0783,614418.05504,701605.47179,797542.2321,777428.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-11-14 00:00:00,"""4""",213876.19851,330803.52595,437067.18958,555524.36963,673981.54967,780245.2133,897172.54074,1.028914e6
"""BPagano-RtDriven""",2020-10-25 00:00:00,2020-10-31 00:00:00,"""1""",299044.13571,364922.81525,424793.42033,491534.04269,558274.66504,618145.27012,684023.94966,572162.0
…,…,…,…,…,…,…,…,…,…,…,…
"""UVA-Ensemble""",2022-03-07 00:00:00,2022-04-02 00:00:00,"""4""",0.0,0.0,0.0,204654.866196,643894.167065,1.0392e6,1.4810e6,198718.0
"""UVA-Ensemble""",2022-03-14 00:00:00,2022-03-19 00:00:00,"""1""",0.0,0.0,29607.178921,239906.562208,450205.945496,639481.950577,851004.384073,215103.0
"""UVA-Ensemble""",2022-03-14 00:00:00,2022-03-26 00:00:00,"""2""",0.0,0.0,0.0,224153.58186,496489.350306,741600.037205,1.0155e6,210000.0
"""UVA-Ensemble""",2022-03-14 00:00:00,2022-04-02 00:00:00,"""3""",0.0,0.0,0.0,201139.12462,490377.59623,750701.243243,1.0416e6,198718.0


In [17]:
aux_fs.write_csv("results/forecast_quantiles_cases_v3.csv")

## Computing WIS

In [18]:
fs = fs.rename({'lower_95' : '0.025',
                'lower_80' : '0.1',
                'lower_50' : '0.25',
                'pred' : '0.5',
                'upper_50': '0.75',
                'upper_80': '0.9',
                'upper_95' :  '0.975'})
fs.head()

model,forecast_date,target_end_date,horizon,0.025,0.1,0.25,0.5,0.75,0.9,0.975,value
str,datetime[μs],datetime[μs],str,f64,f64,f64,f64,f64,f64,f64,f64
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-10-24 00:00:00,"""1""",252157.46981,309027.43446,360710.91108,418324.95041,475938.98974,527622.46637,584492.43102,485474.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-10-31 00:00:00,"""2""",250557.3476,325076.2206,392799.0415,468292.9038,543786.76609,611509.58699,686028.46,572162.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-11-07 00:00:00,"""3""",236909.92449,332846.6848,420034.10156,517226.0783,614418.05504,701605.47179,797542.2321,777428.0
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-11-14 00:00:00,"""4""",213876.19851,330803.52595,437067.18958,555524.36963,673981.54967,780245.2133,897172.54074,1028914.0
"""BPagano-RtDriven""",2020-10-25 00:00:00,2020-10-31 00:00:00,"""1""",299044.13571,364922.81525,424793.42033,491534.04269,558274.66504,618145.27012,684023.94966,572162.0


In [19]:
def compute_wis2(quantiles, y, taus):
    """
    - quantiles (list): List of predictive quantiles q_1 to q_K
    - y (float): Observed quantity
    - taus (list): List of corresponding tau values (e.g., [0.025,0.1,0.25,0.5,0.75,0.9,0.975])
    
    Returns:
        float: WIS value
    """
    K = len(quantiles)
    wis = 0
    for k in range(K):
        indicator = 1 if y <= quantiles[k] else 0
        wis += 2 * (indicator - taus[k]) * (quantiles[k] - y)
    return wis / K
    
def maybe_compute_wis2(quantiles, y, taus):
    try:
        return compute_wis2(quantiles, y, taus)
    except Exception:
        return np.nan

        
taus = [0.025,0.1,0.25,0.5,0.75,0.9,0.975]
def row_wis2(row,taus):
    quantiles = [row[str(t)] for t in taus]
    y = row['value']
    return maybe_compute_wis2(quantiles, y, taus)

In [20]:
fs = fs.with_columns(
    pl.struct(pl.all()).map_elements(lambda r: row_wis2(r,taus),return_dtype=float).alias('wis')
)

In [21]:
fs.head()

model,forecast_date,target_end_date,horizon,0.025,0.1,0.25,0.5,0.75,0.9,0.975,value,wis
str,datetime[μs],datetime[μs],str,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-10-24 00:00:00,"""1""",252157.46981,309027.43446,360710.91108,418324.95041,475938.98974,527622.46637,584492.43102,485474.0,29166.980555
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-10-31 00:00:00,"""2""",250557.3476,325076.2206,392799.0415,468292.9038,543786.76609,611509.58699,686028.46,572162.0,45024.807888
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-11-07 00:00:00,"""3""",236909.92449,332846.6848,420034.10156,517226.0783,614418.05504,701605.47179,797542.2321,777428.0,133834.602508
"""BPagano-RtDriven""",2020-10-18 00:00:00,2020-11-14 00:00:00,"""4""",213876.19851,330803.52595,437067.18958,555524.36963,673981.54967,780245.2133,897172.54074,1028914.0,312369.336797
"""BPagano-RtDriven""",2020-10-25 00:00:00,2020-10-31 00:00:00,"""1""",299044.13571,364922.81525,424793.42033,491534.04269,558274.66504,618145.27012,684023.94966,572162.0,35005.247322


## Model Ensamble - Untrained

*Non-Robust untrained*: Use the mean of each quantile prediction in each component forecast

*Robust untrained*: Use the median of each quantile prediction in each component forecast


Let us create a dataframe with the mean and median of each quantile prediction in each component forecast.

In [22]:
components = fs.join(models.filter('include_ensemble'),on='model',how='inner') #filter only models that are included in the ensemble

# Compute the mediana and mean
components = components.group_by(['target_end_date','horizon']).agg(
    [pl.col(str(q)).mean().alias('mean_' + str(q)) for q in taus] +
    [pl.col(str(q)).median().alias('median_' + str(q)) for q in taus]+
    [pl.col('value').first()]
).sort('target_end_date')

# Adjust the dataframe format
components = components.unpivot(
    index=['target_end_date','horizon'],
    on   = ['mean_'+str(q) for q in taus] + ['median_'+str(q) for q in taus]
).with_columns(
    pl.col('variable').str.split('_').map_elements(lambda x: float(x[1]) if len(x) > 1 else None,return_dtype=float).alias('quantile'),
    pl.col('variable').str.split('_').map_elements(lambda x: 'nonrobust_untrained' if x[0] == 'mean' else 'robust_untrained',return_dtype=str).alias('model')
).sort(['model','target_end_date','horizon','quantile'])

In [23]:
components

target_end_date,horizon,variable,value,quantile,model
datetime[μs],str,str,f64,f64,str
2020-05-23 00:00:00,"""1""","""mean_0.025""",80290.425,0.025,"""nonrobust_untrained"""
2020-05-23 00:00:00,"""1""","""mean_0.1""",136901.6,0.1,"""nonrobust_untrained"""
2020-05-23 00:00:00,"""1""","""mean_0.25""",144254.5,0.25,"""nonrobust_untrained"""
2020-05-23 00:00:00,"""1""","""mean_0.5""",158270.0,0.5,"""nonrobust_untrained"""
2020-05-23 00:00:00,"""1""","""mean_0.75""",172285.5,0.75,"""nonrobust_untrained"""
…,…,…,…,…,…
2022-04-09 00:00:00,"""4""","""median_0.25""",60575.561752,0.25,"""robust_untrained"""
2022-04-09 00:00:00,"""4""","""median_0.5""",159832.869277,0.5,"""robust_untrained"""
2022-04-09 00:00:00,"""4""","""median_0.75""",254531.9152,0.75,"""robust_untrained"""
2022-04-09 00:00:00,"""4""","""median_0.9""",407946.068962,0.9,"""robust_untrained"""


Let us now compare our ensemble with the one that comes directly from the CovidHub dataset. 
Note that the CovidHub only has the `COVIDhub-4_week_ensemble`, which is the robust untrained ensemble. So we compare that with a robust untrained ensemble using only the $34$ models that we have acess.

In [24]:
covidhub_robust_untrained = fs.sort('target_end_date').filter(
    pl.col('model') == 'COVIDhub-4_week_ensemble'
)

covidhub_robust_untrained = covidhub_robust_untrained.rename({'value':'y'})\
    .unpivot(
        on=[str(t) for t in taus],
        index=["target_end_date",'model','horizon','forecast_date','wis','y'])\
    .rename({'variable':'quantile'})\
    .with_columns(
        pl.col('quantile').cast(pl.Float64)
    ).sort(['model','target_end_date','horizon','quantile'])


In [25]:
comparison = covidhub_robust_untrained.join(components.filter(pl.col('model') =='robust_untrained')[['target_end_date','horizon','quantile','value']],
                                    on=['target_end_date','horizon','quantile'])

comparison = comparison.with_columns(
    (pl.col('value')- pl.col('value_right')).alias('diff'),
)

comparison = comparison.with_columns(
    (pl.col('diff')/pl.col('value')).alias('relative_diff')
)

In [26]:
comparison = comparison.rename({"value_right": "value_calculated", "value": "value_COVIDhub"})

In [27]:
comparison.sort('relative_diff')

target_end_date,model,horizon,forecast_date,wis,y,quantile,value_COVIDhub,value_calculated,diff,relative_diff
datetime[μs],str,str,datetime[μs],f64,f64,f64,f64,f64,f64,f64
2022-03-19 00:00:00,"""COVIDhub-4_week_ensemble""","""4""",2022-02-21 00:00:00,37206.092857,215103.0,0.9,520540.0,890184.517508,-369644.517508,-0.710117
2022-03-19 00:00:00,"""COVIDhub-4_week_ensemble""","""4""",2022-02-21 00:00:00,37206.092857,215103.0,0.75,352284.0,535159.0,-182875.0,-0.519112
2022-04-02 00:00:00,"""COVIDhub-4_week_ensemble""","""3""",2022-03-14 00:00:00,33983.485714,198718.0,0.5,119458.0,175427.72964,-55969.72964,-0.468531
2020-08-15 00:00:00,"""COVIDhub-4_week_ensemble""","""4""",2020-07-20 00:00:00,45018.314286,391741.0,0.025,224612.0,326723.781432,-102111.781432,-0.454614
2022-04-09 00:00:00,"""COVIDhub-4_week_ensemble""","""4""",2022-03-14 00:00:00,54188.557143,243938.0,0.75,187056.0,254531.9152,-67475.9152,-0.360726
…,…,…,…,…,…,…,…,…,…,…
2022-02-12 00:00:00,"""COVIDhub-4_week_ensemble""","""4""",2022-01-17 00:00:00,1.5932e6,1.259905e6,0.025,1.648119e6,1.0765e6,571586.75,0.346812
2022-03-19 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2022-03-14 00:00:00,14137.521429,215103.0,0.025,101130.0,58507.940084,42622.059916,0.421458
2022-02-26 00:00:00,"""COVIDhub-4_week_ensemble""","""2""",2022-02-14 00:00:00,57660.057143,464427.0,0.025,195436.0,99972.062388,95463.937612,0.488466
2022-03-05 00:00:00,"""COVIDhub-4_week_ensemble""","""3""",2022-02-14 00:00:00,61498.221429,337025.0,0.025,90709.0,45355.594849,45353.405151,0.499988


In [28]:
(
    # ggplot(data=comparison.filter(pl.col('horizon')=='1',pl.col('quantile')==0.5))
    # ggplot(data=comparison.filter(pl.col('horizon')=='2'))
    ggplot(data=comparison)
    + geom_line(aes(x='target_end_date',y='value_COVIDhub'),color='blue')
    + geom_line(aes(x='target_end_date', y='value_calculated'), color='red', linetype='dashed')
    + geom_point(aes(x='target_end_date', y='diff'), color='green')
    + facet_grid(x='horizon',y='quantile',scales='free_y')
)

In [29]:
comparison_selected = comparison.select(["target_end_date","model","horizon","forecast_date","wis","y", "quantile","value_COVIDhub","value_calculated"])
comparison_selected = comparison_selected.rename({"wis": "wis_COVIDhub"})
comparison_selected

target_end_date,model,horizon,forecast_date,wis_COVIDhub,y,quantile,value_COVIDhub,value_calculated
datetime[μs],str,str,datetime[μs],f64,f64,f64,f64,f64
2020-07-25 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2020-07-20 00:00:00,25603.764286,462065.0,0.025,354703.0,389770.55
2020-07-25 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2020-07-20 00:00:00,25603.764286,462065.0,0.1,433849.0,433243.926379
2020-07-25 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2020-07-20 00:00:00,25603.764286,462065.0,0.25,471764.0,456853.38525
2020-07-25 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2020-07-20 00:00:00,25603.764286,462065.0,0.5,528883.0,479011.347656
2020-07-25 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2020-07-20 00:00:00,25603.764286,462065.0,0.75,563543.0,480936.25
…,…,…,…,…,…,…,…,…
2022-04-09 00:00:00,"""COVIDhub-4_week_ensemble""","""4""",2022-03-14 00:00:00,54188.557143,243938.0,0.25,56904.0,60575.561752
2022-04-09 00:00:00,"""COVIDhub-4_week_ensemble""","""4""",2022-03-14 00:00:00,54188.557143,243938.0,0.5,125651.0,159832.869277
2022-04-09 00:00:00,"""COVIDhub-4_week_ensemble""","""4""",2022-03-14 00:00:00,54188.557143,243938.0,0.75,187056.0,254531.9152
2022-04-09 00:00:00,"""COVIDhub-4_week_ensemble""","""4""",2022-03-14 00:00:00,54188.557143,243938.0,0.9,312173.0,407946.068962


In [30]:
id_cols = ["target_end_date", "model", "horizon", "forecast_date", "wis_COVIDhub", "y"]

df_long = comparison_selected.unpivot(
    index=["quantile"] + id_cols,
    on=["value_COVIDhub", "value_calculated"],
    variable_name="value_type",
    value_name="value"
)

df_long = df_long.with_columns(
    (pl.col("quantile").cast(str) + "_" + pl.col("value_type").str.replace("value_", "")).alias("cat_val_type")
)

df_wide = df_long.pivot(
    values="value",
    index=id_cols,
    on="cat_val_type",
    aggregate_function="first"
)

df_wide

target_end_date,model,horizon,forecast_date,wis_COVIDhub,y,0.025_COVIDhub,0.1_COVIDhub,0.25_COVIDhub,0.5_COVIDhub,0.75_COVIDhub,0.9_COVIDhub,0.975_COVIDhub,0.025_calculated,0.1_calculated,0.25_calculated,0.5_calculated,0.75_calculated,0.9_calculated,0.975_calculated
datetime[μs],str,str,datetime[μs],f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
2020-07-25 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2020-07-20 00:00:00,25603.764286,462065.0,354703.0,433849.0,471764.0,528883.0,563543.0,585873.0,689024.0,389770.55,433243.926379,456853.38525,479011.347656,480936.25,521132.563758,546731.492201
2020-08-01 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2020-07-27 00:00:00,11144.278571,435246.0,390056.0,413333.0,421784.0,454323.0,485925.0,513596.0,526253.0,390056.0,413333.0,421784.434824,454323.0,485925.0,513596.0,534241.575
2020-08-01 00:00:00,"""COVIDhub-4_week_ensemble""","""2""",2020-07-20 00:00:00,41405.928571,435246.0,290371.0,393021.0,455740.0,537671.0,595235.0,648353.0,802663.0,363549.615832,410751.638121,456037.658439,527631.255637,564017.703948,571498.58314,651353.291898
2020-08-08 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2020-08-03 00:00:00,19652.042857,388787.0,367977.0,389299.0,410424.0,434142.0,452867.0,482542.0,528801.0,367977.0,389299.0,410424.0,434142.0,452867.0,482542.0,534700.5
2020-08-08 00:00:00,"""COVIDhub-4_week_ensemble""","""2""",2020-07-27 00:00:00,35524.878571,388787.0,334323.0,393452.0,431745.0,468837.0,499385.0,530096.0,578910.0,370689.833617,393452.931423,431745.0,468837.0,499385.0,530096.0,578909.528346
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2022-03-26 00:00:00,"""COVIDhub-4_week_ensemble""","""3""",2022-03-07 00:00:00,30335.007143,210000.0,43800.0,73298.0,106979.0,160430.0,251749.0,384326.0,607489.0,43800.0,73298.0,106979.0,182774.91863,251749.0,384325.175,607488.232031
2022-03-26 00:00:00,"""COVIDhub-4_week_ensemble""","""4""",2022-02-28 00:00:00,44069.542857,210000.0,28738.0,45077.0,93449.0,150573.0,351379.0,452559.0,800706.0,28738.0,45077.0,93449.0,178234.233286,351379.0,452559.0,800705.515239
2022-04-02 00:00:00,"""COVIDhub-4_week_ensemble""","""3""",2022-03-14 00:00:00,33983.485714,198718.0,25722.0,37424.0,75688.0,119458.0,180537.0,277093.0,463804.0,25722.221184,37424.427132,80698.58373,175427.72964,236771.5,352246.464669,473069.819946
2022-04-02 00:00:00,"""COVIDhub-4_week_ensemble""","""4""",2022-03-07 00:00:00,30210.278571,198718.0,29803.0,54906.0,92434.0,179928.0,276207.0,394795.0,686156.0,29803.46985,54906.90625,92434.0,190012.245259,276207.0,394795.0,686156.0


In [31]:
df_wide = df_wide.rename({'0.025_COVIDhub' : 'lower_95_COVIDhub',
                        '0.1_COVIDhub' : 'lower_80_COVIDhub',
                        '0.25_COVIDhub' : 'lower_50_COVIDhub',
                        '0.5_COVIDhub' : 'pred_COVIDhub',
                        '0.75_COVIDhub' : 'upper_50_COVIDhub',
                        '0.9_COVIDhub' :'upper_80_COVIDhub',
                        '0.975_COVIDhub' :'upper_95_COVIDhub',
                        '0.025_calculated' : 'lower_95_calculated',
                        '0.1_calculated' : 'lower_80_calculated',
                        '0.25_calculated' : 'lower_50_calculated',
                        '0.5_calculated' : 'pred_calculated',
                        '0.75_calculated' : 'upper_50_calculated',
                        '0.9_calculated' :'upper_80_calculated',
                        '0.975_calculated' :'upper_95_calculated',})
df_wide.head()

target_end_date,model,horizon,forecast_date,wis_COVIDhub,y,lower_95_COVIDhub,lower_80_COVIDhub,lower_50_COVIDhub,pred_COVIDhub,upper_50_COVIDhub,upper_80_COVIDhub,upper_95_COVIDhub,lower_95_calculated,lower_80_calculated,lower_50_calculated,pred_calculated,upper_50_calculated,upper_80_calculated,upper_95_calculated
datetime[μs],str,str,datetime[μs],f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
2020-07-25 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2020-07-20 00:00:00,25603.764286,462065.0,354703.0,433849.0,471764.0,528883.0,563543.0,585873.0,689024.0,389770.55,433243.926379,456853.38525,479011.347656,480936.25,521132.563758,546731.492201
2020-08-01 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2020-07-27 00:00:00,11144.278571,435246.0,390056.0,413333.0,421784.0,454323.0,485925.0,513596.0,526253.0,390056.0,413333.0,421784.434824,454323.0,485925.0,513596.0,534241.575
2020-08-01 00:00:00,"""COVIDhub-4_week_ensemble""","""2""",2020-07-20 00:00:00,41405.928571,435246.0,290371.0,393021.0,455740.0,537671.0,595235.0,648353.0,802663.0,363549.615832,410751.638121,456037.658439,527631.255637,564017.703948,571498.58314,651353.291898
2020-08-08 00:00:00,"""COVIDhub-4_week_ensemble""","""1""",2020-08-03 00:00:00,19652.042857,388787.0,367977.0,389299.0,410424.0,434142.0,452867.0,482542.0,528801.0,367977.0,389299.0,410424.0,434142.0,452867.0,482542.0,534700.5
2020-08-08 00:00:00,"""COVIDhub-4_week_ensemble""","""2""",2020-07-27 00:00:00,35524.878571,388787.0,334323.0,393452.0,431745.0,468837.0,499385.0,530096.0,578910.0,370689.833617,393452.931423,431745.0,468837.0,499385.0,530096.0,578909.528346


In [32]:
df_wide.write_csv("results/comparison_4weeks_WIS_cases_v3.csv")

In [33]:
df_aux = pd.read_csv("results/comparison_4weeks_WIS_cases_v3.csv")
df_aux

quantile_probs = [0.025, 0.1, 0.25,
                  0.5,
                  0.75, 0.9, 0.975]
quantile_names = ['lower_95','lower_80','lower_50',
                  'pred',
                  'upper_50','upper_80','upper_95']

rename_map = {f"{col}_calculated" : col for col in quantile_names}
df_aux.rename(columns=rename_map, inplace=True)

wis_calculated = compute_wis(df_aux[quantile_names],df_aux.y.to_numpy())
df_aux['wis_calculated'] = wis_calculated
df_aux

rename_map = {col : f"{col}_calculated" for col in quantile_names}
df_aux.rename(columns=rename_map, inplace=True)
df_aux

Unnamed: 0,target_end_date,model,horizon,forecast_date,wis_COVIDhub,y,lower_95_COVIDhub,lower_80_COVIDhub,lower_50_COVIDhub,pred_COVIDhub,...,upper_80_COVIDhub,upper_95_COVIDhub,lower_95_calculated,lower_80_calculated,lower_50_calculated,pred_calculated,upper_50_calculated,upper_80_calculated,upper_95_calculated,wis_calculated
0,2020-07-25T00:00:00.000000,COVIDhub-4_week_ensemble,1,2020-07-20T00:00:00.000000,25603.764286,462065.0,354703.0,433849.0,471764.0,528883.0,...,585873.0,689024.0,389770.550000,433243.926379,456853.385250,479011.347656,480936.250000,521132.563758,546731.492201,7773.364945
1,2020-08-01T00:00:00.000000,COVIDhub-4_week_ensemble,1,2020-07-27T00:00:00.000000,11144.278571,435246.0,390056.0,413333.0,421784.0,454323.0,...,513596.0,526253.0,390056.000000,413333.000000,421784.434824,454323.000000,485925.000000,513596.000000,534241.575000,11201.308763
2,2020-08-01T00:00:00.000000,COVIDhub-4_week_ensemble,2,2020-07-20T00:00:00.000000,41405.928571,435246.0,290371.0,393021.0,455740.0,537671.0,...,648353.0,802663.0,363549.615832,410751.638121,456037.658439,527631.255637,564017.703948,571498.583140,651353.291898,33499.738297
3,2020-08-08T00:00:00.000000,COVIDhub-4_week_ensemble,1,2020-08-03T00:00:00.000000,19652.042857,388787.0,367977.0,389299.0,410424.0,434142.0,...,482542.0,528801.0,367977.000000,389299.000000,410424.000000,434142.000000,452867.000000,482542.000000,534700.500000,19694.182143
4,2020-08-08T00:00:00.000000,COVIDhub-4_week_ensemble,2,2020-07-27T00:00:00.000000,35524.878571,388787.0,334323.0,393452.0,431745.0,468837.0,...,530096.0,578910.0,370689.833617,393452.931423,431745.000000,468837.000000,499385.000000,530096.000000,578909.528346,35265.351614
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
343,2022-03-26T00:00:00.000000,COVIDhub-4_week_ensemble,3,2022-03-07T00:00:00.000000,30335.007143,210000.0,43800.0,73298.0,106979.0,160430.0,...,384326.0,607489.0,43800.000000,73298.000000,106979.000000,182774.918630,251749.000000,384325.175000,607488.232031,27142.846853
344,2022-03-26T00:00:00.000000,COVIDhub-4_week_ensemble,4,2022-02-28T00:00:00.000000,44069.542857,210000.0,28738.0,45077.0,93449.0,150573.0,...,452559.0,800706.0,28738.000000,45077.000000,93449.000000,178234.233286,351379.000000,452559.000000,800705.515239,40117.934639
345,2022-04-02T00:00:00.000000,COVIDhub-4_week_ensemble,3,2022-03-14T00:00:00.000000,33983.485714,198718.0,25722.0,37424.0,75688.0,119458.0,...,277093.0,463804.0,25722.221184,37424.427132,80698.583730,175427.729640,236771.500000,352246.464669,473069.819946,26665.502277
346,2022-04-02T00:00:00.000000,COVIDhub-4_week_ensemble,4,2022-03-07T00:00:00.000000,30210.278571,198718.0,29803.0,54906.0,92434.0,179928.0,...,394795.0,686156.0,29803.469850,54906.906250,92434.000000,190012.245259,276207.000000,394795.000000,686156.000000,28769.642857


In [34]:
df_aux.to_csv("results/comparison_4weeks_WIS_cases_v3.csv")

**Observation:** value_calculated and wis_calculated contains respectively, our quantile median (based on $34$ selected models that are in the paper cited above to make the ensemble, but not all names appear on the data and there's some extra models name in the data avaiable, so we filtered $34$ that are common on paper and at dataset) and the WIS of $34$ based on "COVIDhu's" methodology robust untrainded.