### Goal

We would like to model

$x_{i,t}=\Lambda f_t + \epsilon_{i,t}$

* $i$ = patient index
* $t$ = time (irregular visits per patient)
* $x_{i,t} \in R^p$ Patients embedding information (realization from the latent factor)
* $f_t \in R^r$ Shared latent temporatl factors (underlying population health states)
* $\Lambda \in R^{p\times r}$ Factor loadings (relationship between embeddings and latent factors)

### Mapping

| Model Element | Interpretation |
|---------------|----------------|
| Observed variables | Daily metrics (Counts by department, age group, diagnosis group) |
| Latent factors | Underlying patient-type intensities that jointly influence those metrics |
| Factor loadings | How strongly each variable with each latent patient type |
| Factor dynamics | How those patient types evolve over time (trends, cycles, shocks) |

|Authors                                                 |Journal                                       |Paper Title                                                                                        |Date         |Link                                                                                                                |
|--------------------------------------------------------|----------------------------------------------|---------------------------------------------------------------------------------------------------|-------------|--------------------------------------------------------------------------------------------------------------------|
|Jushan Bai; Serena Ng                                   |Econometrica                                  |Determining the Number of Factors in Approximate Factor Models                                     |January 2002 |<https://ideas.repec.org/h/nbr/nberch/6670.html>                                                                    |
|Ben S. Bernanke; Jean Boivin; Piotr Eliasz              |The Quarterly Journal of Economics            |Measuring the Effects of Monetary Policy: A Factor-Augmented Vector Autoregressive (FAVAR) Approach|February 2005|<https://academic.oup.com/qje/article/120/1/387/1931468>                                                            |
|Ben S. Bernanke; Jean Boivin                            |Journal of Monetary Economics                 |Monetary policy in a data-rich environment                                                         |April 2003   |<https://www.sciencedirect.com/science/article/abs/pii/S0304393203000242>                                           |
|Mario Forni; Marc Hallin; Marco Lippi; Lucrezia Reichlin|The Review of Economics and Statistics        |The Generalized Dynamic-Factor Model: Identification and Estimation                                |November 2000|<https://direct.mit.edu/rest/article/82/4/540/57226/The-Generalized-Dynamic-Factor-Model>                           |
|James H. Stock; Mark W. Watson                          |Journal of Business & Economic Statistics     |Macroeconomic Forecasting Using Diffusion Indexes                                                  |April 2002   |<https://www.tandfonline.com/doi/abs/10.1198/073500102317351921>                                                    |
|James H. Stock; Mark W. Watson                          |NBER Working Paper / Conference Volume        |Implications of Dynamic Factor Models for VAR Analysis (NBER WP w11467; conference chapter)        |July 2005    |<https://papers.ssrn.com/sol3/papers.cfm?abstract_id=755703> â€¢ <https://www.princeton.edu/~mwatson/papers/favar.pdf>|
|Domenico Giannone; Lucrezia Reichlin; Luca Sala         |NBER Macroeconomics Annual (MIT Press chapter)|Monetary Policy in Real Time                                                                       |April 2005   |<https://www.nber.org/books-and-chapters/nber-macroeconomics-annual-2004-volume-19/monetary-policy-real-time>       |

Modeling future latent factors with a VAR is standard practice.

Dynamic factor models, the latent factors typically follow a low order VAR because it's parsimonious, easy to estimate in state-space form. Aligns with identifications schemes using in FAVARs. 

|Time|F_t (latent factor)        |y_t (observed target)             |
|----|---------------------------|----------------------------------|
|t   |F_t                        |y_t                               |
|t+1 |F_{t+1} (predicted via VAR)|y_{t+1} (from model using F_{t+1})|
|t+2 |F_{t+2} (predicted via VAR)|y_{t+2} (from model using F_{t+2})|
|t+3 |F_{t+3} (predicted via VAR)|y_{t+3} (from model using F_{t+3})|

1. F_t Prediction can be done by VAR

### Experiment list

1. F_t and y_t relationship can be modeled with non-parametric & non linear model. (RandomForest)
  - Why RF? RF tends to perform better in smaller and noiser dataset than boosting method because of Bootstrapping.
2. Turn y_t and F_t from cross sectional to time series aware model.
  - Currently: $y_t = f(F_t)$
  - Turn into: $y_t = f(F_t, F_{t-1}, \ldots, F_{t-k})$


In [1]:
import os
import sys
root_dir = os.path.abspath('../')
sys.path.append(root_dir)

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# --- Stunning palette: Viridis or Plasma ---
palette = sns.color_palette("viridis", 8)

sns.set_theme(
    style="whitegrid",
    font="sans-serif",
    rc={
        "font.size": 13,
        "axes.titlesize": 16,
        "axes.titleweight": "bold",
        "axes.labelsize": 13,
        "axes.labelweight": "semibold",
        "axes.edgecolor": "#2F2F2F",
        "axes.linewidth": 0.8,
        "grid.color": "#CCCCCC",
        "grid.linewidth": 0.6,
        "grid.alpha": 0.35,
        "figure.facecolor": "#FAFAFA",
        "axes.facecolor": "#FFFFFF",
        "axes.prop_cycle": plt.cycler("color", palette),
    }
)

In [2]:
# Dynamic Factor Model - engineered data
dfm_data = pd.read_parquet(
    os.path.join(root_dir, "data/processed/hana_ent/dfm_daily.parquet")
)
full_range = pd.date_range(dfm_data.index.min(), dfm_data.index.max(), freq='D')
dfm_data = dfm_data.reindex(full_range)
dfm_data = dfm_data.fillna(0)

s = dfm_data[[c for c in dfm_data.columns if c == 'male' or c == 'female']]
a = dfm_data[[c for c in dfm_data.columns if c.startswith('age')]]
p = pd.DataFrame({
    'dep_Ear, Nose, Throat': dfm_data[[c for c in dfm_data.columns if c == 'dep_Ear, Nose and Throat']].sum(axis=1),
    'dep_Internal Medicine': dfm_data[[c for c in dfm_data.columns if c == 'dep_Internal Medicine']].sum(axis=1),
    'dep_Other': dfm_data[[c for c in dfm_data.columns if not (c == 'dep_Ear, Nose and Throat' or c == 'dep_Internal Medicine')]].sum(axis=1),
})
d = pd.DataFrame({
    'A': dfm_data[[c for c in dfm_data.columns if c.startswith('A')]].sum(axis=1),
    'B': dfm_data[[c for c in dfm_data.columns if c.startswith('B')]].sum(axis=1),
    'C': dfm_data[[c for c in dfm_data.columns if c.startswith('C')]].sum(axis=1),
    'D': dfm_data[[c for c in dfm_data.columns if c.startswith('D')]].sum(axis=1),
    'H': dfm_data[[c for c in dfm_data.columns if c.startswith('H')]].sum(axis=1),
    'J': dfm_data[[c for c in dfm_data.columns if c.startswith('J')]].sum(axis=1),
    'M': dfm_data[[c for c in dfm_data.columns if c.startswith('M')]].sum(axis=1),
    'N': dfm_data[[c for c in dfm_data.columns if c.startswith('N')]].sum(axis=1),
    'R': dfm_data[[c for c in dfm_data.columns if c.startswith('R')]].sum(axis=1),
    'V': dfm_data[[c for c in dfm_data.columns if c.startswith('V')]].sum(axis=1),
})

dfm_data_reduced = s.join(a).join(p).join(d)

# Resample to daily frequency
dfm_data_daily = dfm_data.resample('D').sum()
dfm_data_reduced_daily = dfm_data_reduced.resample('D').sum()

# Resample to weekly frequency
dfm_data_weekly = dfm_data.resample('W').sum()
dfm_data_reduced_weekly = dfm_data_reduced.resample('W').sum()

# Supply data
supply = pd.read_parquet(
    os.path.join(root_dir, "./data/processed/hana_ent/supply.parquet")
)
full_range = pd.date_range(supply.index.min(), supply.index.max(), freq='D')
supply = supply.reindex(full_range)
supply = supply.fillna(0)

# Resample to weekly frequency
supply = supply.resample('W').sum()

##### V1 Run / VAR + Linear / Weekly

In [None]:
# Using Linear Regression
from pipeline.dynamic_factor.hana_ent.v1.run import DFMConfig, RegressionConfig, RollingConfig
from pipeline.dynamic_factor.hana_ent.v1.run import rolling_method_dfm
from pipeline.dynamic_factor.hana_ent.v1.run import compare_pred_with_actual

# Number of factors: 3
dfm_cfg_f3 = DFMConfig(n_factors=3, factor_order=1, var_clipping=True)
dfm_cfg_f5 = DFMConfig(n_factors=5, factor_order=1, var_clipping=True)
dfm_cfg_f7 = DFMConfig(n_factors=7, factor_order=1, var_clipping=True)
reg_cfg = RegressionConfig()

roll_cfg = RollingConfig(
    window_type="rolling",
    window_size=104,      # for example: 2 years of weekly data
    forecast_horizon=4,   # 4 weeks ahead
    min_train_size=80     # first forecast after ~1.5 years
)

pred_linear_f3, factor_linear_f3 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f3,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

pred_linear_f5, factor_linear_f5 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f5,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

pred_linear_f7, factor_linear_f7 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f7,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

comparison_linear_f3 = compare_pred_with_actual(
    pred=pred_linear_f3,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

comparison_linear_f5 = compare_pred_with_actual(
    pred=pred_linear_f5,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

comparison_linear_f7 = compare_pred_with_actual(
    pred=pred_linear_f7,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

##### V2 Run / VAR + Random Forest / Weekly

In [None]:
from pipeline.dynamic_factor.hana_ent.v2.run import DFMConfig, RFConfig, RollingConfig
from pipeline.dynamic_factor.hana_ent.v2.run import rolling_method_dfm
from pipeline.dynamic_factor.hana_ent.v2.run import compare_pred_with_actual

# Number of factors: 3
dfm_cfg_f3 = DFMConfig(n_factors=3, factor_order=1, var_clipping=True)
dfm_cfg_f5 = DFMConfig(n_factors=5, factor_order=1, var_clipping=True)
dfm_cfg_f7 = DFMConfig(n_factors=7, factor_order=1, var_clipping=True)
reg_cfg = RFConfig()

roll_cfg = RollingConfig(
    window_type="rolling",
    window_size=104,      # for example: 2 years of weekly data
    forecast_horizon=4,   # 4 weeks ahead
    min_train_size=80     # first forecast after ~1.5 years
)

pred_rf_f3, factor_rf_f3 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f3,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

pred_rf_f5, factor_rf_f5 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f5,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

pred_rf_f7, factor_rf_f7 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f7,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

comparison_rf_f3 = compare_pred_with_actual(
    pred=pred_rf_f3,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

comparison_rf_f5 = compare_pred_with_actual(
    pred=pred_rf_f5,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

comparison_rf_f7 = compare_pred_with_actual(
    pred=pred_rf_f7,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

##### Feedback & Assessment

*The prediction seems too smoothed*

##### Reason 1. DFM setup. - Primary cause of this.

* error_cov_type = "scalar"
* Kalman filtered factors
* factor_order = 1

This implies that 
1. Idiosyncratic shocks are pooled into a single variance. 
2. High frequency movements are treated as noise
3. Latent factors are forced to evolve smoothly


##### Reason 2. VAR

VAR adds 
* Linear Gaussian dynamics
* Mean reverting behavior
* Further attenuation of residual volatility

Inputs are already smooth - VAR forecasts becomes more smoother. - low lag order. 

##### Solution

* Fix the DFM setup - Report 1.
  * Baseline. (cfg1)
  * Increase Factor Order from 1 to 2 (cfg2)
  * Change Error Covariance Type from Scalar (smoothed) to other (cfg5, cfg6)
  * Change Error Order **:: FAILED ::** Is not supported. Theory is valid (cfg3, cfg4)
  * Change use filtered value to predicted factors, which is less smoothed  **:: FAILED ::**  (cfg7)
* TODO: Fix the VAR setup -> To non linear model
* TODO: Fix the time horizon ->

##### V2 Run / VAR + Random Forest / Weekly / More DFM setup

In [None]:
from pipeline.dynamic_factor.hana_ent.v2.run import DFMConfig, RFConfig, RollingConfig
from pipeline.dynamic_factor.hana_ent.v2.run import rolling_method_dfm
from pipeline.dynamic_factor.hana_ent.v2.run import compare_pred_with_actual

# Number of factors: 3
dfm_cfg_baseline = DFMConfig(n_factors=3, factor_order=1, error_order=0, error_cov_type="scalar", use_factors="filtered", var_clipping=True)
# Factor order 2
dfm_cfg__factor_order2 = DFMConfig(n_factors=3, factor_order=2, var_clipping=True)
# Error order 1 & 2
dfm_cfg__error_order1 = DFMConfig(n_factors=3, factor_order=1, error_order=1, var_clipping=True)
dfm_cfg__error_order2 = DFMConfig(n_factors=3, factor_order=1, error_order=2, var_clipping=True)
# Error Covariance Type
dfm_cfg__error_cov_type_diagonal = DFMConfig(n_factors=3, factor_order=1, error_cov_type="diagonal", var_clipping=True)
dfm_cfg__error_cov_type_unstructured = DFMConfig(n_factors=3, factor_order=1, error_cov_type="unstructured", var_clipping=True)
# Use filtered, predicted, or smoothed factors
dfm_cfg__use_predicted = DFMConfig(n_factors=3, factor_order=1, use_factors="predicted", var_clipping=True)

reg_cfg = RFConfig()

roll_cfg = RollingConfig(
    window_type="rolling",
    window_size=104,      # for example: 2 years of weekly data
    forecast_horizon=4,   # 4 weeks ahead
    min_train_size=80     # first forecast after ~1.5 years
)

pred_cfg1, ftr_cfg1 = rolling_method_dfm(dfm_data=dfm_data_reduced_weekly, supply_data=supply, dfm_cfg=dfm_cfg_baseline, reg_cfg=reg_cfg, roll_cfg=roll_cfg)
pred_cfg2, ftr_cfg2 = rolling_method_dfm(dfm_data=dfm_data_reduced_weekly, supply_data=supply, dfm_cfg=dfm_cfg__factor_order2, reg_cfg=reg_cfg, roll_cfg=roll_cfg)
# pred_cfg3, ftr_cfg3 = rolling_method_dfm(dfm_data=dfm_data_reduced_weekly, supply_data=supply, dfm_cfg=dfm_cfg__error_order1, reg_cfg=reg_cfg, roll_cfg=roll_cfg)
# pred_cfg4, ftr_cfg4 = rolling_method_dfm(dfm_data=dfm_data_reduced_weekly, supply_data=supply, dfm_cfg=dfm_cfg__error_order2, reg_cfg=reg_cfg, roll_cfg=roll_cfg)
pred_cfg5, ftr_cfg5 = rolling_method_dfm(dfm_data=dfm_data_reduced_weekly, supply_data=supply, dfm_cfg=dfm_cfg__error_cov_type_diagonal, reg_cfg=reg_cfg, roll_cfg=roll_cfg)
pred_cfg6, ftr_cfg6 = rolling_method_dfm(dfm_data=dfm_data_reduced_weekly, supply_data=supply, dfm_cfg=dfm_cfg__error_cov_type_unstructured, reg_cfg=reg_cfg, roll_cfg=roll_cfg)
pred_cfg7, ftr_cfg7 = rolling_method_dfm(dfm_data=dfm_data_reduced_weekly, supply_data=supply, dfm_cfg=dfm_cfg__use_predicted, reg_cfg=reg_cfg, roll_cfg=roll_cfg)

comp_cfg1 = compare_pred_with_actual(pred=pred_cfg1, supply_true=supply, target_cols=reg_cfg.target_columns)
comp_cfg2 = compare_pred_with_actual(pred=pred_cfg2, supply_true=supply, target_cols=reg_cfg.target_columns)
# comp_cfg3 = compare_pred_with_actual(pred=pred_cfg3, supply_true=supply, target_cols=reg_cfg.target_columns)
# comp_cfg4 = compare_pred_with_actual(pred=pred_cfg4, supply_true=supply, target_cols=reg_cfg.target_columns)
comp_cfg5 = compare_pred_with_actual(pred=pred_cfg5, supply_true=supply, target_cols=reg_cfg.target_columns)
comp_cfg6 = compare_pred_with_actual(pred=pred_cfg6, supply_true=supply, target_cols=reg_cfg.target_columns)
# comp_cfg7 = compare_pred_with_actual(pred=pred_cfg7, supply_true=supply, target_cols=reg_cfg.target_columns)

In [13]:
comp_cfg1 = compare_pred_with_actual(pred=pred_cfg1, supply_true=supply, target_cols=reg_cfg.target_columns)
comp_cfg2 = compare_pred_with_actual(pred=pred_cfg2, supply_true=supply, target_cols=reg_cfg.target_columns)
# comp_cfg3 = compare_pred_with_actual(pred=pred_cfg3, supply_true=supply, target_cols=reg_cfg.target_columns)
# comp_cfg4 = compare_pred_with_actual(pred=pred_cfg4, supply_true=supply, target_cols=reg_cfg.target_columns)
comp_cfg5 = compare_pred_with_actual(pred=pred_cfg5, supply_true=supply, target_cols=reg_cfg.target_columns)
comp_cfg6 = compare_pred_with_actual(pred=pred_cfg6, supply_true=supply, target_cols=reg_cfg.target_columns)
# comp_cfg7 = compare_pred_with_actual(pred=pred_cfg7, supply_true=supply, target_cols=reg_cfg.target_columns)

##### Reporting process

In [16]:
from pipeline.reporting.chart_pred_true_ts import compute_mse_per_target
from pipeline.reporting.plot_pred_true_ts import plot_separate_graphs_better

In [None]:
# xlsx file for each model. Mark comparison and mse

# V1
comparison_linear_f3.to_excel("comparison_linear_f3.xlsx")
comparison_linear_f5.to_excel("comparison_linear_f5.xlsx")
comparison_linear_f7.to_excel("comparison_linear_f7.xlsx")

# V2
comparison_rf_f3.to_excel("comparison_rf_f3.xlsx")
comparison_rf_f5.to_excel("comparison_rf_f5.xlsx")
comparison_rf_f7.to_excel("comparison_rf_f7.xlsx")

# V2. DFM Setup Change
comp_cfg1.to_excel("comparison_linear_f3_baseline.xlsx")
comp_cfg2.to_excel("comparison_linear_f3_factor_order2.xlsx")
comp_cfg5.to_excel("comparison_linear_f3_error_cov_type_diagonal.xlsx")
comp_cfg6.to_excel("comparison_linear_f3_error_cov_type_unstructured.xlsx")

In [None]:
# Charts

# V1
plot_separate_graphs_better(comparison_linear_f3, supply.columns, "Linear 3 Factors", os.path.join(root_dir, "notebook/static/v2/linear_dfm3"), is_weekly=True)
plot_separate_graphs_better(comparison_linear_f5, supply.columns, "Linear 5 Factors", os.path.join(root_dir, "notebook/static/v2/linear_dfm5"), is_weekly=True)
plot_separate_graphs_better(comparison_linear_f7, supply.columns, "Linear 7 Factors", os.path.join(root_dir, "notebook/static/v2/linear_dfm7"), is_weekly=True)

# V2
plot_separate_graphs_better(comparison_rf_f3, supply.columns, "RF 3 Factors", os.path.join(root_dir, "notebook/static/v2/rf_dfm3"), is_weekly=True)
plot_separate_graphs_better(comparison_rf_f5, supply.columns, "RF 5 Factors", os.path.join(root_dir, "notebook/static/v2/rf_dfm5"), is_weekly=True)
plot_separate_graphs_better(comparison_rf_f7, supply.columns, "RF 7 Factors", os.path.join(root_dir, "notebook/static/v2/rf_dfm7"), is_weekly=True)

# V2. DFM Setup Change
plot_separate_graphs_better(comp_cfg1, supply.columns, "Baseline", os.path.join(root_dir, "notebook/static/v2_dfm_setup/baseline"), is_weekly=True)
plot_separate_graphs_better(comp_cfg2, supply.columns, "Factor Order 2", os.path.join(root_dir, "notebook/static/v2_dfm_setup/factor_order2"), is_weekly=True)
plot_separate_graphs_better(comp_cfg5, supply.columns, "Error Covariance Type Diagonal", os.path.join(root_dir, "notebook/static/v2_dfm_setup/error_cov_type_diagonal"), is_weekly=True)
plot_separate_graphs_better(comp_cfg6, supply.columns, "Error Covariance Type Unstructured", os.path.join(root_dir, "notebook/static/v2_dfm_setup/error_cov_type_unstructured"), is_weekly=True)

In [19]:
mses = pd.DataFrame({
    "Baseline": compute_mse_per_target(comp_cfg1),
    "Factor Order 2": compute_mse_per_target(comp_cfg2),
    "Error Covariance Type Diagonal": compute_mse_per_target(comp_cfg5),
    "Error Covariance Type Unstructured": compute_mse_per_target(comp_cfg6),
})

mses.to_excel("mses.xlsx")

  for target, subdf in df.groupby(level=0, axis=1):
  for target, subdf in df.groupby(level=0, axis=1):
  for target, subdf in df.groupby(level=0, axis=1):
  for target, subdf in df.groupby(level=0, axis=1):
