### Goal

We would like to model

$x_{i,t}=\Lambda f_t + \epsilon_{i,t}$

* $i$ = patient index
* $t$ = time (irregular visits per patient)
* $x_{i,t} \in R^p$ Patients embedding information (realization from the latent factor)
* $f_t \in R^r$ Shared latent temporatl factors (underlying population health states)
* $\Lambda \in R^{p\times r}$ Factor loadings (relationship between embeddings and latent factors)

### Mapping

| Model Element | Interpretation |
|---------------|----------------|
| Observed variables | Daily metrics (Counts by department, age group, diagnosis group) |
| Latent factors | Underlying patient-type intensities that jointly influence those metrics |
| Factor loadings | How strongly each variable with each latent patient type |
| Factor dynamics | How those patient types evolve over time (trends, cycles, shocks) |

|Authors                                                 |Journal                                       |Paper Title                                                                                        |Date         |Link                                                                                                                |
|--------------------------------------------------------|----------------------------------------------|---------------------------------------------------------------------------------------------------|-------------|--------------------------------------------------------------------------------------------------------------------|
|Jushan Bai; Serena Ng                                   |Econometrica                                  |Determining the Number of Factors in Approximate Factor Models                                     |January 2002 |<https://ideas.repec.org/h/nbr/nberch/6670.html>                                                                    |
|Ben S. Bernanke; Jean Boivin; Piotr Eliasz              |The Quarterly Journal of Economics            |Measuring the Effects of Monetary Policy: A Factor-Augmented Vector Autoregressive (FAVAR) Approach|February 2005|<https://academic.oup.com/qje/article/120/1/387/1931468>                                                            |
|Ben S. Bernanke; Jean Boivin                            |Journal of Monetary Economics                 |Monetary policy in a data-rich environment                                                         |April 2003   |<https://www.sciencedirect.com/science/article/abs/pii/S0304393203000242>                                           |
|Mario Forni; Marc Hallin; Marco Lippi; Lucrezia Reichlin|The Review of Economics and Statistics        |The Generalized Dynamic-Factor Model: Identification and Estimation                                |November 2000|<https://direct.mit.edu/rest/article/82/4/540/57226/The-Generalized-Dynamic-Factor-Model>                           |
|James H. Stock; Mark W. Watson                          |Journal of Business & Economic Statistics     |Macroeconomic Forecasting Using Diffusion Indexes                                                  |April 2002   |<https://www.tandfonline.com/doi/abs/10.1198/073500102317351921>                                                    |
|James H. Stock; Mark W. Watson                          |NBER Working Paper / Conference Volume        |Implications of Dynamic Factor Models for VAR Analysis (NBER WP w11467; conference chapter)        |July 2005    |<https://papers.ssrn.com/sol3/papers.cfm?abstract_id=755703> • <https://www.princeton.edu/~mwatson/papers/favar.pdf>|
|Domenico Giannone; Lucrezia Reichlin; Luca Sala         |NBER Macroeconomics Annual (MIT Press chapter)|Monetary Policy in Real Time                                                                       |April 2005   |<https://www.nber.org/books-and-chapters/nber-macroeconomics-annual-2004-volume-19/monetary-policy-real-time>       |

Modeling future latent factors with a VAR is standard practice.

Dynamic factor models, the latent factors typically follow a low order VAR because it's parsimonious, easy to estimate in state-space form. Aligns with identifications schemes using in FAVARs. 

|Time|F_t (latent factor)        |y_t (observed target)             |
|----|---------------------------|----------------------------------|
|t   |F_t                        |y_t                               |
|t+1 |F_{t+1} (predicted via VAR)|y_{t+1} (from model using F_{t+1})|
|t+2 |F_{t+2} (predicted via VAR)|y_{t+2} (from model using F_{t+2})|
|t+3 |F_{t+3} (predicted via VAR)|y_{t+3} (from model using F_{t+3})|

1. F_t Prediction can be done by VAR

### Experiment list

1. F_t and y_t relationship can be modeled with non-parametric & non linear model. (RandomForest)
  - Why RF? RF tends to perform better in smaller and noiser dataset than boosting method because of Bootstrapping.
2. Turn y_t and F_t from cross sectional to time series aware model.
  - Currently: $y_t = f(F_t)$
  - Turn into: $y_t = f(F_t, F_{t-1}, \ldots, F_{t-k})$


In [None]:
import os
import sys
root_dir = os.path.abspath('../')
sys.path.append(root_dir)

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# --- Stunning palette: Viridis or Plasma ---
palette = sns.color_palette("viridis", 8)

sns.set_theme(
    style="whitegrid",
    font="sans-serif",
    rc={
        "font.size": 13,
        "axes.titlesize": 16,
        "axes.titleweight": "bold",
        "axes.labelsize": 13,
        "axes.labelweight": "semibold",
        "axes.edgecolor": "#2F2F2F",
        "axes.linewidth": 0.8,
        "grid.color": "#CCCCCC",
        "grid.linewidth": 0.6,
        "grid.alpha": 0.35,
        "figure.facecolor": "#FAFAFA",
        "axes.facecolor": "#FFFFFF",
        "axes.prop_cycle": plt.cycler("color", palette),
    }
)

In [2]:
# Dynamic Factor Model - engineered data
dfm_data = pd.read_parquet(
    os.path.join(root_dir, "data/processed/hana_ent/dfm_daily.parquet")
)
full_range = pd.date_range(dfm_data.index.min(), dfm_data.index.max(), freq='D')
dfm_data = dfm_data.reindex(full_range)
dfm_data = dfm_data.fillna(0)

s = dfm_data[[c for c in dfm_data.columns if c == 'male' or c == 'female']]
a = dfm_data[[c for c in dfm_data.columns if c.startswith('age')]]
p = pd.DataFrame({
    'dep_Ear, Nose, Throat': dfm_data[[c for c in dfm_data.columns if c == 'dep_Ear, Nose and Throat']].sum(axis=1),
    'dep_Internal Medicine': dfm_data[[c for c in dfm_data.columns if c == 'dep_Internal Medicine']].sum(axis=1),
    'dep_Other': dfm_data[[c for c in dfm_data.columns if not (c == 'dep_Ear, Nose and Throat' or c == 'dep_Internal Medicine')]].sum(axis=1),
})
d = pd.DataFrame({
    'A': dfm_data[[c for c in dfm_data.columns if c.startswith('A')]].sum(axis=1),
    'B': dfm_data[[c for c in dfm_data.columns if c.startswith('B')]].sum(axis=1),
    'C': dfm_data[[c for c in dfm_data.columns if c.startswith('C')]].sum(axis=1),
    'D': dfm_data[[c for c in dfm_data.columns if c.startswith('D')]].sum(axis=1),
    'H': dfm_data[[c for c in dfm_data.columns if c.startswith('H')]].sum(axis=1),
    'J': dfm_data[[c for c in dfm_data.columns if c.startswith('J')]].sum(axis=1),
    'M': dfm_data[[c for c in dfm_data.columns if c.startswith('M')]].sum(axis=1),
    'N': dfm_data[[c for c in dfm_data.columns if c.startswith('N')]].sum(axis=1),
    'R': dfm_data[[c for c in dfm_data.columns if c.startswith('R')]].sum(axis=1),
    'V': dfm_data[[c for c in dfm_data.columns if c.startswith('V')]].sum(axis=1),
})

dfm_data_reduced = s.join(a).join(p).join(d)

# Resample to weekly frequency
dfm_data_weekly = dfm_data.resample('W').sum()
dfm_data_reduced_weekly = dfm_data_reduced.resample('W').sum()

# Supply data
supply = pd.read_parquet(
    os.path.join(root_dir, "./data/processed/hana_ent/supply.parquet")
)
full_range = pd.date_range(supply.index.min(), supply.index.max(), freq='D')
supply = supply.reindex(full_range)
supply = supply.fillna(0)

# Resample to weekly frequency
supply = supply.resample('W').sum()

In [3]:
dfm_data_weekly.head(2)

Unnamed: 0,male,female,age_0_10,age_10_20,age_20_30,age_30_40,age_40_50,age_50_60,age_60_70,age_70_80,...,M01,M03,N01,N02,N05,N07,R03,R05,R06,V03
2018-01-07,1703.0,1512.0,105.0,46.0,656.0,899.0,707.0,406.0,208.0,86.0,...,78.0,0.0,46.0,1047.0,0.0,0.0,0.0,6.0,25.0,0.0
2018-01-14,1660.0,1498.0,95.0,54.0,687.0,1030.0,651.0,349.0,188.0,88.0,...,121.0,0.0,44.0,882.0,0.0,0.0,0.0,2.0,29.0,0.0


In [4]:
# Reduced AXX to A
# Reduced Department to 3 parts
dfm_data_reduced_weekly.head(2)

Unnamed: 0,male,female,age_0_10,age_10_20,age_20_30,age_30_40,age_40_50,age_50_60,age_60_70,age_70_80,...,A,B,C,D,H,J,M,N,R,V
2018-01-07,1703.0,1512.0,105.0,46.0,656.0,899.0,707.0,406.0,208.0,86.0,...,12.0,131.0,0.0,0.0,1417.0,453.0,78.0,1093.0,31.0,0.0
2018-01-14,1660.0,1498.0,95.0,54.0,687.0,1030.0,651.0,349.0,188.0,88.0,...,6.0,106.0,0.0,0.0,1351.0,617.0,121.0,926.0,31.0,0.0


In [5]:
supply[['dexamethasone', 'tramadol', 'netilmicin', 'electrolytes with carbohydrates', 'diclofenac']].head(2)

prescription,dexamethasone,tramadol,netilmicin,electrolytes with carbohydrates,diclofenac
2018-01-07,1414.0,1047.0,336.0,107.0,78.0
2018-01-14,1348.0,882.0,501.0,85.0,121.0


In [None]:
# Using Linear Regression
from pipeline.dynamic_factor.hana_ent.v1.run import DFMConfig, RegressionConfig, RollingConfig
from pipeline.dynamic_factor.hana_ent.v1.run import rolling_method_dfm
from pipeline.dynamic_factor.hana_ent.v1.run import compare_pred_with_actual

# Number of factors: 3
dfm_cfg_f3 = DFMConfig(n_factors=3, factor_order=1, var_clipping=True)
dfm_cfg_f5 = DFMConfig(n_factors=5, factor_order=1, var_clipping=True)
dfm_cfg_f7 = DFMConfig(n_factors=7, factor_order=1, var_clipping=True)
reg_cfg = RegressionConfig()

roll_cfg = RollingConfig(
    window_type="rolling",
    window_size=104,      # for example: 2 years of weekly data
    forecast_horizon=4,   # 4 weeks ahead
    min_train_size=80     # first forecast after ~1.5 years
)

pred_linear_f3, factor_linear_f3 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f3,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

pred_linear_f5, factor_linear_f5 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f5,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

pred_linear_f7, factor_linear_f7 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f7,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

comparison_linear_f3 = compare_pred_with_actual(
    pred=pred_linear_f3,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

comparison_linear_f5 = compare_pred_with_actual(
    pred=pred_linear_f5,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

comparison_linear_f7 = compare_pred_with_actual(
    pred=pred_linear_f7,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

In [None]:
from pipeline.dynamic_factor.hana_ent.v2.run import DFMConfig, RFConfig, RollingConfig
from pipeline.dynamic_factor.hana_ent.v2.run import rolling_method_dfm
from pipeline.dynamic_factor.hana_ent.v2.run import compare_pred_with_actual

# Number of factors: 3
dfm_cfg_f3 = DFMConfig(n_factors=3, factor_order=1, var_clipping=True)
dfm_cfg_f5 = DFMConfig(n_factors=5, factor_order=1, var_clipping=True)
dfm_cfg_f7 = DFMConfig(n_factors=7, factor_order=1, var_clipping=True)
reg_cfg = RFConfig()

roll_cfg = RollingConfig(
    window_type="rolling",
    window_size=104,      # for example: 2 years of weekly data
    forecast_horizon=4,   # 4 weeks ahead
    min_train_size=80     # first forecast after ~1.5 years
)

pred_rf_f3, factor_rf_f3 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f3,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

pred_rf_f5, factor_rf_f5 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f5,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

pred_rf_f7, factor_rf_f7 = rolling_method_dfm(
    dfm_data=dfm_data_reduced_weekly,
    supply_data=supply,
    dfm_cfg=dfm_cfg_f7,
    reg_cfg=reg_cfg,
    roll_cfg=roll_cfg
)

comparison_rf_f3 = compare_pred_with_actual(
    pred=pred_rf_f3,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

comparison_rf_f5 = compare_pred_with_actual(
    pred=pred_rf_f5,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

comparison_rf_f7 = compare_pred_with_actual(
    pred=pred_rf_f7,
    supply_true=supply,
    target_cols=reg_cfg.target_columns
)

In [17]:
def plot_separate_graphs_better(
    comparison_df: pd.DataFrame,
    target_cols: list,
    model_name: str,
    save_path: str
):
    """
    Draws separate, easy-to-read graphs for each target column.
    X-axis = Date Index
    Y-axis = predicted vs actual
    """

    # Distinct, highly contrasting colors
    pred_color = "#1f77b4"   # blue
    true_color = "#d62728"   # red

    for col in target_cols:

        # Handle MultiIndex columns: comparison_df[col]["pred"] / ["true"]
        # Or fallback if flat columns provided (though updated compare_pred_with_actual returns MultiIndex)
        if isinstance(comparison_df.columns, pd.MultiIndex):
            y_true = comparison_df[col]["true"]
            y_pred = comparison_df[col]["pred"]
        else:
            y_true = comparison_df[f"{col}_true"]
            y_pred = comparison_df[f"{col}_pred"]

        plt.figure(figsize=(14, 4))

        # Plot actual values
        plt.plot(
            comparison_df.index,
            y_true,
            label="Actual",
            linewidth=2.6,
            color=true_color,
            marker="o",
            markersize=6
        )

        # Plot predicted values
        plt.plot(
            comparison_df.index,
            y_pred,
            label="Predicted",
            linewidth=2.6,
            color=pred_color,
            marker="s",
            markersize=6
        )

        filename = f"{col.replace(" ", "_")}__{model_name.replace(" ", "_")}.png"
        plt.title(f"Predicted vs Actual({model_name}) – {col}", weight="bold", fontsize=16)
        plt.xlabel("Date")
        plt.ylabel("Weekly Usage")
        plt.grid(True, alpha=0.3)
        plt.legend()
        plt.tight_layout()
        plt.savefig(os.path.join(save_path, filename))
        plt.show()
        plt.close()

In [None]:
plot_separate_graphs_better(comparison_linear_f3, supply.columns, "Linear 3 Factors", os.path.join(root_dir, "notebook/static/v2/linear_dfm3"))
plot_separate_graphs_better(comparison_linear_f5, supply.columns, "Linear 5 Factors", os.path.join(root_dir, "notebook/static/v2/linear_dfm5"))
plot_separate_graphs_better(comparison_linear_f7, supply.columns, "Linear 7 Factors", os.path.join(root_dir, "notebook/static/v2/linear_dfm7"))

In [None]:
plot_separate_graphs_better(comparison_rf_f3, supply.columns, "RF 3 Factors", os.path.join(root_dir, "notebook/static/v2/rf_dfm3"))
plot_separate_graphs_better(comparison_rf_f5, supply.columns, "RF 5 Factors", os.path.join(root_dir, "notebook/static/v2/rf_dfm5"))
plot_separate_graphs_better(comparison_rf_f7, supply.columns, "RF 7 Factors", os.path.join(root_dir, "notebook/static/v2/rf_dfm7"))

In [None]:
def compute_mse_per_target(df: pd.DataFrame) -> pd.Series:
    """
    Compute MSE for each target (top-level column) where columns are a MultiIndex
    structured like (target, ['pred', 'true', ...]).
    """

    if not isinstance(df.columns, pd.MultiIndex):
        raise ValueError("Input DataFrame must have MultiIndex columns.")

    # We assume exactly two columns per target used for calculation: pred & true.
    # We group by the level=0 (the target names).
    mses = {}
    for target, subdf in df.groupby(level=0, axis=1):
        # Safely pick only pred and true
        y_true = subdf.get((target, "true"))
        y_pred = subdf.get((target, "pred"))
        if y_true is None or y_pred is None:
            continue  # skip if one is missing

        mse_val = np.mean((y_true - y_pred)**2)
        mses[target] = mse_val

    return pd.Series(mses, name="mse")

In [None]:
mses = pd.DataFrame({
    "linear_f3": compute_mse_per_target(comparison_linear_f3),
    "linear_f5": compute_mse_per_target(comparison_linear_f5),
    "linear_f7": compute_mse_per_target(comparison_linear_f7),
    "rf_f3": compute_mse_per_target(comparison_rf_f3),
    "rf_f5": compute_mse_per_target(comparison_rf_f5),
    "rf_f7": compute_mse_per_target(comparison_rf_f7),
})

mses.to_excel("mses.xlsx")

In [22]:
comparison_linear_f3.to_excel("comparison_linear_f3.xlsx")
comparison_linear_f5.to_excel("comparison_linear_f5.xlsx")
comparison_linear_f7.to_excel("comparison_linear_f7.xlsx")
comparison_rf_f3.to_excel("comparison_rf_f3.xlsx")
comparison_rf_f5.to_excel("comparison_rf_f5.xlsx")
comparison_rf_f7.to_excel("comparison_rf_f7.xlsx")