In [8]:
from imputation import core_utils, core_imputation_model_new
import numpy as np
from tqdm.notebook import tqdm

# Data Loading

`core_utils.get_data_panel` loads the data from the corresponding `data_path` this would be the feather file shared on google drive, however it is too large to host on github, it returns the characteristic percentile ranks as a numpy array of shape TxNxC where T is the number of dates N the number of stocks and C the number of characteristics, the raw characteristics, the characteristic namess, the dates, returns and permos

In [9]:
data_path = "data/raw_chars_returns_df_yearly_fb_monthly_avg_mergedizes.fthr"
percentile_rank_chars, raw_chars, chars, date_vals, returns, permnos = core_utils.get_data_panel(
    path=data_path, computstat_data_present_filter=True,start_date=19770000)

  0%|          | 0/528 [00:00<?, ?it/s]

In [10]:
char_groupings = core_utils.CHAR_GROUPINGS

Two methods we want to highlight are
- `core_imputation_model_new.run_imputation`
- `core_imputation_model_new.fit_factors_and_loadings`

The first runs the full method as described in the paper, including potentially different time series information sets depending on the arguments given.

The second generates the factors and loadings. 

The below examples correspond to global and local fits, the parameters are documented in the function definition. 

# Running Imputations

In this section we will run the imputation method described in the paper.

In [11]:
T, N, L = percentile_rank_chars.shape

## Fitting Model

We first look at a local estimation, in this case we show how to estimate either the purely cross-sectional model or the cross-sectional model with backwards time series information. 

We would like to emphasize two parameters in this estimation. This first in the number of cross-sectional factors: `n_xs_factors` the second is the cross-sectional ffactor regularization: `xs_factor_reg`.

These two hyperparameters have a significant impact on the performance of the model, and should be chosen carefully. The parameters we use in this example are tuned for the data-set from Missing Financial Data, and should not be considered default aprameters for alternative data-sets.

In [16]:
imputation = core_imputation_model_new.run_imputation(
    percentile_rank_chars, 
    n_xs_factors=20,
    time_varying_loadings=True,
    xs_factor_reg=0.00022,
    use_bw_ts_info=False, 
    include_ts_residuals=True,
    min_xs_obs=1
)

bw_xs_imputation = core_imputation_model_new.run_imputation(
    percentile_rank_chars, 
    n_xs_factors=20,
    time_varying_loadings=True,
    xs_factor_reg=0.00022,
    use_bw_ts_info=True, 
    include_ts_residuals=True,
    min_xs_obs=1
)

[Parallel(n_jobs=30)]: Using backend LokyBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done  12 tasks      | elapsed:  3.4min
[Parallel(n_jobs=30)]: Done 102 tasks      | elapsed:  3.6min
[Parallel(n_jobs=30)]: Done 228 tasks      | elapsed:  3.8min
[Parallel(n_jobs=30)]: Done 390 tasks      | elapsed:  4.0min
[Parallel(n_jobs=30)]: Done 528 out of 528 | elapsed:  4.1min finished


  0%|          | 0/528 [00:00<?, ?it/s]

resids rmse are  0.09351689202831431


0it [00:00, ?it/s]

[Parallel(n_jobs=30)]: Using backend LokyBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done  12 tasks      | elapsed:  2.0min
[Parallel(n_jobs=30)]: Done 102 tasks      | elapsed:  2.1min
[Parallel(n_jobs=30)]: Done 228 tasks      | elapsed:  2.3min
[Parallel(n_jobs=30)]: Done 390 tasks      | elapsed:  2.5min
[Parallel(n_jobs=30)]: Done 528 out of 528 | elapsed:  2.6min finished


  0%|          | 0/528 [00:00<?, ?it/s]

resids rmse are  0.09351689202831431


0it [00:00, ?it/s]

  0%|          | 0/527 [00:00<?, ?it/s]

  0%|          | 0/45 [00:00<?, ?it/s]

In [17]:
gamma_ts, lmbda = core_imputation_model_new.fit_factors_and_loadings(
    char_panel=percentile_rank_chars, 
    min_chars=1, 
    K=20, 
    num_months_train=T,
    reg=0.00022,
    time_varying_lambdas=True,
    eval_data=None,
    run_in_parallel=True
)

[Parallel(n_jobs=30)]: Using backend LokyBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done  12 tasks      | elapsed:  1.8min
[Parallel(n_jobs=30)]: Done 102 tasks      | elapsed:  2.0min
[Parallel(n_jobs=30)]: Done 228 tasks      | elapsed:  2.5min
[Parallel(n_jobs=30)]: Done 390 tasks      | elapsed:  3.0min
[Parallel(n_jobs=30)]: Done 528 out of 528 | elapsed:  3.3min finished


  0%|          | 0/528 [00:00<?, ?it/s]

resids rmse are  0.09351689202831431


# On Hyperparameter Choice

Below we show the plots from figure 8 in the paper. This are the kinds of plot we used to determine the optional regularization and number of factors. Namely, we considered the out of sample performance of the model implied by a certain hyperparater choice across a grid of thsese choices.

In [18]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "data/example_of_cval.png")

The `core_imputation_model_new.fit_factors_and_loadings` method allows the use to pass in and argument `eval_data`. This, if provided, is compared against the imputation and the RMSE is reported. This is simple way that hyperparameter choices could be evaluated with the model.

In [19]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "data/reg_cval.png")