# 4D GARCH surrogate training code

This notebook provides an example of the code used to train and evaluate the accuracy of surrogate models for the 4D toy problem discussed in Chapter 3 Section 3.3. 

Original experiments were run in Google Colab using TPUs. 

This code is replicated with varying training sizes to produce the full result set. Full code and models are available [here](https://drive.google.com/drive/folders/1J7srZbZPS6UhE43GFXP3Gkd3TmEvT-6f).

In [1]:
import autograd.numpy as np
import pandas as pd
from collections import defaultdict
from sklearn.model_selection import train_test_split
import keras.backend as K
from keras.models import Model
from keras.layers import Input, Dense, Lambda, dot, concatenate, PReLU, Dropout, advanced_activations
from keras.models import load_model
from keras.optimizers import adam_v2
from keras.callbacks import LearningRateScheduler
import concurrent.futures
from time import time
import gc
from scipy import stats
from datetime import *
from time import time as time1
import os
import subprocess
#from google.colab import files

In [2]:
# Suppress retracing  and auograph error
import logging
import tensorflow as tf
tf.get_logger().setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)

In [3]:
nInputDim = 4 # Latent parameters
nOutputDim = 1 # Vol output

T = 5 # Time window forecast
forecast_window = range(1, T + 1)
base_hidden_size = 4

nBatchSize = 10

## Model architecture

Initial (t+1) surrogate takes only latent parameters $\theta$ as an input while subsequent surrogates take latent parameters and previous output as described in Chapter 2 Section 2.1.1. This allows the surrogate model to generate volatility predictions for t+h for h 2-5.

Surrogate models use architecture with 4 hidden layers, each with $\eta$ * 4 nodes where $\eta$ refers to model complexity as described in Chapter 3 Section 3.1.

In [4]:
def make_model_t1(complexity=1, lr=0.01, g_w=0.5):
    """"Makes surrogates for for t+1 vol pred.
    The weight assigned to the loss function that fits gradients can be set with g_w.
    Model complexity integer and determines the number of nodes in each 
    hidden layer.
    """
    theta = Input(shape=nInputDim)
    prev_vol = Input(shape=1)
    prev_return = Input(shape=1)
    concat = concatenate([theta, prev_vol, prev_return])
    h1 = Dense(complexity * base_hidden_size, activation="tanh")(concat)
    h2 = Dense(complexity * base_hidden_size, activation="tanh")(h1)
    h3 = Dense(complexity * base_hidden_size, activation="tanh")(h2)
    h4 = Dense(complexity * base_hidden_size, activation="tanh")(h3)
    out = Dense(nOutputDim, activation='softplus')(h4)
    
    grad = Lambda(lambda x: K.gradients(x[0], [x[1]])[0], output_shape=nInputDim)([out, theta])
    model = Model(inputs=[theta, prev_vol, prev_return], outputs=[out, grad])
    opt = adam_v2.Adam(learning_rate=lr)
    model.compile(loss=['mse', 'mse'], optimizer=opt, metrics=['accuracy'], loss_weights=[1-g_w, g_w])
    
    return model

In [5]:
def make_model_tsub(complexity=1, lr=0.01, g_w=0.5):
    """"Makes surrogates for for t+h vol pred where 2 <= h <= T.
    Note that for t+h preds resid_t is not needed (refer to ARCH docs for more detail.).
    The weight assigned to the loss function that fits gradients can be set with g_w.
    Model complexity integer and determines the number of nodes in each 
    hidden layer.
    """
    theta = Input(shape=nInputDim)
    prev_vol = Input(shape=1)
    concat = concatenate([theta, prev_vol])
    h1 = Dense(complexity * base_hidden_size, activation="tanh")(concat)
    h2 = Dense(complexity * base_hidden_size, activation="tanh")(h1)
    h3 = Dense(complexity * base_hidden_size, activation="tanh")(h2)
    h4 = Dense(complexity * base_hidden_size, activation="tanh")(h3)
    out = Dense(nOutputDim, activation='softplus')(h4)
    
    grad = Lambda(lambda x: K.gradients(x[0], [x[1]])[0], output_shape=nInputDim)([out, theta])
    model = Model(inputs=[theta, prev_vol], outputs=[out, grad])
    opt = adam_v2.Adam(learning_rate=lr)
    model.compile(loss=['mse', 'mse'], optimizer=opt, metrics=['accuracy'], loss_weights=[1-g_w, g_w])
    
    return model

In [6]:
def build_models(lr=0.01):
    """ Store models in nested dictionary of the form [time_step][complexity].
    """
    
    models_std = defaultdict(dict)
    models_grad = defaultdict(dict)
    
    for t in forecast_window:
        for m in complexity_range:
            if t == 1:
                models_std[str(t)][str(m)] = make_model_t1(lr=lr, complexity=m, g_w=0)
                models_grad[str(t)][str(m)] = make_model_t1(lr=lr, complexity=m)
            else:
                models_std[str(t)][str(m)] = make_model_tsub(lr=lr, complexity=m, g_w=0)
                models_grad[str(t)][str(m)] = make_model_tsub(lr=lr, complexity=m)
        
    return models_std, models_grad

In [7]:
def fit_model(model, params, prev_vol, prev_ret, prev_vol_sub, y, y_grad, t, complexity, foldername):
    """ Fits the NN models. Note that prev_vol is vol at t=0 (inital vol). Subsequent 
    vol predictions use prior (true) vol outputs.
    """
    filename = f'{foldername}/GARCH_example_N{nTrain}_E{nEpochs}_c{complexity}_t{t}.h5'
    if t == 1:
        model.fit([params, prev_vol, prev_ret], [y, y_grad], batch_size=nBatchSize, epochs=nEpochs, verbose=0,
            use_multiprocessing=True, callbacks=[LearningRateScheduler(lr_time_based_decay)])
    else:
        model.fit([params, prev_vol_sub], [y, y_grad], batch_size=nBatchSize, epochs=nEpochs, verbose=0,
            use_multiprocessing=True, callbacks=[LearningRateScheduler(lr_time_based_decay)])
    
    model.save(filepath=filename)

In [8]:
def train_models(curr_data, models_std, models_grad, foldername):
    
    simInData_latent = curr_data['simInData_latent']
    simInData_vol = curr_data['simInData_vol']
    simInData_ret = curr_data['simInData_ret']
    simOutData = curr_data['simOutData']
    simOutData_grad = curr_data['simOutData_grad']

    resid = simInData_ret.iloc[:, 0] - simInData_latent.iloc[:, 0]
    
    no_models = len(complexity_range)

    foldername_grad = foldername + '_grad'
    t_curr = time1()
    for t in forecast_window:
        t_min = max(t-1, 1) # To avoid error when t = 1.
        with concurrent.futures.ThreadPoolExecutor(max_workers=no_models) as executor:
            future1 = {executor.submit(fit_model, 
                                        m, 
                                        simInData_latent.values, 
                                        simInData_vol.values,
                                        resid.values, 
                                        simOutData.loc[:, t_min].values,  # Used as input for t+h.
                                        simOutData.loc[:, t].values, 
                                        simOutData_grad.loc[:, str(t)].values,
                                        t,
                                        complexity,
                                        foldername) 
                    for complexity, m in models_std[str(t)].items()}

            future2 = {executor.submit(fit_model, 
                                        m, 
                                        simInData_latent.values, 
                                        simInData_vol.values,
                                        resid.values, 
                                        simOutData.loc[:, t_min].values,  # Used as input for t+h.
                                        simOutData.loc[:, t].values, 
                                        simOutData_grad.loc[:, str(t)].values,
                                        t,
                                        complexity,
                                        foldername_grad) 
                    for complexity, m in models_grad[str(t)].items()}
            
        print(f'Completed timestep {t} in {time1() - t_curr:.02f}s...')
        t_curr = time1()

In [9]:
def load_models(N, foldername):
    std_models = defaultdict(dict)
    grad_models = defaultdict(dict)

    grad_foldername = foldername + '_grad' 
    
    for t in forecast_window:
        for c in complexity_range:
            filename = f'{foldername}/GARCH_example_N{nTrain}_E{nEpochs}_c{c}_t{t}.h5'
            filename_grad = f'{grad_foldername}/GARCH_example_N{nTrain}_E{nEpochs}_c{c}_t{t}.h5'
            std_models[t][c] = load_model(filename)
            grad_models[t][c] = load_model(filename_grad)
    return std_models, grad_models

In [10]:
def gen_preds(train_data, test_data, std_models, grad_models):
    
    simInData_train_latent = train_data['simInData_latent']
    simInData_train_vol = train_data['simInData_vol']
    simInData_train_ret = train_data['simInData_ret']

    simInData_test_latent = test_data['simInData_latent']
    simInData_test_vol = test_data['simInData_vol']
    simInData_test_ret = test_data['simInData_ret']

    simOutData_train = train_data['simOutData'] # Only used for constucting DFs and t0 population
    simOutData_test = test_data['simOutData'] # Only used for constucting DFs and t0 population

    N_train = len(simInData_train_latent)
    N_test = len(simInData_test_latent)

    resids_train = simInData_train_ret.iloc[:, 0] - simInData_train_latent.iloc[:, 0]
    resids_test = simInData_test_ret.iloc[:, 0] - simInData_test_latent.iloc[:, 0]

    std_preds_train = {}
    std_preds_test = {}
    
    grad_preds_train = {}
    grad_preds_test = {}
    for c in complexity_range:
        std_preds_train[c] = pd.DataFrame(index=simInData_train_latent.index, columns=forecast_window)
        std_preds_test[c] = pd.DataFrame(index=simInData_test_latent.index, columns=forecast_window)
        grad_preds_train[c] = pd.DataFrame(index=simInData_train_latent.index, columns=forecast_window)
        grad_preds_test[c] = pd.DataFrame(index=simInData_test_latent.index, columns=forecast_window)

    for t in forecast_window:
        for c in complexity_range:
            
            crtModel_std = std_models[t][c]
            crtModel_grad = grad_models[t][c]

            if t == 1:
                std_preds_train[c].loc[:, t] = crtModel_std.predict([simInData_train_latent.values, 
                                                    simInData_train_vol.values, resids_train.values])[0].flatten()
                std_preds_test[c].loc[:, t] = crtModel_std.predict([simInData_test_latent.values, 
                                                    simInData_test_vol.values, resids_test.values])[0].flatten()
                
                grad_preds_train[c].loc[:, t] = crtModel_grad.predict([simInData_train_latent.values, 
                                                    simInData_train_vol.values, resids_train.values])[0].flatten()
                grad_preds_test[c].loc[:, t] = crtModel_grad.predict([simInData_test_latent.values, 
                                                    simInData_test_vol.values, resids_test.values])[0].flatten()
            else:
                std_preds_train[c].loc[:, t] = crtModel_std.predict([simInData_train_latent.values, 
                                                    std_preds_train[c].loc[:, t-1].values])[0].flatten()
                std_preds_test[c].loc[:, t] = crtModel_std.predict([simInData_test_latent.values, 
                                                    std_preds_test[c].loc[:, t-1].values])[0].flatten()
                
                grad_preds_train[c].loc[:, t] = crtModel_grad.predict([simInData_train_latent.values, 
                                                    grad_preds_train[c].loc[:, t-1].values])[0].flatten()
                grad_preds_test[c].loc[:, t] = crtModel_grad.predict([simInData_test_latent.values, 
                                                    grad_preds_test[c].loc[:, t-1].values])[0].flatten()

    gc.collect()

    return std_preds_train, std_preds_test, grad_preds_train, grad_preds_test

In [11]:
# Evaulation metrics
def rmse(pred, true):
    return np.sqrt(((pred.values - true.values)**2).mean().mean())

def corr(pred, true):
    return stats.pearsonr(pred.values.flatten(),true.values.flatten())[0]

In [12]:
def evaluate(train_data, test_data, std_preds_train, std_preds_test, grad_preds_train, grad_preds_test):
    
    simOutData_train = train_data['simOutData'] 
    simOutData_test = test_data['simOutData']
    
    std_train_res = defaultdict(dict)
    std_test_res = defaultdict(dict)
    grad_train_res = defaultdict(dict)
    grad_test_res = defaultdict(dict)
    
    for c in complexity_range:
        std_train_res[c]['rmse'] = rmse(std_preds_train[c], simOutData_train)
        std_train_res[c]['corr'] = corr(std_preds_train[c], simOutData_train)
        
        std_test_res[c]['rmse'] = rmse(std_preds_test[c], simOutData_test)
        std_test_res[c]['corr'] = corr(std_preds_test[c], simOutData_test)
        
        grad_train_res[c]['rmse'] = rmse(grad_preds_train[c], simOutData_train)
        grad_train_res[c]['corr'] = corr(grad_preds_train[c], simOutData_train)
        
        grad_test_res[c]['rmse'] = rmse(grad_preds_test[c], simOutData_test)
        grad_test_res[c]['corr'] = corr(grad_preds_test[c], simOutData_test)
    
    
    return std_train_res, std_test_res, grad_train_res, grad_test_res

In [13]:
def pipe(curr_data, test_data, N_train, foldername, idx):
    
    print(f'Beginning iteration {idx+1}')
    
    # Generate seperate foldername for each split to avoid confusion
    foldername = foldername + f'_ntrain{N_train}_{idx}'
    
    # Build models
    models_std, models_grad = build_models()
    
    print(f'Training models for dataset {idx+1}')
    t0 = time1()
    # Train models and save
    train_models(curr_data, models_std, models_grad, foldername)
    print(f'Trained models for dataset {idx+1} in {time1() - t0:.02f}s')
     
    print(f'Loading models for dataset {idx+1}')
    # Load models from file
    std_models, grad_models = load_models(N_train, foldername)
    
    print(f'Generating predictions for dataset {idx+1}')
    # Generate predictions
    std_preds_train, std_preds_test, grad_preds_train, grad_preds_test = gen_preds(curr_data, 
                                                test_data, std_models, grad_models)

    std_train_res, std_test_res, grad_train_res, grad_test_res = evaluate(curr_data, test_data, 
                            std_preds_train, std_preds_test, grad_preds_train, grad_preds_test)
    
    return std_train_res, std_test_res, grad_train_res, grad_test_res, idx

In [14]:
def run(N_train, no_ds, train_data, test_data, foldername='saved_models_cmplx_tests'):
    
    N_test = len(test_data['simInData_latent'])

    print(f'Beginning process with:')
    print(f'-> Epochs: {nEpochs}')
    print(f'-> Complexity range: {complexity_range}')
    print(f'-> {N_train} training samples')
    print(f"-> {N_test} test samples")
    print(f'-> Averging {no_ds} datasets')
    
    # Generate dataframes to store final results in
    std_train_rmse = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    std_test_rmse = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    grad_train_rmse = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    grad_test_rmse = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    
    std_train_corr = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    std_test_corr = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    grad_train_corr = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    grad_test_corr = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_res = {executor.submit(pipe, curr_data, test_data, 
                                        N_train, foldername, i): i \
                         for i, curr_data in train_data.items()}
        for i in concurrent.futures.as_completed(future_to_res):
            std_train_res, std_test_res, grad_train_res, grad_test_res, idx = i.result()
            
            for c in complexity_range:
                std_train_rmse.loc[c, idx] = std_train_res[c]['rmse']
                std_train_corr.loc[c, idx] = std_train_res[c]['corr']
                
                std_test_rmse.loc[c, idx] = std_test_res[c]['rmse']
                std_test_corr.loc[c, idx] = std_test_res[c]['corr']
            
                grad_train_rmse.loc[c, idx] = grad_train_res[c]['rmse']
                grad_train_corr.loc[c, idx] = grad_train_res[c]['corr']
                
                grad_test_rmse.loc[c, idx] = grad_test_res[c]['rmse']
                grad_test_corr.loc[c, idx] = grad_test_res[c]['corr']
        
        print(f'Completed iteration {int(idx)+1} of {no_ds}')
        
    # Take averages/stds
    std_train_rmse['mean'] = std_train_rmse.mean(axis=1)
    std_train_rmse['std'] = std_train_rmse.std(axis=1)
    
    std_test_rmse['mean'] = std_test_rmse.mean(axis=1)
    std_test_rmse['std'] = std_test_rmse.std(axis=1)
    
    std_train_corr['mean'] = std_train_corr.mean(axis=1)
    std_train_corr['std'] = std_train_corr.std(axis=1)
    
    std_test_corr['mean'] = std_test_corr.mean(axis=1)
    std_test_corr['std'] = std_test_corr.std(axis=1)
    
    grad_train_rmse['mean'] = grad_train_rmse.mean(axis=1)
    grad_train_rmse['std'] = grad_train_rmse.std(axis=1)
    
    grad_test_rmse['mean'] = grad_test_rmse.mean(axis=1)
    grad_test_rmse['std'] = grad_test_rmse.std(axis=1)
    
    grad_train_corr['mean'] = grad_train_corr.mean(axis=1)
    grad_train_corr['std'] = grad_train_corr.std(axis=1)
    
    grad_test_corr['mean'] = grad_test_corr.mean(axis=1)
    grad_test_corr['std'] = grad_test_corr.std(axis=1)
    
    c_time = datetime.now().strftime("%Y-%m-%d %H-%M")
    base_string = f'_ntrain{N_train}_ntest{N_test}_nEpoch{nEpochs}_Nruns{no_ds}_{c_time}.csv'
    std_filename_train_rmse = f'std_train_rmse'+base_string
    std_filename_train_corr = f'std_train_corr'+base_string
    
    std_filename_test_rmse = f'std_test_rmse'+base_string
    std_filename_test_corr = f'std_test_corr'+base_string
    
    grad_filename_train_rmse = f'grad_train_rmse'+base_string
    grad_filename_train_corr = f'grad_train_corr'+base_string
    
    grad_filename_test_rmse = f'grad_test_rmse'+base_string
    grad_filename_test_corr = f'grad_test_corr'+base_string
    
    res_folder = f'complexity_test_results_ntrain{N_train}_{c_time}'
    os.mkdir(res_folder)
    std_train_rmse.to_csv(res_folder+'/'+std_filename_train_rmse)
    std_train_corr.to_csv(res_folder+'/'+std_filename_train_corr)
    std_test_rmse.to_csv(res_folder+'/'+std_filename_test_rmse)
    std_test_corr.to_csv(res_folder+'/'+std_filename_test_corr)
    
    grad_train_rmse.to_csv(res_folder+'/'+grad_filename_train_rmse)
    grad_train_corr.to_csv(res_folder+'/'+grad_filename_train_corr)
    grad_test_rmse.to_csv(res_folder+'/'+grad_filename_test_rmse)
    grad_test_corr.to_csv(res_folder+'/'+grad_filename_test_corr)
    
    #Zip up results so they can be downloaded.
    #subprocess.call(["zip", "-r", f"/content/{res_folder}.zip", f"/content/{res_folder}"])
    #files.download(f"/content/{res_folder}.zip")

    print(f'Completed. Saved results to folder {res_folder}')

In [15]:
def sub_sample_data(n_train, no_runs, all_train_data):
    
    train_data = defaultdict(dict) 
    for ds in range(no_runs):
        np.random.seed(ds)
        perm = np.random.permutation(n_train)
        train_data[ds]['simInData_latent'] = all_train_data['simInData_latent'].iloc[perm, :].reset_index(drop=True)
        train_data[ds]['simInData_ret'] = all_train_data['simInData_ret'].iloc[perm].reset_index(drop=True)
        train_data[ds]['simInData_vol'] = all_train_data['simInData_vol'].iloc[perm].reset_index(drop=True)
        train_data[ds]['simOutData'] = all_train_data['simOutData'].iloc[perm, :].reset_index(drop=True)
        train_data[ds]['simOutData'].columns = forecast_window
        train_data[ds]['simOutData_grad'] = all_train_data['simOutData_grad'].iloc[perm, :].reset_index(drop=True)

    return train_data

In [16]:
def lr_time_based_decay(epoch, lr):
    return lr * 1 / (1 + decay * nEpochs)

In [17]:
# Location of 4D data, generated by function presented in Chaper 3, Section 3.3.1
data_folder = '4d_data'

all_data_train = {}
all_data_test = {}

all_data_train['simInData_latent'] = pd.read_csv(f'{data_folder}/x_train_latent.csv', index_col=0)
all_data_train['simInData_ret'] = pd.read_csv(f'{data_folder}/x_train_ret.csv', index_col=0)
all_data_train['simInData_vol'] = pd.read_csv(f'{data_folder}/x_train_vol.csv', index_col=0)
all_data_train['simOutData'] = pd.read_csv(f'{data_folder}/y_train.csv', index_col=0)
all_data_train['simOutData_grad'] = pd.read_csv(f'{data_folder}/y_train_grad.csv', index_col=0, header=[0,1])

all_data_test['simInData_latent'] = pd.read_csv(f'{data_folder}/x_test_latent.csv', index_col=0)
all_data_test['simInData_ret'] = pd.read_csv(f'{data_folder}/x_test_ret.csv', index_col=0)
all_data_test['simInData_vol'] = pd.read_csv(f'{data_folder}/x_test_vol.csv', index_col=0)
all_data_test['simOutData'] = pd.read_csv(f'{data_folder}/y_test.csv', index_col=0)
print(f'Successfully read in dataset.')
# No need to read in test gradients

Successfully read in dataset.


In [18]:
# Number of random draws from the training data pool to use
n_runs = 5
nTrain = 100

train_data_ss_n100 = sub_sample_data(nTrain, n_runs, all_data_train)

In [19]:
# Folder to save models to
foldername = '/saved_models_nTrain100/saved_models_cmplx_tests'

nEpochs = 300

initial_learning_rate = 0.01
decay = initial_learning_rate / nEpochs

# Range of model complexities to experiment with
complexity_range = [34, 38, 42, 46, 50, 54, 58, 62, 66, 70]

In [20]:
# Running this cell runs experiments
#run(nTrain, n_runs, train_data_ss_n100, all_data_test, foldername=foldername)

Beginning process with:
-> Epochs: 300
-> Complexity range: [34, 38, 42, 46, 50, 54, 58, 62, 66, 70]
-> 100 training samples
-> 1000 test samples
-> Averging 5 datasets
Beginning iteration 1
Beginning iteration 2Beginning iteration 3

Beginning iteration 4Beginning iteration 5

Training models for dataset 3
Training models for dataset 5
Training models for dataset 4
Training models for dataset 1
Training models for dataset 2
Completed timestep 1 in 1344.73s...
Completed timestep 1 in 1340.40s...
Completed timestep 1 in 1345.78s...
Completed timestep 1 in 1356.09s...
Completed timestep 1 in 1356.64s...
Completed timestep 2 in 1270.68s...
Completed timestep 2 in 1264.45s...
Completed timestep 2 in 1275.97s...
Completed timestep 2 in 1276.37s...
Completed timestep 2 in 1260.63s...
Completed timestep 3 in 1422.23s...
Completed timestep 3 in 1482.01s...
Completed timestep 3 in 1491.14s...
Completed timestep 3 in 1500.18s...
Completed timestep 3 in 1515.58s...
Completed timestep 4 in 1347.60

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Completed. Saved results to folder complexity_test_results_ntrain100_2021-08-24 21-47
