# 8D surrogate training code

This notebook provides an example of the code used to train and evaluate the accuracy of surrogate models for the 8D toy problem discussed in Chapter 4 Section 4.1. 

Original experiments were run in Google Colab using TPUs. 

This code is replicated with varying training sizes to produce the full result set. Full code and models are available [here](https://drive.google.com/drive/folders/1J7srZbZPS6UhE43GFXP3Gkd3TmEvT-6f).

In [None]:
import autograd.numpy as np
import pandas as pd
from collections import defaultdict
from sklearn.model_selection import train_test_split
import keras.backend as K
from keras.models import Model
from keras.layers import Input, Dense, Lambda, dot, concatenate, PReLU, Dropout, advanced_activations
from keras.models import load_model
from keras.optimizers import adam_v2
from keras.callbacks import LearningRateScheduler
from keras.metrics import RootMeanSquaredError
import concurrent.futures
from time import time
import gc
from scipy import stats
from datetime import *
from time import time as time1
import os
import subprocess
from google.colab import files
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Suppress retracing  and auograph error
import logging
import tensorflow as tf
tf.get_logger().setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)

In [None]:
# Parameters for 10D toy problem
nInputDim = 8
nOutputDim = 1
base_hidden_size = 4
nBatchSize = 100

# Other parmeters
alpha = 0.1
beta = 0.2
XDIM = 2

## Model architecture

The neural network used for the 8D toy problem does not model a dynamic process so only involves one architecture. The model has 10 hidden layers, each with $\eta$ * 4 nodes where $\eta$ refers to model complexity as described in Chapter 3 Section 3.1.

In [None]:
def make_model(complexity=1, lr=0.001, g_w=0.5):
    """"Makes surrogate model.
    The weight assigned to the loss function that fits gradients can be set with g_w.
    Model complexity integer and determines the number of nodes in each 
    hidden layer.
    """
    theta = Input(shape=nInputDim)
    X = Input(shape=XDIM+1)
    concat = concatenate([theta, X])
    h1 = Dense(complexity * base_hidden_size, activation="tanh")(concat)
    h2 = Dense(complexity * base_hidden_size, activation="tanh")(h1)
    h3 = Dense(complexity * base_hidden_size, activation="tanh")(h2)
    h4 = Dense(complexity * base_hidden_size, activation="tanh")(h3)
    h5 = Dense(complexity * base_hidden_size, activation="tanh")(h4)
    h6 = Dense(complexity * base_hidden_size, activation="tanh")(h5)
    h7 = Dense(complexity * base_hidden_size, activation="tanh")(h6)
    h8 = Dense(complexity * base_hidden_size, activation="tanh")(h7)
    h9 = Dense(complexity * base_hidden_size, activation="tanh")(h8)
    h10 = Dense(complexity * base_hidden_size, activation="tanh")(h9)
    out = Dense(nOutputDim, activation='linear', name='out')(h10)
    
    grad = Lambda(lambda x: K.gradients(x[0], [x[1]])[0], output_shape=nInputDim)([out, theta])
    model = Model(inputs=[theta, X], outputs=[out, grad])
    opt = adam_v2.Adam(learning_rate=lr)
    model.compile(loss=['mse', 'mse'], optimizer=opt, metrics=[RootMeanSquaredError()], loss_weights=[1-g_w, g_w])
    
    return model

In [None]:
def build_models():
    """ Store models in nested dictionary of the form [time_step][complexity].
    """
    
    models_std = defaultdict(dict)
    models_grad = defaultdict(dict)

    for m in complexity_range:
        models_std[m] = make_model(complexity=m, g_w=0)
        models_grad[m] = make_model(complexity=m)
        
    return models_std, models_grad

In [None]:
def fit_model(model, theta, X, y, y_grad, complexity, foldername):
    """ Fits the NN models.
    Note that when yearIdx=0, prev_output takes value -1 which is incorrect (should be no prev_output), 
    but it is not used.
    """
    model.fit([theta, X], [y, y_grad], batch_size=nBatchSize, epochs=nEpochs, verbose=0, 
    use_multiprocessing=True, callbacks=[LearningRateScheduler(lr_time_based_decay)])
    
    filename = f'{foldername}/8D_example_N{nTrain}_E{nEpochs}_c{complexity}.h5'
    model.save(filepath=filename)

In [None]:
def train_models(curr_data, models_std, models_grad, foldername):
    
    simInData_latent = curr_data['simInData_latent']
    simInData_X = curr_data['simInData_X']
    simOutData = curr_data['simOutData']
    simOutData_grad = curr_data['simOutData_grad']
    
    no_models = len(complexity_range)

    foldername_grad = foldername+'_grad'

    with concurrent.futures.ThreadPoolExecutor(max_workers=no_models*2) as executor:
        future1 = {executor.submit(fit_model, 
                                  m, 
                                  simInData_latent.values, simInData_X.values,
                                  simOutData.values, simOutData_grad.values, 
                                  complexity, 
                                  foldername) 
                for complexity, m in models_std.items()}

        future1 = {executor.submit(fit_model, 
                                  m, 
                                  simInData_latent.values, simInData_X.values,
                                  simOutData.values, simOutData_grad.values, 
                                  complexity, 
                                  foldername_grad) 
                for complexity, m in models_grad.items()}

In [None]:
def load_models(foldername):
    std_models = defaultdict(dict)
    grad_models = defaultdict(dict)
    
    grad_foldername = foldername + '_grad' 

    for c in complexity_range:
        filename = f'{foldername}/8D_example_N{nTrain}_E{nEpochs}_c{c}.h5'
        filename_grad = f'{grad_foldername}/8D_example_N{nTrain}_E{nEpochs}_c{c}.h5'
        std_models[c] = load_model(filename)
        grad_models[c] = load_model(filename_grad)
    return std_models, grad_models

In [None]:
def gen_preds(train_data, test_data, std_models, grad_models):
    
    simInData_train_latent = train_data['simInData_latent']
    simInData_train_X = train_data['simInData_X']

    simInData_test_latent = test_data['simInData_latent']
    simInData_test_X = test_data['simInData_X']

    simOutData_train = train_data['simOutData'] # Only used for constucting DFs and t0 population
    simOutData_test = test_data['simOutData'] # Only used for constucting DFs and t0 population

    N_train = len(simInData_train_latent)
    N_test = len(simInData_test_latent)

    std_preds_train = {}
    std_preds_test = {}
    
    grad_preds_train = {}
    grad_preds_test = {}
    
    for c in complexity_range:
        std_preds_train[c] = pd.DataFrame(index=simInData_train_latent.index, columns=range(nOutputDim))
        std_preds_test[c] = pd.DataFrame(index=simInData_test_latent.index, columns=range(nOutputDim))
        grad_preds_train[c] = pd.DataFrame(index=simInData_train_latent.index, columns=range(nOutputDim))
        grad_preds_test[c] = pd.DataFrame(index=simInData_test_latent.index, columns=range(nOutputDim))

    for c in complexity_range:

        crtModel_std = std_models[c]
        crtModel_grad = grad_models[c]

        std_preds_train[c].loc[:, 0] = crtModel_std.predict_on_batch([simInData_train_latent.values, 
                                            simInData_train_X.values])[0].flatten()
        std_preds_test[c].loc[:, 0] = crtModel_std.predict_on_batch([simInData_test_latent.values, 
                                            simInData_test_X.values])[0].flatten()

        grad_preds_train[c].loc[:, 0] = crtModel_grad.predict_on_batch([simInData_train_latent.values, 
                                            simInData_train_X.values])[0].flatten()
        grad_preds_test[c].loc[:, 0] = crtModel_grad.predict_on_batch([simInData_test_latent.values, 
                                            simInData_test_X.values])[0].flatten()

    gc.collect()

    return std_preds_train, std_preds_test, grad_preds_train, grad_preds_test

In [None]:
# Evaulation metrics
def rmse(pred, true):
    return np.sqrt(((pred.values - true.values)**2).mean())

def corr(pred, true):
    return stats.pearsonr(pred.values.flatten(),true.values.flatten())[0]

In [None]:
def evaluate(train_data, test_data, std_preds_train, std_preds_test, grad_preds_train, grad_preds_test):
    
    simOutData_train = train_data['simOutData'] 
    simOutData_test = test_data['simOutData']
    
    std_train_res = defaultdict(dict)
    std_test_res = defaultdict(dict)
    grad_train_res = defaultdict(dict)
    grad_test_res = defaultdict(dict)
    
    for c in complexity_range:
        std_train_res[c]['rmse'] = rmse(std_preds_train[c], simOutData_train)
        std_train_res[c]['corr'] = corr(std_preds_train[c], simOutData_train)
        
        std_test_res[c]['rmse'] = rmse(std_preds_test[c], simOutData_test)
        std_test_res[c]['corr'] = corr(std_preds_test[c], simOutData_test)
        
        grad_train_res[c]['rmse'] = rmse(grad_preds_train[c], simOutData_train)
        grad_train_res[c]['corr'] = corr(grad_preds_train[c], simOutData_train)
        
        grad_test_res[c]['rmse'] = rmse(grad_preds_test[c], simOutData_test)
        grad_test_res[c]['corr'] = corr(grad_preds_test[c], simOutData_test)
    
    
    return std_train_res, std_test_res, grad_train_res, grad_test_res

In [None]:
def pipe(curr_data, test_data, N_train, foldername, idx):
    
    print(f'Beginning iteration {idx+1}')
    
    # Generate seperate foldername for each split to avoid confusion
    foldername = foldername + f'_ntrain{N_train}_{idx}'
    
    # Build models
    models_std, models_grad = build_models()
    
    print(f'Training models for dataset {idx+1}')
    t0 = time1()
    # Train models and save
    train_models(curr_data, models_std, models_grad, foldername)
    print(f'Trained models for dataset {idx+1} in {time1() - t0:.02f}s') 
        
    print(f'Loading models for dataset {idx+1}')
    # Load models from file
    std_models, grad_models = load_models(foldername)
    
    print(f'Generating predictions for dataset {idx+1}')
    # Generate predictions
    std_preds_train, std_preds_test, grad_preds_train, grad_preds_test = gen_preds(curr_data, 
                                                test_data, std_models, grad_models)

    std_train_res, std_test_res, grad_train_res, grad_test_res = evaluate(curr_data, test_data, 
                            std_preds_train, std_preds_test, grad_preds_train, grad_preds_test)
    
    return std_train_res, std_test_res, grad_train_res, grad_test_res, idx

In [None]:
def run(N_train, no_ds, train_data, test_data, foldername='saved_models_cmplx_tests'):
    
    N_test = len(test_data['simInData_latent'])

    print(f'Beginning process with:')
    print(f'-> Epochs: {nEpochs}')
    print(f'-> Complexity range: {complexity_range}')
    print(f'-> {N_train} training samples')
    print(f"-> {N_test} test samples")
    print(f'-> Averaging {no_ds} datasets')
    
    # Generate dataframes to store final results in
    std_train_rmse = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    std_test_rmse = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    grad_train_rmse = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    grad_test_rmse = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    
    std_train_corr = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    std_test_corr = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    grad_train_corr = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    grad_test_corr = pd.DataFrame(index=complexity_range, columns=range(no_ds))
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_res = {executor.submit(pipe, curr_data, test_data, 
                                        N_train, foldername, i): i \
                         for i, curr_data in train_data.items()}
        for i in concurrent.futures.as_completed(future_to_res):
            std_train_res, std_test_res, grad_train_res, grad_test_res, idx = i.result()
            
            for c in complexity_range:
                std_train_rmse.loc[c, idx] = std_train_res[c]['rmse']
                std_train_corr.loc[c, idx] = std_train_res[c]['corr']
                
                std_test_rmse.loc[c, idx] = std_test_res[c]['rmse']
                std_test_corr.loc[c, idx] = std_test_res[c]['corr']
            
                grad_train_rmse.loc[c, idx] = grad_train_res[c]['rmse']
                grad_train_corr.loc[c, idx] = grad_train_res[c]['corr']
                
                grad_test_rmse.loc[c, idx] = grad_test_res[c]['rmse']
                grad_test_corr.loc[c, idx] = grad_test_res[c]['corr']
        
        print(f'Completed iteration {int(idx)+1} of {no_ds}')
        
    # Take averages/stds
    std_train_rmse['mean'] = std_train_rmse.mean(axis=1)
    std_train_rmse['std'] = std_train_rmse.std(axis=1)
    
    std_test_rmse['mean'] = std_test_rmse.mean(axis=1)
    std_test_rmse['std'] = std_test_rmse.std(axis=1)
    
    std_train_corr['mean'] = std_train_corr.mean(axis=1)
    std_train_corr['std'] = std_train_corr.std(axis=1)
    
    std_test_corr['mean'] = std_test_corr.mean(axis=1)
    std_test_corr['std'] = std_test_corr.std(axis=1)
    
    grad_train_rmse['mean'] = grad_train_rmse.mean(axis=1)
    grad_train_rmse['std'] = grad_train_rmse.std(axis=1)
    
    grad_test_rmse['mean'] = grad_test_rmse.mean(axis=1)
    grad_test_rmse['std'] = grad_test_rmse.std(axis=1)
    
    grad_train_corr['mean'] = grad_train_corr.mean(axis=1)
    grad_train_corr['std'] = grad_train_corr.std(axis=1)
    
    grad_test_corr['mean'] = grad_test_corr.mean(axis=1)
    grad_test_corr['std'] = grad_test_corr.std(axis=1)
    
    c_time = datetime.now().strftime("%Y-%m-%d %H-%M")
    base_string = f'_ntrain{N_train}_ntest{N_test}_nEpoch{nEpochs}_Nruns{no_ds}_{c_time}.csv'
    std_filename_train_rmse = f'std_train_rmse'+base_string
    std_filename_train_corr = f'std_train_corr'+base_string
    
    std_filename_test_rmse = f'std_test_rmse'+base_string
    std_filename_test_corr = f'std_test_corr'+base_string
    
    grad_filename_train_rmse = f'grad_train_rmse'+base_string
    grad_filename_train_corr = f'grad_train_corr'+base_string
    
    grad_filename_test_rmse = f'grad_test_rmse'+base_string
    grad_filename_test_corr = f'grad_test_corr'+base_string
    
    res_folder = f'complexity_test_results_ntrain{N_train}_{c_time}'
    os.mkdir(res_folder)
    std_train_rmse.to_csv(res_folder+'/'+std_filename_train_rmse)
    std_train_corr.to_csv(res_folder+'/'+std_filename_train_corr)
    std_test_rmse.to_csv(res_folder+'/'+std_filename_test_rmse)
    std_test_corr.to_csv(res_folder+'/'+std_filename_test_corr)
    
    grad_train_rmse.to_csv(res_folder+'/'+grad_filename_train_rmse)
    grad_train_corr.to_csv(res_folder+'/'+grad_filename_train_corr)
    grad_test_rmse.to_csv(res_folder+'/'+grad_filename_test_rmse)
    grad_test_corr.to_csv(res_folder+'/'+grad_filename_test_corr)
    
    #Zip up results so they can be downloaded.
    #subprocess.call(["zip", "-r", f"/content/{res_folder}.zip", f"/content/{res_folder}"])
    #files.download(f"/content/{res_folder}.zip")

    print(f'Completed. Saved results to folder {res_folder}')

In [None]:
def sub_sample_data(n_train, no_runs, all_train_data):
    
    train_data = defaultdict(dict) 
    for ds in range(no_runs):
        np.random.seed(ds)
        perm = np.random.permutation(n_train)
        train_data[ds]['simInData_latent'] = all_train_data['simInData_latent'].iloc[perm, :].reset_index(drop=True)
        train_data[ds]['simInData_X'] = all_train_data['simInData_X'].iloc[perm].reset_index(drop=True)
        train_data[ds]['simOutData'] = all_train_data['simOutData'].iloc[perm, :].reset_index(drop=True)
        train_data[ds]['simOutData_grad'] = all_train_data['simOutData_grad'].iloc[perm, :].reset_index(drop=True)

    return train_data

In [None]:
def lr_time_based_decay(epoch, lr):
    return lr * 1 / (1 + decay * nEpochs)

In [None]:
# Location of 4D data, generated by function presented in Chaper 3, Section 4.1.1
data_folder = '8d_data'

all_data_train = {}
all_data_test = {}

all_data_train['simInData_latent'] = pd.read_csv(f'{data_folder}/x_train_latent.csv', index_col=0)
all_data_train['simInData_X'] = pd.read_csv(f'{data_folder}/x_train_X.csv', index_col=0)
all_data_train['simOutData'] = pd.read_csv(f'{data_folder}/y_train.csv', index_col=0)
all_data_train['simOutData_grad'] = pd.read_csv(f'{data_folder}/y_train_grad.csv', index_col=0)

all_data_test['simInData_latent'] = pd.read_csv(f'{data_folder}/x_test_latent.csv', index_col=0)
all_data_test['simInData_X'] = pd.read_csv(f'{data_folder}/x_test_X.csv', index_col=0)
all_data_test['simOutData'] = pd.read_csv(f'{data_folder}/y_test.csv', index_col=0)
print(f'Successfully read in dataset.')
# No need to read in test gradients

Successfully read in dataset.


In [None]:
# Number of random draws from the training data pool to use
n_runs = 5
nTrain = 2500

train_data_ss = sub_sample_data(nTrain, n_runs, all_data_train)

In [None]:
# Folder to save models to
foldername = f'/saved_models_nTrain2500/saved_models_cmplx_tests'
nEpochs = 300
initial_learning_rate = 0.01
decay = initial_learning_rate / nEpochs

In [None]:
complexity_range = [10, 20, 30, 40, 50, 60, 70, 80, 90]

In [None]:
%%time 

run(nTrain, n_runs, train_data_ss, all_data_test, foldername=foldername)

Beginning process with:
-> Epochs: 300
-> Complexity range: [10, 20, 30, 40, 50, 60, 70, 80, 90]
-> 2500 training samples
-> 1000 test samples
-> Averaging 5 datasets
Beginning iteration 1
Beginning iteration 2
Beginning iteration 3
Beginning iteration 4
Beginning iteration 5
Training models for dataset 1
Training models for dataset 3
Training models for dataset 2
Training models for dataset 5
Training models for dataset 4
Trained models for dataset 4 in 17514.73s
Loading models for dataset 4
Generating predictions for dataset 4
Trained models for dataset 2 in 17548.82s
Loading models for dataset 2
Trained models for dataset 1 in 17549.73s
Loading models for dataset 1
Trained models for dataset 3 in 17549.93s
Loading models for dataset 3
Trained models for dataset 5 in 17546.88s
Loading models for dataset 5
Generating predictions for dataset 2
Generating predictions for dataset 1
Generating predictions for dataset 3
Generating predictions for dataset 5
Completed iteration 2 of 5


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Completed. Saved results to folder complexity_test_results_ntrain2500_2021-08-31 21-51
CPU times: user 8h 37min 35s, sys: 42min 37s, total: 9h 20min 12s
Wall time: 4h 55min 22s
