# Train a Deep NN to predict Asset Price returns

In practice, we need to explore variations of the design options outlined above because we can rarely be sure from the outset which network architecture best suits the data.

In this section, we will explore various options to build a simple feedforward Neural Network to predict asset price returns for a one-day horizon.

## Imports & Settings

In [137]:
import warnings
warnings.filterwarnings('ignore')

In [161]:
%matplotlib inline

import os, sys
from ast import literal_eval as make_tuple
from time import time
from pathlib import Path
from itertools import product
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import spearmanr
import seaborn as sns
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping
import time
import pandas as pd
from pathlib import Path
from scipy.stats import spearmanr
import numpy as np
from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import StandardScaler

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Conv1D, MaxPooling1D, GlobalAveragePooling1D, BatchNormalization, Activation
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Conv1D, MaxPooling1D, GlobalAveragePooling1D, BatchNormalization, Activation
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import Input


In [162]:
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
if gpu_devices:
    print('Using GPU')
    tf.config.experimental.set_memory_growth(gpu_devices[0], True)
else:
    print('Using CPU')

Using GPU


In [163]:
project_path = Path(r'C:/Users/flyin/OneDrive/Desktop/autotrade/new/machine-learning-for-trading')
sys.path.append(str(project_path))
from utils import MultipleTimeSeriesCV, format_time


In [164]:
np.random.seed(42)
sns.set_style('whitegrid')
idx = pd.IndexSlice

In [165]:
DATA_STORE = '../data/assets.h5'

In [166]:
results_path = Path('results')
if not results_path.exists():
    results_path.mkdir()
    
checkpoint_path = results_path / 'logs'

## Create a stock return series to predict asset price moves

To develop our trading strategy, we use the daily stock returns for some 995 US stocks for the eight year period from 2010 to 2017, and the features developed in Chapter 12 that include volatility and momentum factors as well as lagged returns with cross-sectional and sectoral rankings.

In [167]:
# Use the correct path to the data.h5 file
data = pd.read_hdf(r'C:\Users\flyin\OneDrive\Desktop\autotrade\new\machine-learning-for-trading\new\auto_generated_notebooks\feature_engineering_gradient_boosting\data.h5', 'model_data').dropna().sort_index()


In [168]:
data.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 948443 entries, ('A', Timestamp('2010-04-06 00:00:00')) to ('ZION', Timestamp('2017-09-28 00:00:00'))
Data columns (total 38 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   dollar_vol       948443 non-null  float64
 1   dollar_vol_rank  948443 non-null  float64
 2   rsi              948443 non-null  float64
 3   bb_high          948443 non-null  float64
 4   bb_low           948443 non-null  float64
 5   NATR             948443 non-null  float64
 6   ATR              948443 non-null  float64
 7   PPO              948443 non-null  float64
 8   MACD             948443 non-null  float64
 9   metadata_sector  948443 non-null  object 
 10  r01              948443 non-null  float64
 11  r05              948443 non-null  float64
 12  r10              948443 non-null  float64
 13  r21              948443 non-null  float64
 14  r42              948443 non-null  float64
 15  r63        

In [169]:
# Convert 'metadata_sector' to categorical
data['metadata_sector'] = data['metadata_sector'].astype('category')

# List of categorical columns
categoricals = ['year', 'month', 'sector', 'weekday', 'metadata_sector']

# Apply one-hot encoding to the categorical columns
data_encoded = pd.get_dummies(data, columns=categoricals, drop_first=True)

# Now 'data_encoded' is ready to be used in your model training
print(f"Data after one-hot encoding has {data_encoded.shape[1]} columns.")


Data after one-hot encoding has 77 columns.


In [170]:
outcomes = data.filter(like='fwd').columns.tolist()

In [171]:
lookahead = 1
outcome= f'r{lookahead:02}_fwd'

In [172]:
X_cv = data.loc[idx[:, :'2017'], :].drop(outcomes, axis=1)
y_cv = data.loc[idx[:, :'2017'], outcome]

In [173]:
len(X_cv.index.get_level_values('symbol').unique())

910

In [174]:
X_cv.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 948443 entries, ('A', Timestamp('2010-04-06 00:00:00')) to ('ZION', Timestamp('2017-09-28 00:00:00'))
Data columns (total 32 columns):
 #   Column           Non-Null Count   Dtype   
---  ------           --------------   -----   
 0   dollar_vol       948443 non-null  float64 
 1   dollar_vol_rank  948443 non-null  float64 
 2   rsi              948443 non-null  float64 
 3   bb_high          948443 non-null  float64 
 4   bb_low           948443 non-null  float64 
 5   NATR             948443 non-null  float64 
 6   ATR              948443 non-null  float64 
 7   PPO              948443 non-null  float64 
 8   MACD             948443 non-null  float64 
 9   metadata_sector  948443 non-null  category
 10  r01              948443 non-null  float64 
 11  r05              948443 non-null  float64 
 12  r10              948443 non-null  float64 
 13  r21              948443 non-null  float64 
 14  r42              948443 non-null  float64 

## Automate model generation

The following `make_model` function illustrates how to flexibly define various architectural elements for the search process. The dense_layers argument defines both the depth and width of the network as a list of integers. We also use dropout for regularization, expressed as a float in the range [0, 1] to define the probability that a given unit will be excluded from a training iteration.

In [183]:
def make_modern_cnn_model(dense_layers, activation, dropout, input_shape=(43, 1), num_classes=1):
    model = Sequential()
    
    # Input layer for 1D data (e.g., time series, structured data)
    model.add(Input(shape=input_shape))
    
    # First Convolutional Block (Conv1D for 1D data)
    model.add(Conv1D(32, kernel_size=3, padding='same', kernel_initializer='he_uniform'))
    model.add(BatchNormalization())
    model.add(Activation('relu'))  # Modern use of ReLU for CNNs
    model.add(MaxPooling1D(pool_size=2))
    model.add(Dropout(dropout))
    
    # Second Convolutional Block
    model.add(Conv1D(64, kernel_size=3, padding='same', kernel_initializer='he_uniform'))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Dropout(dropout))
    
    # Third Convolutional Block
    model.add(Conv1D(128, kernel_size=3, padding='same', kernel_initializer='he_uniform'))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Dropout(dropout))
    
    # Global Average Pooling instead of flattening (modern choice for reducing model size)
    model.add(GlobalAveragePooling1D())

    # Fully connected layers (as per your dense_layer_opts)
    for units in dense_layers:
        model.add(Dense(units))
        model.add(Activation(activation))
        model.add(Dropout(dropout))

    # Output layer (for binary classification, you can modify this for multi-class if needed)
    model.add(Dense(num_classes, activation='sigmoid'))  # For binary classification
    
    # Compile the model
    model.compile(optimizer=Adam(learning_rate=0.001),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
    return model

## Cross-validate multiple configurations with TensorFlow

### Train-Test Split

We split the data into a training set for cross-validation, and keep the last 12 months with data as holdout test:

In [184]:
n_splits = 12
train_period_length=21 * 12 * 4
test_period_length=21 * 3

In [185]:
cv = MultipleTimeSeriesCV(n_splits=n_splits,
                          train_period_length=train_period_length,
                          test_period_length=test_period_length,
                          lookahead=lookahead)

### Define CV Parameters

Now we just need to define our Keras classifier using the make_model function, set cross-validation (see chapter 6 on The Machine Learning Process and following for the OneStepTimeSeriesSplit), and the parameters that we would like to explore. 

We pick several one- and two-layer configurations, relu and tanh activation functions, and different dropout rates. We could also try out different optimizers (but did not run this experiment to limit what is already a computationally intensive effort):

In [186]:
"""dense_layer_opts = [(16, 8), (32, 16), (32, 32), (64, 32)]
activation_opts = ['tanh']
dropout_opts = [0, .1, .2]"""

SyntaxError: unterminated string literal (detected at line 3) (99134841.py, line 3)

In [187]:
"""param_grid = list(product(dense_layer_opts, activation_opts, dropout_opts))
np.random.shuffle(param_grid)"""

SyntaxError: incomplete input (3370599035.py, line 2)

In [188]:
"""len(param_grid)"""

'len(param_grid)'

To trigger the parameter search, we instantiate a GridSearchCV object, define the fit_params that will be passed to the Keras model’s fit method, and provide the training data to the GridSearchCV fit method:

In [189]:
def get_train_valid_data(X, y, train_idx, test_idx):
    x_train, y_train = X.iloc[train_idx, :], y.iloc[train_idx]
    x_val, y_val = X.iloc[test_idx, :], y.iloc[test_idx]
    return x_train, y_train, x_val, y_val

In [204]:
# Define callbacks
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
tensorboard_log_dir = './logs'
tensorboard = TensorBoard(log_dir=tensorboard_log_dir, histogram_freq=1)

# Before scaling, convert categorical columns to numerical ones using one-hot encoding
X_cv_encoded = pd.get_dummies(X_cv)

# Initialize variables
ic = []
scaler = StandardScaler()

# Loop through parameter grid for hyperparameter tuning
for params in param_grid:
    dense_layers, activation, dropout = params
    for batch_size in [64, 256]:
        print(dense_layers, activation, dropout, batch_size)
        
        # Create directories for checkpoints
        checkpoint_dir = checkpoint_path / str(dense_layers) / activation / str(dropout) / str(batch_size)
        checkpoint_dir.mkdir(parents=True, exist_ok=True)
        
        start_time = time.time()  # Track overall start time
        
        # Cross-validation loop
        for fold, (train_idx, test_idx) in enumerate(cv.split(X_cv_encoded)):  # Use the encoded version here
            # Get train & validation data
            x_train, y_train, x_val, y_val = get_train_valid_data(X_cv_encoded, y_cv, train_idx, test_idx)
            
            # Scale features
            x_train = scaler.fit_transform(x_train)
            x_val = scaler.transform(x_val)
            
            # Reshape for CNN (assuming 1D data)
            x_train = x_train[..., np.newaxis]  # Add a channel dimension
            x_val = x_val[..., np.newaxis]
            
            # Set up dataframes to log results
            preds = y_val.to_frame('actual')
            r = pd.DataFrame(index=y_val.groupby(level='date').size().index)
            
            # Create model based on validation parameters using modernized CNN
            input_shape = (x_train.shape[1], 1)  # Update input shape for CNN
            model = make_modern_cnn_model(dense_layers, activation, dropout, input_shape)
            
            epochs = 50
            for epoch in range(epochs):
                # Track epoch start time
                epoch_start = time.time()
                
                # Train the model for one epoch
                model.fit(x_train,
                          y_train,
                          batch_size=batch_size,
                          epochs=1,  # Train one epoch at a time
                          verbose=0,
                          shuffle=True,
                          validation_data=(x_val, y_val),
                          callbacks=[early_stopping, tensorboard])  # Added TensorBoard and EarlyStopping callbacks
                
                # Save weights after each epoch
                model.save_weights((checkpoint_dir / f'ckpt_{fold}_{epoch}').as_posix())
                
                # Predict and log results
                preds['predicted'] = model.predict(x_val).squeeze()
                r[epoch] = preds.groupby(level='date').apply(lambda x: spearmanr(x['actual'], x['predicted'])[0])
                
                # Calculate time for the epoch
                epoch_time = time.time() - epoch_start
                
                # Estimate ETA (time remaining)
                remaining_epochs = epochs - (epoch + 1)
                eta = epoch_time * remaining_epochs
                eta_str = str(timedelta(seconds=int(eta)))
                
                # Print progress with ETA
                print(f"Epoch {epoch + 1}/{epochs} | Fold {fold + 1} | Epoch time: {epoch_time:.2f}s | ETA: {eta_str}")
            
            # After all epochs, log final performance for the fold (moved outside epoch loop)
            fold_mean = r.mean().mean()  # Overall mean of the correlation
            fold_median = r.median().median()  # Overall median of the correlation

            print(format_time(time.time() - start_time), 
                  f'{fold + 1:02d} | Fold complete | Mean {fold_mean:7.4f} | Median {fold_median:7.4f}')
        
        # Append intermediate results and save to HDF5
        ic.append(r.assign(dense_layers=str(dense_layers), 
                           activation=activation, 
                           dropout=dropout,
                           batch_size=batch_size,
                           fold=fold))
        
        # Save intermediate results to HDF5
        pd.concat(ic).to_hdf(results_path / 'scores.h5', 'ic_by_day')

(64, 32) tanh 0.1 64
Epoch 1/50 | Fold 1 | Epoch time: 87.81s | ETA: 1:11:42
Epoch 2/50 | Fold 1 | Epoch time: 103.33s | ETA: 1:22:40
Epoch 3/50 | Fold 1 | Epoch time: 104.99s | ETA: 1:22:14
Epoch 4/50 | Fold 1 | Epoch time: 86.45s | ETA: 1:06:16
Epoch 5/50 | Fold 1 | Epoch time: 103.99s | ETA: 1:17:59
Epoch 6/50 | Fold 1 | Epoch time: 103.69s | ETA: 1:16:02
Epoch 7/50 | Fold 1 | Epoch time: 98.07s | ETA: 1:10:16
Epoch 8/50 | Fold 1 | Epoch time: 84.40s | ETA: 0:59:04
Epoch 9/50 | Fold 1 | Epoch time: 106.89s | ETA: 1:13:02
Epoch 10/50 | Fold 1 | Epoch time: 106.83s | ETA: 1:11:13
Epoch 11/50 | Fold 1 | Epoch time: 103.83s | ETA: 1:07:29
Epoch 12/50 | Fold 1 | Epoch time: 105.36s | ETA: 1:06:43
Epoch 13/50 | Fold 1 | Epoch time: 106.59s | ETA: 1:05:43
Epoch 14/50 | Fold 1 | Epoch time: 106.09s | ETA: 1:03:39
Epoch 15/50 | Fold 1 | Epoch time: 104.11s | ETA: 1:00:43
Epoch 16/50 | Fold 1 | Epoch time: 107.67s | ETA: 1:01:00
Epoch 17/50 | Fold 1 | Epoch time: 103.02s | ETA: 0:56:39
Epoch 

KeyboardInterrupt: 

### Evaluate predictive performance

In [205]:
params = ['dense_layers', 'dropout', 'batch_size']

In [206]:
ic = pd.read_hdf(results_path / 'scores.h5', 'ic_by_day').drop('activation', axis=1)
ic.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 252 entries, 2014-09-30 to 2014-12-29
Data columns (total 54 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   0             252 non-null    float64
 1   1             252 non-null    float64
 2   2             252 non-null    float64
 3   3             252 non-null    float64
 4   4             252 non-null    float64
 5   5             252 non-null    float64
 6   6             252 non-null    float64
 7   7             252 non-null    float64
 8   8             252 non-null    float64
 9   9             252 non-null    float64
 10  10            252 non-null    float64
 11  11            252 non-null    float64
 12  12            252 non-null    float64
 13  13            252 non-null    float64
 14  14            252 non-null    float64
 15  15            252 non-null    float64
 16  16            252 non-null    float64
 17  17            252 non-null    float64
 18  18         

In [207]:
ic.groupby(params).size()

dense_layers  dropout  batch_size
(64, 32)      0.0      64            63
                       256           63
              0.1      64            63
                       256           63
dtype: int64

In [208]:
ic_long = pd.melt(ic, id_vars=params + ['fold'], var_name='epoch', value_name='ic')
ic_long.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12600 entries, 0 to 12599
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   dense_layers  12600 non-null  object 
 1   dropout       12600 non-null  float64
 2   batch_size    12600 non-null  int64  
 3   fold          12600 non-null  int64  
 4   epoch         12600 non-null  object 
 5   ic            11901 non-null  float64
dtypes: float64(2), int64(2), object(2)
memory usage: 590.8+ KB


In [209]:
ic_long = ic_long.groupby(params+ ['epoch', 'fold']).ic.mean().to_frame('ic').reset_index()

In [210]:
g = sns.relplot(x='epoch', y='ic', col='dense_layers', row='dropout', 
                data=ic_long[ic_long.dropout>0], kind='line')
g.map(plt.axhline, y=0, ls='--', c='k', lw=1)
g.savefig(results_path / 'ic_lineplot', dpi=300);

In [211]:
def run_ols(ic):
    ic.dense_layers = ic.dense_layers.str.replace(', ', '-').str.replace('(', '').str.replace(')', '')
    data = pd.melt(ic, id_vars=params, var_name='epoch', value_name='ic')
    data.epoch = data.epoch.astype(int).astype(str).apply(lambda x: f'{int(x):02.0f}')
    model_data = pd.get_dummies(data.sort_values(params + ['epoch']), columns=['epoch'] + params, drop_first=True).sort_index(1)
    model_data.columns = [s.split('_')[-1] for s in model_data.columns]
    model = sm.OLS(endog=model_data.ic, exog=sm.add_constant(model_data.drop('ic', axis=1)))
    return model.fit()

In [212]:
model = run_ols(ic.drop('fold', axis=1))

TypeError: DataFrame.sort_index() takes 1 positional argument but 2 were given

In [213]:
print(model.summary())

Model: "sequential_65"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv1d_166 (Conv1D)         (None, 43, 32)            128       
                                                                 
 batch_normalization_165 (Ba  (None, 43, 32)           128       
 tchNormalization)                                               
                                                                 
 activation_279 (Activation)  (None, 43, 32)           0         
                                                                 
 max_pooling1d_165 (MaxPooli  (None, 21, 32)           0         
 ng1D)                                                           
                                                                 
 dropout_277 (Dropout)       (None, 21, 32)            0         
                                                                 
 conv1d_167 (Conv1D)         (None, 21, 64)          

In [214]:
fig, ax = plt.subplots(figsize=(14, 4))

ci = model.conf_int()
errors = ci[1].sub(ci[0]).div(2)

coefs = (model.params.to_frame('coef').assign(error=errors)
         .reset_index().rename(columns={'index': 'variable'}))
coefs = coefs[~coefs['variable'].str.startswith('date') & (coefs.variable != 'const')]

coefs.plot(x='variable', y='coef', kind='bar',
           ax=ax, color='none', capsize=3,
           yerr='error', legend=False, rot=0, title='Impact of Architecture and Training Parameters on Out-of-Sample Performance')
ax.set_ylabel('IC')
ax.set_xlabel('')
ax.scatter(x=pd.np.arange(len(coefs)), marker='_', s=120, y=coefs['coef'], color='black')
ax.axhline(y=0, linestyle='--', color='black', linewidth=1)
ax.xaxis.set_ticks_position('none')

ax.annotate('Batch Size', xy=(.02, -0.1), xytext=(.02, -0.2),
            xycoords='axes fraction',
            textcoords='axes fraction',
            fontsize=11, ha='center', va='bottom',
            bbox=dict(boxstyle='square', fc='white', ec='black'),
            arrowprops=dict(arrowstyle='-[, widthB=1.3, lengthB=0.8', lw=1.0, color='black'))

ax.annotate('Layers', xy=(.1, -0.1), xytext=(.1, -0.2),
            xycoords='axes fraction',
            textcoords='axes fraction',
            fontsize=11, ha='center', va='bottom',
            bbox=dict(boxstyle='square', fc='white', ec='black'),
            arrowprops=dict(arrowstyle='-[, widthB=4.8, lengthB=0.8', lw=1.0, color='black'))

ax.annotate('Dropout', xy=(.2, -0.1), xytext=(.2, -0.2),
            xycoords='axes fraction',
            textcoords='axes fraction',
            fontsize=11, ha='center', va='bottom',
            bbox=dict(boxstyle='square', fc='white', ec='black'),
            arrowprops=dict(arrowstyle='-[, widthB=2.8, lengthB=0.8', lw=1.0, color='black'))

ax.annotate('Epochs', xy=(.62, -0.1), xytext=(.62, -0.2),
            xycoords='axes fraction',
            textcoords='axes fraction',
            fontsize=11, ha='center', va='bottom',
            bbox=dict(boxstyle='square', fc='white', ec='black'),
            arrowprops=dict(arrowstyle='-[, widthB=30.5, lengthB=1.0', lw=1.0, color='black'))

sns.despine()
fig.tight_layout()
fig.savefig(results_path / 'ols_coef', dpi=300)

AttributeError: 'Sequential' object has no attribute 'conf_int'

## Make Predictions

In [215]:
def get_best_params(n=5):
    """Get the best parameters across all folds by daily median IC"""
    params = ['dense_layers', 'activation', 'dropout', 'batch_size']
    ic = pd.read_hdf(results_path / 'scores.h5', 'ic_by_day').drop('fold', axis=1)
    dates = sorted(ic.index.unique())
    train_period = 24 * 21
    train_dates = dates[:train_period]
    ic = ic.loc[train_dates]
    return (ic.groupby(params)
            .median()
            .stack()
            .to_frame('ic')
            .reset_index()
            .rename(columns={'level_4': 'epoch'})
            .nlargest(n=n, columns='ic')
            .drop('ic', axis=1)
            .to_dict('records'))

In [216]:
def generate_predictions(dense_layers, activation, dropout, batch_size, epoch):
    data = pd.read_hdf('../12_gradient_boosting_machines/data.h5', 'model_data').dropna().sort_index()
    outcomes = data.filter(like='fwd').columns.tolist()
    X_cv = data.loc[idx[:, :'2017'], :].drop(outcomes, axis=1)
    input_dim = X_cv.shape[1]
    y_cv = data.loc[idx[:, :'2017'], 'r01_fwd']

    scaler = StandardScaler()
    predictions = []
    
    do = '0' if str(dropout) == '0.0' else str(dropout)
    checkpoint_dir = checkpoint_path / str(dense_layers) / activation / str(do) / str(batch_size)
        
    for fold, (train_idx, test_idx) in enumerate(cv.split(X_cv)):
        x_train, y_train, x_val, y_val = get_train_valid_data(X_cv, y_cv, train_idx, test_idx)
        x_val = scaler.fit(x_train).transform(x_val)
        model = make_model(make_tuple(dense_layers), activation, dropout)
        status = model.load_weights((checkpoint_dir / f'ckpt_{fold}_{epoch}').as_posix())
        status.expect_partial()
        predictions.append(pd.Series(model.predict(x_val).squeeze(), index=y_val.index))
    return pd.concat(predictions)        

In [217]:
best_params = get_best_params()
predictions = []
for i, params in enumerate(best_params):
    predictions.append(generate_predictions(**params).to_frame(i))

predictions = pd.concat(predictions, axis=1)
print(predictions.info())
predictions.to_hdf(results_path / 'test_preds.h5', 'predictions')

FileNotFoundError: File ../12_gradient_boosting_machines/data.h5 does not exist

### How to further improve the results

The relatively simple architecture yields some promising results. To further improve performance, you can
- First and foremost, add new features and more data to the model
- Expand the set of architectures to explore, including more or wider layers
- Inspect the training progress and train for more epochs if the validation error continued to improve at 50 epochs

Finally, you can use more sophisticated architectures, including Recurrent Neural Networks (RNN) and Convolutional Neural Networks that are well suited to sequential data, whereas vanilla feedforward NNs are not designed to capture the ordered nature of the features.
