# Baseline Model

The purpose of this notebook is to train and evaluate a baseline model which will be used to gauge the future performance of our more complex models. We will define the baseline model to be the bare minimum model we can construct if we constrain our model to map any input to only one output value. Assuming our output data is distributed continuously, then the single value that best approximates the output distribution is the mean of the distribution. It follows naturally that the baseline model we will construct for the Wells Fargo stock price time series will be configured in such a way as to output the mean price of Wells Fargo stock across all of time.

## Building and Evaluating the Baseline Model

Since the problem we are dealing with is a regression problem with only one target variable, we can represent any model that fits this problem description as: 

$$E(Y|X=X) = W \cdot X + b$$

In the case of simple linear regression, then $X$ directly represents the input features to the model, and $W$ and $b$ the weights and biases that the model learns. On the other hand if this equation represents a deep model architecture, then $W$ and $b$ would represent only the weights and biases that the model learns for the **output layer**.

Assuming we have a model represented in the way as described above, we can morph it into something that behaves as our described baseline model by requiring $W = 0$ and $b = E(Y)$. This will force our model to output the same value of $E(Y)$ for every sample $X=x$ fed to it, which is precisely the behavior we require our baseline model to have. A corollary of this idea is that we can convert **any** model representable in this way to our baseline model by initializing the output layer's weights to 0 and bias to $E(Y)$.

In the code below we will define and intialize a baseline model using the train dataset we pickled to disk in `Stock_Analysis_EDA.ipynb`, and then evaluate the model on the validation dataset pickled to disk in `Stock_Analysis_EDA.ipynb`. We will plan on evaluating the model's mean squared error as well as its accuracy in being able to predict daily stock trends. We chose mean squared error because it will be used as our loss function to minimize when training more complex models, while the accuracy metric was chosen simply because it will be interesting to learn how well our models perform at predicting trends.

Restart kernel in order to fully clear GPU memory.

In [12]:
# Configuring Virtual GPU and Loading Data

import os
import json
import pickle

from stockanalysis.train import config_hardware

# Project Paths
project_dir = os.path.split(os.path.split(os.getcwd())[0])[0]
path_to_data = os.path.join(project_dir, 'data')

# Random Seed
seed = None

# Configuring GPU and TensorFlow
config_hardware(gpu_memory=7000, seed=seed)

# Loading Train Dataset
with open(os.path.join(path_to_data, 'train_dataset.pickle'), 'rb') as f:
    train_dataset = pickle.load(f)

# Loading Test Dataset
with open(os.path.join(path_to_data, 'val_dataset.pickle'), 'rb') as f:
    val_dataset = pickle.load(f)
    
# Loading Train Dataset's Vocabulary
with open(os.path.join(path_to_data, 'vocab_8k_norm_train_WFC_JPM_BAC_C.json'), 'r') as f:
    vocab = json.load(f)

GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Visible GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
1 Physical GPUs, 1 Logical GPUs


Defining baseline model.

In [13]:
import tensorflow as tf

def baseline_model(output_bias_init=None):
    
    if output_bias_init is not None:
        output_bias = output_bias_init['adjusted_close_target_WFC']
        output_bias_init = tf.keras.initializers.Constant(output_bias)
        
    inputs = {
              'adjusted_close_WFC': tf.keras.Input(shape=(5,), name='adjusted_close_WFC', dtype=tf.float32),
              '8-k_WFC': tf.keras.Input(shape=(None,), name='8-k_WFC', dtype=tf.int64),
              'adjusted_close_JPM': tf.keras.Input(shape=(5,), name='adjusted_close_JPM', dtype=tf.float32),
              '8-k_JPM': tf.keras.Input(shape=(None,), name='8-k_JPM', dtype=tf.int64),
              'adjusted_close_BAC': tf.keras.Input(shape=(5,), name='adjusted_close_BAC', dtype=tf.float32),
              '8-k_BAC': tf.keras.Input(shape=(None,), name='8-k_BAC', dtype=tf.int64),
              'adjusted_close_C': tf.keras.Input(shape=(5,), name='adjusted_close_C', dtype=tf.float32),
              '8-k_C': tf.keras.Input(shape=(None,), name='8-k_C', dtype=tf.int64)
             }
    
    features = tf.keras.layers.Concatenate()([inputs[fname] for fname in inputs.keys() if '8-k' not in fname])
    
    output_layer = tf.keras.layers.Dense(1, kernel_initializer='zeros', bias_initializer=output_bias_init, 
                                         name='adjusted_close_target_WFC')
    
    outputs = {
               'adjusted_close_target_WFC': output_layer(features)
              }
    
    model = tf.keras.Model(inputs, outputs, name='baseline_model')
    
    return model

Inspecting both our training and validation datasets.

In [14]:
X_train, y_train = train_dataset
X_val, y_val = val_dataset

print('Feature names and shapes for Training Data:')
for key in X_train:
    print('{}: {}'.format(key, X_train[key].shape[1:]))
print()
print('Feature names and shapes for Validation Data:')
for key in X_val:
    print('{}: {}'.format(key, X_val[key].shape[1:]))
print()
print('Train set size: {}'.format(len(y_train['adjusted_close_target_WFC'])))
print('Validation set size: {}'.format(len(y_val['adjusted_close_target_WFC'])))

Feature names and shapes for Training Data:
adjusted_close_WFC: (5,)
8-k_WFC: (1562,)
adjusted_close_JPM: (5,)
8-k_JPM: (90363,)
adjusted_close_BAC: (5,)
8-k_BAC: (15826,)
adjusted_close_C: (5,)
8-k_C: (53958,)

Feature names and shapes for Validation Data:
adjusted_close_WFC: (5,)
8-k_WFC: (1364,)
adjusted_close_JPM: (5,)
8-k_JPM: (1706,)
adjusted_close_BAC: (5,)
8-k_BAC: (2026,)
adjusted_close_C: (5,)
8-k_C: (2636,)

Train set size: 3014
Validation set size: 1001


Building baseline model.

In [15]:
import tensorflow as tf

from stockanalysis.train import build_compiled_model

# Defining Hyperparameters
output_bias_init = {key: y_train[key].mean() for key in y_train}
model_params = {'output_bias_init': output_bias_init}
training_params = {'batch_size': None, 'epochs': None}
loss = tf.keras.losses.MeanSquaredError
optimizer = tf.keras.optimizers.Adam
optimizer_params = {}

model_version = None
hyperparameters = {
                   'model_parameters': model_params,
                   'training_parameters': training_params,
                   'loss': loss, 
                   'optimizer': optimizer, 
                   'optimizer_parameters': optimizer_params, 
                   'version': model_version
                  }

# Defining Metrics
metrics = []

# Setting unique Run Number 
run_number = None

model_baseline, initial_epoch = build_compiled_model(baseline_model, hyperparameters, metrics, run_number)

Evaluating baseline model.

In [16]:
import numpy as np

metrics_train = model_baseline.evaluate(X_train, y_train, verbose=0)
metrics_val = model_baseline.evaluate(X_val, y_val, verbose=0)
m_preds_train = model_baseline.predict(X_train)
m_preds_val = model_baseline.predict(X_val)
m_preds_up_train = ((m_preds_train[1:, 0] - y_train['adjusted_close_target_WFC'][:-1]) > 0).astype(int)
m_preds_up_val = ((m_preds_val[1:, 0] - y_val['adjusted_close_target_WFC'][:-1]) > 0).astype(int)
labels_up_train = ((y_train['adjusted_close_target_WFC'][1:] - y_train['adjusted_close_target_WFC'][:-1]) > 0).astype(int)
labels_up_val = ((y_val['adjusted_close_target_WFC'][1:] - y_val['adjusted_close_target_WFC'][:-1]) > 0).astype(int)
up_cls_acc_train = np.mean(np.equal(m_preds_up_train, labels_up_train))
up_cls_acc_val = np.mean(np.equal(m_preds_up_val, labels_up_val))

In [17]:
assert all((pred == m_preds_train[0]) for pred in m_preds_train)

print('Loss on train dataset for Baseline Model: {}'.format(metrics_train))
print('Loss on validation dataset for Baseline Model: {}'.format(metrics_val))
print()
print('Metrics for Classifying Upward Movements')
print('Accuracy on train dataset for Baseline Model: {}'.format(up_cls_acc_train))
print('Accuracy on validation dataset for Baseline Model: {}'.format(up_cls_acc_val))

Loss on train dataset for Baseline Model: 18.943944523487964
Loss on validation dataset for Baseline Model: 387.5771874609765

Metrics for Classifying Upward Movements
Accuracy on train dataset for Baseline Model: 0.7467640225688682
Accuracy on validation dataset for Baseline Model: 0.507
