## Modeling

### Baseline Model

The baseline model is the bare minimum model we can manually construct from just our Exploratory Data Analysis. This model will give us baselines that hopefully our trained models will beat. Since the problem we are dealing with is a regression problem with only one target variable, we can represent any model that fits this problem description as: 

$$E(Y|X=X) = W \cdot X + b$$

In the case of simple linear regression then $X$ directly represents the input features to the model, and $W$ and $b$ the weights and biases that the model learns. On the other hand if this equation represents a deep model architecture, then $W$ and $b$ would represent only the weights and biases that the model learns for the **output layer**. This implies that $X$ would be a function of the input features to the model and would represent the input passed to the output layer of the model, or in other words $X$ represents the features the previous layers of the model **learned** to extract from the given original input features.

We can express our baseline model as one that predicts the expected value of the target variable no matter what the predictor variable happens to be. If we assume that there exists a relation between our predictor variables and target variable, and we average across all values of our predictor variables, then we find that any trained model of the relationship between predictor and target variables should on average predict the expected value of the target variable. This means that the bare minimum we can ask a model is that on average it predicts the expected value of the target variable. This implies that if we were to design a model that predicts the target variable without taking the predictor variable into account, then inorder to satisfy the bare minimum constraint, this model would need to predict the mean of our target variable. Mathematically we can express or baseline model as:

$E(Y|X=X) = E(Y) = \mu = W \cdot X + b$ such that $W = 0$ in order to satisfy the input independent constraint.

This implies $b = \mu$.

This also implies that we can convert any model to a model equivalent to the baseline model, as long as we initialize its output layers weights to zero, and its bias to the expected value of our target variable.


In our specific case we know from our exploratory data analysis that the means of all our target variables, logarithmic adjusted returns, are close to zero for each stock. This is also supported by theory, which predicts that the logarthimic return on a stock should follow a normal distribution centered around zero (or at least very closely conform to a normal distribution). This means to build our baseline model we must create a linear regression model where both the weights and bias are initialized to zero.

Restart kernel in order to fully clear GPU memory.

In [1]:
# Configuring Virtual GPU and Loading Data

import os
import json
import pickle

from stockanalysis.train import config_hardware

# Project Paths
project_dir = os.path.split(os.path.split(os.getcwd())[0])[0]
path_to_data = os.path.join(project_dir, 'data')

# Random Seed
seed = None

# Configuring GPU and TensorFlow
config_hardware(gpu_memory=7000, seed=seed)

# Loading Train Dataset
with open(os.path.join(path_to_data, 'train_datasetp2.pickle'), 'rb') as f:
    train_dataset = pickle.load(f)

# Loading Test Dataset
with open(os.path.join(path_to_data, 'val_datasetp2.pickle'), 'rb') as f:
    val_dataset = pickle.load(f)
    
# Loading Train Dataset's Vocabulary
with open(os.path.join(path_to_data, 'vocab_8k_norm_trainp2_WFC_JPM_BAC_C.json'), 'r') as f:
    vocab = json.load(f)

GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Visible GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
1 Physical GPUs, 1 Logical GPUs


Defining baseline model.

In [4]:
import tensorflow as tf

def baseline_model(output_bias_init=None):
    
    if output_bias_init is not None:
        output_bias = output_bias_init['adjusted_close_target_WFC']
        output_bias_init = tf.keras.initializers.Constant(output_bias)
        
    inputs = {
              'adjusted_close_WFC': tf.keras.Input(shape=(5,), name='adjusted_close_WFC', dtype=tf.float32),
              '8-k_WFC': tf.keras.Input(shape=(None,), name='8-k_WFC', dtype=tf.int64),
              'adjusted_close_JPM': tf.keras.Input(shape=(5,), name='adjusted_close_JPM', dtype=tf.float32),
              '8-k_JPM': tf.keras.Input(shape=(None,), name='8-k_JPM', dtype=tf.int64),
              'adjusted_close_BAC': tf.keras.Input(shape=(5,), name='adjusted_close_BAC', dtype=tf.float32),
              '8-k_BAC': tf.keras.Input(shape=(None,), name='8-k_BAC', dtype=tf.int64),
              'adjusted_close_C': tf.keras.Input(shape=(5,), name='adjusted_close_C', dtype=tf.float32),
              '8-k_C': tf.keras.Input(shape=(None,), name='8-k_C', dtype=tf.int64)
             }
    
    features = tf.keras.layers.Concatenate()([inputs[fname] for fname in inputs.keys() if '8-k' not in fname])
    
    output_layer = tf.keras.layers.Dense(1, kernel_initializer='zeros', bias_initializer=output_bias_init, 
                                         name='adjusted_close_target_WFC')
    
    outputs = {
               'adjusted_close_target_WFC': output_layer(features)
              }
    
    model = tf.keras.Model(inputs, outputs, name='baseline_model')
    
    return model

Inspecting both our training and validation datasets.

In [3]:
X_train, y_train = train_dataset
X_val, y_val = val_dataset

print('Feature names and shapes for Training Data:')
for key in X_train:
    print('{}: {}'.format(key, X_train[key].shape[1:]))
print()
print('Feature names and shapes for Validation Data:')
for key in X_val:
    print('{}: {}'.format(key, X_val[key].shape[1:]))
print()
print('Train set size: {}'.format(len(y_train['adjusted_close_target_WFC'])))
print('Validation set size: {}'.format(len(y_val['adjusted_close_target_WFC'])))

Feature names and shapes for Training Data:
adjusted_close_WFC: (5,)
8-k_WFC: (1562,)
adjusted_close_JPM: (5,)
8-k_JPM: (90363,)
adjusted_close_BAC: (5,)
8-k_BAC: (15826,)
adjusted_close_C: (5,)
8-k_C: (53958,)

Feature names and shapes for Validation Data:
adjusted_close_WFC: (5,)
8-k_WFC: (1364,)
adjusted_close_JPM: (5,)
8-k_JPM: (1706,)
adjusted_close_BAC: (5,)
8-k_BAC: (2026,)
adjusted_close_C: (5,)
8-k_C: (2636,)

Train set size: 3014
Validation set size: 1001


Building baseline model.

In [8]:
import tensorflow as tf

from stockanalysis.train import build_compiled_model

# Defining Hyperparameters
output_bias_init = {key: y_train[key].mean() for key in y_train}
model_params = {'output_bias_init': output_bias_init}
training_params = {'batch_size': None, 'epochs': None}
loss = tf.keras.losses.MeanSquaredError
optimizer = tf.keras.optimizers.Adam
optimizer_params = {}

model_version = None
hyperparameters = {
                   'model_parameters': model_params,
                   'training_parameters': training_params,
                   'loss': loss, 
                   'optimizer': optimizer, 
                   'optimizer_parameters': optimizer_params, 
                   'version': model_version
                  }

# Defining Metrics
metrics = []

# Setting unique Run Number 
run_number = None

model_baseline, initial_epoch = build_compiled_model(baseline_model, hyperparameters, metrics, run_number)

Evaluating baseline model.

In [10]:
baseline_predictions = model_baseline.predict(X_train)
baseline_results_train = model_baseline.evaluate(X_train, y_train, verbose=0)
baseline_results_val = model_baseline.evaluate(X_val, y_val, verbose=0)

In [11]:
assert all((pred == baseline_predictions[0]) for pred in baseline_predictions)

print('Train Loss for Baseline Model: {}'.format(baseline_results_train))
print('Test Loss for Baseline Model: {}'.format(baseline_results_val))

Train Loss for Baseline Model: 18.943944523487964
Test Loss for Baseline Model: 387.5771874609765


In [None]:
# CAlcualtte up predictions and down predictions for baseline then pretty up and rerun all epxerimentes