## Modeling

### Baseline Model

The baseline model is the bare minimum model we can manually construct from just our Exploratory Data Analysis. This model will give us baselines that hopefully our trained models will beat. Since the problem we are dealing with is a regression problem with only one target variable, we can represent any model that fits this problem description as: 

$$E(Y|X=X) = W \cdot X + b$$

In the case of simple linear regression then $X$ directly represents the input features to the model, and $W$ and $b$ the weights and biases that the model learns. On the other hand if this equation represents a deep model architecture, then $W$ and $b$ would represent only the weights and biases that the model learns for the **output layer**. This implies that $X$ would be a function of the input features to the model and would represent the input passed to the output layer of the model, or in other words $X$ represents the features the previous layers of the model **learned** to extract from the given original input features.

We can express our baseline model as one that predicts the expected value of the target variable no matter what the predictor variable happens to be. If we assume that there exists a relation between our predictor variables and target variable, and we average across all values of our predictor variables, then we find that any trained model of the relationship between predictor and target variables should on average predict the expected value of the target variable. This means that the bare minimum we can ask a model is that on average it predicts the expected value of the target variable. This implies that if we were to design a model that predicts the target variable without taking the predictor variable into account, then inorder to satisfy the bare minimum constraint, this model would need to predict the mean of our target variable. Mathematically we can express or baseline model as:

$E(Y|X=X) = E(Y) = \mu = W \cdot X + b$ such that $W = 0$ in order to satisfy the input independent constraint.

This implies $b = \mu$.

This also implies that we can convert any model to a model equivalent to the baseline model, as long as we initialize its output layers weights to zero, and its bias to the expected value of our target variable.


In our specific case we know from our exploratory data analysis that the means of all our target variables, logarithmic adjusted returns, are close to zero for each stock. This is also supported by theory, which predicts that the logarthimic return on a stock should follow a normal distribution centered around zero (or at least very closely conform to a normal distribution). This means to build our baseline model we must create a linear regression model where both the weights and bias are initialized to zero.

Restart kernel in order to fully clear GPU memory.

In [12]:
# Importing Libraries and Configuring virtual GPU

import os
import sys
import json
import pickle
import numpy as np
import pandas as pd
import tensorflow as tf
from stockanalysis.train import config_hardware

# Project Paths
project_dir = os.path.split(os.path.split(os.getcwd())[0])[0]
path_to_data = os.path.join(project_dir, 'data')
path_to_docs = os.path.join(path_to_data, 'documents')
path_to_models = os.path.join(project_dir, 'models')

# Random Seed
seed = 42

# Configuring GPU and Random States
config_hardware(gpu_memory=None, seed=seed)


# Loading Train Dataset
with open(os.path.join(path_to_data, 'train_dataset.pickle'), 'rb') as f:
    train_dataset = pickle.load(f)

# Loading Test Dataset
with open(os.path.join(path_to_data, 'val_dataset.pickle'), 'rb') as f:
    val_dataset = pickle.load(f)

# Defining Helper Functions    
def build_compiled_model(build_model, hparams, loss, optimizer, metrics, callbacks=None):
    model = build_model(**hparams)
    model.compile(loss=loss, optimizer=optimizer, metrics=metrics, callbacks=callbacks)
    return model

# Model Agnostic Parameters
LOSS = tf.keras.losses.MeanSquaredError()
OPTIMIZER = tf.keras.optimizers.Adam()
METRICS = []
CALLBACKS = None
with open(os.path.join(path_to_data, 'vocab_8k_norm_train_WFC_JPM_BAC_C.json'), 'r') as f:
    vocab = json.load(f)

GPUs: []
Visible GPUs: []


Defining baseline model.

In [7]:
def baseline_model(output_bias_init):
    if output_bias_init is not None:
        output_bias_init = tf.keras.initializers.Constant(output_bias_init)
        
    inputs = {'log_adj_daily_returns': tf.keras.Input(shape=(5,), name='log_adj_daily_returns', dtype=tf.float32),
              '8-k': tf.keras.Input(shape=(None,), name='8-k', dtype=tf.int64)}
    output_layer = tf.keras.layers.Dense(1, kernel_initializer='zeros', bias_initializer=output_bias_init, 
                                         name='log_adj_daily_returns_target')
    outputs = {'log_adj_daily_returns_target': output_layer(inputs['log_adj_daily_returns'])}
    
    model = tf.keras.Model(inputs, outputs, name='baseline_model')
    
    return model

Building baseline model.

In [8]:
output_bias_init = 0
hparams = {'output_bias_init': output_bias_init}

baseline_m = build_compiled_model(baseline_model, hparams, loss=LOSS,
                                  optimizer=OPTIMIZER, metrics=METRICS, callbacks=CALLBACKS)

Evaluating baseline model.

In [10]:
X_train, y_train = train_dataset
X_val, y_val = val_dataset

baseline_predictions = baseline_m.predict(X_train)
baseline_results_train = baseline_m.evaluate(X_train, y_train, verbose=0)
baseline_results_val = baseline_m.evaluate(X_val, y_val, verbose=0)

In [11]:
assert all((pred == baseline_predictions[0]) for pred in baseline_predictions)

print('Train Loss for Baseline Model: {}'.format(baseline_results_train))
print('Test Loss for Baseline Model: {}'.format(baseline_results_val))

Train Loss for Baseline Model: 0.0011974310440253082
Test Loss for Baseline Model: 0.00020920674223289128


# baseline model 2

In [1]:
# Importing Libraries and Configuring virtual GPU

import os
import sys
import json
import pickle
import numpy as np
import pandas as pd
import tensorflow as tf
from stockanalysis.train import config_hardware

# Project Paths
project_dir = os.path.split(os.path.split(os.getcwd())[0])[0]
path_to_data = os.path.join(project_dir, 'data')
path_to_docs = os.path.join(path_to_data, 'documents')
path_to_models = os.path.join(project_dir, 'models')

# Random Seed
seed = 42

# Configuring GPU and Random States
config_hardware(gpu_memory=None, seed=seed)


# Loading Train Dataset
with open(os.path.join(path_to_data, 'train_datasetp2.pickle'), 'rb') as f:
    train_dataset = pickle.load(f)

# Loading Test Dataset
with open(os.path.join(path_to_data, 'val_datasetp2.pickle'), 'rb') as f:
    val_dataset = pickle.load(f)

# Defining Helper Functions    
def build_compiled_model(build_model, hparams, loss, loss_weights, optimizer, metrics, callbacks=None):
    model = build_model(**hparams)
    model.compile(loss=loss, loss_weights=loss_weights, optimizer=optimizer, metrics=metrics, callbacks=callbacks)
    return model

# Model Agnostic Parameters
LOSS = {
        'log_adj_daily_returns_target_WFC': tf.keras.losses.MeanSquaredError(),
       }

OPTIMIZER = tf.keras.optimizers.Adam()

CALLBACKS = None
with open(os.path.join(path_to_data, 'vocab_8k_norm_trainp2_WFC_JPM_BAC_C.json'), 'r') as f:
    vocab = json.load(f)

GPUs: []
Visible GPUs: []


In [2]:
train_dataset[0].keys()

dict_keys(['log_adj_daily_returns_WFC', '8-k_WFC', 'log_adj_daily_returns_JPM', '8-k_JPM', 'log_adj_daily_returns_BAC', '8-k_BAC', 'log_adj_daily_returns_C', '8-k_C'])

In [3]:
def gen_print_model_stats(model, path, model_name):
    m = model()
    if not os.path.exists(path):
        os.makedirs(path)
    fname = os.path.join(path, model_name)
    tf.keras.utils.plot_model(m, fname + '.png', show_shapes=True, expand_nested=True)
    tf.keras.backend.clear_session()
    return m.summary()

Defining baseline model.

In [4]:
def baseline_model(output_bias_init=0):
    
    if output_bias_init is not None:
        output_bias_init = tf.keras.initializers.Constant(output_bias_init)
        
    inputs = {
              'log_adj_daily_returns_WFC': tf.keras.Input(shape=(5,), name='log_adj_daily_returns_WFC', dtype=tf.float32),
              '8-k_WFC': tf.keras.Input(shape=(None,), name='8-k_WFC', dtype=tf.int64),
              'log_adj_daily_returns_JPM': tf.keras.Input(shape=(5,), name='log_adj_daily_returns_JPM', dtype=tf.float32),
              '8-k_JPM': tf.keras.Input(shape=(None,), name='8-k_JPM', dtype=tf.int64),
              'log_adj_daily_returns_BAC': tf.keras.Input(shape=(5,), name='log_adj_daily_returns_BAC', dtype=tf.float32),
              '8-k_BAC': tf.keras.Input(shape=(None,), name='8-k_BAC', dtype=tf.int64),
              'log_adj_daily_returns_C': tf.keras.Input(shape=(5,), name='log_adj_daily_returns_C', dtype=tf.float32),
              '8-k_C': tf.keras.Input(shape=(None,), name='8-k_C', dtype=tf.int64)
             }
    
    features = tf.keras.layers.Concatenate()([inputs[fname] for fname in inputs.keys() if '8-k' not in fname])
    
    output_wfc = tf.keras.layers.Dense(1, kernel_initializer='zeros', bias_initializer=output_bias_init, name='log_adj_daily_returns_target_WFC')(features)
    output_jpm = tf.keras.layers.Dense(1, kernel_initializer='zeros', bias_initializer=output_bias_init, name='log_adj_daily_returns_target_JPM')(features)
    output_bac = tf.keras.layers.Dense(1, kernel_initializer='zeros', bias_initializer=output_bias_init, name='log_adj_daily_returns_target_BAC')(features)
    output_c = tf.keras.layers.Dense(1, kernel_initializer='zeros', bias_initializer=output_bias_init, name='log_adj_daily_returns_target_C')(features)
    
    outputs = {
               'log_adj_daily_returns_target_WFC': output_wfc, 
               'log_adj_daily_returns_target_JPM': output_jpm,
               'log_adj_daily_returns_target_BAC': output_bac,
               'log_adj_daily_returns_target_C': output_c
              }
    
    model = tf.keras.Model(inputs, outputs, name='baseline_model')
    
    return model

gen_print_model_stats(baseline_model, 'pics', 'testbaseline')

Model: "baseline_model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
log_adj_daily_returns_WFC (Inpu [(None, 5)]          0                                            
__________________________________________________________________________________________________
log_adj_daily_returns_JPM (Inpu [(None, 5)]          0                                            
__________________________________________________________________________________________________
log_adj_daily_returns_BAC (Inpu [(None, 5)]          0                                            
__________________________________________________________________________________________________
log_adj_daily_returns_C (InputL [(None, 5)]          0                                            
_____________________________________________________________________________________

![title](pics/testbaseline.png)

Building baseline model.

In [5]:
output_bias_init = 0
hparams = {'output_bias_init': output_bias_init}

baseline_m = build_compiled_model(baseline_model, hparams, loss=LOSS, loss_weights=None, 
                                  optimizer=OPTIMIZER, metrics=None, callbacks=CALLBACKS)



Evaluating baseline model.

In [6]:
X_train, y_train = train_dataset
X_val, y_val = val_dataset

baseline_predictions = baseline_m.predict(X_train)
baseline_results_train = baseline_m.evaluate(X_train, y_train, verbose=0)
baseline_results_val = baseline_m.evaluate(X_val, y_val, verbose=0)

In [7]:
#assert all(all(pred[:, 0] == pred[0, 0]) for pred in baseline_predictions)

print('Names of losses: {}\n'.format(baseline_m.metrics_names))
print('Train Loss for Baseline Model: {}'.format(baseline_results_train))
print('Test Loss for Baseline Model: {}'.format(baseline_results_val))



Names of losses: ['loss', 'log_adj_daily_returns_target_WFC_loss']

Train Loss for Baseline Model: [0.0008051489573671401, 0.00079874013]
Test Loss for Baseline Model: [0.00015122863568831234, 0.00014940402]


In [8]:
sum(baseline_results_train[1:])

0.004108154098503292

In [9]:
sum(baseline_results_val[1:])

0.0010604808485368267