#  Predicting Stock Yields using NLP

# Todo
- update text portions of notebook for intro price EDA
- make notebook able to updated dfs.
- get rid of hardcoded feature names in modeling

Deep learning NLP techniques to map raw text to dense vector representations have had some surprising success in the world of computer natural language processing compared to classical means of encoding text. In this project we will attempt to leverage some of these techniques to help assist us in time series analysis on stock yields. The company we will choose to investigate is Wells Fargo (stock ticker: WFC), and the text data we will be leveraging are the SEC forms of Wells Fargo along with its competitors: JPMorgan Chase, Bank of America, and Citigroup. Specifically the 8-K form. The 8-K form was chosen because it tends to be the more text rich SEC document when compared to others.

In [132]:
# Setting Notebook Global Variables
import os
import sys
import numpy as np

# Project Paths
project_dir = os.path.split(os.path.split(os.getcwd())[0])[0]
path_to_data = os.path.join(project_dir, 'data')
path_to_docs = os.path.join(path_to_data, 'documents')
# Company Stock Ticker and CIK number
ticker = 'WFC'
competitors = ['JPM', 'BAC', 'C']
tickers = [ticker] + competitors

After sucessfully preprocessing our dataset we next write our dataset to a TFRecords file (https://www.tensorflow.org/tutorials/load_data/tfrecord) a binary file format that is read efficiently by the TensorFlow framework. 

For this notebook we will be using only the bare minimum amount of features for modeling ie: our text features listed in the doc columns and the feature we are trying to predict in time series.

## Modeling

The goal of this notebook is to discover whether we can construct a deep learning model that correlates (predicts) stock price data with unstructured text data found in company 8-K forms. Specifically for our first models we will try to correlate the logarithmic returns of a stock with the previous logarithmic returns of the stock in a given window of time, along with the text data stored in the companies 8-K forms that were released in the given window of time. If there are multiple 8-K documents released in the given window of time then our text data fed to the model will be derived from a single 8-K document uniformly sampled from the 8-K documents released in the given window of time. This window size (in days) along with architecture of the model will be the hyperparameters that can be tuned when experiment with different model designs. Modeling will consist of two phases: Preparing Data, and Evaluating Models.

In [249]:
# Importing Libraries and Configuring virtual GPU

import os
import json
import pickle
import pandas as pd
import tensorflow as tf

from fractions import Fraction
from functools import reduce
from tensorflow import keras
from tensorflow.keras import layers

gpus = tf.config.experimental.list_physical_devices('GPU')
visible_gpus = tf.config.experimental.get_visible_devices('GPU')
print('GPUs: {}'.format(gpus))
print('Visible GPUs: {}'.format(visible_gpus))

if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=5000)])
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

# Setting Random Seed
tf.random.set_seed(20)
 
# Model Hyperparameters    
## Model Hyperparameters that require reshaping Dataset
TIMESTEPS = 8
BATCH_SIZE = 10
with open(os.path.join(path_to_data, 'vocab.json'), 'r') as f:
    vocab = json.load(f)
## Model 1 Hyperparameters
DOC_EMBEDDING_UNITS = 1000
#TS_LAYER_1_UNITS = 700
TS_LAYER_2_UNITS = 50
TS_LAYER_3_UNITS = 50
OPTIMIZER = keras.optimizers.Adam
LOSS = keras.losses.BinaryCrossentropy()
METRICS = ['accuracy']
LEARNING_RATE=None

GPUs: []
Visible GPUs: []


### Preparing Data

Preparing our data involves:
1. Loading the dataset from the TFRecord file
2. Splitting the dataset by stock ticker
3. Reshaping each dataset to prepare it for training:
    1. Windowing the dataset so each element produces a time series of features along with there corresponding label
    2. Sampling the document feature for the document that will represent the specific window's document and cloning that document for each timestep in our defined window size
    3. Filtering our dataset to include only elements with a document feature
4. Concatenating the reshaped datasets together, and shuffling the dataset
5. Splitting the dataset into train, validation, and test datasets

1. Loading the dataset from TFRecord file

In [250]:
# Defining functions and classes used to load the dataset from its TFRecord file

def parse_example(example_proto, feature_description):
    '''
    Parses example proto from
    
    :param example_proto: 
    :param feature_description: 
    '''
    
    # Parse the input tf.Example proto using the dictionary above.
    example = tf.io.parse_single_example(example_proto, feature_description)
    
    # Reconstructing Ragged Tensors from Example
    for t in tickers:
        example['_'.join(['docs', t])] = tf.RaggedTensor.from_row_lengths(example['docs_{}/vals'.format(t)].values,
                                                           row_lengths=example['docs_{}/lens'.format(t)].values)

    # Deleting Redundant Keys
    for t in tickers:
        del example['docs_{}/vals'.format(t)]
        del example['docs_{}/lens'.format(t)]
        
    return example

In [251]:
# Loading the Dataset

# Loading the raw dataset from the TFRecord file
dataset = tf.data.TFRecordDataset(os.path.join(path_to_data, 'dataset.tfrecord'))
# Loading the dataset's feature_description
with open(os.path.join(path_to_data, 'dataset_feature_description.pickle'), 'rb') as f:
    feature_description = pickle.load(f)
# Decoding the raw dataset using the dataset's feature_description
dataset = dataset.map(lambda example_proto: parse_example(example_proto, feature_description))

2. Spliting the dataset by stock ticker

In [252]:
def split(example, features, ticker):
    return {feature_name: example['_'.join([feature_name, ticker])] for feature_name in features}

datasets = [dataset.map(lambda ex: split(ex, ['log_adj_daily_returns', 'docs'], t)) for t in tickers]

3. Reshaping datasets

In [253]:
# Defining functions and classes used to reshape datasets

def make_window_dataset(ds, window_size, shift=1, stride=1):
    
    windows = ds.window(window_size, shift=shift, stride=stride)
    
    feature_datasets = {key: windows.flat_map(lambda x: x[key].batch(window_size, drop_remainder=True))
                        for key in windows.element_spec.keys()}
    
    return tf.data.Dataset.zip(feature_datasets)

def extract_labels(timeslice, label_features):
    labels = {}
    
    for feature_key in timeslice.keys():
        feature_timeslice = timeslice[feature_key]
        if feature_key in label_features:
            labels[feature_key] = feature_timeslice[-1]
        timeslice[feature_key] = feature_timeslice[:-1]
        
    return (timeslice, labels)


def to_time_series(ds, label_features, window_size, steps_to_pred=1, num_of_preds=1):
    
    # making full time series Dataset object (features + labels)
    full_ts_ds = make_window_dataset(ds, window_size=window_size+1)
    
    # mapping dataset to Dataset where each el is: (features: dict, labels)
    ts_ds = full_ts_ds.map(lambda s: extract_labels(s, label_features))
    
    return ts_ds

def sample_documents(sample):
    # Extracting all documents in the sample
    docs_in_sample = sample.values
    # Sampling a random document from all the documents in the sample
    if docs_in_sample.nrows() != 0:
        i = tf.random.uniform([1], maxval=docs_in_sample.nrows(), dtype=tf.int64)[0]
        sample_doc = docs_in_sample[i]
    else:
        sample_doc = tf.constant([], dtype=tf.int64)
        
    return sample_doc

def select_doc(features, labels):
    
    for fname in features.keys():
        feature = features[fname]
        timesteps = feature.shape[0]
        # Feature is a doc feature
        if isinstance(feature, tf.RaggedTensor):
            doc = sample_documents(feature)
            feature = tf.stack([doc for day in range(timesteps)])
            features[fname] = feature
        
    return (features, *list(labels.values()))

def filter_fn(f, l):
    shape = tf.shape(f['docs'])[1]
    return tf.math.not_equal(shape, 0)

def reshape(dataset, window_size, label_name):
    # Converting to time series
    ds = to_time_series(dataset, label_name, window_size=window_size)
    # Selecting document features
    ds = ds.map(select_doc)
    # Filtering out elements without a document feature
    ds = ds.filter(filter_fn)
    return ds

In [254]:
# Reshaping Datasets
reshaped_datasets = list(map(lambda d: reshape(d, TIMESTEPS, 'log_adj_daily_returns'), datasets))

4. Concatenating datasets, and shuffling dataset

In [255]:
dataset = reduce(lambda a, b: a.concatenate(b), reshaped_datasets).shuffle(1000, reshuffle_each_iteration=False)

5. Splitting dataset into train, validation, and test datasets

In [256]:
# Defining Functions and Classes for splitting datasets into train, validation, and test datasets

def k_folds(dataset, k):
    '''
    Splits :param dataset: into :param k: number of equally sized (or close to equally sized) components.
    
    :param dataset: tf.data.Dataset, dataset to split into k folds
    :param k: int, number of folds to split :param dataset: into
    
    ---> list, of tf.data.Dataset objets
    '''
    return [dataset.shard(k, i) for i in range(k)]

def train_test_split(dataset, train_size):
    '''
    Splits :param dataset: into
    
    :param dataset: tf.data.Dataset, to split into train and test datasets
    :param train_size: float between 0 and 1, proportion of :param dataset: to put into train dataset
    
    ---> (tf.data.Dataset, tf.data.Dataset), representing train, test datasets
    '''
    train_size = Fraction(train_size).limit_denominator()
    x, k = train_size.numerator, train_size.denominator
    folds = k_folds(dataset, k)
    train = reduce(lambda a, b: a.concatenate(b), folds[:x])
    test = reduce(lambda a, b: a.concatenate(b), folds[x:])
    return train, test

For our models we will reserve 60% of the dataset for training, 20% for validation, and 20% for testing.

In [257]:
# Splitting our dataset into train, validation, test datasets

# Creating datasets
train_dataset, test_val_dataset = train_test_split(dataset, train_size=0.6)
val_dataset, test_dataset = train_test_split(test_val_dataset, train_size=0.5)

# Prepping datasets for modeling
train_dataset = (train_dataset.shuffle(10)
                 .padded_batch(batch_size, 
                               padded_shapes=({'log_adj_daily_returns': [TIMESTEPS,], 
                                               'docs': [TIMESTEPS, None]}, [])))
val_dataset = (val_dataset.shuffle(10)
               .padded_batch(batch_size, 
                             padded_shapes=({'log_adj_daily_returns': [TIMESTEPS,], 
                                             'docs': [TIMESTEPS, None]}, [])))
test_dataset = (test_dataset.shuffle(10)
                .padded_batch(batch_size, 
                              padded_shapes=({'log_adj_daily_returns': [TIMESTEPS,], 
                                              'docs': [TIMESTEPS, None]}, [])))



### Evaluating Models

In [258]:
# Defining functions and classes used to construct model layers

def embedding_matrix(vocab, init):
    '''
    Constructs the embedding matrix for specific init type for a pre initialized word embedding layer.
    
    :param vocab: dict, a mapping between keys of words, and values of unique integer identifiers for each word
    :param init: string, initialization type currently we only support glove initialization
    
    ---> numpy array of size (vocab length, embedding dimension) mapping each word encoding to a vector
    '''
    
    if init == 'glove':
        glove_dir = 'glove'
        
        try:
            with open(os.path.join(glove_dir, 'current_embedding.pickle'), 'rb') as f:
                embedding_m = pickle.load(f)
            
        except FileNotFoundError:
            # Building word to vector map
            word_embeddings = {}
            with open(os.path.join(glove_dir, 'glove.840B.300d.txt')) as f:
                for line in f:
                    tokens = line.split(' ')
                    word = tokens[0]
                    embedding = np.asarray(tokens[1:], dtype='float32')
                    # Needs to check if dim is changing
                    assert len(embedding) == 300
                    word_embeddings[word] = embedding
            # Building embedding matrix
            EMBEDDING_DIM = len(next(iter(word_embeddings.values())))
            embedding_m = np.zeros((len(vocab) + 1, EMBEDDING_DIM))
            for word, i in vocab.items():
                embedding_vector = word_embeddings.get(word)
                if embedding_vector is not None:
                    embedding_m[i] = embedding_vector
            # Saving embedding matrix
            with open(os.path.join(glove_dir, 'current_embedding.pickle'), 'wb') as f:
                pickle.dump(embedding_m, f)
                
    else:
        raise ValueError('init type not supported, init must be equal to "glove"')

    return embedding_m

def Word_Embedding(vocab, init, 
                   embeddings_initializer='uniform', embeddings_regularizer=None, 
                   activity_regularizer=None, embeddings_constraint=None, 
                   mask_zero=False, input_length=None, **kwargs):
    
    '''
    Creates a keras embedding layer specifically designed to embed the words specified in :param vocab:
    
    :param vocab: dict, representing the mapping between the words in corpus (keys) and their unique integer
                  encodings
    :param init: string or int, tells the layer how to initialize its embeddings. If of type int, then
                 it tells the layer to initialize its word embeddings with an embedding dimension of :param init:.
                 If of type string, then :param init: specifies the type of pretrained word embeddings we will be 
                 initializing the embedding layer with
    
    ---> tf.keras.layers.Embedding
    '''
    
    if isinstance(init, str):
        current_embedding_matrix = embedding_matrix(vocab, init)
        emb_layer = layers.Embedding(current_embedding_matrix.shape[0], current_embedding_matrix.shape[1],
                                     weights=[current_embedding_matrix], mask_zero=mask_zero,
                                     input_length=None, **kwargs)
        
    elif isinstance(init, int):
        emb_layer = layers.Embedding(len(vocab) + 1, output_dim=init, 
                                     embeddings_initializer=embeddings_initializer, embeddings_regularizer=embeddings_regularizer, 
                                     activity_regularizer=activity_regularizer, embeddings_constraint=embeddings_constraint, 
                                     mask_zero=mask_zero, input_length=input_length, **kwargs)
    else:
        raise ValueError('init type not supported')
        
    return emb_layer


In [1]:
# Defining functions and classes used to construct models

def model_1(timesteps, vocab, doc_embedding_size, ts_layer_1_size, ts_layer_2_size, ts_layer_3_size,
            optimizer, learning_rate, loss, metrics):
    '''
    Constructs a model with the architecture of 
    '''
    
    # inputs
    input_docs = keras.Input(shape=(timesteps, None), name='docs', dtype=tf.int64)
    input_log_returns = keras.Input(shape=(timesteps,), name='log_adj_daily_returns', dtype=tf.float32)
    
    # Preparing Features for Time Series Analysis
    num_features = tf.expand_dims(input_log_returns, -1)
    ts_input = num_featres
    
    # Building Time Series Layer
    ts_layer_2 = layers.LSTM(ts_layer_2_size, activation='relu', return_sequences=True)(ts_input)
    ts_layer_3 = layers.LSTM(ts_layer_3_size, activation='relu')(ts_layer_2)
    
    # Building Output Layer
    output = layers.Dense(1, activation='sigmoid')(ts_layer_3)
    
    # Building Model
    model = keras.Model([input_docs, input_log_returns], output, name='model_1')
    
    if learning_rate == None:
        opt = optimizer()
    else:
        opt = optimizer(learning_rate=learning_rate)

    model.compile(optimizer=opt, loss=loss, metrics=metrics)
    
    return model


In [260]:
# Training Model 1 with the current hyperparameters
tf.keras.backend.clear_session()
model = model_1(timesteps=TIMESTEPS, vocab=vocab, doc_embedding_size=DOC_EMBEDDING_UNITS,
                ts_layer_1_size=None, ts_layer_2_size=TS_LAYER_2_UNITS, ts_layer_3_size=TS_LAYER_3_UNITS,
                optimizer=OPTIMIZER, learning_rate=LEARNING_RATE, loss=LOSS, metrics=METRICS)

print(model.summary())
# Converting data to catagorical data
def to_categorical(f, l):
    if l > 0:
        c = 1
    else:
        c = 0
    return (f, c)

train_dataset = train_dataset.unbatch().map(to_categorical).batch(10)
model.fit(train_dataset, epochs=30, validation_data=train_dataset)


Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
log_adj_daily_returns (InputLay [(None, 8)]          0                                            
__________________________________________________________________________________________________
tf_op_layer_ExpandDims (TensorF [(None, 8, 1)]       0           log_adj_daily_returns[0][0]      
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 8, 50)        10400       tf_op_layer_ExpandDims[0][0]     
__________________________________________________________________________________________________
lstm_2 (LSTM)                   (None, 50)           20200       lstm_1[0][0]                     
____________________________________________________________________________________________

<tensorflow.python.keras.callbacks.History at 0x7fdfa70e3668>