# Stock Market Forecasting with a Simple Neural Network

Created by Kenneth Liao <br>
3/11/2018

---
This will be a quick introduction to building a simple neural network (NN) model with the Keras API for stock market forecasting. This article is organized into the sections outlined below. 

### The following topics will be covered:

1. Importing and prepping the data
2. Building the NN model
3. Training and evaluating the model's performance

### The data:
We will be using both the **sector_ETFs.csv** and **Indicators_Train.csv** datasets. This data will be split into a training dataset and a cross-validation (cv) dataset. The cv dataset will be used to gauge the model's performance.

---


## 1. Importing and prepping the data

We start by importing the necessary libraries. To make data visualizations a little more legible, let's scale up the default figure and font sizes.

In [1]:
import pandas as pd
import numpy as np
import html_tools as ht
import os
import plotly
import plotly.graph_objs as go
from config import credentials

# Set plotly credentials. Required only to upload plots to plotly.
plotly.tools.set_credentials_file(username=credentials['plotly_user'], api_key=credentials['plotly_api_key'])

# Enable offline plotting in jupyter notebook
plotly.offline.init_notebook_mode(connected=True)

---

After importing the data into a pandas dataframe, we'll want to convert the index to a datetime object. This will allow us to perform special functions on the data such as easily switching between weekly, monthly, or yearly timeframes. Our end goal will be to predict weekly prices so we can go ahead and change the time period of our data to weekly.

---

In [2]:
# change working directory to data folder
os.chdir('C:/Users/Kenny/projects/pds/stock-market-forecasting/data')

# Read in data and convert index to datetime object
raw_data = pd.read_csv('sector_ETFs.csv')
raw_data['Date'] = pd.to_datetime(raw_data['Date'])
raw_data.set_index('Date', inplace=True)

# Resample data to a weekly period
weekly_data = raw_data.resample('W-FRI').last()

# Get the number of stocks
N_stocks = len(weekly_data.columns)
stocks = weekly_data.columns

print(weekly_data.shape)
weekly_data.head()

(904, 6)


Unnamed: 0_level_0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-05-19,98.382324,,,,,
2000-05-26,94.684608,,,,,
2000-06-02,111.828659,,,,,
2000-06-09,114.06971,,,,,
2000-06-16,112.87915,,29.423733,,51.203468,


---

Let's make a quick plot of the 6 ETFs we will be working with.

---

In [3]:
trace0 = go.Scatter(
    name='Technology (IYW)',
    x=weekly_data.index,
    y=weekly_data['Technology (IYW)']
)
trace1 = go.Scatter(
    name='Basic Materials (IYM)',
    x=weekly_data.index,
    y=weekly_data['Basic Materials (IYM)']
)
trace2 = go.Scatter(
    name='Consumer Goods (IYK)',
    x=weekly_data.index,
    y=weekly_data['Consumer Goods (IYK)']
)
trace3 = go.Scatter(
    name='Services (IYC)',
    x=weekly_data.index,
    y=weekly_data['Services (IYC)']
)
trace4 = go.Scatter(
    name='Healthcare (IYH)',
    x=weekly_data.index,
    y=weekly_data['Healthcare (IYH)']
)
trace5 = go.Scatter(
    name='Utilities (IDU)',
    x=weekly_data.index,
    y=weekly_data['Utilities (IDU)']
)

data=[trace0, trace1, trace2, trace3, trace4, trace5]

layout= go.Layout(
    xaxis=dict(
        title='Year',
        titlefont=dict(size=16)),
    yaxis=dict(
        title='Adjusted Closing Price ($)',
        titlefont=dict(size=16)),
    showlegend=True,
    legend=dict(
        font=dict(size=12),
        orientation='h',
        y=1.2),    
    margin=dict(t=100,
               r=0)
)

fig = go.Figure(data=data, layout=layout)
# change working directory to images folder
#os.chdir('C:/Users/Kenny/projects/repositories/kennfucius.github.io/images/0002-keras-ETF/')
plotly.offline.iplot(fig, filename='sectorETFs.html')

---

Now we'll import our indicators data and perform the following clean up:
1. Drop nonessential columns (all columns ending with dt)
2. Remove the indicator description row and save it in *indicators_desc*
3. Convert index to a datetime object and resample to a weekly period
3. Drop rows that contain blank dates

---

In [4]:
indicators_train = pd.read_csv('Indicators_Train_v2.csv', skip_blank_lines=True)

# Drop the index column
indicators_train.drop('Unnamed: 0', axis=1, inplace=True)

# Drop all columns that contain the 'dt' string
for col in indicators_train.columns:
    if 'dt' in col:
        indicators_train.drop(col, axis=1, inplace=True)

# Convert index to datetime object
indicators_train.rename(columns={'date':'Date'}, inplace=True)
indicators_train['Date'] = pd.to_datetime(indicators_train['Date'])
indicators_train = indicators_train.set_index('Date')

# Convert index to weekly time period
indicators_train = indicators_train.resample('W-FRI').last()
indicators_train = indicators_train.apply(pd.to_numeric, errors='coerce', downcast='float')

# Drop rows missing a datetime index value
idx_drop = indicators_train.index[indicators_train.index.isnull()==True]
indicators_train.drop(idx_drop, inplace=True)

N_indicators=len(indicators_train.columns)

In [5]:
indicators_train.loc[indicators_train.index>'2000-01-01',]

Unnamed: 0_level_0,HOUST,UNRATENSA,EMRATIO,UEMPMED,UMCSENT,USSLIND,KCFSI,IPMAN,VIXCLS,DGS10
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2000-01-07,1600.0,3.7,64.400002,5.9,105.400002,2.22,0.29,102.415398,21.250000,6.15
2000-01-14,1600.0,3.7,64.400002,5.9,105.400002,2.22,0.29,102.415398,22.030001,6.25
2000-01-21,1712.0,3.7,64.400002,5.9,105.400002,2.22,0.29,102.415398,22.430000,6.39
2000-01-28,1712.0,3.7,64.400002,5.9,112.000000,2.22,0.29,102.415398,23.090000,6.40
2000-02-04,1712.0,4.5,64.800003,5.7,112.000000,1.83,0.39,102.645302,26.410000,6.62
2000-02-11,1712.0,4.5,64.800003,5.7,112.000000,1.83,0.39,102.645302,22.840000,6.72
2000-02-18,1775.0,4.5,64.800003,5.7,112.000000,1.83,0.39,102.645302,21.719999,6.73
2000-02-25,1775.0,4.5,64.800003,5.7,112.000000,1.83,0.39,102.645302,23.030001,6.69
2000-03-03,1775.0,4.4,64.800003,6.1,111.300003,1.45,0.18,103.033997,23.120001,6.60
2000-03-10,1775.0,4.4,64.800003,6.1,111.300003,1.45,0.18,103.033997,22.900000,6.56


---

Let's add the indicator columns to the ETF dataframe. This way, all of the data is in the same place and on the same weekly index.

---

In [6]:
# Add indicators to dataset
combined_data = weekly_data.join(indicators_train, how='left')

# Interpolate any missing values
interp_data = combined_data.interpolate(method='linear')

# Remove any samples that are missing data (blanks)
idx_drop = interp_data.index[interp_data.isnull().any(axis=1)==True]
interp_data.drop(idx_drop, inplace=True)
print(interp_data.shape)
interp_data.head()

(898, 16)


Unnamed: 0_level_0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU),HOUST,UNRATENSA,EMRATIO,UEMPMED,UMCSENT,USSLIND,KCFSI,IPMAN,VIXCLS,DGS10
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2000-06-30,114.349861,25.320297,29.090391,51.704765,52.495567,36.802292,1592.0,3.8,64.300003,5.8,106.400002,1.88,0.94,104.651497,23.65,6.29
2000-07-07,115.526382,24.638411,29.090391,54.124199,54.1931,37.973682,1592.0,4.1,64.5,5.8,106.400002,1.04,0.94,104.857803,22.48,6.13
2000-07-14,122.109482,25.865795,29.6682,55.429054,51.469501,38.952721,1592.0,4.1,64.5,5.8,106.400002,1.04,0.94,104.857803,21.48,6.06
2000-07-21,115.918587,25.842134,30.316614,54.776619,51.609634,39.390499,1554.0,4.1,64.5,5.8,106.400002,1.04,0.94,104.857803,20.610001,6.11
2000-07-28,104.111015,24.454477,30.316614,52.098934,51.432274,39.390499,1554.0,4.1,64.5,5.8,108.300003,1.04,0.94,104.857803,20.290001,6.11


In [7]:
traces={}
for i in range(N_stocks,len(interp_data.columns)):
    col=interp_data.columns[i]
    
    traces['trace' + str(i)] = go.Scatter(
                name=str(col),
                x=interp_data.index,
                y=interp_data[col]
    )

data = []
for trace in traces:
    trace = traces[str(trace)]
    data.append(trace)

layout= go.Layout(
    xaxis=dict(
        title='Year',
        titlefont=dict(size=16)
    ),
    yaxis=dict(
        title='Indicator Value',
        titlefont=dict(size=16)
    ),
    showlegend=True,
    legend=dict(
        font=dict(size=12),
        orientation='h',
        y=1.2
    ),    
    margin=dict(t=100,
        r=0
    )
)

fig = go.Figure(data=data, layout=layout)
# change working directory to images folder
os.chdir('C:/Users/Kenny/projects/repositories/kennfucius.github.io/images/0002-keras-ETF/')
plotly.offline.iplot(fig, filename='indicators.html')

In [8]:
# change working directory to data folder
os.chdir('C:/Users/Kenny/projects/pds/stock-market-forecasting/data')

ind_desc = pd.read_excel('indicator descriptions.xlsx', header=None)
ind_desc.columns=['Indicator', 'Description']
pd.set_option('display.max_colwidth', -1)
ind_desc.style.set_properties(**{'text-align': 'left'})

Unnamed: 0,Indicator,Description
0,HOUST,Housing Starts: Total: New Privately Owned Housing Units Started
1,UNRATENSA,Civilian Unemployment Rate NSA
2,EMRATIO,Civilian Employment-Population Ratio
3,UEMPMED,Median Duration of Unemployment
4,UMCSENT,University of Michigan: Consumer Sentiment
5,USSLIND,Leading Index for the United States
6,KCFSI,Kansas City Financial Stress Index
7,IPMAN,Industrial Production: Manufacturing (NAICS)
8,VIXCLS,CBOE Volatility Index: NSA
9,DGS10,10-Year Treasury Constant Maturity Rate


In [9]:
print(ht.table(ind_desc))

<table style="width:100%"><tr><th>Indicator</th><th>Description</th></tr>
<tr><td>HOUST</td><td>Housing Starts: Total: New Privately Owned Housing Units Started</td></tr>
<tr><td>UNRATENSA</td><td>Civilian Unemployment Rate NSA</td></tr>
<tr><td>EMRATIO</td><td>Civilian Employment-Population Ratio</td></tr>
<tr><td>UEMPMED</td><td>Median Duration of Unemployment</td></tr>
<tr><td>UMCSENT</td><td>University of Michigan: Consumer Sentiment</td></tr>
<tr><td>USSLIND</td><td>Leading Index for the United States</td></tr>
<tr><td>KCFSI</td><td>Kansas City Financial Stress Index</td></tr>
<tr><td>IPMAN</td><td>Industrial Production: Manufacturing (NAICS)</td></tr>
<tr><td>VIXCLS</td><td>CBOE Volatility Index: NSA</td></tr>
<tr><td>DGS10</td><td>10-Year Treasury Constant Maturity Rate</td></tr></table>


---

The ETF and indicator data we're working with is not a complete set. Each column will have data spanning a different timeframe. More recent dates will have data for all ETFs and indicators while much earlier dates will only have data for some. One way to work with such data is to take a subset of it, only using recent dates to ensure a more complete dataset. We'll start by taking the last year's worth of ETF and indicator data. I encourage you to try longer time spans once you have a working model.

We'll perform an 70/30 split on this subset data for training and cross-validation of the model. 70% of the data will be saved to **train_set** to train the model and 30% will be saved in **cv_set** to gauge the model's performance on data it hasn't yet.

---

In [10]:
# Define the number of years back to use for training
years = 5

# Take a subset of the data (in weeks)
subset_start = years*52

data_subset = interp_data.iloc[-subset_start:, :]

# Separate the data into an 70/30 train/test split
train_start = data_subset.index[0]
train_end = data_subset.index[int(np.floor(0.8*len(data_subset.index)))]
cv_start = train_end + 1
cv_end = data_subset.index[-1]

train_set = data_subset.loc[train_start: train_end, :]
cv_set = data_subset.loc[cv_start:cv_end, :]
print(train_set.shape)
train_set.head()

(209, 16)


Unnamed: 0_level_0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU),HOUST,UNRATENSA,EMRATIO,UEMPMED,UMCSENT,USSLIND,KCFSI,IPMAN,VIXCLS,DGS10
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2012-09-21,73.250854,63.142998,67.473579,82.698982,78.990929,74.672485,750.0,8.2,58.299999,18.0,74.300003,0.94,-0.4,95.344902,13.98,1.8
2012-09-28,71.434181,61.579113,66.656784,81.770248,78.683899,75.375397,750.0,8.2,58.299999,18.0,78.300003,0.94,-0.4,95.344902,14.84,1.64
2012-10-05,71.019882,62.161152,68.034836,83.121735,80.62796,76.065987,750.0,7.6,58.700001,18.5,78.300003,1.36,-0.4,95.344902,14.55,1.7
2012-10-12,68.816711,61.097095,66.647835,81.174812,78.879242,75.417496,750.0,7.6,58.700001,18.5,78.300003,1.36,-0.5,95.344902,16.139999,1.7
2012-10-19,66.90538,62.452168,67.00576,81.590675,79.195496,76.840805,872.0,7.6,58.700001,18.5,78.300003,1.36,-0.5,95.2537,17.059999,1.86


In [11]:
print(cv_set.shape)
cv_set.head()

(51, 16)


Unnamed: 0_level_0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU),HOUST,UNRATENSA,EMRATIO,UEMPMED,UMCSENT,USSLIND,KCFSI,IPMAN,VIXCLS,DGS10
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2016-09-23,116.756874,76.267693,111.06488,144.693253,150.265152,122.779961,1142.0,5.0,59.700001,11.2,89.800003,1.4,-0.34,103.669098,12.29,1.63
2016-09-30,117.603813,77.424454,110.90033,145.152069,147.982513,118.296898,1142.0,5.0,59.700001,11.2,91.199997,1.4,-0.34,103.669098,13.29,1.56
2016-10-07,117.603813,76.101028,109.478783,144.135788,147.499771,113.702843,1142.0,4.8,59.799999,10.3,91.199997,1.15,-0.34,103.669098,13.48,1.75
2016-10-14,116.585899,74.718811,108.826424,143.198441,142.524643,115.314629,1142.0,4.8,59.799999,10.3,91.199997,1.15,-0.39,103.669098,16.120001,1.75
2016-10-21,117.425926,76.208878,108.719322,144.352859,142.869446,115.826157,1047.0,4.8,59.799999,10.3,91.199997,1.15,-0.39,103.843498,13.75,1.76


---

Before the data can be fed into the model, it will have to be normalized. We will scale the data so that the minimum and maximum for each column correspond to 0 and 1. It's important to note that normalization must be done after splitting the data into the train and cv sets. Otherwise, the NN model would see information from the cv dataset.

The plot belows shows the 6 normalized ETFs from the training dataset. This is the output that the model will be trying to fit.

---

In [12]:
# Normalize the two datasets, plot the normalized train data
train_norm = ((train_set - train_set.min())/(train_set.max() - train_set.min())) 
cv_norm = ((cv_set - cv_set.min())/(cv_set.max() - cv_set.min())) 

for col in train_set:
    if train_set[col].nunique() == 1:
        train_norm[col] = train_set[col]
for col in cv_set:
    if cv_set[col].nunique() == 1:
        cv_norm[col] = cv_set[col]

In [13]:
traces={}
for i in range(N_stocks):
    col=train_norm.columns[i]
    
    traces['trace' + str(i)] = go.Scatter(
                name=str(col),
                x=train_norm.index,
                y=train_norm[col]
    )

data = []
for trace in traces:
    trace = traces[str(trace)]
    data.append(trace)

layout= go.Layout(
    xaxis=dict(
        title='Year',
        titlefont=dict(size=16)),
    yaxis=dict(
        title='Normalized Price',
        titlefont=dict(size=16)),
    showlegend=True,
    legend=dict(
        orientation='h',
        font=dict(size=12),
        x=0.02,
        y=1.1),    
    margin=dict(t=100,
               r=0)
)

fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig, filename='norm-sector-ETFs.html')

In [14]:
train_norm.head()

Unnamed: 0_level_0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU),HOUST,UNRATENSA,EMRATIO,UEMPMED,UMCSENT,USSLIND,KCFSI,IPMAN,VIXCLS,DGS10
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2012-09-21,0.187486,0.196065,0.049116,0.059007,0.040246,0.069013,0.0,0.925,0.0,0.827957,0.055556,0.0,0.521739,0.075977,0.219162,0.25
2012-09-28,0.15313,0.134424,0.032467,0.045624,0.036504,0.081741,0.0,0.925,0.0,0.827957,0.214286,0.0,0.521739,0.075977,0.270659,0.15
2012-10-05,0.145295,0.157365,0.060557,0.065099,0.060198,0.094246,0.0,0.775,0.250001,0.88172,0.214286,0.456522,0.521739,0.075977,0.253293,0.1875
2012-10-12,0.10363,0.115425,0.032285,0.037044,0.038885,0.082503,0.0,0.775,0.250001,0.88172,0.214286,0.456522,0.434783,0.075977,0.348503,0.1875
2012-10-19,0.067484,0.168835,0.039581,0.043036,0.04274,0.108276,0.264642,0.775,0.250001,0.88172,0.214286,0.456522,0.434783,0.068868,0.403593,0.2875


---

The last step in the data prep is to split the train and cv datasets into inputs (X) and outputs (Y) for the model. We will go into more detail on what these matrices are.

---

In [15]:
# Dimensions: [# training samples x # indicators]
X_train = train_norm.iloc[:, N_stocks:] 
X_cv = cv_norm.iloc[:, N_stocks:] 

# Dimensions: [# training samples x # ETFs]
Y_train = train_norm.iloc[:, :N_stocks]
Y_cv = cv_norm.iloc[:, :N_stocks]

In [16]:
X_train.head()

Unnamed: 0_level_0,HOUST,UNRATENSA,EMRATIO,UEMPMED,UMCSENT,USSLIND,KCFSI,IPMAN,VIXCLS,DGS10
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2012-09-21,0.0,0.925,0.0,0.827957,0.055556,0.0,0.521739,0.075977,0.219162,0.25
2012-09-28,0.0,0.925,0.0,0.827957,0.214286,0.0,0.521739,0.075977,0.270659,0.15
2012-10-05,0.0,0.775,0.250001,0.88172,0.214286,0.456522,0.521739,0.075977,0.253293,0.1875
2012-10-12,0.0,0.775,0.250001,0.88172,0.214286,0.456522,0.434783,0.075977,0.348503,0.1875
2012-10-19,0.264642,0.775,0.250001,0.88172,0.214286,0.456522,0.434783,0.068868,0.403593,0.2875


In [17]:
Y_train.head()

Unnamed: 0_level_0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-09-21,0.187486,0.196065,0.049116,0.059007,0.040246,0.069013
2012-09-28,0.15313,0.134424,0.032467,0.045624,0.036504,0.081741
2012-10-05,0.145295,0.157365,0.060557,0.065099,0.060198,0.094246
2012-10-12,0.10363,0.115425,0.032285,0.037044,0.038885,0.082503
2012-10-19,0.067484,0.168835,0.039581,0.043036,0.04274,0.108276


---


## II. Building the NN Model

When building any machine learning model, it's important to understand the inputs and outputs for the model. Let's remind ourselves what we're trying to do with our NN model.

What question are we trying to answer?

*Given some macroeconomic data over some time period, can our NN model predict ETF prices over the same time period?*

So we need our NN model to:

*Take in macroeconomic data and output ETF prices*

---

Let's call out what X and Y are explicitly. X is what we feed the NN model and Y is what it tries to predict. In our case, X is the indicator data over the specified timeframe. Given X, the NN will output its predictions of the ETF prices. We can then compare the model's predictions to the actual ETF prices (Y) over the same timeframe. 

Below we can see the shapes of our X and Y matrices for both the training and cv data. The first dimension in each matrix is the number of samples. For us, each sample corresponds to a particular weekly time period. In the training set we have 37 samples and in our cv set we have 15: recall our 70/30 split on one year's worth of data. For the X_train and X_cv matrices, the second dimension is the number of indicators which is 10. For the Y_train and Y_cv matrices, the second dimension corresponds to the number of ETFs which is 6. Scroll up to see X_train and Y_train for a reminder of what these matrices look like. These matrix shapes are important when setting up your NN model.

---

In [18]:
print("X_train shape: " + str(X_train.shape))
print("Y_train shape: " + str(Y_train.shape))
print("\n" + "X_cv shape: " + str(X_cv.shape))
print("Y_cv shape: " + str(Y_cv.shape))

X_train shape: (209, 10)
Y_train shape: (209, 6)

X_cv shape: (51, 10)
Y_cv shape: (51, 6)


---

We will be utilizing the **sequential** model for our NN. The sequential model allows us to simply add layers to our network, one by one, to form a linear stack. 

We then add the first hidden layer with **model.add()**. We feed this the **Dense()** function which means every node in the hidden layer will be connected to every node from the previous layer (input layer). 

The first argument given to *Dense()* is the # of output units which is equal to the number of nodes in the hidden layer. We will start with 10, the same as the input layer. 

The *input_shape()* argument corresponds to the shape of the input you're feeding the model. To train our model we want to feed it X_train, which has a shape of (37,10), and have it predict Y_train of shape (37,6). So our input_shape() should be (37,10). Keras allows us to simply specify the dimension of a single sample and leave the second dimension open for flexibility. That way we can feed it different batch sizes of samples with the same sample dimension. So we can simply pass input_shape(10,), leaving the batch size open. 

The last argument is the activation function we want to apply to the layer's output. We will start with the rectified linear (relU) function. Other functions you can try are the 'sigmoid' and 'tanh' functions.

Then we add the output layer which will have a dimension of 6, for our 6 ETFs. For all layers after the initial hidden layer, we need not specify the input_shape as it is inferred from the previous layer.

Once the model architecture is specified, we have to compile the model. We can pass 'mse' (mean squared error) to the *loss* argument which is what the model tries to minimize. We will use the 'adam' optimizer which is a form of stochastic gradient descent, and specify 'mse' as our metric which the model will calculate for us at every iteration during training.

We now have a Keras NN model!

---

In [19]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras import backend
from keras import regularizers
 
# Define RMSE metric
def rmse(y_true, y_pred):
    return backend.sqrt(backend.mean(backend.square(y_pred - y_true), axis=-1))

# Define NN model architecture: we will use the seqential model.
model = Sequential()

# Add the first hidden layer
model.add(Dense(20, input_shape=(10,), activation='relu'))

# Add the output layer
model.add(Dense(6))

# Compile the model
model.compile(loss='mse', optimizer='adam', metrics=[rmse])

# Print the model summary
model.summary()

Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 20)                220       
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 126       
Total params: 346
Trainable params: 346
Non-trainable params: 0
_________________________________________________________________


## 3. Training & Evaluating The Model

---

To train the model we use the model.fit() method. We have to specify the training data by passing it X_train and Y_train. We can also pass in our validation data, X_cv and Y_cv. Finally, we have to specify how many iterations or epochcs we want the model to train for.

Details about each training iteration are printed as the model trains. We can get the results of training from H.history(), which stores the loss and metric values. We'll store the MSE values in **mse_train** and **mse_cv**.

---

In [20]:
%%time

# THIS WILL THROW AN ERROR IF ANY DATA IS MISSING IN THE TRAINING SET

# Fit the model
H = model.fit(X_train, Y_train, 
              validation_data=(X_cv, Y_cv), epochs=100)

# Save the mean squared errors
rmse_train = np.array(H.history['rmse'])
rmse_cv = np.array(H.history['val_rmse'])

Train on 209 samples, validate on 51 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100


Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
Wall time: 1.68 s


In [21]:
print('LOWEST VAL_RMSE: ' + str(np.array(H.history['val_rmse']).min()))

LOWEST VAL_RMSE: 0.20281235376993814


---

Below is a plot of the Training MSE and Validation MSE. The training MSE is the MSE calculated between the predictions of the model and Y_train, given X_train as the input. The Validation MSE is the MSE calculated between the predictions of the model and Y_cv, given X_cv as the input.

The downtrend in both curves suggests the model is making better predictions with each iteration!

---

In [22]:
trace0 = go.Scatter(
    name='Training Error',
    x=[i for i in range(len(H.history['rmse']))],
    y=H.history['rmse'])

trace1 = go.Scatter(
    name='Validation Error',
    x=[i for i in range(len(H.history['val_rmse']))],
    y=H.history['val_rmse'])

data = [trace0, trace1]

layout= go.Layout(
    xaxis=dict(
        title='Epoch',
        titlefont=dict(size=16)),
    yaxis=dict(
        title='RMSE',
        titlefont=dict(size=16)),
    showlegend=True,
    legend=dict(
        font=dict(size=14),
        orientation='h',
        y=1.1),    
    margin=dict(t=50,
                r=0)
)

fig = go.Figure(data=data, layout=layout)
# change working directory to images folder
os.chdir('C:/Users/Kenny/projects/repositories/kennfucius.github.io/images/0002-keras-ETF/')
plotly.offline.iplot(fig, filename='sectorETF-learningcurves.html')

---

The plots below show the actual ETF prices in blue and the model's predictions in orange. The black, dashed, vertical line delineates the training data to the left, and the cross-validation data to the right. So to the left of the line we can see how well the predictions fit the training data, and to the right we can see how well the model predicts prices based on indicator data it hasn't seen before.

We can see there is some overlay of the model's predictions with the actual data, but it's not great. The relative "flatness" of the predicted data and generally poor fit of the training data suggests the model is underfitting the data.

This simple model can make much better predictions given the right tuning! Play around with the model to see how you can get better predictions. Remember that the goal is for the model to make accurate predictions on the cv data, not necessarily to make the best predictions on the training data.

Things to try:
1. Add more nodes to the hidden layer (20, 50, 100, etc.)
2. Add more hidden layers
3. Use different activation functions
4. Use more training data by changing years>1
5. Change the number of epochs for training

For a pretty good model try a single hidden layer with 50 units, 'tanh' activator, and 5 years of training data.

Hint: Change the parameters and then go to *Kernel* and hit *Restart & Run All*

---

In [23]:
# Rename predicted columns
predicted_col_names = []
for col in Y_train.columns:
    predicted_col_names.append('Predicted - ' + col)
    
# Get the *trained* model's predictions on X_train and X_cv
predictions_train = pd.DataFrame(model.predict(X_train), index=Y_train.index,
                columns=predicted_col_names)
predictions_cv = pd.DataFrame(model.predict(X_cv), index=Y_cv.index,
                columns=predicted_col_names)

# Merge Y_train and the model predictions into a single dataframe
plot_data_train = Y_train.join(predictions_train, how='left')
plot_data_cv = Y_cv.join(predictions_cv, how='left')

# Merge Y_cv and the model predictions
plot_data_cv = pd.DataFrame(Y_cv).join(
    pd.DataFrame(predictions_cv, index=Y_cv.index,
                columns=predicted_col_names), how='left')

# Stack the two dataframes above 
plot_all = pd.concat([plot_data_train, plot_data_cv])

In [24]:
plot_all.head()

Unnamed: 0_level_0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU),Predicted - Technology (IYW),Predicted - Basic Materials (IYM),Predicted - Consumer Goods (IYK),Predicted - Services (IYC),Predicted - Healthcare (IYH),Predicted - Utilities (IDU)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2012-09-21,0.187486,0.196065,0.049116,0.059007,0.040246,0.069013,-0.108568,0.060983,0.059087,0.027155,-0.047151,-0.025537
2012-09-28,0.15313,0.134424,0.032467,0.045624,0.036504,0.081741,-0.087017,0.086109,0.063452,0.051955,-0.06262,-0.002881
2012-10-05,0.145295,0.157365,0.060557,0.065099,0.060198,0.094246,0.125763,0.158732,0.153828,0.150081,0.060706,0.102514
2012-10-12,0.10363,0.115425,0.032285,0.037044,0.038885,0.082503,0.115188,0.171008,0.152173,0.154826,0.073894,0.115332
2012-10-19,0.067484,0.168835,0.039581,0.043036,0.04274,0.108276,0.007108,0.134386,0.121357,0.087006,0.063612,0.089147


In [72]:
train_range = train_set.iloc[:, 0:N_stocks].max()-train_set.iloc[:, 0:N_stocks].min()
train_range = pd.Series(train_range).append(train_range)
train_range.index = plot_data_train.columns

train_mins = pd.Series(train_set.iloc[:, 0:N_stocks].min()).append(train_set.iloc[:, 0:N_stocks].min())
train_mins.index = plot_data_train.columns 

cv_range = cv_set.iloc[:, 0:N_stocks].max()-cv_set.iloc[:, 0:N_stocks].min()
cv_range = pd.Series(cv_range).append(cv_range)
cv_range.index = plot_data_cv.columns

cv_mins = pd.Series(cv_set.iloc[:, 0:N_stocks].min()).append(cv_set.iloc[:, 0:N_stocks].min())
cv_mins.index = plot_data_cv.columns

unnorm_train = plot_data_train.multiply(train_range) + train_mins
unnorm_cv = plot_data_cv.multiply(cv_range) + cv_mins

# Stack the two dataframes above 
plot_all_unnorm = pd.concat([unnorm_train, unnorm_cv])

In [73]:
traces={}
for i in range(len(plot_all_unnorm.columns)):
    col=plot_all_unnorm.columns[i]
    
    # create traces for the actual values
    if i < N_stocks:
        traces['trace' + str(i)] = go.Scatter(
            name=str(col),
            x=plot_all_unnorm.index,
            y=plot_all_unnorm[col],
            mode='lines + markers',
            line=dict(
                color = ('rgb(0, 77, 153)'),
                width=2,
                )
        )
    # create traces for the predicted values
    else:
        traces['trace' + str(i)] = go.Scatter(
            name=str(col),
            x=plot_all_unnorm.index,
            y=plot_all_unnorm[col],
            mode='lines + markers',
            line=dict(
                color = ('rgb(255, 102, 0)'),
                width=2,
                )
        )

In [83]:
# Get the date where the cv data starts
cv_start = plot_data_cv.index[0]

# change working directory to images folder
os.chdir('C:/Users/Kenny/projects/repositories/kennfucius.github.io/images/0002-keras-ETF/')

for i in range(N_stocks):
    
    layout= go.Layout(
    xaxis=dict(
        title='Date',
        titlefont=dict(size=16)),
    yaxis=dict(
        title='Price ($)',
        titlefont=dict(size=16)),
    showlegend=True,
    legend=dict(
        font=dict(size=14),
        x=0.05,
        y=1),    
    margin=dict(
        t=25,
        r=0),
    shapes = [dict(
            type='rect',
            xref='x',
            yref='paper',
            x0=str(cv_start.date()),
            y0=0,
            x1=str(plot_all.index[-1].date()),
            y1=1,
            fillcolor=('rgb(0, 153, 51)'),
            opacity=0.15,
            line=dict(
                width=0,
            )
        )]
    )
    
    data = [traces['trace'+str(i)], traces['trace'+str(i+N_stocks)]]
    
    fig = go.Figure(data=data, layout=layout)
    plotly.offline.iplot(fig, filename='predictions-' + str(i) + '.html')

In [28]:
#Plot just the validation portion
for i in range(len(plot_all_unnorm.columns)):
    col=plot_all_unnorm.columns[i]
    
    # create traces for the actual values
    if i < N_stocks:
        traces['trace' + str(i)] = go.Scatter(
            name=str(col),
            x=plot_all_unnorm.index[plot_all.index>cv_start],
            y=plot_all_unnorm.loc[cv_start:, col],
            mode='lines + markers',
            line=dict(
                color = ('rgb(0, 77, 153)'),
                width=2,
                )
        )
    # create traces for the predicted values
    else:
        traces['trace' + str(i)] = go.Scatter(
            name=str(col),
            x=plot_all_unnorm.index[plot_all.index>cv_start],
            y=plot_all_unnorm.loc[cv_start:, col],
            mode='lines + markers',
            line=dict(
                color = ('rgb(255, 102, 0)'),
                width=2,
                )
        )

In [29]:
# change working directory to images folder
os.chdir('C:/Users/Kenny/projects/repositories/kennfucius.github.io/images/0002-keras-ETF/')

for i in range(N_stocks):
    
    layout= go.Layout(
    xaxis=dict(
        title='Date',
        titlefont=dict(size=16)),
    yaxis=dict(
        title='Normalized Price',
        titlefont=dict(size=16)),
    showlegend=True,
    legend=dict(
        font=dict(size=14),
        x=0,
        y=1.1),    
    margin=dict(t=50,
        r=0)
    )
    
    data = [traces['trace'+str(i)], traces['trace'+str(i+N_stocks)]]
    
    fig = go.Figure(data=data, layout=layout)
    plotly.offline.iplot(fig, filename='cv-predictions-' + str(i) + '.html')

In [30]:
trace0 = go.Scatter(
    name='Predicted/Actual - CV',
    mode='markers',
    x=Y_train.stack(),
    y=predictions_train.stack())

trace1 = go.Scatter(
    name='Ideal Line',
    mode='lines',
    x=[0,1],
    y=[0,1])

data = [trace0, trace1]

layout= go.Layout(
    xaxis=dict(
        title='Actual Values',
        titlefont=dict(size=16)),
    yaxis=dict(
        title='Predicted Values',
        titlefont=dict(size=16)),
    showlegend=True,
    legend=dict(
        font=dict(size=14),
        orientation='h',
        y=1.1),  
    margin=dict(
        t=50,
        r=0)
    )

fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig, filename='prediction-ratios-train.html')

In [31]:
trace0 = go.Scatter(
    name='Predicted/Actual - CV',
    mode='markers',
    x=Y_cv.stack(),
    y=predictions_cv.stack())

trace1 = go.Scatter(
    name='Ideal Line',
    mode='lines',
    x=[0,1],
    y=[0,1])

data = [trace0, trace1]

layout= go.Layout(
    xaxis=dict(
        title='Actual Values',
        titlefont=dict(size=16)),
    yaxis=dict(
        title='Predicted Values',
        titlefont=dict(size=16)),
    showlegend=True,
    legend=dict(
        font=dict(size=14),
        orientation='h',
        y=1.1),  
    margin=dict(
        t=50,
        r=0)
    )

fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig, filename='prediction-ratios-cv.html')

In [32]:
# Create best fit lines 
from scipy import stats

traces=[]
for indicator in train_set.columns[N_stocks:]:
    
    xi=train_set[indicator]
    yi=train_set['Technology (IYW)']
    
    slope, intercept, r_value, p_value, std_err = stats.linregress(xi,yi)
    line = slope*xi+intercept

    trace0 = go.Scatter(
        mode='markers',
        x=xi,
        y=yi,
        showlegend=False)

    layout= go.Layout(
        xaxis=dict(
            title=str(indicator),
            titlefont=dict(size=16)),
        yaxis=dict(
            title='Technology (IYW)',
            titlefont=dict(size=16)),
        showlegend=True,
        legend=dict(
            font=dict(size=14),
            orientation='h',
            y=1.05),    
        margin=dict(
            t=20,
            r=0),
        hovermode='closest'
    )

    data = [trace0]

    fig = go.Figure(data=data, layout=layout)
    plotly.offline.iplot(fig, filename='tech-dep-' + str(indicator) + '.html')

In [33]:
train_set.head(10)

Unnamed: 0_level_0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU),HOUST,UNRATENSA,EMRATIO,UEMPMED,UMCSENT,USSLIND,KCFSI,IPMAN,VIXCLS,DGS10
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2012-09-21,73.250854,63.142998,67.473579,82.698982,78.990929,74.672485,750.0,8.2,58.299999,18.0,74.300003,0.94,-0.4,95.344902,13.98,1.8
2012-09-28,71.434181,61.579113,66.656784,81.770248,78.683899,75.375397,750.0,8.2,58.299999,18.0,78.300003,0.94,-0.4,95.344902,14.84,1.64
2012-10-05,71.019882,62.161152,68.034836,83.121735,80.62796,76.065987,750.0,7.6,58.700001,18.5,78.300003,1.36,-0.4,95.344902,14.55,1.7
2012-10-12,68.816711,61.097095,66.647835,81.174812,78.879242,75.417496,750.0,7.6,58.700001,18.5,78.300003,1.36,-0.5,95.344902,16.139999,1.7
2012-10-19,66.90538,62.452168,67.00576,81.590675,79.195496,76.840805,872.0,7.6,58.700001,18.5,78.300003,1.36,-0.5,95.2537,17.059999,1.86
2012-10-26,66.415794,61.124382,66.021461,80.607773,78.311844,75.670166,872.0,7.6,58.700001,18.5,82.599998,1.36,-0.5,95.2537,18.120001,1.86
2012-11-02,66.32164,61.133492,66.558357,81.222099,77.81884,75.063789,872.0,7.5,58.799999,19.6,82.599998,1.25,-0.5,95.2537,16.690001,1.75
2012-11-09,64.796371,59.887531,65.314529,79.502014,76.191032,71.855057,872.0,7.5,58.799999,19.6,82.599998,1.25,-0.4,95.2537,18.49,1.62
2012-11-16,63.33699,58.168652,65.063965,78.604164,75.688728,71.290817,872.0,7.5,58.799999,19.6,82.599998,1.25,-0.4,94.370201,16.41,1.58
2012-11-23,66.001534,60.606003,67.77533,81.467819,77.949066,70.86129,894.0,7.5,58.799999,19.6,82.699997,1.25,-0.4,94.370201,15.14,1.69


In [34]:
pearson = pd.DataFrame(train_set.corr().iloc[N_stocks:,:N_stocks]).round(3)
pearson

Unnamed: 0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU)
HOUST,0.757,0.229,0.795,0.773,0.748,0.771
UNRATENSA,-0.917,-0.428,-0.919,-0.919,-0.91,-0.874
EMRATIO,0.858,0.316,0.879,0.847,0.833,0.905
UEMPMED,-0.924,-0.412,-0.933,-0.932,-0.929,-0.882
UMCSENT,0.8,0.321,0.786,0.82,0.845,0.761
USSLIND,0.255,0.192,0.238,0.23,0.262,0.212
KCFSI,0.297,-0.39,0.365,0.326,0.265,0.428
IPMAN,0.903,0.353,0.896,0.914,0.904,0.841
VIXCLS,0.04,-0.423,0.055,0.088,0.097,0.083
DGS10,-0.159,0.285,-0.183,-0.102,-0.108,-0.369


In [35]:
kendall = pd.DataFrame(train_set.corr(method='kendall').iloc[N_stocks:,:N_stocks]).round(3)
kendall

Unnamed: 0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU)
HOUST,0.549,0.132,0.599,0.587,0.531,0.566
UNRATENSA,-0.703,-0.217,-0.771,-0.722,-0.658,-0.707
EMRATIO,0.677,0.198,0.737,0.694,0.649,0.692
UEMPMED,-0.693,-0.203,-0.763,-0.73,-0.68,-0.682
UMCSENT,0.555,0.182,0.553,0.57,0.631,0.549
USSLIND,0.105,0.157,0.112,0.096,0.14,0.175
KCFSI,0.178,-0.279,0.245,0.194,0.093,0.247
IPMAN,0.697,0.22,0.737,0.713,0.659,0.698
VIXCLS,-0.057,-0.313,-0.018,-0.043,-0.027,0.013
DGS10,-0.149,0.214,-0.23,-0.165,-0.127,-0.319


In [36]:
spearman = pd.DataFrame(train_set.corr(method='spearman').iloc[N_stocks:,:N_stocks]).round(3)
spearman

Unnamed: 0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU)
HOUST,0.748,0.218,0.798,0.781,0.728,0.769
UNRATENSA,-0.884,-0.345,-0.93,-0.898,-0.85,-0.882
EMRATIO,0.858,0.324,0.904,0.875,0.841,0.884
UEMPMED,-0.867,-0.295,-0.919,-0.898,-0.861,-0.867
UMCSENT,0.778,0.271,0.781,0.794,0.831,0.772
USSLIND,0.192,0.242,0.208,0.184,0.226,0.276
KCFSI,0.358,-0.406,0.445,0.39,0.291,0.424
IPMAN,0.862,0.311,0.894,0.88,0.841,0.874
VIXCLS,-0.066,-0.451,-0.004,-0.038,-0.025,0.04
DGS10,-0.174,0.304,-0.25,-0.191,-0.144,-0.339


In [50]:
def rmse_col(df1, df2):
    df2.columns = df1.columns
    sq_diff = (df2 - df1)**2
    rmse_cols = sq_diff.sum()**0.5
    return rmse_cols

rmse_col(Y_train, predictions_train).T

Technology (IYW)         1.196336
Basic Materials (IYM)    2.063661
Consumer Goods (IYK)     1.040161
Services (IYC)           0.962031
Healthcare (IYH)         0.919669
Utilities (IDU)          1.071568
dtype: float64

In [69]:
idx = kendall.index
mean_corr = pd.DataFrame(kendall.abs().mean()).T
mean_corr.index = ['Mean Correlation']
rmse_cols = pd.DataFrame(rmse_col(Y_train, predictions_train)).T
rmse_cols.index = ['RMSE']
kendall_rmse = pd.concat([kendall, mean_corr, rmse_cols]).round(3)
kendall_rmse

Unnamed: 0,Technology (IYW),Basic Materials (IYM),Consumer Goods (IYK),Services (IYC),Healthcare (IYH),Utilities (IDU)
HOUST,0.549,0.132,0.599,0.587,0.531,0.566
UNRATENSA,-0.703,-0.217,-0.771,-0.722,-0.658,-0.707
EMRATIO,0.677,0.198,0.737,0.694,0.649,0.692
UEMPMED,-0.693,-0.203,-0.763,-0.73,-0.68,-0.682
UMCSENT,0.555,0.182,0.553,0.57,0.631,0.549
USSLIND,0.105,0.157,0.112,0.096,0.14,0.175
KCFSI,0.178,-0.279,0.245,0.194,0.093,0.247
IPMAN,0.697,0.22,0.737,0.713,0.659,0.698
VIXCLS,-0.057,-0.313,-0.018,-0.043,-0.027,0.013
DGS10,-0.149,0.214,-0.23,-0.165,-0.127,-0.319


In [70]:
from html_tools import table_idx

table_idx(kendall_rmse)

'<table style="width:100%"><tr><th> </th><th>Technology (IYW)</th><th>Basic Materials (IYM)</th><th>Consumer Goods (IYK)</th><th>Services (IYC)</th><th>Healthcare (IYH)</th><th>Utilities (IDU)</th></tr>\n<tr><td>HOUST</td><td>0.549</td><td>0.132</td><td>0.599</td><td>0.587</td><td>0.531</td><td>0.566</td></tr>\n<tr><td>UNRATENSA</td><td>-0.703</td><td>-0.217</td><td>-0.771</td><td>-0.722</td><td>-0.658</td><td>-0.707</td></tr>\n<tr><td>EMRATIO</td><td>0.677</td><td>0.198</td><td>0.737</td><td>0.694</td><td>0.649</td><td>0.692</td></tr>\n<tr><td>UEMPMED</td><td>-0.693</td><td>-0.203</td><td>-0.763</td><td>-0.73</td><td>-0.68</td><td>-0.682</td></tr>\n<tr><td>UMCSENT</td><td>0.555</td><td>0.182</td><td>0.553</td><td>0.57</td><td>0.631</td><td>0.549</td></tr>\n<tr><td>USSLIND</td><td>0.105</td><td>0.157</td><td>0.112</td><td>0.096</td><td>0.14</td><td>0.175</td></tr>\n<tr><td>KCFSI</td><td>0.178</td><td>-0.279</td><td>0.245</td><td>0.194</td><td>0.093</td><td>0.247</td></tr>\n<tr><td>IPMA

---
# Supplemental

---

In [39]:
plotly.__version__

'3.1.0'