# Summary:

#### In this notebook, the optimal network configurations obtained from the runs in the preceding notebook ('03_Blue_in_Green.ipynb') are trained on different lengths of historical data set. The behavior of the learning curves as well as the magnitude of the error metric with varying historical information is analyzed and an optimal length for the training data inferred.

#### The experiements are run on the clean as well as distored datasets which were both analyzed in the preceding notebook.

#### GIven a seven day forecasting horizon for the target, the inferred optimal range of historical information is used as a guideline in the subsequent notebook '05_Flamenco_Sketches.ipynb', where the fine tuned models are test on financial stock data.

# Table of contents
* [1. Load modules](#Part1_link)
* [2. Clean time series](#Part2_link)
<br >&nbsp;&nbsp;&nbsp;[2.1 Evaluate model performance under varying amounts of historical information](#Part2.1_link)
<br >&nbsp;&nbsp;&nbsp;[2.2 Visualize and save results](#Part2.2_link)
* [3. Distorted time series](#Part3_link)
<br >&nbsp;&nbsp;&nbsp;[3.1 Evaluate model performance under varying amounts of historical information](#Part3.1_link)
<br >&nbsp;&nbsp;&nbsp;[3.2 Visualize and save results](#Part3.2_link)

<a id='Part1_link'></a>
# 1. Load modules

In [1]:
import sys
sys.path.append("../src/")
import Kind_of_Blue  # own class with a collection of methods used in this analysis

import tensorflow as tf

import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')

import numpy as np
import pandas as pd


<a id='Part2_link'></a>
# 2. Clean time series

Evaluate model performance for a range of historical training data lengths. 

In [2]:
# set a range of dates on which the observations are made
idx = pd.date_range(end='7/1/2020', periods=5*364, freq='d')

# take a sine function as the observations
num_periods = 10  # number of sine periods
observations = [np.sin(2*np.pi*num_periods*x/len(idx)) for x in range(len(idx))]

# initialize object
mdq = Kind_of_Blue.Kind_of_Blue()

# set target feature 
mdq._selected_features = ['observations']

# initialize dataframe to store time series
df = pd.DataFrame(data={'observations': observations})
df.index = idx

# load dataframe into object
mdq.df = df

# initialize dataset from dataframe 
mdq.initialize_dataset()

# standardize data
mdq.standardize_data()

# set number of time points for 1/ future forecasting points and 2/ the past, historical time points
future_target_size = int(365/52)

# specify model configuration: this is chosen basen on the results from the previous notebook 02_Freddie_Freeloader.ipynb
units = 128  # number of units in each neural network layer
num_layers = 2  # total number of layers
epochs = 10


<a id='Part2.1_link'></a>
### 2.1 Evaluate model performance under varying amounts of historical information

In [None]:
# choose a few past history sizes as a multiple of the future target size (seven data points)
past_history_sizes = [1 * future_target_size, 10 * future_target_size
                      , 20 * future_target_size, 52 * future_target_size]

# initialize results dictionary
res_2 = {'model_type': [], 'past_history_size': [], 'val_mse': []
       , 'mse': [], 'total_training_time': []}

# model type 
model_types = ['RNN', 'LSTM']

for model_type in model_types:
    
    for past_history_size in past_history_sizes:
        
        # generate train and validation data
        mdq.generate_train_and_val_data(future_target_size=future_target_size, past_history_size=past_history_size)

        # set number of steps per epoch
        num_samples = mdq._num_samples
        steps_per_epoch = int(num_samples/future_target_size)
        validation_steps = int(steps_per_epoch/2)

        # compile model
        mdq.compile_model(units=units, num_layers=num_layers, model_type=model_type)

        # fit model
        mdq.fit_model(epochs=epochs, steps_per_epoch=steps_per_epoch
                      ,validation_steps=validation_steps, model_type=model_type)

        # get errors
        history = mdq._histories[model_type]
        val_mse = history.history['val_mse'][-1]
        mse = history.history['mse'][-1]

        # get total training time
        total_training_time = sum(mdq._time_callbacks[model_type].times)

        # append results to results dictionary
        res_2['model_type'].append(model_type)
        res_2['past_history_size'].append(past_history_size)
        res_2['val_mse'].append(val_mse)
        res_2['mse'].append(mse)
        res_2['total_training_time'].append(total_training_time)

training set shape: x:(1267, 7, 1), y:(1267, 7, 1)
validation set shape: x:(532, 7, 1), y:(532, 7, 1)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
training set shape: x:(1204, 70, 1), y:(1204, 7, 1)
validation set shape: x:(469, 70, 1), y:(469, 7, 1)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
training set shape: x:(1134, 140, 1), y:(1134, 7, 1)
validation set shape: x:(399, 140, 1), y:(399, 7, 1)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
training set shape: x:(910, 364, 1), y:(910, 7, 1)
validation set shape: x:(175, 364, 1), y:(175, 7, 1)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
training set shape: x:(1267, 7, 1), y:(1267, 7, 1)
validation set shape: x:(532, 7, 1), y:(532, 7, 1)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch

Epoch 8/10
Epoch 9/10
Epoch 10/10
training set shape: x:(1134, 140, 1), y:(1134, 7, 1)
validation set shape: x:(399, 140, 1), y:(399, 7, 1)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
 26/162 [===>..........................] - ETA: 21s - loss: nan - mse: nan

<a id='Part2.2_link'></a>
### 2.2 Visualize and save results

In [None]:
# transform dictionary to dataframe
df_res_22 = pd.DataFrame(res_2)

# store dataframe as csv locally
df_res_22.to_csv('../data/04_results_cleanData.csv')

In [None]:
# visualize results
# compare RNN to LSTM results

df_res = df_res_22

x_label = 'past_history_size'
y_label = 'mse'
z_label = 'val_mse'

for model_type in model_types:
    condition_1 = (df_res['model_type']== model_type)
    
    x = df_res[condition_1][x_label].values
    y = df_res[condition_1][y_label].values
    
    z = df_res[condition_1][z_label].values
    
    plt.scatter(x, y, alpha=0.6, c="red", linewidth=0.0, label='training')
    plt.scatter(x, z, alpha=0.6, c="blue", linewidth=0.0, label='validation')
        
    plt.xlabel('past history data size [days]')
    plt.ylabel('mse')
    plt.legend()
    plt.title('clean dataset: {} model'.format(model_type))
    plt.show()

<a id='Part3_link'></a>
# 3. Distorted time series

Run RNN and LSTM model with noise added ontop of the clean data. Evaluate model performance for a range of variances of the noise.

<a id='Part3.1_link'></a>
### 3.1 Evaluate model performance under varying noise levels

In [None]:
# set a range of dates on which the observations are made
idx = pd.date_range(end='7/1/2020', periods=5*364, freq='d')

# take a sine function as the observations
num_periods = 10  # number of sine periods
observations = [np.sin(2*np.pi*num_periods*x/len(idx)) for x in range(len(idx))]

# initialize object
try:
    del mqd
except: pass

mdq = Kind_of_Blue.Kind_of_Blue()

# set target feature 
mdq._selected_features = ['observations']


# generate noisy observations by adding Gaussian noise to clean observations
mean = 0.0
std = 1.0
noise = [np.random.normal(loc=mean, scale=std, size=None) for x in range(len(idx))]        
noisy_observations = [noise[i]+observations[i] for i in range(len(noise))]

# initialize dataframe to store time series
df = pd.DataFrame(data={'observations': noisy_observations})
df.index = idx

# load dataframe into object
mdq.df = df

# initialize dataset from dataframe 
mdq.initialize_dataset()

# standardize data
mdq.standardize_data()
   

In [None]:
# initialize results dictionary
res_3 = {'model_type': [], 'past_history_size': [], 'val_mse': []
       , 'mse': [], 'total_training_time': []}

# model type 
model_types = ['RNN', 'LSTM']

for model_type in model_types:
    
    for past_history_size in past_history_sizes:
        
        # generate train and validation data
        mdq.generate_train_and_val_data(future_target_size=future_target_size, past_history_size=past_history_size)

        # set number of steps per epoch
        num_samples = mdq._num_samples
        steps_per_epoch = int(num_samples/future_target_size)
        validation_steps = int(steps_per_epoch/2)

        # compile model
        mdq.compile_model(units=units, num_layers=num_layers, model_type=model_type)

        # fit model
        mdq.fit_model(epochs=epochs, steps_per_epoch=steps_per_epoch
                      ,validation_steps=validation_steps, model_type=model_type)

        # get errors
        history = mdq._histories[model_type]
        val_mse = history.history['val_mse'][-1]
        mse = history.history['mse'][-1]

        # get total training time
        total_training_time = sum(mdq._time_callbacks[model_type].times)

        # append results to results dictionary
        res_3['model_type'].append(model_type)
        res_3['past_history_size'].append(past_history_size)
        res_3['val_mse'].append(val_mse)
        res_3['mse'].append(mse)
        res_3['total_training_time'].append(total_training_time)

<a id='Part3.2_link'></a>
### 3.2 Visualize and save results

In [None]:
# transform dictionary to dataframe
df_res_32 = pd.DataFrame(res_3)

# store dataframe as csv locally
df_res_32.to_csv('../data/04_results_distored.csv')

In [None]:
# visualize results
# compare RNN to LSTM results

df_res = df_res_32

x_label = 'past_history_size'
y_label = 'mse'
z_label = 'val_mse'

for model_type in model_types:
    condition_1 = (df_res['model_type']== model_type)
    
    x = df_res[condition_1][x_label].values
    y = df_res[condition_1][y_label].values
    
    z = df_res[condition_1][z_label].values
    
    plt.scatter(x, y, alpha=0.6, c="red", linewidth=0.0, label='training')
    plt.scatter(x, z, alpha=0.6, c="blue", linewidth=0.0, label='validation')
        
    plt.xlabel('past history data size [days]')
    plt.ylabel('mse')
    plt.title('noisy dataset: {} model'.format(model_type))
    plt.show()