# Summary:

#### In this notebook, the trial runs from the preceding notebook ('01_So_What.ipynb') are systematically extended to try to find an optimal configuration for the two models. The model parameters to be fine tuned are: number of epochs, number of layers and the number of units.

#### The best performing model configuration will be used in the notebook '03_Blue_in_Green.ipynb' to analyze the impact of adding noise to the clean dataset.


# Table of contents
* [1. Load modules](#Part1_link)
* [2. Setup data](#Part2_link)
<br >&nbsp;&nbsp;&nbsp;[2.1  Generate data, separate testing and validation set and standardize testing data](#Part2.1_link)
* [3. Setup models and evaluate for various hyper-parameter choices](#Part3_link)
<br >&nbsp;&nbsp;&nbsp;[3.1 Compile and fit LSTM and RNN model](#Part3.1_link)
* [4. Visualize results](#Part4_link)

<a id='Part1_link'></a>
# 1. Load modules

In [1]:
import sys
sys.path.append("../src/")
import Kind_of_Blue  # own class with a collection of methods used in this analysis

import tensorflow as tf

import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')

import numpy as np
import pandas as pd


<a id='Part2_link'></a>
# 2. Setup data

<a id='Part2.1_link'></a>
### 2.1 Generate data, separate testing and validation set and standardize testing data

The following steps are repeated from the previous notebook, '01_So_What.ipynb', and are grouped into one single step here for simplicity.

In [2]:
# set a range of dates on which the observations are made
idx = pd.date_range(end='7/1/2020', periods=5*364, freq='d')

# take a sine function as the observations
num_periods = 10  # number of sine periods
observations = [np.sin(2*np.pi*num_periods*x/len(idx)) for x in range(len(idx))]
print('number of observations in time series: {}'.format(len(observations)))

# initialize dataframe to store time series
df = pd.DataFrame(data=observations, columns=['observations'])
df.index = idx

# initialize object
mdq = Kind_of_Blue.Kind_of_Blue()

# load dataframe into object
mdq._selected_features = ['observations']
mdq.df = df

# train-validation split ratio as class attribute set to 70%
print('train split ratio = ', mdq.TRAIN_SPLIT_RATIO)

# initialize dataset from dataframe 
mdq.initialize_dataset()
print('loaded data set length: {}'.format(len(mdq._dataset)))

# standardize data
mdq.standardize_data()

# check that mean equals zero and the standard deviation is one
print('mean: {}, std: {}'.format(round(np.mean(mdq._dataset), 2), round(np.std(mdq._dataset), 2)))

# set number of time points for 1/ future forecasting points and 2/ the past, historical time points
future_target_size = int(365/52)
past_history_size = int(1*365)

# set batch size
batch_size = 32

# generate train and validation data
mdq.generate_train_and_val_data(future_target_size=future_target_size, past_history_size=past_history_size
                                , batch_size=batch_size)

print('number of training samples: {}'.format(mdq._num_samples))

number of observations in time series: 1820
train split ratio =  0.7
loaded data set length: 1820
mean: 0.0, std: 1.0
debug3: check what buffer_size actually does! and why is data shape always (..., ..., 1) <- 1???
training set shape: x:(909, 365, 1), y:(909, 7, 1)
validation set shape: x:(174, 365, 1), y:(174, 7, 1)
number of training samples: 909


<a id='Part3_link'></a>
# 3. Setup models and evaluate for various hyper-parameter choices

<a id='Part3.1_link'></a>
### 3.1 Compile and fit LSTM and RNN model

In [3]:
# generator for configurations to be iterated over

def config_generator():
    
    unit_choices = [2, 8, 16, 64, 128, 256, 512]  # number of units in each neural network layer
    layer_choices = [2]  # total number of layers
    epoch_choices = [10, 30, 50]  # number of epochs the model is trained on
    
    for units in unit_choices:
        for num_layers in layer_choices:
            for epochs in epoch_choices:
                yield units, num_layers, epochs


In [None]:
# iterations over the model parameter configurations are done for both LSTM as well as RNN model
# model_types = ['LSTM', 'RNN']
model_types = ['RNN']

# set number of steps per epoch
num_samples = mdq._num_samples
steps_per_epoch = int(num_samples/future_target_size)
validation_steps = int(steps_per_epoch/2)

# initialize results dictionary
res = {'model_type': [], 'epochs': [], 'num_layers': [], 'units': [], 'val_mse': []
       , 'mse': [], 'total_training_time': []}

for units, num_layers, epochs in config_generator():
    for model_type in model_types:

        print('currently running {} model'.format(model_type))

        # compile model
        mdq.compile_model(units=units, num_layers=num_layers, model_type=model_type)

        # fit model
        mdq.fit_model(epochs=epochs, steps_per_epoch=steps_per_epoch
                      ,validation_steps=validation_steps, model_type=model_type)
        
        # get errors
        history = mdq._histories[model_type]
        val_mse = history.history['val_mse'][-1]
        mse = history.history['mse'][-1]
        
        # get total training time
        total_training_time = sum(mdq._time_callbacks[model_type].times)
        
        # append results to results dictionary
        res['model_type'].append(model_type)
        res['epochs'].append(epochs)
        res['num_layers'].append(num_layers)
        res['units'].append(units)
        res['val_mse'].append(val_mse)
        res['mse'].append(mse)
        res['total_training_time'].append(total_training_time)


currently running RNN model
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
currently running RNN model
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
currently running RNN model
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50


Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
currently running RNN model
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
currently running RNN model
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30


Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
currently running RNN model
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
 11/129 [=>............................] - ETA: 6s - loss: 0.1749 - mse: 0.1749

<a id='Part4_link'></a>
# 4. Visualize results

In [None]:
# transform dictionary to dataframe
df_res = pd.DataFrame(res)

# store dataframe as csv locally
# df_res.to_csv('../data/02_results_run4.csv')

In [None]:
# visualize results; use bubble plots to indicate magnitude of mean-square error for specific configuration comparing
# RNN to LSTM results

x_label = 'epochs'
y_label = 'units'
z_label = 'mse'
# condition_label = 'num_layers'
# condition_vals = list(set(df_res[condition_label]))

# condition_LSTM = (df_res['model_type']=='LSTM')
# condition_RNN = (df_res['model_type']=='RNN')

for model_type in ['LSTM', 'RNN']:
    condition_1 = (df_res['model_type']== model_type)
    
    x = df_res[condition_1][x_label].values
    y = df_res[condition_1][y_label].values
    
    z = df_res[condition_1][z_label].values
    plt.scatter(x, y, s=z*10000, alpha=0.6, c="red", linewidth=0.0)
        
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title('mean-square training error: {} model'.format(model_type))
    plt.show()

In [None]:
df_res[df_res['model_type']=='RNN']