TODO : 
- Find multivariate real world data 
- Create exercise and solution for it 

# 05 Data Preparation for Deep Learning
It is important to first transform any data that you have into a suitable format before any time series analysis can be done. This lab will guide you through from the basics of transforming raw time series data into structure suitable for supervised learning task, and ways to transform time series data into 3-dimensional structure in PyTorch to be feed into convolutional neural networks (CNN) and long short-term memory (LSTM). At the end of this lab, you will be able to:

1. transform a time series dataset into a x-feature and y-label base on supervised learning format, and
2. transform a it into a three-dimensional structure.

But first let us import some necessary libraries for this lab.

In [156]:
# importing required libraries or modules for this lab
import numpy as np 
import pandas as pd
import torch

# Transforming Time Series Data for Supervised Learning Task

Below provides an example of a function written with the purpose to transform a univariate time series into a structure suitable for supervised learning.

Suppose we have a univariate time series. It has a 1-dimensional structure, and thus is unable to perform supervised learning. Why? Because there are no clear distinctions of features and labels.

# Univariate Data Prepation

## Univariate for Single Step Forecasting

In [157]:
# Example of time series
univariate_series = np.array([1,2,3,4,5,6,7,8,9,10])
print(univariate_series.shape)

(10,)


In [158]:
# split a univariate sequence into samples
def univariate_single_step(sequence, window_size):
    x, y = list(), list()
    for i in range(len(sequence)):
    # find the end of this pattern
        end_ix = i + window_size
        # check if we are beyond the sequence
        if end_ix > len(sequence)-1:
            break
    # gather input and output parts of the pattern
        seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
        x.append(seq_x)
        y.append(seq_y)
    return np.array(x), np.array(y)

In [159]:
# calling the function to transform time series into features and labels
x_feature, y_label = univariate_single_step(univariate_series,window_size = 3)
print(f"Features are now in the shape of {x_feature.shape} while labels are now in the shape of {y_label.shape}\n")
print("x-feature\n"+str(x_feature.shape[0])+" = total number of data ")
print(str(x_feature.shape[1])+" = window size\n")
print("y-label\n"+str(y_label.shape[0])+" = number of data\n")

# printing out each sample
for i in range(len(x_feature)):
    print(x_feature[i], y_label[i])

Features are now in the shape of (7, 3) while labels are now in the shape of (7,)

x-feature
7 = total number of data 
3 = window size

y-label
7 = number of data

[1 2 3] 4
[2 3 4] 5
[3 4 5] 6
[4 5 6] 7
[5 6 7] 8
[6 7 8] 9
[7 8 9] 10


## Univariate for Multi-Step  Forecasting

In [160]:
def univariate_multi_step(sequence,window_size,n_multistep):
    x, y = list(), list()
    for i in range(len(sequence)):
    # find the end of this pattern
        end_ix = i + window_size
        out_ix = end_ix+n_multistep
        # check if we are beyond the sequence
        if out_ix > len(sequence):
            break
    # gather input and output parts of the pattern
        seq_x, seq_y = sequence[i:end_ix], sequence[end_ix:out_ix]
        x.append(seq_x)
        y.append(seq_y)
    return np.array(x), np.array(y)

In [161]:
x_feature, y_label = univariate_multi_step(univariate_series,window_size = 3,n_multistep = 2)
print(f"Features are now in the shape of {x_feature.shape} while labels are now in the shape of {y_label.shape}\n")
print("x-feature\n"+str(x_feature.shape[0])+" = total number of data ")
print(str(x_feature.shape[1])+" = window size \n")
print("y-label\n"+str(y_label.shape[0])+" = number of data")
print(str(y_label.shape[1])+" = number of step\n")
# printing out each sample
for i in range(len(x_feature)):
    print(x_feature[i], y_label[i])

Features are now in the shape of (6, 3) while labels are now in the shape of (6, 2)

x-feature
6 = total number of data 
3 = window size 

y-label
6 = number of data
2 = number of step

[1 2 3] [4 5]
[2 3 4] [5 6]
[3 4 5] [6 7]
[4 5 6] [7 8]
[5 6 7] [8 9]
[6 7 8] [ 9 10]


## Exercise for Univariate (Solution)
Try to apply in AirPassengers data by create single step and multi step for univariate time series


In [162]:
airpassengers = pd.read_csv('../datasets/decomposition/AirPassengers.csv')
airpassengers_ts = pd.Series(airpassengers['#Passengers'].values, 
                            index = pd.date_range('1949-01', periods = len(airpassengers), freq='M'))
airpassengers_ts

1949-01-31    112
1949-02-28    118
1949-03-31    132
1949-04-30    129
1949-05-31    121
             ... 
1960-08-31    606
1960-09-30    508
1960-10-31    461
1960-11-30    390
1960-12-31    432
Freq: M, Length: 144, dtype: int64

## Univariate Single-Step Forecasting 

In [164]:
x_feature, y_label = univariate_single_step(airpassengers_ts, window_size = 5)
print(f"Features are now in the shape of {x_feature.shape} while labels are now in the shape of {y_label.shape}")
#print out sample
for i in range(10):
    print(x_feature[i], y_label[i])

Features are now in the shape of (139, 5) while labels are now in the shape of (139,)
[112 118 132 129 121] 135
[118 132 129 121 135] 148
[132 129 121 135 148] 148
[129 121 135 148 148] 136
[121 135 148 148 136] 119
[135 148 148 136 119] 104
[148 148 136 119 104] 118
[148 136 119 104 118] 115
[136 119 104 118 115] 126
[119 104 118 115 126] 141


## Univariate Multi-Step Forecasting

In [166]:
x_feature, y_label = univariate_multi_step(airpassengers_ts, window_size = 5, n_multistep = 2)
print(f"Features are now in the shape of {x_feature.shape} while labels are now in the shape of {y_label.shape}")

# printing out sample
for i in range(10):
    print(x_feature[i], y_label[i])

Features are now in the shape of (138, 5) while labels are now in the shape of (138, 2)
[112 118 132 129 121] [135 148]
[118 132 129 121 135] [148 148]
[132 129 121 135 148] [148 136]
[129 121 135 148 148] [136 119]
[121 135 148 148 136] [119 104]
[135 148 148 136 119] [104 118]
[148 148 136 119 104] [118 115]
[148 136 119 104 118] [115 126]
[136 119 104 118 115] [126 141]
[119 104 118 115 126] [141 135]


# Multivariate Data Preparation

## Single Step Forecast

## Multivariate Input , Univariate Output for Single Step Forecast

In [167]:
series1 = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
series2 = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95])
outputseries = np.array([25,45,65,85,105,125,145,165,185])

series1 = series1.reshape(len(series1),1)
series2 = series2.reshape(len(series2),1)
outputseries = outputseries.reshape(len(outputseries),1)
#horizontally stack column
multivariate_dataset = np.hstack((series1,series2,outputseries))
multivariate_dataset

array([[ 10,  15,  25],
       [ 20,  25,  45],
       [ 30,  35,  65],
       [ 40,  45,  85],
       [ 50,  55, 105],
       [ 60,  65, 125],
       [ 70,  75, 145],
       [ 80,  85, 165],
       [ 90,  95, 185]])

In [168]:
def multivariate_univariate_single_step(sequence,window_size):
    x, y = list(), list()
    for i in range(len(sequence)):
    # find the end of this pattern
        end_ix = i + window_size
        # check if we are beyond the sequence
        if end_ix > len(sequence):
            break
    # gather input and output parts of the pattern
        seq_x, seq_y = sequence[i:end_ix,:-1], sequence[end_ix-1,-1]
        x.append(seq_x)
        y.append(seq_y)
    return np.array(x), np.array(y)

In [169]:
x_feature, y_label = multivariate_univariate_single_step(multivariate_dataset, window_size = 2)
print(f"Features are now in the shape of {x_feature.shape} while labels are now in the shape of {y_label.shape}\n")
print("x-feature\n"+str(x_feature.shape[0])+" = total number of data ")
print(str(x_feature.shape[1])+" = window size ")
print(str(x_feature.shape[2])+" = number of time series\n")
print("y-label\n"+str(y_label.shape[0])+" = number of data\n")


# printing out sample
for i in range(5):
    print(x_feature[i], y_label[i])

Features are now in the shape of (8, 2, 2) while labels are now in the shape of (8,)

x-feature
8 = total number of data 
2 = window size 
2 = number of time series

y-label
8 = number of data

[[10 15]
 [20 25]] 45
[[20 25]
 [30 35]] 65
[[30 35]
 [40 45]] 85
[[40 45]
 [50 55]] 105
[[50 55]
 [60 65]] 125


## Multivariate Input , Multivariate Output for Single Step Forecast


In [170]:
series1 = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
series2 = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95])
series3 = np.array([25,45,65,85,105,125,145,165,185])

series1 = series1.reshape(len(series1),1)
series2 = series2.reshape(len(series2),1)
series3 = series3.reshape(len(series3),1)
#horizontally stack column
multivariate_output_dataset = np.hstack((series1,series2,series3))
multivariate_output_dataset

array([[ 10,  15,  25],
       [ 20,  25,  45],
       [ 30,  35,  65],
       [ 40,  45,  85],
       [ 50,  55, 105],
       [ 60,  65, 125],
       [ 70,  75, 145],
       [ 80,  85, 165],
       [ 90,  95, 185]])

In [171]:
def multivariate_multivariate_single_step(sequence,window_size):
    x, y = list(), list()
    for i in range(len(sequence)):
    # find the end of this pattern
        end_ix = i + window_size
        # check if we are beyond the sequence
        if end_ix >= len(sequence):
            break
    # gather input and output parts of the pattern
        seq_x, seq_y = sequence[i:end_ix,:], sequence[end_ix,:]
        x.append(seq_x)
        y.append(seq_y)
    return np.array(x), np.array(y)

In [172]:
x_feature, y_label = multivariate_multivariate_single_step(multivariate_output_dataset, window_size = 4)
print(f"Features are now in the shape of {x_feature.shape} while labels are now in the shape of {y_label.shape}\n")
print("x-feature\n"+str(x_feature.shape[0])+" = number of data ")
print(str(x_feature.shape[1])+" = window size ")
print(str(x_feature.shape[2])+" = number of time series\n")
print("y-label\n"+str(y_label.shape[0])+" = number of data")
print(str(y_label.shape[1])+" = number of step\n")

# printing out sample
for i in range(x_feature.shape[0]):
    print(x_feature[i], y_label[i])

Features are now in the shape of (5, 4, 3) while labels are now in the shape of (5, 3)

x-feature
5 = number of data 
4 = window size 
3 = number of time series

y-label
5 = number of data
3 = number of step

[[10 15 25]
 [20 25 45]
 [30 35 65]
 [40 45 85]] [ 50  55 105]
[[ 20  25  45]
 [ 30  35  65]
 [ 40  45  85]
 [ 50  55 105]] [ 60  65 125]
[[ 30  35  65]
 [ 40  45  85]
 [ 50  55 105]
 [ 60  65 125]] [ 70  75 145]
[[ 40  45  85]
 [ 50  55 105]
 [ 60  65 125]
 [ 70  75 145]] [ 80  85 165]
[[ 50  55 105]
 [ 60  65 125]
 [ 70  75 145]
 [ 80  85 165]] [ 90  95 185]


## Multi- Step Forecast


## Multivariate Input , Univariate Output for Multi Step Forecast

In [173]:
series1 = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
series2 = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95])
outputseries = np.array([25,45,65,85,105,125,145,165,185])

series1 = series1.reshape(len(series1),1)
series2 = series2.reshape(len(series2),1)
outputseries = outputseries.reshape(len(outputseries),1)
#horizontally stack column
multivariate_dataset2 = np.hstack((series1,series2,outputseries))
multivariate_dataset2

array([[ 10,  15,  25],
       [ 20,  25,  45],
       [ 30,  35,  65],
       [ 40,  45,  85],
       [ 50,  55, 105],
       [ 60,  65, 125],
       [ 70,  75, 145],
       [ 80,  85, 165],
       [ 90,  95, 185]])

In [174]:
def multivariate_univariate_multi_step(sequence,window_size,n_multistep):
    x, y = list(), list()
    for i in range(len(sequence)):
    # find the end of this pattern
        end_ix = i + window_size
        out_ix = end_ix + n_multistep -1
        # check if we are beyond the sequence
        if out_ix > len(sequence):
            break
    # gather input and output parts of the pattern
        seq_x, seq_y = sequence[i:end_ix,:-1], sequence[end_ix-1:out_ix,-1]
        x.append(seq_x)
        y.append(seq_y)
    return np.array(x), np.array(y)

In [175]:
x_feature, y_label = multivariate_univariate_multi_step(multivariate_dataset2, window_size = 3 ,n_multistep = 3)
print(f"Features are now in the shape of {x_feature.shape} while labels are now in the shape of {y_label.shape}\n")
print("x-feature\n"+str(x_feature.shape[0])+" = total number of data ")
print(str(x_feature.shape[1])+" = window size ")
print(str(x_feature.shape[2])+" = number of time series\n")
print("y-label\n"+str(y_label.shape[0])+" =number of data")
print(str(y_label.shape[1])+" =number of step\n")

# printing out sample
for i in range(x_feature.shape[0]):
    print(x_feature[i], y_label[i])

Features are now in the shape of (5, 3, 2) while labels are now in the shape of (5, 3)

x-feature
5 = total number of data 
3 = window size 
2 = number of time series

y-label
5 =number of data
3 =number of step

[[10 15]
 [20 25]
 [30 35]] [ 65  85 105]
[[20 25]
 [30 35]
 [40 45]] [ 85 105 125]
[[30 35]
 [40 45]
 [50 55]] [105 125 145]
[[40 45]
 [50 55]
 [60 65]] [125 145 165]
[[50 55]
 [60 65]
 [70 75]] [145 165 185]


## Multivariate Input , Multivariate Output for Multi Step Forecast


In [176]:
series1 = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
series2 = np.array([15, 25, 35, 45, 55, 65, 75, 85, 95])
series3 = np.array([25,45,65,85,105,125,145,165,185])

series1 = series1.reshape(len(series1),1)
series2 = series2.reshape(len(series2),1)
series3 = series3.reshape(len(series3),1)
#horizontally stack column
multivariate_output_dataset_multi = np.hstack((series1,series2,series3))
multivariate_output_dataset_multi

array([[ 10,  15,  25],
       [ 20,  25,  45],
       [ 30,  35,  65],
       [ 40,  45,  85],
       [ 50,  55, 105],
       [ 60,  65, 125],
       [ 70,  75, 145],
       [ 80,  85, 165],
       [ 90,  95, 185]])

In [177]:
def multivariate_multivariate_multi_step(sequence,window_size,n_multistep):
    x, y = list(),list()
    for i in range(len(sequence)):
    # find the end of this pattern
        end_ix = i + window_size
        out_ix = end_ix + n_multistep
        # check if we are beyond the sequence
        if end_ix >= len(sequence):
            break
    # gather input and output parts of the pattern
        seq_x, seq_y = sequence[i:end_ix,:], sequence[end_ix:out_ix,:]
        x.append(seq_x)
        y.append(seq_y)
    return np.array(x), np.array(y,dtype='object')

TO Do : solve the bug that cannot print the y-label multistep value

In [178]:
x_feature, y_label = multivariate_multivariate_multi_step(multivariate_output_dataset_multi, window_size = 4 ,n_multistep = 3)
print(f"Features are now in the shape of {x_feature.shape} while labels are now in the shape of {y_label.shape}\n")
print("x-feature\n"+str(x_feature.shape[0])+" = total number of data ")
print(str(x_feature.shape[1])+" = window size ")
print(str(x_feature.shape[2])+" = number of time series\n")
print("y-label\n"+str(y_label.shape[0])+" =number of data\n")


# printing out sample
for i in range(x_feature.shape[0]):
    print(x_feature[i], y_label[i])

Features are now in the shape of (5, 4, 3) while labels are now in the shape of (5,)

x-feature
5 = total number of data 
4 = window size 
3 = number of time series

y-label
5 =number of data

[[10 15 25]
 [20 25 45]
 [30 35 65]
 [40 45 85]] [[ 50  55 105]
 [ 60  65 125]
 [ 70  75 145]]
[[ 20  25  45]
 [ 30  35  65]
 [ 40  45  85]
 [ 50  55 105]] [[ 60  65 125]
 [ 70  75 145]
 [ 80  85 165]]
[[ 30  35  65]
 [ 40  45  85]
 [ 50  55 105]
 [ 60  65 125]] [[ 70  75 145]
 [ 80  85 165]
 [ 90  95 185]]
[[ 40  45  85]
 [ 50  55 105]
 [ 60  65 125]
 [ 70  75 145]] [[ 80  85 165]
 [ 90  95 185]]
[[ 50  55 105]
 [ 60  65 125]
 [ 70  75 145]
 [ 80  85 165]] [[ 90  95 185]]


In [179]:
# Easy to view
list(y_label)

[array([[ 50,  55, 105],
        [ 60,  65, 125],
        [ 70,  75, 145]]),
 array([[ 60,  65, 125],
        [ 70,  75, 145],
        [ 80,  85, 165]]),
 array([[ 70,  75, 145],
        [ 80,  85, 165],
        [ 90,  95, 185]]),
 array([[ 80,  85, 165],
        [ 90,  95, 185]]),
 array([[ 90,  95, 185]])]

"Running the example first prints the shape of the time series, in this case 10 time steps
of observations. Next, the series is split into input and output components for a supervised
learning problem. We can see that for the chosen representation that we have 7 samples for the
input and output and 3 input features. The shape of the output is 7 samples represented as (7,)
indicating that the array is a single column. It could also be represented as a two-dimensional
array with 7 rows and 1 column [7, 1]. Finally , the input and output aspects of each sample
are printed, showing the expected breakdown of the problem."

# Preparing 3-Dimensional Data
Preparing time series data for CNNs and LSTMs requires one additional step beyond transforming
the data into a supervised learning problem.Some configuration is needed in building the LSTM model in Pytorch : \
batch_size = True [ total number of data ,window size ,number of time series ]

In [180]:
#For example :
class LSTM(torch.nn.Module):
    
     def __init__(self, n_feature, hidden_dim, num_layers, output_dim):
                super(LSTM, self).__init__()

                self.n_feature = n_feature
                # Hidden dimensions
                self.hidden_dim = hidden_dim

                # Number of hidden layers
                self.num_layers = num_layers

                # Building your LSTM
                # batch_first=True causes input/output tensors to be of shape
                # (batch_dim, seq_dim, feature_dim)
                self.lstm = nn.LSTM(n_feature, hidden_dim, num_layers, batch_first=True)

                # Readout layer
                self.fc = nn.Linear(hidden_dim, output_dim)
               

## Example of Series

Source: 
- Deep Learning for Time Series Forecasting, Jason Brownlee 
- https://www.kaggle.com/taronzakaryan/stock-prediction-lstm-using-pytorch
- https://stackoverflow.com/questions/56858924/multivariate-input-lstm-in-pytorch