## Data Preparation: Time series data for CNN modeling

Convolutional neural networks (CNNs) were traditionally used for image processing, but they can be used time series forecasting by converting the time series data into arrays compatible for CNN models.

Two following functions are developed:
1. Input time series to sequence: It can convert univariate or multivariate time series data to CNN sequences. Both training and test dataset need to be converted prior to fitting and predicting model, respectively.

2. Output time series to sequence: The function can convert single or multiple output time series to arrays of sequence.

The developed function is demonstrated for three cases:
1. Univariate time series
2. Multivariate time series
3. Multioutput time series

In this notebook, focus is in on data preparation. Therefore, the models parameters are kept the same for all the above mentioned example. It is not worth to tune hyperparameter for sample data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow

In [118]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import MaxPooling1D

In [3]:
def input_df_to_sequence(input_df, input_columns,  sequence_size):
    '''Converting input dataframe to an array of defined size
    
    Parameters
    ---------
    input_df: dataframe
        A dataframe of time series (one or multiple) 
    input_columns: list
        A list of columns to convert into sequence
    sequence_size: int
        Size of the time series data for sequencing

    Returns
    ------
    input_array: array
        A array consisting of input series(s) sequences of input columns
    '''

    input_layer = []
    for index in input_df.index[0:len(input_df)-sequence_size+1]:
        input_layer.append(input_df.loc[index:index+sequence_size-1, input_columns].values.tolist())
    input_array = np.array(input_layer)
    return input_array

def output_df_to_sequence(output_df, output_columns, sequence_size):
    '''Converting input dataframe to an array of defined size
    
    Parameters
    ---------
    input_df: dataframe
        A dataframe of time series (one or multiple)
    input_columns: list
        A list of columns to convert into sequence
    sequence_size: int
        Size of the time series data for sequencing

    Returns
    ------
    input_array: array
        A array consisting of input series(s) sequences 
    '''

    output_layer = []

    if(len(output_columns)==1):  #for single output series
        for index in output_df.index[0:len(output_df)-sequence_size+1]:
            output_layer.append(output_df.loc[index+sequence_size-1, output_columns].values)
        
    else: # for multiple output series
        for index in output_df.index[0:len(output_df)-sequence_size+1]:
            output_layer.append(output_df.loc[index+sequence_size-1, output_columns].values.tolist())

    output_array = np.array(output_layer)
    return output_array


**Train and test dataset**

A dataframe is built which can be used as univariate or multivariate time series data by choosing one (e.g.: x1) or more columns (x1 and x2 both) of dataframe. 

1. Univariate time series: Input and output dataframes are same.

     $y(t) = X_1 (t)$

2. Multivariate time series: 

    $y(t) = X_1 (t) + X_2 (t)$

3. Multiple output series: 

    $y_1(t) = X_1 (t) + X_2 (t)$

    $y_2(t) = X_1 (t) + 2*X_2 (t)$

In [119]:
X1_train = np.arange(0, 100, 10).reshape(-1,1)
X2_train = np.arange(5, 105, 10).reshape(-1,1)
df_train = pd.DataFrame(np.hstack((X1_train, X2_train)),
                        columns= ['x1', 'x2'])
df_train.head()

Unnamed: 0,x1,x2
0,0,5
1,10,15
2,20,25
3,30,35
4,40,45


In [120]:
X1_test =  np.arange(105, 205, 10).reshape(-1, 1)
X2_test = np.arange(115, 215, 10).reshape(-1,1)

# print(X1_test.shape, X2_test.shape)
df_test = pd.DataFrame(np.hstack((X1_test, X2_test)), columns = ['x1', 'x2'])
df_test.head()

Unnamed: 0,x1,x2
0,105,115
1,115,125
2,125,135
3,135,145
4,145,155


### Univariate time series

Both the input and output time series are considered from column $x_1$

In [121]:
input_columns = ['x1']
X_train = df_train[input_columns]

output_columns = ['x1']
y_train = df_train[output_columns]

sequence_size = 3
X_train = input_df_to_sequence(X_train, input_columns = input_columns, sequence_size = sequence_size)
y_train = output_df_to_sequence(y_train, output_columns = output_columns, sequence_size= sequence_size)
for i in [0,1]:
    print(X_train[i], y_train[i])

[[ 0]
 [10]
 [20]] [20]
[[10]
 [20]
 [30]] [30]


In [105]:

model = Sequential()
model.add(Conv1D(filters=64, kernel_size=2, activation='relu', 
                #  input_shape=(n_steps, n_features)
                 ))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(50, activation='relu'))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
# fit model
model.fit(X_train, y_train, epochs=1000, verbose=0)


<keras.src.callbacks.history.History at 0x1df4356bb10>

In [117]:
X_test = input_df_to_sequence(df_test, 
                              input_columns=input_columns,
                               sequence_size= sequence_size )
# X_test
y_predict = model.predict(X_test)
print('predicted', 'actual')
for i in range(0, 5, 1):
    print("{:0.2f}".format(y_predict[i][0]), df_test.loc[i+2, 'x1'])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 84ms/step
predicted actual
127.02 125
137.61 135
148.18 145
158.75 155
169.32 165


### Multivariate Time Series

Time series can have two or more input time series and output  depends on these series. An example of adding two time series is demonstrated below:

$y(t) = X_1 (t) + X_2 (t)$

In [122]:
mv_train = df_train.copy() #multivariate time 
mv_train.loc[:, 'y'] = mv_train['x1'] + mv_train['x2'] #Creating the output column 'c'
mv_train.head()

Unnamed: 0,x1,x2,y
0,0,5,5
1,10,15,25
2,20,25,45
3,30,35,65
4,40,45,85


In [123]:
input_columns, output_columns = ['x1', 'x2'], ['y']
X_train = mv_train[input_columns]
y_train = mv_train[output_columns]

# Converting input and output time series to sequences
sequence_size = 3
X_train = input_df_to_sequence(X_train, input_columns = input_columns, sequence_size = sequence_size)
y_train = output_df_to_sequence(y_train, output_columns = output_columns, sequence_size= sequence_size)
for i in [0,1]:
    print(X_train[i], y_train[i])

[[ 0  5]
 [10 15]
 [20 25]] [45]
[[10 15]
 [20 25]
 [30 35]] [65]


In [16]:
model = Sequential()
model.add(Conv1D(filters=64, kernel_size=2, activation='relu', 
                #  input_shape=(n_steps, n_features)
                 ))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(50, activation='relu'))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
# fit model
model.fit(X_train, y_train, epochs=1000, verbose=0)

<keras.src.callbacks.history.History at 0x1df39ee6f50>

In [61]:
X_test = input_df_to_sequence(df_test, 
                              input_columns=input_columns,
                               sequence_size= sequence_size )
# X_test
y_predict = model.predict(X_test)
print('Predicted', 'Actual')
for i in range(0, 5, 1):
    predicted = y_predict[i][0]
    actual = df_test.loc[i+2].sum(axis =0)
    print ("{:0.2f}".format(predicted), actual)
# y_predict

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
Predicted Actual
263.05 260
283.68 280
304.53 300
325.38 320
346.22 340


### Multiple Output Series

It is also possible to predict multiple parallel time series and a value must be predicted for each of them. 

An example is illustrated below in which both addition and substration of two time series will be modelled and predicted.

$y_1 (t) = X_1 (t) + X_2 (t)$

$y_2 (t) = X_1 (t) +2*X_2 (t)$


In [124]:
mo_train = df_train.copy()
mo_train.loc[:, 'y1'] = mo_train['x1'] + mo_train['x2']
mo_train.loc[:, 'y2'] = mo_train['x1'] + 2*mo_train['x2']
mo_train.head()

Unnamed: 0,x1,x2,y1,y2
0,0,5,5,10
1,10,15,25,40
2,20,25,45,70
3,30,35,65,100
4,40,45,85,130


In [125]:
input_columns, output_columns = ['x1', 'x2'], ['y1', 'y2']
X_train = mo_train[input_columns]
y_train = mo_train[output_columns]

# Converting input and output time series to sequences
sequence_size = 3
X_train = input_df_to_sequence(X_train, input_columns = input_columns, sequence_size = sequence_size)
y_train = output_df_to_sequence(y_train, output_columns = output_columns, sequence_size= sequence_size)
for i in [0,1]:
    print(X_train[i], y_train[i])

[[ 0  5]
 [10 15]
 [20 25]] [45 70]
[[10 15]
 [20 25]
 [30 35]] [ 65 100]


In [82]:
model = Sequential()
model.add(Conv1D(filters=64, kernel_size=2, activation='relu', 
                #  input_shape=(n_steps, n_features)
                 ))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(50, activation='relu'))
model.add(Dense(2))
model.compile(optimizer='adam', loss='mse')
# fit model
model.fit(X_train, y_train, epochs=1000, verbose=0)

<keras.src.callbacks.history.History at 0x1df40c8ad10>

In [103]:
X_test = input_df_to_sequence(df_test, 
                              input_columns=input_columns,
                               sequence_size= sequence_size )
# X_test
y_predict = model.predict(X_test)
y_predict
# print('Predicted', 'Actual')
for i in range(0, 5, 1):
    predicted = y_predict[i]
    actual = [(df_test.loc[i+2,'x1']+df_test.loc[i+2, 'x2']).item(), (df_test.loc[i+2, 'x1']+2*df_test.loc[i+2, 'x2']).item()]

    print ((predicted), actual)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
[262.61313 399.00787] [260, 395]
[283.3601  430.66052] [280, 425]
[304.1071  462.31326] [300, 455]
[324.8593  493.97247] [320, 485]
[345.62524 525.64856] [340, 515]


The developed functions of converting input and output time series data are useful for all possible types of series (univariate, multivariate, multiple output series.). 