# Time Series Imputation - Imputation with  Vanilla Seq2Seq architecture

In this notebook, we explore how to impute an univariate time-series with Variational Auto Encoders. The architecture of this VAE is simple and only suited for univariate distributions. 

Tensorflow is the package used for the Seq2Seq model implementation.

This notebooks covers the following approach:

*Data Preparation:*

Partition the time-series into overlapping sequences. For example, if you have a daily time-series of length 365 days and you want to use 30 days to predict the next 10 days, then you can partition the series into overlapping sequences where each sequence consists of 30 days followed by the next 10 days.

In this case as we have created the missing data, the source sequence is the original data, whereas the data to be imputed is the one that has missingness created. The data was scaled, and the NaN values replaced with a mask. The chosen sequence lenght was a full week of reading 24*7, as this is a hourly dataset. 

*Training the Model:*

Train the Seq2Seq model using the valid overlapping sequences.
The encoder will take the 7-days source sequence and the decoder will try to reproduce the same period of time. 

*Challenges:*

The architecture (like the number of layers, number of neurons, type of RNN cell, etc.) of the Seq2Seq model can significantly affect performance.
The choice of the window size (7 days in the example) and how far into the future you're predicting can also be crucial.
Overfitting can be a concern, especially if the time-series is noisy or if there's not much data available.

### Import the required packages

In [18]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Model

### Auxiliary functions

In [19]:
#Utility function to generate a 3D sequence
def gen_seq(id_df, seq_length, seq_cols):

    data_matrix =  id_df[seq_cols]
    num_elements = data_matrix.shape[0]

    for start, stop in zip(range(0, num_elements-seq_length, 1), range(seq_length, num_elements, 1)):
        
        yield data_matrix[stop-sequence_length:stop].values.reshape((-1,len(seq_cols)))
        
#Scaler class
class scaler1D:
    
    def fit(self, X):
        self.mean = np.nanmean(np.asarray(X).ravel())
        self.std = np.nanstd(np.asarray(X).ravel())
        return self
        
    def transform(self, X):
        return (X - self.mean)/self.std
    
    def inverse_transform(self, X):
        return (X*self.std) + self.mean

### Read the data

In [5]:
og_df = pd.read_csv('spot_prices_oil.csv', index_col=[0])
og_df["Date"] = pd.to_datetime(og_df.Date, format="%Y-%m-%d %H:%M:%S")
og_df = og_df.sort_values('Date')

og_df.drop_duplicates('Date', inplace=True)
og_df.set_index('Date', inplace=True)

In [6]:
#read dataset with the long gaps
long_missing = pd.read_csv('missing_long.csv', index_col=[0])
long_missing["Date"] = pd.to_datetime(long_missing.Date, format="%Y-%m-%d %H:%M:%S")
long_missing = long_missing.sort_values('Date')

long_missing.drop_duplicates('Date', inplace=True)
long_missing.set_index('Date', inplace=True)

## Seq2Seq Vanilla architecture

In [31]:
# Define the encoder-decoder model

def create_seq2seq(input_seq_length, output_seq_length, hidden_units):
    input_seq = Input(shape=(input_seq_length, 1))
    encoder = LSTM(hidden_units, return_state=True)
    encoder_outputs, state_h, state_c = encoder(input_seq)

    decoder_input_seq = Input(shape=(output_seq_length, 1))
    decoder_lstm = LSTM(hidden_units, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_input_seq, initial_state=[state_h, state_c])
    decoder_dense = Dense(1)
    decoder_outputs = decoder_dense(decoder_outputs)

    model = Model(inputs=[input_seq, decoder_input_seq], outputs=decoder_outputs)

    return model

## Preparing the data for the model

In [22]:
##Creating sequences to input into the model

sequence_length = 24*7

sequence_input = []
sequence_target = []

for seq in gen_seq(og_df[['BE']], sequence_length, ['BE']):
    sequence_target.append(seq)
    
for seq in gen_seq(long_missing, sequence_length, ['BE']):
    sequence_input.append(seq)
    
sequence_input = np.asarray(sequence_input)
sequence_target = np.asarray(sequence_target)

sequence_input.shape, sequence_target.shape

((34873, 168, 1), (34873, 168, 1))

In [23]:
## Split the data in training and test splits
train_size = 0.8

sequence_input_train = sequence_input[:int(len(sequence_input)*train_size)]
sequence_input_test = sequence_input[int(len(sequence_input)*train_size):]
print(sequence_input_train.shape, sequence_input_test.shape)

sequence_target_train = sequence_target[:int(len(sequence_target)*train_size)]
sequence_target_test = sequence_target[int(len(sequence_target)*train_size):]
print(sequence_target_train.shape, sequence_target_test.shape)

(27898, 168, 1) (6975, 168, 1)
(27898, 168, 1) (6975, 168, 1)


In [24]:
#Scale the sequences and mask the missing values

scaler_target = scaler1D().fit(sequence_input)

sequence_input_train = scaler_target.transform(sequence_input_train)
sequence_input_test = scaler_target.transform(sequence_input_test)

sequence_target_train = scaler_target.transform(sequence_target_train)
sequence_target_test = scaler_target.transform(sequence_target_test)

mask_value = -999.
sequence_input_train[np.isnan(sequence_input_train)] = mask_value

## Train model

In [32]:
# Define input and output sequence lengths and the number of hidden units
input_sequence_length = sequence_length  # Adjust as needed
output_sequence_length = sequence_length  # Adjust as needed
hidden_units = 64

# Create the Seq2Seq model
model = create_seq2seq(input_sequence_length, output_sequence_length, hidden_units)

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

In [35]:
#you can add callbacks for an early stopping based on a defined metric to be monitored
es = EarlyStopping(patience=10, verbose=1, min_delta=0.001, monitor='loss', mode='auto', restore_best_weights=True)

# Train the model
model.fit([sequence_input_train, sequence_target_train], sequence_target_train, epochs=100, batch_size=32, callbacks=[es])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 11: early stopping


<keras.src.callbacks.History at 0x7fb687e391b0>

## Imputation

In [36]:
reconstructions = model.predict([sequence_input_test, sequence_input_test])

