## Basic LSTM model using univariate data, given a sequence of global_active_power, will try to predict a sequence of global_active_power (vector output)

In [1]:
# coding: utf-8
# !/usr/bin/env python3
import pandas as pd
pd.set_option("display.max_columns", None)
import numpy as np
import warnings
warnings.filterwarnings("ignore")
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import tensorflow as tf

In [2]:
print(tf.__version__)

1.13.0-rc1


In [3]:
# read dataset, if file is not available run: 1_Exploratory_Analylis.ipynb
df = pd.read_csv("./data/cleaned_household_power_consumption.csv", infer_datetime_format=True, parse_dates=["local_time"],
                index_col=["local_time"], dtype=np.float32)

In [4]:
# Just to be safer, data is already sorted
df.sort_index(inplace=True)

In [5]:
df.head()

Unnamed: 0_level_0,global_active_power,global_reactive_power,voltage,global_intensity,sub_metering_1,sub_metering_2,sub_metering_3,sub_metering_other
local_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2006-12-16 17:24:00,4.216,0.418,234.839996,18.4,0.0,1.0,17.0,52.26667
2006-12-16 17:25:00,5.36,0.436,233.630005,23.0,0.0,1.0,16.0,72.333336
2006-12-16 17:26:00,5.374,0.498,233.289993,23.0,0.0,2.0,17.0,70.566666
2006-12-16 17:27:00,5.388,0.502,233.740005,23.0,0.0,1.0,17.0,71.800003
2006-12-16 17:28:00,3.666,0.528,235.679993,15.8,0.0,1.0,17.0,43.099998


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2075259 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Data columns (total 8 columns):
global_active_power      float32
global_reactive_power    float32
voltage                  float32
global_intensity         float32
sub_metering_1           float32
sub_metering_2           float32
sub_metering_3           float32
sub_metering_other       float32
dtypes: float32(8)
memory usage: 79.2 MB


In [7]:
print("Data Range: ", df.index.min(), " to ", df.index.max())

Data Range:  2006-12-16 17:24:00  to  2010-11-26 21:02:00


### Strategies
1. We will use sequence of **global_active_power** values to predict **global_active_power** itself. It's an univariate model.
2. We need to provide multi-step time forecast (asking for minutely forecast for 2 days ie. 2 * 24 * 60 data points). Data start and end is not from begining minute of the first day also not till end minute of last day.
3. For new prediction purpose, we will give give 2 days prediction. That is 2010-11-24 21:03:00 to 2010-11-24 21:02:00.
4. Model can be trained using prior few days/minutes data (here equivalent minutes). We will start with using prior 2 days data (2880 minutes data) to train the model. It can be extended to any days, say 3, 4, 5 days. or if we look through minutes, say 3000, 4000, 5000 minutes prior.
5. We will split the data into 3 sets (train, evaluate, test). And it will be in time series. And we will create a distict boundaries that no set can see the data from other set (be it the input sequence to train the model). We will keep last 2880 + number of data in training as train set (if we don't do this, data will overlap). Here for test data, now we have 5760 data points (considering, training with 2880 data points). We will evaluate our model with 30 days (43200 minutes), so will keep next last 32 days data (46080 minutes). And remaining data points we will use for training our data.

### Train, Evaluate & Test split of univariate data:
**(I am not creating functions now, once we are clear and confident about our steps we can wrap different building blocks as functions)**

In [8]:
n_input = 2 * 24 * 60
n_output = 2 * 24 * 60
test_index_start = n_input + n_output
# 30 evaluation days
eval_days = 30
eval_index_start = n_input + eval_days * 24 * 60 + test_index_start
values = df.values
train, evaluate, test = values[:-eval_index_start], values[-eval_index_start:-test_index_start], values[-test_index_start:]
print("Training data shape: ", train.shape)
print("Evaluation data shape: ", evaluate.shape)
print("Test data shape: ", test.shape)

Training data shape:  (2023419, 8)
Evaluation data shape:  (46080, 8)
Test data shape:  (5760, 8)


**We are getting evaluation and test data as mentioned in our startegy**    
Note: Still we have 8 features, we need only 1 feature (global_active_power in index 0) to build our univariate model

In [9]:
def get_X_y(data, n_in, n_out=2880):
    X, y = list(), list()
    in_start = 0
    # step over the entire history one time step at a time
    for _ in range(len(data)):
        # define the end of the input sequence
        in_end = in_start + n_input
        out_end = in_end + n_out
        # ensuring that we have enough data for this instance
        if out_end <= len(data):
            # need only 1 feature
            x_input = data[in_start:in_end, 0]
            # reshaping [timestemps, features]
            x_input = x_input.reshape((len(x_input), 1))
            # it will give [samples, timestemps, features]
            X.append(x_input)
            # [samples, output]
            y.append(data[in_end:out_end, 0])
        # move along one time step
        in_start += 1
    return np.array(X), np.array(y)

In [10]:
# X_train, y_train = get_X_y(data=train, n_in=n_input, n_out=n_output)
# X_eval, y_eval = get_X_y(data=evaluate, n_in=n_input, n_out=n_output)
# X_test, y_test = get_X_y(data=test, n_in=n_input, n_out=n_output)
# print("Train X shape: ", X_train.shape)
# print("Train y shape: ", y_train.shape)
# print("Evaluation X shape: ", X_eval.shape)
# print("Evaluation y shape: ", y_eval.shape)
# print("Test X shape: ", X_test.shape)
# print("Test y shape: ", y_test.shape)

### Above method is taking time to generate the sets, bcause of memory issues

In [11]:
print("Expected X_train shape: ({},{},{})".format(train.shape[0] - 2 * 2880 + 1, 2880, 1))
print("Expected X_train shape: ({},{})".format(train.shape[0] - 2 * 2880 + 1, 2880))

Expected X_train shape: (2017660,2880,1)
Expected X_train shape: (2017660,2880)


In [12]:
# Let's check approximately how much memory we consume by looking into first 10000 data points, the expected rows will be 10000 - 2 * 2880
X_train, y_train = get_X_y(data=train[0:10000,:], n_in=n_input, n_out=n_output)

In [13]:
print("X_train shape: ", X_train.shape, " , y_train shape: ", y_train.shape)

X_train shape:  (4241, 2880, 1)  , y_train shape:  (4241, 2880)


In [14]:
print("X_train_size in MB: ", X_train.nbytes/(1024*1024), " , y_train_size in MB: ", y_train.nbytes/(1024*1024))

X_train_size in MB:  46.593017578125  , y_train_size in MB:  46.593017578125


In [15]:
# Lets calculate the expected size for train
# 4241 rows need 2 * 46.58203125 mb
print("Expected size in GB for 2017659 rows in training(X,y): ", 
      X_train.nbytes/(1024*1024) * 2 * (train.shape[0] - 2 * 2880 + 1)/X_train.shape[0]/1024, " GB")

Expected size in GB for 2017659 rows in training(X,y):  43.29428672790527  GB


## This is why we are going out of memory

### We can try saving these arrays to DISK and then feed to the model by fit_generator in KERAS by defining our generator. I will explore on this later.
**For now I will move forward and try to build models by converting it to hourly records**