## Data Preparation Example:

### Consider that you are in the current situation:
I have two columns in my data file with 5,000 rows, column 1 is time (with 1 hour
interval) and column 2 is the number of sales and I am trying to forecast the number
of sales for future time steps. Help me to set the number of samples, time steps and
features in this data for an LSTM?

There are few problems here:

* Data Shape. LSTMs expect 3D input, and it can be challenging to get your head around this the first time.
* Sequence Length. LSTMs don’t like sequences of more than 200-400 time steps, so the data will need to be split into subsamples.

We will work through this example, broken down into the following 4 steps:
1. Load the Data
2. Drop the Time Column
3. Split Into Samples
4. Reshape Subsequences

### 1. Load the Data

For this example, we will mock loading by defining a new dataset in memory with 5,000
time steps.

In [19]:
from numpy import array

# define the dataset
data = list()
n = 5000

for i in range(n):
    data.append([i+1, (i+1)*10])
data = array(data)

Running this piece both prints the first 5 rows of data and the shape of the loaded data. We
can see we have 5,000 rows and 2 columns: a standard univariate time series dataset.

In [20]:
print(data[:5, :])

[[ 1 10]
 [ 2 20]
 [ 3 30]
 [ 4 40]
 [ 5 50]]


We still have the time column

In [21]:
print(data.shape)

(5000, 2)


### 2. Drop the time column

In [22]:
data = data[:,1]
data.shape

(5000,)

### 3. Split into samples

LSTMs need to process samples where each sample is a single sequence of observations. In this
case, 5,000 time steps is too long; LSTMs work better with 200-to-400 time steps. Therefore, we
need to split the 5,000 time steps into multiple shorter sub-sequences. There are many ways to
do this, and you may want to explore some depending on your problem. For example, perhaps
you need overlapping sequences, perhaps non-overlapping is good but your model needs state
across the sub-sequences and so on. In this example, we will split the 5,000 time steps into 25
sub-sequences of 200 time steps each. Rather than using NumPy or Python tricks, we will do
this the old fashioned way so you can see what is going on.

In [25]:
# split into samples (e.g. 5000/200 = 25)
samples = list()
length = 200

# step over the 5,000 in jumps of 200
for i in range(0,n,length):
    # grab from i to i + 200
    sample = data[i:i+length]
    samples.append(sample)
    
print(len(samples))

25


In [26]:
# convert list of arrays into 2d array
data = array(samples)
print(data.shape)

(25, 200)


### 4. Reshape the sub-sequences

The LSTM needs data with the format of ```[samples, timesteps, features]```. We have 25
samples, 200 time steps per sample, and 1 feature. First, we need to convert our list of arrays
into a 2D NumPy array with the shape ```[25, 200]```.

In [27]:
# reshape into [samples, timesteps, features]
data = data.reshape((len(samples), length, 1))
print(data.shape)

(25, 200, 1)


In [28]:
data

array([[[   10],
        [   20],
        [   30],
        ...,
        [ 1980],
        [ 1990],
        [ 2000]],

       [[ 2010],
        [ 2020],
        [ 2030],
        ...,
        [ 3980],
        [ 3990],
        [ 4000]],

       [[ 4010],
        [ 4020],
        [ 4030],
        ...,
        [ 5980],
        [ 5990],
        [ 6000]],

       ...,

       [[44010],
        [44020],
        [44030],
        ...,
        [45980],
        [45990],
        [46000]],

       [[46010],
        [46020],
        [46030],
        ...,
        [47980],
        [47990],
        [48000]],

       [[48010],
        [48020],
        [48030],
        ...,
        [49980],
        [49990],
        [50000]]])

This section provides more resources on the topic if you are looking to go deeper.

* numpy.reshape API.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html

* Keras Recurrent Layers API in Keras.
https://keras.io/layers/recurrent/

* Keras Convolutional Layers API in Keras.
https://keras.io/layers/convolutional/