# Time Series Data Preprocessing
This workbook introduces you to the concept of Time Series Data and how we can pre-process such data to reframe it to a Supervised Learning problem.


## What is Time Series Data
Time Series Data is a series of data points ordered in time. Time Series data can have some interesting properties such as:

__Trends__ where there is a general increase, decrease or steady state to the values over time or during period. An example of this might be the graph of the FTSE 100 share index which generally shows an upward trend.
 
 __Seasonality__ where there are distinct periodic fluctuations with similar pattersn. An example of this might be the mean daily temperature over a 5 year period would likely show regular period of increased temperatures occuring during summer periods.

Some more details about Time Series Data can be found at https://towardsdatascience.com/almost-everything-you-need-to-know-about-time-series-860241bdc578


## Why is Time Series Data interesting?
Data that forms Time Series is of interest since it can be used to forecast future events. For example:
 - if we can reliabily forecast the price of a stock from recent historical data then we can potentially make money
 - if we can forecast where a hurricane will make landfall then we can make preprations and save lives.
 
 Each of these examples, has value and it is likely that the historical data is a good indicator for a future forecast.
 
 ### Exercise
 In your teams, list out some examples of Time Series Data where you think it might be valuable to be able to forecast future values (such as the two given above). 
 
 For the 2 most interesting datasets consider:
 - What is the value in being able to forecast future values/events for this dataset?
 - What data is likely to be related to the time-series that might help with a forecast?  
     - For example, in stock market prediction, historical stock info could be suplemented with stock market prices of related stocks or indexes such as the FTSE 100.
 

# Reframing Time Series Forecasting as a Supervised Learning Problem
When using Supervised Learning, we generally have a set of related inputs (a training sample) and an expected output (target).

A Time-Series is a sequence of values that occur over a time period taken at regular intervals; what we are attempting to do is predict the next item in the sequence based on the previous values.

An example of a Time Series Data might be the following (passenger number per month in thousands:

112, 118, 132, 129, 121, 135, 148, 148, 136, 119

This data isn't really in a form that we can readily use for our machine learning models


## Exercise
Think about the way we construct our data for Prediction and Classification tasks; we have our training samples (X) that contains a set of features and we have our targets (y) that contains the target value for the feature. We want our model to learn a mapping between the training sample(s) and the associated target.

Now consider the following time-series sequence:

__112, 118, 132, 129, 121, 135, 148, 148, 136, 119__

In your groups consider how you can construct, from this sequence data that is suitable for machine learning based on the following scenarios - for each list out the training data you would create.

List out the dataset you would create to train a model to:
- predict the next item in the sequence based on the previous 2 items
    - e.g. given 112 and 118 we need to predict 132
- predict the next 2 items in the sequence based on the previous 2
    - e.g. given 112 and 118 we need to preduct 123 and 129
- predict the item 2 places ahead in the sequence based on the previous 2
    - e.g. given 112 and 118 we need to predict 129

Remember you need a set training samples consisting of X-data (the features) and the corresponding y-data (the target we are trying to predict/forecast.

# Reframing Data using a TimeSeriesGenerator
Until recently, we would have had to create custom functions to reframe our timeseries data into a suitable form but thankfully with Keras, we can use a __TimeSeriesGenerator__.

The TimeSeriesGenerator takes our input sequence(s) and uses them to provide training samples. It's doesn't change the original data but instead uses it to form a dataset suitable for Supervised learning.

We can create a generator using `generator = TimeseriesGenerator()` and specify a set of parameters to construct how the data is produced. These parameters include:
- __data__ - this is the sequence that is to be predicted that covers a time-period
- __targets__ - this is the target data sequence. 
    - this sequence should cover the same time period as the _data_ and be aligned (i.e. _data_ at position 1 should have the corresponding target at _targets_ position 1
    - If our targets are part of the _data_ we just provide the same sequence again, otherwise we provide the target sequence.
- __length__: Length of the output sequences (in number of timesteps).
- __sampling_rate__: Period between successive individual timesteps within sequences.
- __stride__: Period between successive output sequences.

Full Details about the TimeSeriesGenerator can be found at https://keras.io/preprocessing/sequence/#timeseriesgenerator


# Let's see the TimeSeriesGenerator in action
We are now going to see the TimeSeriesGenerator in action against a simple dataset.

We will use the 3 scenarios you considered earlier as examples. These were:
- predict the next item in the sequence based on the previous 2 items
    - e.g. given 112 and 118 we need to predict 132
- predict the next 2 items in the sequence based on the previous 2
    - e.g. given 112 and 118 we need to preduct 123 and 129
- predict the item 2 places ahead in the sequence based on the previous 2
    - e.g. given 112 and 118 we need to predict 129

In [None]:
# Import some packages we will need
import pandas
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow import keras

# This imports the TimeseriesGenerator package
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator


In [None]:
# Useful functions
def print_timeseries(generator):
    for i in range(len(generator[0][0])):
        x = generator[0][0][i]
        
        y = generator[0][1][i]
        
        print("[%s] => [%s]" % (x, y))

def create_target_series_for_generator(data, skip_output=0, output_length=1):
    # The Generator will deal selecting the next in the sequence
    # but if our predictions are skiping steps we need to shift our data a bit
    shifted_data = data[skip_output:]
   
    # Now we loop through this shifted data set and collect sucessive  pairs
    # and add them to our target_seq
    target_seq = []
    # Stop before we run out of output groups
    pairs = len(shifted_data) - output_length
    for i in range(pairs):
        item = []
        for j in range(output_length):
            item.append(shifted_data[i+j])
        target_seq.append(item)
    return target_seq

In [None]:
# Import some sample data
dataset = pandas.read_csv('https://github.com/BillMatthews/ml-course/blob/master/airline-passengers.csv?raw=true', 
                          usecols=[1], engine='python')
# Extract the data
data = dataset.values
data = data.astype('float32')
data = np.concatenate(data)

dataset.head()

In [None]:
plt.plot(dataset)
plt.show()
# Note the Seasonality and Trend in the data

Let's print out the first 10 data points

In [None]:
print(data[0:10])

## Scenario 1
Predict the next item in the sequence based on the previous 2 items. For example given 112 and 118 we need to predict 132

__Key Points to consider__
* Since we only have 1 timeseries and the data and targets come from this sequence, we provide our data as the value to both the __data__ and __targets__ parameters
* We are predicting based on the previous 2 items so our __length__ is 2

The original sequnce was:

__112, 118, 132, 129, 121, 135, 148, 148, 136, 119__

In [None]:
s1_generator = TimeseriesGenerator (data = data[:10],
                                    targets = data[:10],
                                    length = 2)

        
print_timeseries(s1_generator)

## Scenario 2
Predict the next 2 items in the sequence based on the previous 2. For example given 112 and 118 we need to preduct 123 and 129

__Key Points to consider__
* Given we want to output 2 values for our target we can't re-use our data as the target (since it only has one item at each timestep). We therefore need to construct a new target dataset.
* We are predicting based on the previous 2 items so our __length__ is 2

The original sequnce was:

__112, 118, 132, 129, 121, 135, 148, 148, 136, 119__

In [None]:
# Create the new target sequence
target_seq = []
# Since we are predicting the next 2 items in a sequnce we don't need to shift the data
# so can just loop through the data and collect the pairs of values
# However, we need to stop short before we reach the end as the last value in our data
# won't have a pair
for i in range(len(data) - 1):
    target_seq.append([data[i], data[i+1]])
    
print(target_seq[:10])
# We can now create our TimeSeriesGenerator
s2_generator = TimeseriesGenerator (data = data[:10],
                                    targets = target_seq[:10],
                                    length = 2)

        
print_timeseries(s2_generator)

## Scenario 3
Predict the item 2 places ahead in the sequence based on the previous 2. For example given 112 and 118 we need to predict 129
__Key Points to consider__
* Since we are now predicting values out of sequence we need to create a new sequence for _targets_
* We are predicting based on the previous 2 items so our __length__ is 2
* We want to step over the next item in the sequence to predict the following one so our __stride__ is 1

The original sequnce was:

__112, 118, 132, 129, 121, 135, 148, 148, 136, 119__

In [None]:
# We are stepping 2 places along the sequence
# The Generator will deal with a single step
# So we need to shift our data by a further 1 step
# so the Generator pick-ups the correct item
target_seq = data[1:]

s3_generator = TimeseriesGenerator (data = data[:10],
                                    targets = target_seq[:10],
                                    length = 2)

print_timeseries(s3_generator)

# Execise
We have seen examples of how to prepare time-series data and how we can create TimeseriesGenerators to feed our learning model.

We have created a function that allows you to create your target sequences without writting code. The function is called `create_target_series_for_generator()` which takes the following parameters:
- __data__: the timeseries data used as your target source. This is usually you source data
- __skip_output__: indicates how many items to skip forward. If you just want the next item as normal then omit this parameter as it is optional and defaults to 0
- __output_length__: indictes the number of items in the output grouping you want. If you just one 1 value for each target then this can be omitted as it is optional and defaults to 1

Before moving on, use the following cell to experiment with different options for the Timeseries data to be sure that understand how to create timeseries data.

For reference, the original data starts with the following sequence:

__112, 118, 132, 129, 121, 135, 148, 148, 136, 119__

In [None]:
target_seq = create_target_series_for_generator(data = data, 
                                                skip_output=0,
                                                output_length = 1)

my_generator = TimeseriesGenerator (data = data[:10],
                                    targets = target_seq[:10],
                                    length = 2)
# 112, 118, 132, 129, 121, 135, 148, 148, 136, 119
print_timeseries(my_generator)

# Creating Training and Testing Data sets
In previous models we created Training and Testing Datasets by spliting the available data into 2 - usually with 20% of the data being reserved for Testing and 80% for training.

The splits we created were random in nature since each training sample was independent. With our Time Series data the data is linked in that they form a sequence of values where a value in the sequence is expected to have some dependency on previous values.

## Exercise
Disucss in your teams the following questions:
- What do you think will happen to our Time Series if we split the data randomly like we do with Image and Text training Samples?
- How could we split a Time Series to preserve it as a time series?
- What key features should we consider when deciding to split our time-series?