# How to Covert a Time Series to a Supervised Learning Problem in Python

Before machine learning can be used, time series forecasting problems bust be re-framed as supervised learning problems. From a sequence to pairs of input and output sequences.

In this tutorial, you will discover how to transform univariate and multivariate time series forecasting problems. After completion, you will know:
- How to develop a function to transform a time series dataset into a supervised learning dataset.
- How to transform univariate time series data for machine learning.
- How to transform multivariate time series data for machine learning.

## Time Series vs Supervised Learning

A **_time series_** is a sequence of numbers that are ordered by a time index. This can be thought of as a list or column of ordered values.

A **_supervised learning_** problem is comprised of input patters (x) and output patterns (y), such that an algorithm can learn how to predict the output patterns from the input patterns.

## Pandas `shift()` Function

A key function to help transform time series data into a supervised learning problem is the Pandas `shift()` function. Given a DataFrame, the `shift()` function can be used to create copies of columns that are pushed forward (rows of NaN values added to the front) or pulled back (rows of NaN values added to the end).

This is the behavior required to create columns of lag observations as well as columns of forecast observations for a time series dataset in a supervised learning format.

In [45]:
import pandas as pd

# create a time series DataFrame sequence
df = pd.DataFrame()
df['t'] = [x for x in range(10)]
print(df)

   t
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9


We can shift all the observations down by one time step by inserting one new row at the top. Because the new row has no data, we can use NaN to represent _no data_.

In [46]:
# create shifted column, 't-1'
df['t-1'] = df['t'].shift(1)
print(df)

   t  t-1
0  0  NaN
1  1  0.0
2  2  1.0
3  3  2.0
4  4  3.0
5  5  4.0
6  6  5.0
7  7  6.0
8  8  7.0
9  9  8.0


We can see that shifting the series forward one time step gives us a primitive supervised learning problem, although with `x` and `y` in the wrong order. Ignore the column of row labels for now. The first row would have to be discarded because of the Nan value. The second row shows the input value of 0.0 in the second column (_input_ or _x_) and the value of 1 in the first column (_output_ or _y_).

If we can repeat this process with shift of 2, 3, and more, we could create long input sequences (x) that can be used to forecast an output value (y).

The shift operator can also accept a negative integer value, which has the effect of pulling the observations up by inserting new rows at the end.

In [47]:
df['t+1'] = df['t'].shift(-1)
print(df)

   t  t-1  t+1
0  0  NaN  1.0
1  1  0.0  2.0
2  2  1.0  3.0
3  3  2.0  4.0
4  4  3.0  5.0
5  5  4.0  6.0
6  6  5.0  7.0
7  7  6.0  8.0
8  8  7.0  9.0
9  9  8.0  NaN


In time series forecasting terminology, the current time (t) and future times (t+n) are forecast times and past observations (t-n) are used to make the forecasts.

We can see how positive and negative shifts can be used to create a new DataFrame from a time series with sequences of input and output patters for a supervised learning problem. This permits not only classical (X to y) prediction, but it also allows for (X to Y) where both input and output can be sequences.

Further, the shift function also works on so-called multivariate time series problems. Where, instead of having one set of observations for a time series, we have multiple (e.g. temperature and pressure). All variates in the time series can be shifted forward or backward to create multivariate input and output sequences.

## The `series_to_supervised()` Function

`shift()` is a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better performing models.

In this section, we will define a new Python function named `series_to_supervised()` that takes a univariate or multivariate time series and frames it as a supervised learning dataset.

The function takes four arguments:
- `data`: sequence of observations as a list or 2D NumPy array. Required.
- `n_in` : number of lag observations as input (x). Values may be between [1, len(data)]. Optional. Defaults to 1.
- `n_out` : number of observations as output (y). Values may be between [0, len(data)-1]. Optional. Defaults to 1.
- `dropnan` : boolean whether or not to drop rows with NaN values. Optional. Defaults to True.

The function returns a single value:
- `return` : Pandas DataFrame of series framed for supervised learning.

The new dataset is constructed as a DataFrame, with each column suitably named both by variable number and time step. This allows you to design a variety of different time step sequence type forecasting problems from a give univariate or multivariate time series.

Once the DataFrame is returned, you an decide how to split the rows of the returned DataFrame into x and y components for supervised learning any way you wish. The function is defined with default parameters so that if you call it with just your data, it will construct a DataFrame with _t-1_ as x and _t_ as y.

In [48]:
def series_to_supervised(data, n_in=1, n_out=1, dropna=True):
    
    """
    Frame a time series as a supervised learning dataset.
    Arguments:
        data: Sequence of observations as a list or NumPy array.
        n_in: Number of lag observations as input (X).
        n_out: Number of observations as output (y).
        dropnan: Boolean whether or not to drop rows with NaN values.
    Returns:
        Pandas DataFrame of series framed for supervised learning.
    """
    
    n_vars = 1 if type(data) is list else data.shape[1]
    # print(n_vars)
    
    # define variables
    df = pd.DataFrame(data)
    cols, names = list(), list() # column data, column names
    
    # input sequence (t, t-1, ... t-n)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [f'var{j+1}(t-{i})' for j in range(n_vars)]
        
    # output sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        
        if i == 0:
            names += [f'var{j+1}(t)' for j in range(n_vars)]
        else:
            names += [f'var{j+1}(t+{i}' for j in range(n_vars)]
            
    # put it all together
    agg = pd.concat(cols, axis=1)
    agg.columns = names
    
    # drop rows with NaN values
    if dropna:
        agg.dropna(inplace=True)
    
    return agg

## One-Step Univariate Forecasting

It is standard practice in time series forecasting to use lagged observations (e.g., t-1) as input variables to forecast the current time step (t). The is called **_one-step forecasting_**.

In [49]:
values = [x for x in range(10)]
data = series_to_supervised(values)
data

Unnamed: 0,var1(t-1),var1(t)
1,0.0,1
2,1.0,2
3,2.0,3
4,3.0,4
5,4.0,5
6,5.0,6
7,6.0,7
8,7.0,8
9,8.0,9


We can see that the observations are named `var1` and that the input observation is suitably named (t-1) and the output time step is named (t). We can also see that rows with NaN values have been automatically removed from the DataFrame.

We can repeat this example with an arbitrary number length input sequence, such as 3. This can be done by specifying the length of the input sequence as an argument.

In [50]:
data = series_to_supervised(values, 3)
data

Unnamed: 0,var1(t-3),var1(t-2),var1(t-1),var1(t)
3,0.0,1.0,2.0,3
4,1.0,2.0,3.0,4
5,2.0,3.0,4.0,5
6,3.0,4.0,5.0,6
7,4.0,5.0,6.0,7
8,5.0,6.0,7.0,8
9,6.0,7.0,8.0,9


## Multi-Step or Sequence Forecasting

A different type of forecasting problem is using past observations to forecast a sequence of future observations. This is called **_multi-step_** or **_sequence forecasting_**.

In [51]:
data = series_to_supervised(values, 2, 2)
data

Unnamed: 0,var1(t-2),var1(t-1),var1(t),var1(t+1
2,0.0,1.0,2,3.0
3,1.0,2.0,3,4.0
4,2.0,3.0,4,5.0
5,3.0,4.0,5,6.0
6,4.0,5.0,6,7.0
7,5.0,6.0,7,8.0
8,6.0,7.0,8,9.0


## Multivariate Forecasting

Another important type of time series is called **_multivariate time series_**. This is where we may have observations of multiple different measures and an interest in forecasting one or more of them. For example, we may have two sets of time series observations `obs1` and `obs2` and we wish to forecast one or both of these.

In [52]:
row = pd.DataFrame()
row['obs1'] = [x for x in range(10)]
row['obs2'] = [x for x in range(50, 60)]
values = row.values

data = series_to_supervised(values)
data

Unnamed: 0,var1(t-1),var2(t-1),var1(t),var2(t)
1,0.0,50.0,1,51
2,1.0,51.0,2,52
3,2.0,52.0,3,53
4,3.0,53.0,4,54
5,4.0,54.0,5,55
6,5.0,55.0,6,56
7,6.0,56.0,7,57
8,7.0,57.0,8,58
9,8.0,58.0,9,59


Running the example prints the new framing of the data, showing an input pattern with one time step for both variables and an output pattern of one time step for both variables. Again, depending on the specifics of the problem, the division of columns into (x) and (y) components can be chosen arbitrarily, such as if the current observation of `var1` was also provided as input and only `var2` was to be predicted.

In [53]:
# using the row values above, we can specify different input and output steps for multivariate problems
data = series_to_supervised(values, 1, 2)
data

Unnamed: 0,var1(t-1),var2(t-1),var1(t),var2(t),var1(t+1,var2(t+1
1,0.0,50.0,1,51,2.0,52.0
2,1.0,51.0,2,52,3.0,53.0
3,2.0,52.0,3,53,4.0,54.0
4,3.0,53.0,4,54,5.0,55.0
5,4.0,54.0,5,55,6.0,56.0
6,5.0,55.0,6,56,7.0,57.0
7,6.0,56.0,7,57,8.0,58.0
8,7.0,57.0,8,58,9.0,59.0
