# shifting sequence data

#### sequences don't fit the typical supervised learning model

supervised learning models require data as input, and target classifications as output. it's the model's job to figure out a function that maps inputs to outputs.

fundamentally, this means that any supervised model will require each instance to have an input with a specific, mapped output, to learn from. for example, a classification problem attempts to find a function that sorts data into some pre-set classes; it's this input-to-output mapping that the model tries to infer.

time- and other sequences do not naturally fit this shape: after all, sequences are just ordered lists, not (for example) taxonomies.

in order to provide a supervised learning model with a predicted value or "target" to learn from, we'll need to create a column of values that are the known "next step" for each adjacent time step in each row of the sequence we want to predict. 

in other words, we create a column shifting each step either forward or backwards, then feed the resulting pair of columns into our supervised model.

### libraries used

pandas provides an easy way to shift data, using the shift() function.

in order to apply the shift() function, we need to import pandas, and create data in pandas DataFrame format.

In [46]:
import pandas as pd

## shifting

creating a series of time steps in an indexed DataFrame:

In [47]:
seq_data = pd.DataFrame()

seq_data['time_step'] = [i for i in range(26)]

# test, should return DataFrame with indexed values

print(seq_data)

    time_step
0           0
1           1
2           2
3           3
4           4
5           5
6           6
7           7
8           8
9           9
10         10
11         11
12         12
13         13
14         14
15         15
16         16
17         17
18         18
19         19
20         20
21         21
22         22
23         23
24         24
25         25


now that we have a DataFrame, we can use the pandas shift() function to create a column where each time point is shifted by one step. we can call it 't-1', referring to each shifted step's relationship with the row-adjacent step.

In [48]:
seq_data['t-1'] = seq_data['time_step'].shift(1)

print(seq_data)

    time_step   t-1
0           0   NaN
1           1   0.0
2           2   1.0
3           3   2.0
4           4   3.0
5           5   4.0
6           6   5.0
7           7   6.0
8           8   7.0
9           9   8.0
10         10   9.0
11         11  10.0
12         12  11.0
13         13  12.0
14         14  13.0
15         15  14.0
16         16  15.0
17         17  16.0
18         18  17.0
19         19  18.0
20         20  19.0
21         21  20.0
22         22  21.0
23         23  22.0
24         24  23.0
25         25  24.0


note: the 't-1' column simply lists the previous time step on the same row as the "current" time step. since the value at index [0] doesn't have a previous step, pandas shift() function puts NaN there instead. practically speaking, this non-numeric value means we will have to discard the first row or our data.

#### shifting by different increments

we can provide a column of values shifted by almost any amount.

however, it's important to keep in mind that adding a column shifted by n-steps meand we will lose n-rows of our data:

In [49]:
seq_data['t-3'] = seq_data['time_step'].shift(3)

print(seq_data)

    time_step   t-1   t-3
0           0   NaN   NaN
1           1   0.0   NaN
2           2   1.0   NaN
3           3   2.0   0.0
4           4   3.0   1.0
5           5   4.0   2.0
6           6   5.0   3.0
7           7   6.0   4.0
8           8   7.0   5.0
9           9   8.0   6.0
10         10   9.0   7.0
11         11  10.0   8.0
12         12  11.0   9.0
13         13  12.0  10.0
14         14  13.0  11.0
15         15  14.0  12.0
16         16  15.0  13.0
17         17  16.0  14.0
18         18  17.0  15.0
19         19  18.0  16.0
20         20  19.0  17.0
21         21  20.0  18.0
22         22  21.0  19.0
23         23  22.0  20.0
24         24  23.0  21.0
25         25  24.0  22.0


#### sequence generation for supervised learning

it's possible to generate long sequences of data by repeating this process. 

with enough data, losing a few rows will matter less than on our extremely small, contrived dataset here.

#### stepping in the opposite direction

in order to shift the steps in the opposite direction, feed a negative number into the shift() function:

In [50]:
# define a fresh DataFrame

seq_data_2 = pd.DataFrame([i for i in range(26)])


print(seq_data_2)

     0
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
10  10
11  11
12  12
13  13
14  14
15  15
16  16
17  17
18  18
19  19
20  20
21  21
22  22
23  23
24  24
25  25


In [51]:
# create the shift column

seq_data_2['t+1'] = seq_data_2.shift(-1)

print(seq_data_2)

     0   t+1
0    0   1.0
1    1   2.0
2    2   3.0
3    3   4.0
4    4   5.0
5    5   6.0
6    6   7.0
7    7   8.0
8    8   9.0
9    9  10.0
10  10  11.0
11  11  12.0
12  12  13.0
13  13  14.0
14  14  15.0
15  15  16.0
16  16  17.0
17  17  18.0
18  18  19.0
19  19  20.0
20  20  21.0
21  21  22.0
22  22  23.0
23  23  24.0
24  24  25.0
25  25   NaN


note: since we're going in the opposite direction this time, we lose the bottom row with a NaN value.

for an n-step shift, we would lose n-rows:

In [52]:
# this time, there's more than one column
# so we specify which one we want to shift from
# i.e. seq_data_2[0]

seq_data_2['t+5'] = seq_data_2[0].shift(-5)

print(seq_data_2)

     0   t+1   t+5
0    0   1.0   5.0
1    1   2.0   6.0
2    2   3.0   7.0
3    3   4.0   8.0
4    4   5.0   9.0
5    5   6.0  10.0
6    6   7.0  11.0
7    7   8.0  12.0
8    8   9.0  13.0
9    9  10.0  14.0
10  10  11.0  15.0
11  11  12.0  16.0
12  12  13.0  17.0
13  13  14.0  18.0
14  14  15.0  19.0
15  15  16.0  20.0
16  16  17.0  21.0
17  17  18.0  22.0
18  18  19.0  23.0
19  19  20.0  24.0
20  20  21.0  25.0
21  21  22.0   NaN
22  22  23.0   NaN
23  23  24.0   NaN
24  24  25.0   NaN
25  25   NaN   NaN


for a 5 step shift, we lose five rows.

## conclusion

providing a pair of related points for a supervised model--a first value to input, and a second correct 'target' value to predict given the first--is a basic requirement for supervised learning that time-series and other sequence prediction problems don't naturally meet.

pandas' shift() function provides a neat tool to transform existing sequential data into sequences that models can use to learn & make predictions.

##### for more information on pandas shift() function:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html