### Part 2 - Dealing with Timeseries as a non-timeseries problem 

In [43]:
import pandas as pd 
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt 
from sklearn.metrics import r2_score
from statsmodels.graphics.tsaplots import plot_pacf
from sklearn.ensemble import RandomForestRegressor
% matplotlib inline 

import warnings
warnings.filterwarnings(action="ignore")
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

import utils

![](https://i.imgflip.com/2acblw.jpg)

Let's get our airlines data 

In [2]:
airlines = utils.load_airline_data()

Ok Neo, picture this. We take the next period's data, and make it the target. 

In [3]:
airlines_as_dataframe = pd.DataFrame(airlines)

In [4]:
airlines_as_dataframe['target'] = airlines.shift(-1)

Now we take a few previous periods, and make them features. 

In [5]:
airlines_as_dataframe['1 period before'] = airlines.shift(1)
airlines_as_dataframe['2 periods before'] = airlines.shift(2)

Behold. 

In [6]:
airlines_as_dataframe.tail(3)

Unnamed: 0_level_0,passengers_thousands,target,1 period before,2 periods before
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1960-09-30,508.0,461.0,606.0,622.0
1960-10-31,461.0,390.0,508.0,606.0
1960-11-30,390.0,,461.0,508.0


With this, we can train on the previous days, using them as features, and predict the target. 

The idea is elegant, and sounds simple. There will however be a lot of weird Pandas to get the data ready. 

Remember Neo: 
> This section is completely optional.

This is your last chance. 

![](https://i.imgflip.com/2akbei.jpg)

-------

Remember. All I'm offering you is the Truth. 

We will be making some functions along the way, or this will get out of control fast. 

Firstly, a function to build our target, by getting the next period: 

In [7]:
def build_target(series_, number_of_periods_ahead):
    """ 
    takes a series, turned it into a dataframe, and adds a new column called target
    This column is the input series, lagged number_of_periods_ahead into the future
    """
    
    # make a copy 
    series_ = series_.copy()
    series_.name = 'observed_values'
    
    # make a dataframe from the series
    df_ = pd.DataFrame(series_)
    
    # the target column will be the input series, lagged into the future
    df_['target'] = series_.shift(-number_of_periods_ahead)
    return df_

What does this do to our `airlines` data? 

In [8]:
airlines_with_target = build_target(airlines, number_of_periods_ahead=1)

airlines_with_target.tail()

Unnamed: 0_level_0,observed_values,target
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
1960-07-31,622.0,606.0
1960-08-31,606.0,508.0
1960-09-30,508.0,461.0
1960-10-31,461.0,390.0
1960-11-30,390.0,


Quite simple, the target becomes the value of the next period. The last day, of course, does not have a target. 

Now, let's build some more features manually, by taking the differences between consecutive days. Maybe this helps to give the model an idea of how things are changing: 

In [9]:
def build_some_features(df_, num_diffs, num_periods_lagged): 
    """
    Builds some features by calculating differences between periods  
    """
    # make a copy 
    df_ = df_.copy()
    
    # for a few values, get the diffs 
    for i in range(1, num_diffs+1):
        # make a new feature, with the diffs in the observed values column
        df_['diffed_%s' % str(i)] = df_['observed_values'].diff(i)
        
    # for a few values, get the lags  
    for i in range(1, num_periods_lagged+1):
        # make a new feature, with the diffs in the observed values column
        df_['lagged_%s' % str(i)] = df_['observed_values'].shift(i)
        
    return df_

In [10]:
airlines_with_some_hand_made_features = build_some_features(airlines_with_target, 2, 2)
airlines_with_some_hand_made_features.tail()

Unnamed: 0_level_0,observed_values,target,diffed_1,diffed_2,lagged_1,lagged_2
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1960-07-31,622.0,606.0,87.0,150.0,535.0,472.0
1960-08-31,606.0,508.0,-16.0,71.0,622.0,535.0
1960-09-30,508.0,461.0,-98.0,-114.0,606.0,622.0
1960-10-31,461.0,390.0,-47.0,-145.0,508.0,606.0
1960-11-30,390.0,,-71.0,-118.0,461.0,508.0


So let's see what this mean... 

> If I take `observed values` on day `1960-10-31`     (461.0)   
> ... and subtract `observed values` on day `1960-08-31` (606.0)  
> I get `diffed_2` for day `1960-10-31`               (-145.0)

> If I take `observed values` on day `1960-10-31`     (461.0)   
> And move it to `1960-11-30`  
> I've created a `lagged_1` feature for day  `1960-11-30`  (461.0)   

Does that make sense? There are obviously a lot more features we can hand-engineer, but let's start with the super basic stuff first. 

Next up, let's separate that last day (which we can't use for training), and separate the features from the target: 

In [11]:
def separate_last_day(df_):
    
    """
    takes a dataset which has the target and features built 
    and separates it into the last day
    """
    # take the last period 
    last_period = df_.iloc[-1]
    
    # the last period is now a series, so it's name will be the timestamp
    training_data = df_.loc[df_.index < last_period.name]

    return last_period, training_data

Does that work? 

In [12]:
last_period, training_data = separate_last_day(airlines_with_some_hand_made_features)

What is our last period? 

In [13]:
last_period

observed_values    390.0
target               NaN
diffed_1           -71.0
diffed_2          -118.0
lagged_1           461.0
lagged_2           508.0
Name: 1960-11-30 00:00:00, dtype: float64

And the training data? 

In [14]:
training_data.tail(3)

Unnamed: 0_level_0,observed_values,target,diffed_1,diffed_2,lagged_1,lagged_2
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1960-08-31,606.0,508.0,-16.0,71.0,622.0,535.0
1960-09-30,508.0,461.0,-98.0,-114.0,606.0,622.0
1960-10-31,461.0,390.0,-47.0,-145.0,508.0,606.0


Excellent, stops right before our last period. One reminder, it will still have missing data (creating our features made some missing data at the start), but we'll just get rid of it when the time comes: 

In [15]:
training_data.head(3)

Unnamed: 0_level_0,observed_values,target,diffed_1,diffed_2,lagged_1,lagged_2
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1949-01-31,112.0,118.0,,,,
1949-02-28,118.0,132.0,6.0,,112.0,
1949-03-31,132.0,129.0,14.0,20.0,118.0,112.0


Lastly, let's make a method for separating the features and targets, so we get our complete train test split: 

In [16]:
def separate_train_and_test_set(last_period_, training_data_, target='target'): 
    
    """ 
    separates training and test set (clue was in the name, really... )
    Ok, we were lazy and left the target hardcoded as 'target'. Shame on us. 
    """
    
    # anything that isn't a target is a feature 
    features = [feature for feature in training_data_.columns if feature != target]
    
    # adding a sneaky little dropna to avoid the missing data problem above 
    X_train = training_data_.dropna()[features]
    y_train = training_data_.dropna()[target]
    
    X_last_period = last_period_[features]
    
    return X_train, y_train, X_last_period

In [17]:
X_train, y_train, X_last_period = separate_train_and_test_set(last_period, training_data, target='target')

In [18]:
X_last_period

observed_values    390.0
diffed_1           -71.0
diffed_2          -118.0
lagged_1           461.0
lagged_2           508.0
Name: 1960-11-30 00:00:00, dtype: float64

Let's take a look at our outputs: 

In [19]:
print(X_train.tail(2), end='\n\n')
print(y_train.tail(2), end='\n\n')
print(X_last_period.tail(2))

            observed_values  diffed_1  diffed_2  lagged_1  lagged_2
Month                                                              
1960-09-30            508.0     -98.0    -114.0     606.0     622.0
1960-10-31            461.0     -47.0    -145.0     508.0     606.0

Month
1960-09-30    461.0
1960-10-31    390.0
Freq: M, Name: target, dtype: float64

lagged_1    461.0
lagged_2    508.0
Name: 1960-11-30 00:00:00, dtype: float64


Great. Now let's make a utility function to put all this together: 

In [20]:
def prepare_for_prediction(series_, number_of_periods_ahead, num_diffs, num_periods_lagged):
    
    """ 
    Wrapper to go from the original series to X_train, y_train, X_last_period 
    
    """
    
    # build the target 
    data_with_target = build_target(series_, 
                                    number_of_periods_ahead)
    
    # build the features 
    data_with_target_and_features = build_some_features(data_with_target, 
                                                             num_diffs=num_diffs, 
                                                        num_periods_lagged=num_periods_lagged)
    # separate train and test data 
    last_period, training_data = separate_last_day(data_with_target_and_features)

    # separate X_train, y_train, and X_test 
    X_train, y_train, X_last_period = separate_train_and_test_set(last_period, 
                                                           training_data, 
                                                           target='target')
    
    # return ALL OF THE THINGS! (well, actually just the ones we need)
    return X_train, y_train, X_last_period 

Did that work? 

In [21]:
X_train, y_train, X_last_period = prepare_for_prediction(airlines, 
                                                         number_of_periods_ahead=1, 
                                                         num_diffs=3, 
                                                        num_periods_lagged=3)

In [22]:
# this is just to see X train and y train side by side 
pd.concat([X_train, y_train], axis=1).tail()

Unnamed: 0_level_0,observed_values,diffed_1,diffed_2,diffed_3,lagged_1,lagged_2,lagged_3,target
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1960-06-30,535.0,63.0,74.0,116.0,472.0,461.0,419.0,622.0
1960-07-31,622.0,87.0,150.0,161.0,535.0,472.0,461.0,606.0
1960-08-31,606.0,-16.0,71.0,134.0,622.0,535.0,472.0,508.0
1960-09-30,508.0,-98.0,-114.0,-27.0,606.0,622.0,535.0,461.0
1960-10-31,461.0,-47.0,-145.0,-161.0,508.0,606.0,622.0,390.0


In [23]:
# what about our last period X? 
X_last_period

observed_values    390.0
diffed_1           -71.0
diffed_2          -118.0
diffed_3          -216.0
lagged_1           461.0
lagged_2           508.0
lagged_3           606.0
Name: 1960-11-30 00:00:00, dtype: float64

Huzzah! Now we can treat this as a normal regression problem (kind of). 

Let's try to predict the next day, using a super-basic [vanilla](https://youtu.be/rog8ou-ZepE?t=58s) [Linear Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html): 

In [24]:
# what's the dumbest, simplest model I can think of? 
lr = LinearRegression()

In [25]:
lr.fit(X_train, y_train);

So what would this model predict for our next day?   

In [26]:
lr.predict(X_last_period.values.reshape(1, -1))

array([375.98517024])

Note: we will be predicting a single observation, hence the annoying `.values.reshape` stuff. 

The API changes a bit when predicting a single point. If you forget, Pandas will tell you what to do, so don't worry too much about it)

How did our model make its prediction? 

In [27]:
def explain_linear_regression(lr, features):
    
    betas = lr.coef_
    print('Regression: \n(%0.3f * %s)' % (betas[0], features[0]))
    for i in range(1, len(betas)): 
        print('+ (%0.3f * %s)' % (betas[i], features[i]))


In [28]:
explain_linear_regression(lr, X_train.columns)

Regression: 
(0.461 * observed_values)
+ (0.503 * diffed_1)
+ (0.204 * diffed_2)
+ (0.181 * diffed_3)
+ (-0.042 * lagged_1)
+ (0.258 * lagged_2)
+ (0.280 * lagged_3)


---- 

![](https://i.imgflip.com/2akdo8.jpg)

----

At this point, we have two options. We can either:

1. take the prediction we've just made, feed it as truth into the model, and predict again. 
2. Just train the model to predict two days in advance. 

Given that we aren't insane, we'll choose #2. 


In [29]:
X_train, y_train, X_last_period = prepare_for_prediction(airlines, 
                                                         number_of_periods_ahead=2, 
                                                         num_diffs=3, 
                                                        num_periods_lagged=3)

In [30]:
pd.concat([X_train, y_train], axis=1).tail()

Unnamed: 0_level_0,observed_values,diffed_1,diffed_2,diffed_3,lagged_1,lagged_2,lagged_3,target
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1960-05-31,472.0,11.0,53.0,81.0,461.0,419.0,391.0,622.0
1960-06-30,535.0,63.0,74.0,116.0,472.0,461.0,419.0,606.0
1960-07-31,622.0,87.0,150.0,161.0,535.0,472.0,461.0,508.0
1960-08-31,606.0,-16.0,71.0,134.0,622.0,535.0,472.0,461.0
1960-09-30,508.0,-98.0,-114.0,-27.0,606.0,622.0,535.0,390.0


Quick note: we now have data until `1960-09-30`, we used to have 1 more period. However because we are predicting two periods in advance, we don't have data for the last two days. Makes sense? 

You will also notice that `X_last_period` is `1960-11-30`, two periods after the last entry in the training data:

In [31]:
X_last_period

observed_values    390.0
diffed_1           -71.0
diffed_2          -118.0
diffed_3          -216.0
lagged_1           461.0
lagged_2           508.0
lagged_3           606.0
Name: 1960-11-30 00:00:00, dtype: float64

Great! Fit it! 

In [32]:
# weeeee 
lr = LinearRegression()
lr.fit(X_train, y_train);
lr.predict(X_last_period.values.reshape(1, -1))

array([419.54756245])

And yes, that does indeed mean that you can use whatever model you want. 

![](https://i.imgflip.com/2akf0s.jpg)

But beware, fitting the models and particularly validating them by hand is not particularly trivial. 


# Predicting as many periods as we like 

(I told you to take the blue pill, Neo.)

In [33]:
def predict_period_n(series_, model, number_of_periods_ahead, num_diffs, num_periods_lagged): 
    
        X_train, y_train, X_last_period = prepare_for_prediction(series_, 
                                                             number_of_periods_ahead, 
                                                             num_diffs, 
                                                            num_periods_lagged)
        
        model.fit(X_train, y_train);
        return model.predict(X_last_period.values.reshape(1, -1))

In [34]:
def predict_n_periods(series_, n_periods, model, num_diffs, num_periods_lagged): 
    predictions = []

    for period_ahead in range(1, n_periods+1):
        pred = predict_period_n(series_=series_, 
                                model=model, 
                                number_of_periods_ahead=period_ahead, 
                                num_diffs=num_diffs,
                                num_periods_lagged=num_periods_lagged)
        
        predictions.append(pred[0])
        
    return predictions 

Let's predict a few periods ahead: 

In [35]:
predict_n_periods(series_=airlines, 
                  n_periods=2, 
                  model=LinearRegression(), 
                  num_diffs=3, 
                  num_periods_lagged=3)

[375.9851702371863, 419.54756244601714]

## How did we do? 

Actually we have no idea. We don't actually have the future to know what the answer should be. 

Let's get a quick split _(you'll get better ways to do this in Part 3 - workflows)_

In [36]:
split_date = '1960-02-29'
train = airlines.loc[airlines.index < split_date]
test = airlines.loc[airlines.index >= split_date].iloc[0:3]

Did it work? 

In [37]:
train.tail(2)

Month
1959-12-31    405.0
1960-01-31    417.0
Freq: M, Name: passengers_thousands, dtype: float64

In [38]:
test

Month
1960-02-29    391.0
1960-03-31    419.0
1960-04-30    461.0
Freq: M, Name: passengers_thousands, dtype: float64

Yep. Now, let's train on `train`, and predict 4 days of `test`

In [39]:
predictions = predict_n_periods(series_=train, 
                  n_periods=3, 
                  model=LinearRegression(), 
                  num_diffs=3, 
                  num_periods_lagged=3)

print('Predictions: %s'  % predictions)

print('\nHow wrong were we on each day?')
print(test - predictions)

Predictions: [418.7765068623498, 421.668065081549, 405.70519573607316]

How wrong were we on each day?
Month
1960-02-29   -27.776507
1960-03-31    -2.668065
1960-04-30    55.294804
Freq: M, Name: passengers_thousands, dtype: float64


Not amazing... How about more features? 

In [40]:
predictions = predict_n_periods(series_=train, 
                  n_periods=3, 
                  model=LinearRegression(), 
                  num_diffs=7, 
                  num_periods_lagged=7)

print('Predictions: %s'  % predictions)

print('\nHow wrong were we on each day?')
print(test - predictions)

Predictions: [415.6171997346689, 423.21634069801127, 415.83073167649667]

How wrong were we on each day?
Month
1960-02-29   -24.617200
1960-03-31    -4.216341
1960-04-30    45.169268
Freq: M, Name: passengers_thousands, dtype: float64


Meh. Different model? 

In [42]:
predictions = predict_n_periods(series_=train, 
                  n_periods=3, 
                  model=RandomForestRegressor(n_estimators=100, max_depth=3, min_samples_split=4), 
                  num_diffs=7, 
                  num_periods_lagged=7)

print('Predictions: %s'  % predictions)

print('\nHow wrong were we on each day?')
print(test - predictions)

Predictions: [425.38838998426763, 424.73036836782273, 397.71502562247014]

How wrong were we on each day?
Month
1960-02-29   -34.388390
1960-03-31    -5.730368
1960-04-30    63.284974
Freq: M, Name: passengers_thousands, dtype: float64


... and so on and so forth. 

# Is this useful at all? 

As with so many things in life, it depends. In some cases, a simple model is the best answer, and avoiding getting stuck in complex timeseries models like SARIMAX can be useful. 

Maybe you just want to backtest your theory that the price of Apple is linearly dependent on the previous 4 days, or want ot throw a Tree at the data, and see what comes back for the sake of interpretability. 

Consider it another tool in your toolbelt, but one you don't necessarily need to pull out, unless the normal tools are failing you. And now, Part 3, the last one before the hackathon: workflows! 