# Baseline Model
## Introduction

### What is Baseline model ?
- A baseline model is a trivial solution to our problem. It often uses heuristics, or simple statistics, to generate predictions. 

### Baseline model for time series
- For time series, there are certain methods of heuristics or simple statistics, namely:
    - Method 1: to compute the mean of the values over **a certain period** and assume that future values will be equal to that mean. 
        - For example, in the context of predicting the EPS for Johnson & Johnson, the average EPS between 1960 and 1979 was $4.31. Therefore the EPS over the next four quarters of 1980 to be equal to $4.31 per quarter.
    - Method 2: to naively forecast the last recorded data point. 
        - For example, if the EPS is $0.71 for this quarter, then the EPS will also be $0.71 for next quarter.
    - Method 3: to repeat that pattern into the future if there is a cyclical pattern in the data
        - For example, if the EPS is $14.04 for the first quarter of 1979, then the EPS for the first quarter of 1980 will also be $14.04.



In [43]:
import pandas as pd
import numpy as np
import plotly.express as px


In [44]:
df = pd.read_csv("../data/book-time-series-forecasting-in-python/jj.csv",
                 parse_dates=[0])

In [45]:
df.tail()

Unnamed: 0,date,data
79,1979-10-01,9.99
80,1980-01-01,16.2
81,1980-04-01,14.67
82,1980-07-02,16.02
83,1980-10-01,11.61


- The train set will consist of the data from 1960 to the end of 1979
- The test set will consist of the four quarters of 1980.
- The model is to preidct the EPS in four quarters of 1980 

In [46]:
# train-test split
train = df[:-4]
test = df[-4:] # the last 4 records are EPS of 1980

In [47]:
fig = px.line(df, x='date', y='data')
fig.update_layout(
    title="Quarterly EPS - J&J stock",
    yaxis_title="EPS"
)
fig.add_vrect(
    x0=test.date.iloc[0], # start of test period
    x1=test.date.iloc[-1], # end of test period
    fillcolor="grey",
    opacity=0.3,
    line_width=0,
    name="test_period"
)
fig.show()

In [48]:
# metrics
def mape(y_true, y_pred):
    return np.mean(np.abs((y_true-y_pred)/y_true)) * 100

### Method 1: Implementing the historical mean baseline
- First use the arithmetic mean of the entire train set.

In [49]:
historical_mean = np.mean(train['data'])
print(historical_mean)

4.308499987499999


In [50]:
test.loc[: , 'pred_mean'] = historical_mean



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [51]:
print(f"MAPE: {mape(test['data'], test['pred_mean'])}")

MAPE: 70.00752579965119


- MAPE of 70.00%. This means that our baseline deviates by 70% on average from the actual values.


In [52]:
# last year mean
last_year_mean = np.mean(train['data'][-4:])
test.loc[:, 'pred__last_yr_mean'] = last_year_mean
print(f"MAPE: {mape(test['data'], test['pred__last_yr_mean'])}")

MAPE: 15.5963680725103




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



- This new baseline is a clear improvement over the previous one where MAPE reduces from 70% to 15.6%. 
- This means that our forecasts deviate from the observed values by 15.6% on average. 
- We can learn from this baseline that future values likely depend on past values that are not too far back in history. 
    - This is a sign of **autocorrelation**.

### Method 2: Predicting using the last known value

In [53]:
# using the last known value of the training set as a baseline model
last = train["data"].iloc[-1]
test.loc[:, 'pred_last'] = last
print(f"MAPE: {mape(test['data'], test['pred_last'])}")

MAPE: 30.457277908606535




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



- Our new hypothesis did not improve upon the last baseline that we built, since we have a MAPE of 30.45%, whereas we achieved a MAPE of 15.60% using the mean EPS over 1979.
- This can be explained by the fact that the EPS displays a **cyclical behavior**, where it is high during the first three quarters and then falls at the last quarter. 
- Using the last known value **does not take the seasonality into account**, so we need to use another naive forecasting technique to see if we can produce a better baseline.

### Method 3: Implementing naive seasonal forecast
- The naive seasonal forecast takes the last observed cycle and repeats it into the future.

In [54]:
test.loc[:, 'pred_last_season'] = train['data'][-4:].values
print(f"MAPE: {mape(test['data'], test['pred_last_season'])}")

MAPE: 11.561658552433654




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



- This gives us a MAPE of 11.56%, which is the lowest MAPE from all the baselines.
- This means that seasonality has a significant impact on future values, since repeating the last season into the future yields fairly accurate forecasts
- Seasonal effects will have to be considered when we develop a more complex forecasting model for this problem.

In [55]:
def plot_baseline_pred(fig_obj, df, baseline_list):
    for bl_method in baseline_list:
        fig_obj.add_scatter(x=df['date'], y=test[bl_method], mode='lines', name=f'{bl_method}_baseline')
    fig_obj.show()
plot_baseline_pred(fig, test, ["pred_mean", "pred__last_yr_mean", "pred_last", "pred_last_season"])