## Modelling outliers and special events with dummy variables

* If the outlier cause is known we can use this to improve our forecasts.
* Dummy variables can be used to model the impact of outliers.
* Dummy variables can remove the impact of an outlier on a model.

## Methods for handling outliers
There are two ways of handling the outliers we can either modify the data for example: we can treat the outliers as missing data and impute them using imputation methods like linear-interpolation, forward-fill etc... or we can modify the model using feature engineering

* 1 : Treat as outlier and impute
    * PROS
        * Easy to implement
    * CONS
        * forward will only handles outliers using the past data
        * STL, LOWESS can be used but its very compute intensive 
        * **This is not useful if you know the cause of outliers**
        * **This is not useful if you know that the outliers will repeat in the future**
* 2 : **Model the outliers**
    * WHEN TO MODEL THE OUTLIERS
        * Suppose we run a shop that sells cards, flowers, chocolates etc.. we can expect an increased sales on 14th-Feb every year.
        * Suppose you are forecasting sales 
            * We should expect high sales on black-friday 
            * We should expect high sales on boxing day
            * Shopping Festivals, Promotion-period, etc.. On such events we could expect a high sales 
        * These are the outliers which we can model.

## Modeling outliers and special events

* Before considering to model the outliers we should first consider what is the cause 
    * WHAT CAUSE THE OUTLIERS
        * If the outliers are result of a random event such as a recording error 
            * then its okay to impute them with simple methods like `ffill`
        * Every Valentines day there will be a demand on flowers
            * Such events happen every year on exact date we can model them using a dummy variable
        * Events such as `Public holidays`, `Sport events`, `Festivals` etc..
            * We can add to the model using dummy variables
        * Suppose we are running a promotion sales
            * Increased sales during promotion sales can be modeled using dummy variables 




<img src='./Notes/Handle-outliers.PNG'>

In [32]:
import datetime
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LinearRegression

from sktime.utils.plotting import plot_series
from statsmodels.tsa.deterministic import DeterministicProcess

## Data Set Synopsis

The timeseries is between January 1992 and Apr 2005.

It consists of a single series of monthly values representing sales volumes. 

a monthly retail sales dataset (found [here](https://raw.githubusercontent.com/facebook/prophet/master/examples/example_retail_sales.csv)).

In [4]:
df = pd.read_csv('../../Datasets/example_retail_sales_with_outliers.csv', parse_dates=['ds'], index_col=['ds'])

plot_series(df)
plt.xticks(rotation=20);

<img src='./plots/retail-sales-with-outliers-df.png'>

### Lets assume we know the cause of these outliers and therefore we know the dates in the past and future. We're just trying to see the impact of outliers on a model

In [6]:
outlier_dates = [datetime.date(1993, 9, 1),
                 datetime.date(1994, 10, 1),
                 datetime.date(1997, 7, 1),
                 datetime.date(2004, 7, 1)]

## Create a dummy variable

In [8]:
df['is_outlier'] = np.where(df.index.isin(outlier_dates), 1, 0)

df['is_outlier'].value_counts()

0    156
1      4
Name: is_outlier, dtype: int64

## Trend as a feature

In [18]:
feature = DeterministicProcess(df.index, period=12, order=1)
feature.in_sample().head()

Unnamed: 0_level_0,trend
ds,Unnamed: 1_level_1
1992-01-01,1.0
1992-02-01,2.0
1992-03-01,3.0
1992-04-01,4.0
1992-05-01,5.0


In [19]:
df['trend'] = feature.in_sample()['trend']

df.head()

Unnamed: 0_level_0,y,is_outlier,trend
ds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1992-01-01,146376.0,0,1.0
1992-02-01,147079.0,0,2.0
1992-03-01,159336.0,0,3.0
1992-04-01,163669.0,0,4.0
1992-05-01,170068.0,0,5.0


## Split into train and test

We will train the data up to the end of 2003 and predict all the data after that.

In [25]:
df_train = df[ df.index<'2004-01-01 ']
df_test = df[ df.index>='2004-01-01']

In [31]:
plot_series(df_train['y'], df_test['y'], labels=['train', 'test'])
plt.xticks(rotation=20);

<img src='./plots/train-test-split.png'>

## Train a model with trend only as feature

In [67]:
linear_with_just_trend = LinearRegression()
linear_with_just_trend.fit(df_train.loc[:,['trend']], df_train['y'])
y_pred_with_trend = linear_with_just_trend.predict(df_test.loc[:,['trend']])


plt.figure(figsize=(15,4))
plt.plot(df_train['y'], label='train')
plt.plot(df_test.index, y_pred_with_trend,linewidth=4, c='r', label='prediction')
plt.plot(df_test.index, df_test['y'], linestyle='--', marker='.', c='g', label='test')
plt.legend()

<img src='./plots/modeling-outlier-with-trend.png'>

## Train a model with both Trend and dummy variable as Features

In [68]:
linear_with_just_trend_and_dummy = LinearRegression()
linear_with_just_trend_and_dummy.fit(df_train.loc[:,['trend','is_outlier']], df_train['y'])
y_pred_with_trend_and_dummy = linear_with_just_trend_and_dummy.predict(df_test.loc[:,['trend','is_outlier']])


plt.figure(figsize=(15,4))
plt.plot(df_train['y'], label='train')
plt.plot(df_test.index, y_pred_with_trend_and_dummy, linewidth=4, c='r', label='prediction', alpha=0.8)
plt.plot(df_test.index, df_test['y'], linestyle='--', marker='.', c='g', label='test')
plt.legend()

<img src='./plots/modeling-outlier-with-trend-and-dummy-var.png'>

### We can see that adding a dummy variable allows us to estimate the impact of future outlier or special events. This is useful when we know in advance when such events will occur.