## Detecting outliers using the rolling mean

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
from sktime.utils.plotting import plot_series

from statsmodels.tsa.seasonal import STL

## Data Set Synopsis

The timeseries is between January 1992 and Apr 2005.

It consists of a single series of monthly values representing sales volumes. 

a monthly retail sales dataset (found [here](https://raw.githubusercontent.com/facebook/prophet/master/examples/example_retail_sales.csv)).

In [8]:
df = pd.read_csv('../../Datasets/example_retail_sales_with_outliers.csv', parse_dates=['ds'], index_col=['ds'])

plot_series(df)
plt.xticks(rotation=20);

<img src='./plots/retail-sales-with-outliers-df.png'>

## Outliers in timeseries data

* ### Point outliers
    * The individual points are outliers

* ### Level Shift Outliers
    * These are called Sub-sequence outliers because a consecutive section of timeseries are outliers
    * The baseline of timeseries undergoes abrupt shift 

* ### Transient Shift Outliers
    * These are called Sub-sequence outliers because a consecutive section of timeseries are outliers
    * The baseline of timeseries undergoes abrupt shift, but that shift is transient
    * The abrupt shift decays over-time hence transient

<img src='./Notes/different-outlier-types.PNG'>

## Why are outliers a problem

* Outliers bias the model
* Time-series decomposition results are distorted if there are outliers
    * The trend computation using rolling-average give us inflated results, if there are outliers in the window
    * Outliers distorts classical seasonal decomposition result

## How to identify Outliers

#### 1. Visual Inspection
Plot the time series and inspect
* if the time series is small, then by visual inspection we can identify the outliers

#### 2. Estimation methods
`abs( y_true - y_hat ) > threshold`
* We can use rolling statitcs like mean, median to compute and expectation
* If the expection is very different from actual we can flag those points as outliers

#### 3. Density based methods
Does an observation has only a few neighbours ?
* Look at the neighbouring datapoints 
* If the neighbourhood is sparse we can flag them as outliers

## Lets use Estimation method 

#### rolling mean and standard-deviation  to identify outliers

In [13]:
# our data
df.plot();

<img src='./plots/retail-sales-with-outliers-df-plt.png'>

#### The seasonal spikes in the data are likely to be picked up as outliers. We shall de-seasonalise the data first using STL decomposition

In [14]:
res = STL(endog=df['y'], robust=True).fit()
res.plot();

<img src='./plots/stl-plot.png'>

In [19]:
deseasonalised = df['y']-res.seasonal
df['deseasonalised'] = deseasonalised
# plot the deseasonalized data
df.plot(y=['deseasonalised']);

<img src='./plots/deseasonalized-plot.png'>

## Methods for handling outliers


#### 1. MODELING OUTLIERS
Suppose the have the flower sales data from a store and we find the flower sales on Feb 14 very high on visual inspection. Is this and outliers ?

From visual inspection its clear that the sales on feb-14 is very high compared to the rest.

Since Feb 14 is valentine's day we know there will be increase in the flower sales

This case can be modeled 

We can add a feature to our model. Eg: Like zero for all other days and 1 for valentines day

We can teach our model about events like valentines-day where we expect a high sale of flowers 



#### 2. IMPUTATION 
#### Treat the outliers as missing data and Impute them using imputation methods 

imputation methods
* forward fill `pandas.DataFrame.fillna(method='ffill')`
* backward fill `pandas.DataFrame.fillna(method='bfill')` 
* linear-interpolation `pandas.DataFrame.interpolate(method='linear')`
etc...


### 3. ROLLING STATISTICS

#### Apply a window of data
#### Compute statistics from the data inside the window
#### Move the window and iterate through all the data

LETS USE ROLLING STATS

Compute yhat using a rolling mean and the rolling standard deviation which will be used as part of the threshold

In [84]:
rolling_stats = (df['deseasonalised'].rolling(window=12, center=True, min_periods=1)
                   .agg({'rolling_mean':'mean', 'rolling_std':'std'}))

rolling_stats

Unnamed: 0_level_0,rolling_mean,rolling_std
ds,Unnamed: 1_level_1,Unnamed: 2_level_1
1992-01-01,165622.552337,1130.452750
1992-02-01,165890.339378,1251.760952
1992-03-01,166261.547409,1563.785489
1992-04-01,166537.395234,1680.647235
1992-05-01,167101.829539,2386.753924
...,...,...
2004-12-01,347537.364408,73572.481723
2005-01-01,350324.676777,76937.563822
2005-02-01,326107.606504,7846.290594
2005-03-01,328385.541059,4121.498142


In [85]:
alpha = 3
upper_limit = rolling_stats['rolling_mean'] + alpha*rolling_stats['rolling_std']
lower_limit = rolling_stats['rolling_mean'] - alpha*rolling_stats['rolling_std']

is_outlier = np.abs(df['deseasonalised']-rolling_stats['rolling_mean']) > alpha*rolling_stats['rolling_std']

In [94]:

ax = df.plot(y=['deseasonalised'], figsize=(15,5))
rolling_stats.plot(y=['rolling_mean'], ax=ax)

df[is_outlier].plot(y=['y'], linestyle='', marker='x', ax=ax, c='r', label=['outlier'])

upper_limit.plot(ax=ax, c='grey', label='')
lower_limit.plot(ax=ax, c='grey', label='')


<img src='./plots/outlier-detection.png'>

### The outliers were only just identified. The rolling mean and rolling standard deviation change significantly when the window includes the outliers (see the jumps in the rolling mean and in the thresholds shown by the grey lines).  This shows that this method is not robust to outliers.

## ROLLING MEAN AND STANDARD DEVIATION

* WINDOW-SIZE
    * Seasonal period is a common choice for window-size
    * We want to smooth out short time fluctuations

* Threshold 
    * `ABS(y_true - rolling-mean) > aplha * rolling-std`
    *  we flag a datapoint as outlier if it vary a lot from what is expected
    *  Usually we choose 3 as alpha
        * low alpha = high sensitivity to outliers
        * high alpha = less sensitive to outliers
    

In [118]:

ax = df.plot(y=['y'], figsize=(15,5), label=['time-series'])
rolling_stats.plot(y=['rolling_mean'], ax=ax)

df[is_outlier].plot(y=['y'], linestyle='', marker='x', ax=ax, c='r', label=['outlier'])


<img src='./plots/outlier-detection-timeseries-plot.png'>

#### High value for `alpha` = low sensitivity

In [121]:
high_alpha = 4
high_alpha_upper_limit = rolling_stats['rolling_mean'] + high_alpha*rolling_stats['rolling_std']
high_alpha_lower_limit = rolling_stats['rolling_mean'] - high_alpha*rolling_stats['rolling_std']
high_alpha_outliers = (np.abs(df['deseasonalised']-rolling_stats['rolling_mean']) > high_alpha*rolling_stats['rolling_std'])

ax = df.plot(y=['deseasonalised'], figsize=(15,5), label=['time-series'])
rolling_stats.plot(y=['rolling_mean'], ax=ax)

high_alpha_upper_limit.plot(ax=ax, c='grey', label='')
high_alpha_lower_limit.plot(ax=ax, c='grey', label='')

plt.title('High alpha value mean low sensitivity to outliers : hence no outliers detected');

<img src='./plots/outlier-detection-high-alpha.png'>

In [122]:
low_alpha = 1
low_alpha_upper_limit = rolling_stats['rolling_mean'] + low_alpha*rolling_stats['rolling_std']
low_alpha_lower_limit = rolling_stats['rolling_mean'] - low_alpha*rolling_stats['rolling_std']
low_alpha_outliers = (np.abs(df['deseasonalised']-rolling_stats['rolling_mean']) > low_alpha*rolling_stats['rolling_std'])

ax = df.plot(y=['deseasonalised'], figsize=(15,5), label=['time-series'])
rolling_stats.plot(y=['rolling_mean'], ax=ax)

df[low_alpha_outliers].plot(y=['deseasonalised'], linestyle='', marker='x', ax=ax, c='r', label=['outlier'])

low_alpha_upper_limit.plot(ax=ax, c='grey', label='')
low_alpha_lower_limit.plot(ax=ax, c='grey', label='')

plt.title('Low alpha value means high sensitivity to outliers');

<img src='./plots/outlier-detection-low-alpha.png'>

## Why are we deseasonalizing the data before outlier detection ?

#### Seasonal fluctuations can distort rolling statistics

In [123]:
# here we are computing rolling stats on the timeseries
distorted_rolling_stats = (
    df['y'].rolling(window=12, center=True, min_periods=1)
           .agg({'rolling_mean':'mean', 'rolling_std':'std'})
)

alpha = 3
distorted_rolling_stats['is_outlier'] = (np.abs(df['y']-distorted_rolling_stats['rolling_mean'])
                                         >
                                        alpha*distorted_rolling_stats['rolling_std'])


distorted_rolling_stats['upper_limit'] = distorted_rolling_stats['rolling_mean'] + alpha*distorted_rolling_stats['rolling_std']
distorted_rolling_stats['lower_limit'] = distorted_rolling_stats['rolling_mean'] - alpha*distorted_rolling_stats['rolling_std']

ax = df.plot(y=['y'], figsize=(15,4))
distorted_rolling_stats.plot(y=['rolling_mean'], ax=ax)

distorted_rolling_stats.plot(y=['upper_limit'], ax=ax )
distorted_rolling_stats.plot(y=['lower_limit'], ax=ax )

plt.title('No outliers detected: Reason seasonal pattern distort the rolling stats');

<img src='./plots/outlier-detection-why-deseason-the-data.png'>

## Rolling mean and Rolling standard deviation are not robust against outliers

* The detection of outliers are very sensitive to data, choice of window-size and threshold
* Sensitivity to outliers also depends on the gradient of trend
    * if slope is steeper , standard deviation increases, this makes outlier detection difficult
* If outlier is in the window we get inflated values from rolling stats