## Outliers in timeseries data

* ### Point outliers
    * The individual points are outliers

* ### Level Shift Outliers
    * These are called Sub-sequence outliers because a consecutive section of timeseries are outliers
    * The baseline of timeseries undergoes abrupt shift 

* ### Transient Shift Outliers
    * These are called Sub-sequence outliers because a consecutive section of timeseries are outliers
    * The baseline of timeseries undergoes abrupt shift, but that shift is transient
    * The abrupt shift decays over-time hence transient

<img src='./Notes/different-outlier-types.PNG'>

## Why are outliers a problem

* Outliers bias the model
* Time-series decomposition results are distorted if there are outliers
    * The trend computation using rolling-average give us inflated results, if there are outliers in the window
    * Outliers distorts classical seasonal decomposition result

## How to identify Outliers

#### 1. Visual Inspection
Plot the time series and inspect
* if the time series is small, then by visual inspection we can identify the outliers

#### 2. Estimation methods
`abs( y_true - y_hat ) > threshold`
* We can use rolling statitcs like mean, median to compute and expectation
* We can use STL to extract trend and seasonal
    * `expectation = trend + seasonal`
* We can use LOWESS to fit a smooth line to the data, we can use this line as expectation
* If the expection is very different from actual we can flag those points as outliers

#### 3. Density based methods
Does an observation has only a few neighbours ?
* Look at the neighbouring datapoints 
* If the neighbourhood is sparse we can flag them as outliers

<img src='./Notes/Handle-outliers.PNG'>

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
from sktime.utils.plotting import plot_series

from statsmodels.tsa.seasonal import STL

from sklearn.linear_model import LinearRegression

## Data Set Synopsis

The timeseries is between January 1992 and Apr 2005.

It consists of a single series of monthly values representing sales volumes. 

a monthly retail sales dataset (found [here](https://raw.githubusercontent.com/facebook/prophet/master/examples/example_retail_sales.csv)).

In [4]:
df = pd.read_csv('../../Datasets/example_retail_sales_with_outliers.csv', parse_dates=['ds'], index_col=['ds'])

plot_series(df)
plt.xticks(rotation=20);

<img src='./plots/retail-sales-with-outliers-df.png'>

## Extract Trend , Seasonal and Residuals

In [8]:
res = STL(endog=df['y'], seasonal=7, period=12, robust=True).fit()
res.plot();

<img src='./plots/stl-plot.png'>

In [26]:
df['expected'] = res.seasonal + res.trend
df['residuals'] = res.resid

## Are the residuals STATIONARY ?

In [13]:
df.plot(y=['residuals'], linestyle='', marker='o', figsize=(15,4))

<img src='./plots/stationary-resid.png'>

### Outlier estimation using `1.5 * IQR` rule

if residuals are stationary then we can use the IQR rule to detect the outliers

Residuals = `y_true - expected`


A common rule to determine whether a particular data point is an outlier is the 1.5 x interquartile range (IQR) rule.

This rule states that a data point $x$ is an outlier if:

$$x > Q_3 + 1.5 \times IQR$$
$$x <Q_1 - 1.5 \times IQR$$

where $Q1$ and $Q3$ are the 1st and 3rd quartiles respectively.

In [15]:
Q1, Q3 = df['residuals'].quantile(q=[0.25, 0.75])
IQR = Q3-Q1
print(f'INTER QUARTILE RANGE : {IQR}')
print(f'First Quartile : {Q1}')
print(f'Third Quartile : {Q3}')

INTER QUARTILE RANGE : 2876.9379567093565
First Quartile : -1376.697816200336
Third Quartile : 1500.2401405090204


In [17]:
alpha = 1.5
df['is_outlier'] = df['residuals'].apply(lambda r: True if r> Q3+(alpha*IQR) or r< Q1-(alpha*IQR) else False)

df['is_outlier'].value_counts()

False    145
True      15
Name: is_outlier, dtype: int64

## Treat outlier as missing data and impute them

In [27]:
df['y_corrected'] = np.where(df['is_outlier'], df['expected'], df['y'])

In [33]:
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(15,10))
df.plot(y=['y'], marker='.', ax=ax[0])
df.query('is_outlier==True').plot(y=['y'] ,linestyle='', marker='o', color='r', 
                            ax=ax[0], label=['outlier'], alpha=0.5)
ax[0].set(title='Data with outliers identified')
df.plot(y=['y_corrected'], ax=ax[1])
df.query('is_outlier==True').plot(y=['y_corrected'] ,linestyle='', marker='o', color='g', 
                            ax=ax[1], label=['outlier imputation'], alpha=0.5)
ax[1].set(title='Data with outliers treated as missing data and imputed')

plt.tight_layout()

<img src='./plots/outlier-identified-and-rectified.png'>

<img src='./Notes/outlier-identification.PNG'>

## Let's learn why outliers are problematic?

In [54]:
# lets fit a linear model to data with outliers
linear_with_outlier = LinearRegression()
linear_without_outlier = LinearRegression()

# lets use the trend as a feature to predict 'y'
X = np.reshape(res.trend.ravel(), newshape=[-1,1])


linear_with_outlier.fit(X, df['y'])
linear_without_outlier.fit(X, df['y_corrected'])

y_pred_with_outlier = linear_with_outlier.predict(X)
y_pred_without_outlier = linear_without_outlier.predict(X)

## Outliers bias the model

In [76]:
plt.figure(figsize=(15,8))
plt.plot(df.index, df['y'])
plt.plot(df.index, y_pred_with_outlier, c='r')
plt.plot(df.index, y_pred_without_outlier, c='g')
plt.legend(['y','biased toward outliers', 'unbiased model prediction'])

<img src='./plots/outlier-bias-the-model.png'>