## Detecting outliers using residuals and STL

##### * STL : SEASONAL TREND DECOMPOSITION
* `from statsmodels.tsa.seasonal import STL`
* `STL` to decompose a time series into three components: `trend`, `seasonal` and `residual`. 

##### STL parameters

##### `seasonal`:
* Determines the window size for LOESS used when smoothing the seasonal component (i.e, the cycle-subseries). 
 
* The cycle-subseries is the time series formed by the sequence of values from the same period in the seasonal cycle (e.g., the sequence of all the values which occured on January over each year, the sequence of all the values which occured on February over each year, etc.). 

* This parameter determines how smooth the seasonal component is for the same period (e.g., every January) across multiple seasonal cycles (e.g, multiple years).

##### `period`: 
* The periodicity of the seasonal component (for yearly seasonality and monthly data, this would be 12 because the seasonal pattern repeats every 12 periods). 
* This variable is used to determine the cycle-subseries and also in the low pass filtering step of the algorithm.

##### `robust`: 
* A flag to use robustness weights during regression in LOESS. This ensures robustness to outliers.


##### Using STL we can extract trend and seasonality and use that as expected value
    * `expected = seasonal + trend`

##### Residuals = `y_true - expected`

##### Outlier estimation using `1.5 * IQR` rule
* IQR = INTER QUARTILE RANGE
* Q1 = First Quartile or 25th Percentile or 0.25 Quantile
* Q3 = Third Quartile or 75th Percentile or 0.75 Quantile
* IQR = Q3 - Q1


* The outliers are visually discernible
* The intution is we can use this residuals and say 
    * if this residual is greater than some threshold we can flag them as outlier
        * `residuals > Q3 + 1.5*IQR`
    * if this residual is lesser than some threshold we can flag them as outlier
        * `residuals < Q1 - 1.5*IQR`


#### CAUTION : RESIDUALS MUST BE STATIONARY
`ie : mean and standard deviation should not change with time`
* Before using residuals to detect outliers make sure that they are stationary


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
from sktime.utils.plotting import plot_series

from statsmodels.tsa.seasonal import STL

from statsmodels.nonparametric.smoothers_lowess import lowess

## Data Set Synopsis

The timeseries is between January 1992 and Apr 2005.

It consists of a single series of monthly values representing sales volumes. 

a monthly retail sales dataset (found [here](https://raw.githubusercontent.com/facebook/prophet/master/examples/example_retail_sales.csv)).

In [3]:
df = pd.read_csv('../../Datasets/example_retail_sales_with_outliers.csv', parse_dates=['ds'], index_col=['ds'])

plot_series(df)
plt.xticks(rotation=20);

<img src='./plots/retail-sales-with-outliers-df.png'>

## Extract Seasonal Patterns and Trend using STL 
* Seasonal patterns could be mistake as outliers so we have to de-sesonalized the data 
* Using STL we can get extract both trend and seasonality
* estimated = trend + seasonality

In [4]:
res = STL(endog=df['y'], seasonal=7, period=12, robust=True).fit()
res.seasonal.plot();

<img src='./plots/STL-seasonal-plot.png'>

### Trend & Seasonal extracted using STL method can be used as expected values

##### expected = `trend + seasonal`

##### Residuals = `y_true - expected`

##### Outlier estimation using `1.5 * IQR` rule

In [14]:
df['expected'] = res.seasonal + res.trend
df['residuals'] = res.resid

fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(15,8))
df.plot(y=['expected'], label=['expected = trend + seasonality'], ax=ax[0]);
df.plot(y=['residuals'], linestyle='', marker='.', ax=ax[1]);

<img src='./plots/STL-estimated-and-resid.png'>

## Make sure that the residuals are stationary

In [18]:
df['residuals'].plot(figsize=(15,4), marker='.', linestyle='', markersize=14);

<img src='./plots/stationary-resid.png'>

## IQR , Q1, Q3 using pandas

In [19]:
Q1, Q3 = df['residuals'].quantile(q=[0.25, 0.75])
IQR = Q3 - Q1
print('INTER QUARTILE RANGE :',IQR)
print('Q1 :',Q1)
print('Q3 :',Q3)

INTER QUARTILE RANGE : 2876.9379567093565
Q1 : -1376.697816200336
Q3 : 1500.2401405090204


In [24]:
plt.figure(figsize=(10,8))
sns.boxplot(x='residuals', data=df);

<img src='./plots/box-plot-STL.png'>

A common rule to determine whether a particular data point is an outlier is the 1.5 x interquartile range (IQR) rule.

This rule states that a data point $x$ is an outlier if:

$$x > Q_3 + 1.5 \times IQR$$
$$x <Q_1 - 1.5 \times IQR$$

where $Q1$ and $Q3$ are the 1st and 3rd quartiles respectively.

In [29]:
upper_bound = Q1 - 1.5*IQR
lower_bound = Q3 + 1.5*IQR

plt.figure(figsize=(10,8))
plt.axvspan(xmin=lower_bound, xmax=upper_bound, color='g', alpha=0.4);
sns.histplot(df['residuals']);
plt.vlines(x=[lower_bound, upper_bound], ymin=plt.ylim()[0], ymax=plt.ylim()[1], colors='r')

plt.legend(['INTER QUARTILE RANGE',  'lower and upper bound', 'distribution of residuals']);

<img src='./plots/histplot-residuals-STL.png'>

In [30]:
def get_outliers(alpha = 1.5):
    return df['residuals'].apply(lambda r: True if r > Q3+(alpha*IQR)or r< Q1-(alpha*IQR) else False)

In [33]:
outliers = get_outliers(alpha=1.5)

ax = df.plot(y=['y'], figsize=(15,4), marker='.')
df[outliers].plot(y=['y'], c='r', marker='o', linestyle='', 
                  markersize=12, ax=ax, alpha=0.5, label=['outlier'])

<img src='./plots/outlier-detection-STL-1.5.png'>

In [35]:
ax = df.plot(y=['residuals'], marker='.', markersize=16, linestyle='', figsize=(10,8))

df[outliers].plot(y=['residuals'], c='r', marker='o', linestyle='',  markersize=12, ax=ax, 
                    alpha=0.5, label=['outlier'])

plt.hlines(y=[lower_bound, upper_bound], xmin=df.index.min(), xmax=df.index.max(), color='r');


<img src='./plots/residuals-and-outliers-STL-1.5.png'>

#### Lets zoom-in and look the points near the boundary that are classified as outliers

In [41]:
ax = df.query('residuals <5e4').plot(y=['y'], figsize=(15,8), marker='.')
df.query('residuals < 5e4').plot(y=['expected'], ax=ax)
df[outliers].query('residuals < 5e4').plot(y=['y'], ax=ax, c='r', linestyle='', 
                marker='o', markersize=16, alpha=0.5, label=['outlier']);

<img src='./plots/STL-estimated-and-outlier-zoomed-1.5.png'>

In [42]:
ax = df.query('residuals < 5e4').plot(y=['residuals'], marker='.', markersize=16, linestyle='', figsize=(10,8))

df[outliers].query('residuals < 5e4').plot(y=['residuals'], c='r', marker='o', 
                    linestyle='',  markersize=12, ax=ax, alpha=0.5, label=['outlier'])

plt.hlines(y=[lower_bound, upper_bound], xmin=df.index.min(), xmax=df.index.max(), color='r');

<img src='./plots/residuals-and-outliers-zoomed-STL-1.5.png'>

#### Points that are identified as outlier are indeed points that deviate from the expected 
* points that deviates from the expected value are flagged as outliers
* if we want, we can make the outlier detection less sensitive by increasing the threshold for 

    * if this residual is greater than some threshold we can flag them as outlier
        * `residuals > Q3 + 3 * IQR`
    * if this residual is lesser than some threshold we can flag them as outlier
        * `residuals < Q1 - 3 * IQR`

#### Sensitivity adjustment ?
Because the datapoints (`y_true`) are close to the expected-values extracted using STL (`y_expected = trend + seasonal`) the majority of residuals are closer to zero.

 **This means that even the smaller fluctuations from expected are identified as outliers.** The residuals are much clearer on the residual-plot and box-plot. The IQR, Q1 and Q3 ends up really small due to this many small residuals. To make  outlier detection less sensitive, a simple solution would be to adjust the threshold. 

In [45]:
outliers = get_outliers(alpha=5)

ax = df.plot(y=['y'], figsize=(15,4), marker='.')
df[outliers].plot(y=['y'], c='r', marker='o', linestyle='', 
                  markersize=12, ax=ax, alpha=0.5, label=['outlier'])

<img src='./plots/outlier-detection-STL-5.png'>

## Handling outliers

#### 1. MODELING OUTLIERS
Suppose the have the flower sales data from a store and we find the flower sales on Feb 14 very high on visual inspection. Is this and outliers ?

From visual inspection its clear that the sales on feb-14 is very high compared to the rest.

Since Feb 14 is valentine's day we know there will be increase in the flower sales

This case can be modeled 

We can add a feature to our model. Eg: Like zero for all other days and 1 for valentines day

We can teach our model about events like valentines-day where we expect a high sale of flowers 


#### 2. IMPUTATION 
#### Treat the outliers as missing data and Impute them using imputation methods 

imputation methods
* forward fill `pandas.DataFrame.fillna(method='ffill')`
* backward fill `pandas.DataFrame.fillna(method='bfill')` 
* linear-interpolation `pandas.DataFrame.interpolate(method='linear')`
* STL decomposition and interpolation


## Treat the outliers as missing and impute

In [47]:
df['y_corrected'] = np.where(outliers,df['expected'],df['y'])

In [56]:

ax = df.plot(y=['y_corrected'], figsize=(15,5), marker='.')
df[outliers].plot(y=['y'], linestyle='', marker='x', markersize=12, 
                    c='r', ax=ax, label=['outlier'], alpha=0.5)
df[outliers].plot(y=['y_corrected'], linestyle='', marker='o', markersize=12, 
                    c='g', ax=ax, label=['outlier_imputed'], alpha=0.5)

ax.set(title='Treat the outliers as missing data and Impute them using imputation methods');

<img src='./plots/handle-outlier-STL-alpha-5.png'>

In [65]:
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(15,8))

df.plot(y=['y'], ax=ax[0])
df[outliers].plot(y=['y'], ax=ax[0], c='r', linestyle='', marker='x', markersize=12, label=['outlier'])

df.plot(y=['y_corrected'], ax=ax[1])
df[outliers].plot(y=['y_corrected'], ax=ax[1], c='g', linestyle='',
                     marker='o', markersize=12, label=['outlier_correction'], alpha=0.4)

<img src='./plots/handle-outlier-using-STL-compare-plot.png'>