## Detecting outliers using the residuals


* Consider the residuals from rolling statistics 
* The outliers are visually discernible
* The intution is we can use this residuals and say 
    * if this residual is greater than some threshold we can flag them as outlier
    * if this residual is lesser than some threshold we can flag them as outlier

#### CAUTION : RESIDUALS MUST BE STATIONARY
`ie : mean and standard deviation should not change with time`
* Before using residuals to detect outliers make sure that they are stationary




In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
from sktime.utils.plotting import plot_series

from statsmodels.tsa.seasonal import STL

## Data Set Synopsis

The timeseries is between January 1992 and Apr 2005.

It consists of a single series of monthly values representing sales volumes. 

a monthly retail sales dataset (found [here](https://raw.githubusercontent.com/facebook/prophet/master/examples/example_retail_sales.csv)).

In [3]:
df = pd.read_csv('../../Datasets/example_retail_sales_with_outliers.csv', parse_dates=['ds'], index_col=['ds'])

plot_series(df)
plt.xticks(rotation=20);

<img src='./plots/retail-sales-with-outliers-df.png'>

## We have to deseasonalize the data first because :
* #### The seasonal spike can be mistaken as outliers 
* #### The seasonal variation can inflate the rolling statistics

In [5]:
res = STL(endog=df['y'], period=12, robust=True).fit()
df['deseasonalized'] = df['y'] - res.seasonal

df.plot(y=['deseasonalized']);

<img src='./plots/deseasonalized-plot.png'>

## Estimation method
datapoint is outlier if `ABS(actual - expectation)` is greater than the `threshold`

`ABS(y(t) - y_hat) > threshold`



## Rolling median
* **Median is robust to outliers**
* We can use the rolling median as out expected-values

#### We are using rolling median instead of rolling mean for expected-value computation


In [6]:
expected_values = (df['deseasonalized'].rolling(window=12, center=True, min_periods=1).agg({'rolling_median':'median'}))
expected_values.head()

Unnamed: 0_level_0,rolling_median
ds,Unnamed: 1_level_1
1992-01-01,165501.41445
1992-02-01,166194.124261
1992-03-01,166336.241297
1992-04-01,166478.358333
1992-05-01,166817.744926


## Compute the residuals

In [8]:
residuals = df['deseasonalized'] - expected_values['rolling_median']


ax = residuals.plot(linestyle='', marker='x', figsize=(12,4))
plt.title('Residuals are stationary; We can see the outliers ');

<img src='./plots/residuals-plot.png'>

<img src='./Notes/box plot and outliers.PNG'>

## Compute the Inter Quartile Range, 1st Quartile and 3rd Quartile

just listing all the options 

In [29]:
# MEDIAN is the 2nd Quartile   50% quantile  or 50th percentile
q2_by_median = np.median(residuals.values)

q2_by_quantile = np.quantile(residuals.values, q=0.5)

q2_by_percentile = np.percentile(residuals.values, q=50)

print('Median or 2nd quartile (0.5 quantile) or 50th percentile :')
print('Numpy median = ',q2_by_median)
print('Numpy quantile = ',q2_by_quantile)
print('Numpy percentile = ',q2_by_percentile)


# we can also use the pandas for computing quantiles
print('Q2 using pandas = ',residuals.quantile(q=0.5))

Median or 2nd quartile (0.5 quantile) or 50th percentile :
Numpy median =  257.9542177627445
Numpy quantile =  257.9542177627445
Numpy percentile =  257.9542177627445
Q2 using pandas =  257.9542177627445


In [16]:
# Q1   1st quartile   25% quantile  or 25th percentile


q1_by_quantile = np.quantile(residuals.values, q=0.25)

q1_by_percentile = np.percentile(residuals.values, q=25)

print('Q1   1st quartile   25% quantile  or 25th percentile')

print('Numpy quantile = ',q1_by_quantile)
print('Numpy percentile = ',q1_by_percentile)


# we can also use the pandas for computing quantiles
print('Q1 using pandas = ',residuals.quantile(q=0.25))

Q1   1st quartile   25% quantile  or 25th percentile
Numpy quantile =  -1043.9408635926884
Numpy percentile =  -1043.9408635926884
Q1 using pandas =  -1043.9408635926884


In [17]:
# Q3   3rd quartile   75% quantile  or 75th percentile


q3_by_quantile = np.quantile(residuals.values, q=0.75)

q3_by_percentile = np.percentile(residuals.values, q=75)

print('Q3   3rd quartile   75% quantile  or 75th percentile')

print('Numpy quantile = ',q3_by_quantile)
print('Numpy percentile = ',q3_by_percentile)


# we can also use the pandas for computing quantiles
print('Q3 using pandas = ',residuals.quantile(q=0.75))

Q3   3rd quartile   75% quantile  or 75th percentile
Numpy quantile =  1779.6279593110448
Numpy percentile =  1779.6279593110448
Q3 using pandas =  1779.6279593110448


<img src='./Notes/Q1 -- Q2  -- IQR.PNG'>

In [27]:
# INTER QUARTILE RANGE

from scipy.stats import iqr

print('IQR by scipy stats : ',iqr(x=residuals.values))

print('IQR by numpy  : ',q3_by_quantile - q1_by_quantile)

IQR by scipy stats :  2823.5688229037332
IQR by numpy  :  2823.5688229037332


<img src='./Notes/25 -- 50 -- 75 percentile.PNG'>

## Box plot

In [74]:
plt.figure(figsize=(15,4))
plt.boxplot(residuals.values, vert=False);

<img src='./plots/box-plot.png'>

## Histogram

In [75]:
# we can clearly see that most of the residuals spread around zero.
# we can clearly see the outliers
residuals.hist(bins=50)

<img src='./plots/hist-plot.png'>

In [63]:
Q1 = q1_by_quantile
Q3 = q3_by_quantile
IQR = Q3-Q1

# FORMULA : 
# Outlier if  residual < Q1 + alpha * IQR 
# Outlier if  residual > Q3 + alpha * IQR 

alpha = 1.5

is_outlier = residuals.apply(lambda r: True if r> Q3 + (alpha*IQR) or r< Q1 - (alpha*IQR) else False)

In [76]:
ax = df.plot(y=['deseasonalized'], figsize=(15,4))
df[is_outlier].plot(y=['deseasonalized'], linestyle='', marker='x', ax=ax, label=['outlier'], c='r');

<img src='./plots/outlier-detection-using-residuals-alpha-1.5.png'>

#### We can increase the alpha to a higher value to make the outlier detection less sensitive

In [72]:
alpha = 5

is_outlier = residuals.apply(lambda r: True if r> Q3 + (alpha*IQR) or r< Q1 - (alpha*IQR) else False)

In [77]:
ax = df.plot(y=['deseasonalized'], figsize=(15,4))
df[is_outlier].plot(y=['deseasonalized'], linestyle='', marker='x', ax=ax, label=['outlier'], c='r');

<img src='./plots/outlier-detection-using-residuals-alpha-5.png'>

## Outlier detection using Residuals
* Pros
    * We can use this with any method that produce and expection for y(t)
    * The threshold is based on the entire dataset, not just a window
* Cons
    * We have to make sure that the residuals are STATIONARY 
    * We have to adjust the threshold for getting the right sensitivity