## Detecting outliers using residuals and LOWESS

##### * LOWESS : Locally Weighted Scatter plot Smoothing
* `from statsmodels.nonparametric.smoothers_lowess import lowess`
* LOWESS fits a smooth curve to the data and it does not make any assumption in that curve
* LOWESS curve at any point `(x,y)` is obtained by fitting a weighted linear regression to a subset of data
* We can control the subset of data used to compute a smooth curve using `frac` parameter
* LOWESS will give less weight to data-points that are futher from the point of interest

* LOWESS can be used to extract trend from the timeseries
* We can use LOWESS to obtain an expectation value at each point in time

##### Residuals = `y_true - expected`

##### Outlier estimation using `1.5 * IQR` rule
* IQR = INTER QUARTILE RANGE
* Q1 = First Quartile or 25th Percentile or 0.25 Quantile
* Q3 = Third Quartile or 75th Percentile or 0.75 Quantile
* IQR = Q3 - Q1


* The outliers are visually discernible
* The intution is we can use this residuals and say 
    * if this residual is greater than some threshold we can flag them as outlier
        * `residuals > Q3 + 1.5*IQR`
    * if this residual is lesser than some threshold we can flag them as outlier
        * `residuals < Q1 - 1.5*IQR`


#### CAUTION : RESIDUALS MUST BE STATIONARY
`ie : mean and standard deviation should not change with time`
* Before using residuals to detect outliers make sure that they are stationary


In [90]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
from sktime.utils.plotting import plot_series

from statsmodels.tsa.seasonal import STL

from statsmodels.nonparametric.smoothers_lowess import lowess

## Data Set Synopsis

The timeseries is between January 1992 and Apr 2005.

It consists of a single series of monthly values representing sales volumes. 

a monthly retail sales dataset (found [here](https://raw.githubusercontent.com/facebook/prophet/master/examples/example_retail_sales.csv)).

In [4]:
df = pd.read_csv('../../Datasets/example_retail_sales_with_outliers.csv', parse_dates=['ds'], index_col=['ds'])

plot_series(df)
plt.xticks(rotation=20);

<img src='./plots/retail-sales-with-outliers-df.png'>

## Seasonal Patterns can be an issue !
* Seasonal patterns could be mistake as outliers so we have to de-sesonalized the data 

In [47]:
res = STL(endog=df['y'], seasonal=7, period=12, robust=True).fit()
res.seasonal.plot();

<img src='./plots/STL-seasonal-plot.png'>

In [51]:
df['deseasonalized'] = df['y']-res.seasonal
df.plot(y=['deseasonalized']);

<img src='./plots/deseasonalized-plot.png'>

## LOWESS for extracting the trend

`LOWESS (Locally Weighted Scatterplot Smoothing)`

Parameters:
* `endog (1-D numpy array) The y-values of the observed points`
* `exog (1-D numpy array) The x-values of the observed points`
* `frac (float) Between 0 and 1. The fraction of the data used when estimating each y-value.`
* `it (int) The number of residual-based reweightings to perform.`
* `delta (float) Distance within which to use linear-interpolation instead of weighted regression.`
* `xvals : (1-D numpy array) Values of the exogenous variable at which to evaluate the regression. If supplied, cannot use delta.`

etc .. 


In [52]:
# lowess needs integer values it can't work with date time values

x = np.arange(len(df))
y = df['deseasonalized'].values
window = 0.1 # 10% 


# The returned array is two-dimensional if return_sorted is True, and one dimensional if return_sorted is False.
#  If return_sorted is True, then a numpy array with two columns. 
# The first column contains the sorted x (exog) values
#  and the second column the associated estimated y (endog) values.
response = lowess(endog=y, exog=x, frac=window)

In [64]:
plt.figure(figsize=(15,4))
ax = plt.subplot()
ax.plot(df['deseasonalized'], label='Deseasonalized')
ax.plot(df.index, response[:,1], label='Trend extracted from deseasonalized');
ax.legend();


<img src='./plots/trend-from-deseasonalized-using-lowess.png'>

### Trend extracted using LOWESS method can be used as expected values

##### Residuals = `y_true - expected`

##### Outlier estimation using `1.5 * IQR` rule

In [65]:
df['expected'] = response[:, 1]
df['residuals'] = df['deseasonalized'] - response[:,1]

## Make sure that the residuals are stationary

In [89]:
df['residuals'].plot(figsize=(15,4), marker='.', linestyle='', markersize=14);

<img src='./plots/stationary-resid.png'>

## IQR , Q1, Q3 using pandas

In [67]:
Q1, Q3 = df['residuals'].quantile(q=[0.25, 0.75])
IQR = Q3 - Q1
print('INTER QUARTILE RANGE :',IQR)
print('Q1 :',Q1)
print('Q3 :',Q3)

INTER QUARTILE RANGE : 2852.0000673578616
Q1 : -1335.8035439788873
Q3 : 1516.1965233789742


In [119]:
plt.figure(figsize=(10,8))
sns.boxplot(x='residuals', data=df);

<img src='./plots/box-plot-residual-sns.png'>

A common rule to determine whether a particular data point is an outlier is the 1.5 x interquartile range (IQR) rule.

This rule states that a data point $x$ is an outlier if:

$$x > Q_3 + 1.5 \times IQR$$
$$x <Q_1 - 1.5 \times IQR$$

where $Q1$ and $Q3$ are the 1st and 3rd quartiles respectively.

In [131]:
upper_bound = Q1 - 1.5*IQR
lower_bound = Q3 + 1.5*IQR

plt.figure(figsize=(10,8))
plt.axvspan(xmin=lower_bound, xmax=upper_bound, color='g', alpha=0.4);
sns.histplot(df['residuals']);
plt.vlines(x=[lower_bound, upper_bound], ymin=plt.ylim()[0], ymax=plt.ylim()[1], colors='r')

plt.legend(['INTER QUARTILE RANGE', 'lower and upper bound', 'distribution of residuals']);

<img src='./plots/histplot-resid.png'>

In [68]:
def get_outliers(alpha = 1.5):
    return df['residuals'].apply(lambda r: True if r > Q3+(alpha*IQR)or r< Q1-(alpha*IQR) else False)

In [146]:
outliers = get_outliers(alpha=1.5)

ax = df.plot(y=['deseasonalized'], figsize=(15,4), marker='.')
df[outliers].plot(y=['deseasonalized'], c='r', marker='o', linestyle='', 
                  markersize=12, ax=ax, alpha=0.5, label=['outlier'])

<img src='./plots/outlier-detection-resid-STL.png'>

In [147]:
ax = df.plot(y=['residuals'], marker='.', markersize=16, linestyle='', figsize=(10,8))

df[outliers].plot(y=['residuals'], c='r', marker='o', linestyle='',  markersize=12, ax=ax, 
                    alpha=0.5, label=['outlier'])

plt.hlines(y=[lower_bound, upper_bound], xmin=df.index.min(), xmax=df.index.max(), color='r');


<img src='./plots/residuals-outilier-plot.png'>

#### Lets zoom-in and look the points near the boundary that are classified as outliers

In [162]:
ax = df[df['residuals']<5e4].plot(y=['deseasonalized'], figsize=(15,8), marker='.')
df.query('residuals < 5e4').plot(y=['expected'], ax=ax)
df[outliers].query('residuals < 5e4').plot(y=['deseasonalized'], ax=ax, c='r', linestyle='', 
                marker='o', markersize=16, alpha=0.5, label=['outlier']);

<img src='./plots/outlier-zoomed.png'>

In [234]:
ax = df.query('residuals < 5e4').plot(y=['residuals'], marker='.', markersize=16, linestyle='', figsize=(10,8))

df[outliers].query('residuals < 5e4').plot(y=['residuals'], c='r', marker='o', 
                    linestyle='',  markersize=12, ax=ax, alpha=0.5, label=['outlier'])

plt.hlines(y=[lower_bound, upper_bound], xmin=df.index.min(), xmax=df.index.max(), color='r');

<img src='./plots/residuals-outilier-plot-zoomed.png'>

#### Points that are identified as outlier are indeed points that deviate from the expected 
* points that deviates from the expected value are flagged as outliers
* if we want, we can make the outlier detection less sensitive by increasing the threshold for 

    * if this residual is greater than some threshold we can flag them as outlier
        * `residuals > Q3 + 3 * IQR`
    * if this residual is lesser than some threshold we can flag them as outlier
        * `residuals < Q1 - 3 * IQR`

#### Sensitivity adjustment ?
Because the datapoints (`y_true`) are close to the trend extracted by STL (`y_expected`) the majority of residuals are closer to zero. **This means that even the smaller fluctuations from expected are identified as outliers.** The residuals are much clearer on the residual-plot and box-plot. The IQR, Q1 and Q3 ends up really small due to this many small residuals. To make  outlier detection less sensitive, a simple solution would be to adjust the threshold. 

In [235]:
outliers = get_outliers(alpha=4)

ax = df.plot(y=['deseasonalized'], figsize=(15,4), marker='.')
df[outliers].plot(y=['deseasonalized'], c='r', marker='o', linestyle='', 
                  markersize=12, ax=ax, alpha=0.5, label=['outlier'])

<img src='./plots/outlier-captured-alpha-4.png'>

## Handling outliers

#### 1. MODELING OUTLIERS
Suppose the have the flower sales data from a store and we find the flower sales on Feb 14 very high on visual inspection. Is this and outliers ?

From visual inspection its clear that the sales on feb-14 is very high compared to the rest.

Since Feb 14 is valentine's day we know there will be increase in the flower sales

This case can be modeled 

We can add a feature to our model. Eg: Like zero for all other days and 1 for valentines day

We can teach our model about events like valentines-day where we expect a high sale of flowers 


#### 2. IMPUTATION 
#### Treat the outliers as missing data and Impute them using imputation methods 

imputation methods
* forward fill `pandas.DataFrame.fillna(method='ffill')`
* backward fill `pandas.DataFrame.fillna(method='bfill')` 
* linear-interpolation `pandas.DataFrame.interpolate(method='linear')`
* STL decomposition and interpolation


<img src='./Notes/Handle-outliers.PNG'>

## Treat the outliers as missing and impute

In [182]:
deseason = df['deseasonalized'].copy()
deseason[outliers] = np.nan
print('Number of outliers set as NaN : ',deseason.isna().sum())
# use linear interpolation to fill the missing
deseason.interpolate(method='linear', inplace=True)
print('Number of missing data after interpolation : ',deseason.isna().sum())
# adding back the seasonality
y_corrected = deseason + res.seasonal

Number of outliers set as NaN :  8
Number of missing data after interpolation :  0


In [236]:
plt.figure(figsize=(15,5))
plt.plot(y_corrected, label='y_corrected', marker='.')
ax = y_corrected[outliers].plot(c='r', linestyle='', markersize=16, marker='o', alpha=0.5);
ax.set(title='Treat the outliers as missing data and Impute them using imputation methods');

<img src='./plots/handle-outlier.png'>

In [239]:
plt.figure(figsize=(15,5))
plt.subplot(121)
plt.plot(df['y'], label='data with outlier')
plt.legend()
plt.subplot(122)
plt.plot(y_corrected, label='y_corrected')
plt.legend()

<img src='./plots/retail-sales-with-and-without-outliers-plot.png'>