## Outlier Detection using Supervised Machine Learning 

Outlier detection is not straightforward, mainly due to the ambiguity surrounding the definition of what an outlier is specific to our data or the problem that we are trying to solve. Having domain knowledge is vital to making the proper judgment when spotting outliers. 

Most of the ML methods techniques for outlier detection are considered unsupervised outlier detection methods, such as **Isolation Forests (iForest), unsupervised K-Nearest Neighbors (KNN), Local Outlier Factor (LOF) and Copula-Based Outlier Detection (COPOD)** 

Generally, outliers (or anomalies) are considered a rare occurrence. In other words, we would assume a small fraction of our data are outliers in a large data set. For example, 1% of the data may be potential outliers. However, this complexity requires methods designed to find patterns in the data. Unsupervised outlier detection techniques are great at finding patterns in rare occurrences. 

After investigating outliers, we will have a historical set of labeled data, allowing us to leverage semi-supervised outlire detection techniques.

Here we will introduce the **PyOD** library, described as "a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data". 

In [1]:
import pyod 

In [2]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
from pathlib import Path 
import warnings 
warnings.filterwarnings('ignore')
plt.rcParams["figure.figsize"] = [16,3]


In [3]:
# load the nyc taxi dataset 
file = Path("../TimeSeriesAnalysisWithPythonCookbook/Data/nyc_taxi.csv")
nyc_taxi = pd.read_csv(file, index_col='timestamp', parse_dates=True)
nyc_taxi.index.freq = "30T"

In [4]:
# Store the known dates containing outliers, also known as ground truth labels

nyc_dates = [
    "2014-11-01",
    "2014-11-27",
    "2014-12-25",
    "2015-01-01",
    "2015-01-27"
]


In [5]:
# Create the plot_outliers function that we will use throughout 

def plot_outliers(outliers, data, method='KNN', 
                  halignment = 'right', 
                  valignment = 'top', 
                  labels=False):
    ax = data.plot(alpha=0.6)

    if labels:
        for i in outliers['value'].items():
            plt.plot(i[0], i[1], 'v', markersize=8, markerfacecolor='none', markeredgecolor='k')
            plt.text(i[0], i[1]-(i[1]*0.04), f'{i[0].strftime("%m/%d")}', 
                     horizontalalignment=halignment, 
                     verticalalignment=valignment)
    
    else:
        data.loc[outliers.index].plot(ax=ax, style='rX', markersize=9)
    
    plt.title(f'NYC Taxi - {method}')
    plt.xlabel('date')
    plt.ylabel('# of passengers')
    plt.legend(['nyc taxi', 'outliers'])
    plt.show()