# IQR Model

#### Using the IQR methodology to create an anomaly detection model

This anomaly detection model detects historical and recent outliers using the method of interquartile range. A datapoint is considered an outlier if it lies outside the range $\left(Q_1 - 1.5\text{IQR}, Q_3 + 1.5\text{IQR}\right)$.

Links:
- [Why 1.5IQR is used for detecting outliers](https://math.stackexchange.com/questions/966331/why-john-tukey-set-1-5-iqr-to-detect-outliers-instead-of-1-or-2)
- [Augmented Dickey-Fuller test](https://en.wikipedia.org/wiki/Augmented_Dickey%E2%80%93Fuller_test)

First import all the necessary modules and set up the dataframe:

In [None]:
from dotenv import load_dotenv
import os

import numpy as np
from numpy import polyfit
from matplotlib import pyplot
import pandas as pd
from datetime import datetime
from statsmodels.tsa.stattools import adfuller

load_dotenv()
DATASET_PATH = os.environ.get("DATASET_PATH")

# main dataframe
df = pd.read_excel(DATASET_PATH + "Conversion by Day.xlsx")

# in this dataset the dimension is POLICY: Risk State, but this will work for any other single-dimension
# dataset where the dimension is in the first column
states = sorted(list(set(df[df.columns[0]].tolist())))  # convert to set then to list to remove duplicates

# preview of dataframe
df.head()

Filter dataframe by state:

In [None]:
state = input("State: ")
# select rows which match the input state
df_new = df.loc[df.STATE_CODE == state][["QUOTE_DATE", "Net Closing Rate"]]
df_new.plot(x="QUOTE_DATE", y="Net Closing Rate", ylabel="Net Closing Rate", legend=False);

In order for the anomaly detection model to work, the data must be stationary (i.e. the mean and variance won't change over time). Some data preparation is required to remove the trends/seasonality in the data. First a polynomial of large degree is fit to the model:

In [None]:
# fit polynomial model to data
n = len(df_new["Net Closing Rate"])  # number of data points

X = [i % 365 for i in range(n)]
y = df_new["Net Closing Rate"].values
degree = 10
coefficients = polyfit(X, y, degree)

# create curve
polynomial = []
for i in range(n):
	value = coefficients[-1]  # y-intercept
	for d in range(degree):
        # calculate the polynomial value at the point
		value += X[i]**(degree-d) * coefficients[d]
	polynomial.append(value)
    
pyplot.plot(df_new["Net Closing Rate"].values)
pyplot.plot(polynomial);

Then a new model is constructed, which takes the difference between the original value and the polynomial value at each data point:

In [None]:
# seasonality adjustment
difference = []
for i in range(n):
    value = y[i] - polynomial[i]
    difference.append(value)

pyplot.plot(difference)
pyplot.show()

Now an augmented Dickey-Fuller (ADF) test is performed to test the stationarity of the data. The null hypothesis is that the data is not stationary, and the alternative is that the data is stationary:

In [None]:
result = adfuller(difference)
print(f"ADF Statistic: {result[0]:f}")
print(f"P-value: {result[1]:f} " + ("< 0.01 (significant)" if result[1] < 0.01 else "> 0.01 (insignificant)"))
print("Critical values:")
for key, value in result[4].items():
    print(f"\t{key}: {value:.3f}")

Now we can look for outliers by calculating Q1, Q3, IQR, and seeing which datapoints fall outside the bounds:

In [None]:
# calculate upper and lower bounds for detecting outliers
q1 = np.quantile(difference, 0.25)
q3 = np.quantile(difference, 0.75)
iqr = q3 - q1

min_bound = q1 - 1.5 * iqr
max_bound = q3 + 1.5 * iqr


# look for anomalies in the data from the last month
last_month = datetime.now() - pd.DateOffset(months=1)
last_month = datetime(last_month.year, last_month.month, 1)  # round to first day of month

anomalies = {
    "QUOTE_DATE": [],
    "Net Closing Rate": [],
}
historical_anomalies = {
    "QUOTE_DATE": [],
    "Net Closing Rate": [],
}

for i in range(len(df_new)):
    value = difference[i]
    date = df_new["QUOTE_DATE"].iloc[i]
    if (value < min_bound) or (value > max_bound):
        # append original data rather than transformed data
        if date >= last_month:
            anomalies["QUOTE_DATE"].append(df_new["QUOTE_DATE"].iloc[i])
            anomalies["Net Closing Rate"].append(df_new["Net Closing Rate"].iloc[i])
        else:
            historical_anomalies["QUOTE_DATE"].append(df_new["QUOTE_DATE"].iloc[i])
            historical_anomalies["Net Closing Rate"].append(df_new["Net Closing Rate"].iloc[i])

print("Recent anomalies (last month)\n-----------------------------")
display(pd.DataFrame(anomalies))
print("\nHistorical anomalies\n-----------------------------")
display(pd.DataFrame(historical_anomalies))

We can loop through all states to perform the process on the entire dataframe all at once. Note this only outputs recent outliers.

In [None]:
from dotenv import load_dotenv
import os

from statsmodels.tsa.stattools import adfuller
from datetime import datetime
import numpy as np
import pandas as pd

load_dotenv()
DATASET_PATH = os.environ.get("DATASET_PATH")

# main dataframe
df_main = pd.read_excel(DATASET_PATH + "Conversion by Day.xlsx")

# get a list of the categories for this dimension, e.g. ['ACT', 'NSW', ...] for POLICY: Risk State
categories = df_main[df_main.columns[0]].tolist()
categories = sorted(list(set(categories)))  # remove duplicate groupings and sort alphabetically

response_col = int(input("Column number of the response variable: "))
response = df_main.columns[response_col]

anomaly_dfs = []  # to store the integer locations of anomalies in the dataframe
    
for category in categories:
    # filter out all other categories
    df = df_main.loc[df_main.iloc[:, 0] == category]
    
    # fit polynomial to curve - first determine coefficients of polynomial
    n = len(df[response])
    x = [i % 365 for i in range(n)]
    y = df[response].values
    degree = 10
    coefficients = np.polyfit(x, y, degree)
    
    # create the polynomial curve
    polynomial = []
    for i in range(n):
        value = coefficients[-1]
        for d in range(degree):
            # calculate polynomial value at each point
            value += x[i]**(degree-d) * coefficients[d]
        polynomial.append(value)

    # seasonality/trend adjustment - take the difference between the original and polynomial values
    adjusted_response = []
    for i in range(n):
        new_value = y[i] - polynomial[i]
        adjusted_response.append(new_value)
    # insert adjusted response variable into the dataframe as a new column
    df.insert(response_col+1, f"Adjusted {response}", adjusted_response, False)
    
    # calculate upper and lower bounds for detecting outliers
    q1 = df[f"Adjusted {response}"].quantile(0.25)
    q3 = df[f"Adjusted {response}"].quantile(0.75)
    iqr = q3 - q1
    
    min_bound, max_bound = q1 - 1.5 * iqr, q3 + 1.5 * iqr
    
    # look for anomalies in the data from the past & current month
    last_month = datetime.now() - pd.DateOffset(months=2)
    last_month = datetime(last_month.year, last_month.month, 1)  # round to first day of month
    
    anomalies = []    
    for i in range(n):
        value = df[f"Adjusted {response}"].iloc[i]
        date = df["QUOTE_DATE"].iloc[i]
        if (not (min_bound <= value <= max_bound)) and date >= last_month:
            anomalies.append(i)
            
    anomaly_dfs.append(df.iloc[anomalies, :])

# filter out only the anomalies from the original dataframe
df_outliers = pd.concat(anomaly_dfs)
display(df_outliers)