# Table of Contents
1. [Week 1 Summary](#week1)
    1. [Getting started](#week1a)
    2. [Research into anomaly detection models](#week1b)
    3. [IQR method for outliers](#week1c)
    4. [ARIMA model](#week1d)
2. [Week 2 Summary](#week2)
    1. [SARIMA, SARIMAX](#week2a)
    2. [K-means clustering](#week2b)
    3. [ADModel](#week2c)
3. [Week 3 Summary](#week3)
    1. [ADModel (continued)](#week3a)
    2. [STL Decomposition](#week3b)
    3. [Pipelines](#week3c)

<a id="week1"></a>

# Week 1 Summary

<a id="week1a"></a>
## Getting started

**Notebooks**:
- [Intro to Pandas and Matplotlib](./Week1_Intro%20to%20Pandas%20and%20Matplotlib.ipynb)

I started off trying to understand the underlying data that I was working with: what the numbers were repsenting, how the dataset was organised, all the different dimensions and columns, etc. For the tech stack, my options were:
- Python, since it was popular for any data science and machine learning related work
- R, since it was designed for statistical analysis and contains a rich library of packages and modules
I ended up choosing Python since it was the language I was most familiar and comfortable with, and because I could do essentially everything I wanted/needed in just Python.

---

<a id="week1b"></a>
## Research into anomaly detection models

**Notebooks**:
- [Intro to Isolation Forest](./Week1_Intro%20to%20Isolation%20Forest.ipynb)
- [Intro to Simple Linear Regression](./Week1_Intro%20to%20Simple%20Linear%20Regression.ipynb)

I researched some common anomaly detection techniques and models to get an idea for how I would construct my model, and did some early experimentation. Some models I ran into were:

- [Isolation Forest](https://en.wikipedia.org/wiki/Isolation_forest)
- [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN)
- [kNN](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)

While these algorithms were interesting, they didn't really fit my use case well (isolation forest requires a lot of features, DBSCAN and kNN work off clustering, etc.). Simple linear regression was out of the picture as well since the data wasn't really linear.

---

<a id="week1c"></a>
## IQR method for outliers

**Notebooks**:
- [IQR Model](./Week1_IQR%20Model.ipynb)

The first anomaly detection model I looked at implementing was a simple one that used the IQR method for outliers. Data points are considered outliers if they lie outside of the range $\left(Q_1 - 1.5\text{IQR}, Q_3 + 1.5\text{IQR}\right)$. This threshold wouldn't be very useful though if the data exhibited any seasonality or trends. Fitting a polynomial to the data and then taking the difference of it from the time series helped in detrending the data, but this wasn't a very sophisticated technique and its performance could vary significantly based on the polynomial being fit. Nevertheless, it was a good starting point.

---

<a id="week1d"></a>
## ARIMA model

**Notebooks**:
- [ARIMA Model](./Week1_ARIMA%20Model.ipynb)

The next technique I looked into was ARIMA modelling. ARIMA is powerful because it provides the ability to forecast future data based on previous data and can account for trends and seasonality (through an extension of ARIMA called SARIMA). This research and development took the rest of the week. When I applied the model to the data, the in-sample predictions weren't always very accurate. For some data, the most optimal model was simply an $\text{ARIMA}(0, 0, 0)$ model (essentially white noise - errors uncorrelated over time). This could have been due to the volatility and granularity of the data, or perhaps due to the fact that there wasn't enough data (it only spanned back 2 years). Also, the forecasts (i.e. out-of-sample predictions) were often very weak as well and flatlined after only a few steps forward.

<a id="week2"></a>
# Week 2 Summary

<a id="week2a"></a>
## SARIMA and SARIMAX

**Notebooks**:
- [SARIMAX Model](./Week%202_SARIMAX%20Model.ipynb)

Whilst researching ARIMA models I came across SARIMA and SARIMAX, extensions of the ARIMA model. SARIMA adds a seasonal component to ARIMA, and SARIMAX adds both a seasonal component and exogenous variables to ARIMA. Though this somewhat helped with improving the accuracy of the in-sample and out-of-sample predictions, this method required implementing a grid search to determine the most optimal parameters $(p, d, q)\times(P, D, Q)m$, and thus was slower than the ARIMA model.

The ARIMA model's forecast would be more useful if there were data points to compare against the forecast. This idea led to the following process:
- Split the data into training and test datasets, where the test dataset consists of the past month's data
- Fit an $\text{ARIMA}(p, d, q)$ model to the training dataset
- Forecast (predict out-of-sample) a month forward to compare against the test dataset
- Compute a (95%) confidence interval for the forecast and compare it against the actual data points (test dataset). Flag a data point as an outlier if it lies outside this confidence interval

This method does not rely on the actual forecast but instead the confidence interval it generates. This is helpful for $\text{ARIMA}(0, 0, 0)$ models; if the data is essentially white noise, then the confidence interval would be useful in helping to determine outliers (points that wouldn't be considered white noise). Since forecasting ability deterioriates the larger the forecasting period is (confidence interval grows wider), it would be more beneficial to reduce the size of the test dataset, to say 7 days, to ensure the forecast and confidence intervals are optimal.

Taking this idea a little further, the process in production would go something like this:
- Preparation/initial modelling:
    - First split the data into training and test datasets, where the test dataset consists of data from the past week (past 7 days)
    - Fit an $\text{ARIMA}(p, d, q)$ model to the training dataset
    - Forecast 7 days forward
    - Compute a confidence interval for the forecast, and compare this against the actual data points. A data point is considered an outlier if it lies outside this confidence interval
- Continuous anomaly detection:
    - Every 7 days, append new data to the dataset. Using the same $\text{ARIMA}(p, d, q)$ parameters*, compute a new forecast on this increased dataset.
    - Detect anomalies in the past 7 days based on the confidence interval.
    
\* this process would only work well if the $p, d, q$ parameters are correctly and accurately chosen. If not, the ARIMA parameters would need to be recalculated every time the algorithm is run.

---

<a id="week2b"></a>
## K-means clustering

**Notebooks**:
- [Experimenting with K-Means Clustering](./Week2_Intro%20to%20K-Means%20Clustering.ipynb)

K-means clustering was another anomaly detection technique that I looked into. It can be used as a standalone anomaly detection model, but I experimented with using it as a tool for clustering outliers together. This process first involved getting outliers from the ARIMA model and then inputting these outliers into the k-means clustering algorithm. This would cluster and group outliers, and could potentially reveal similarities between outliers which may help in determining the cause of certain outliers (e.g. many anomalies tend to occur around this date in this particular state). In the future, this could be scaled up to the whole dataset so that a multi-dimensional clustering algorithm can be run on the database to provide more detailed explanations for the cause of certain anomalies.


---

<a id="week2c"></a>
## ADModel

**Notebooks**:
- [ARIMA Example](./Week2_ARIMA%20Example.ipynb)

I also started working on the `ADModel` class this week. This class is intended to be an extension of Pandas' `DataFrame` and provides extra functionality for data preparation and anomaly detection. I also created the `arimatools` and `timeseries` modules which provide helper functions for working with and automating ARIMA modelling and time series data manipulation. Ideally this model would be used in production as it provides the ability to automate a lot of the modelling and anomaly detection. As I learn more about and implement more anomaly detection models, the `ADModel` class should hopefully be able to handle all of those; all that needs to be done is to create new helper modules to automate some of the modelling processes, and the `ADModel` should be able to handle the rest.

<a id="week3"></a>

# Week 3 Summary

<a id="week3a"></a>
## ADModel (continued)

**Notebooks**:
- [ARIMA Example](./Week2_ARIMA%20Example.ipynb)

I spent much of this week continuing work on the `ADModel` class and turning it into its own mini package. At this point only the ARIMA model was implemented into the class, but my hope was that I had created the class in such a way that it allows for any other anomaly detection technique to be seamlessly added into the class in the future. I also began writing unit tests for the `ADModel` class.

<a id="week3b"></a>
## STL Decomposition

**Notebooks**:
- [STL Modelling](./Week3_STL%20Modelling.ipynb)

Seasonal and Trend decomposition using Loess is a method for extracting the seasonal and trend components from time series. This leaves behind a residual which is time stationary and (usually) also normal (verify with QQ plot). A simple outlier detection method can then be applied to this stationary series, like the IQR method from above or a standard deviation technique (i.e. 95% confidence interval, reject outer 5% of values).

The benefit this method has over ARIMA is that it is much quicker as there is no need to grid search through combinations of parameters. There are however parameters that can be finetuned to improve the seasonal and trend extraction from the data.

The problem with STL is that the trend and seasonal components always seem to be overly complex, which has the possibility of unintentionally extracting the irregularities (spikes and dips) in the data. This could perhaps be adjusted for by the parameters.

<a id="week3c"></a>
## Pipelines

`sklearn` offers a great `Pipeline` class for creating chains of data transformations. These can be extended upon with custom transformers which are compatible with the rest of the `sklearn` library. Creating my own custom transformers and integrating them into the `sklearn` ecosystem allows for flexible and easy-to-use pipelining that can be modified or extended in the future if needed, so I migrated all my data processing and transformation code over to be compatible with `sklearn`'s `Pipeline`.

<a id="week4"></a>

# Week 4 Summary

## Amodely

## Proof of concept