## All the models and techniques I’ve researched/looked into/developed:
- **Simple Linear Regression**
- **Clustering algorithms**
    - Algorithms like isolation forest work off having lots of dimensions in the data, but this won't work for all features, e.g. conversion rate. **Not possible to retain multiple dimensions and calculate conversion rate since it must be collapsed down to daily, weekly, monthly, etc. along one dimension only**
    - Some algorithms, like k-means, don't detect outliers; they just cluster everything into groups
    - DBScan looks promising & I might explore it next - but it is very sensitive to initial parameters and clustering techniques in general may not work as effectively on time series data.
- **IQR Model**
    - Process:
        - Fit a high order polynomial to the time series and take the difference to get a stationary time series
        - Classify data point as outlier if it is outside of $(Q_1 - 1.5\text{IQR}, Q_3 + 1.5\text{IQR})$
    - Strengths:
        - Easy to implement, quick runtime
        - Intuitive
    - Limitations:
        - No forecasting abilities
        - Not a reliable way of detrending data
        - Model needs to be recalculated when new data is added
        - Anomaly threshold is fixed and doesn't account for finer details in the data
- **ARIMA Model**
    - Process:
        - Split dataset into historical vs recent (last 7 days) data
        - Fit ARIMA parameters p, d, q to historical data
        - Determine best parameters using grid search of combinations for p, d, q
            - "Best" as determined by AICc score
        - ARIMA forecast a week forward and compare the forecast to the recent data
        - Classify data point as outlier if it is outside of the forecast's confidence interval
    - Strengths:
        - Forecasting
        - Seems to work well and detects anomalies appropriately
        - Completely automatable with grid search algorithm
    - Limitations:
        - Fails when data appears just like white noise (i.e. best ARIMA parameters are 0, 0, 0). This may not be that much of a problem given that the anomaly detection is based on the forecast's confidence intervals instead of the actual forecast
        - Somewhat slow - takes a few minutes to run. This also may not be a problem if it were set up to run automatically in the background e.g. weekly
- **STL Decomposition**
    - Process:
        - Extract trend and seasonality from time series, leaving behind a (stationary) residual plot
        - Perform anomaly detection on the residual plot: could use the IQR method, normality, etc.
    - Strengths:
        - Extracts both trend & seasonality from plot - definitely stationary
    - Limitations:
        - No forecasting
        - Seems to be prone to overfitting? Trend graph is always very complex

## Proof of concept: anomaly detection algorithm

In [None]:
# Load modules

from dotenv import load_dotenv
import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src.amodely import *

load_dotenv()
DATASET_PATH = os.environ.get("DATASET_PATH")

In [None]:
model = Amodely(pd.read_excel(DATASET_PATH + "Conversion Data Extended Period.xlsx"), measure="conversion_rate")
model.detect_anomalies(method="arima", dimension="STATE_CODE", steps=10)

In [None]:
model.download_anomalies()

## Proof of concept: appending new data

In [None]:
print("Original data:")
model2 = Amodely(pd.read_excel(DATASET_PATH + "Conversion by Day (multi dimension).xlsx"), measure="conversion_rate")
display(model2.df)

print("New data to be added:")
new_data = pd.read_excel(DATASET_PATH + "Conversion by Day (small addition).xlsx")
display(new_data)

print("End result:")
model2.append(new_data, sort_after=False, reset_working=True)
display(model2.df)

## Project structure
```
amodely
│   main.py
│   requirements.txt
│   ...
│
└───amodely
│   │   amodely.py
│   │   arimatools.py
│   │   lib.py
│   │   pipelines.py
│   │   ...
│   │
│   └───tests
│       │   ...
│   
└───docs
│   │   ...
│
└───notebooks
│   ...
```

`admodel.py`
- Main file, contains `ADModel` class

`arimatools.py`
- Provides some helper functions for calculating ARIMA models

`lib.py`
- Constants

`pipelines.py`
- Contains all of the data preparation/transformations needed for the model;
    - `FillNA` to replace NaNs with zeroes
    - `Collapse` to collapse multi-dimensional data down to single-dimensional data
    - `FilterCategory` to filter the data using categories
    - `AddResponse` to add the response variable (e.g. conversion rate) after data preparation
- Integrated with scikit-learn's ecosystem; easy to add new transformations, pipelines, etc. in future if needed, easy to change/modify code

## Ideas/moving forward:

- Continuous anomaly detection
    - Recalculate anomaly detection model and output anomalies as soon as data is loaded in (daily or weekly)
    - Model would take a few minutes to run at max, so it wouldn't be too expensive to run daily if needed
- Anomaly scores
    - Modify models to output "anomaly scores" for each anomaly to indicate how likely they are to be an anomaly/how extreme of an anomaly they are
        - Logistic regression
        - Similar to how an activation function works in neural networks
    - This would make it possible to run multiple anomaly detection models. You could then take the average anomaly score for all the data points and flag the data points which have the highest anomaly score (i.e. they keep popping up in all the models)
- Presentation
    - Save anomalies to a spreadsheet which can then be loaded in PowerBI, or
    - Upload anomaly data to a Plotly Dash application on GCP
- Scaling to all features
    - The model can already handle any dimension for the conversion rate feature
    
    
    
Next steps:
1. Refine code
    - Explore weekly
    - Go with STL
    - Add new column to store number of standard deviations away from mean (classify how extreme of an anomaly each data point is)
2. Anomaly report
    - Plotly summary
    - Export only points of interest
3. Scaling to all measures
    - May need to consider what to do with dimensions with a lot of categories