# Anomaly Detection
## Sammy Samkough

#### Summary
In this notebook, I will be manipulating data to find their anomalies using various techniques of anomaly detection. I'll try my best to explain what's going on as I go. I'm going to import my data from a csv file using only python. Let's get started!

## Understanding Anomaly Detection
I think it's important for us to understand what we're talking about before we get right into it. I'm going to try my best to explain what anomaly detection is and how we're going to go about this project.

### What is Anomaly Detection?
([Wikipedia](https://en.wikipedia.org/wiki/Anomaly_detection))
Anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions.

In particular, in the context of abuse and network intrusion detection, the interesting objects are often not rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical definition of an outlier as a rare object, and many outlier detection methods (in particular unsupervised methods) will fail on such data, unless it has been aggregated appropriately. Instead, a cluster analysis algorithm may be able to detect the micro clusters formed by these patterns.

### Anomaly Detection Techniques
([Wikipedia](https://en.wikipedia.org/wiki/Anomaly_detection))
Three broad categories of anomaly detection techniques exist:
1. Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. 
2. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection).
3. Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then testing the likelihood of a test instance to be generated by the learnt model.

The specific techniques we will go over are ([Data Science - Pramit Choudhary](https://www.datascience.com/blog/python-anomaly-detection)):
1. Simple Statistical Methods
2. K-nearest Neighbor
3. Local Outlier Factor
4. Clustering-Based Anomaly Detection
5. Support Vector Machine-Based Anomaly Detection

## Simple Statistical Methods
([Data Science - Pramit Choudhary](https://www.datascience.com/blog/python-anomaly-detection))

The simplest approach to identifying irregularities in data is to flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles. Let's say the definition of an anomalous data point is one that deviates by a certain standard deviation from the mean. Traversing mean over time-series data isn't exactly trivial, as it's not static. You would need a rolling window to compute the average across the data points. Technically, this is called a rolling average or a moving average, and it's intended to smooth short-term fluctuations and highlight long-term ones. Mathematically, an n-period simple moving average can also be defined as a "low pass filter."

##### Challenges
The low pass filter allows you to identify anomalies in simple use cases, but there are certain situations where this technique won't work. Here are a few:  
- The data contains noise which might be similar to abnormal behavior, because the boundary between normal and abnormal behavior is often not precise. 
- The definition of abnormal or normal may frequently change, as malicious adversaries constantly adapt themselves. Therefore, the threshold based on moving average may not always apply.
- The pattern is based on seasonality. This involves more sophisticated methods, such as decomposing the data into multiple trends in order to identify the change in seasonality.

In [None]:
import sqlite3

conn = sqlite3.connect("database-500.db")