# Anomaly Detection

**Project Overview** <br>
Implement and understand the application of a number of outlier/anomaly detection algorithms

**Project Aim** <br>
- The purpose of this project is to solve a classification problem in relation to anomaly detection
- The objective of the network optimization team is to analyze traces of past activity, which will be used to train an ML system capable of classifying samples of current activity as:
    - 0 (normal): current activity corresponds to normal behavior of any working day and. Therefore, no reconfiguration or redistribution of resources is needed.
    - 1 (unusual): current activity slightly differs from the behavior usually observed for that time of the day (e.g. due to a strike, demonstration, sports event, etc.), which should trigger a reconfiguration of the base station.
    
**Project Value**
- Target Benefits: Why are we doing this project and where is the value?
- Business Needs
    - Identifying all of the available benefits (not just ‘enough benefits’ to get the project approved)
    - Defining specific measurable end states—the desired business outcomes—that need to be achieved in the business for the benefits to be delivered in full (filling the ‘gap’)
    - Maximizing and then quantifying all of the available financial benefits
    - Identifying the change activities required to deliver these outcomes, benefits and value.

**Data** <br>
https://www.kaggle.com/c/anomaly-detection-in-cellular-networks/data
https://towardsdatascience.com/adrepository-anomaly-detection-datasets-with-real-anomalies-2ee218f76292

During two weeks, different metrics were gathered from a set of 10 base stations, each having a different number of cells, every 15 minutes

The dataset is split into training (approx. 80%) and test (approx. 20%) subsets provided as two separate CSV files
- The training set: ML-MATT-CompetitionQT1920_train.csv contains 36,904 samples, each having 13 features and a label. Note that there may be erroneous samples and outliers.
- The test set: ML-MATT-CompetitionQT1920_test.csv contains 9,158 samples following the same structure as the training set but not including the labels.

- Column Definitions
    - Time : hour of the day (in the format hh:mm) when the sample was generated.
    - CellName1: text string used to uniquely identify the cell that generated the current sample. CellName is in the form xαLTE, where x identifies the base station, and α the cell within that base station (see the example in the right figure).
    - PRBUsageUL and PRBUsageDL: level of resource utilization in that cell measured as the portion of Physical Radio Blocks (PRB) that were in use (%) in the previous 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
    - meanThrDL and meanThrUL: average carried traffic (in Mbps) during the past 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
    - maxThrDL and maxThrUL: maximum carried traffic (in Mbps) measured in the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
    - meanUEDL and meanUEUL: average number of user equipment (UE) devices that were simultaneously active during the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
    - maxUEDL and maxUEUL: maximum number of user equipment (UE) devices that were simultaneously active during the last 15 minutes. Uplink (UL) and downlink (DL) are measured separately.
    - maxUE_UL+DL: maximum number of user equipment (UE) devices that were active simultaneously in the last 15 minutes, regardless of UL and DL.
    - Unusual: labels for supervised learning. A value of 0 determines that the sample corresponds to normal operation, a value of 1 identifies unusual behavior.


**Algorithms to Implement**
- One-Class Support Vector Machines
- Isolation Forests
- Local Outlier Factor
- Elliptic Envelope
- DBSCAN
- Multivariate anomalie detection


**Loss Functions to Consider**
- Accuracy
- Log-Loss (https://towardsdatascience.com/the-most-awesome-loss-function-172ffc106c99)


**Benchmark Perfomance**
- Top 10: 0.99244
- Top 20: 0.88193
- Worst: 0.41447

**Resources - Exploratory Data Analysis** <br>
~~https://towardsdatascience.com/organize-your-data-and-models-using-the-object-oriented-programming-and-pickle-876a6654494 ~~<br>
~~https://www.brighthubpm.com/project-planning/128738-are-your-projects-delivering-business-value/~~ <br>

**Resources - Anomaly Detection** <br>
Andew Ng - Machine Learning Tutorials (YouTube) <br>
~~Multivariate Unsupervised Machine Learning for Anomaly Detection in Enterprise Applications.pdf~~ <br>
~~https://machinelearningmastery.com/model-based-outlier-detection-and-removal-in-python/~~ <br>
~~https://www.bmc.com/blogs/outlier-and-anomaly-detection/~~ <br>
~~https://machinelearningmastery.com/one-class-classification-algorithms/~~ <br>
~~https://medium.com/sciforce/anomaly-detection-another-challenge-for-artificial-intelligence-c69d414b14db~~ <br>
~~https://medium.com/learningdatascience/anomaly-detection-techniques-in-python-50f650c75aaf~~ <br>
~~https://towardsdatascience.com/detecting-weird-data-conformal-anomaly-detection-20afb36c7bcd~~ <br>
~~https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/~~ <br>
~~https://towardsdatascience.com/detecting-credit-card-fraud-using-machine-learning-a3d83423d3b8~~ <br>
~~https://towardsdatascience.com/unsupervised-machine-learning-approaches-for-outlier-detection-in-time-series-using-python-5759c6394e19~~ <br>
~~https://medium.com/pinterest-engineering/building-a-real-time-anomaly-detection-system-for-time-series-at-pinterest-a833e6856ddd~~ <br>
~~https://towardsdatascience.com/identifying-outliers-with-local-outlier-probabilities-2b5781e86e01#:~:text=By%20comparing%20the%20local%20density,to%20their%20Local%20Outlier%20Probability.~~ <br>

In [1]:
import pandas as pd

In [2]:
train_df = pd.read_csv('/Users/Rej1992/Documents/AnomalieDetectionMethods_RawData/ML-MATT-CompetitionQT1920_train.csv')
test_df = pd.read_csv('/Users/Rej1992/Documents/AnomalieDetectionMethods_RawData/ML-MATT-CompetitionQT1920_test.csv')

## Data Preprocessing

In [6]:
train_df.head()

Unnamed: 0,Time,CellName,PRBUsageUL,PRBUsageDL,meanThr_DL,meanThr_UL,maxThr_DL,maxThr_UL,meanUE_DL,meanUE_UL,maxUE_DL,maxUE_UL,maxUE_UL+DL,Unusual
0,10:45,3BLTE,11.642,1.393,0.37,0.041,15.655,0.644,1.114,1.025,4.0,3.0,7,1
1,09:45,1BLTE,21.791,1.891,0.537,0.268,10.273,1.154,1.353,1.085,6.0,4.0,10,1
2,07:45,9BLTE,0.498,0.398,0.015,0.01,0.262,0.164,0.995,0.995,1.0,1.0,2,1
3,02:45,4ALTE,1.891,1.095,0.94,0.024,60.715,0.825,1.035,0.995,2.0,2.0,4,1
4,03:30,10BLTE,0.303,0.404,0.016,0.013,0.348,0.168,1.011,1.011,2.0,1.0,3,0


## Exploratory Data Analysis 
https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b <br>
https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/

## How to use Statistics to Identify Outliers
https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/ <br>

## Anomaly Detection Algorithm Application
**One-Class Support Vector Machines** <br>

**Isolation Forest** <br>
https://www.bmc.com/blogs/outlier-and-anomaly-detection/

**Local Outlier Factor** <br>

**Elliptic Envelope** <br>

**DBSCAN** <br>

**Multivariate anomalie detection** <br>

**Conformal Anomaly Detection & Conformal Prediction Framework** <br>
https://towardsdatascience.com/detecting-weird-data-conformal-anomaly-detection-20afb36c7bcd

**Other Resources** <br>
https://machinelearningmastery.com/one-class-classification-algorithms/ <br>
https://medium.com/learningdatascience/anomaly-detection-techniques-in-python-50f650c75aaf <br>
https://machinelearningmastery.com/model-based-outlier-detection-and-removal-in-python/ <br>
https://towardsdatascience.com/identifying-outliers-with-local-outlier-probabilities-2b5781e86e01#:~:text=By%20comparing%20the%20local%20density,to%20their%20Local%20Outlier%20Probability.

## Save Results and Generate Final Results
https://towardsdatascience.com/organize-your-data-and-models-using-the-object-oriented-programming-and-pickle-876a6654494

## Usecases
https://towardsdatascience.com/detecting-credit-card-fraud-using-machine-learning-a3d83423d3b8 <br>
https://towardsdatascience.com/unsupervised-machine-learning-approaches-for-outlier-detection-in-time-series-using-python-5759c6394e19 <br>
https://medium.com/pinterest-engineering/building-a-real-time-anomaly-detection-system-for-time-series-at-pinterest-a833e6856ddd <br>