# About Novelty Detection and its evaluation metrics

## References

* [A Survey on Unsupervised Outlier Detection in High-Dimensional Numerical Data](https://onlinelibrary.wiley.com/doi/10.1002/sam.11161)
* [An Experiment with the Edited Nearest-Neighbor Rule](https://ieeexplore.ieee.org/document/4309523)
* [Anomaly detection - A survey](http://cucis.ece.northwestern.edu/projects/DMS/publications/AnomalyDetection.pdf)
* [Improving classification accuracy by identifying and removing instances that should be misclassified](https://ieeexplore.ieee.org/document/6033571/)
* [There and back again- Outlier detection between statistical reasoning and data mining algorithms](http://wires.wiley.com/WileyCDA/WiresArticle/wisId-WIDM1280.html)
* [There and back again- Outlier detection between statistical reasoning and data mining algorithms(Slides)](http://www.informatik.tuwien.ac.at/teaching/phdschool/talkTUVienna.pdf)

## Definition

### 1) Outlier Detection ( = Anomaly detection )
- **Identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data**
- Three methods for anomaly detectoin
    * **Unsupervised** : Unlabeled test data set under the assumption that the majority of the instances in the data set are normal
    * **Supervised** : Labeled, imbalanced data set (normal/abnormal)
    * **Semi-Supervised** : Model representing normal behavior from a given normal training data set. 
- e.g.) Bank fraud, Structual defect, System health monitoring, Intrusion detection, Fault detection, Ecosystem disturbances
- Source : [Wikipedia](https://goo.gl/YOdhxK)
- By [scikit-learn](https://goo.gl/csTPJr) : ***'Training data contains outliers'***



### 2) Novelty Detection
- **Mechanism by which an intelligent organism is able to identify an incoming sensory pattern as being hitherto unknown**
- The principle is long known in **neurophysiology(신경생리학)**
- 'Early neural modeling attempts were by Yehuda Salu(1988)'
- Source : [Wikipedia](https://goo.gl/6mntxw)
- By [scikit-learn](https://goo.gl/csTPJr) : ***'Training data is not polluted by outliers'***
    



## Evaluation Metrics for Anomaly & Novelty Detection

### 1) Metrics for Out-of-Distribution Detection


![](img/roc.png)

- ROC (Receiver Operating Characteristics)
  - **False positive rate(FPR)** versus the **true positive rate(TPR)(=Recall)** for a number of different candidate threshold values between 0.0 and 1.0
  - 0~1사이의 threshold를 변경해가면서, 그 때의 **False positive rate(FPR)** 대 **True positive rate(TPR)(=Recall)**
  - In other words, it plots the false alarm rate versus the hit rate
  - False alarm rate와 hit rate를 나타내는 것으로도 볼 수 있다
- AUC (Area Under the Curve)
  - Literally area under the ROC curve
  - ROC curve 아래의 면적
  - AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
  - AUC는 0부터 1까지의 범위를 가진다. 100% 틀린 예측을 하는 모델의 AUC는 0이고, 100% 맞는 예측을 하는 모델의 AUC는 1이다
- PRC (Precision-Recall Curve)
  - **Precision** versus the **Recall**
  - Precision 대 Recall
  - To know how good a model is at predicting the positive class
  - 모델이 positive class를 얼마나 잘 예측하는지를 보기 위함

### 2) Better metric for class-imbalanced data

- Precision captures false positive more sensitively than FPR, **thus PRC is more appropriate than ROC when it comes to class-imbalanced problem**
- Precision은 false positive를 FPR에 비해 훨씬 더 민감하게 잡아낸다. **따라서 클래스의 불균형이 있는 문제에서는 ROC보다 PRC를 보는 것이 더 낫다.**

> e.g.) 1 million samples, 100 positive and others are all negative

> case1) 100 predicted positive, 90 true positive  
> case2) 2000 predicted positive, 90 true positive  

> case1) 0.9 TPR, 0.00001 FPR  
> case2) 0.9 TPR, 0.00191 FPR
> FPR difference = 0.00190

> case1) 0.9 Recall, 0.9 Precision  
> case2) 0.9 Recall, 0.045 Precision
> Precision difference = 0.855

> Upon same false positive difference, precision shows bigger difference than FPR. In other words, precision is more sensitive to false positives  
> 같은 수준의 false positive의 차이에 대해서, precision이 FPR보다 더 큰 차이를 보인다. 다시 말하면, precision이 false positive에 대해 더 민감하게 반응한다