# Anomaly Detection Algorithms: A Visit  
## Why look at this problem?  
When I was working for my former employer, a digital advertising company in Japan, there were from time to time some anomaly detection tasks assigned to me which I didn't want to spend much time on. Basically I just threw all the data into some model like [Gaussian Processes](https://bugra.github.io/work/notes/2014-05-11/robust-regression-and-outlier-detection-via-gaussian-processes/) and highlighted the data which are not in the confidence/prediction interval as the output. I did not think much about if I can explain the results but there are much more to be explained than I thought. Indeed, feature engineering is important when you are dealing with data, and there are many powerful models that can handle unbalanced data or outliers very well like gaussian processes or boosted trees. However in most cases, those "outliers" do not only concern us in a data engineering way, but also in business sense. Which, requires the analysts to do more research about this specific problem and provides insights about the data.  
Recently I read about a Q&A post on [Zhihu (a Quora-like site in China)](https://www.zhihu.com/question/280696035), the question is "What are the popular anomaly detection algorithms in data mining?". I went through all the answers and found that there are lot of stuff in this field. So I decided to look at this problem by implementing different methods myself. Could be lot of fun :)  

## Some tools/libraries I will be using  
- JupyterLab
- pandas / dask  

## Useful posts and papers  
- [Metrics, Techniques and Tools of Anomaly Detection: A Survey](https://www.cse.wustl.edu/~jain/cse567-17/ftp/mttad/index.html)
- [数据挖掘中常见的「异常检测」算法有哪些？](https://www.zhihu.com/question/280696035)

# Practice: Credit Card Fraud Detection  
This is a kaggle dataset, problem description goes [here](https://www.kaggle.com/mlg-ulb/creditcardfraud/version/3#). Some notes:  
- Binary classification problem
- Use AUC as the performance metric
- Highly unbalanced:  frauds account takes for 0.172% of all transactions

In [3]:
'''
Okay let's take a look at the data first.
'''
import pandas as pd

df = pd.read_csv('../fraud/creditcard.csv', header=0)
cols = list(df.columns)
# Row numbers of fraud records
fraud_idx = list(df.loc[df['Class']==1].index.values)

print (cols)
print (len(fraud_idx), len(df))

['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class']
492 284807


## 1　Unsupervised Approaches  
First I would like to try some unsupervised learning algorithms (pure statistical methods or clustering) and compare with their labels. By the end of this part hope I can have a general understanding of things below:  
- How to implement unsupervised learning algorithms on anomaly detection  
- What the data is like in specific problem and why some algorithms perfrom well  
- Prons and cons of each algorithm

In [5]:
'''
Prepare our data first
'''

df_1 = df[cols[:-1]].copy()

### 1-1　KMeans

In [26]:
from sklearn.cluster import KMeans

# Seems we need to normalize 'Time' and 'Amount'

time_col = list(df_1['Time'])
time_max = max(time_col)
time_col = [x/time_max for x in time_col]

amount_col = list(df_1['Amount'])
amount_max = max(amount_col)
amount_col = [x/amount_max for x in amount_col]

df_1['Time'] = time_col
df_1['Amount'] = amount_col

X = df_1.as_matrix()
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
labels = list(kmeans.labels_)

print (labels.count(0), labels.count(1))

134701 150106
