In [1]:
'''
Okay let's take a look at the data first.
'''
import pandas as pd

df = pd.read_csv('../fraud/creditcard.csv', header=0)
cols = list(df.columns)
# Row numbers of fraud records
fraud_idx = list(df.loc[df['Class']==1].index.values)

print (cols)
print (len(fraud_idx), len(df))

['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class']
492 284807


# Unsupervised Approaches  
First I would like to try some unsupervised learning algorithms (pure statistical methods or clustering) and compare with their labels. By the end of this part hope I can have a general understanding of things below:  
- How to implement unsupervised learning algorithms on anomaly detection  
- What the data is like in specific problem and why some algorithms perfrom well  
- Prons and cons of each algorithm

In [32]:
'''
Prepare our data first
'''

df_nolabel = df[cols[:-1]].copy()

## 1　KMeans

In [19]:
from sklearn.cluster import KMeans
import numpy as np

def kmeans_ad(df, clusters, fraud_idx):
    # normalize 'Time' and 'Amount'
    time_col = list(df_1['Time'])
    time_max = max(time_col)
    time_col = [x/time_max for x in time_col]

    amount_col = list(df_1['Amount'])
    amount_max = max(amount_col)
    amount_col = [x/amount_max for x in amount_col]

    df_1['Time'] = time_col
    df_1['Amount'] = amount_col

    X = df_1.as_matrix()
    kmeans = KMeans(n_clusters=clusters, random_state=0).fit(X)
    labels = list(kmeans.labels_)
    centers = list(kmeans.cluster_centers_)
    
    clustered_labels = []
    for idx in fraud_idx:
        clustered_labels.append(labels[idx])
    
    clustered_dict = {}
    for c in clustered_labels:
        if c not in clustered_dict:
            clustered_dict[c] = 1
        else:
            clustered_dict[c] += 1
            
    return clustered_dict, centers, labels, X

In [25]:
cluster_dict, centers, labels, X = kmeans_ad(df_1, 1, fraud_idx)
print (cluster_dict)
distances = [np.linalg.norm(X[idx] - centers[labels[idx]]) for idx in range(len(labels))]
distandes_sorted = sorted(distances, reverse=True)
rank = []
for f in fraud_idx:
    rank.append(distandes_sorted.index(distances[f]))

{0: 492}


### Notes  
- KMeans itself is sentitive to outliers
- To ensure a good detection result first you need a good clustering result
    - [StackExchange](https://stats.stackexchange.com/questions/160260/anomaly-detection-based-on-clustering)
- Maybe try density-based clustering?
    - [Unsupervised clustering approach for network anomaly detection](https://pdfs.semanticscholar.org/7d5d/794e77378a7dbc5d67d79f4c5e7a11b05454.pdf)

## 2　DBSCAN  
Density-based spatial clustering of applications with noise. This approach evaluates the density of data points and groups them based on high/low densities. This approach is very robust to outliers.