<a href="https://colab.research.google.com/github/CptK1ng/dmc2019/blob/alexander_dev/notebooks/anomaly_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# anomaly detection
**outlier detection**  = unsupervised anomaly detection

**novelty detection**  = semi-supervised anomaly detection

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn import svm
from sklearn import metrics
%matplotlib inline

Download our custom Dataset splits and the unlabeled Test Set:

In [2]:
!wget -nc -q --show-progress https://www.dropbox.com/s/6m8iq9ogpzmu7vx/train_new.csv?dl=1 -O train_new.csv
!wget -nc -q --show-progress https://www.dropbox.com/s/tjpkc45oqn3uv8s/val_new.csv?dl=1 -O val_new.csv
!wget -nc -q --show-progress https://www.dropbox.com/s/hbd6nzgwlnevu4x/test.csv?dl=1 -O test.csv



Import data:

In [3]:
df_train_original = pd.read_csv("train_new.csv", sep="|")
df_val_original = pd.read_csv("val_new.csv", sep="|")
df_test_original = pd.read_csv("test.csv", sep="|")
df_train_original.head(2)

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,fraud
0,4,828,66.56,7,4,3,0.007246,0.080386,1.166667,0
1,1,1612,31.34,2,4,3,0.008685,0.019442,0.142857,0


## Feature Engineering

In [4]:
def prepareData(df):
  df = df.copy()
  df['totalLineItems'] = df['scannedLineItemsPerSecond'] * df['totalScanTimeInSeconds'] #nur of scanned products
  return df

df_train = prepareData(df_train_original)
df_val = prepareData(df_val_original)
df_test = prepareData(df_test_original)

df_train.head()

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,fraud,totalLineItems
0,4,828,66.56,7,4,3,0.007246,0.080386,1.166667,0,6.0
1,1,1612,31.34,2,4,3,0.008685,0.019442,0.142857,0,14.0
2,3,848,52.37,2,4,0,0.022406,0.061757,0.105263,0,19.0
3,1,321,76.03,8,7,2,0.071651,0.236854,0.347826,0,23.0
4,1,660,6.06,3,7,1,0.027273,0.009182,0.166667,0,18.0


In [5]:
# Extract fraud=0 from Training (and remove fraud column)
df_train_nofraud = df_train[df_train.fraud == 1].copy().drop('fraud', axis=1)

# Extract High Trust Level Entries (which we think is fraud=0)
df_test_nofraud = df_test[df_test.trustLevel > 2].copy()

df_nofraud = pd.concat([df_train_nofraud, df_test_nofraud], sort=False)

# Splitting validation split label
df_val_X = df_val.drop('fraud', axis=1)
df_val_y = df_val['fraud']

X_nofraud, X_val, y_val = df_nofraud.values, df_val_X.values, df_val_y.values

print("Shapes",X_nofraud.shape, X_val.shape, y_val.shape)

Shapes (332576, 10) (376, 10) (376,)


## Novelty Detection
semi supervised

> Consider a data set of  observations from the same distribution described by  features. Consider now that we add one more observation to that data set. Is the new observation so different from the others that we can doubt it is regular? (i.e. does it come from the same distribution?) Or on the contrary, is it so similar to the other that we cannot distinguish it from the original observations? This is the question addressed by the novelty detection tools and methods. ([source](https://scikit-learn.org/stable/modules/outlier_detection.html#novelty-detection))



In [10]:
clf = svm.OneClassSVM(cache_size=20000, max_iter=10000)
clf.fit(X_nofraud)
y_val_pred_distances = clf.decision_function(X_val) # Signed distance to the separating hyperplane, positive for inliers, negative for outliers.



In [11]:
outliers_fraction = 0.94 #percentile, how much padding to add around classification border, high=less frauds

threshold = stats.scoreatpercentile(y_val_pred_distances.ravel(),  100 * outliers_fraction)

# y_val_pred = np.where(y_val_pred_distances > 0, 0, 1) # Nonfraud = Inlier > 0, Fraud = Outlier < 0
y_val_pred = y_val_pred_distances > threshold

n_errors = (y_val_pred != y_val).sum()
print("Total Val:",len(y_val),", Errors:", n_errors, ", Confmatrix:", metrics.confusion_matrix(y_val, y_val_pred).T.tolist(), ", Nr of actual/predicted Frauds:",y_val.sum(), "/", y_val_pred.sum(), ", Nr of actual/predicted Non-Frauds:",(y_val == 0).sum(), "/", (y_val_pred == 0).sum())

Total Val: 376 , Errors: 46 , Confmatrix: [[330, 23], [23, 0]] , Nr of actual/predicted Frauds: 23 / 23 , Nr of actual/predicted Non-Frauds: 353 / 353


In [12]:
def score_function(y_true, y_pred):
  dmc = np.sum(metrics.confusion_matrix(y_true, y_pred)*np.array([[0, -25],[ -5, 5]])) #sklearn gives [[tn,fp],[fn,tp]]
  return (0 if all(y_pred == 0) else metrics.fbeta_score(y_true, y_pred, beta=2),
          dmc, 
          dmc/len(y_pred), #comparable relative score, the higher the better.
          metrics.confusion_matrix(y_true, y_pred).tolist())

print("OneClassSVM", "\t", score_function(y_val, y_val_pred) )

OneClassSVM 	 (0.0, -690, -1.8351063829787233, [[330, 23], [23, 0]])
