# Tutorial 02 - Anomaly Detection 

In data analysis, *anomaly detection* (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behaviour.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
%matplotlib inline

import scipy as stats
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

X=np.load('data.npy')
X

Plot of the data with the center

In [None]:
plt.scatter(X[:,0], X[:,1])

Clustering the data

In [None]:
from sklearn.cluster import KMeans
k = 2
kmeans = KMeans(n_clusters = k, random_state = 1).fit(X)

plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, cmap='brg', alpha=0.4)  # plot points with cluster dependent colors
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c = 'black', s=100)
plt.show()

Lets print each of the clusters

In [None]:
for k in range(0,2):
    plt.scatter(X[kmeans.labels_==k,0], X[kmeans.labels_==k,1],color =(['blue','green'])[k])
    plt.show()

Detecting anomalies on the data finding points far from the centroids

In [None]:
eucledianDist = np.sqrt(np.sum((X - kmeans.cluster_centers_[kmeans.labels_])**2,axis=-1))
plt.scatter(X[:,0], X[:,1],c=(eucledianDist),cmap='Reds')

In [None]:
medians = np.r_[np.median(eucledianDist[kmeans.labels_==0]),np.median(eucledianDist[kmeans.labels_==1])]
eucledianMedians = np.abs(eucledianDist/medians[kmeans.labels_])
plt.scatter(X[:,0], X[:,1],c=(eucledianMedians),cmap='Reds')

Eucledian distance between the datapoints and the center

In [None]:
data = X[kmeans.labels_==0,:]
center = kmeans.cluster_centers_[0]

eucledianDist = np.sqrt(np.sum((data[:,0:2] - center)**2,axis=1))

plt.scatter(data[:,0], data[:,1],c=(eucledianDist),cmap='Blues')
plt.scatter(center[0], center[1], c = 'red', s=100)

Here, we consider anomaly datapoints that are distant for the center a threshhold. 

In [None]:
threshold = 10

plt.scatter(data[:,0], data[:,1],c=(eucledianDist>threshold))
plt.scatter(center[0], center[1], c = 'red', s=100)

The Eucledian distance does not seems the best one for this data set. Let's try the Mahalanobis distance.

In [None]:
def calculateMahalanobis(data):
    y_mu = data - np.mean(data,axis=0)
    cov = np.cov(data.values.T)
    inv_covmat = np.linalg.inv(cov)
    left = np.dot(y_mu, inv_covmat)
    mahal = np.dot(left, y_mu.T)
    return mahal.diagonal()
  

mahalanobisDistance= calculateMahalanobis(pd.DataFrame(data, columns = ['x','y']))
plt.scatter(data[:,0], data[:,1],c=(mahalanobisDistance),cmap='Greens')
plt.scatter(center[0], center[1], c = 'red', s=100)

Detecting anomalies by considering the Mahalanobis distance.

In [None]:
threshold = 2

plt.scatter(data[:,0], data[:,1],c=(mahalanobisDistance>threshold))
plt.scatter(center[0], center[1], c = 'red', s=100)

There is no correct way to determine the threshold. One way is to try different values and compute the percentage of anomalies each threshold detects. The number of anomalies should be very small, by defition (e.g., between 0-5%). 

In [None]:
for threshold in np.linspace(start=0.5, stop=10, num=20):
    print(f"{threshold} {(mahalanobisDistance>threshold).sum()/mahalanobisDistance.shape[0]}")

# Short competition

We will split the class into groups of three individuals for a brief contest. Your aim is to detect frauds in finantial transactions. The dataset consists of features related to financial transactions, some of which may involve fraudulent activity. The data has been normalized and dimentionality reduced to prevent the use of heuristics based on human knowledge. Your task is to identify anomalies in order to detect possible fraud. Frauds are frequently atypical transactions. It is worth noting that atypical transactions are not necessarily fraudulent. Let us now examine the data.

In [None]:
data = np.load("anomalyDATA.npy")
data

The dataset comprises 5 columns, with the first 4 columns providing transaction features such as transaction value, time, and agent's salary. The fifth column contains three distinct values: 0, 1, and nan. The value 0 indicates that the transaction is not fraudulent, while 1 indicates that it is fraudulent. The "nan" value denotes instances where the financial institution is unsure whether the transaction is fraudulent or not.

In [None]:
np.unique(data[:,4])

In [None]:
#Your code here