# KNN on Credit Card Fraud Detection Dataset

Please download the data from https://www.kaggle.com/dalpozz/creditcardfraud/data

**Task 1.** Propose a suitable error metrics for this problem. 

**Task 2.** Apply KNN on the dataset, find out the best k using grid search.

**Task 3.** Report the value of performance

**Info about data: it is a CSV file, contains 31 features, the last feature is used to classify the transaction whether it is a fraud or not**

**Information about data set**

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. **Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.**

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler

from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import cross_val_score



In [2]:
data = pd.read_csv("creditcard.csv")

In [3]:
data.shape

(284807, 31)

In [4]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [5]:
data["Class"].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [6]:
# sampling random 10000 points
data_10000 = data.sample(n = 10000)

In [7]:
data_10000.shape

(10000, 31)

In [8]:
data_10000["Class"].value_counts()

0    9977
1      23
Name: Class, dtype: int64

**Our dataset is heavily imbalanced**

In [9]:
data10000 = data_10000.drop(['Class'], axis=1)
data10000.shape

(10000, 30)

In [10]:
data10000_labels = data_10000["Class"]
data10000_labels.shape

(10000,)

In [11]:
data10000_Std = StandardScaler(with_mean = False).fit_transform(data10000)
print(data10000_Std.shape)
print(type(data10000_Std))

(10000, 30)
<class 'numpy.ndarray'>


### Task1: Propose a suitable error metrics for this problem.

**Since our dataset is heavily imbalanced therefore I am proposing "Recall" as a suitable error metric for our problem**

### Task 2:  Apply KNN on the dataset, find out the best k using grid search.

In [12]:
X1, XTest, Y1, YTest = cross_validation.train_test_split(data10000_Std, data10000_labels, test_size = 0.3, random_state = 0)

myList = list(range(0,50))
neighbors = list(filter(lambda x: x%2!=0, myList))  #This will give a list of odd numbers only ranging from 0 to 50

CV_Scores = []

for k in neighbors:
    KNN = KNeighborsClassifier(n_neighbors = k, algorithm = 'kd_tree')
    scores = cross_val_score(KNN, X1, Y1, cv = 10, scoring='recall')
    CV_Scores.append(scores.mean())

In [158]:
CV_Scores

[0.55,
 0.7,
 0.7,
 0.65,
 0.65,
 0.55,
 0.55,
 0.4,
 0.4,
 0.35,
 0.3,
 0.3,
 0.05,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]

In [159]:
best_k = neighbors[CV_Scores.index(min(CV_Scores))]
best_k

27

**For each odd value of 'K' ranging from 0 to 50, our recall scores for 10-fold Cross Validation are close to zero. This means that the value of "True Positives" for most of the values of "K" are zero. This further implies that not only our dataset is heavily imbalanced but our model is also heavily biased towards majority class which is here 0.**

In [160]:
def change(label):
    if label == 0:
        return 1
    else:
        return 0

In [161]:
positiveNegative = list(map(change, data10000_labels)) #map(function, list of numbers)
data10000_invertedLabel = positiveNegative
len(data10000_invertedLabel)

10000

In [162]:
X2, XTest2, Y2, YTest2 = cross_validation.train_test_split(data10000_Std, data10000_invertedLabel, test_size = 0.3, random_state = 0)

myList = list(range(0,50))
neighbors = list(filter(lambda x: x%2!=0, myList))  #This will give a list of odd numbers only ranging from 0 to 50

CV_Scores = []

for k in neighbors:
    KNN = KNeighborsClassifier(n_neighbors = k, algorithm = 'kd_tree')
    scores = cross_val_score(KNN, X2, Y2, cv = 10, scoring='recall')
    CV_Scores.append(scores.mean())

In [163]:
CV_Scores

[0.9994271390566138,
 0.9997134670487107,
 0.9997134670487107,
 0.9997134670487107,
 0.9998567335243553,
 0.9998567335243553,
 0.9998567335243553,
 0.9998567335243553,
 0.9998567335243553,
 0.9998567335243553,
 0.9998567335243553,
 0.9998567335243553,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0]

**Here, for cross checking we have changed the polarity of all the class labels in our sampled dataset. Now all the 1's becomes 0 and all the 0's became 1. Now our majority class becomes 1. After that we applied 10-folds CV again and calculated the recall scores for all the odd  values of 'k' ranging from 0 to 50. It has been found that most of the recall scores here are 1 which means that "True Positives" for most of the values of 'k' are 1. This implies that our model is heavily biased towards majority class which is here 1.**

In [164]:
best_k = neighbors[CV_Scores.index(max(CV_Scores))]
best_k

25

**Best 'K' value is chosen as 27**

In [165]:
from sklearn.metrics import recall_score

KNN_best = KNeighborsClassifier(n_neighbors = best_k, algorithm = 'kd_tree')

KNN_best.fit(X1, Y1)

prediction = KNN_best.predict(XTest)

recallTest = recall_score(YTest, prediction)

print("Recall Score of the knn classifier for best k values of "+str(best_k)+" is: "+str(recallTest))

confusion_matrix(YTest, prediction)

Recall Score of the knn classifier for best k values of 25 is: 0.0


array([[2993,    0],
       [   7,    0]], dtype=int64)

In [170]:
YTest.value_counts()

0    2993
1       7
Name: Class, dtype: int64

**There are total 3000 points in our test dataset, out of which 2993 points belongs to class label '0' and 7 points belong to class label '1'. Now from confusion matrix we can see that the value of "True Negative" is 2993 which means that all the 2993 points in our test dataset which belong to class label '0' are predicted as '0'. Furthermore, from the same confusion matrix we can see that the value of "False Negative" is 7 which means that all the 7 points in our test dataset which belong to class label '1' are also predicted as '0'.**

**In conclusion, our model is nothing but a "Dumb Model" because of heavily imbalanced dataset.**

### Task 3: Report the value of performance

In [171]:
# Calculating R square value of our model
from sklearn.metrics import r2_score

r2_score(YTest, prediction)  

-0.002338790511192901

**R Square value of our model is negative, it means our model is worse than the mean model.
In Conclusion, performance of our model is worst because of imbalance in our dataset.** 