# Credit Fraud Training (KNN)

## KNN classifier on credit fraud dataset
Use K-nearest-neighbor method to create a model that is able to detect potential credit card fraud.


### Task 1: Mount your drive
Mount your google drive on Colab which allows to access the .csv file included.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Task 2: Load and preprocess datasets
In this program we are using a dataset that has the following features:  

V1 | V1 | ... | V10 | Amount | Class
---|---|---|---|---|---
(float)|(float)|(float)|(float)|(float)|(str)

The first ten features are the top PCA values for certain transaction information. The reason only PCA values are given is to protect private information.
The **Amount** feature is the amount of money in that particular transaction and the **Class** feature contains two classes **safe** and **Fraud**.
Each class has 400 examples, aim to predict the **Class** feature from all the other features, i.e. determine which transactions are fraudulent or not.


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

rng = np.random.RandomState(5) # use this if you need to generate a random sample

# Load Credit Card Data
path = '/content/drive/My Drive/Assignment_3/Assignment-3/creditcard_ece570.csv'
data = pd.read_csv(path)
# Create X, y data
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
# Split Train, Test Data (80%-20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Learn Preprocessing Transform
scalar = StandardScaler()
labeler = LabelEncoder()
scalar.fit(X_train)
labeler.fit(y_train)
# Apply Transforms
X_train = scalar.transform(X_train)
X_test = scalar.transform(X_test)
y_train = labeler.transform(y_train)
y_test = labeler.transform(y_test)

# Test Statements
print(f'X_train has the shape {X_train.shape}')
print(f'y_train has the shape {y_train.shape}')
print(f'X_test has the shape {X_test.shape}')
print(f'y_test has the shape {y_test.shape}')
print(f'X_train mean is {np.mean(X_train, axis=0)}')
print(f'X_test mean is {np.mean(X_test, axis=0)}')
print(f'Sum of X_train mean is {np.sum(np.mean(X_train, axis=0))}')
print(f'Sum of X_test mean is {np.sum(np.mean(X_test, axis=0))}')

X_train has the shape (640, 11)
y_train has the shape (640,)
X_test has the shape (160, 11)
y_test has the shape (160,)
X_train mean is [-4.44089210e-17 -6.93889390e-18  3.33066907e-17  1.66533454e-17
  1.94289029e-17  1.66533454e-17 -3.46944695e-17 -1.94289029e-17
 -4.44089210e-17  0.00000000e+00  7.63278329e-18]
X_test mean is [ 0.09347451 -0.09182168  0.15245938 -0.19522097  0.10599188  0.23410519
  0.09620048 -0.19714382  0.18255422  0.17135069  0.12558944]
Sum of X_train mean is -5.620504062164857e-17
Sum of X_test mean is 0.6775393249086965


### Task 3: Find the optimal KNN estimator
We need to find the optimal parameters of the KNN estimator (the model selection problem) using cross validation, and then provide a final estimate of the model's generalization performance via the test set.


In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

rng = np.random.RandomState(5) # use this if you need to generate a random sample

# Set Hyperparameters
grid_params = {
    'n_neighbors': [1,3,5,7,9,11],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'chebyshev'],
}
# Grid Search
KNN_GV = GridSearchCV(KNeighborsClassifier(), grid_params, cv = 5)
# Fit Data (training only)
KNN_GV = KNN_GV.fit(X_train, y_train)

print(f'The best parameters are {KNN_GV.best_params_}')
print(f'The best accuracy on the training data is {KNN_GV.score(X_train, y_train)}')
print(f'The best accuracy on the testing data is {KNN_GV.score(X_test, y_test)}')

The best parameters are {'metric': 'manhattan', 'n_neighbors': 7, 'weights': 'uniform'}
The best accuracy on the training data is 0.953125
The best accuracy on the testing data is 0.90625
