# CS182. Artificial Intelligence - Final Project

---

## Credit Card Fraud Classification/Amount Prediction

### Presented by Boyuan Sun, Yijun Shen, Shenghao Jiang

In [3]:
# import libraries
import json
import os
import sys
import time
from copy import deepcopy
import collections
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split
#models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier as DecisionTree
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [4]:
data = pd.read_csv('creditcard.csv')
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Before Constructing Models : Train-test split and defining predictors
The predictors will be Principle Components after PCA, which are preprocessed in the given dataset and Amount.
We have to be very careful with the train-test split: since the dataset is very imbalanced, we have to make sure that the train/test data set include enough amount of the observation that has a label of 1.

In [5]:
# Define predictors
predictors = list(data.columns)
predictors.remove('Class')
predictors.remove('Time')

In [6]:
# Train-test split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(data[predictors], data['Class'], test_size =.33, random_state = 182, stratify = data['Class'])

In [7]:
y_train = pd.DataFrame(y_train, columns = ['Class'])
y_test = pd.DataFrame(y_test, columns = ['Class'])

In [8]:
# Ensure that we have some labels of 1 in both training and testing sets
print('In training sets number of label 1: {}'.format(len(np.where(y_train == 1)[0])))
print('In testing sets number of label 1: {}'.format(len(np.where(y_test == 1)[0])))

In training sets number of label 1: 330
In testing sets number of label 1: 162


## 1. Baseline model : zero classifier
We present a baseline model that always predicts label 0, meaning that there is no fraud within the time. This is the model we are aiming to beat using the models and techniques learned in the course to improve the metric we choose to evaluate the model.

Train accuracy : $0.9982398981387859$

Test accuracy : $0.9983703974263518$

But sicne the dataset is very unbalanced, we are going to use other metrics, such as a confusion matrix and AUC (area under the curve) for measuring the performance of the classifier

In [9]:
class ZeroClassifier:
    """ Class for zero classifier
    """
    
    def predict(self, X):
        return np.zeros([len(X),1]).reshape([len(X),])
    
    def score(self, X, y):
        """
        INPUTS
        ------
        X: predictors
        y: labels
        
        OUTPUTS
        -------
        the accuracy score for y
        """
        return list(y).count(0) / len(y)
    
    def predict_proba(self, X):
        """
        INPUTS
        ------
        X: predictors
        
        OUTPUTS
        -------
        probas: probabilities for being classified as in the two labels. Since this is the zero classifier which
        will always yeild 0, the probability of being label 0 will be 1 and being label 1 will be 0.
        
        """
        probas = np.array([[1,0] for i in range(len(X))])
        return probas

### accuracy score (might not be a good metrix)

In [10]:
# Construct a zero classifier and find the acccuracy score
zero_classifier = ZeroClassifier()
print('Train accuracy : {}'.format(zero_classifier.score(X_train, y_train)))
print('Test accuracy : {}'.format(zero_classifier.score(X_test, y_test)))

Train accuracy : 0.0
Test accuracy : 0.0


### confusion matrix and TPR
1. TN-Truth Negative: the classifier gives a correct 0 prediction when the observation is 0

2. FP-Truth Negative: the classifier gives a incorrect label 1 prediction when the observation is 0

3. TP-Truth Negative: the classifier gives a correct 1 prediction when the observation is 1

4. FN-Truth Negative: the classifier gives a correct 0 prediction when the observation is 1

Obs\Pred|0|1|
----|----|----
    0 | TN|FP
    1 | FN|TP
And the confusion matrix is:

Obs\Pred|0|1|
----|----|----
    0 | 213248|0
    1 | 376|0
    
We propose that TPR-Truth Positive Rate is a good measure in this case. TPR gives a good meature of the times that the model gives a correct output when the actual prediction is 1. In this case, we are more concerned with whether or not the model can successfully identify a credit card fraud when there is one. The TPR for the zero classifier is 0 based on the below calculation:
$$TPR = \frac{TP}{TP + FN} = 0$$
This is saying that the model does not identify any credit card fraud when there is one.

In [11]:
# confusion matric format:
# O\Pred   0     1
# B  0     TN    FP  ON
# S  1     FN    TP  OP
#          PN    PP
confusion_matrix(y_train, zero_classifier.predict(X_train))

array([[190490,      0],
       [   330,      0]])

### 2. Logistic Model

In [16]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
logregcv = LogisticRegressionCV(random_state = 123) # By default LBGFS induces L2 norm.
logregcv.fit(X_train, y_train)
y_hat_train = logregcv.predict(X_train)
y_hat_test = logregcv.predict(X_test)
print("Train accuracy: ", accuracy_score(y_train, y_hat_train))
print("Test accuracy: ", accuracy_score(y_test, y_hat_test))

Train accuracy:  0.999266324285
Test accuracy:  0.99921265707


In [17]:
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_hat_test)
# conf_df = pd.DataFrame(conf_mat, columns = ['y_hat=0', 'y_hat = 1'], index = ['y=0', 'y=1'])
conf_df

Unnamed: 0,y_hat=0,y_hat = 1
y=0,93809,16
y=1,59,103


Obs\Pred|0|1|
----|----|----
    0 | 93809|16
    1 | 59|103