<a href="https://www.kaggle.com/code/briankhor/credit-card-fraud-prediction?scriptVersionId=110450986" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Credit Card Fraud Prediction

The goal of this work is to apply machine learning technique to train a predictive model on a large dataset of credit card transactions to detect potentially fraudulent credit card transactions. We will employ basic machine learning techniques, models and practices below.

This documentation is a personal learning documentation in applying ML technique and is adapted from various sources below and I claim close-to-zero originality in this work.

- [Predicting Fraud with TensorFlow](http://www.kaggle.com/code/currie32/predicting-fraud-with-tensorflow)
- [Credit Card Fraud Detection using Machine Learning & Python](http://towardsdatascience.com/credit-card-fraud-detection-using-machine-learning-python-5b098d4a8edc)

## Making sense of our data 

First, we need to import the necessary python packages for data visualization, and we will load our data and clean our data by checking for null values and duplicates. Finally, we will try to visualize the basic features of our data.

We start by loading the necessary packages for our work.

In [None]:
import numpy as np 
import pandas as pd 
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
from sklearn.manifold import TSNE

Next, we import the given data set.

In [None]:
data = pd.read_csv('../input/creditcardfraud/creditcard.csv')

We will explore the basic features of the data below. First, we look at the headings on the dataset.

In [None]:
data.head()

We look at the basic descriptions on each column on the data (mean, standard deviations, etc). Note that V1 - V28 are confidential information and are already transformed for its original form.

In [None]:
data.describe()

We next look at the transaction distribution. 

In [None]:
Total_transaction = len(data)
normal = len(data[data.Class == 0])
fraudulent = len(data[data.Class == 1])
fraudulent_percentage = round(fraudulent*100/Total_transaction, 2)
print('Total number of transactions: ' + str(Total_transaction))
print('Number of normal transactions: ' + str(normal))
print('Number of fraudulent transactions: ' + str(fraudulent))
print('Percentage of fraudulent transactions: ' + str(fraudulent_percentage))

Here we see that we have a highly imbalanced dataset with only 0.17% fraudulent transactions, which is expected since we don't expect our credit card to be involved in a scam/fraud every single day (otherwise many would worry about the financial security provided by financial institutions!).

Next, we check for any null value.

In [None]:
data.info()

Good! There is no null value present in the data.

We will first look at the normal and fraudulent distributions for the 'Time' feature.

In [None]:
ax = plt.subplot()
sns.histplot(data['Time'][data.Class==0], bins=50, color="blue",stat="density")
sns.histplot(data['Time'][data.Class==1], bins=50, color="orange",stat="density")
plt.legend(["Normal", "Fraud"])
ax.set_title('Time')
             
plt.show()

Markdown and LaTeX:
It seems like time is more or less uniform across both distrubutions, and the peaks are artificial as the density of of order $10^{-5}$ for these putative 'peak', so any variation is probably meaningless and we will drop the time feature.

We will also check for duplicates (it is possible that duplicate transactions were made at slightly different times). We make a simplifying assumption that if all other identification features (meaning V1 - V28 and amount) are the same, we will count them as duplicates and remove them).

In [None]:
data.drop(['Time'], axis=1, inplace=True)
data.drop_duplicates(inplace=True)
data.shape 

Now we look at the distributions of the data for both normal and fraudulent transactions for the rest of the features. 

In [None]:
v_features = data.iloc[:, 0:29].columns

plt.figure(figsize=(12, 29*4))
gs = gridspec.GridSpec(29, 1)

for i, col in enumerate(data[v_features]):
    ax = plt.subplot(gs[i])
    sns.histplot(data[col][data.Class == 0], bins = 50, color="blue", stat="density", kde=True)
    sns.histplot(data[col][data.Class == 1], bins = 50, color="orange", stat="density", kde=True)
    plt.legend(["Normal","Fraud"])
    ax.set_xlabel('')
    ax.set_title('Histogram of feature: ' + str(col))

plt.show()

Here we notice a few features of our data.

1. The amount varies very largely (the difference between max and min), but the data is mostly skewed towards the lower end (both normal and fraudulent). So we need a way to scale this.

2. We will drop other anonymous features where normal and fraud distributions are similar (V13, V15, V20, V22 - V26, V28).

We will start by scaling the amount column and dropping less distinguishable features.

In [None]:
data.drop(['V13', 'V15', 'V20', 'V22', 'V23', 'V24', 'V25', 'V26', 'V28'], axis = 1, inplace=True)
           
sc = StandardScaler()
amount = data['Amount'].values
data['Amount'] = sc.fit_transform(amount.reshape(-1,1))

Now we are done cleaning and organizing (i.e., scaling) data. We will move on to training or dataset.

## Splitting the dataset for train and test set

It is a standard practice in ML to split the large dataset available to us into train dataset and test dataset. Train dataset is the dataset that is used to train our neural network architecture/model parameter, while the test dataset is not used in training our NN architecture but we apply the trained model to determine how well our model fare in terms of successfully predicting fraudulent transactions.

The standard practice for modest (i.e., large but not too large, in the ballpark of 100k - several million data points) sized dataset is to split up the train-test set into 80-20 percent. There is no hard rule in splitting the dataset and one is free to slightly alter the fractions.

In [None]:
X = data.drop(['Class'], axis=1).values
Y = data['Class'].values

input_node = X.shape[1]

X = X.astype('float32')
Y = Y.astype('float32').reshape((-1,1))

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=1)

## Running the model using TensorFlow Neural Network

We will first train a neural network architecture using TensorFlow package. An advantage of the TensorFlow package is that once we build the forward propagation structure, backpropagation portion of the network can be worked out easily with very few lines of codes. 

Since this is a binary classificaton problem, we will be using neural network with stacks of logistic regressions, and each layer consists of linear function followed by RELU activation function (except the last layer, where we will use sigmoid activation function). We will choose to use Adam Optimizers for regularization.

Set number of hidden nodes in each layer with a constant ratio. We will use 4 layers with LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.

In [None]:
multiplier = 0.5
hidden_nodes1 = 25
hidden_nodes2 = round(hidden_nodes1*multiplier)
hidden_nodes3 = round(hidden_nodes2*multiplier)

Next we build the weights and biases for each layer, where we write operations between tensors and initialize variables.

In [None]:
def initialize_parameter():
    zero_initializer = tf.zeros_initializer()
    initializer = tf.keras.initializers.TruncatedNormal(stddev=0.15)

    # layer 1
    W1 = tf.Variable(initializer([hidden_nodes1, input_node], dtype=tf.float32))
    b1 = tf.Variable(zero_initializer([hidden_nodes1, 1], dtype=tf.float32))

    # layer 2
    W2 = tf.Variable(initializer([hidden_nodes2, hidden_nodes1], dtype=tf.float32))
    b2 = tf.Variable(zero_initializer([hidden_nodes2, 1], dtype=tf.float32))
                 
    # layer 3
    W3 = tf.Variable(initializer([hidden_nodes3, hidden_nodes2], dtype=tf.float32))
    b3 = tf.Variable(zero_initializer([hidden_nodes3, 1], dtype=tf.float32))
                 
    # layer 4
    W4 = tf.Variable(initializer([1, hidden_nodes3], dtype = tf.float32))
    b4 = tf.Variable(zero_initializer([1], dtype=tf.float32))
    
    return W1, b1, W2, b2, W3, b3, W4, b4

def forward_propagation(X, W1, b1, W2, b2, W3, b3, W4, b4):
    tf.cast(X, tf.float32)
    z1 = tf.math.add(tf.linalg.matmul(W1,X),b1)
    y1 = tf.keras.activations.relu(z1)
    z2 = tf.math.add(tf.linalg.matmul(W2,y1),b2)
    y2 = tf.keras.activations.relu(z2)
    z3 = tf.math.add(tf.linalg.matmul(W3,y2),b3)
    y3 = tf.keras.activations.relu(z3)
    z4 = tf.math.add(tf.linalg.matmul(W4,y3),b4)
    y4 = tf.keras.activations.sigmoid(z4)
    
    return y4

Set the hyperparameters such as number of epochs, learning rate and batch size.

In [None]:
learning_rate = 0.0005
batch_size = 2048
num_epoch = 10

n_train = Y_train.shape[0]

Next we write cost function.

In [None]:
def compute_cost(y4, Y):
    
    logits = tf.transpose(y4)
    labels = tf.transpose(Y)
    cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=labels))
    
    return cost

Now, we sort or data sets into batches, initialize variables, write backpropagation and choose optimizer, and run our session.

In [None]:
x_train = tf.data.Dataset.from_tensor_slices(X_train)
y_train = tf.data.Dataset.from_tensor_slices(Y_train)
x_test = tf.data.Dataset.from_tensor_slices(X_test)
y_test = tf.data.Dataset.from_tensor_slices(Y_test)

In [None]:
dataset = tf.data.Dataset.zip((x_train, y_train))
test_dataset = tf.data.Dataset.zip((x_test, y_test))

optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)

test_accuracy = tf.keras.metrics.CategoricalAccuracy()
train_accuracy = tf.keras.metrics.CategoricalAccuracy()

minibatches = dataset.batch(batch_size).prefetch(8)
test_minibatches = test_dataset.batch(batch_size).prefetch(8)

costs = []
train_acc = []
test_acc = []

W1, b1, W2, b2, W3, b3, W4, b4 = initialize_parameter()
    
for epoch in range(num_epoch):
        
    epoch_cost = 0.
    train_accuracy.reset_states()
        
    for (minibatch_X, minibatch_Y) in minibatches:
        with tf.GradientTape() as tape:
            y4 = forward_propagation(tf.transpose(minibatch_X), W1, b1, W2, b2, W3, b3, W4, b4)
            tf.shape(y4)
            tf.shape(minibatch_Y)
            minibatch_cost = compute_cost(y4, tf.transpose(minibatch_Y))
                
        train_accuracy.update_state(minibatch_Y, y4)
        trainable_variables = [W1, b1, W2, b2, W3, b3, W4, b4]
        grads = tape.gradient(minibatch_cost, trainable_variables)
        optimizer.apply_gradients(zip(grads, trainable_variables))
        epoch_cost += minibatch_cost
            
    epoch_cost /= n_train
    costs.append(epoch_cost)
    train_acc.append(train_accuracy.result())
    test_acc.append(test_accuracy.result())
    test_accuracy.reset_states()
                    
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations')
plt.show()

## Comparing machine learning models using prebuilt packages

Having run or model using TensorFlow, we will also try to train different ML models on our dataset for fraud detection. This section is mostly from [Credit Card Fraud Detection using Machine Learning & Python](http://towardsdatascience.com/credit-card-fraud-detection-using-machine-learning-python-5b098d4a8edc) and I claim little to no originality. Again, this is a personal documentation for me to learn ML techniques.

We will first begin by loading and importing Python packages.

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
from sklearn.impute import MissingIndicator, SimpleImputer
from sklearn.preprocessing import PolynomialFeatures, KBinsDiscretizer, FunctionTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer, OrdinalEncoder
import statsmodels.formula.api as smf
import statsmodels.tsa as tsa
from sklearn.linear_model import LogisticRegression, LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor, RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor, AdaBoostClassifier, AdaBoostRegressor
from sklearn.svm import LinearSVC, LinearSVR, SVC, SVR
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

### *Decision Tree*

In [None]:
DT = DecisionTreeClassifier(max_depth=4, criterion='entropy')
DT.fit(X_train, Y_train)
DT_yhat = DT.predict(X_test)

Let's check the accuracy of the decision tree model.

In [None]:
print('Accuracy score of the Decision Tree Model is {}'.format(accuracy_score(Y_test, DT_yhat)))

Checking the F1 score of the decision tree model.

In [None]:
print('F1 score of the Decision Tree Model is {}'.format(f1_score(Y_test, DT_yhat)))

Checking the confusion matrix of the decision tree model.

In [None]:
confusion_matrix(Y_test, DT_yhat, labels=[0,1])

Here we have 55015 true positive (normal classified as normal) and 16 false positives (fraud classified as normal). Out of 28 + 74 = 102 real frauds, we missed about 16% of fraudulent transactions as fraudulent using the Decision Tree model.

### *K-Nearest Neighbors*

In [None]:
KNN = KNeighborsClassifier(n_neighbors = 7)
KNN.fit(X_train, Y_train)
KNN_yhat = KNN.predict(X_test)

Checking the accuracy, F1 score and the confusion matrix of the K-Nearest Neighbors model.

In [None]:
print('Accuracy score of the KNN model is {}'.format(accuracy_score(Y_test, KNN_yhat)))
print('F1 score of the KNN model is {}'.format(f1_score(Y_test, KNN_yhat)))
confusion_matrix(Y_test, KNN_yhat, labels = [0,1])

### *Logistic Regression*

In [None]:
LR = LogisticRegression()
LR.fit(X_train, Y_train)
LR_yhat = LR.predict(X_test)

The accuracy, F1 score and confusion matrix for using logistic regression is as following.

In [None]:
print('Accuracy score of logistic regression is {}'.format(accuracy_score(Y_test, LR_yhat)))
print('F1 score of logistic regression is {}'.format(f1_score(Y_test, LR_yhat)))
confusion_matrix(Y_test, LR_yhat, labels = [0, 1])

Here, we missed 14 cases of frauds and raised false alarm for about 42 cases of normal transactions as fraudulent.

### *Support Vector Machines*

In [None]:
svm = SVC()
svm.fit(X_train, Y_train)
svm_yhat = svm.predict(X_test)

Accuracy, F1 score, Confusion matrix below.

In [None]:
print('Accuracy score of the SVM model is {}'.format(accuracy_score(Y_test, svm_yhat)))
print('F1 score of the SVM model is {}'.format(f1_score(Y_test, svm_yhat)))
confusion_matrix(Y_test, svm_yhat, labels=[0,1])

### *Random Forest*

In [None]:
RF = RandomForestClassifier(max_depth = 4)
RF.fit(X_train, Y_train)
RF_yhat = RF.predict(X_test)

Accuracy, F1 score and confusion matrix

In [None]:
print('Accuracy score of the random forest model is {}'.format(accuracy_score(Y_test, RF_yhat)))
print('F1 score of the random forest model is {}'.format(f1_score(Y_test, RF_yhat)))
confusion_matrix(Y_test, RF_yhat, labels=[0,1])

### *XGBoost*

In [None]:
XGB = XGBClassifier(max_depth = 4)
XGB.fit(X_train, Y_train)
XGB_yhat = XGB.predict(X_test)

Accuracy score, F1 score and Confusion Matrix

In [None]:
print('Accuracy score of the XGB model is {}'.format(accuracy_score(Y_test, XGB_yhat)))
print('F1 score of the XGB model is {}'.format(f1_score(Y_test, XGB_yhat)))
confusion_matrix(Y_test, XGB_yhat, labels=[0,1])