# Credit Card Fraud Detection

## Data Preparation

The data is already fairly well prepared, as it was cleaned and ran through PCA (in order to ensure anonymity). 

It still needs to be split before training any models.
It is **very important** to ensure equal proportions of fraudulent transactions in each subset of data (train/validation/test)

In [1]:
# import the needed libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#sklearn will be imported partially when necessary

#read in the data and make a copy of it in case anything goes wrong
path = "C:/Users/ms101/OneDrive/datasets"
credit_data = pd.read_csv(path + "/creditcard.csv")

data = credit_data.copy()


In [2]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


We already checked for missing values during the exploration, one more thing to look for are **duplicates**. These could potentially lead to biased metrics.

In [3]:
data.duplicated().sum()

1081

There are 1081 duplicate entries which should be dropped.

In [4]:
print(data.shape)
data.drop_duplicates(inplace = True)
data.reset_index(inplace = True, drop = True)

(284807, 31)


In [5]:
data.shape

(283726, 31)

In [6]:
assert data.shape == (284807-1081,31)#check if duplicates are removed correctly

### Splitting the data

I will use a stratified shuffle split, which ensures, that the proportions of classes within all subsets are equal.

I will not split of a seperate validation set at this point because I aim to utilize cross-validation.

In [7]:
X = data.drop("Class", axis = 1)
y = data["Class"]

In [8]:
X.shape

(283726, 30)

In [9]:
y.shape

(283726,)

In [11]:
from sklearn.model_selection import StratifiedShuffleSplit

strat_split = StratifiedShuffleSplit(test_size = 0.2, random_state = 13)
for train_index, test_index in strat_split.split(X,y):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]

In [12]:
X_train.shape, y_train.shape

((226980, 30), (226980,))

In [13]:
X_test.shape, y_test.shape

((56746, 30), (56746,))

Now we should check if the proportions of fraudulent cases are similar in both subset.

In [22]:
train_counts = y_train.value_counts()
train_ratio = train_counts[1]/train_counts[0]

In [23]:
test_counts = y_test.value_counts()
test_ratio = test_counts[1]/test_counts[0]

In [25]:
train_ratio.round(5) , test_ratio.round(5)

(0.00167, 0.00168)

They are very similar our data split was a success.

## Model Selection

### Baseline Model

The first model which will act as a baseline for others will be a Logistic Regression with fairly standard parameters.

I also aim to try other "classic" classification models before going into more specialized models for outlier/anomaly detection such as Gaussian Mixture Models.

From all models the most promising ones will be select and tuned further.

I will consider combining multiple models into an ensemble by voting or stacking.

The metrics will be:
- precision
- recall (aiming for a high recall/sensitivity)

To determine good models I will used cross-validation which will again use a stratified approach to ensure similar target-value ratios.



In [27]:
# import the metrics
from sklearn.metrics import confusion_matrix, precision_score, recall_score, precision_recall_curve


The normal `cross_val_score` from sklearn samples randomly when splitting the data into train and validation set. I want to avoid that for the mentioned reasons. Therefore the following function will be used instead. 

*Note: Parts of this code were adapted from Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow by Aurélien Géron)*

In [47]:
def strat_cross_val_score(classifier,X_train, y_train, cv = 5, scoring = "accuracy"):
    """
    Runs n-fold cross-validation with stratified sampling.

    Parameters
    ----------
    classifier : sklearn Classifier to evaluate
    X_train : Training set of features.
    y_train : Training set of the target values.
    cv : n-folds for the cross validation
    scoring: Method of scoring ("accuracy","precision_recall", "roc_auc")

    Returns
    -------
    The score of the chosen metric for each round of cross-validation.

    """
    
    from sklearn.model_selection import StratifiedKFold
    from sklearn.base import clone

    skfolds = StratifiedKFold(n_splits = cv, shuffle = True ,random_state = 13)
    """for train_index, test_index in strat_split.split(X,y):
    X_train, X_test = X.loc[train_index], X.loc[test_index]
    y_train, y_test = y.loc[train_index], y.loc[test_index]"""
    for train_index, test_index in skfolds.split(X_train, y_train):
        clone_clf = clone(classifier)
        X_train_folds, X_test_fold = X_train.iloc[train_index], X_train.iloc[test_index]
        y_train_folds, y_test_fold = y_train.iloc[train_index], y_train.iloc[test_index]

        clone_clf.fit(X_train_folds, y_train_folds)
        y_pred = clone_clf.predict(X_test_fold)
        #for accuracy
        if scoring == "accuracy":
            n_correct = sum(y_pred == y_test_fold)
            print(n_correct / len(y_pred))
        #for precision and recall
        #if scoring == "precision_recall":
        
        #for roc_auc
        #if scoring == "roc_auc"
        

In [48]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression()

In [50]:
strat_cross_val_score(log_clf, X_train, y_train) #the data needs to be scaled but right now I was only testing the function

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.9990748083531589


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.9989426381178959


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.998788439510089


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.9989646664904397
0.998656269274826


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
