# Lab : Kaggle Credit card fraud detection

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

The dataset contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

**Objectives:** Compare Logistic Regression and Deep NN classifiers on skewed data. The idea is to compare if preprocessing techniques work better when there is an overwhelming majority class that can disrupt the efficiency of the predictive model. Learn how to apply cross validation (CV) for hyper-parameter tuning.

In [1]:
import warnings
warnings.filterwarnings('ignore',category=FutureWarning)
warnings.filterwarnings('ignore',category=DeprecationWarning)
warnings.filterwarnings('ignore',category=Warning)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

### Load the anonimised dataset

Dataset contains only numerical input variables which are the result of a PCA transformation. Due to confidentiality issues, the original features and more background information about the data cannot be provided. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount. 
 
The last column is the Class:  normal transaction (0),  fraud transaction (1). 

Load the dataset stored in the file *"creditcard.csv"*. 

In [None]:
data = ?

# Confirm that the dimension of the data set is (284807, 31)    
?

#Compute the mean of each column, and observe that the anonimised features V1-V28 have practically 0 mean
?

# show the first few examples (rows) from the dataset 
 ?

#### Normalize the values of Column Amount  

In [None]:
from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
#drop column Time as irrelevant feature
#drop columnt Amount as column normAmount was added
data = data.drop(['Time','Amount'],axis=1)  

# show again the first few examples (rows) from the normalized dataset 
?

#### Compute the number of samples per class


In [None]:
number_records_fraud = ?

number_records_normal = ?

print('Class 1 ( fraud transaction):', number_records_fraud)  # ANSWER: Class 1 ( fraud transaction): 492

print('Class 0 (normal transaction) :', number_records_normal) # ANSWER: Class 0 (normal transaction) : 284315

###  Data is totally unbalanced ! How to approach this classification problem:

- Collect more data.  Nice strategy but not applicable in this case. 
- Change the performance metric (do not rely only on the Accuracy): compute Precision, Recall, F1_score.
- Resampling the dataset to have an approximate 50-50 ratio:
    - By OVER-sampling => add copies of the under-represented class.
    - By UNDER-sampling => delete instances from the over-represented class.
   

First, extract the features in matrix X and the class labels in vector y

In [None]:
X =  ?
y =  ?


####  UNDER-sampling 

Apply UNDER-sampling by randomly selecting x samples from the majority class (0), where x is the total number of records with the minority class (1). 

The under-sampled dataset has a 50/50 class ratio of samples. 

In [None]:
# Picking the indices of the minority (fraud) class
fraud_indices = np.array(data[data.Class == 1].index)

# Picking the indices of the normal class
normal_indices = ?

# Number of data points in the minority (fraud) class
number_records_fraud = len(data[data.Class == 1])


# Out of the normal class indices, randomly select number_records_fraud samples 
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False)

# Appending the indices of normal and fraud classes
under_sample_indices = ?

# Under sample dataset
under_sample_data = data.iloc[under_sample_indices,:]

#The features in matrix X_undersample, the class labels in vector  y_undersample
X_undersample = ?
y_undersample = ?

# Data class ratio

print("Total # of transactions in resampled data:", ?) #ANSWER:  984

print(" % of normal transactions: ", ?)                #ANSWER:  0.5
print(" % of fraud transactions: ", ?)                 #ANSWER:  0.5


### Explanation of random_state

All computers have what is called a pseudo-random number generator. This is something that produces seemingly random numbers, but if kept being repeated, would reproduce the same sequence eventually.
Where the number generator is started is known as the seed. When you specify the random_state parameter, you are just setting the random seed for the random number generator.

Suppose you set random_seed = 0. The random number generator might then produce the sequence of integers
0, 19, 11, 2, 34, 5, 23, 24, 0, 1, 89, …

and by fixing random_state=0, you will always see this sequence each time you call your train_test_split function. 

On the other hand, suppose you set random_state=1 and got the following sequence of integers:
91, 18, 11, 34, 34, 5, 19, 18, 0, 0, 1, …

You will always see these random numbers when you set random_state = 1. 

### Train-test data splitting

Apply *train_test_split* to the Whole dataset and to the Undersampled dataset with 30% train-test data ratio and random_state = 0. 

In [None]:
from sklearn.model_selection import train_test_split

# Whole dataset
X_train, X_test, y_train, y_test = ?

print("Number transactions train dataset: ", ?)   #ANSWER: 199364
print("Number transactions test dataset: ", ?)    #ANSWER:   85443
print("Total number of transactions: ", ?)        #ANSWER:  284807

# Undersampled dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = ?

print() 
print("UNDER-SAMPLED DATA:")
print("Number transactions train dataset : ", ?)  #ANSWER:  688

print("Number transactions test dataset: ", ?)    #ANSWER:  296
print("Total number of transactions: ", ?)        #ANSWER:  984


###  MODEL 1: Logistic regression classifier - Undersampled data

- Accuracy = (TP+TN)/total
- Precision = TP/(TP+FP)
- Recall = TP/(TP+FN)

**Our goal is, do not miss a fraud transaction**, therefore  we are interested in the Recall score, because that is the metric to capture the most fraudulent transactions. Due to the imbalacing of the data, many observations could be predicted as False Negatives, that is, we predict a normal transaction, but it is in fact a fraudulent one. Recall captures this.

Precision is less important metric for this problem, because if we predict that a transaction is fraudulent and turns out not to be (FP), is not a massive problem compared to the opposite. 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 

### K-fold Cross Validation (CV) to find the best hyper-parameter C of Logistic Regression.  

C =1/$\lambda$, where $\lambda$ is the regularization parameter. 

In [None]:
# Find the best hyper-parameter optimizing for recall
def print_gridsearch_scores(x_train_data,y_train_data):
    c_param_range = [0.01,0.1,1,10,100]

    clf = GridSearchCV(LogisticRegression(), {"C": c_param_range}, cv=5, scoring='recall')
    clf.fit(x_train_data,y_train_data)

    print("Best parameters found on development set:")
    print()
    print(clf.best_params_)

    print("Grid scores on development set:")
    
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    
    #Simultaneous visualization of iterations on different arrays
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
        
    return clf.best_params_["C"]

In [None]:
#Apply function print_gridsearch_scores to get the best C with the Undersampled dataset
best_c = ?

### Model 1.1: Logistic Regression trained and tested with undersampled data


In [None]:
# Use the best C to train LogReg model with undersampled train data and test it with undersampled test data
lr = LogisticRegression(C = best_c)
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)

print('Confusion matrix (undersample test dataset)')
print(cnf_matrix)
print("Recall: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))


### Model 1.2: Logistic Regression trained on under-sampled data and tested with the whole test data

Apply the same approach as above. 


In [None]:
# Use the best C to train LogReg model with undersampled train dataset and test it with whole test dataset

lr = LogisticRegression(C = best_c)

#train on undersampled data
?

#predict whole test data 
y_pred = ?

# Compute and print confusion matrix
?

# Compute and print Recall metric
?


###  ROC curve & AUC

Plot the Receiver Operating Characteristic (ROC) curve and compute the Area Under the ROC Curve (AUC). 


In [None]:
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_score=lr.decision_function(X_test_undersample.values)

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test_undersample.values.ravel(),y_pred_undersample_score)

# Compute Area Under the ROC Curve (AUC), it is a scalar 
roc_auc = auc(fpr,tpr)


# Plot ROC
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.0])
plt.ylim([-0.1,1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#### REMARK
To create the undersampled data, we randomly picked some samples from the majority class. This is a valid technique, however is doesn't represent the real (huge) population. 
For sufficient statistical credibility, it would be usefull to repeat the process with different undersampled configurations and check if the previous chosen parameters are still the most effective. In the end, the idea is to use a wider random representation of the whole dataset and rely on the averaged best parameters.

### MODEL 2: Logistic regression classifier - Skewed data

Now, apply K-fold Cross Validation (CV) to find the best hyper-parameter C with whole train data, as it was done above. 

K-fold is now computationally much more time consuming. 

In [None]:
best_c = ?

Use the best C to train LogReg model with the whole train data and test it with whole test data. 


In [None]:
lr = ?
?
y_pred = ?

# Compute and print confusion matrix
?

# Compute and print Recall metric.
?



### MODEL 3: Deep NN model

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

In [None]:
# Neural Network (NN) with multiple hidden layers
def network_builder(hidden_dimensions, input_dim):
    
    # create model
    model = Sequential()
    model.add(Dense(hidden_dimensions[0], input_dim=input_dim, kernel_initializer='normal', activation='relu'))
    
    # add multiple hidden layers
    for dimension in hidden_dimensions[1:]:
        model.add(Dense(dimension, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    
    # Compile model. Use the the logarithmic loss function, and the Adam gradient optimizer.
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

In [None]:
# Find the best hyper-parameter (hidden layer dimensions) optimizing for recall
def print_gridsearch_scores_deep_learning(x_train_data,y_train_data):
      
    #verbose=1 will show an animated progress bar ------------- during training 
    #verbose=0  does not show anything during training 
    
    #choose between 3 options for "hidden_dimensions": 
    # [10] - 1 hidden layer with 10 units; 
    # [10, 10, 10] - 3 hidden layers with 10 units each; 
    # [100, 10] - 2 hidden layers with 10o units the 1st and 10 units the 2nd layer; 
    

    clf = GridSearchCV(KerasClassifier(build_fn=network_builder, epochs=50, batch_size=128, 
        verbose=0, input_dim=29), {"hidden_dimensions": ([10], [10, 10, 10], [100, 10])}, cv=5, scoring='recall')
    
    clf.fit(x_train_data,y_train_data)

    print("Best parameters found on development set:")
    print()
    print(clf.best_params_)

    print("Grid scores on development set:")
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print( "%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params) )


In [None]:
print_gridsearch_scores_deep_learning(X_train_undersample, y_train_undersample)

In [None]:
# Use the best hidden_dimension to train and test Deep NN model with the under-sample data 

k = KerasClassifier(build_fn=network_builder, epochs=50, batch_size=128,
                     hidden_dimensions=[100, 10], verbose=0, input_dim=29)
k.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = k.predict(X_test_undersample.values)


# Compute and print confusion matrix
?

# Compute and print Recall metric
?


In [None]:
#Test with whole test data 
y_pred = ?

# Compute and print confusion matrix
?

# Compute and print Recall metric
?
