# Detecting Fraudulent claims in Banksim
This notebook tries to create a supervised learning model to detect fraudulent transactions in the BankSim dataset provided by https://www.kaggle.com/ntnu-testimon/banksim1 

The goal is to first create a benchmark model on the intrinsic features provided, the dataset will then be modeled as a graph using a Neo4j database to be able to apply graph theory algorithms in order of creating network features to feed into the model. 

### Implementation steps
The steps for creating the model are as follows:  
1. Preprocess data to be able to feed into predictive model
 1. Remove rows with empty values
 2. Normalize feature values
2. Train supervised learning model
 1. Split data into 5 folds, to use for cross validation
 2. Estimate model prediction error using K-fold cross validation
 3. Choose best performing model
 4. Optimize hyperparameters using grid search
3. Measure performance of final optimized model %on intrinsic features
4. Create graph data model 
5. Apply graph algorithms to create network features
6. Add network features to preprocessed dataset from step 1
7. Retrain supervised learning model with additional features using same method from step 2
8. Measure performance of model with network features and compare to metrics from step 3
9. Quantify performance gains

## Data Exploration
This section will provide some basic exploration of the dataset in question

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline

In [2]:
df = pd.read_csv("./data/bs140513_032310.csv")
df.head()

Unnamed: 0,step,customer,age,gender,zipcodeOri,merchant,zipMerchant,category,amount,fraud
0,0,'C1093826151','4','M','28007','M348934600','28007','es_transportation',4.55,0
1,0,'C352968107','2','M','28007','M348934600','28007','es_transportation',39.68,0
2,0,'C2054744914','4','F','28007','M1823072687','28007','es_transportation',26.89,0
3,0,'C1760612790','3','M','28007','M348934600','28007','es_transportation',17.25,0
4,0,'C757503768','5','M','28007','M348934600','28007','es_transportation',35.72,0


As seen above, the data consists of 10 fields, 9 input features and one label noting if the datapoint is fraudulent or not

### Number of unique values
Exploring how many unique values there are for each feature

In [3]:
unique_print_str =  ""
for column in df:
    unique_print_str += " |{}: {}| ".format(column, df[column].unique().size)
print('---------- Number of unique values per feature ----------')
print(unique_print_str)

---------- Number of unique values per feature ----------
 |step: 180|  |customer: 4112|  |age: 8|  |gender: 4|  |zipcodeOri: 1|  |merchant: 50|  |zipMerchant: 1|  |category: 15|  |amount: 23767|  |fraud: 2| 


As seen above, the zipcodeOri and zipMerchant features contain only one unique value. 

### Amount of fraudulent nodes

In [4]:
total = df.shape[0]
normal = df[df.fraud == 0].step.count()
fraudulent = total - normal

print("The total number of datapoints are {}".format(total))
print("The number of non-fraudulent datapoints are {}, equal to {} % of the dataset".format(normal, round(100 *normal/total, 2)))
print("The number of fraudulent datapoints are {}, equal to {} % of the dataset".format(fraudulent, round(100 *fraudulent/total,2)))


The total number of datapoints are 594643
The number of non-fraudulent datapoints are 587443, equal to 98.79 % of the dataset
The number of fraudulent datapoints are 7200, equal to 1.21 % of the dataset


In [5]:
# split the features and labels
label = df.fraud
features = df.drop('fraud', axis = 1)
features.head()

Unnamed: 0,step,customer,age,gender,zipcodeOri,merchant,zipMerchant,category,amount
0,0,'C1093826151','4','M','28007','M348934600','28007','es_transportation',4.55
1,0,'C352968107','2','M','28007','M348934600','28007','es_transportation',39.68
2,0,'C2054744914','4','F','28007','M1823072687','28007','es_transportation',26.89
3,0,'C1760612790','3','M','28007','M348934600','28007','es_transportation',17.25
4,0,'C757503768','5','M','28007','M348934600','28007','es_transportation',35.72


## Preprocessing
When some basic data exploring has been done, the data needs to be preprocessed to be able to be used as input features of the supervised learning models. 
### Removing empty values and non-usable features
As a first preprocessing step, empty values in the dataset should be handled

In [6]:
# check if empty values...
df.isnull().sum()

step           0
customer       0
age            0
gender         0
zipcodeOri     0
merchant       0
zipMerchant    0
category       0
amount         0
fraud          0
dtype: int64

As the dataset does not contain any empty values, no rows will be removed from the dataset. 

Secondly, the features step, zipcodeOri, zipMerchant and customer will be removed. The zip codes are removed since they only contain one unique value. The customer so the model won't overfit on the customer name but to be able to learn to predict even new customers. The step is removed since this model won't ..

In [7]:
features =  features.drop(['step','zipcodeOri', 'zipMerchant', 'customer'], axis = 1)
features.head()

Unnamed: 0,age,gender,merchant,category,amount
0,'4','M','M348934600','es_transportation',4.55
1,'2','M','M348934600','es_transportation',39.68
2,'4','F','M1823072687','es_transportation',26.89
3,'3','M','M348934600','es_transportation',17.25
4,'5','M','M348934600','es_transportation',35.72


### Normalizing Numerical Features 

In [8]:
from sklearn.preprocessing import MinMaxScaler

# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler()

df[['amount', 'fraud']] = scaler.fit_transform(df[['amount', 'fraud']])
features.amount = df.amount

# Show an example of a record with scaling applied
features.head()

Unnamed: 0,age,gender,merchant,category,amount
0,'4','M','M348934600','es_transportation',0.000546
1,'2','M','M348934600','es_transportation',0.004764
2,'4','F','M1823072687','es_transportation',0.003228
3,'3','M','M348934600','es_transportation',0.002071
4,'5','M','M348934600','es_transportation',0.004288


### Converting categorical values using one hot encoding

In [9]:
features_final = pd.get_dummies(features)

In [10]:
features_final.head()

Unnamed: 0,amount,age_'0',age_'1',age_'2',age_'3',age_'4',age_'5',age_'6',age_'U',gender_'E',...,category_'es_home',category_'es_hotelservices',category_'es_hyper',category_'es_leisure',category_'es_otherservices',category_'es_sportsandtoys',category_'es_tech',category_'es_transportation',category_'es_travel',category_'es_wellnessandbeauty'
0,0.000546,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0.004764,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0.003228,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0.002071,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0.004288,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [11]:
features_final.shape

(594643, 78)

## Train Standard Models 
When all features has been preprocessed, two models (a Random Forest and a SVC) trained on the standard features will be compared to each other using K-Fold Cross Validation to find the model best suited for the dataset.  

In [12]:
from sklearn.model_selection import *
from sklearn.ensemble import RandomForestClassifier

In [None]:
from modules import train_standard_models as tsm
fold_betas, fold_accuracy = tsm.some_func(features_final, label)

In [None]:
print("---------- Betas Average ----------")
tsm.calculate_average_for_fold(fold_betas, ["SVM", "RF"])
print("---------- Accuracy Averages -----------")
tsm.calculate_average_for_fold(fold_accuracy, ["SVM", "RF"])


### Hyperparameter Optimization
As seen above, the Random Forest outperformes the SVC model in both f_1 score and accuracy. Therefore, the hyperparameters of the Random Forest model will be optimized using Grid Search. 

In [None]:
from modules import  hyperparameter_optimization as ho
best_clf = ho.opt_and_print(features_final, label)

Optimal parameters: 

In [None]:
best_clf

## Network Features
The following network features have been computed for both the customer and the merchant: Degree, PageRank and Community. These network features will be added to the dataset and preprocessed before used to train a Random Forest model. 


In [None]:
# Test output
from modules import  network_features as nf
df = nf.get_df("./data/bs140513_032310.csv", "./config.ini")
df.head()

In [None]:
features_graph = df.drop('fraud', axis = 1)

In [None]:
features_graph[['amount', 'merchDegree', 'custDegree', 'custPageRank', 'merchPageRank']].head()

### Preprocessing 
The standard features are preprocessed in the same way as before. The PageRank and Degree of both the customer and the merchant are min-max scaled and their community is one-hot encoded. 

In [None]:
scaler = MinMaxScaler()

df[['amount', 'merchDegree', 'custDegree', 'custPageRank', 'merchPageRank']] = scaler.fit_transform(df[['amount', 'merchDegree', 'custDegree', 'custPageRank', 'merchPageRank']])
features_graph[['amount', 'merchDegree', 'custDegree', 'custPageRank', 'merchPageRank']] = df[['amount', 'merchDegree', 'custDegree', 'custPageRank', 'merchPageRank']]


# Show an example of a record with scaling applied
features_graph.head()

In [None]:
features_graph =  features_graph.drop(['step','zipcodeOri', 'zipMerchant', 'customer'], axis = 1)
features_enhanced = pd.get_dummies(features_graph)

In [None]:
features_enhanced.shape

### Model training: Network Enhanced vs Standard Model
Two models are evaluated using K-fold Cross-Validation: one model trained on the network emnhaced input feature set and one on the standard features. The Random Forest models are initiated with the optimal hyperparameters found using grid search above.  


In [None]:
fold_betas, fold_accuracy = tsm.other_func(features_final, features_enhanced, label)

In [None]:

print("---------- Betas Average ----------")
tsm.calculate_average_for_fold(fold_betas, ["standard", "enhanced"])
print("---------- Accuracy Averages -----------")
tsm.calculate_average_for_fold(fold_accuracy, ["standard", "enhanced"])

## Result Evaluation 

The results will be evaluated with regards to these three factors: 
* Statistical Accuracy
* Interpretability
* Operational Efficiency

The statistical accuracy will be evaluated by training both the standard and network enhanced model on 100 random train-test splits and creating confidence intervals of the difference in metrics between the two models. 

The interpretability will be evaluated using the feature importance statistic on the Random Forest model. 

The operational efficience will be evaluated by calculating the average training and prediction time of the two models over the 100 train-test splits. 



In [None]:
from modules import for_result as fr
# Train 100 models
training_seeds_result = []

for i in range(100):
    training_seeds_result.append(fr.train_models(i, features_final, features_enhanced, label))
    print("----- Training: {} done -----".format(i))

In [None]:
# Savning the results (commented out to not oversave )
import pickle

with open('training_seeds_result', 'wb') as fp:
    pickle.dump(training_seeds_result, fp)

# get prediction results of 100 classifiers back
#with open ('training_seeds_result', 'rb') as fp:
    #training_seeds_result = pickle.load(fp)

In [None]:
fr.print_metric_comparison(training_seeds_result)

### Feature Importance

Using the feature importance statistic on the random forest model the most important features of the model can be plotted. 

In [None]:
X_train_enh, X_test_enh, y_train_enh, y_test_enh = train_test_split(features_enhanced, 
                                                    label, 
                                                    test_size = 0.2, 
                                                    random_state = 0)

clf_enh = RandomForestClassifier(max_features='sqrt',min_samples_split=50,n_estimators=100)
clf_enh.fit(X_train_enh, y_train_enh)

In [None]:
from modules import draw
draw.draw_func(clf_enh, features_enhanced, X_train_enh)


As seen in the two plots above the amount feature is by far the most important feature. Interstingly though the other four on the top five features are all derived from the graph. 

### Statistic Significance 
To evaluate the statistical significance of the resuilst. The difference in results between the two models will be calculated and used to create 99% confidence intervals for the improvements of the network enhanced model. The improvement is considered significant if 0 is not present in the interval.
 

In [None]:
training_results_lists = fr.statistic_significance(training_seeds_result)

In [None]:
for val in training_results_lists:
    temp_array = np.array(training_results_lists[val])
    training_results_lists[val] = temp_array

    print("{}: avg: {} std: {}".format(val, temp_array.mean(), temp_array.std()))

TEST

In [None]:
from modules import test
test.run_test(training_results_lists)