# FRAUD DETECTION - MODELLING

This Notebook contains modelling part and it is the second of the two notebooks, so this notebook inherits analysis from the first notebook.

### Importing Libraries

In [0]:
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt

from pyspark.sql import functions as f
from pyspark.sql.types import FloatType

from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler, StandardScaler, IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier, LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.ml import Pipeline
from pyspark.sql.functions import col,sum
from pyspark.ml.param import Param, Params

### Importing Data

In [0]:
dataPath = "/FileStore/tables/creditcard.csv"
df = spark.read\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .csv(dataPath)

#### Dropping Time Features

In [0]:
# Dropping the Time column
df = df.drop('Time')

#### Splitting Data
Now we split the dataset 75-25 for training and testing purposes.<br>
Our target feature is the Class column and it has 2 unique values, 1 for fraudulant transactions and 0 for non-fraudulent transactions.

In [0]:
df_train, df_test = df.randomSplit([0.75, 0.25], seed=123)

# Caching the data
df_train.cache()
df_test.cache()

print("Number ot Training data: " + str(df_train.count()))
print("Number ot Test data: " + str(df_test.count()))

Number ot Training data: 214074
Number ot Test data: 70733


### Data Balancing
We used the oversampling method to correct for the data imbalance.<br>
Thr altenative would have been to use undersampling, but given that there is a fairly small number of fraudulent transactions, undersampling would leave us with a very small dataset for the modeling, which would lead to weak models.

In [0]:
major_df_train = df_train.filter(f.col("Class") == 0)
minor_df_train = df_train.filter(f.col("Class") == 1)
ratio = int(major_df_train.count()/minor_df_train.count())

In [0]:
# There are 571 non-fraudulent transactions for each fraudulent transaction
print("In the training set, for every fraudulent transaction there are {}".format(ratio),'non-fraudulent transactions.')

In the training set, for every fraudulent transaction there are 576 non-fraudulent transactions.


In [0]:
a = range(70)
# Duplicate the minority rows
oversampled_df = minor_df_train.withColumn("dummy", f.explode(f.array([f.lit(x) for x in a]))).drop('dummy')

# Combine both oversampled minority rows and previous majority rows 
balanced_df_train = major_df_train.unionAll(oversampled_df)

In [0]:
balanced_df_train.groupBy("Class").count().withColumn("%", col('count')/balanced_df_train.count()*100).show()

+-----+------+------------------+
|Class| count|                 %|
+-----+------+------------------+
|    0|213703|  89.1644031659803|
|    1| 25970|10.835596834019686|
+-----+------+------------------+



### Vectorizing
Now that all of our data is ready. We're going to have to put all of it into one column of a vector type for Spark MLLib. This makes it easy to embed a prediction right in a DataFrame and also makes it very clear as to what is getting passed into the model and what isn't. This also makes it easy to incrementally add new features, simply by adding to the vector.

In [0]:
nonFeatureCols = ["Class"]
featureCols = [item for item in df.columns if item not in nonFeatureCols]
print(featureCols)

['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']


In [0]:
assembler = (VectorAssembler()
  .setInputCols(featureCols)
  .setOutputCol("features"))

df_train = assembler.transform(balanced_df_train)
df_test = assembler.transform(df_test)

## MODELING

In [0]:
#Creating a Results dataframe to store results
Results = pd.DataFrame(columns =["Model","Recall", "Precision", "F1", "PR_AUC", "ROC_AUC"])
Results

Unnamed: 0,Model,Recall,Precision,F1,PR_AUC,ROC_AUC


### Logistic Regression Model

Using logistic regression to predict fraud.

In [0]:
logis = LogisticRegression(labelCol='Class',featuresCol='features',family='binomial')
logis_paramGrid = (ParamGridBuilder()
  .addGrid(logis.maxIter, [15])
  .addGrid(logis.threshold, [0.95, 0.8, 0.5])
  .build())

logis_pipeline = Pipeline().setStages([logis])

logis_cv = (CrossValidator(numFolds = 5, seed = 1) 
  .setEstimator(logis_pipeline) 
  .setEstimatorParamMaps(logis_paramGrid)
  .setEvaluator(BinaryClassificationEvaluator(metricName = "areaUnderPR").setLabelCol("Class")))
logis_cv_fit = logis_cv.fit(df_train)

In [0]:
# Get peformance metrics
logis_prediction = logis_cv_fit.transform(df_test)
logis_conf_matrix = logis_prediction.groupBy(['Class', 'prediction']).count().toPandas().pivot(index = 'Class', columns='prediction', values='count')

print("Confusion Matrix: \n",logis_conf_matrix)

logis_precision = logis_conf_matrix[1][1]/(logis_conf_matrix[1].sum())

logis_recall = logis_conf_matrix[1][1]/(logis_conf_matrix.iloc[1].sum())

logis_f1 = 2*logis_recall*logis_precision/(logis_recall + logis_precision)

logis_ROC_AUC = BinaryClassificationEvaluator(labelCol="Class", metricName = "areaUnderROC").evaluate(logis_prediction)

logis_PR_AUC = BinaryClassificationEvaluator(labelCol="Class", metricName = "areaUnderPR").evaluate(logis_prediction)

Results.loc[len(Results.index)] = ["Logistic Regression",logis_recall, logis_precision, logis_f1, logis_PR_AUC, logis_ROC_AUC]

print(Results)

Confusion Matrix: 
 prediction    0.0  1.0
Class                 
0           70560   52
1              19  102
                 Model    Recall  Precision        F1    PR_AUC   ROC_AUC
0  Logistic Regression  0.842975   0.662338  0.741818  0.766738  0.976402


In [0]:
logis_prediction.groupBy(['Class', 'prediction']).count().toPandas().pivot(index = 'Class', columns='prediction', values='count')

prediction,0.0,1.0
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
0,70560,52
1,19,102


### Random Forest Model

Using Random forest to predict fraud.

In [0]:
CV_RF = RandomForestClassifier(labelCol = 'Class', featuresCol = 'features', seed = 4)
rf_paramGrid = (ParamGridBuilder()
  .addGrid(CV_RF.maxDepth, [8])
  .addGrid(CV_RF.numTrees, [30, 60])
  .build())

rf_pipeline = Pipeline().setStages([CV_RF])

rf_cv = (CrossValidator(numFolds = 5, seed = 1) 
  .setEstimator(rf_pipeline) 
  .setEstimatorParamMaps(rf_paramGrid)
  .setEvaluator(BinaryClassificationEvaluator().setLabelCol("Class")))

rf_cv_fit = rf_cv.fit(df_train)

In [0]:
# Get peformance metrics of random forest models
rf_prediction = rf_cv_fit.transform(df_test)
rf_conf_matrix = rf_prediction.groupBy(['Class', 'prediction']).count().toPandas().pivot(index = 'Class', columns='prediction', values='count')

print("Confusion Matrix: \n",rf_conf_matrix)

rf_precision = rf_conf_matrix[1][1]/(rf_conf_matrix[1].sum())

rf_recall = rf_conf_matrix[1][1]/(rf_conf_matrix.iloc[1].sum())

rf_f1 = 2*rf_recall*rf_precision/(rf_recall + rf_precision)

rf_ROC_AUC = BinaryClassificationEvaluator(labelCol="Class", metricName = "areaUnderROC").evaluate(rf_prediction)

rf_PR_AUC = BinaryClassificationEvaluator(labelCol="Class", metricName = "areaUnderPR").evaluate(rf_prediction)

Results.loc[len(Results.index)] = ["Random Forest",rf_recall, rf_precision, rf_f1, rf_PR_AUC, rf_ROC_AUC]

print(Results)

Confusion Matrix: 
 prediction    0.0  1.0
Class                 
0           70590   22
1              21  100
                 Model    Recall  Precision        F1    PR_AUC   ROC_AUC
0  Logistic Regression  0.842975   0.662338  0.741818  0.766738  0.976402
1        Random Forest  0.826446   0.819672  0.823045  0.839597  0.977960


In [0]:
rf_prediction.groupBy(['Class', 'prediction']).count().toPandas().pivot(index = 'Class', columns='prediction', values='count')

prediction,0.0,1.0
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
0,70590,22
1,21,100


### Gradient Boosted Tree

Using GBT to predict fraud.

In [0]:
gbt = GBTClassifier(labelCol = 'Class', featuresCol = 'features', maxDepth = 8)

gbt_fit = gbt.fit(df_train)

In [0]:
# Get peformance metrics of GBT model
gbt_prediction = gbt_fit.transform(df_test)
gbt_conf_matrix = gbt_prediction.groupBy(['Class', 'prediction']).count().toPandas().pivot(index = 'Class', columns='prediction', values='count')

print("Confusion Matrix: \n",gbt_conf_matrix)

gbt_precision = gbt_conf_matrix[1][1]/(gbt_conf_matrix[1].sum())

gbt_recall = gbt_conf_matrix[1][1]/(gbt_conf_matrix.iloc[1].sum())

gbt_f1 = 2*gbt_recall*gbt_precision/(gbt_recall + gbt_precision)

gbt_ROC_AUC = BinaryClassificationEvaluator(labelCol="Class", metricName = "areaUnderROC").evaluate(gbt_prediction)

gbt_PR_AUC = BinaryClassificationEvaluator(labelCol="Class", metricName = "areaUnderPR").evaluate(gbt_prediction)

Results.loc[len(Results.index)] = ["Gradient Boosted Tree",gbt_recall, gbt_precision, gbt_f1, gbt_PR_AUC, gbt_ROC_AUC]

print(Results)

Confusion Matrix: 
 prediction    0.0  1.0
Class                 
0           70565   47
1              29   92
                   Model    Recall  Precision        F1    PR_AUC   ROC_AUC
0    Logistic Regression  0.842975   0.662338  0.741818  0.766738  0.976402
1          Random Forest  0.826446   0.819672  0.823045  0.839597  0.977960
2  Gradient Boosted Tree  0.760331   0.661871  0.707692  0.770578  0.942938


In [0]:
gbt_prediction.groupBy(['Class', 'prediction']).count().toPandas().pivot(index = 'Class', columns='prediction', values='count')

prediction,0.0,1.0
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
0,70565,47
1,29,92


# RESULTS AND CONCLUSIONS

The below table summarizes the performance of the three models on the test set:

In [0]:
display(Results)

Model,Recall,Precision,F1,PR_AUC,ROC_AUC
Logistic Regression,0.8429752066115702,0.6623376623376623,0.7418181818181817,0.7667384417121104,0.976402004575816
Random Forest,0.8264462809917356,0.819672131147541,0.823045267489712,0.8395974902556095,0.9779601060480436
Gradient Boosted Tree,0.7603305785123967,0.6618705035971223,0.7076923076923077,0.7705783729055073,0.9429379058086256


The random forest model has the highest performance in terms of Precision, F1 score, PR_AUC and ROC_AUC. It also has comparable recall with other models then it is the best model. Improving recall for fraud detection means that we are improving the detection of fraudulent transactions, which is an important metric to measure our models' predictive strength.