# Logistic Regression Model for Fraud Detection using Pyspark

## Summary
In this notebook, I would like to share my humble pyspark "skills" with anyone interested. Wanted to use Logistic Regression as the first model and I was going to investigate other models as well. However, at first try, model accuracy reached 100% so I stopped working on the data. Let's get it started!

In [1]:
!pip install pyspark
import numpy as np
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.ml.feature import OneHotEncoder, StringIndexer, StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import FloatType

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- \ done
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.0/199.0 KB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
[?25h  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=78f7d38a9b478f5069248c90a24e4199d487a244bd4dd55eba633eb88f6f733e
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd80

## Data Investigation

Let's see what kind of features we have in the dataset.

In [2]:
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True)# Property used to format output tables better

df = spark.read.csv("../input/online-payments-fraud-detection-dataset/PS_20174392719_1491204439457_log.csv", inferSchema=True, header=True)

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/04/18 21:30:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

In [3]:
# Length of the whole dataset
df.count()

                                                                                

6362620

We see that the dataset consist of mostly numerical values. However, we should investigate string features to see if there is any column that can be helpful for our model.

In [4]:
df.printSchema()

root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: double (nullable = true)
 |-- newbalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: double (nullable = true)
 |-- newbalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)



Seems like type may be used since there may be a relation between payment type and probability of transaction to be fraud. We also have nameOrig and nameDest columns which are anonymized. They may be useful since fraud may be caused by the customer or a place that has higher fraud rates compared to others. I do not want to start with a complicated model so I will look at the number of distinct values for these columns. If they are small enough, I will add them in my features.

In [5]:
df.show(5)

+----+--------+--------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+
|step|    type|  amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|
+----+--------+--------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+
|   1| PAYMENT| 9839.64|C1231006815|     170136.0|     160296.36|M1979787155|           0.0|           0.0|      0|             0|
|   1| PAYMENT| 1864.28|C1666544295|      21249.0|      19384.72|M2044282225|           0.0|           0.0|      0|             0|
|   1|TRANSFER|   181.0|C1305486145|        181.0|           0.0| C553264065|           0.0|           0.0|      1|             0|
|   1|CASH_OUT|   181.0| C840083671|        181.0|           0.0|  C38997010|       21182.0|           0.0|      1|             0|
|   1| PAYMENT|11668.14|C2048537720|      41554.0|      29885.86|M1230701703|      

Type has only 5 distinct values so it will not create too much columns when one hot encoding is implemented. Therefore, I will include Type column.

In [6]:
df.select("type").distinct()

                                                                                

type
TRANSFER
CASH_IN
CASH_OUT
PAYMENT
DEBIT


nameDest and nameOrig has lots of unique values so, for now, I will discard them from the dataset.

In [7]:
print(df.select("nameDest").distinct().count())
print(df.select("nameOrig").distinct().count())

                                                                                

2722362




6353307


                                                                                

I also want to look at the step data which is called as a time unit. I believe these transactions happened in 743. To be honest, I was expecting this column to be hours in the day. I am not sure if this column will be important but it does not need much attention so I include this column as well.

In [8]:
df.select("step").distinct().orderBy("step", ascending=False).show(5)



+----+
|step|
+----+
| 743|
| 742|
| 741|
| 740|
| 739|
+----+
only showing top 5 rows



                                                                                

## Simple Feature Engineering

In [9]:
df = df.select("step", "type", "amount", "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest", "isFraud")
df.show(5)

+----+--------+--------+-------------+--------------+--------------+--------------+-------+
|step|    type|  amount|oldbalanceOrg|newbalanceOrig|oldbalanceDest|newbalanceDest|isFraud|
+----+--------+--------+-------------+--------------+--------------+--------------+-------+
|   1| PAYMENT| 9839.64|     170136.0|     160296.36|           0.0|           0.0|      0|
|   1| PAYMENT| 1864.28|      21249.0|      19384.72|           0.0|           0.0|      0|
|   1|TRANSFER|   181.0|        181.0|           0.0|           0.0|           0.0|      1|
|   1|CASH_OUT|   181.0|        181.0|           0.0|       21182.0|           0.0|      1|
|   1| PAYMENT|11668.14|      41554.0|      29885.86|           0.0|           0.0|      0|
+----+--------+--------+-------------+--------------+--------------+--------------+-------+
only showing top 5 rows



Before going in to create a pipeline, we should split the data into train and test for investigation of model performance.

In [10]:
train, test = df.randomSplit([0.7, 0.3], seed=5624)

We only have one string column to process so it is easy to handle it manually. Below code creates a pipeline that changes string value into numbers and then implements one hot encoder. After that, pipeline inserts the new features into the dataset. 

Note that we should include StandardScaler in this pipeline as well since we are going to use Logistic Regression as our model. Without the standardization, number of iteration may be increased and it would be hard for model to find global minima. However, I first wanted to try without standardization and it works as well. Just takes a little time...

In [11]:
string_indexer = [StringIndexer(inputCol="type",
                                outputCol="type" + "_StringIndexer",
                                handleInvalid="skip")]
                        
one_hot_encoder = [OneHotEncoder(inputCols=["type_StringIndexer"],
                                 outputCols=["type_OneHotEncoder"])]

assemblerInput = ["step", "amount", "oldbalanceOrg", "newbalanceOrig", "oldbalanceDest", "newbalanceDest", "isFraud", "type_OneHotEncoder"]


vector_assembler = VectorAssembler(inputCols=assemblerInput,
                                   outputCol="VectorAssembler_features")
                    
stages = []
stages += string_indexer
stages += one_hot_encoder
stages += [vector_assembler]

Now lets process the data with our pipeline.

In [12]:
pipeline = Pipeline().setStages(stages)
pipe_model = pipeline.fit(train)

train_data_pipe = pipe_model.transform(train)
test_data_pipe = pipe_model.transform(test)

train_data = train_data_pipe.select(F.col("VectorAssembler_features").alias("features"),
                                    F.col("isFraud").alias("label"))

test_data = test_data_pipe.select(F.col("VectorAssembler_features").alias("features"),
                                  F.col("isFraud").alias("label"))

train_data.show(10)

[Stage 31:>                                                         (0 + 1) / 1]

+--------------------+-----+
|            features|label|
+--------------------+-----+
|(11,[0,1,2,3,4,9]...|    0|
|[1.0,484.57,54224...|    0|
|[1.0,783.31,81503...|    0|
|[1.0,863.08,92907...|    0|
|[1.0,1076.27,3538...|    0|
|[1.0,1271.77,6973...|    0|
|(11,[0,1,2,3,4,9]...|    0|
|[1.0,2643.45,6434...|    0|
|[1.0,2673.64,7688...|    0|
|[1.0,5763.99,1276...|    0|
+--------------------+-----+
only showing top 10 rows



                                                                                

## Logistic Regression  Model

It's time to train our model and check the performances.

In [13]:
lr_spark = LogisticRegression().fit(train_data)

print(f"Training Area Under ROC: {lr_spark.summary.areaUnderROC}")
print(f"Training Accuracy: {lr_spark.summary.accuracy}")
pred = lr_spark.evaluate(test_data)

22/04/18 21:32:17 WARN MemoryStore: Not enough space to cache rdd_94_0 in memory! (computed 65.0 MiB so far)
22/04/18 21:32:17 WARN BlockManager: Persisting block rdd_94_0 to disk instead.
22/04/18 21:32:24 WARN BlockManager: Asked to remove block broadcast_66, which does not exist
22/04/18 21:33:25 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_138_1 in memory.
22/04/18 21:33:25 WARN MemoryStore: Not enough space to cache rdd_138_1 in memory! (computed 384.0 B so far)
22/04/18 21:33:25 WARN BlockManager: Block rdd_138_1 could not be removed as it was not found on disk or in memory
22/04/18 21:33:25 WARN BlockManager: Putting block rdd_138_1 failed
22/04/18 21:33:25 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KiB for computing block rdd_138_3 in memory.
22/04/18 21:33:25 WARN MemoryStore: Not enough space to cache rdd_138_3 in memory! (computed 384.0 B so far)
22/04/18 21:33:25 WARN BlockManager: Block rdd_1

Training Area Under ROC: 0.9996301888217659




Training Accuracy: 1.0


                                                                                

In [14]:
print(f"Test Area Under ROC: {pred.areaUnderROC}")
print(f"Test Accuracy: {pred.accuracy}")

                                                                                

Test Area Under ROC: 0.9996125781729113




Test Accuracy: 1.0


                                                                                

In [15]:
metrics = MulticlassMetrics(pred.predictions.select("prediction", "label").withColumn("label", F.col("prediction").cast(FloatType())).rdd.map(tuple))

                                                                                

In [16]:
cm = metrics.confusionMatrix().toArray()

                                                                                

In [17]:
cm

array([[1906191.,       0.],
       [      0.,    2453.]])

100% Test Accuracy, Not much to say.. Maybe cross validation code can be implemented to see a better approximation to real performance but with 6 million data, that many accuracy is hard to be a coincidence. Anyway, I would like to see more analysis on the data as well.(Especially visualization using pyspark would be really helpful.) Thanks for reading!!!

## Fatih Özgür Ardıç