# Fraud Detection Model

K-means clustering will be used to predict which transactions are fraudulent. Fraudulent transactions will most likely have similar features/behavior with one another, therefore they will all be in the same cluster. We can then treat the cluster with the highest fraud probability as the cluster with all fraudulent transactions.

## Data Processing for K-Means Clustering

In [1]:
# Importing pyspark libraries
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.functions import mean, countDistinct
from pyspark.sql.types import IntegerType

# Importing model libraries
from pyspark.ml.feature import (StringIndexer, OneHotEncoder, VectorAssembler)
from pyspark.ml.feature import StandardScaler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

In [2]:
# Create a spark session
spark = (
    SparkSession.builder.appName("BNPL Project")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.driver.memory", "4g")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .getOrCreate()
)

22/10/05 17:59:41 WARN Utils: Your hostname, LAPTOP-03OFAS5P resolves to a loopback address: 127.0.1.1; using 172.20.154.107 instead (on interface eth0)
22/10/05 17:59:41 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/05 17:59:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Read in data.

In [3]:
#Read data
sdf = spark.read.parquet("../data/curated/process_data.parquet")

                                                                                

Only consider instances with non-null fraud probabilities. It is assumed that users and merchants with no fraud probabilities are deemed to be not suspicious, therefore the probability of their transactions being fraudulent will be 0.

In [4]:
# separate instances with null fraud probabilities
sdf_fraudless = sdf.filter(F.col("merchant_fraud_probability").isNull() | F.col("user_fraud_probability").isNull())
sdf_fraud = sdf.filter(F.col("merchant_fraud_probability").isNotNull() & F.col("user_fraud_probability").isNotNull())
sdf = sdf_fraud

Select attributes.

In [5]:
# Selecting attributes
sdf = sdf.withColumn('postcode', sdf["postcode"].cast(IntegerType()))
features = ['merchant_abn', 'consumer_id', 'dollar_value', 'postcode', 'gender', 'revenue', 'rate', 'category']

Convert categorical features to integer and then one hot encode.

In [6]:
# Discretisation
stringToNum = StringIndexer(inputCol= 'gender', outputCol= 'genderNum')
output_data = stringToNum.fit(sdf).transform(sdf)

stringToNum = StringIndexer(inputCol= 'revenue', outputCol= 'revenueNum')
output_data = stringToNum.fit(output_data).transform(output_data)

stringToNum = StringIndexer(inputCol= 'category', outputCol= 'categoryNum')
output_data = stringToNum.fit(output_data).transform(output_data)

                                                                                

In [7]:
# One hot encoding
encoder = OneHotEncoder(inputCol= 'genderNum', outputCol = 'genderVec')
onehotdata = encoder.fit(output_data).transform(output_data)

encoder = OneHotEncoder(inputCol= 'revenueNum', outputCol = 'revenueVec')
onehotdata = encoder.fit(onehotdata).transform(onehotdata)

encoder = OneHotEncoder(inputCol= 'categoryNum', outputCol = 'categoryVec')
onehotdata = encoder.fit(onehotdata).transform(onehotdata)

Convert features to a single vector and standardize.

In [8]:
# Converting to vector
assembler1 = VectorAssembler(
inputCols= ['merchant_abn', 'consumer_id', 'dollar_value', 'postcode', 'genderVec', 'revenueVec', 'rate', 'categoryVec'],
outputCol='features')
result = assembler1.transform(onehotdata)

In [9]:
# standardizing the feature vector
scale = StandardScaler(inputCol='features',outputCol='standardized')
data_scale = scale.fit(result)
data_scale_output = data_scale.transform(result)

                                                                                

## K-means Clustering Model

In [10]:
# Build K-means clustering model
import numpy as np
cost = np.zeros(11)
evaluator = ClusteringEvaluator(predictionCol='prediction', featuresCol='standardized', \
                                metricName='silhouette', distanceMeasure='squaredEuclidean')
KMeans_algo=KMeans(featuresCol='standardized', k= 2)
KMeans_fit=KMeans_algo.fit(data_scale_output)
output=KMeans_fit.transform(data_scale_output)  
score=evaluator.evaluate(output)
score

                                                                                

0.35692111925566483

In [11]:
# calculate mean merchant fraud probability for each cluster
output.groupBy("prediction").mean("merchant_fraud_probability")

                                                                                

prediction,avg(merchant_fraud_probability)
1,0.3058770828390268
0,0.2936770731958961


In [12]:
# calculate mean user fraud probability for each cluster
output.groupBy("prediction").mean("user_fraud_probability")

                                                                                

prediction,avg(user_fraud_probability)
1,0.1542216049664078
0,0.1535548175058309


Since cluster 0 has higher mean merchant and user fraud probabilities, we will treat all the transactions in this cluster as fraud. These transactions will not be included in the ranking system.

## Removal of Fraud Transactions

In [13]:
# count number of fraud transactions
output.filter(F.col("prediction") == 0).count()

                                                                                

149013

In total, 206014 transactions will be removed.

In [14]:
# ensure that each transaction has a unique order_id which will be used for joining
print(sdf.select(countDistinct("order_id")))
print(sdf.count())

                                                                                

+------------------------+
|count(DISTINCT order_id)|
+------------------------+
|                  472823|
+------------------------+

472823


                                                                                

In [15]:
# only join transactions from cluster 1
sdf = sdf.join(output.filter(F.col("prediction") == 1), on=["order_id"], how="leftSemi")
sdf.count()

                                                                                

323810

In [16]:
# combine the two dataframes
sdf = sdf.union(sdf_fraudless)
sdf.count()

                                                                                

13465143

In [17]:
# save data for ranking
sdf.write.mode('overwrite').parquet('../data/curated/fraudless_data.parquet')

                                                                                