# MODEL

The following notebook contains the classification model for the fraud detection algorithm. It is assumed that the fraud data given is a small true sample which can be used to train and predict the rest of the unlabeled transactions. Since this notebook is large, it has been divided into checkpoints so the spark session does not run out of memory. To make sure the notebook is reproducible, ensure you execute all cells just before the "Model" heading, and restart the kernel and read from the last saved file.

The notebook tries 2 approaches
- Multilayer Perceptron Classifier
- Random Forest Classifier

The RF yieleded better results overall with more categories being predicted slighter higher probabilities than true values. To achieve the final results, running the MLP model is not necessary. Instead run the RF model.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

from pyspark.sql.functions import *
from pyspark.sql.types import DateType
from pyspark.sql import SparkSession, DataFrame

from pyspark.ml import Pipeline
from pyspark.sql.types import FloatType
from pyspark.mllib.evaluation import MulticlassMetrics

from pyspark.ml.classification import MultilayerPerceptronClassifier, RandomForestClassifier
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, RegressionEvaluator

In [2]:
# PATH Variables
# Change path environement to specific use case

dir = "../data/"

In [3]:
sp = (
    SparkSession.builder.appName("Model")
    .config("spark.sql.session.timeZone", "+11")
    .config("spark.driver.memory", "10g")
    .config("spark.executor.memory", "10g")
    .config('spark.sql.parquet.cacheMetadata', 'True')
    .getOrCreate()
)
sp

22/10/17 02:40:36 WARN Utils: Your hostname, J-L resolves to a loopback address: 127.0.1.1; using 172.21.176.78 instead (on interface eth0)
22/10/17 02:40:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/17 02:40:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
transactions = sp.read.option("inferSchema", True).parquet(dir + "processed/transactions")
merchants = sp.read.option("inferSchema", True).parquet(dir + "processed/merchants")
customers = sp.read.option("inferSchema", True).parquet(dir + "processed/customers")

transactions.show(1)
merchants.show(1)
customers.head(1)

+--------+-------+------------+------------+--------------+-----------+-----------------+-------+----------+-----+---------+
|order_id|user_id|merchant_abn|dollar_value|order_datetime|Natural_var|Potential_Outlier|holiday|dayofmonth|month|dayofweek|
+--------+-------+------------+------------+--------------+-----------+-----------------+-------+----------+-----+---------+
|       3|      3| 60956456424|      136.68|    2021-08-20|          0|                0|      0|        20|    8|        6|
+--------+-------+------------+------------+--------------+-----------+-----------------+-------+----------+-----+---------+
only showing top 1 row

+------------+-------------+--------------+--------+----+---------------+---------------+----------------+-----------------+
|merchant_abn|         name|Earnings_Class|BNPL_Fee|tags|avg_monthly_inc|monthly_entropy|postcode_entropy|          revenue|
+------------+-------------+--------------+--------+----+---------------+---------------+------------

[Row(state='ACT', postcode=200, gender='Female', user_id=71674, Number of individuals lodging an income tax return=5524, Average taxable income or loss=66722, Median taxable income or loss=52958, Proportion with salary or wages=1, Count salary or wages=5009, Average salary or wages=64930, Median salary or wages=55579, Proportion with net rent=1, Count net rent=762, Average net rent=-4289, Median net rent=-2448, Average total income or loss=68991, Median total income or loss=54988, Average total deductions=2244, Median total deductions=872, Proportion with total business income=1, Count total business income=382, Average total business income=56170, Median total business income=18742, Proportion with total business expenses=1, Count total business expenses=343, Average total business expenses=42645, Median total business expenses=8664, Proportion with net tax=1, Count net tax=4586, Average net tax=18805, Median net tax=11482, Count super total accounts balance=7620, Average super total 

### PROCESSING CUSTOMER FRAUD DATA

In [5]:
c_fraud = sp.read.option("inferSchema", True).parquet(dir + "curated/customer_fraud")
c_fraud = c_fraud.withColumn("order_datetime", col("order_datetime").cast(DateType()))
c_fraud.show(2)

+-------+--------------+-----------------+
|user_id|order_datetime|fraud_probability|
+-------+--------------+-----------------+
|   6228|    2021-12-19|         97.62981|
|  21419|    2021-12-10|         99.24738|
+-------+--------------+-----------------+
only showing top 2 rows



In [6]:
c_fraud_full = transactions.join(c_fraud, on=["user_id", "order_datetime"])
c_fraud_full.show(2)

+-------+--------------+--------+------------+------------+-----------+-----------------+-------+----------+-----+---------+-----------------+
|user_id|order_datetime|order_id|merchant_abn|dollar_value|Natural_var|Potential_Outlier|holiday|dayofmonth|month|dayofweek|fraud_probability|
+-------+--------------+--------+------------+------------+-----------+-----------------+-------+----------+-----+---------+-----------------+
|    448|    2021-08-20|    1005| 94380689142|     6263.03|          0|                0|      0|        20|    8|        6|        14.681704|
|   3116|    2021-08-20|    6989| 22248828825|     3958.86|          0|                0|      0|        20|    8|        6|         8.809071|
+-------+--------------+--------+------------+------------+-----------+-----------------+-------+----------+-----+---------+-----------------+
only showing top 2 rows



In [7]:
c_fraud_full.count()

                                                                                

80560

In [7]:
X = c_fraud_full.join(merchants, on="merchant_abn").join(customers, on="user_id")
X.head(1)

[Row(user_id=448, merchant_abn=94380689142, order_datetime=datetime.date(2021, 8, 20), order_id=1005, dollar_value=6263.02978515625, Natural_var=0, Potential_Outlier=0, holiday=0, dayofmonth=20, month=8, dayofweek=6, fraud_probability=14.681703567504883, name='Aliquet Ltd', Earnings_Class='b', BNPL_Fee=3.77, tags=12, avg_monthly_inc=0.0, monthly_entropy=2.710181474685669, postcode_entropy=4.060055732727051, revenue=241562.580078125, state='WA', postcode=6170, gender='Female', Number of individuals lodging an income tax return=4994, Average taxable income or loss=56564, Median taxable income or loss=44772, Proportion with salary or wages=1, Count salary or wages=3916, Average salary or wages=57393, Median salary or wages=49510, Proportion with net rent=1, Count net rent=690, Average net rent=863, Median net rent=255, Average total income or loss=59730, Median total income or loss=47123, Average total deductions=2865, Median total deductions=598, Proportion with total business income=1, 

X is now the full dataset with all merge combinations. To this we will further create categorical columns and standardize the numerical columns (after train test split). 

### Dropping Columns

In [8]:
X = X.drop("user_id", "merchant_abn", "order_datetime", "order_id", "name", "postcode", "holiday")
X.printSchema()

root
 |-- dollar_value: float (nullable = true)
 |-- Natural_var: integer (nullable = true)
 |-- Potential_Outlier: integer (nullable = true)
 |-- dayofmonth: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- dayofweek: integer (nullable = true)
 |-- fraud_probability: float (nullable = true)
 |-- Earnings_Class: string (nullable = true)
 |-- BNPL_Fee: double (nullable = true)
 |-- tags: integer (nullable = true)
 |-- avg_monthly_inc: float (nullable = true)
 |-- monthly_entropy: float (nullable = true)
 |-- postcode_entropy: float (nullable = true)
 |-- revenue: double (nullable = true)
 |-- state: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- Number of individuals lodging an income tax return: long (nullable = true)
 |-- Average taxable income or loss: long (nullable = true)
 |-- Median taxable income or loss: long (nullable = true)
 |-- Proportion with salary or wages: long (nullable = true)
 |-- Count salary or wages: long (nullable = true)


### Categorize

- dayofmonth
- dayofweek
- month
- tags
- state
- gender
- Earnings Class

In [9]:
def category_processing(data: DataFrame, outcome: str):
    categories = [
        "dayofmonth",
        "dayofweek",
        "month",
        "tags",
        "state",
        "gender",
        "Earnings_Class"
    ]

    # Pipeline
    indexers = [StringIndexer(inputCol=c, outputCol=c+"_index") for c in categories]
    encoders = [OneHotEncoder(inputCol=c+"_index", outputCol=c+"_encoded") for c in categories]
    transformer = Pipeline(stages=indexers + encoders).fit(data)
    transformed = transformer.transform(data)

    for c in categories:
        transformed = transformed.drop(c).drop(c+"_index")
    return transformer, transformed

In [10]:
cat_transformer, category_processed = category_processing(X, "outcome")
category_processed.head(1)

                                                                                

[Row(dollar_value=6263.02978515625, Natural_var=0, Potential_Outlier=0, fraud_probability=14.681703567504883, BNPL_Fee=3.77, avg_monthly_inc=0.0, monthly_entropy=2.710181474685669, postcode_entropy=4.060055732727051, revenue=241562.580078125, Number of individuals lodging an income tax return=4994, Average taxable income or loss=56564, Median taxable income or loss=44772, Proportion with salary or wages=1, Count salary or wages=3916, Average salary or wages=57393, Median salary or wages=49510, Proportion with net rent=1, Count net rent=690, Average net rent=863, Median net rent=255, Average total income or loss=59730, Median total income or loss=47123, Average total deductions=2865, Median total deductions=598, Proportion with total business income=1, Count total business income=457, Average total business income=93034, Median total business income=32873, Proportion with total business expenses=1, Count total business expenses=436, Average total business expenses=76035, Median total bu

### CREATE QUANTILES

Creating quantiles of 0-10%, 10-20%, and so on of the fraud probabilities.

In [11]:
from pyspark.ml.feature import Bucketizer

buckets = Bucketizer(splits=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100], inputCol="fraud_probability", outputCol="fraud_buckets")
X_bucks = buckets.transform(category_processed).drop("fraud_probability")

X_bucks.head(1)

                                                                                

[Row(dollar_value=6263.02978515625, Natural_var=0, Potential_Outlier=0, BNPL_Fee=3.77, avg_monthly_inc=0.0, monthly_entropy=2.710181474685669, postcode_entropy=4.060055732727051, revenue=241562.580078125, Number of individuals lodging an income tax return=4994, Average taxable income or loss=56564, Median taxable income or loss=44772, Proportion with salary or wages=1, Count salary or wages=3916, Average salary or wages=57393, Median salary or wages=49510, Proportion with net rent=1, Count net rent=690, Average net rent=863, Median net rent=255, Average total income or loss=59730, Median total income or loss=47123, Average total deductions=2865, Median total deductions=598, Proportion with total business income=1, Count total business income=457, Average total business income=93034, Median total business income=32873, Proportion with total business expenses=1, Count total business expenses=436, Average total business expenses=76035, Median total business expenses=20422, Proportion with

In [14]:
X_bucks.groupBy("fraud_buckets").count().orderBy("fraud_buckets").show()

[Stage 71:>                                                         (0 + 8) / 9]

+-------------+-----+
|fraud_buckets|count|
+-------------+-----+
|          0.0|22923|
|          1.0|38611|
|          2.0| 6113|
|          3.0| 1934|
|          4.0|  990|
|          5.0|  576|
|          6.0|  360|
|          7.0|  193|
|          8.0|  102|
|          9.0|   11|
+-------------+-----+



                                                                                

In [13]:
from functools import reduce

In [14]:

fractions = [0, 0, 0, 2, 4, 7, 15, 20, 35, 250]

X_adjusted = reduce(
    DataFrame.unionAll,
    [X_bucks.filter(X_bucks.fraud_buckets == float(x)).sample(withReplacement=True, fraction=float(fractions[x]), seed=69) for x in range(3, 10)]
)
X_adjusted = reduce(
    DataFrame.unionAll,
    [X_adjusted] + [X_bucks.filter(X_bucks.fraud_buckets == float(x)) for x in range(0, 3)]
)

X_adjusted.count()

                                                                                

95171

In [15]:
X_adjusted.groupBy("fraud_buckets").count().orderBy("fraud_buckets").show()



+-------------+-----+
|fraud_buckets|count|
+-------------+-----+
|          0.0|22923|
|          1.0|38611|
|          2.0| 6113|
|          3.0| 3910|
|          4.0| 3931|
|          5.0| 4025|
|          6.0| 5450|
|          7.0| 3876|
|          8.0| 3583|
|          9.0| 2749|
+-------------+-----+



                                                                                

### TRAIN TEST SPLIT

In [16]:
train, val, test = X_adjusted.randomSplit([0.7, 0.2, 0.1], seed=69)

#print(train.count())
#print(val.count())
#test.count()

In [17]:
train.write.parquet("../models/train_raw", mode="overwrite")
val.write.parquet("../models/val_raw", mode="overwrite")
test.write.parquet("../models/test_raw", mode="overwrite")

22/10/17 02:10:58 WARN DAGScheduler: Broadcasting large task binary with size 1555.2 KiB


                                                                                

22/10/17 02:11:58 WARN DAGScheduler: Broadcasting large task binary with size 1555.3 KiB


                                                                                

22/10/17 02:12:52 WARN DAGScheduler: Broadcasting large task binary with size 1555.3 KiB


                                                                                

---
        ### RESTART KERNEL HERE
---

In [5]:
train = sp.read.option("inferSchema", True).parquet("../models/train_raw/", mode="overwrite")
val = sp.read.option("inferSchema", True).parquet("../models/val_raw/", mode="overwrite")
test = sp.read.option("inferSchema", True).parquet("../models/test_raw/", mode="overwrite")

                                                                                

In [21]:
def process_numerical(data: DataFrame):
    """
    Function to scale and process numerical columns
    """
    # Scaler
    columns = ['dollar_value', 'avg_monthly_inc', 'BNPL_Fee',
    'monthly_entropy', 'postcode_entropy', 'revenue', 'Number of individuals lodging an income tax return', 
    'Average taxable income or loss', 'Median taxable income or loss', 'Proportion with salary or wages', 'Count salary or wages', 
    'Average salary or wages', 'Median salary or wages', 'Proportion with net rent', 'Count net rent', 'Average net rent', 
    'Median net rent', 'Average total income or loss', 'Median total income or loss', 'Average total deductions', 
    'Median total deductions', 'Proportion with total business income', 'Count total business income', 
    'Average total business income', 'Median total business income', 'Proportion with total business expenses', 
    'Count total business expenses', 'Average total business expenses', 'Median total business expenses', 
    'Proportion with net tax', 'Count net tax', 'Average net tax', 'Median net tax', 'Count super total accounts balance', 
    'Average super total accounts balance', 'Median super total accounts balance']

    va = VectorAssembler(inputCols=columns, outputCol="to_scale")
    sc = StandardScaler(inputCol="to_scale", outputCol="scaled")

    va_data = va.transform(data)
    data = sc.fit(va_data).transform(va_data)
    
    # Drop other columns
    for c in columns:
        data = data.drop(c)
    return data.drop("to_scale")

In [7]:
train_processed = process_numerical(train)
val_processed = process_numerical(val)
test_processed = process_numerical(test)

train_processed.head(1)
val_processed.head(1)
test_processed.head(1)

22/10/17 02:17:39 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


                                                                                

[Row(Natural_var=0, Potential_Outlier=0, dayofmonth_encoded=SparseVector(30, {6: 1.0}), dayofweek_encoded=SparseVector(6, {5: 1.0}), month_encoded=SparseVector(11, {5: 1.0}), tags_encoded=SparseVector(24, {17: 1.0}), state_encoded=SparseVector(7, {0: 1.0}), gender_encoded=SparseVector(2, {1: 1.0}), Earnings_Class_encoded=SparseVector(4, {2: 1.0}), fraud_buckets=1.0, scaled=DenseVector([0.0001, 0.01, 1.1552, 11.8533, 5.5901, 0.3913, 1.0881, 4.0182, 6.3873, 0.0, 1.0838, 6.1235, 7.2708, 44.0159, 1.1971, 0.061, -0.0291, 3.9179, 6.5207, 1.2957, 3.8719, 37.2041, 0.9587, 1.9516, 1.9602, 31.132, 0.9613, 1.4767, 0.9168, 0.0, 1.0796, 2.6103, 4.6455, 1.1213, 2.7123, 4.0691]))]

In [8]:
def vectorize(data: DataFrame, outcome: str):
    """
    Function to vectorize all the processed data
    """
    data = data.withColumnRenamed(outcome, "label")
    return VectorAssembler(
        inputCols= [c for c in data.drop("label").columns],
        outputCol="features"
    ).transform(data)

In [9]:
train_vector = vectorize(train_processed, "fraud_buckets")
val_vector = vectorize(val_processed, "fraud_buckets")
test_vector = vectorize(test_processed, "fraud_buckets")

#train_vector.head(1)
#val_vector.head(1)
#test_vector.head(1)

In [10]:
import os

# Safety check
target_dir = "../models/"
if not os.path.exists(target_dir):
    os.makedirs(target_dir)

train_vector.select("features", "label").write.parquet("../models/train_vector", mode="overwrite")
val_vector.select("features", "label").write.parquet("../models/val_vector", mode="overwrite")
test_vector.select("features", "label").write.parquet("../models/test_vector", mode="overwrite")

                                                                                

---

        #### RESTART KERNEL HERE

---

To save memory and increase speed, save the data and restart kernel and start from below. You may have to run the first 2 cells of the notebook before running the cells below. Again, it is unnecessary to run MLP, it is just for reference

## MODEL

In [4]:
train_vector = sp.read.option("inferSchema", True).parquet("../models/train_vector/")
val_vector = sp.read.option("inferSchema", True).parquet("../models/val_vector/")
test_vector = sp.read.option("inferSchema", True).parquet("../models/test_vector/")

                                                                                

### MULTI LAYER PERCEPTRON CLASSIFIER

In [5]:
inputCount = 122                            # Seen from sparse vector column
layers = [122, 256, 64, 10]
model = MultilayerPerceptronClassifier(
    labelCol='label',
    featuresCol='features',
    solver='gd',
    maxIter=100,
    layers=layers,
    blockSize=64,
    seed=69)

In [6]:
model_fit = model.fit(train_vector.select("features", "label").dropna())

                                                                                

22/10/08 16:10:41 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
22/10/08 16:10:41 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
22/10/08 16:10:41 WARN InstanceBuilder$JavaBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS


                                                                                

In [7]:
# train_output = model_fit.transform(train_vector)
val_output = model_fit.transform(val_vector.dropna())
test_output = model_fit.transform(test_vector.dropna())

In [20]:
# metrics = ['weightedPrecision', 'weightedRecall', 'accuracy']
metrics = ["accuracy"]
for metric in metrics:
    evaluator = MulticlassClassificationEvaluator(metricName=metric)
    print('Validation ' + metric + ' = ' + str(evaluator.evaluate(
        val_output.select("prediction", "label"))))
    print('Test ' + metric + ' = ' + str(evaluator.evaluate(
        test_output.select("prediction", "label"))))

Train accuracy = 0.3971909532169782
Train weightedFalsePositiveRate = 0.3971909532169782


#### Determining how good the scores are

In [54]:
val_output.select("label", "rawPrediction", "prediction").show(3)

+-----+--------------------+----------+
|label|       rawPrediction|prediction|
+-----+--------------------+----------+
|  1.0|[1.21844183743563...|       1.0|
|  1.0|[1.27374537812617...|       1.0|
|  1.0|[1.25855328047493...|       1.0|
+-----+--------------------+----------+
only showing top 3 rows



In [61]:
mean_square_error_val_score = val_output.select("label", "prediction").withColumn("MSE" , (col("label") - col("prediction")))
mean_square_error_val_score.show(5)

+-----+----------+---+
|label|prediction|MSE|
+-----+----------+---+
|  1.0|       1.0|0.0|
|  1.0|       1.0|0.0|
|  1.0|       1.0|0.0|
|  1.0|       1.0|0.0|
|  1.0|       1.0|0.0|
+-----+----------+---+
only showing top 5 rows



In [62]:
mean_square_error_val_score.groupBy("MSE").count().show()



+----+-----+
| MSE|count|
+----+-----+
| 0.0| 7617|
|-1.0| 4518|
| 1.0| 1176|
| 3.0|  793|
| 2.0|  791|
| 4.0|  851|
| 5.0| 1104|
| 7.0|  705|
| 6.0|  742|
| 8.0|  537|
+----+-----+



                                                                                

In [63]:
mean_square_error_test_score = test_output.select("label", "prediction").withColumn("MSE" , ((col("label") - col("prediction")) ** 2) ** 0.5)
mean_square_error_test_score.show(5)

+-----+----------+---+
|label|prediction|MSE|
+-----+----------+---+
|  1.0|       1.0|0.0|
|  1.0|       1.0|0.0|
|  1.0|       1.0|0.0|
|  1.0|       1.0|0.0|
|  1.0|       1.0|0.0|
+-----+----------+---+
only showing top 5 rows



In [64]:
mean_square_error_test_score.groupBy("MSE").count().show()

+---+-----+
|MSE|count|
+---+-----+
|0.0| 3846|
|1.0| 3013|
|4.0|  418|
|3.0|  448|
|2.0|  369|
|5.0|  563|
|7.0|  389|
|6.0|  391|
|8.0|  246|
+---+-----+



In [79]:
# Weighted MSE
MSE_count = mean_square_error_test_score.groupBy("MSE").count().groupBy().sum().select("sum(count)")
MSE_test = mean_square_error_test_score.groupBy("MSE").count().withColumn("Weighted MSE", col("MSE") * col("count") / MSE_count.collect()[0]["sum(count)"])
MSE_test.show()


+---+-----+-------------------+
|MSE|count|       Weighted MSE|
+---+-----+-------------------+
|0.0| 3846|                0.0|
|1.0| 3013|0.31116389548693585|
|4.0|  418|0.17267375813281008|
|3.0|  448|0.13879995869048847|
|2.0|  369|0.07621604874522359|
|5.0|  563| 0.2907156872869978|
|7.0|  389| 0.2812144996385418|
|6.0|  391|0.24228028503562946|
|8.0|  246|0.20324279665392958|
+---+-----+-------------------+



In [82]:
MSE_test.groupBy().sum().select("sum(Weighted MSE)").show()

+------------------+
| sum(Weighted MSE)|
+------------------+
|1.7163069296705566|
+------------------+



This shows that the Mean Squared Error between the categories is 1.7. Which means overall, the MSE Is 17% for the fraud probability which is a great value!

---
    #### RESTART KERNEL AGAIN IF NECESSARY
---

Here, restart kernel and read both full dataset and train, val, test vectorized datasets to feed into the RF model

### RANDOM FOREST CLASSIFIER

In [5]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=100)
model = rf.fit(train_vector)
val_pred = model.transform(val_vector)
val_pred.show(1)

                                                                                

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|(122,[11,32,39,53...|  1.0|[27.7622795157378...|[0.27762279515737...|       1.0|
+--------------------+-----+--------------------+--------------------+----------+
only showing top 1 row



In [12]:
val_pred.select("prediction").distinct().show()

22/10/17 02:19:40 WARN DAGScheduler: Broadcasting large task binary with size 1086.7 KiB




22/10/17 02:19:42 WARN DAGScheduler: Broadcasting large task binary with size 1028.4 KiB
+----------+
|prediction|
+----------+
|       1.0|
|       6.0|
|       5.0|
|       7.0|
|       8.0|
|       9.0|
+----------+



                                                                                

In [6]:
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")

accuracy = evaluator.evaluate(val_pred)
print(accuracy)



0.4674524795582457


                                                                                

In [16]:
test_pred = model.transform(test_vector)

In [17]:
test_accuracy = evaluator.evaluate(test_pred)
test_accuracy

0.4334400495714138

### MSE metrics

In [16]:
# Always postive metric
mean_square_error_val_score = val_pred.select("label", "prediction").withColumn("MSE" , ((col("label") - col("prediction")) ** 2) ** 0.5)
mean_square_error_val_score.groupBy("MSE").count().show()

22/10/17 02:20:02 WARN DAGScheduler: Broadcasting large task binary with size 1095.6 KiB




22/10/17 02:20:03 WARN DAGScheduler: Broadcasting large task binary with size 1032.3 KiB
+---+-----+
|MSE|count|
+---+-----+
|0.0| 8787|
|1.0| 5825|
|4.0|  780|
|3.0|  760|
|2.0|  860|
|5.0|  843|
|7.0|  434|
|6.0|  545|
+---+-----+



                                                                                

In [22]:
# Dispersed metric
mean_square_error_val_score_disp = val_pred.select("label", "prediction").withColumn("MSE" , (col("label") - col("prediction")))
mean_square_error_val_score_disp.groupBy("MSE").count().show()

+----+-----+
| MSE|count|
+----+-----+
| 0.0| 8726|
|-1.0| 4617|
| 1.0| 1210|
|-4.0|    3|
|-2.0|   57|
|-3.0|   13|
| 3.0|  746|
| 2.0|  796|
| 4.0|  764|
| 5.0|  875|
| 7.0|  498|
| 6.0|  529|
+----+-----+



#### PREDICTING FOR THE ENTIRE DATATSET

In [18]:
def vectorize_full(data: DataFrame, drop: list):
    """
    Function to vectorize all the processed data
    """
    return VectorAssembler(
        inputCols= [c for c in data.columns if not c in drop],
        outputCol="features"
    ).transform(data)

In [19]:
full = transactions.join(merchants, on="merchant_abn").join(customers, on="user_id")

# X = X.drop("user_id", "merchant_abn", "order_datetime", "order_id", "name", "postcode", "holiday")
full = full.drop("name", "order_id", "holiday")
category_full = cat_transformer.transform(full)

categories = [
    "dayofmonth",
    "dayofweek",
    "month",
    "tags",
    "state",
    "gender",
    "Earnings_Class"
]

for c in categories:
    category_full = category_full.drop(c).drop(c+"_index")

category_full.head()

                                                                                

Row(user_id=3, merchant_abn=60956456424, dollar_value=136.67999267578125, order_datetime=datetime.date(2021, 8, 20), Natural_var=0, Potential_Outlier=0, BNPL_Fee=4.69, avg_monthly_inc=-4.238095283508301, monthly_entropy=2.985382318496704, postcode_entropy=7.979236125946045, revenue=8026969.561502457, postcode=862, Number of individuals lodging an income tax return=1099, Average taxable income or loss=56030, Median taxable income or loss=45125, Proportion with salary or wages=1, Count salary or wages=821, Average salary or wages=56184, Median salary or wages=49641, Proportion with net rent=1, Count net rent=155, Average net rent=950, Median net rent=818, Average total income or loss=58776, Median total income or loss=47279, Average total deductions=2575, Median total deductions=695, Proportion with total business income=1, Count total business income=144, Average total business income=64171, Median total business income=25168, Proportion with total business expenses=1, Count total busin

In [22]:
#print("At bucketing")
#buckets_full = buckets.transform(category_full)
print("At numerical")
numerical_full = process_numerical(category_full)

print("At Vectorize")
full = vectorize_full(numerical_full, ["user_id", "merchant_abn", "order_datetime", "postcode"])

full.head(1)

At numerical


                                                                                

At Vectorize


                                                                                

[Row(user_id=3, merchant_abn=60956456424, order_datetime=datetime.date(2021, 8, 20), Natural_var=0, Potential_Outlier=0, postcode=862, dayofmonth_encoded=SparseVector(30, {17: 1.0}), dayofweek_encoded=SparseVector(6, {0: 1.0}), month_encoded=SparseVector(11, {6: 1.0}), tags_encoded=SparseVector(24, {1: 1.0}), state_encoded=SparseVector(7, {6: 1.0}), gender_encoded=SparseVector(2, {1: 1.0}), Earnings_Class_encoded=SparseVector(4, {1: 1.0}), scaled=DenseVector([0.2962, -0.824, 2.6978, 143.6152, 15.7495, 2.3418, 0.1505, 3.2388, 5.1963, 0.0, 0.1382, 4.646, 5.7305, 49.2032, 0.1392, 0.3585, 0.5033, 3.2237, 5.3209, 1.4794, 3.0258, 42.7062, 0.1964, 1.2211, 2.11, 34.5515, 0.2057, 0.909, 1.0993, 0.0, 0.141, 2.0222, 3.4683, 0.1507, 2.8041, 3.5449]), features=SparseVector(122, {19: 1.0, 32: 1.0, 44: 1.0, 50: 1.0, 79: 1.0, 81: 1.0, 83: 1.0, 86: 0.2962, 87: -0.824, 88: 2.6978, 89: 143.6152, 90: 15.7495, 91: 2.3418, 92: 0.1505, 93: 3.2388, 94: 5.1963, 96: 0.1382, 97: 4.646, 98: 5.7305, 99: 49.2032, 1

In [23]:
full.write.parquet("../models/full_raw", mode="overwrite")

                                                                                

In [8]:
# Full prediction
full = sp.read.option("inferSchema", True).parquet("../models/full_raw/")
full.head(1)

[Row(user_id=3, merchant_abn=60956456424, order_datetime=datetime.date(2021, 8, 20), Natural_var=0, Potential_Outlier=0, postcode=862, dayofmonth_encoded=SparseVector(30, {17: 1.0}), dayofweek_encoded=SparseVector(6, {0: 1.0}), month_encoded=SparseVector(11, {6: 1.0}), tags_encoded=SparseVector(24, {1: 1.0}), state_encoded=SparseVector(7, {6: 1.0}), gender_encoded=SparseVector(2, {1: 1.0}), Earnings_Class_encoded=SparseVector(4, {1: 1.0}), scaled=DenseVector([0.2962, -0.824, 2.6978, 143.6152, 15.7495, 2.3418, 0.1505, 3.2388, 5.1963, 0.0, 0.1382, 4.646, 5.7305, 49.2032, 0.1392, 0.3585, 0.5033, 3.2237, 5.3209, 1.4794, 3.0258, 42.7062, 0.1964, 1.2211, 2.11, 34.5515, 0.2057, 0.909, 1.0993, 0.0, 0.141, 2.0222, 3.4683, 0.1507, 2.8041, 3.5449]), features=SparseVector(122, {19: 1.0, 32: 1.0, 44: 1.0, 50: 1.0, 79: 1.0, 81: 1.0, 83: 1.0, 86: 0.2962, 87: -0.824, 88: 2.6978, 89: 143.6152, 90: 15.7495, 91: 2.3418, 92: 0.1505, 93: 3.2388, 94: 5.1963, 96: 0.1382, 97: 4.646, 98: 5.7305, 99: 49.2032, 1

In [9]:
full_rf_pred = model.transform(full.drop("rawPrediction", "probability", "prediction"))
full_rf_pred.head(1)

22/10/17 02:41:48 WARN DAGScheduler: Broadcasting large task binary with size 1049.0 KiB


[Row(user_id=3, merchant_abn=60956456424, order_datetime=datetime.date(2021, 8, 20), Natural_var=0, Potential_Outlier=0, postcode=862, dayofmonth_encoded=SparseVector(30, {17: 1.0}), dayofweek_encoded=SparseVector(6, {0: 1.0}), month_encoded=SparseVector(11, {6: 1.0}), tags_encoded=SparseVector(24, {1: 1.0}), state_encoded=SparseVector(7, {6: 1.0}), gender_encoded=SparseVector(2, {1: 1.0}), Earnings_Class_encoded=SparseVector(4, {1: 1.0}), scaled=DenseVector([0.2962, -0.824, 2.6978, 143.6152, 15.7495, 2.3418, 0.1505, 3.2388, 5.1963, 0.0, 0.1382, 4.646, 5.7305, 49.2032, 0.1392, 0.3585, 0.5033, 3.2237, 5.3209, 1.4794, 3.0258, 42.7062, 0.1964, 1.2211, 2.11, 34.5515, 0.2057, 0.909, 1.0993, 0.0, 0.141, 2.0222, 3.4683, 0.1507, 2.8041, 3.5449]), features=SparseVector(122, {19: 1.0, 32: 1.0, 44: 1.0, 50: 1.0, 79: 1.0, 81: 1.0, 83: 1.0, 86: 0.2962, 87: -0.824, 88: 2.6978, 89: 143.6152, 90: 15.7495, 91: 2.3418, 92: 0.1505, 93: 3.2388, 94: 5.1963, 96: 0.1382, 97: 4.646, 98: 5.7305, 99: 49.2032, 1

In [10]:
full_rf_pred = full_rf_pred.drop("scaled", "features", "rawPrediction", "probability")
full_rf_pred.head(1)

22/10/17 02:41:52 WARN DAGScheduler: Broadcasting large task binary with size 1003.9 KiB


[Row(user_id=3, merchant_abn=60956456424, order_datetime=datetime.date(2021, 8, 20), Natural_var=0, Potential_Outlier=0, postcode=862, dayofmonth_encoded=SparseVector(30, {17: 1.0}), dayofweek_encoded=SparseVector(6, {0: 1.0}), month_encoded=SparseVector(11, {6: 1.0}), tags_encoded=SparseVector(24, {1: 1.0}), state_encoded=SparseVector(7, {6: 1.0}), gender_encoded=SparseVector(2, {1: 1.0}), Earnings_Class_encoded=SparseVector(4, {1: 1.0}), prediction=1.0)]

In [11]:
full_rf_pred.write.parquet("../models/random_forest_output_full", mode="overwrite")

22/10/17 02:41:55 WARN DAGScheduler: Broadcasting large task binary with size 1210.3 KiB


                                                                                