# Modelling

This notebook is used to create a Gradient Boosted Trees model to predict whether the transaction is fraud in the future.

---

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import* 
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [2]:
spark = (
    SparkSession.builder.appName("Modelling")
    .config("spark.sql.repl.eagerEval.enabled", True) 
    .config("spark.sql.parquet.cacheMetadata", "true")
    .config("spark.sql.session.timeZone", "Etc/UTC")
    .config("spark.sql.debug.maxToStringFields", 3000)
    .config("spark.network.timeout", "300s")
    .config("spark.driver.maxResultSize", "4g")
    .config("spark.rpc.askTimeout", "300s")
    .config("spark.driver.memory", "8G")
    .config("spark.executor.memory", "8G")
    .getOrCreate()
)

24/09/20 02:18:07 WARN Utils: Your hostname, Cocos-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 172.16.33.67 instead (on interface en0)
24/09/20 02:18:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/20 02:18:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Read dataset

In [3]:
full_transaction = spark.read.parquet('../data/curated/full_transaction_with_segments')

                                                                                

In [4]:
full_transaction.printSchema()

root
 |-- merchant_abn: long (nullable = true)
 |-- order_id: string (nullable = true)
 |-- take_rate: float (nullable = true)
 |-- merchant_fraud_probability: double (nullable = true)
 |-- transaction_revenue: double (nullable = true)
 |-- BNPL_revenue: double (nullable = true)
 |-- revenue_level_e: integer (nullable = true)
 |-- revenue_level_d: integer (nullable = true)
 |-- revenue_level_c: integer (nullable = true)
 |-- revenue_level_b: integer (nullable = true)
 |-- revenue_level_a: integer (nullable = true)
 |-- category_jewelry: integer (nullable = true)
 |-- category_art: integer (nullable = true)
 |-- category_television: integer (nullable = true)
 |-- category_watch: integer (nullable = true)
 |-- category_cable: integer (nullable = true)
 |-- category_repair: integer (nullable = true)
 |-- category_stock: integer (nullable = true)
 |-- category_flower: integer (nullable = true)
 |-- category_office: integer (nullable = true)
 |-- category_souvenir: integer (nullable = true)

# Drop unnecessary columns

- The columns starts with `category_` are dropped since they were converted to the column starts with `merchant_segment_`
- `postcode` and `hashed_postcode` are dropped since 
  - they are repeated
  - the dimension of `hashed_postcode` is too high (6000 dimensions)
  - they are not important based on previous correlation and feature importance
- `order_datetime` is dropped since it was converted to `order_timestamp`
- `consumer_is_fraud` and `merchant_is_fraud` are dropped since they are directly related to the label `transaction_is_fraud`
- And we tried if include `hashed_postcode` and the columns starts with `category_` in the model, the evaluation results is almost the same as the model which dropped those columns.

In [5]:
columns_to_drop = [
                   'merchant_abn', 'order_id', 'user_id', 'postcode', 'hashed_postcode',
                   'consumer_id', 'consumer_is_fraud', 'order_datetime', 'merchant_is_fraud'
                  ] + [col for col in full_transaction.columns if col.startswith('category_')]
data = full_transaction.drop(*columns_to_drop)

# Train the Gradient Boosted Trees model

In [6]:
# assemble features into a single vector column
selected_columns = [col for col in data.columns if col != 'transaction_is_fraud']
assembler = VectorAssembler(inputCols=selected_columns, outputCol='features')
assembled_data = assembler.transform(data)

# scale features
scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures')
scaler_model = scaler.fit(assembled_data)
scaled_data = scaler_model.transform(assembled_data)

# split the data into training and test sets
train_data, test_data = scaled_data.randomSplit([0.7, 0.3], seed=42)

# define Gradient Boosted Trees model
gbt = GBTClassifier(labelCol='transaction_is_fraud', 
                    featuresCol='scaledFeatures', maxIter=10, seed=42)

# train the model
gbt_model = gbt.fit(train_data)

# make predictions on the test set
predictions = gbt_model.transform(test_data)

                                                                                

# Evaluations

### ROC AUC Score

The ROC AUC Score (Receiver Operating Characteristic Area Under the Curve) is a metric used to evaluate the performance of binary classification models. It quantifies a model's ability to distinguish between two classes, with a score ranging from 0 to 1. 
- A score of 1 indicates perfect classification
- A score of 0.5 equates to no better than random guessing
- A score of 0 implies complete misclassification.

In [7]:
# evaluate the model performance using Binary Classification metrics
binary_evaluator = BinaryClassificationEvaluator(labelCol='transaction_is_fraud', 
                                                 metricName='areaUnderROC')
roc_auc = binary_evaluator.evaluate(predictions)
print(f"ROC AUC Score: {roc_auc}")

24/09/20 02:24:35 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                

ROC AUC Score: 0.9998325969497125


The provided ROC AUC Score of 0.9998325969497125 suggests an exceptionally high discriminatory power, indicating that the model can very accurately differentiate between the positive and negative classes.

### Precision, recall, and F1 score

In [8]:
# define evaluators for precision, recall, and F1 score
evaluator_precision = MulticlassClassificationEvaluator(labelCol='transaction_is_fraud', 
                                                        predictionCol='prediction', 
                                                        metricName='precisionByLabel')
evaluator_recall = MulticlassClassificationEvaluator(labelCol='transaction_is_fraud', 
                                                     predictionCol='prediction', 
                                                     metricName='recallByLabel')
evaluator_f1 = MulticlassClassificationEvaluator(labelCol='transaction_is_fraud', 
                                                 predictionCol='prediction', 
                                                 metricName='f1')

# evaluate the model using the defined evaluators
precision = evaluator_precision.evaluate(predictions)
recall = evaluator_recall.evaluate(predictions)
f1_score = evaluator_f1.evaluate(predictions)

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1_score}")



Precision: 0.9971343677836155
Recall: 0.9925347345362653
F1 Score: 0.9926324044661374


                                                                                

### Confusion matrix-like summary

In [9]:
predictions.groupBy('transaction_is_fraud', 'prediction').count().show()

24/09/20 02:28:47 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/20 02:28:47 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/20 02:28:47 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/20 02:28:47 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/20 02:28:47 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/20 02:28:47 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/20 02:28:47 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/20 02:28:47 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/09/20 02:29:03 WARN RowBasedKeyValueBatch: Calling spill() on

+--------------------+----------+-------+
|transaction_is_fraud|prediction|  count|
+--------------------+----------+-------+
|                   1|       0.0|   6917|
|                   0|       0.0|2406861|
|                   1|       1.0| 958213|
|                   0|       1.0|  18103|
+--------------------+----------+-------+



                                                                                