# Credit Card Fraud Detection
In this notebook, I'll explore this public dataset on Kaggle about [credit card fraud](https://www.kaggle.com/datasets/kartik2112/fraud-detection?resource=download). My objective is to analyze this dataset behavior and create an algorithm to predict if there's a fraud or not.

As for my tool, I'll use **PySpark** to load my data, check the quality and do the exploratory data analysis (EDA). Next, I'll run some classifications algorithms and compare their performance to see which model would be used in a 'deploy phase'.

## Table of Contents  <a name="table_cont"></a>

0. [**Libraries**](#lib)
1. [**Load data**](#load_data)
2. [**Data quality**](#data-quality)
3. [**Feature Engineering**](#features)
4. [**EDA**](#eda)
5. [**Data preparation**](#data_prep)
6. [**ML Models**](#ml_mod)
    - Logistical regression ([Logit](#logit))
    - Random Forest ([RF](#rf))
    - Gradient-Boosted Trees ([GBTs](#gbt))
7. [**Comparing results**](#results)
8. [**Take Aways**](#take_away)


## Libraries <a name="libs"></a>

In [1]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import isnan, when, count, col, year, month, dayofweek, weekofyear, date_format, current_date, date_diff, floor, desc, asc, sum, mean, explode, lower,split
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier, NaiveBayes
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator


## Load data  <a name="load_data"></a>

In [2]:
# create spark session
spark = SparkSession.builder.appName('CC_fraud').getOrCreate()

# load train and test datasets
train = spark.read.csv('fraudTrain.csv', header=True, inferSchema=True)
test = spark.read.csv('fraudTest.csv', header=True, inferSchema=True)

In [3]:
print('Train set \n')
train.limit(5).show()
print('Test set')
test.limit(5).show()

Train set 

+---+---------------------+----------------+--------------------+-------------+------+---------+-------+------+--------------------+--------------+-----+-----+-------+---------+--------+--------------------+----------+--------------------+----------+------------------+-----------+--------+
|_c0|trans_date_trans_time|          cc_num|            merchant|     category|   amt|    first|   last|gender|              street|          city|state|  zip|    lat|     long|city_pop|                 job|       dob|           trans_num| unix_time|         merch_lat| merch_long|is_fraud|
+---+---------------------+----------------+--------------------+-------------+------+---------+-------+------+--------------------+--------------+-----+-----+-------+---------+--------+--------------------+----------+--------------------+----------+------------------+-----------+--------+
|  0|  2019-01-01 00:00:18|2703186189652095|fraud_Rippin, Kub...|     misc_net|  4.97| Jennifer|  Banks|     F|    

In [4]:
# dataframes shape
print((train.count(), len(train.columns)))

(1296675, 23)


In [5]:
print((test.count(), len(test.columns)))

(555719, 23)


## Data quality <a name="data_quality"></a>
[Table of contents](#table_cont)

In [6]:
# check if columns have the correct data type
print('Train set data type:')
train.dtypes

## could also have used train.printSchema()

Train set data type:


[('_c0', 'int'),
 ('trans_date_trans_time', 'timestamp'),
 ('cc_num', 'bigint'),
 ('merchant', 'string'),
 ('category', 'string'),
 ('amt', 'double'),
 ('first', 'string'),
 ('last', 'string'),
 ('gender', 'string'),
 ('street', 'string'),
 ('city', 'string'),
 ('state', 'string'),
 ('zip', 'int'),
 ('lat', 'double'),
 ('long', 'double'),
 ('city_pop', 'int'),
 ('job', 'string'),
 ('dob', 'date'),
 ('trans_num', 'string'),
 ('unix_time', 'int'),
 ('merch_lat', 'double'),
 ('merch_long', 'double'),
 ('is_fraud', 'int')]

In [7]:
print('Test set data type:')
test.dtypes

Test set data type:


[('_c0', 'int'),
 ('trans_date_trans_time', 'timestamp'),
 ('cc_num', 'bigint'),
 ('merchant', 'string'),
 ('category', 'string'),
 ('amt', 'double'),
 ('first', 'string'),
 ('last', 'string'),
 ('gender', 'string'),
 ('street', 'string'),
 ('city', 'string'),
 ('state', 'string'),
 ('zip', 'int'),
 ('lat', 'double'),
 ('long', 'double'),
 ('city_pop', 'int'),
 ('job', 'string'),
 ('dob', 'date'),
 ('trans_num', 'string'),
 ('unix_time', 'int'),
 ('merch_lat', 'double'),
 ('merch_long', 'double'),
 ('is_fraud', 'int')]

In [8]:
# separate columns between numeric and strings
numeric_cols = [col[0] for col in train.dtypes if col[1] in ['double','float','int']]
string_cols = [col[0] for col in train.dtypes if col[1] not in ['double','float','int']]

# check for nulls
def count_nulls(df):
  df.select([
      count(when(col(c).isNull() | (col(c) == ""), c)).alias(c) if c in string_cols else
      count(when(col(c).isNull() | isnan(col(c)), c)).alias(c)
      for c in df.columns
  ]).show()

print('Train nulls:')
count_nulls(train)

Train nulls:
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+
|_c0|trans_date_trans_time|cc_num|merchant|category|amt|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|trans_num|unix_time|merch_lat|merch_long|is_fraud|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+
|  0|                    0|     0|       0|       0|  0|    0|   0|     0|     0|   0|    0|  0|  0|   0|       0|  0|  0|        0|        0|        0|         0|       0|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+



In [9]:
print('Test nulls:')
count_nulls(test)

Test nulls:
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+
|_c0|trans_date_trans_time|cc_num|merchant|category|amt|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|trans_num|unix_time|merch_lat|merch_long|is_fraud|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+
|  0|                    0|     0|       0|       0|  0|    0|   0|     0|     0|   0|    0|  0|  0|   0|       0|  0|  0|        0|        0|        0|         0|       0|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+



In [10]:
# check for duplicates
def duplicates (df):
  df.groupBy(df.columns).count().filter("count > 1").show()

duplicates(train)

+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+
|_c0|trans_date_trans_time|cc_num|merchant|category|amt|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|trans_num|unix_time|merch_lat|merch_long|is_fraud|count|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+



In [11]:
duplicates(test)

+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+
|_c0|trans_date_trans_time|cc_num|merchant|category|amt|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|trans_num|unix_time|merch_lat|merch_long|is_fraud|count|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+



## Feature Engineering <a name="features"></a>
[Table of contents](#table_cont)

Retail industry know that some dates are better for commerce than others. Those [holidays](https://www.tagalys.com/blog/us-shopping-sales-calendar-2019) have a significant impact on customer behavior, on which they look for deals and increase the chance to fall victim of scams. But not all events are the same, for instance, BlackFriday and Christmas tend to attract more customers due it's increrible deals while other holidays have more modest impact. That's why I'm classifying holidays based on their impact, and therefore creating 'tiers'.

Some details to look out for:
- Deals usually happen during the holiday week, so I'm using the week number to build my variables
- Week number may change due the year, so I'll need to build a flexible way to find the corresponding number
- This is an American dataset. Even if it was created with synthetic data, still follow American behavior and calendar.
- Retail dates:
  - Tier1: Blackfriday (November 4th week), Christmas (last week of december)
  - Tier2: Easter, Independence day (4th of July), Labor day (1st Monday of September), halloween (October last week).
  - No event: the rest of the year

  obs: In the ideal case, I would find the specific week of the year depending on the timestamp. But for this exercise, since I'm only using PySpark to treat the data, I'll create a range of week dates.

In [12]:
def feature_engine(df):
  # extract date, time, create 'age' variable based on dob (date of birth),
  # create 'distance' based on difference between lat,long and merchant_lat,merchant_long
  df = df.withColumn('year', year('trans_date_trans_time'))\
         .withColumn('month', month('trans_date_trans_time'))\
         .withColumn('day_week', dayofweek('trans_date_trans_time'))\
         .withColumn('age', floor(date_diff(current_date(), col('dob'))/365.25))\
         .withColumn('lat_diff', col('lat')-col('merch_lat'))\
         .withColumn('long_diff', col('long')-col('merch_long'))\
         .withColumn('tier1_event', when(weekofyear('trans_date_trans_time').isin([47,48,51,52]), 1).otherwise(0))\
         .withColumn('tier2_event', when(weekofyear('trans_date_trans_time').isin([12,13,14,15,16,17,27,35,36,37,43,44,45]),1).otherwise(0))\
         .withColumn('no_event', when((col('tier1_event')==0) & (col('tier2_event')==0), 1).otherwise(0))
  df = df.drop(*['_c0','trans_date_trans_time','cc_num','first','last','street','dob','trans_num'])

  return df

In [13]:
# crete new dataframes with new features and dropped useless ones
adj_train = feature_engine(train)
adj_test = feature_engine(test)

In [14]:
adj_train.show(2)

+--------------------+-----------+------+------+--------------+-----+-----+-------+---------+--------+--------------------+----------+------------------+-----------+--------+----+-----+--------+---+--------------------+--------------------+-----------+-----------+--------+
|            merchant|   category|   amt|gender|          city|state|  zip|    lat|     long|city_pop|                 job| unix_time|         merch_lat| merch_long|is_fraud|year|month|day_week|age|            lat_diff|           long_diff|tier1_event|tier2_event|no_event|
+--------------------+-----------+------+------+--------------+-----+-----+-------+---------+--------+--------------------+----------+------------------+-----------+--------+----+-----+--------+---+--------------------+--------------------+-----------+-----------+--------+
|fraud_Rippin, Kub...|   misc_net|  4.97|     F|Moravian Falls|   NC|28654|36.0788| -81.1781|    3495|Psychologist, cou...|1325376018|         36.011293| -82.048315|       0|2019

## EDA  <a name="eda"></a>
[Table of contents](#table_cont)

### Fraud

In [15]:
# check distribuition of fraud and non-fraud in my dataset
adj_train.groupby('is_fraud').agg(count('is_fraud'), sum('amt')).show()

+--------+---------------+-------------------+
|is_fraud|count(is_fraud)|           sum(amt)|
+--------+---------------+-------------------+
|       1|           7506| 3988088.6099999975|
|       0|        1289169|8.723434029000089E7|
+--------+---------------+-------------------+



In [16]:
# frauds by year
adj_train.groupby(*['year','is_fraud']).agg(count('is_fraud'), sum('amt'), mean('amt')).show()

+----+--------+---------------+-------------------+-----------------+
|year|is_fraud|count(is_fraud)|           sum(amt)|         avg(amt)|
+----+--------+---------------+-------------------+-----------------+
|2019|       1|           5220| 2767822.8699999964|530.2342662835242|
|2019|       0|         919630|6.221713006000053E7|67.65452416732874|
|2020|       1|           2286|  1220265.740000001| 533.799536307962|
|2020|       0|         369539|2.501721022999985E7|67.69843028746587|
+----+--------+---------------+-------------------+-----------------+



In [17]:
# check distribution of frauds along the year (2019)
adj_train.groupby(*['year','month','is_fraud'])\
         .agg(count('is_fraud'), sum('amt'), mean('amt'))\
         .filter((col('year') == 2019) & (col('is_fraud') == 1))\
         .sort(asc('month')).show()

+----+-----+--------+---------------+------------------+------------------+
|year|month|is_fraud|count(is_fraud)|          sum(amt)|          avg(amt)|
+----+-----+--------+---------------+------------------+------------------+
|2019|    1|       1|            506|261780.38000000015| 517.3525296442691|
|2019|    2|       1|            517|         274051.08| 530.0794584139265|
|2019|    3|       1|            494|237637.59000000003| 481.0477530364373|
|2019|    4|       1|            376| 202067.2899999998| 537.4130053191484|
|2019|    5|       1|            408|210549.11000000002| 516.0517401960784|
|2019|    6|       1|            354|178204.60000000015|503.40282485875747|
|2019|    7|       1|            331|188701.59000000014| 570.0954380664657|
|2019|    8|       1|            382| 203951.1299999999| 533.9034816753924|
|2019|    9|       1|            418|217675.37000000002| 520.7544736842106|
|2019|   10|       1|            454|257739.72000000023|  567.708634361234|
|2019|   11|

In [18]:
# check distribution of frauds along the year (2020)
adj_train.groupby(*['year','month','is_fraud'])\
         .agg(count('is_fraud'), sum('amt'), mean('amt'))\
         .filter((col('year') == 2020) & (col('is_fraud') == 1))\
         .sort(asc('month')).show()

+----+-----+--------+---------------+------------------+------------------+
|year|month|is_fraud|count(is_fraud)|          sum(amt)|          avg(amt)|
+----+-----+--------+---------------+------------------+------------------+
|2020|    1|       1|            343|182595.36000000007| 532.3479883381926|
|2020|    2|       1|            336|183950.10999999993| 547.4705654761902|
|2020|    3|       1|            444|234090.09999999986| 527.2299549549547|
|2020|    4|       1|            302| 152173.9799999999|503.88735099337714|
|2020|    5|       1|            527|287226.37999999995| 545.0215939278936|
|2020|    6|       1|            334|180229.81000000008| 539.6102095808386|
+----+-----+--------+---------------+------------------+------------------+



In [19]:
# check if fraud distribution is the same in test set
adj_test.groupby('is_fraud').count().show()

+--------+------+
|is_fraud| count|
+--------+------+
|       1|  2145|
|       0|553574|
+--------+------+



### Merchant

In [20]:
# unique merchants
adj_train.select('merchant').distinct().count()

693

In [21]:
# main merchants
adj_train.filter(col('is_fraud') == 1)\
         .groupby(*['merchant','is_fraud']).agg(count('is_fraud'), sum('amt'), mean('amt'))\
         .sort(desc('count(is_fraud)')).show(10)

+--------------------+--------+---------------+------------------+------------------+
|            merchant|is_fraud|count(is_fraud)|          sum(amt)|          avg(amt)|
+--------------------+--------+---------------+------------------+------------------+
|  fraud_Rau and Sons|       1|             49|          15299.76|            312.24|
|   fraud_Cormier LLC|       1|             48|          44903.89| 935.4977083333333|
|   fraud_Kozey-Boehm|       1|             48|48189.979999999996|1003.9579166666666|
|fraud_Vandervort-...|       1|             47|14973.760000000002| 318.5906382978724|
|   fraud_Kilback LLC|       1|             47|          13039.28| 277.4314893617021|
|     fraud_Doyle Ltd|       1|             47|14807.910000000002|315.06191489361703|
|      fraud_Kuhn LLC|       1|             44|37166.299999999996| 844.6886363636363|
| fraud_Padberg-Welch|       1|             44|          13713.45| 311.6693181818182|
|    fraud_Terry-Huel|       1|             43|       

In [22]:
# main merchants
adj_train.filter(col('is_fraud') == 1)\
         .groupby(*['merchant','is_fraud']).agg(count('is_fraud'), sum('amt'), mean('amt'))\
         .sort(desc('sum(amt)')).show(10)

+--------------------+--------+---------------+------------------+------------------+
|            merchant|is_fraud|count(is_fraud)|          sum(amt)|          avg(amt)|
+--------------------+--------+---------------+------------------+------------------+
|   fraud_Kozey-Boehm|       1|             48|48189.979999999996|1003.9579166666666|
|   fraud_Cormier LLC|       1|             48|          44903.89| 935.4977083333333|
|      fraud_Jast Ltd|       1|             42|          42560.34|1013.3414285714285|
|    fraud_Terry-Huel|       1|             43|          42356.37| 985.0318604651163|
|   fraud_Goyette Inc|       1|             42|41580.840000000004| 990.0200000000001|
|fraud_Kerluke-Abs...|       1|             41|          40909.57| 997.7943902439024|
|fraud_Schmeler, B...|       1|             41|40143.049999999996| 979.0987804878048|
|fraud_Gleason-Mac...|       1|             40|          39892.84| 997.3209999999999|
|fraud_Kuhic, Bins...|       1|             39|       

In [23]:
# top merchant & CC fraud 2019 by count
adj_train.filter((col('year') == 2019) & (col('is_fraud') == 1))\
         .groupby(*['year','merchant','is_fraud']).agg(count('is_fraud'), sum('amt'), mean('amt'))\
         .sort(desc('count(is_fraud)')).show(10)

+----+--------------------+--------+---------------+------------------+------------------+
|year|            merchant|is_fraud|count(is_fraud)|          sum(amt)|          avg(amt)|
+----+--------------------+--------+---------------+------------------+------------------+
|2019|  fraud_Hudson-Ratke|       1|             37|11690.240000000002| 315.9524324324325|
|2019|   fraud_Kilback LLC|       1|             35|           9816.91|280.48314285714287|
|2019|  fraud_Rau and Sons|       1|             33|          10301.79|312.17545454545456|
|2019|   fraud_Cormier LLC|       1|             32|29125.440000000002| 910.1700000000001|
|2019|fraud_Gleason-Mac...|       1|             32|          32168.85|      1005.2765625|
|2019| fraud_Koepp-Witting|       1|             32| 9795.609999999999|306.11281249999996|
|2019|      fraud_Kuhn LLC|       1|             32|          26354.69|       823.5840625|
|2019|     fraud_Kuhic LLC|       1|             31|          31563.39|1018.1738709677419|

In [24]:
# top merchant & CC fraud 2019 by total spent
adj_train.filter((col('year') == 2019) & (col('is_fraud') == 1))\
         .groupby(*['year','merchant','is_fraud']).agg(count('is_fraud'), sum('amt'), mean('amt'))\
         .sort(desc('sum(amt)')).show(10)

+----+--------------------+--------+---------------+------------------+------------------+
|year|            merchant|is_fraud|count(is_fraud)|          sum(amt)|          avg(amt)|
+----+--------------------+--------+---------------+------------------+------------------+
|2019|fraud_Gleason-Mac...|       1|             32|          32168.85|      1005.2765625|
|2019|     fraud_Kuhic LLC|       1|             31|          31563.39|1018.1738709677419|
|2019|fraud_Boyer-Reichert|       1|             31|31545.469999999998|1017.5958064516128|
|2019|    fraud_Terry-Huel|       1|             31|30626.420000000002| 987.9490322580646|
|2019|   fraud_Kozey-Boehm|       1|             30|30325.079999999998|1010.8359999999999|
|2019|fraud_Towne, Gree...|       1|             30|30236.109999999997|1007.8703333333332|
|2019|   fraud_Cormier LLC|       1|             32|29125.440000000002| 910.1700000000001|
|2019|fraud_Fisher-Scho...|       1|             29|28825.259999999995| 993.9744827586205|

In [25]:
# top merchant & CC fraud 2020 by count
adj_train.filter((col('year') == 2020) & (col('is_fraud') == 1))\
         .groupby(*['year','merchant','is_fraud']).agg(count('is_fraud'), sum('amt'), mean('amt'))\
         .sort(desc('count(is_fraud)')).show(10)

+----+--------------------+--------+---------------+------------------+------------------+
|year|            merchant|is_fraud|count(is_fraud)|          sum(amt)|          avg(amt)|
+----+--------------------+--------+---------------+------------------+------------------+
|2020|     fraud_Doyle Ltd|       1|             23| 7299.240000000002| 317.3582608695653|
|2020|fraud_Kerluke-Abs...|       1|             19|18476.069999999996| 972.4247368421051|
|2020|   fraud_Kozey-Boehm|       1|             18|17864.899999999998| 992.4944444444443|
|2020|     fraud_Kiehn Inc|       1|             16| 4944.869999999999|309.05437499999994|
|2020|fraud_Vandervort-...|       1|             16|           4880.71|        305.044375|
|2020|   fraud_Cormier LLC|       1|             16|          15778.45|        986.153125|
|2020|fraud_Moen, Reing...|       1|             16|           4903.06|         306.44125|
|2020|  fraud_Rau and Sons|       1|             16|           4997.97|        312.373125|

In [26]:
# top merchant & CC fraud 2020 by total spent
adj_train.filter((col('year') == 2020) & (col('is_fraud') == 1))\
         .groupby(*['year','merchant','is_fraud']).agg(count('is_fraud'), sum('amt'), mean('amt'))\
         .sort(desc('sum(amt)')).show(10)

+----+--------------------+--------+---------------+------------------+------------------+
|year|            merchant|is_fraud|count(is_fraud)|          sum(amt)|          avg(amt)|
+----+--------------------+--------+---------------+------------------+------------------+
|2020|fraud_Kerluke-Abs...|       1|             19|18476.069999999996| 972.4247368421051|
|2020|   fraud_Kozey-Boehm|       1|             18|17864.899999999998| 992.4944444444443|
|2020|    fraud_Fisher Inc|       1|             15|          15830.84|1055.3893333333333|
|2020|   fraud_Cormier LLC|       1|             16|          15778.45|        986.153125|
|2020|      fraud_Jast Ltd|       1|             15|          15643.32|          1042.888|
|2020|fraud_Langworth, ...|       1|             16|15590.740000000002| 974.4212500000001|
|2020|    fraud_Schumm PLC|       1|             13|13649.819999999998|1049.9861538461537|
|2020|   fraud_Goyette Inc|       1|             13|          13515.88| 1039.683076923077|

### Category

In [27]:
# unique categories
adj_train.select('category').distinct().count()

14

In [28]:
# main frauded categories
adj_train.groupby(*['category','is_fraud'])\
         .agg(count('is_fraud'), sum('amt'))\
         .filter(col('is_fraud') == 1)\
         .sort(desc('sum(amt)')).show(14)

+--------------+--------+---------------+------------------+
|      category|is_fraud|count(is_fraud)|          sum(amt)|
+--------------+--------+---------------+------------------+
|  shopping_net|       1|           1713|1711723.7099999983|
|  shopping_pos|       1|            843|         739245.09|
|      misc_net|       1|            915| 729266.7599999999|
|   grocery_pos|       1|           1743| 543797.8999999999|
| entertainment|       1|            233|117323.79000000002|
|      misc_pos|       1|            250| 54571.01999999999|
|          home|       1|            198|          50971.66|
|   food_dining|       1|            151|          18131.62|
| gas_transport|       1|            618| 7594.109999999998|
| personal_care|       1|            220|           5757.52|
|     kids_pets|       1|            239|           4331.08|
|health_fitness|       1|            133|           2693.04|
|   grocery_net|       1|            134|           1629.82|
|        travel|       1

In [29]:
# categories & CC fraud 2019
adj_train.groupby(*['year','category','is_fraud'])\
         .count().filter((col('year') == 2019) & (col('is_fraud') == 1))\
        .sort(desc('count')).show(14)

+----+--------------+--------+-----+
|year|      category|is_fraud|count|
+----+--------------+--------+-----+
|2019|   grocery_pos|       1| 1202|
|2019|  shopping_net|       1| 1201|
|2019|      misc_net|       1|  629|
|2019|  shopping_pos|       1|  583|
|2019| gas_transport|       1|  439|
|2019|     kids_pets|       1|  172|
|2019|      misc_pos|       1|  170|
|2019| entertainment|       1|  163|
|2019| personal_care|       1|  152|
|2019|          home|       1|  129|
|2019|   food_dining|       1|  104|
|2019|health_fitness|       1|   96|
|2019|   grocery_net|       1|   94|
|2019|        travel|       1|   86|
+----+--------------+--------+-----+



In [30]:
# Higher amount by cartegory and frauded in 2019
adj_train.groupby(*['year','category','is_fraud']).sum('amt')\
         .filter((col('year') == 2019) & (col('is_fraud') == 1))\
         .sort(desc('sum(amt)')).show(14)

+----+--------------+--------+------------------+
|year|      category|is_fraud|          sum(amt)|
+----+--------------+--------+------------------+
|2019|  shopping_net|       1|1202563.4199999995|
|2019|  shopping_pos|       1| 512650.4299999999|
|2019|      misc_net|       1| 500575.6700000001|
|2019|   grocery_pos|       1|         375168.02|
|2019| entertainment|       1| 82990.86000000002|
|2019|          home|       1|33465.909999999996|
|2019|      misc_pos|       1|31586.319999999992|
|2019|   food_dining|       1|12401.849999999999|
|2019| gas_transport|       1| 5433.089999999998|
|2019| personal_care|       1| 4024.290000000001|
|2019|     kids_pets|       1|3076.3100000000004|
|2019|health_fitness|       1|           1944.62|
|2019|   grocery_net|       1|           1172.03|
|2019|        travel|       1|            770.05|
+----+--------------+--------+------------------+



In [31]:
# Higher amount by cartegory and frauded in 2020
adj_train.groupby(*['year','category','is_fraud'])\
         .sum('amt').filter((col('year') == 2020) & (col('is_fraud') == 1))\
         .sort(desc('sum(amt)')).show(14)

+----+--------------+--------+------------------+
|year|      category|is_fraud|          sum(amt)|
+----+--------------+--------+------------------+
|2020|  shopping_net|       1| 509160.2899999995|
|2020|      misc_net|       1|228691.08999999997|
|2020|  shopping_pos|       1|226594.66000000006|
|2020|   grocery_pos|       1|168629.87999999998|
|2020| entertainment|       1| 34332.92999999999|
|2020|      misc_pos|       1|22984.699999999997|
|2020|          home|       1|          17505.75|
|2020|   food_dining|       1|           5729.77|
|2020| gas_transport|       1|2161.0199999999995|
|2020| personal_care|       1| 1733.229999999999|
|2020|     kids_pets|       1|1254.7699999999998|
|2020|health_fitness|       1| 748.4199999999998|
|2020|   grocery_net|       1|457.78999999999996|
|2020|        travel|       1|281.44000000000005|
+----+--------------+--------+------------------+



### Gender

In [32]:
# gender distribution
adj_train.groupby('gender').count().sort(desc('count')).show()

+------+------+
|gender| count|
+------+------+
|     F|709863|
|     M|586812|
+------+------+



In [33]:
# which gender suffers more with CC fraud
adj_train.groupby(*['gender','is_fraud']).count().sort(asc('count')).show()

+------+--------+------+
|gender|is_fraud| count|
+------+--------+------+
|     F|       1|  3735|
|     M|       1|  3771|
|     M|       0|583041|
|     F|       0|706128|
+------+--------+------+



In [34]:
# which gender spends more with CC fraud
adj_train.groupby(*['gender','is_fraud']).sum('amt')\
         .filter(col('is_fraud')==1).sort(desc('sum(amt)')).show()

+------+--------+------------------+
|gender|is_fraud|          sum(amt)|
+------+--------+------------------+
|     M|       1|2142801.2700000023|
|     F|       1|1845287.3399999982|
+------+--------+------------------+



In [35]:
# how much each gender spended if they suffered fraud (2019)
adj_train.groupby(*['year','category','gender','is_fraud'])\
         .sum('amt').filter((col('year') == 2019) & (col('is_fraud') == 1))\
         .sort(desc('sum(amt)')).show(14)

+----+-------------+------+--------+------------------+
|year|     category|gender|is_fraud|          sum(amt)|
+----+-------------+------+--------+------------------+
|2019| shopping_net|     M|       1| 735598.6099999999|
|2019| shopping_net|     F|       1|466964.80999999994|
|2019|     misc_net|     M|       1|284502.30000000005|
|2019| shopping_pos|     F|       1|         282235.86|
|2019| shopping_pos|     M|       1|230414.56999999995|
|2019|     misc_net|     F|       1|216073.37000000005|
|2019|  grocery_pos|     M|       1|212679.80000000005|
|2019|  grocery_pos|     F|       1|162488.22000000006|
|2019|entertainment|     F|       1| 49913.71000000001|
|2019|entertainment|     M|       1|33077.149999999994|
|2019|     misc_pos|     F|       1|          30448.77|
|2019|         home|     F|       1|          29845.39|
|2019|  food_dining|     F|       1| 9263.490000000002|
|2019|         home|     M|       1|3620.5200000000004|
+----+-------------+------+--------+------------

In [36]:
# how much each gender spended if they suffered fraud (2020)
adj_train.groupby(*['year','category','gender','is_fraud'])\
         .sum('amt').filter((col('year') == 2020) & (col('is_fraud') == 1))\
         .sort(desc('sum(amt)')).show(14)

+----+-------------+------+--------+------------------+
|year|     category|gender|is_fraud|          sum(amt)|
+----+-------------+------+--------+------------------+
|2020| shopping_net|     M|       1| 288958.1199999999|
|2020| shopping_net|     F|       1|220202.17000000013|
|2020|     misc_net|     M|       1|         131204.01|
|2020| shopping_pos|     F|       1|122150.40999999997|
|2020| shopping_pos|     M|       1|104444.24999999997|
|2020|     misc_net|     F|       1| 97487.08000000003|
|2020|  grocery_pos|     M|       1| 91167.47000000002|
|2020|  grocery_pos|     F|       1|          77462.41|
|2020|entertainment|     F|       1|          22900.44|
|2020|     misc_pos|     F|       1|          22635.95|
|2020|         home|     F|       1|16300.210000000001|
|2020|entertainment|     M|       1|11432.490000000002|
|2020|  food_dining|     F|       1|           4634.48|
|2020|gas_transport|     M|       1|           1245.37|
+----+-------------+------+--------+------------

### Age

In [37]:
# age distribution if CC frauded
adj_train.groupby(*['age','is_fraud'])\
         .count().filter(col('is_fraud')==1)\
         .sort(desc('count')).show()

+---+--------+-----+
|age|is_fraud|count|
+---+--------+-----+
| 34|       1|  196|
| 48|       1|  181|
| 31|       1|  179|
| 50|       1|  178|
| 63|       1|  177|
| 38|       1|  174|
| 37|       1|  172|
| 52|       1|  171|
| 57|       1|  167|
| 54|       1|  163|
| 53|       1|  161|
| 27|       1|  156|
| 35|       1|  155|
| 29|       1|  154|
| 61|       1|  153|
| 58|       1|  151|
| 55|       1|  142|
| 68|       1|  141|
| 39|       1|  140|
| 46|       1|  139|
+---+--------+-----+
only showing top 20 rows



In [38]:
## frauded cc age & sex distribution
adj_train.groupby(*['age','gender','is_fraud'])\
         .count().filter(col('is_fraud')==1)\
         .sort(desc('count')).show()

+---+------+--------+-----+
|age|gender|is_fraud|count|
+---+------+--------+-----+
| 34|     M|       1|  146|
| 53|     F|       1|  130|
| 57|     M|       1|  109|
| 52|     F|       1|  107|
| 44|     M|       1|  104|
| 48|     M|       1|  104|
| 55|     M|       1|  103|
| 37|     M|       1|  101|
| 38|     F|       1|   99|
| 58|     M|       1|   97|
| 61|     F|       1|   96|
| 46|     M|       1|   96|
| 31|     F|       1|   94|
| 63|     F|       1|   93|
| 50|     F|       1|   93|
| 39|     F|       1|   92|
| 60|     M|       1|   86|
| 50|     M|       1|   85|
| 35|     M|       1|   85|
| 31|     M|       1|   85|
+---+------+--------+-----+
only showing top 20 rows



In [39]:
# amount spent by age with frauded CC
adj_train.groupby(*['age','is_fraud'])\
         .sum('amt').filter(col('is_fraud')==1)\
         .sort(desc('sum(amt)')).show(5)

+---+--------+------------------+
|age|is_fraud|          sum(amt)|
+---+--------+------------------+
| 34|       1|109248.05999999998|
| 63|       1|102921.00000000001|
| 37|       1|          95389.21|
| 48|       1| 91865.23000000001|
| 29|       1| 90028.45999999999|
+---+--------+------------------+
only showing top 5 rows



In [40]:
# total spend on frauded cc by gender and age
adj_train.groupby(*['age','gender','is_fraud'])\
         .sum('amt').filter(col('is_fraud')==1)\
         .sort(desc('sum(amt)')).show(5)

+---+------+--------+------------------+
|age|gender|is_fraud|          sum(amt)|
+---+------+--------+------------------+
| 34|     M|       1| 86134.79999999999|
| 37|     M|       1| 66199.59999999999|
| 48|     M|       1|64653.619999999995|
| 55|     M|       1|          63821.48|
| 57|     M|       1|58460.399999999994|
+---+------+--------+------------------+
only showing top 5 rows



In [41]:
# average age by gender and amount spent in CC fraud
adj_train.groupby(*['age','gender','is_fraud']).agg(mean('age'),sum('amt'))\
         .filter(col('is_fraud')==1)\
         .sort(desc('sum(amt)')).show(5)

+---+------+--------+--------+------------------+
|age|gender|is_fraud|avg(age)|          sum(amt)|
+---+------+--------+--------+------------------+
| 34|     M|       1|    34.0| 86134.79999999999|
| 37|     M|       1|    37.0| 66199.59999999999|
| 48|     M|       1|    48.0|64653.619999999995|
| 55|     M|       1|    55.0|          63821.48|
| 57|     M|       1|    57.0|58460.399999999994|
+---+------+--------+--------+------------------+
only showing top 5 rows



### Location (State, city, population size)

In [42]:
# unique states
adj_train.select('state').distinct().count()

51

In [43]:
# main states with CC fraud
adj_train.groupby(*['state','is_fraud'])\
         .count().filter(col('is_fraud') == 1)\
         .sort(desc('count')).show(5)

+-----+--------+-----+
|state|is_fraud|count|
+-----+--------+-----+
|   NY|       1|  555|
|   TX|       1|  479|
|   PA|       1|  458|
|   CA|       1|  326|
|   OH|       1|  321|
+-----+--------+-----+
only showing top 5 rows



In [44]:
# total amount by state with CC fraud
adj_train.groupby(*['state','is_fraud'])\
         .sum('amt').filter(col('is_fraud') == 1)\
         .sort(desc('sum(amt)')).show(5)

+-----+--------+------------------+
|state|is_fraud|          sum(amt)|
+-----+--------+------------------+
|   NY|       1|295548.64000000013|
|   TX|       1| 265806.4100000001|
|   PA|       1|244624.67000000004|
|   CA|       1|         170943.92|
|   OH|       1|         168919.98|
+-----+--------+------------------+
only showing top 5 rows



In [45]:
# most populated state and CC fraud amount
adj_train.groupby(*['state','is_fraud'])\
         .sum(*['city_pop','amt'])\
         .filter(col('is_fraud') == 1)\
         .sort(desc('sum(city_pop)')).show(5)

+-----+--------+-------------+------------------+
|state|is_fraud|sum(city_pop)|          sum(amt)|
+-----+--------+-------------+------------------+
|   TX|       1|    206734477| 265806.4100000001|
|   CA|       1|     89649164|         170943.92|
|   NY|       1|     79853780|295548.64000000013|
|   FL|       1|     45649842|         150913.03|
|   MN|       1|     28231300|         112454.39|
+-----+--------+-------------+------------------+
only showing top 5 rows



In [46]:
# most populated city and CC fraud amount
adj_train.groupby(*['state','city','is_fraud'])\
         .sum(*['city_pop','amt'])\
         .filter(col('is_fraud') == 1)\
         .sort(desc('sum(city_pop)')).show(5)

+-----+-------------+--------+-------------+------------------+
|state|         city|is_fraud|sum(city_pop)|          sum(amt)|
+-----+-------------+--------+-------------+------------------+
|   TX|      Houston|       1|    113361300|          21667.21|
|   TX|  San Antonio|       1|     39894925|14536.749999999996|
|   NY|New York City|       1|     36279855|13136.859999999999|
|   TX|       Dallas|       1|     34109667|19747.140000000003|
|   NY|     Brooklyn|       1|     25047000|           7435.38|
+-----+-------------+--------+-------------+------------------+
only showing top 5 rows



In [47]:
# average age per city and CC fraud amount
adj_train.groupby(*['state','city','gender','is_fraud'])\
         .agg(floor(mean('age')), sum('city_pop'), sum('amt'))\
         .filter(col('is_fraud') == 1)\
         .sort(desc('sum(amt)')).show(5)

+-----+-------------+------+--------+---------------+-------------+------------------+
|state|         city|gender|is_fraud|FLOOR(avg(age))|sum(city_pop)|          sum(amt)|
+-----+-------------+------+--------+---------------+-------------+------------------+
|   OK|        Tulsa|     F|       1|             68|     11166498|17470.250000000004|
|   NY|New York City|     F|       1|             60|     36279855|13136.859999999999|
|   TX|  San Antonio|     M|       1|             39|     28724346|13040.649999999996|
|   FL|       Naples|     F|       1|             81|      5520040|13018.669999999998|
|   TX|      Houston|     M|       1|             38|     63947400|12979.279999999999|
+-----+-------------+------+--------+---------------+-------------+------------------+
only showing top 5 rows



### Occupation (job title)

In [48]:
# unique jobs
adj_train.select('job').distinct().count()

494

In [49]:
# job with most frauded
adj_train.groupby(*['job', 'is_fraud'])\
         .count().filter(col('is_fraud') == 1)\
         .sort(desc('count')).show(5)

+--------------------+--------+-----+
|                 job|is_fraud|count|
+--------------------+--------+-----+
|  Materials engineer|       1|   62|
|Trading standards...|       1|   56|
|     Naval architect|       1|   53|
| Exhibition designer|       1|   51|
|Surveyor, land/ge...|       1|   50|
+--------------------+--------+-----+
only showing top 5 rows



In [50]:
# find most common words for jobs, so I can create macro categories
adj_train.withColumn('job_macro', explode(split(lower(col('job')), '[ , ]'))) \
          .groupBy('job_macro') \
          .count() \
          .filter((col('job_macro') != 'and') & (col('job_macro') != ''))\
          .orderBy(desc('count')) \
          .show(20)

+-------------+------+
|    job_macro| count|
+-------------+------+
|     engineer|131756|
|      officer|110915|
|      manager| 61124|
|    scientist| 55878|
|     designer| 52218|
|     surveyor| 49062|
|      teacher| 38126|
| psychologist| 32600|
|     research| 29754|
|       editor| 28725|
|    education| 26624|
|       public| 26116|
|    therapist| 25110|
|   consultant| 24785|
|        chief| 23081|
|    chartered| 19009|
|  development| 17943|
|       health| 17300|
|administrator| 16988|
|   researcher| 16001|
+-------------+------+
only showing top 20 rows



In [51]:
# create occupation clusters based on the most used words for jobs
adj_train = adj_train.withColumn('job_macro', explode(split(lower(col('job')), '[ , ]')))
adj_test = adj_train.withColumn('job_macro', explode(split(lower(col('job')), '[ , ]')))

adj_train = adj_train.withColumn('job_category',
                        when(lower(col('job_macro')).isin(['engineer', 'scientist', 'development', 'researcher', 'research']), 'science')
                        .when(lower(col('job_macro')).isin(['manager', 'chief', 'administrator', 'consultant']), 'business')
                        .when(lower(col('job_macro')).isin(['teacher', 'education']), 'education')
                        .when(lower(col('job_macro')).isin(['health', 'therapist', 'psychologist']), 'healthcare')
                        .when(lower(col('job_macro')).isin(['editor', 'public']), 'comunication')
                        .when(lower(col('job_macro')).isin(['designer', 'surveyor']), 'desing')
                        .otherwise('others'))

# do the same with test set
adj_test = adj_test.withColumn('job_category',
                        when(lower(col('job_macro')).isin(['engineer', 'scientist', 'development', 'researcher', 'research']), 'science')
                        .when(lower(col('job_macro')).isin(['manager', 'chief', 'administrator', 'consultant']), 'business')
                        .when(lower(col('job_macro')).isin(['teacher', 'education']), 'education')
                        .when(lower(col('job_macro')).isin(['health', 'therapist', 'psychologist']), 'healthcare')
                        .when(lower(col('job_macro')).isin(['editor', 'public']), 'comunication')
                        .when(lower(col('job_macro')).isin(['designer', 'surveyor']), 'desing')
                        .otherwise('others'))

adj_train.show(4)

+--------------------+-----------+------+------+--------------+-----+-----+-------+---------+--------+--------------------+----------+------------------+-----------+--------+----+-----+--------+---+--------------------+--------------------+-----------+-----------+--------+------------+------------+
|            merchant|   category|   amt|gender|          city|state|  zip|    lat|     long|city_pop|                 job| unix_time|         merch_lat| merch_long|is_fraud|year|month|day_week|age|            lat_diff|           long_diff|tier1_event|tier2_event|no_event|   job_macro|job_category|
+--------------------+-----------+------+------+--------------+-----+-----+-------+---------+--------+--------------------+----------+------------------+-----------+--------+----+-----+--------+---+--------------------+--------------------+-----------+-----------+--------+------------+------------+
|fraud_Rippin, Kub...|   misc_net|  4.97|     F|Moravian Falls|   NC|28654|36.0788| -81.1781|    349

In [52]:
# fraud by job category
adj_train.groupby(*['job_category', 'is_fraud']).count().filter(col('is_fraud') == 1).sort(desc('count')).show(7)

+------------+--------+-----+
|job_category|is_fraud|count|
+------------+--------+-----+
|      others|       1|15037|
|     science|       1| 1369|
|    business|       1|  676|
|      desing|       1|  564|
|   education|       1|  456|
|  healthcare|       1|  367|
|comunication|       1|  250|
+------------+--------+-----+



In [53]:
# amount by job category when CC frauded
adj_train.groupby(*['job_category', 'is_fraud']).sum('amt').filter(col('is_fraud') == 1).sort(desc('sum(amt)')).show(7)

+------------+--------+------------------+
|job_category|is_fraud|          sum(amt)|
+------------+--------+------------------+
|      others|       1| 7974167.390000008|
|     science|       1| 718112.9300000002|
|    business|       1|337418.98000000004|
|      desing|       1| 304814.8100000001|
|   education|       1|235717.29999999993|
|  healthcare|       1|          202879.2|
|comunication|       1|130377.24999999999|
+------------+--------+------------------+



### Latitude and Longitude

In [54]:
# avarage difference between customer and merchant latitude
adj_train.groupby('is_fraud').mean('lat_diff').show()

+--------+--------------------+
|is_fraud|       avg(lat_diff)|
+--------+--------------------+
|       1|0.010138428495111993|
|       0| 1.36273459634889E-4|
+--------+--------------------+



In [55]:
# avarage difference between customer and merchant longitude
adj_train.groupby('is_fraud').mean('long_diff').show()

+--------+--------------------+
|is_fraud|      avg(long_diff)|
+--------+--------------------+
|       1|8.399439072599181E-4|
|       0|7.071568078397803E-4|
+--------+--------------------+



### Unix
*Unix timestamp is a way to track time as a running total of seconds. This  ount starts at the Unix Epoch on January 1st, 1970 at UTC. Therefore, the Unix timestamp is merely the number of seconds between a particular date and the Unix Epoch. It should also be pointed out that this point in time technically does not change no matter where you are located on the globe. This is very useful to computer systems for tracking and sorting dated information in dynamic and distributed applications both online and client-side. The reason why Unix timestamps are used by many webmasters is that they can represent all time zones at once.*

In [56]:
# avarage unix time
adj_train.groupby('is_fraud').mean('unix_time').show()

+--------+--------------------+
|is_fraud|      avg(unix_time)|
+--------+--------------------+
|       1|1.3481615599279876E9|
|       0|1.3492567481422484E9|
+--------+--------------------+



### Day and Hour

In [57]:
# is there a spefic time for fraud?
# adj_train.groupby(*['time', 'is_fraud']).count().filter(col('is_fraud') == 1).sort(desc('count')).show(10)

In [58]:
# amount frauded by hour
# adj_train.groupby(*['time', 'is_fraud']).sum('amt').filter(col('is_fraud') == 1).sort(desc('sum(amt)')).show(10)

In [59]:
# does fraud occur in a specific day of the week?
adj_train.groupby(*['day_week', 'is_fraud']).count().filter(col('is_fraud') == 1).sort(desc('count')).show(10)

+--------+--------+-----+
|day_week|is_fraud|count|
+--------+--------+-----+
|       7|       1| 3128|
|       2|       1| 2993|
|       1|       1| 2954|
|       6|       1| 2718|
|       5|       1| 2483|
|       3|       1| 2370|
|       4|       1| 2073|
+--------+--------+-----+



In [60]:
# amount frauded by day of week
adj_train.groupby(*['day_week', 'is_fraud']).sum('amt').filter(col('is_fraud') == 1).sort(desc('sum(amt)')).show(10)

+--------+--------+------------------+
|day_week|is_fraud|          sum(amt)|
+--------+--------+------------------+
|       7|       1|1635887.2400000005|
|       1|       1|1616182.6999999993|
|       2|       1|1493520.4499999983|
|       6|       1|1487687.5299999993|
|       5|       1| 1347469.530000001|
|       3|       1|1245697.5300000003|
|       4|       1|        1077042.88|
+--------+--------+------------------+



### Holidays

In [61]:
# tier1 events
adj_train.groupby(*['year','tier1_event', 'is_fraud']).agg(count('is_fraud'), sum('amt')).filter(col('is_fraud') == 1).sort(desc('sum(amt)')).show()

+----+-----------+--------+---------------+------------------+
|year|tier1_event|is_fraud|count(is_fraud)|          sum(amt)|
+----+-----------+--------+---------------+------------------+
|2019|          0|       1|          11976|  6297632.81000001|
|2020|          0|       1|           5566|2964485.1800000058|
|2019|          1|       1|           1177| 641369.8700000012|
+----+-----------+--------+---------------+------------------+



In [62]:
adj_train.groupby(*['year','tier2_event', 'is_fraud']).agg(count('is_fraud'), sum('amt')).filter(col('is_fraud') == 1).sort(desc('sum(amt)')).show()

+----+-----------+--------+---------------+------------------+
|year|tier2_event|is_fraud|count(is_fraud)|          sum(amt)|
+----+-----------+--------+---------------+------------------+
|2019|          0|       1|          10248| 5398151.310000012|
|2020|          0|       1|           4456|2404682.5300000003|
|2019|          1|       1|           2905|1540851.3699999987|
|2020|          1|       1|           1110| 559802.6500000007|
+----+-----------+--------+---------------+------------------+



In [63]:
adj_train.groupby(*['year','no_event', 'is_fraud']).agg(count('is_fraud'), sum('amt')).filter(col('is_fraud') == 1).sort(desc('sum(amt)')).show()

+----+--------+--------+---------------+------------------+
|year|no_event|is_fraud|count(is_fraud)|          sum(amt)|
+----+--------+--------+---------------+------------------+
|2019|       1|       1|           9071|4756781.4400000125|
|2020|       1|       1|           4456|2404682.5300000003|
|2019|       0|       1|           4082|        2182221.24|
|2020|       0|       1|           1110| 559802.6500000007|
+----+--------+--------+---------------+------------------+



### Statistics

In [64]:
# basic dataset statistics
adj_train.summary().show()

+-------+-------------------+-------------+------------------+-------+-------+-------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+------------------+------------------+--------------------+------------------+-----------------+------------------+------------------+--------------------+--------------------+-------------------+-------------------+------------------+---------+------------+
|summary|           merchant|     category|               amt| gender|   city|  state|              zip|              lat|              long|         city_pop|               job|           unix_time|         merch_lat|        merch_long|            is_fraud|              year|            month|          day_week|               age|            lat_diff|           long_diff|        tier1_event|        tier2_event|          no_event|job_macro|job_category|
+-------+-------------------+-------------+------------------+-------+-------+------

In [65]:
adj_train.show(2)

+--------------------+--------+----+------+--------------+-----+-----+-------+--------+--------+--------------------+----------+---------+----------+--------+----+-----+--------+---+------------------+------------------+-----------+-----------+--------+------------+------------+
|            merchant|category| amt|gender|          city|state|  zip|    lat|    long|city_pop|                 job| unix_time|merch_lat|merch_long|is_fraud|year|month|day_week|age|          lat_diff|         long_diff|tier1_event|tier2_event|no_event|   job_macro|job_category|
+--------------------+--------+----+------+--------------+-----+-----+-------+--------+--------+--------------------+----------+---------+----------+--------+----+-----+--------+---+------------------+------------------+-----------+-----------+--------+------------+------------+
|fraud_Rippin, Kub...|misc_net|4.97|     F|Moravian Falls|   NC|28654|36.0788|-81.1781|    3495|Psychologist, cou...|1325376018|36.011293|-82.048315|       0|20

## Data Preparation <a name="data_prep"></a>
[Table of contents](#table_cont)

In [66]:
# select top 5 merchants with the highest amount spent in frauded CC
sus_merchants = adj_train.groupby(*['merchant', 'is_fraud'])\
                         .sum('amt').filter(col('is_fraud') == 1)\
                         .sort(desc('sum(amt)'))\
                         .select('merchant')\
                         .limit(5).collect()

# the issue is that .collect() will get the list of rows, and I want just the name of the merchants
sus_merchants = [row['merchant'] for row in sus_merchants]

In [67]:
# create a column for suspicious merchants
adj_train = adj_train.withColumn('sus_mercht',
                                 when(col('merchant').isin(sus_merchants), 1).otherwise(0))

# do the same for test set
adj_test = adj_test.withColumn('sus_mercht',
                               when(col('merchant').isin(sus_merchants),1).otherwise(0))

In [68]:
# make list for category
category_list = adj_train.select('category').distinct().collect()
category_list = [row['category'] for row in category_list]

# make a list for job_category
job_list = adj_train.select('job_category').distinct().collect()
job_list = [row['job_category'] for row in job_list]

In [69]:
# create loop for category
for cat in category_list:
  adj_train = adj_train.withColumn(f'{cat}', when(col('category') == cat ,1).otherwise(0))
  adj_test = adj_test.withColumn(f'{cat}', when(col('category') == cat ,1).otherwise(0))

# create loop for job_category
for job in job_list:
  adj_train = adj_train.withColumn(f'{job}', when(col('job_category') == job ,1).otherwise(0))
  adj_test = adj_test.withColumn(f'{job}', when(col('job_category') == job ,1).otherwise(0))

# create dummy for gender
adj_train = adj_train.withColumn('female', when(col('gender') == 'F', 1).otherwise(0))
adj_test = adj_test.withColumn('female', when(col('gender') == 'F', 1).otherwise(0))


In [70]:
# drop categorical variables
final_train = adj_train.drop(*['merchant','category', 'gender','city','state','job','job_macro','job_category'])
final_test = adj_test.drop(*['merchant','category', 'gender','city','state','job','job_macro','job_category'])

## ML Models <a name="ml_mod"></a>
[Table of contents](#table_cont)

### Logit <a name="logit"></a>

In [71]:
# dealing with umlabanced dataset
fraud_count = final_train.filter(col('is_fraud') == 1).count()
non_fraud_count = final_train.filter(col('is_fraud') == 0).count()

# calculating weights, it will increase minority class weight and increase penalization for miss classifications
weight_0 = final_train.count() / non_fraud_count
weight_1 = final_train.count() / fraud_count

# add weights to final dataframe
final_train = final_train.withColumn('weights',  when(col('is_fraud') == 1, weight_1).otherwise(weight_0))

In [72]:
# select feature names
features_vec = [col for col in final_train.columns if col not in  ['is_fraud','weights']]

# create vector on training set
assembler = VectorAssembler(inputCols=features_vec, outputCol='features')
final_train = assembler.transform(final_train)

# transform test set into vector
final_test = assembler.transform(final_test)

In [73]:
# create logit model
logreg = LogisticRegression(labelCol='is_fraud', featuresCol='features', weightCol='weights')

# apply to dataset
logit = logreg.fit(final_train)

In [74]:
# predict using test set
logit_pred = logit.transform(final_test)

# area under the ROC Curve
evaluator = BinaryClassificationEvaluator(labelCol='is_fraud', rawPredictionCol="prediction")
auc = evaluator.evaluate(logit_pred)
print(f"AUC: {auc}")

AUC: 0.8189514389268764


In [75]:
# create evaluators for other metrics
precision_evaluator = MulticlassClassificationEvaluator(labelCol="is_fraud", metricName="precisionByLabel")
recall_evaluator = MulticlassClassificationEvaluator(labelCol="is_fraud", metricName="recallByLabel")
f1_evaluator = MulticlassClassificationEvaluator(labelCol="is_fraud", metricName="f1")
accuracy_evaluator = MulticlassClassificationEvaluator(labelCol="is_fraud", metricName="accuracy")

# applying evaluators to logit prediction
precision = precision_evaluator.evaluate(logit_pred, {precision_evaluator.metricLabel: 1})
recall = recall_evaluator.evaluate(logit_pred, {recall_evaluator.metricLabel: 1})
f1 = f1_evaluator.evaluate(logit_pred)
accuracy = accuracy_evaluator.evaluate(logit_pred)

In [76]:
# add results to dataframe to compare with later models
models_results = spark.createDataFrame([
          Row(model= 'Logit', AUC=auc, Precision=precision, Recall=recall, Accuracy=accuracy, F1=f1)
                ])

### Random Forest <a name="rf"></a>

In [77]:
# create rf model
rfreg = RandomForestClassifier(labelCol='is_fraud', featuresCol='features', weightCol='weights')

# apply to dataset
rf = rfreg.fit(final_train)

In [78]:
# predict using test set
rf_pred = rf.transform(final_test)

# area under the ROC Curve
evaluator = BinaryClassificationEvaluator(labelCol='is_fraud', rawPredictionCol="prediction")
auc = evaluator.evaluate(rf_pred)
print(f"AUC: {auc}")

precision = precision_evaluator.evaluate(rf_pred, {precision_evaluator.metricLabel: 1})
recall = recall_evaluator.evaluate(rf_pred, {recall_evaluator.metricLabel: 1})
f1 = f1_evaluator.evaluate(rf_pred)
accuracy = accuracy_evaluator.evaluate(rf_pred)

AUC: 0.8611925330329725


In [79]:
# create new row with random forest result
column_names = models_results.columns
new_row = spark.createDataFrame([('random forest',auc, precision, recall, accuracy, f1)], column_names)

# add new row to DataFrame
models_results = models_results.union(new_row)

### GBT <a name="gbt"></a>

In [80]:
# create gbt model
gbtreg = GBTClassifier(labelCol='is_fraud', featuresCol='features', weightCol='weights')

# apply to dataset
gbt = gbtreg.fit(final_train)

# predict using test set
gbt_pred = gbt.transform(final_test)

# area under the ROC Curve
evaluator = BinaryClassificationEvaluator(labelCol='is_fraud', rawPredictionCol="prediction")
auc = evaluator.evaluate(gbt_pred)
print(f"AUC: {auc}")

precision = precision_evaluator.evaluate(gbt_pred, {precision_evaluator.metricLabel: 1})
recall = recall_evaluator.evaluate(gbt_pred, {recall_evaluator.metricLabel: 1})
f1 = f1_evaluator.evaluate(gbt_pred)
accuracy = accuracy_evaluator.evaluate(gbt_pred)

AUC: 0.9426013580244417


In [81]:
# create new row
new_row = spark.createDataFrame([('GBT',auc, precision, recall, accuracy, f1)], column_names)

# add new row to DataFrame
models_results = models_results.union(new_row)

## Comparing results <a name="results"></a>
[Table of contents](#table_cont)

In [82]:
# extract columns names
feature_names = final_train.columns
# Criando DataFrame com os nomes das features
feature_df = spark.createDataFrame([(name,) for name in feature_names], ["feature"])

# variable impact: logit
logit_importance = logit.coefficients.toArray().tolist()
logit_df = spark.createDataFrame(zip(feature_names, logit_importance), ["Feature", "Logit"])

In [83]:
# variables impact: random forest
rf_importance = rf.featureImportances.toArray().tolist()
rf_df = spark.createDataFrame(zip(feature_names, rf_importance), ["Feature", "Random Forest"])


In [84]:
# variable impact: GBT
gbt_importance = gbt.featureImportances.toArray().tolist()
gbt_df = spark.createDataFrame(zip(feature_names, gbt_importance), ["Feature", "GBT"])


In [85]:
# merge all datasets into one
feature_analysis = feature_df \
    .join(logit_df, "Feature", "left") \
    .join(rf_df, "Feature", "left") \
    .join(gbt_df, "Feature", "left")

In [88]:
feature_analysis.sort(asc('feature')).show(40)
# feature_analysis.write.csv('feature_analysis.csv')

+--------------+--------------------+--------------------+--------------------+
|       feature|               Logit|       Random Forest|                 GBT|
+--------------+--------------------+--------------------+--------------------+
|           age|  0.0547865759739923|4.068352679573005...|1.598730352511529...|
|           amt|0.010736487592143983|  0.6478714875838733|  0.6582547163711452|
|      business| 0.16864632181374198|0.002122762794491649|0.004489183081609636|
|      city_pop|-3.57223858702804...|0.001096013050411571|0.005004322590572377|
|  comunication|-0.18363419828524138|                 0.0|3.218817789884851...|
|      day_week|0.002244139334827049|0.016943167531426405|0.018566831402962745|
|        desing| 0.03851089929327762|1.167282033426925...|                 0.0|
|     education|-0.07432711413572131|1.306525530004939...|5.46131170037336E-19|
| entertainment| 0.23150367921558887|0.026918464298779494| 0.04468117324209432|
|      features|                NULL|   

In [89]:
# check results
models_results.show()
# models_results.write.csv('models_results.csv')

+-------------+------------------+-------------------+------------------+------------------+------------------+
|        model|               AUC|          Precision|            Recall|          Accuracy|                F1|
+-------------+------------------+-------------------+------------------+------------------+------------------+
|        Logit|0.8189514389268764| 0.0395214982343873|0.7454254388414214|0.8916097940724804| 0.937314085602649|
|random forest|0.8611925330329725|0.12157794049739429|0.7547514007557566|0.9663776047369897|0.9782602826511785|
|          GBT|0.9426013580244417|0.12542571724031565|0.9234191470747007|0.9615572089349264|0.9758115354691581|
+-------------+------------------+-------------------+------------------+------------------+------------------+



## Take Aways <a name="take_away"></a>
[Table of contents](#table_cont)

In this notebook I conducted some experiments using 3 different models to classify fraud in CC. This is an unbalanced dataset, which means I'll have to treat this on my models, and when choosing the performance metric.

I feature engineered some variables based on date and important dates for retail. After working at Amazon, I know deals and promotions happen during the week of big retail events (such as Black Friday and Christmas). That's why I chose to create a dummy for transactions made on the week of retail event. Some events doesn't have a fixed week, so I looked for possible week numbers it would be.

Gradient Boosting algorithms outperfor random forest and logit. With almost 95% of probability to classifying a transaction right (AUC 0.942). Due the unbalanced dataset, AUC was the best metric to compare models and account for the miniroty class (is_fraud = 1).

### Optimization
To optimize this project, I suggest:
- Create a loop to identify the exact week for retail events (such as Easter)
- Run a fine tuning on parameters, equivalent of scikit-learn GridSearch. Due to computational limitations this was not possible
- Increase the job_category variable in more areas. This feature had most of the observation and could be due to specific occupations
- Standardize numerical variables and test if logit performs better
- Test other libs, such as XG Boost or CAT boost. Gradient Boosting seems to be outperforming traditional models, but the archetecture is different and could have different results