# Credit Card Fraud Detection
In this notebook, I'll explore this public dataset on Kaggle about [credit card fraud](https://www.kaggle.com/datasets/kartik2112/fraud-detection?resource=download). My objective is to analyze this dataset behavior and create an algorithm to predict if there's a fraud or not.

As for my tool, I'll use **PySpark** to load my data, check the quality and do the exploratory data analysis (EDA). Next, I'll run some classifications algorithms and compare their performance to see which model would be used in a 'deploy phase'.

## Table of Contents  <a name="table_cont"></a>

0. [**Libraries**](#lib)
1. [**Load data**](#load_data)
2. [**Data quality**](#data-quality)
3. [**Feature Engineering**](#features)
4. [**EDA**](#eda)
5. [**Data preparation**](#data_prep)
6. [**ML Models**](#ml_mod)
    - Logistical regression (Logit)
    - Random Forest
    - Gradient-Boosted Trees (GBTs)
    - Naive Bayes
7. [**Comparing results**](#results)
8. [**Take Aways**](#take_away)


## Libraries <a name="libs"></a>

In [289]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import isnan, when, count, col, year, month, dayofweek, date_format, current_date, date_diff, floor, desc, asc, sum, mean, explode, lower,split

## Load data  <a name="load_data"></a>

In [290]:
# create spark session
spark = SparkSession.builder.appName('CC_fraud').getOrCreate()

# load train and test datasets
train = spark.read.csv('fraudTrain.csv', header=True, inferSchema=True)
test = spark.read.csv('fraudTest.csv', header=True, inferSchema=True)

In [291]:
print('Train set \n')
train.limit(5).show()
print('Test set')
test.limit(5).show()

Train set 

+---+---------------------+----------------+--------------------+-------------+------+---------+-------+------+--------------------+--------------+-----+-----+-------+---------+--------+--------------------+----------+--------------------+----------+------------------+-----------+--------+
|_c0|trans_date_trans_time|          cc_num|            merchant|     category|   amt|    first|   last|gender|              street|          city|state|  zip|    lat|     long|city_pop|                 job|       dob|           trans_num| unix_time|         merch_lat| merch_long|is_fraud|
+---+---------------------+----------------+--------------------+-------------+------+---------+-------+------+--------------------+--------------+-----+-----+-------+---------+--------+--------------------+----------+--------------------+----------+------------------+-----------+--------+
|  0|  2019-01-01 00:00:18|2703186189652095|fraud_Rippin, Kub...|     misc_net|  4.97| Jennifer|  Banks|     F|    

In [292]:
# dataframes shape
print((train.count(), len(train.columns)))

(1296675, 23)


In [293]:
print((test.count(), len(test.columns)))

(555719, 23)


## Data quality <a name="data_quality"></a>
[Table of contents](#table_cont)

In [294]:
# check if columns have the correct data type
print('Train set data type:')
train.dtypes

## could also have used train.printSchema()

Train set data type:


[('_c0', 'int'),
 ('trans_date_trans_time', 'timestamp'),
 ('cc_num', 'bigint'),
 ('merchant', 'string'),
 ('category', 'string'),
 ('amt', 'double'),
 ('first', 'string'),
 ('last', 'string'),
 ('gender', 'string'),
 ('street', 'string'),
 ('city', 'string'),
 ('state', 'string'),
 ('zip', 'int'),
 ('lat', 'double'),
 ('long', 'double'),
 ('city_pop', 'int'),
 ('job', 'string'),
 ('dob', 'date'),
 ('trans_num', 'string'),
 ('unix_time', 'int'),
 ('merch_lat', 'double'),
 ('merch_long', 'double'),
 ('is_fraud', 'int')]

In [295]:
print('Test set data type:')
test.dtypes

Test set data type:


[('_c0', 'int'),
 ('trans_date_trans_time', 'timestamp'),
 ('cc_num', 'bigint'),
 ('merchant', 'string'),
 ('category', 'string'),
 ('amt', 'double'),
 ('first', 'string'),
 ('last', 'string'),
 ('gender', 'string'),
 ('street', 'string'),
 ('city', 'string'),
 ('state', 'string'),
 ('zip', 'int'),
 ('lat', 'double'),
 ('long', 'double'),
 ('city_pop', 'int'),
 ('job', 'string'),
 ('dob', 'date'),
 ('trans_num', 'string'),
 ('unix_time', 'int'),
 ('merch_lat', 'double'),
 ('merch_long', 'double'),
 ('is_fraud', 'int')]

In [296]:
# separate columns between numeric and strings
numeric_cols = [col[0] for col in train.dtypes if col[1] in ['double','float','int']]
string_cols = [col[0] for col in train.dtypes if col[1] not in ['double','float','int']]

# check for nulls
def count_nulls(df):
  df.select([
      count(when(col(c).isNull() | (col(c) == ""), c)).alias(c) if c in string_cols else
      count(when(col(c).isNull() | isnan(col(c)), c)).alias(c)
      for c in df.columns
  ]).show()

print('Train nulls:')
count_nulls(train)

Train nulls:
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+
|_c0|trans_date_trans_time|cc_num|merchant|category|amt|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|trans_num|unix_time|merch_lat|merch_long|is_fraud|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+
|  0|                    0|     0|       0|       0|  0|    0|   0|     0|     0|   0|    0|  0|  0|   0|       0|  0|  0|        0|        0|        0|         0|       0|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+



In [297]:
print('Test nulls:')
count_nulls(test)

Test nulls:
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+
|_c0|trans_date_trans_time|cc_num|merchant|category|amt|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|trans_num|unix_time|merch_lat|merch_long|is_fraud|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+
|  0|                    0|     0|       0|       0|  0|    0|   0|     0|     0|   0|    0|  0|  0|   0|       0|  0|  0|        0|        0|        0|         0|       0|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+



In [298]:
# check for duplicates
def duplicates (df):
  df.groupBy(df.columns).count().filter("count > 1").show()

duplicates(train)

+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+
|_c0|trans_date_trans_time|cc_num|merchant|category|amt|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|trans_num|unix_time|merch_lat|merch_long|is_fraud|count|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+



In [299]:
duplicates(test)

+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+
|_c0|trans_date_trans_time|cc_num|merchant|category|amt|first|last|gender|street|city|state|zip|lat|long|city_pop|job|dob|trans_num|unix_time|merch_lat|merch_long|is_fraud|count|
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+
+---+---------------------+------+--------+--------+---+-----+----+------+------+----+-----+---+---+----+--------+---+---+---------+---------+---------+----------+--------+-----+



## Feature Engineering <a name="features"></a>
[Table of contents](#table_cont)

In [300]:
def feature_engine(df):
  # extract date, time, create 'age' variable based on dob (date of birth),
  # create 'distance' based on difference between lat,long and merchant_lat,merchant_long
  df = df.withColumn('year', year('trans_date_trans_time'))\
         .withColumn('month', month('trans_date_trans_time'))\
         .withColumn('day_week', dayofweek('trans_date_trans_time'))\
         .withColumn('time', date_format('trans_date_trans_time', 'HH:mm:ss'))\
         .withColumn('age', floor(date_diff(current_date(), col('dob'))/365.25))\
         .withColumn('lat_diff', col('lat')-col('merch_lat'))\
         .withColumn('long_diff', col('long')-col('merch_long'))
  df = df.drop(*['_c0','trans_date_trans_time','cc_num','first','last','street','dob','trans_num'])

  return df

In [301]:
# crete new dataframes with new features and dropped useless ones
adj_train = feature_engine(train)
adj_test = feature_engine(test)

In [302]:
adj_train.show(2)

+--------------------+-----------+------+------+--------------+-----+-----+-------+---------+--------+--------------------+----------+------------------+-----------+--------+----+-----+--------+--------+---+--------------------+--------------------+
|            merchant|   category|   amt|gender|          city|state|  zip|    lat|     long|city_pop|                 job| unix_time|         merch_lat| merch_long|is_fraud|year|month|day_week|    time|age|            lat_diff|           long_diff|
+--------------------+-----------+------+------+--------------+-----+-----+-------+---------+--------+--------------------+----------+------------------+-----------+--------+----+-----+--------+--------+---+--------------------+--------------------+
|fraud_Rippin, Kub...|   misc_net|  4.97|     F|Moravian Falls|   NC|28654|36.0788| -81.1781|    3495|Psychologist, cou...|1325376018|         36.011293| -82.048315|       0|2019|    1|       3|00:00:18| 36|  0.0675069999999991|  0.8702150000000017|


## EDA  <a name="eda"></a>
[Table of contents](#table_cont)

### Fraud

In [303]:
# check distribuition of fraud and non-fraud in my dataset
adj_train.groupby('is_fraud').count().show()

+--------+-------+
|is_fraud|  count|
+--------+-------+
|       1|   7506|
|       0|1289169|
+--------+-------+



In [304]:
# frauds by year
adj_train.groupby(*['year','is_fraud']).count().show()

+----+--------+------+
|year|is_fraud| count|
+----+--------+------+
|2019|       1|  5220|
|2019|       0|919630|
|2020|       1|  2286|
|2020|       0|369539|
+----+--------+------+



In [305]:
# check distribution of frauds along the year (2019)
adj_train.groupby(*['year','month','is_fraud']).count().filter((col('year') == 2019) & (col('is_fraud') == 1)).sort(asc('month')).show()

+----+-----+--------+-----+
|year|month|is_fraud|count|
+----+-----+--------+-----+
|2019|    1|       1|  506|
|2019|    2|       1|  517|
|2019|    3|       1|  494|
|2019|    4|       1|  376|
|2019|    5|       1|  408|
|2019|    6|       1|  354|
|2019|    7|       1|  331|
|2019|    8|       1|  382|
|2019|    9|       1|  418|
|2019|   10|       1|  454|
|2019|   11|       1|  388|
|2019|   12|       1|  592|
+----+-----+--------+-----+



In [306]:
# check distribution of frauds along the year (2020)
adj_train.groupby(*['year','month','is_fraud']).count().filter((col('year') == 2020) & (col('is_fraud') == 1)).sort(asc('month')).show()

+----+-----+--------+-----+
|year|month|is_fraud|count|
+----+-----+--------+-----+
|2020|    1|       1|  343|
|2020|    2|       1|  336|
|2020|    3|       1|  444|
|2020|    4|       1|  302|
|2020|    5|       1|  527|
|2020|    6|       1|  334|
+----+-----+--------+-----+



In [307]:
adj_test.groupby('is_fraud').count().show()

+--------+------+
|is_fraud| count|
+--------+------+
|       1|  2145|
|       0|553574|
+--------+------+



### Merchant

In [308]:
# unique merchants
adj_train.select('merchant').distinct().count()

693

In [309]:
# main merchants
adj_train.groupby(*['merchant','is_fraud']).count().filter(col('is_fraud') == 1).sort(desc('count')).show(10)

+--------------------+--------+-----+
|            merchant|is_fraud|count|
+--------------------+--------+-----+
|  fraud_Rau and Sons|       1|   49|
|   fraud_Cormier LLC|       1|   48|
|   fraud_Kozey-Boehm|       1|   48|
|fraud_Vandervort-...|       1|   47|
|   fraud_Kilback LLC|       1|   47|
|     fraud_Doyle Ltd|       1|   47|
|      fraud_Kuhn LLC|       1|   44|
| fraud_Padberg-Welch|       1|   44|
|    fraud_Terry-Huel|       1|   43|
|      fraud_Jast Ltd|       1|   42|
+--------------------+--------+-----+
only showing top 10 rows



In [310]:
# top merchant & CC fraud 2019
adj_train.groupby(*['year','merchant','is_fraud']).count().filter((col('year') == 2019) & (col('is_fraud') == 1)).sort(desc('count')).show(10)

+----+--------------------+--------+-----+
|year|            merchant|is_fraud|count|
+----+--------------------+--------+-----+
|2019|  fraud_Hudson-Ratke|       1|   37|
|2019|   fraud_Kilback LLC|       1|   35|
|2019|  fraud_Rau and Sons|       1|   33|
|2019|   fraud_Cormier LLC|       1|   32|
|2019|fraud_Gleason-Mac...|       1|   32|
|2019| fraud_Koepp-Witting|       1|   32|
|2019|      fraud_Kuhn LLC|       1|   32|
|2019|     fraud_Kuhic LLC|       1|   31|
|2019| fraud_Padberg-Welch|       1|   31|
|2019|fraud_Boyer-Reichert|       1|   31|
+----+--------------------+--------+-----+
only showing top 10 rows



In [311]:
# top merchant & CC fraud 2020
adj_train.groupby(*['year','merchant','is_fraud']).count().filter((col('year') == 2020) & (col('is_fraud') == 1)).sort(desc('count')).show(10)

+----+--------------------+--------+-----+
|year|            merchant|is_fraud|count|
+----+--------------------+--------+-----+
|2020|     fraud_Doyle Ltd|       1|   23|
|2020|fraud_Kerluke-Abs...|       1|   19|
|2020|   fraud_Kozey-Boehm|       1|   18|
|2020|     fraud_Kiehn Inc|       1|   16|
|2020|fraud_Vandervort-...|       1|   16|
|2020|   fraud_Cormier LLC|       1|   16|
|2020|fraud_Moen, Reing...|       1|   16|
|2020|  fraud_Rau and Sons|       1|   16|
|2020|fraud_Langworth, ...|       1|   16|
|2020|    fraud_Fisher Inc|       1|   15|
+----+--------------------+--------+-----+
only showing top 10 rows



### Category

In [312]:
# unique categories
adj_train.select('category').distinct().count()

14

In [313]:
# main frauded categories
adj_train.groupby(*['category','is_fraud']).count().filter(col('is_fraud') == 1).sort(desc('count')).show(14)

+--------------+--------+-----+
|      category|is_fraud|count|
+--------------+--------+-----+
|   grocery_pos|       1| 1743|
|  shopping_net|       1| 1713|
|      misc_net|       1|  915|
|  shopping_pos|       1|  843|
| gas_transport|       1|  618|
|      misc_pos|       1|  250|
|     kids_pets|       1|  239|
| entertainment|       1|  233|
| personal_care|       1|  220|
|          home|       1|  198|
|   food_dining|       1|  151|
|   grocery_net|       1|  134|
|health_fitness|       1|  133|
|        travel|       1|  116|
+--------------+--------+-----+



In [314]:
# categories & CC fraud 2019
adj_train.groupby(*['year','category','is_fraud']).count().filter((col('year') == 2019) & (col('is_fraud') == 1)).sort(desc('count')).show(14)

+----+--------------+--------+-----+
|year|      category|is_fraud|count|
+----+--------------+--------+-----+
|2019|   grocery_pos|       1| 1202|
|2019|  shopping_net|       1| 1201|
|2019|      misc_net|       1|  629|
|2019|  shopping_pos|       1|  583|
|2019| gas_transport|       1|  439|
|2019|     kids_pets|       1|  172|
|2019|      misc_pos|       1|  170|
|2019| entertainment|       1|  163|
|2019| personal_care|       1|  152|
|2019|          home|       1|  129|
|2019|   food_dining|       1|  104|
|2019|health_fitness|       1|   96|
|2019|   grocery_net|       1|   94|
|2019|        travel|       1|   86|
+----+--------------+--------+-----+



In [315]:
# Higher amount by cartegory and frauded in 2019
adj_train.groupby(*['year','category','is_fraud']).sum('amt').filter((col('year') == 2019) & (col('is_fraud') == 1)).sort(desc('sum(amt)')).show(14)

+----+--------------+--------+------------------+
|year|      category|is_fraud|          sum(amt)|
+----+--------------+--------+------------------+
|2019|  shopping_net|       1|1202563.4199999995|
|2019|  shopping_pos|       1| 512650.4299999999|
|2019|      misc_net|       1| 500575.6700000001|
|2019|   grocery_pos|       1|         375168.02|
|2019| entertainment|       1| 82990.86000000002|
|2019|          home|       1|33465.909999999996|
|2019|      misc_pos|       1|31586.319999999992|
|2019|   food_dining|       1|12401.849999999999|
|2019| gas_transport|       1| 5433.089999999998|
|2019| personal_care|       1| 4024.290000000001|
|2019|     kids_pets|       1|3076.3100000000004|
|2019|health_fitness|       1|           1944.62|
|2019|   grocery_net|       1|           1172.03|
|2019|        travel|       1|            770.05|
+----+--------------+--------+------------------+



In [316]:
# Higher amount by cartegory and frauded in 2020
adj_train.groupby(*['year','category','is_fraud']).sum('amt').filter((col('year') == 2020) & (col('is_fraud') == 1)).sort(desc('sum(amt)')).show(14)

+----+--------------+--------+------------------+
|year|      category|is_fraud|          sum(amt)|
+----+--------------+--------+------------------+
|2020|  shopping_net|       1| 509160.2899999995|
|2020|      misc_net|       1|228691.08999999997|
|2020|  shopping_pos|       1|226594.66000000006|
|2020|   grocery_pos|       1|168629.87999999998|
|2020| entertainment|       1| 34332.92999999999|
|2020|      misc_pos|       1|22984.699999999997|
|2020|          home|       1|          17505.75|
|2020|   food_dining|       1|           5729.77|
|2020| gas_transport|       1|2161.0199999999995|
|2020| personal_care|       1| 1733.229999999999|
|2020|     kids_pets|       1|1254.7699999999998|
|2020|health_fitness|       1| 748.4199999999998|
|2020|   grocery_net|       1|457.78999999999996|
|2020|        travel|       1|281.44000000000005|
+----+--------------+--------+------------------+



### Gender

In [317]:
# gender distribution
adj_train.groupby('gender').count().sort(desc('count')).show()

+------+------+
|gender| count|
+------+------+
|     F|709863|
|     M|586812|
+------+------+



In [318]:
# which gender suffers more with CC fraud
adj_train.groupby(*['gender','is_fraud']).count().sort(asc('count')).show()

+------+--------+------+
|gender|is_fraud| count|
+------+--------+------+
|     F|       1|  3735|
|     M|       1|  3771|
|     M|       0|583041|
|     F|       0|706128|
+------+--------+------+



In [319]:
# which gender spends more with CC fraud
adj_train.groupby(*['gender','is_fraud']).sum('amt').filter(col('is_fraud')==1).sort(desc('sum(amt)')).show()

+------+--------+------------------+
|gender|is_fraud|          sum(amt)|
+------+--------+------------------+
|     M|       1|2142801.2700000023|
|     F|       1|1845287.3399999982|
+------+--------+------------------+



In [320]:
# how much each gender spended if they suffered fraud (2019)
adj_train.groupby(*['year','category','gender','is_fraud']).sum('amt').filter((col('year') == 2019) & (col('is_fraud') == 1)).sort(desc('sum(amt)')).show(14)

+----+-------------+------+--------+------------------+
|year|     category|gender|is_fraud|          sum(amt)|
+----+-------------+------+--------+------------------+
|2019| shopping_net|     M|       1| 735598.6099999999|
|2019| shopping_net|     F|       1|466964.80999999994|
|2019|     misc_net|     M|       1|284502.30000000005|
|2019| shopping_pos|     F|       1|         282235.86|
|2019| shopping_pos|     M|       1|230414.56999999995|
|2019|     misc_net|     F|       1|216073.37000000005|
|2019|  grocery_pos|     M|       1|212679.80000000005|
|2019|  grocery_pos|     F|       1|162488.22000000006|
|2019|entertainment|     F|       1| 49913.71000000001|
|2019|entertainment|     M|       1|33077.149999999994|
|2019|     misc_pos|     F|       1|          30448.77|
|2019|         home|     F|       1|          29845.39|
|2019|  food_dining|     F|       1| 9263.490000000002|
|2019|         home|     M|       1|3620.5200000000004|
+----+-------------+------+--------+------------

In [321]:
# how much each gender spended if they suffered fraud (2020)
adj_train.groupby(*['year','category','gender','is_fraud']).sum('amt').filter((col('year') == 2020) & (col('is_fraud') == 1)).sort(desc('sum(amt)')).show(14)

+----+-------------+------+--------+------------------+
|year|     category|gender|is_fraud|          sum(amt)|
+----+-------------+------+--------+------------------+
|2020| shopping_net|     M|       1| 288958.1199999999|
|2020| shopping_net|     F|       1|220202.17000000013|
|2020|     misc_net|     M|       1|         131204.01|
|2020| shopping_pos|     F|       1|122150.40999999997|
|2020| shopping_pos|     M|       1|104444.24999999997|
|2020|     misc_net|     F|       1| 97487.08000000003|
|2020|  grocery_pos|     M|       1| 91167.47000000002|
|2020|  grocery_pos|     F|       1|          77462.41|
|2020|entertainment|     F|       1|          22900.44|
|2020|     misc_pos|     F|       1|          22635.95|
|2020|         home|     F|       1|16300.210000000001|
|2020|entertainment|     M|       1|11432.490000000002|
|2020|  food_dining|     F|       1|           4634.48|
|2020|gas_transport|     M|       1|           1245.37|
+----+-------------+------+--------+------------

### Age

In [322]:
# age distribution if CC frauded
adj_train.groupby(*['age','is_fraud']).count().filter(col('is_fraud')==1).sort(desc('count')).show()

+---+--------+-----+
|age|is_fraud|count|
+---+--------+-----+
| 34|       1|  196|
| 50|       1|  187|
| 38|       1|  183|
| 52|       1|  182|
| 48|       1|  181|
| 63|       1|  177|
| 31|       1|  172|
| 57|       1|  167|
| 54|       1|  163|
| 37|       1|  162|
| 27|       1|  156|
| 35|       1|  155|
| 29|       1|  154|
| 58|       1|  151|
| 53|       1|  150|
| 61|       1|  143|
| 43|       1|  143|
| 60|       1|  142|
| 55|       1|  142|
| 68|       1|  141|
+---+--------+-----+
only showing top 20 rows



In [323]:
## frauded cc age & sex distribution
adj_train.groupby(*['age','gender','is_fraud']).count().filter(col('is_fraud')==1).sort(desc('count')).show()

+---+------+--------+-----+
|age|gender|is_fraud|count|
+---+------+--------+-----+
| 34|     M|       1|  146|
| 53|     F|       1|  119|
| 52|     F|       1|  118|
| 57|     M|       1|  109|
| 38|     F|       1|  108|
| 48|     M|       1|  104|
| 55|     M|       1|  103|
| 58|     M|       1|   97|
| 46|     M|       1|   96|
| 44|     M|       1|   94|
| 50|     M|       1|   94|
| 31|     F|       1|   94|
| 63|     F|       1|   93|
| 50|     F|       1|   93|
| 43|     M|       1|   92|
| 37|     M|       1|   91|
| 60|     M|       1|   86|
| 61|     F|       1|   86|
| 35|     M|       1|   85|
| 63|     M|       1|   84|
+---+------+--------+-----+
only showing top 20 rows



In [324]:
# amount spent by age with frauded CC
adj_train.groupby(*['age','is_fraud']).sum('amt').filter(col('is_fraud')==1).sort(desc('sum(amt)')).show(5)

+---+--------+------------------+
|age|is_fraud|          sum(amt)|
+---+--------+------------------+
| 34|       1|109248.05999999998|
| 63|       1|102921.00000000001|
| 48|       1| 91865.23000000001|
| 29|       1| 90028.45999999999|
| 37|       1|          89043.31|
+---+--------+------------------+
only showing top 5 rows



In [325]:
# total spend on frauded cc by gender and age
adj_train.groupby(*['age','gender','is_fraud']).sum('amt').filter(col('is_fraud')==1).sort(desc('sum(amt)')).show(5)

+---+------+--------+------------------+
|age|gender|is_fraud|          sum(amt)|
+---+------+--------+------------------+
| 34|     M|       1| 86134.79999999999|
| 48|     M|       1|64653.619999999995|
| 55|     M|       1|          63821.48|
| 37|     M|       1| 59853.69999999999|
| 57|     M|       1|58460.399999999994|
+---+------+--------+------------------+
only showing top 5 rows



In [326]:
# average age by gender and amount spent in CC fraud
adj_train.groupby(*['age','gender','is_fraud']).agg(mean('age'),sum('amt')).filter(col('is_fraud')==1).sort(desc('sum(amt)')).show(5)

+---+------+--------+--------+------------------+
|age|gender|is_fraud|avg(age)|          sum(amt)|
+---+------+--------+--------+------------------+
| 34|     M|       1|    34.0| 86134.79999999999|
| 48|     M|       1|    48.0|64653.619999999995|
| 55|     M|       1|    55.0|          63821.48|
| 37|     M|       1|    37.0| 59853.69999999999|
| 57|     M|       1|    57.0|58460.399999999994|
+---+------+--------+--------+------------------+
only showing top 5 rows



### Location (State, city, population size)

In [327]:
# unique states
adj_train.select('state').distinct().count()

51

In [328]:
# main states with CC fraud
adj_train.groupby(*['state','is_fraud']).count().filter(col('is_fraud') == 1).sort(desc('count')).show(5)

+-----+--------+-----+
|state|is_fraud|count|
+-----+--------+-----+
|   NY|       1|  555|
|   TX|       1|  479|
|   PA|       1|  458|
|   CA|       1|  326|
|   OH|       1|  321|
+-----+--------+-----+
only showing top 5 rows



In [329]:
# total amount by state with CC fraud
adj_train.groupby(*['state','is_fraud']).sum('amt').filter(col('is_fraud') == 1).sort(desc('sum(amt)')).show(5)

+-----+--------+------------------+
|state|is_fraud|          sum(amt)|
+-----+--------+------------------+
|   NY|       1|295548.64000000013|
|   TX|       1| 265806.4100000001|
|   PA|       1|244624.67000000004|
|   CA|       1|         170943.92|
|   OH|       1|         168919.98|
+-----+--------+------------------+
only showing top 5 rows



In [330]:
# most populated state and CC fraud amount
adj_train.groupby(*['state','is_fraud']).sum(*['city_pop','amt']).filter(col('is_fraud') == 1).sort(desc('sum(city_pop)')).show(5)

+-----+--------+-------------+------------------+
|state|is_fraud|sum(city_pop)|          sum(amt)|
+-----+--------+-------------+------------------+
|   TX|       1|    206734477| 265806.4100000001|
|   CA|       1|     89649164|         170943.92|
|   NY|       1|     79853780|295548.64000000013|
|   FL|       1|     45649842|         150913.03|
|   MN|       1|     28231300|         112454.39|
+-----+--------+-------------+------------------+
only showing top 5 rows



In [331]:
# most populated city and CC fraud amount
adj_train.groupby(*['state','city','is_fraud']).sum(*['city_pop','amt']).filter(col('is_fraud') == 1).sort(desc('sum(city_pop)')).show(5)

+-----+-------------+--------+-------------+------------------+
|state|         city|is_fraud|sum(city_pop)|          sum(amt)|
+-----+-------------+--------+-------------+------------------+
|   TX|      Houston|       1|    113361300|          21667.21|
|   TX|  San Antonio|       1|     39894925|14536.749999999996|
|   NY|New York City|       1|     36279855|13136.859999999999|
|   TX|       Dallas|       1|     34109667|19747.140000000003|
|   NY|     Brooklyn|       1|     25047000|           7435.38|
+-----+-------------+--------+-------------+------------------+
only showing top 5 rows



In [332]:
# average age per city and CC fraud amount
adj_train.groupby(*['state','city','gender','is_fraud'])\
         .agg(floor(mean('age')), sum('city_pop'), sum('amt'))\
         .filter(col('is_fraud') == 1)\
         .sort(desc('sum(amt)')).show(5)

+-----+-------------+------+--------+---------------+-------------+------------------+
|state|         city|gender|is_fraud|FLOOR(avg(age))|sum(city_pop)|          sum(amt)|
+-----+-------------+------+--------+---------------+-------------+------------------+
|   OK|        Tulsa|     F|       1|             68|     11166498|17470.250000000004|
|   NY|New York City|     F|       1|             60|     36279855|13136.859999999999|
|   TX|  San Antonio|     M|       1|             39|     28724346|13040.649999999996|
|   FL|       Naples|     F|       1|             81|      5520040|13018.669999999998|
|   TX|      Houston|     M|       1|             38|     63947400|12979.279999999999|
+-----+-------------+------+--------+---------------+-------------+------------------+
only showing top 5 rows



### Occupation (job title)

In [333]:
# unique jobs
adj_train.select('job').distinct().count()

494

In [334]:
# job with most frauded
adj_train.groupby(*['job', 'is_fraud']).count().filter(col('is_fraud') == 1).sort(desc('count')).show(5)

+--------------------+--------+-----+
|                 job|is_fraud|count|
+--------------------+--------+-----+
|  Materials engineer|       1|   62|
|Trading standards...|       1|   56|
|     Naval architect|       1|   53|
| Exhibition designer|       1|   51|
|Surveyor, land/ge...|       1|   50|
+--------------------+--------+-----+
only showing top 5 rows



In [335]:
# find most common words for jobs, so I can create macro categories
adj_train.withColumn('job_macro', explode(split(lower(col('job')), '[ , ]'))) \
          .groupBy('job_macro') \
          .count() \
          .filter((col('job_macro') != 'and') & (col('job_macro') != ''))\
          .orderBy(desc('count')) \
          .show(20)

+-------------+------+
|    job_macro| count|
+-------------+------+
|     engineer|131756|
|      officer|110915|
|      manager| 61124|
|    scientist| 55878|
|     designer| 52218|
|     surveyor| 49062|
|      teacher| 38126|
| psychologist| 32600|
|     research| 29754|
|       editor| 28725|
|    education| 26624|
|       public| 26116|
|    therapist| 25110|
|   consultant| 24785|
|        chief| 23081|
|    chartered| 19009|
|  development| 17943|
|       health| 17300|
|administrator| 16988|
|   researcher| 16001|
+-------------+------+
only showing top 20 rows



In [336]:
# create occupation clusters based on the most used words for jobs
adj_train = adj_train.withColumn('job_macro', explode(split(lower(col('job')), '[ , ]')))

adj_train = adj_train.withColumn('job_category',
                        when(lower(col('job_macro')).isin(['engineer', 'scientist', 'development', 'researcher', 'research']), 'science')
                        .when(lower(col('job_macro')).isin(['manager', 'chief', 'administrator', 'consultant']), 'business')
                        .when(lower(col('job_macro')).isin(['teacher', 'education']), 'education')
                        .when(lower(col('job_macro')).isin(['health', 'therapist', 'psychologist']), 'healthcare')
                        .when(lower(col('job_macro')).isin(['editor', 'public']), 'comunication')
                        .when(lower(col('job_macro')).isin(['designer', 'surveyor']), 'desing')
                        .otherwise('others'))
adj_train.show(4)

+--------------------+-----------+------+------+--------------+-----+-----+-------+---------+--------+--------------------+----------+------------------+-----------+--------+----+-----+--------+--------+---+--------------------+--------------------+------------+------------+
|            merchant|   category|   amt|gender|          city|state|  zip|    lat|     long|city_pop|                 job| unix_time|         merch_lat| merch_long|is_fraud|year|month|day_week|    time|age|            lat_diff|           long_diff|   job_macro|job_category|
+--------------------+-----------+------+------+--------------+-----+-----+-------+---------+--------+--------------------+----------+------------------+-----------+--------+----+-----+--------+--------+---+--------------------+--------------------+------------+------------+
|fraud_Rippin, Kub...|   misc_net|  4.97|     F|Moravian Falls|   NC|28654|36.0788| -81.1781|    3495|Psychologist, cou...|1325376018|         36.011293| -82.048315|       

In [337]:
# fraud by job category
adj_train.groupby(*['job_category', 'is_fraud']).count().filter(col('is_fraud') == 1).sort(desc('count')).show(7)

+------------+--------+-----+
|job_category|is_fraud|count|
+------------+--------+-----+
|      others|       1|15037|
|     science|       1| 1369|
|    business|       1|  676|
|      desing|       1|  564|
|   education|       1|  456|
|  healthcare|       1|  367|
|comunication|       1|  250|
+------------+--------+-----+



In [340]:
# amount by job category when CC frauded
adj_train.groupby(*['job_category', 'is_fraud']).sum('amt').filter(col('is_fraud') == 1).sort(desc('sum(amt)')).show(7)

+------------+--------+------------------+
|job_category|is_fraud|          sum(amt)|
+------------+--------+------------------+
|      others|       1| 7974167.390000008|
|     science|       1| 718112.9300000002|
|    business|       1|337418.98000000004|
|      desing|       1| 304814.8100000001|
|   education|       1|235717.29999999993|
|  healthcare|       1|          202879.2|
|comunication|       1|130377.24999999999|
+------------+--------+------------------+



### Latitude and Longitude

In [346]:
# avarage difference between customer and merchant latitude
adj_train.groupby('is_fraud').mean('lat_diff').show()

+--------+--------------------+
|is_fraud|       avg(lat_diff)|
+--------+--------------------+
|       1|0.010138428495111993|
|       0| 1.36273459634889E-4|
+--------+--------------------+



In [347]:
# avarage difference between customer and merchant longitude
adj_train.groupby('is_fraud').mean('long_diff').show()

+--------+--------------------+
|is_fraud|      avg(long_diff)|
+--------+--------------------+
|       1|8.399439072599181E-4|
|       0|7.071568078397803E-4|
+--------+--------------------+



### Unix

In [348]:
# avarage unix time
adj_train.groupby('is_fraud').mean('unix_time').show()

+--------+--------------------+
|is_fraud|      avg(unix_time)|
+--------+--------------------+
|       1|1.3481615599279876E9|
|       0|1.3492567481422484E9|
+--------+--------------------+



### Day and Hour

In [354]:
# is there a spefic time for fraud?
adj_train.groupby(*['time', 'is_fraud']).count().filter(col('is_fraud') == 1).sort(desc('count')).show(10)

+--------+--------+-----+
|    time|is_fraud|count|
+--------+--------+-----+
|00:22:36|       1|   13|
|23:42:53|       1|   13|
|23:23:34|       1|   13|
|22:02:16|       1|   12|
|22:25:30|       1|   12|
|22:17:47|       1|   12|
|22:43:46|       1|   11|
|03:13:37|       1|   11|
|22:22:44|       1|   11|
|23:40:36|       1|   11|
+--------+--------+-----+
only showing top 10 rows



In [355]:
# amount frauded by hour
adj_train.groupby(*['time', 'is_fraud']).sum('amt').filter(col('is_fraud') == 1).sort(desc('sum(amt)')).show(10)

+--------+--------+-----------------+
|    time|is_fraud|         sum(amt)|
+--------+--------+-----------------+
|22:43:46|       1|         10796.23|
|22:59:35|       1|         10672.18|
|23:15:12|       1|          9457.49|
|22:10:59|       1|9022.580000000002|
|22:20:58|       1|8720.039999999999|
|22:45:52|       1|          8612.68|
|23:42:24|       1|8609.150000000001|
|23:43:08|       1|          8377.88|
|23:49:53|       1|          8366.07|
|23:58:42|       1|          8320.66|
+--------+--------+-----------------+
only showing top 10 rows



In [356]:
# does fraud occur in a specific day of the week?
adj_train.groupby(*['day_week', 'is_fraud']).count().filter(col('is_fraud') == 1).sort(desc('count')).show(10)

+--------+--------+-----+
|day_week|is_fraud|count|
+--------+--------+-----+
|       7|       1| 3128|
|       2|       1| 2993|
|       1|       1| 2954|
|       6|       1| 2718|
|       5|       1| 2483|
|       3|       1| 2370|
|       4|       1| 2073|
+--------+--------+-----+



In [357]:
# amount frauded by day of week
adj_train.groupby(*['day_week', 'is_fraud']).sum('amt').filter(col('is_fraud') == 1).sort(desc('sum(amt)')).show(10)

+--------+--------+------------------+
|day_week|is_fraud|          sum(amt)|
+--------+--------+------------------+
|       7|       1|1635887.2400000005|
|       1|       1|1616182.6999999993|
|       2|       1|1493520.4499999983|
|       6|       1|1487687.5299999993|
|       5|       1| 1347469.530000001|
|       3|       1|1245697.5300000003|
|       4|       1|        1077042.88|
+--------+--------+------------------+



### Statistics

In [349]:
# basic dataset statistics
adj_train.summary().show()

+-------+-------------------+-------------+------------------+-------+-------+-------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+------------------+------------------+--------------------+------------------+-----------------+------------------+--------+------------------+--------------------+--------------------+---------+------------+
|summary|           merchant|     category|               amt| gender|   city|  state|              zip|              lat|              long|         city_pop|               job|           unix_time|         merch_lat|        merch_long|            is_fraud|              year|            month|          day_week|    time|               age|            lat_diff|           long_diff|job_macro|job_category|
+-------+-------------------+-------------+------------------+-------+-------+-------+-----------------+-----------------+------------------+-----------------+------------------+------

In [338]:
adj_train.show(2)

+--------------------+--------+----+------+--------------+-----+-----+-------+--------+--------+--------------------+----------+---------+----------+--------+----+-----+--------+--------+---+------------------+------------------+------------+------------+
|            merchant|category| amt|gender|          city|state|  zip|    lat|    long|city_pop|                 job| unix_time|merch_lat|merch_long|is_fraud|year|month|day_week|    time|age|          lat_diff|         long_diff|   job_macro|job_category|
+--------------------+--------+----+------+--------------+-----+-----+-------+--------+--------+--------------------+----------+---------+----------+--------+----+-----+--------+--------+---+------------------+------------------+------------+------------+
|fraud_Rippin, Kub...|misc_net|4.97|     F|Moravian Falls|   NC|28654|36.0788|-81.1781|    3495|Psychologist, cou...|1325376018|36.011293|-82.048315|       0|2019|    1|       3|00:00:18| 36|0.0675069999999991|0.8702150000000017|psy

## Data Preparation <a name="data_prep"></a>
[Table of contents](#table_cont)

In [339]:
# drop variables
## ['_c0','trans_date_trans_time','cc_num']

## ML Models <a name="ml_mod"></a>
[Table of contents](#table_cont)

## Comparing results <a name="results"></a>
[Table of contents](#table_cont)

## Take Aways <a name="take_away"></a>
[Table of contents](#table_cont)