# Table of contents

* [Setup](#Setup)
  * [Module imports](#Module-imports)
  * [Load spark session and data](#Load-spark-session-and-data)
  * [Declare globals](#Declare-globals)
* [Distribution of values](#Distribution-of-values)

# Setup

## Module imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split, when
import importlib
import numpy as np

In [2]:
from modules.load import read_file
from modules.plots import catplot, countplot, display_images, distribution

## Load spark session and data

In [3]:
spark = SparkSession.builder.appName("Test").getOrCreate()

In [4]:
df = read_file(spark, "data/transactions.ndjson")

In [5]:
# Clean up the "merchantName" column
df = df.withColumn(
    "merchantName",
    when(col("merchantName").contains(" #"), split(col("merchantName"), " #").getItem(0))
                            .when(col("merchantName").contains("Blue Mountain"), "Blue Mountain")
                            .when(col("merchantName").contains("ethanallen.com"), "Ethan Allen")
                            .when(col("merchantName").contains("pottery-barn.com"), "Pottery Barn")
                            .when(col("merchantName").contains("westelm.com"), "West Elm")
                            .when(col("merchantName").contains("williamssonoma"), "Williams Sonoma")
                            .otherwise(col("merchantName"))
)

## Declare globals

All the column names in the dataset shown for reference

In [6]:
np.array(df.columns).reshape((13,3)).tolist()

[['accountNumber', 'accountOpenDate', 'acqCountry'],
 ['availableMoney', 'cardCVV', 'cardLast4Digits'],
 ['cardPresent', 'creditLimit', 'currentBalance'],
 ['currentExpDate', 'customerId', 'dateOfLastAddressChange'],
 ['enteredCVV', 'expirationDateKeyInMatch', 'isFraud'],
 ['merchantCategoryCode', 'merchantCountryCode', 'merchantName'],
 ['posConditionCode', 'posEntryMode', 'transactionAmount'],
 ['transactionDateTime', 'transactionType', 'creditLimitIndexed'],
 ['merchantNameIndexed', 'acqCountryIndexed', 'merchantCountryCodeIndexed'],
 ['posEntryModeIndexed',
  'posConditionCodeIndexed',
  'merchantCategoryCodeIndexed'],
 ['transactionTypeIndexed', 'creditLimitEncoded', 'merchantNameEncoded'],
 ['acqCountryEncoded', 'merchantCountryCodeEncoded', 'posEntryModeEncoded'],
 ['posConditionCodeEncoded',
  'merchantCategoryCodeEncoded',
  'transactionTypeEncoded']]

### Labels and groups
Assign names to columns grouped by type for labeling purposes

In [7]:
# Define the list of decimal type columns
decimal_cols = [col for col, dtype in df.dtypes if "decimal" in dtype]

# Define the list of numerical type columns
numerical_cols = [col for col, dtype in df.dtypes if ("decimal" in dtype) or ("double" in dtype) or ("int" in dtype)]

# Define the list of categorical columns
categorical_cols = [
    "creditLimit",
    "merchantName",
    "acqCountry",
    "merchantCountryCode",
    "posEntryMode",
    "posConditionCode",
    "merchantCategoryCode",
    "transactionType"
]

catkinds = ["strip", "box", "boxen", "violin"]

# Distribution of values

### Single variable plots

In [8]:
# Look at the distribution of values of all the numerical columns
for column in decimal_cols:
    continue
    distribution(df.select(column), column, log_scale=True)
    print(column)
for column in categorical_cols:
    continue
    countplot(df.select(column), column)

<div>
<p style="text-align: center;">Plots showing the distributions of the four decimal quantities</p>
<img src="figures/availableMoney_distribution.png" width="250"/>
<img src="figures/creditLimit_distribution.png" width="250"/>
<img src="figures/currentBalance_distribution.png" width="250"/>
<img src="figures/transactionAmount_distribution.png" width="250"/><br>
<p style="text-align: center;">Plots showing the distribution of categorical columns</p>
<img src="figures/creditLimit_barchart.png" width="250"/>
<img src="figures/acqCountry_barchart.png" width="250"/>
<img src="figures/merchantCountryCode_barchart.png" width="250"/>
<img src="figures/posEntryMode_barchart.png" width="250"/><br>
<img src="figures/merchantCategoryCode_barchart.png" width="250"/>
<img src="figures/transactionType_barchart.png" width="250"/>
<img src="figures/merchantName_barchart.png" width="250"/><br>
</div>
Something that is obvious in retrospect that I didn't think about was that credit limit is really a categorical variable even though there is technically nothing preventing a bank assigning any non-negative credit limit to a customer, so I ended up plotting it twice, once treating the values as decimal and again using the values as categorical.

### Single variable grouped by fraud/not fraud

In [10]:
for column in decimal_cols:
    continue
    for kind in catkinds:
        catplot(df.select("isFraud", column), "isFraud", column, kind=kind)

for column in categorical_cols:
    continue
    distribution(df.select("isFraud", col(column).cast("string")), column, hue="isFraud", stat="proportion", common_norm=False)

<div>
<p style="text-align: center;">Box plots of the four decimal columns, grouped by fraud/legit transactions</p>
<img src="figures/availableMoney_boxplot_by_isFraud.png" width="350"/>
<img src="figures/creditLimit_boxplot_by_isFraud.png" width="350"/>
<img src="figures/currentBalance_boxplot_by_isFraud.png" width="350"/>
<img src="figures/transactionAmount_boxplot_by_isFraud.png" width="350"/><br>
<p style="text-align: center;">Boxen plots of the four decimal columns, grouped by fraud/legit transactions</p>
<img src="figures/availableMoney_boxenplot_by_isFraud.png" width="350"/>
<img src="figures/creditLimit_boxenplot_by_isFraud.png" width="350"/>
<img src="figures/currentBalance_boxenplot_by_isFraud.png" width="350"/>
<img src="figures/transactionAmount_boxenplot_by_isFraud.png" width="350"/><br>
<p style="text-align: center;">Strip plots of the four decimal columns, grouped by fraud/legit transactions</p>
<img src="figures/availableMoney_stripplot_by_isFraud.png" width="350"/>
<img src="figures/creditLimit_stripplot_by_isFraud.png" width="350"/>
<img src="figures/currentBalance_stripplot_by_isFraud.png" width="350"/>
<img src="figures/transactionAmount_stripplot_by_isFraud.png" width="350"/><br>
<p style="text-align: center;">Violin plots of the four decimal columns, grouped by fraud/legit transactions</p>
<img src="figures/availableMoney_violinplot_by_isFraud.png" width="350"/>
<img src="figures/creditLimit_violinplot_by_isFraud.png" width="350"/>
<img src="figures/currentBalance_violinplot_by_isFraud.png" width="350"/>
<img src="figures/transactionAmount_violinplot_by_isFraud.png" width="350"/><br>
</div>

1. The values of "availableMoney" variable is very similar between legit/fraud groups.
2. "creditLimit" is a little more widely distributed in fraud group compared to legit group.
3. Greater proportion of legit accounts have "currentBalance" and "transactionAmount" very close to zero compared to fraud group.

In general, the distributions of each quantity are not qualitatively different between fraud and legit transactions, but there are quantitative differences.

## Three-variable analysis
Next I looked at the distribution of numerical variables for the fraud/legit groups, split by categorical variables.

In [2]:
catplot(df.select("isFraud", "transactionAmount", "transactionType"), "isFraud", "transactionAmount", kind="violin", hue="transactionType")
Image(filename="figures/transactionAmount_violinplot_by_isFraud_with_transactionType.png")

NameError: name 'catplot' is not defined

Above plot shows that non-fraudulent transactions are concentrated very close to zero in all categories, while fraudulent charges are more spread out. This difference is especially notable for unmarked ("empty") transactions. The average size of the fraudulent transactions are consistently higher across all categories as a result.

In [None]:
catplot(df.select("isFraud", "transactionAmount", "merchantCategoryCode"), "isFraud", "transactionAmount", kind="violin", hue="merchantCategoryCode", aspect=2)
Image(filename="figures/transactionAmount_violinplot_by_isFraud_with_merchantCategoryCode.png")

In [None]:
countplot(df.select("isFraud", "cardPresent"), "isFraud", hue="cardPresent", stat="proportion")
Image(filename="figures/isFraud_barchart_with_cardPresent.png")

In [None]:
countplot(df.select("isFraud", "cardPresent").where(df.isFraud == True), "cardPresent", stat="proportion")
Image(filename="figures/cardPresent_barchart.png")

In [None]:
countplot(df.select("isFraud", "cardPresent").where(df.isFraud == False), "cardPresent", stat="proportion")
Image(filename="figures/cardPresent_barchart.png")

The two above plots show that while legitimate transactions have roughly equal likelihood of being performed with or without the card present, fraudulent transactions happend predominantly without the card present.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as img
plt.imshow(img.imread("figures/availableMoney_distribution.png"))
plt.axis("off")