# Real-Time Credit Card Fraud Detection

## Activity List

This is a dump of proposed activities. Feel absolutely free to create and take
ownership of any activity you may think of!

| Activity | Owner   | Status |
| :-------- | :-------: | :------: |
| EDA - Fraud as a function of hour of day  |   ||
| EDA - Fraud per category |   ||
| EDA - Fraud per vendor | ||
| Feature - # of transactions last 24 hrs  |    ||
| Feature - total amount for card last 24 hrs | ||
| Feature - avg amount for card last 24 hrs | ||
| Feature - # of fraud occurences for vendor | ||
| ML - pipeline | ||
| ML - streaming | ||
    

## Basic Imports and Settings

In [None]:
# import modules from pyspark
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SQLContext
import pandas as pd
import matplotlib.pyplot as plt
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline

# uncomment the following line if running pyspark from the notebook itself
# spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
sqlContext = SQLContext(spark)

## Loading the Dataset File and Performing Basic Data Type Conversions

Source of the dataset: https://www.kaggle.com/datasets/kartik2112/fraud-detection

In [None]:
# define a reusable schema for the dataset (will be useful for the real-time portion)
ccschema = StructType([
    StructField("_c0", IntegerType(), True),
    StructField("trans_date_trans_time", TimestampType(), True),
    StructField("cc_num", StringType(), True),
    StructField("merchant", StringType(), True),
    StructField("category", StringType(), True),
    StructField("amt", DoubleType(), True),
    StructField("first", StringType(), True),
    StructField("last", StringType(), True),
    StructField("gender", StringType(), True),
    StructField("street", StringType(), True),
    StructField("city", StringType(), True),
    StructField("state", StringType(), True),
    StructField("zip", StringType(), True),
    StructField("lat", DoubleType(), True),
    StructField("long", DoubleType(), True),
    StructField("city_pop", DoubleType(), True),
    StructField("job", StringType(), True),
    StructField("dob", DateType(), True),
    StructField("trans_num", StringType(), True),
    StructField("unix_time", StringType(), True),
    StructField("merch_lat", DoubleType(), True),
    StructField("merch_long", DoubleType(), True),
    StructField("is_fraud", IntegerType(), True),
])

In [None]:
# Read The data from local folder
cc = (spark.read.csv("fraudTrain.csv", schema=ccschema, header=True))

In [None]:
# Read The data in the context of Databricks
# cc = (spark.read
#  .option("header", "true")
#  .csv("s3://group9-ml-project/fraudTrain.csv"))

In [None]:
# use this code to create some sample data from the main dataset for the purpose
# of loading transactions for the real time portion

# cc.limit(10).write.option("header",True).csv("sample")

## Exploratory Data Analysis

### A Look at the Data and its Basic Statistics

In [None]:
# let's look a the first 5 rows
pd.DataFrame(cc.take(5), columns=cc.columns)

In [None]:
# basic statistics
cc.describe().toPandas()

In [None]:
# looking to see if there are null values
cc.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in cc.columns)).toPandas()

### Visualizing the Data Distribution

In [None]:
fig, axs = plt.subplots(4 , 2, figsize=(15, 20))
fig.suptitle('CC Fraud Data Distribution')

for idx, column in enumerate(['amt', 'city_pop', 'lat', 'long', 'merch_lat', 'merch_long', 'unix_time', 'is_fraud']):
    # Show histogram of the column
    bins, counts = cc.select(column).rdd.flatMap(lambda x: x).map(float).histogram(20)
    axs[idx//2][idx%2].set_title(column)
    axs[idx//2][idx%2].hist(bins[:-1], bins=bins, weights=counts)
    
plt.show()

In [None]:
cc.select("category").groupby("category").count().toPandas()

In [None]:
cc.select("gender").groupby("gender").count().toPandas()

### EDA Preliminary Findings

- The data appears to be clean with no missing values
- Some of the heavily skewed features like amt and city_pop may benefit from logarithmic transformation
- The target class (is_fraud) is heavily imbalanced

## ML Pipeline Setup

In [None]:
# define logarithmic transformer
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark import keyword_only  # Note: use pyspark.ml.util.keyword_only if Spark < 2.0
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param, Params, TypeConverters
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
 
class LogTransformer(Transformer,               # Base class
                     HasInputCol,               # Sets up an inputCol parameter
                     HasOutputCol,              # Sets up an outputCol parameter
                     DefaultParamsReadable,     # Makes parameters readable from file
                     DefaultParamsWritable      # Makes parameters writable from file
                    ):
  
    @keyword_only
    def __init__(self, inputCol=None, outputCol=None, append_str=None):
        """
        Constructor: set values for all Param objects
        """
        super().__init__()
        self._setDefault()
        kwargs = self._input_kwargs
        self.setParams(**kwargs)
  
    @keyword_only
    def setParams(self, inputCol=None, outputCol=None):
        kwargs = self._input_kwargs
        return self._set(**kwargs)
  
    # Required if you use Spark >= 3.0
    def setInputCol(self, new_inputCol):
        return self.setParams(inputCol=new_inputCol)
  
    # Required if you use Spark >= 3.0
    def setOutputCol(self, new_outputCol):
        return self.setParams(outputCol=new_outputCol)
  
    def _transform(self, dataset):
        """
        This is the main member function which applies the transform to transform data from the `inputCol` to the `outputCol`
        """
        if not self.isSet("inputCol"):
            raise ValueError(
                "No input column set for the "
                "LogTransformer transformer."
            )
        input_column = self.getInputCol()
        output_column = self.getOutputCol()

        return dataset.withColumn(output_column,
                                  log(col(input_column)))

In [None]:
from pyspark.ml.feature import StringIndexer

# define a transformer to convert string categorical features to numeric indices
inputs = ['merchant', 'category', 'gender', 'city', 'state', 'job']
outputs = ['merchant_idx', 'category_idx', 'gender_idx', 'city_idx', 'state_idx', 'job_idx']
stringIndexer = StringIndexer(inputCols=inputs, outputCols=outputs)


In [None]:
from pyspark.ml.feature import OneHotEncoder

# define a transformer to one-hot encode indexed categorical features
inputs_1hot = ['merchant_idx', 'category_idx', 'city_idx', 'state_idx', 'job_idx']
outputs_1hot = ['merchant_1hot', 'category_1hot', 'city_1hot', 'state_1hot', 'job_1hot']

oneHotEncoder = OneHotEncoder(inputCols=inputs_1hot, outputCols=outputs_1hot)


In [None]:
from pyspark.ml.feature import VectorAssembler

# assemble the prepped features into one single vector.
featureCols = ['amt_log', 'city_pop_log', 'job_1hot', 'state_1hot', 'category_1hot', 'gender_idx']
assembler = (VectorAssembler()
  .setInputCols(featureCols)
  .setOutputCol("features"))

# cc_final = assembler.transform(cc_prepped)

In [None]:
amtTransformer = LogTransformer(inputCol="amt", outputCol="amt_log")
cityPopTransformer = LogTransformer(inputCol="city_pop", outputCol="city_pop_log")

## ML Training and Prediction - RandomForestClassifier

In [None]:
training, test = cc.randomSplit([0.7, 0.3])

print(training.count())
print(test.count())

In [None]:
rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="is_fraud", seed=42,
    leafCol="leafId")
rf.setFeaturesCol("features")

# define pipeline using previously defined stages
pipeline = Pipeline(stages=[stringIndexer, oneHotEncoder, amtTransformer, cityPopTransformer, assembler, rf])

model = pipeline.fit(training)

In [None]:
preds = model.transform(test) 

In [None]:
pd.DataFrame(preds.take(5), columns=preds.columns)

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Instantiate the evaluator
bce= BinaryClassificationEvaluator(rawPredictionCol= "rawPrediction",
                                   labelCol="is_fraud", 
                                   metricName= "areaUnderROC")
                                   
bce.evaluate(preds)

In [None]:
from pyspark.mllib.evaluation import MulticlassMetrics

# create confusion matrix

preds_float = preds \
    .select("prediction", "is_fraud") \
    .withColumn("is_fraud", col("is_fraud").cast(DoubleType())) \
    .orderBy("prediction")

cm = MulticlassMetrics(preds_float.rdd.map(tuple))

# print(cm.confusionMatrix().toArray())

#show the confusion matrix as a pandas df for clearer presentation
pd.DataFrame(cm.confusionMatrix().toArray(),
             columns= ["true positive", "true negative"],
             index= ["predicted positive", "predicted negative"])

### Results Analysis

With a false negative rate of 100% in the confusion matrix, and 0.5 AUC score we obviously have work to do! ;-)