### Machine Learning Pipeline
The pipeline will involve the following models:
1. Decision Tree
2. Regression
3. Random Forest
4. Gradient-boosted trees
5. Naive Bayes

### How will this work?
Here's our recipe for success:
- 1st: load the data
- 2nd: split the data (80/20 approach)
- 3rd: get our feature columns and vectorize
- 4th: instantiate Models
- 5th: build and run the pipeline
- 6th: apply metrics (we use accuracy, precision, recall and f1-score)

Documentation: 
- https://spark.apache.org/docs/latest/ml-pipeline.html; 
- https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier
- https://www.v7labs.com/blog/f1-score-guide
- https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
- https://datascience-enthusiast.com/Python/PySpark_ML_with_Text_part1.html
               

### Let's start with the imbalanced dataset


In [24]:
# imports and configure spark session

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier, LinearSVC, NaiveBayes
from pyspark.ml.feature import StringIndexer, VectorIndexer, IndexToString, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("xpto") \
    .getOrCreate()
sc = spark.sparkContext

In [25]:
# load the data and show first 5 records
data = spark.read.csv('datasets/creditcard.csv', header=True, inferSchema=True, sep=",")


data.show(5)

                                                                                

+----+------------------+-------------------+----------------+------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+------------------+-------------------+--------------------+-------------------+------------------+------------------+------------------+------------------+--------------------+-------------------+------+-----+
|Time|                V1|                 V2|              V3|                V4|                 V5|                 V6|                 V7|                V8|                V9|                V10|               V11|               V12|               V13|               V14|               V15|               V16|               V17|                V18|               V19|                V20|                 V21|                V22|     

In [26]:
# build a data split: 80/20
train, test = data.randomSplit(weights=[0.8, 0.2], seed=42)
print('Train shape: ', (train.count(), len(train.columns)))
print('Test shape: ', (test.count(), len(test.columns)))

                                                                                

Train shape:  (227766, 31)


[Stage 27:>                                                         (0 + 1) / 2]

Test shape:  (57041, 31)


                                                                                

In [27]:
# get feature columns names
feature_columns = [col for col in data.columns if col!= 'Class']
print(feature_columns)
print(len(feature_columns))

['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']
30


In [28]:
# vectorize
vectorizer = VectorAssembler(inputCols=feature_columns, outputCol="features")
train_vec = vectorizer.transform(train)
test_vec = vectorizer.transform(test)

In [42]:
# instantiate Models

# regression
lr = LogisticRegression(
    featuresCol='features',
    labelCol='Class',
    predictionCol='Class_Prediction',
    maxIter=10,
    regParam=0.3,
    elasticNetParam=0.8
)

# decison tree
dt = DecisionTreeClassifier(featuresCol='features',
    labelCol='Class',
    predictionCol='Class_Prediction'
)

# random forest
rf = RandomForestClassifier(
    featuresCol='features',
    labelCol='Class',
    predictionCol='Class_Prediction'
)

# gradient - boosted tree
gbt = GBTClassifier(
    featuresCol='features',
    labelCol='Class',
    predictionCol='Class_Prediction'
)

# linear support vector machines
lsvc = LinearSVC(
    featuresCol='features',
    labelCol='Class',
    predictionCol='Class_Prediction'
)

# naive bayes
#nb = NaiveBayes(
    #featuresCol='features',
    #labelCol='Class',
    #predictionCol='Class_Prediction'
    #smoothing=1.0, 
    #modelType="multinomial"
#)




# create list of models
list_of_models = [lr, dt, rf, gbt, lsvc]
list_of_model_names = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'Gradient-Boosted Tree', 'Linear Support Vector Machines']

# go through list
for model, model_name in zip(list_of_models, list_of_model_names):

    # print current model
    print('Current model: ', model_name)

    # create a pipeline object
    pipeline = Pipeline(stages=[model])

    # fit pipeline
    pipeline_model = pipeline.fit(train_vec)

    # get scores on the test set
    test_pred = pipeline_model.transform(test_vec)


    # get accuracy on test set
    accuracy_evaluator = MulticlassClassificationEvaluator(predictionCol='Class_Prediction', labelCol='Class', metricName='accuracy')
    accuracy_score = accuracy_evaluator.evaluate(test_pred)
    print('Accuracy: ', accuracy_score)

    # get precision on test set
    precision_evaluator = MulticlassClassificationEvaluator(predictionCol='Class_Prediction', labelCol='Class', metricName='precisionByLabel')
    precision_score = precision_evaluator.evaluate(test_pred)
    print('Precision: ', precision_score)

    # get recall on test set
    recall_evaluator = MulticlassClassificationEvaluator(predictionCol='Class_Prediction', labelCol='Class', metricName='recallByLabel')
    recall_score = recall_evaluator.evaluate(test_pred)
    print('Recall: ', recall_score)

    # get f1-score on test set
    f1_evaluator = MulticlassClassificationEvaluator(predictionCol='Class_Prediction', labelCol='Class', metricName='f1')
    f1_score = f1_evaluator.evaluate(test_pred)
    print('F1-score: ', f1_score)

Current model:  Logistic Regression


                                                                                

Accuracy:  0.9980890938097158


                                                                                

Precision:  0.9980890938097158


                                                                                

Recall:  1.0


                                                                                

F1-score:  0.9971345544782491
Current model:  Decision Tree


                                                                                

Accuracy:  0.9992812187724619


                                                                                

Precision:  0.9993855444953564


                                                                                

Recall:  0.9998946111150144


                                                                                

F1-score:  0.9992261656907501
Current model:  Random Forest


                                                                                

Accuracy:  0.999298750021914


                                                                                

Precision:  0.9995083233827351


                                                                                

Recall:  0.9997892222300288


                                                                                

F1-score:  0.9992710270079342
Current model:  Gradient-Boosted Tree


                                                                                

Accuracy:  0.9993338125208183


                                                                                

Precision:  0.9994732591213962


                                                                                

Recall:  0.9998594814866859


                                                                                

F1-score:  0.9992964887936008
Current model:  Linear Support Vector Machines


                                                                                

Accuracy:  0.9993338125208183


                                                                                

Precision:  0.9995609721831975


                                                                                

Recall:  0.9997716574158645


[Stage 3309:>                                                       (0 + 1) / 2]

F1-score:  0.9993144441026857


                                                                                

#### Conclusions
As expected when working with an imbalanced, binary-targeted dataset, the predictions have excellent values for every metric. This happens because, in this particular case, 99% of records have a value of '0 - not fraud', which means that the model is basically training to predict 0s most of the time.