# 2.0 Machine Learning Pipeline

The pipeline will involve the following models:

- Decision Tree
- Regression
- Random Forest
- Gradient-boosted trees
- Naive Bayes

Here's our recipe for success:

1. Load the data
2. Split the data using the 80/20 approach
3. Select feature columns and vectorize them
4. Instantiate models
5. Build and run the pipeline
6. Apply metrics such as accuracy, precision, recall, and f1-score

Documentation:

- [Apache Spark ML Pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html)
- [Decision Tree Classifier](https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier)
- [F1-Score Guide](https://www.v7labs.com/blog/f1-score-guide)
- [Precision and Recall](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall)
- [PySpark ML with Text (Part 1)](https://datascience-enthusiast.com/Python/PySpark_ML_with_Text_part1.html)

In [9]:
# Import libraries
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Import PySpark libraries
from pyspark.sql import Window
import pyspark.sql.types as t
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.classification import LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, GBTClassifier, LinearSVC, NaiveBayes
from pyspark.ml.feature import StringIndexer, VectorIndexer, IndexToString, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder,CrossValidator

# Configure Spark Session
spark = SparkSession.builder \
    .appName("Credit Card ML") \
    .config("spark.master", "local") \
    .getOrCreate()
sc = spark.sparkContext

In [3]:
# Load file
df = spark.read.csv('datasets/creditcard.csv', header=True, inferSchema=True, sep=",")

                                                                                

In [4]:
# Select fraud and non-fraud transactions and limit non-fraud transactions to the same number as fraud transactions
fraud_df = df.filter(f.col('Class') == 1)
non_fraud_df = df.filter(f.col('Class') == 0).limit(fraud_df.count())

# Combine fraud and non-fraud transactions and shuffle the data
balanced_df = fraud_df.union(non_fraud_df).orderBy(f.rand())

# Show 10 rows of the shuffled, balanced dataframe
balanced_df.limit(10).toPandas()

                                                                                

23/04/30 08:36:19 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


                                                                                

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,135095.0,0.232512,0.938944,-4.64778,3.079844,-1.902655,-1.041408,-1.020407,0.547069,-1.10599,...,0.911373,1.042929,0.999394,0.90126,-0.452093,0.192959,0.180859,-0.029315,345.0,1
1,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.88285,0.697211,-2.064945,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.29268,0.147968,390.0,1
2,160870.0,-0.644278,5.002352,-8.252739,7.756915,-0.216267,-2.751496,-3.358857,1.406268,-4.403852,...,0.587728,-0.605759,0.033746,-0.75617,-0.008172,0.532772,0.66397,0.192067,0.77,1
3,145.0,-2.420413,1.947885,0.553646,0.983069,-0.281518,2.408958,-1.401613,-0.188299,0.675878,...,1.213826,-1.23862,0.006927,-1.724222,0.239603,-0.313703,-0.188281,0.119831,6.0,0
4,91.0,-1.822273,1.235336,-0.307804,-1.821824,2.762482,3.641499,-0.344614,-1.547541,-0.138239,...,2.080848,-1.591888,0.321636,0.889258,0.156445,-0.960611,-0.035302,0.182321,15.89,0
5,91554.0,-5.100256,3.633442,-3.843919,0.183208,-1.183997,1.602139,-3.005953,-8.645038,1.285458,...,8.280439,-2.79715,1.090707,-0.15926,0.532156,-0.497126,0.943622,0.553581,261.22,1
6,142280.0,-1.169203,1.863414,-2.515135,5.463681,-0.297971,1.364918,0.759219,-0.118861,-2.293921,...,-0.39309,-0.708692,0.471309,-0.078616,-0.544655,0.014777,-0.24093,-0.781055,324.59,1
7,95.0,1.195572,0.258858,0.635796,0.641257,-0.395081,-0.694667,0.034086,-0.124346,-0.0784,...,-0.201249,-0.516925,0.199096,0.412552,0.122984,0.10194,-0.007846,0.020214,1.29,0
8,303.0,1.254258,1.218376,-2.148615,1.155957,1.813892,-0.238358,0.623888,-0.060265,-0.739258,...,-0.210083,-0.463849,-0.370852,-1.644707,0.96267,-0.200548,0.055746,0.071654,2.95,0
9,307.0,-2.658288,-3.014776,2.271636,-1.218204,1.546541,-1.682064,-1.524913,0.088425,-0.761266,...,-0.212736,-0.587618,0.320804,-0.013227,-0.233772,0.742313,-0.505256,0.394053,52.9,0


In [5]:
# build a data split: 80/20
train, test = balanced_df.randomSplit(weights=[0.8, 0.2], seed=42)
print('Train shape: ', (train.count(), len(train.columns)))
print('Test shape: ', (test.count(), len(test.columns)))

                                                                                

Train shape:  (825, 31)




Test shape:  (159, 31)


                                                                                

In [6]:
# get feature columns names
feature_columns = [col for col in balanced_df.columns if col!= 'Class']
print(feature_columns)
print(len(feature_columns))

['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']
30


In [7]:
# vectorize
vectorizer = VectorAssembler(inputCols=feature_columns, outputCol="features")
train_vec = vectorizer.transform(train)
test_vec = vectorizer.transform(test)

train_vec.select('Time','Features','Class').show(10)



+----+--------------------+-----+
|Time|            Features|Class|
+----+--------------------+-----+
| 0.0|[0.0,-1.359807133...|    0|
| 0.0|[0.0,1.1918571113...|    0|
| 1.0|[1.0,-0.966271711...|    0|
| 2.0|[2.0,-1.158233093...|    0|
| 2.0|[2.0,-0.425965884...|    0|
| 7.0|[7.0,-0.894286082...|    0|
| 9.0|[9.0,-0.338261752...|    0|
|10.0|[10.0,0.384978215...|    0|
|10.0|[10.0,1.249998742...|    0|
|10.0|[10.0,1.449043781...|    0|
+----+--------------------+-----+
only showing top 10 rows



                                                                                

In [8]:
# instantiate Models

# regression
lr = LogisticRegression(
    featuresCol='features',
    labelCol='Class',
    predictionCol='Class_Prediction',
    maxIter=10,
    regParam=0.3,
    elasticNetParam=0.8
)
# decison tree
dt = DecisionTreeClassifier(featuresCol='features',
                            labelCol='Class',
                            predictionCol='Class_Prediction'
                            )
# random forest
rf = RandomForestClassifier(
    featuresCol='features',
    labelCol='Class',
    predictionCol='Class_Prediction'
)
# gradient - boosted tree
gbt = GBTClassifier(
    featuresCol='features',
    labelCol='Class',
    predictionCol='Class_Prediction'
)
# linear support vector machines
lsvc = LinearSVC(
    featuresCol='features',
    labelCol='Class',
    predictionCol='Class_Prediction'
)

# create list of models
list_of_models = [lr, dt, rf, gbt, lsvc]
list_of_model_names = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'Gradient-Boosted Tree', 'Linear Support Vector Machines']

# go through list
for model, model_name in zip(list_of_models, list_of_model_names):

    # print current model
    print('Current model: ', model_name)

    # create a pipeline object
    pipeline = Pipeline(stages=[model])

    # fit pipeline
    pipeline_model = pipeline.fit(train_vec)

    # get scores on the test set
    test_pred = pipeline_model.transform(test_vec)


    # get accuracy on test set
    accuracy_evaluator = MulticlassClassificationEvaluator(predictionCol='Class_Prediction', labelCol='Class', metricName='accuracy')
    accuracy_score = accuracy_evaluator.evaluate(test_pred)
    print('Accuracy: ', accuracy_score)

    # get precision on test set
    precision_evaluator = MulticlassClassificationEvaluator(predictionCol='Class_Prediction', labelCol='Class', metricName='precisionByLabel')
    precision_score = precision_evaluator.evaluate(test_pred)
    print('Precision: ', precision_score)

    # get recall on test set
    recall_evaluator = MulticlassClassificationEvaluator(predictionCol='Class_Prediction', labelCol='Class', metricName='recallByLabel')
    recall_score = recall_evaluator.evaluate(test_pred)
    print('Recall: ', recall_score)

    # get f1-score on test set
    f1_evaluator = MulticlassClassificationEvaluator(predictionCol='Class_Prediction', labelCol='Class', metricName='f1')
    f1_score = f1_evaluator.evaluate(test_pred)
    print('F1-score: ', f1_score)

Current model:  Logistic Regression


                                                                                

23/04/30 08:37:39 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
23/04/30 08:37:39 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS


                                                                                

Accuracy:  0.9245283018867925


                                                                                

Precision:  0.8588235294117647


                                                                                

Recall:  1.0


                                                                                

F1-score:  0.9245641270599474
Current model:  Decision Tree


                                                                                

Accuracy:  1.0


                                                                                

Precision:  1.0


                                                                                

Recall:  1.0


                                                                                

F1-score:  1.0
Current model:  Random Forest


                                                                                

Accuracy:  1.0


                                                                                

Precision:  1.0


                                                                                

Recall:  1.0


                                                                                

F1-score:  1.0
Current model:  Gradient-Boosted Tree


                                                                                

Accuracy:  1.0


                                                                                

Precision:  1.0


                                                                                

Recall:  1.0


                                                                                

F1-score:  1.0
Current model:  Linear Support Vector Machines


                                                                                

Accuracy:  0.9874213836477987


                                                                                

Precision:  0.9733333333333334


                                                                                

Recall:  1.0




F1-score:  0.9874323824379319


                                                                                

As expected when working with an imbalanced, binary-targeted dataset, the predictions have excellent values for every metric. This happens because, in this particular case, 99% of records have a value of '0 - not fraud', which means that the model is basically training to predict 0s most of the time.

## 2.2 Feature Selection

## 2.3 Hyperparameter tuning with CrossValidator

In [None]:
rfparamGrid = (ParamGridBuilder()
               .addGrid(rf.maxDepth, [2, 4, 5])
               .addGrid(rf.numTrees, [5, 10, 20, 100])
               .build())

In [None]:
evaluatorRF=BinaryClassificationEvaluator(labelCol='Class')

cv = CrossValidator(estimator = rf,
                    estimatorParamMaps = rfparamGrid,
                    evaluator = evaluatorRF,
                    numFolds = 5)

rfcv=cv.fit(test_pred)