# Project: Fraud Detection 

## 1. Overview

### PaySim simulates mobile money transactions based on a sample of real transacions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world. The objective of the project is to predict if a transaction is fraudulent or not.

## 2. Preprocess the data

### Libraries

In [None]:
# libraries: mathematical computing
import numpy as np
import pandas as pd

# libraries: sklearn
from imblearn.over_sampling import SMOTE, RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification


# libraries: pyspark sql
from pyspark.sql.types import IntegerType, FloatType
from pyspark.sql.window import Window
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from  pyspark.sql.functions import monotonically_increasing_id, desc, row_number

# libraries: pyspark machine learning
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml import Pipeline
from pyspark.ml.functions import vector_to_array
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression, DecisionTreeClassifier, NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.mllib.stat import Statistics

# libraries: visualization
import seaborn as sb
import matplotlib.pyplot as mpt
import functools
from collections import Counter

In [None]:
# global variables

global df_bank, results 

#### We´ll use PySpark to preprocess the data.

In [None]:
# creation of the SparkSession

spark = SparkSession.builder.appName("FraudDetection").getOrCreate()
spark

In [4]:
# spark dataframe 

df = spark.read.csv('fraudDetection.csv', header=True)

                                                                                

In [5]:
spark.conf.set("spark.sql.execution.arrow.enabled","true")

24/05/17 00:40:45 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.


#### Now, we´ll convert this "df" dataframe into a parquet file using the following method of pyspark. The file will be named "fraudDetection.parquet"

In [6]:
df.write.parquet("fraudDetection.parquet")

24/05/17 00:40:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
24/05/17 00:40:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
24/05/17 00:40:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 75.08% for 9 writers
24/05/17 00:40:47 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 67.58% for 10 writers
24/05/17 00:40:56 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 75.08% for 9 writers
24/05/17 00:40:56 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
24/05/17 00:40:56 WARN MemoryManager: Total allocation exceeds 95.00% (906,992,01

#### Now, we´ll read the file as a parquet file. The calculation will be faster.

In [7]:
df_bank_par = spark.read.parquet("fraudDetection.parquet")

#### Let´s take a look to the data with the first 10 rows.

In [8]:
df_bank_par.show(10)

+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+
|step|   type|   amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|
+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+
|  35|CASH_IN|312070.89| C154541954|     334944.3|     647015.19|C1995182035|     1030393.8|     718322.91|      0|             0|
|  35|CASH_IN|244107.21|C1988196004|    647015.19|      891122.4| C877334652|     792091.74|     547984.53|      0|             0|
|  35|CASH_IN| 17849.53|C1469762907|     891122.4|     908971.93| C733481207|     107400.33|       89550.8|      0|             0|
|  35|CASH_IN|204719.93| C842268344|    908971.93|    1113691.86| C702268498|     531408.31|     326688.37|      0|             0|
|  35|CASH_IN|281004.16| C188755315|   1113691.86|    1394696.02|C1358158097|     6

In [9]:
df_bank_par.printSchema()

root
 |-- step: string (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: string (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: string (nullable = true)
 |-- newbalanceOrig: string (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: string (nullable = true)
 |-- newbalanceDest: string (nullable = true)
 |-- isFraud: string (nullable = true)
 |-- isFlaggedFraud: string (nullable = true)



#### There are 11 columns, none of them is numerical (they are categorical). Let´s count the number of registers.

In [10]:
print(f"The total number of registers is:",df_bank_par.count())

The total number of registers is: 6362620


#### We have more than six miliions of transactions in the dataset.

### 2.1 Feature Engineering

#### Firstly, we´ll create a function to create a new variable.

In [11]:
### 2.1.1.- creation of a new variable: type2

df_type2 = df_bank_par.withColumn("type2",f.concat(f.substring("nameOrig",1,1),f.substring("nameDest",1,1)))

In [12]:
df_type2.show(5)

+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+
|step|   type|   amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|type2|
+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+
|  35|CASH_IN|312070.89| C154541954|     334944.3|     647015.19|C1995182035|     1030393.8|     718322.91|      0|             0|   CC|
|  35|CASH_IN|244107.21|C1988196004|    647015.19|      891122.4| C877334652|     792091.74|     547984.53|      0|             0|   CC|
|  35|CASH_IN| 17849.53|C1469762907|     891122.4|     908971.93| C733481207|     107400.33|       89550.8|      0|             0|   CC|
|  35|CASH_IN|204719.93| C842268344|    908971.93|    1113691.86| C702268498|     531408.31|     326688.37|      0|             0|   CC|
|  35|CASH_IN|281004.16| C188755315|   11

#### We´ve created a new column named "type2" which is composed by the first character of the column "nameOrig" and the first character of the column "nameDest"

In [13]:
### 2.1.2.1.- One Hot Encoding: column "type"

df_type2.show(3)

+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+
|step|   type|   amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|type2|
+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+
|  35|CASH_IN|312070.89| C154541954|     334944.3|     647015.19|C1995182035|     1030393.8|     718322.91|      0|             0|   CC|
|  35|CASH_IN|244107.21|C1988196004|    647015.19|      891122.4| C877334652|     792091.74|     547984.53|      0|             0|   CC|
|  35|CASH_IN| 17849.53|C1469762907|     891122.4|     908971.93| C733481207|     107400.33|       89550.8|      0|             0|   CC|
+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+
only showing top 3 rows



#### We´ll use some libraries of Spark for Machine Learning (SparkML).

In [14]:
### StringIndexer Initialization
### column: type

indexer_type = StringIndexer(inputCol="type",outputCol="types_indexed")
indexerModel_type = indexer_type.fit(df_type2)


                                                                                

In [15]:
### Transform the DataFrame using the fitted StringIndexer model

indexed_df_type2 = indexerModel_type.transform(df_type2)
indexed_df_type2.show(10)

+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+
|step|   type|   amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|type2|types_indexed|
+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+
|  35|CASH_IN|312070.89| C154541954|     334944.3|     647015.19|C1995182035|     1030393.8|     718322.91|      0|             0|   CC|          2.0|
|  35|CASH_IN|244107.21|C1988196004|    647015.19|      891122.4| C877334652|     792091.74|     547984.53|      0|             0|   CC|          2.0|
|  35|CASH_IN| 17849.53|C1469762907|     891122.4|     908971.93| C733481207|     107400.33|       89550.8|      0|             0|   CC|          2.0|
|  35|CASH_IN|204719.93| C842268344|    908971.93|    1113691.86| C702268498|     531408.31|  

#### Here, we´ve set each of the elements of the "type" column into indexes.

In [16]:
### apply One-Hot-Encoding to the indexed column, that is, 
### "types_indexed"

encoder_type = OneHotEncoder(dropLast=False, inputCol="types_indexed", outputCol="types_onehot")
encoder_type_df = encoder_type.fit(indexed_df_type2).transform(indexed_df_type2)
encoder_type_df.show(truncate=False)


+----+--------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+
|step|type    |amount   |nameOrig   |oldbalanceOrg|newbalanceOrig|nameDest   |oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|type2|types_indexed|types_onehot |
+----+--------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+
|35  |CASH_IN |312070.89|C154541954 |334944.3     |647015.19     |C1995182035|1030393.8     |718322.91     |0      |0             |CC   |2.0          |(5,[2],[1.0])|
|35  |CASH_IN |244107.21|C1988196004|647015.19    |891122.4      |C877334652 |792091.74     |547984.53     |0      |0             |CC   |2.0          |(5,[2],[1.0])|
|35  |CASH_IN |17849.53 |C1469762907|891122.4     |908971.93     |C733481207 |107400.33     |89550.8       |0      |0             |CC   |2.0          |(5,[2],[1.0])|
|35 

In [17]:
encoder_type_df.printSchema()

root
 |-- step: string (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: string (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: string (nullable = true)
 |-- newbalanceOrig: string (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: string (nullable = true)
 |-- newbalanceDest: string (nullable = true)
 |-- isFraud: string (nullable = true)
 |-- isFlaggedFraud: string (nullable = true)
 |-- type2: string (nullable = true)
 |-- types_indexed: double (nullable = false)
 |-- types_onehot: vector (nullable = true)



In [18]:
encoder_type_df_split = encoder_type_df.select('*',vector_to_array('types_onehot').alias('types_onehot_split'))
encoder_type_df_split.show(5)

+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+
|step|   type|   amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|type2|types_indexed| types_onehot|  types_onehot_split|
+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+
|  35|CASH_IN|312070.89| C154541954|     334944.3|     647015.19|C1995182035|     1030393.8|     718322.91|      0|             0|   CC|          2.0|(5,[2],[1.0])|[0.0, 0.0, 1.0, 0...|
|  35|CASH_IN|244107.21|C1988196004|    647015.19|      891122.4| C877334652|     792091.74|     547984.53|      0|             0|   CC|          2.0|(5,[2],[1.0])|[0.0, 0.0, 1.0, 0...|
|  35|CASH_IN| 17849.53|C1469762907|     891122.4|     908971.93| C733

In [19]:
### now, we´ll split the "types_onehot_split" into five columns, one per category

num_categories = len(encoder_type_df_split.first()['types_onehot_split'])
cols_expanded = [(f.col('types_onehot_split')[i].alias(f"{indexerModel_type.labels[i]}")) for i in range(num_categories)]
type_df = encoder_type_df_split.select('*',*cols_expanded)


In [20]:
type_df.show(100)

+----+--------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+--------+-------+-------+--------+-----+
|step|    type|   amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|type2|types_indexed| types_onehot|  types_onehot_split|CASH_OUT|PAYMENT|CASH_IN|TRANSFER|DEBIT|
+----+--------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+--------+-------+-------+--------+-----+
|  35| CASH_IN|312070.89| C154541954|     334944.3|     647015.19|C1995182035|     1030393.8|     718322.91|      0|             0|   CC|          2.0|(5,[2],[1.0])|[0.0, 0.0, 1.0, 0...|     0.0|    0.0|    1.0|     0.0|  0.0|
|  35| CASH_IN|244107.21|C1988196004|    647015.19|      891122.4| C877334652|     792091.74

#### We´ve applied One-Hot-Encoding to the column "type" resulting in five new columns:
+ CASH_OUT
+ CASH_IN
+ PAYMENT
+ TRANSFER 
+ DEBIT

#### Now, we´ll apply this procedure to the column "type2".

In [21]:
### 2.1.2.2.- One Hot Encoding: column "type2"

type_df.show(5)

+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+--------+-------+-------+--------+-----+
|step|   type|   amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|type2|types_indexed| types_onehot|  types_onehot_split|CASH_OUT|PAYMENT|CASH_IN|TRANSFER|DEBIT|
+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+--------+-------+-------+--------+-----+
|  35|CASH_IN|312070.89| C154541954|     334944.3|     647015.19|C1995182035|     1030393.8|     718322.91|      0|             0|   CC|          2.0|(5,[2],[1.0])|[0.0, 0.0, 1.0, 0...|     0.0|    0.0|    1.0|     0.0|  0.0|
|  35|CASH_IN|244107.21|C1988196004|    647015.19|      891122.4| C877334652|     792091.74|    

In [22]:
### StringIndexer Initialization
### column: type2

indexer_type = StringIndexer(inputCol="type2",outputCol="types_indexed2")
indexerModel_type = indexer_type.fit(type_df)

                                                                                

In [23]:
### Transform the DataFrame using the fitted StringIndexer model

indexed_df_type = indexerModel_type.transform(type_df)
indexed_df_type.show(10)

+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+--------+-------+-------+--------+-----+--------------+
|step|   type|   amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|type2|types_indexed| types_onehot|  types_onehot_split|CASH_OUT|PAYMENT|CASH_IN|TRANSFER|DEBIT|types_indexed2|
+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+--------+-------+-------+--------+-----+--------------+
|  35|CASH_IN|312070.89| C154541954|     334944.3|     647015.19|C1995182035|     1030393.8|     718322.91|      0|             0|   CC|          2.0|(5,[2],[1.0])|[0.0, 0.0, 1.0, 0...|     0.0|    0.0|    1.0|     0.0|  0.0|           0.0|
|  35|CASH_IN|244107.21|C1988196004|

In [24]:
### apply One-Hot-Encoding to the indexed column, that is, 
### "types_indexed2"

encoder_type2 = OneHotEncoder(dropLast=False, inputCol="types_indexed2", outputCol="types_onehot2")
encoder_type2_df = encoder_type2.fit(indexed_df_type).transform(indexed_df_type)
encoder_type2_df.show(truncate=False)

+----+--------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+-------------------------+--------+-------+-------+--------+-----+--------------+-------------+
|step|type    |amount   |nameOrig   |oldbalanceOrg|newbalanceOrig|nameDest   |oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|type2|types_indexed|types_onehot |types_onehot_split       |CASH_OUT|PAYMENT|CASH_IN|TRANSFER|DEBIT|types_indexed2|types_onehot2|
+----+--------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+-------------------------+--------+-------+-------+--------+-----+--------------+-------------+
|35  |CASH_IN |312070.89|C154541954 |334944.3     |647015.19     |C1995182035|1030393.8     |718322.91     |0      |0             |CC   |2.0          |(5,[2],[1.0])|[0.0, 0.0, 1.0, 0.0, 0.0]|0.0     |0.0    |1.0    |0

In [25]:
encoder_type2_df.printSchema()

root
 |-- step: string (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: string (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: string (nullable = true)
 |-- newbalanceOrig: string (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: string (nullable = true)
 |-- newbalanceDest: string (nullable = true)
 |-- isFraud: string (nullable = true)
 |-- isFlaggedFraud: string (nullable = true)
 |-- type2: string (nullable = true)
 |-- types_indexed: double (nullable = false)
 |-- types_onehot: vector (nullable = true)
 |-- types_onehot_split: array (nullable = false)
 |    |-- element: double (containsNull = false)
 |-- CASH_OUT: double (nullable = true)
 |-- PAYMENT: double (nullable = true)
 |-- CASH_IN: double (nullable = true)
 |-- TRANSFER: double (nullable = true)
 |-- DEBIT: double (nullable = true)
 |-- types_indexed2: double (nullable = false)
 |-- types_onehot2: vector (nullable = true)



In [26]:
encoder_type2_df_split = encoder_type2_df.select('*',vector_to_array('types_onehot2').alias('types_onehot_split2'))
encoder_type2_df_split.show(5)

+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+--------+-------+-------+--------+-----+--------------+-------------+-------------------+
|step|   type|   amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|type2|types_indexed| types_onehot|  types_onehot_split|CASH_OUT|PAYMENT|CASH_IN|TRANSFER|DEBIT|types_indexed2|types_onehot2|types_onehot_split2|
+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+--------+-------+-------+--------+-----+--------------+-------------+-------------------+
|  35|CASH_IN|312070.89| C154541954|     334944.3|     647015.19|C1995182035|     1030393.8|     718322.91|      0|             0|   CC|          2.0|(5,[2],[1.0])|[0.0, 0.0, 

In [27]:
### now, we´ll split the "types_onehot_split2" into two columns, one per category

num_categories = len(encoder_type2_df_split.first()['types_onehot_split2'])
cols_expanded = [(f.col('types_onehot_split2')[i].alias(f"{indexerModel_type.labels[i]}")) for i in range(num_categories)]
encoder_type2_df_split = encoder_type2_df_split.select('*',*cols_expanded)

In [28]:
encoder_type2_df_split.show(5)

+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+--------+-------+-------+--------+-----+--------------+-------------+-------------------+---+---+
|step|   type|   amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|type2|types_indexed| types_onehot|  types_onehot_split|CASH_OUT|PAYMENT|CASH_IN|TRANSFER|DEBIT|types_indexed2|types_onehot2|types_onehot_split2| CC| CM|
+----+-------+---------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+-----+-------------+-------------+--------------------+--------+-------+-------+--------+-----+--------------+-------------+-------------------+---+---+
|  35|CASH_IN|312070.89| C154541954|     334944.3|     647015.19|C1995182035|     1030393.8|     718322.91|      0|             0|   CC|          2.0|(

#### We´ve split the "type2" column into two columns based on One-Hot-Encoding. Now, we´ll eliminate some unnecessaruy columns. Let´s check out all the columns.

In [29]:
encoder_type2_df_split.printSchema()

root
 |-- step: string (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: string (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: string (nullable = true)
 |-- newbalanceOrig: string (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: string (nullable = true)
 |-- newbalanceDest: string (nullable = true)
 |-- isFraud: string (nullable = true)
 |-- isFlaggedFraud: string (nullable = true)
 |-- type2: string (nullable = true)
 |-- types_indexed: double (nullable = false)
 |-- types_onehot: vector (nullable = true)
 |-- types_onehot_split: array (nullable = false)
 |    |-- element: double (containsNull = false)
 |-- CASH_OUT: double (nullable = true)
 |-- PAYMENT: double (nullable = true)
 |-- CASH_IN: double (nullable = true)
 |-- TRANSFER: double (nullable = true)
 |-- DEBIT: double (nullable = true)
 |-- types_indexed2: double (nullable = false)
 |-- types_onehot2: vector (nullable = true)
 |-- types_onehot_split2

#### Now, we´ll eliminate the unnecessary columns:
+ nameOrig
+ nameDest
+ isFlaggedFraud
+ newbalanceDest
+ oldbalanceDest
+ oldbalanceOrg
+ newbalanceOrig 
+ types_indexed
+ types_onehot
+ types_onehot_split
+ types_indexed2
+ types_onehot2
+ types_onehot_split2
+ type
+ type2

In [30]:
df_bank_par = encoder_type2_df_split.drop("nameOrig","nameDest","isFlaggedFraud","newbalanceDest","oldbalanceDest",
                       "oldbalanceOrg","newbalanceOrig","type","types_indexed","types_onehot",
                       "types_onehot_split","type2","types_indexed2","types_onehot2","types_onehot_split2" )
df_bank_par.show(5)

+----+---------+-------+--------+-------+-------+--------+-----+---+---+
|step|   amount|isFraud|CASH_OUT|PAYMENT|CASH_IN|TRANSFER|DEBIT| CC| CM|
+----+---------+-------+--------+-------+-------+--------+-----+---+---+
|  35|312070.89|      0|     0.0|    0.0|    1.0|     0.0|  0.0|1.0|0.0|
|  35|244107.21|      0|     0.0|    0.0|    1.0|     0.0|  0.0|1.0|0.0|
|  35| 17849.53|      0|     0.0|    0.0|    1.0|     0.0|  0.0|1.0|0.0|
|  35|204719.93|      0|     0.0|    0.0|    1.0|     0.0|  0.0|1.0|0.0|
|  35|281004.16|      0|     0.0|    0.0|    1.0|     0.0|  0.0|1.0|0.0|
+----+---------+-------+--------+-------+-------+--------+-----+---+---+
only showing top 5 rows



In [31]:
df_bank_par.count()

6362620

In [32]:
type(df_bank_par)

pyspark.sql.dataframe.DataFrame

#### We can see that there are the same quantity of registers.

### 2.2 Data Cleaning

In [33]:
### 2.2.1.- Eliminate duplicated

num_all_rows = df_bank_par.count()
num_all_rows

6362620

In [34]:
num_duplicated_rows = df_bank_par.distinct().count() 

[Stage 33:>                                                       (0 + 10) / 10]

In [None]:
print(f"The total number of duplicated rows is:",num_all_rows - num_duplicated_rows)

#### We can see that there are 7597 duplicated rows. Let´s remove the null values and duplicated values from the df_bank_par dataframe.

In [None]:
df_bank_par = df_bank_par.dropna()

df_bank_par = df_bank_par.dropDuplicates()

In [None]:
df_bank_par.count()

#### We can see the duplicated registers have been removed because there are fewer registers than before. Let´s take a look at the "clean" dataset.

In [None]:
df_bank_par.show(10)

## 3. Exploratory Data Analysis (EDA)

### 3.1 Visualization

#### The visualization will be done using a functions which leverages the method histogram() of pyspark. 

In [None]:
# definition of the "histogram" function

def histogram(df, col, bins=10, xname=None, yname=None):
    
    '''
    This function makes a histogram from spark dataframe named 
    df for column name col. 
    '''
    
    # Calculating histogram in Spark 
    vals = df.select(col).rdd.flatMap(lambda x: x).histogram(bins)
    
    # Preprocessing histogram points and locations 
    width = vals[0][1] - vals[0][0]
    loc = [vals[0][0] + (i+1) * width for i in range(len(vals[1]))]
    
    # Making a bar plot 
    mpt.bar(loc, vals[1], width=width)
    mpt.xlabel(col)
    mpt.ylabel(yname)
    mpt.show()

#### There are some features that need to be converted to integers such as "step","amount" and "isFraud".

In [None]:
# convert string columns into integer columns

df_bank_par = df_bank_par.withColumn("step",df_bank_par["step"].cast(IntegerType()))

In [None]:
df_bank_par = df_bank_par.withColumn("amount",df_bank_par["amount"].cast(IntegerType()))

In [None]:
df_bank_par = df_bank_par.withColumn("isFraud",df_bank_par["isFraud"].cast(IntegerType()))

In [None]:
df_bank_par.printSchema()

#### We´ve seen that all the features are "integer" types now. Therefore, we´re able to perform various visualizations with the histogram method. That´s what we´ll do next.

In [None]:
# histogram: "step"

##histogram(df_bank_par, 'step', bins=15, yname='frequency')

In [None]:
# histogram: "amount"

##histogram(df_bank_par, 'amount', bins=15, yname='frequency')

In [None]:
# histogram: "Debit"

##histogram(df_bank_par, 'Debit', bins=15, yname='frequency')


In [None]:
# histogram: "Payment"

##histogram(df_bank_par, 'Payment', bins=15, yname='frequency')


In [None]:
# histogram: "CASH_OUT"

##histogram(df_bank_par, 'CASH_OUT', bins=15, yname='frequency')


In [None]:
# histogram: "CASH_IN"

##histogram(df_bank_par, 'CASH_IN', bins=15, yname='frequency')


In [None]:
# histogram: "TRANSFER"

##histogram(df_bank_par, 'TRANSFER', bins=15, yname='frequency')


In [None]:
# histogram: "CC"

##histogram(df_bank_par, 'CC', bins=15, yname='frequency')


In [None]:
# histogram: "CM"

##histogram(df_bank_par, 'CM', bins=15, yname='frequency')

In [None]:
# histogram: "isFraud"

##histogram(df_bank_par, 'isFraud', bins=15, yname='frequency')

#### Remember that our label is "isFraud", therefore, we can see that this class is unbalanced as we can see from the previous graphic. We need to perform an **Oversampling** through ***Data Balancing*** using *pyspark*.

### 3.2 Data Balancing

In [None]:
############################################################## Oversampling with PySpark #########################################################

# Create undersampling function
#def oversample_minority(df, ratio=1):
#    '''
#    ratio is the ratio of majority to minority
#    Eg. ratio 1 is equivalent to majority:minority = 1:1
#    ratio 5 is equivalent to majority:minority = 5:1
#    '''
#    minority_count = df.filter(f.col('isFraud')==1).count()
#    majority_count = df.filter(f.col('isFraud')==0).count()
#    
#    balance_ratio = majority_count / minority_count
#    
#    print(f"Initial Majority:Minority ratio is {balance_ratio:.2f}:1")
#    if ratio >= balance_ratio:
#        print("No oversampling of minority was done as the input ratio was more than or equal to the initial ratio.")
#    else:
#        print(f"Oversampling of minority done such that Majority:Minority ratio is {ratio}:1")
#    
#    oversampled_minority = df.filter(f.col('isFraud')==1).sample(withReplacement=True, fraction=(balance_ratio/ratio),seed=88)
#    oversampled_df = df.filter(f.col('isFraud')==0).union(oversampled_minority)
#    
#    return oversampled_df

#oversampled_df = oversample_minority(df_bank_par,ratio=1)

#minority_count = oversampled_df.filter(f.col('isFraud')==1).count()
#majority_count = oversampled_df.filter(f.col('isFraud')==0).count()
#minority_count, majority_count
#oversampled_df = oversampled_df.dropna()
#oversampled_df = oversampled_df.dropDuplicates()
#df_bank_par = oversampled_df


#### If we want to transform this pyspark "dataframe" df_bank_par into a pandas dataframe we can use the method to_pandas_on_spark.

In [None]:
# pandas dataframe

df_bank_pandas = pd.read_parquet('fraudDetection.parquet')

In [None]:
type(df_bank_pandas)

In [None]:
df_bank_pandas = pd.read_csv('fraudDetection.csv')

In [None]:
type(df_bank_pandas)

In [None]:
#@title
def procesar_datos():
  global df_banco, resultados
  df_banco=df_bank_pandas.copy()
  # Crea la nueva variable type2 con la combinación de la primera letra de las columnas nameOrig y nameDest
  df_banco['type2'] = df_banco['nameOrig'].str[0] + df_banco['nameDest'].str[0]


In [None]:
procesar_datos()
df_banco.head(10)

In [None]:
# Realiza one-hot encoding de las columnas type y type2
df_encoded = pd.get_dummies(df_banco, columns=['type', 'type2'], dtype=int)
df_encoded.sample(10)

In [None]:
# Lista de columnas a eliminar
columns_to_drop = ['nameOrig', 'nameDest', 'isFlaggedFraud', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
# Elimina las columnas del DataFrame
df_encoded.drop(columns=columns_to_drop, inplace=True)
# Resetea el índice
df_encoded.reset_index(drop=True, inplace=True)
df_encoded


In [None]:
# Elimina registros duplicados y guarda el resultado en df_banco
df_banco = df_encoded.drop_duplicates()

In [None]:
# Elimina registros con valores nulos y restablece el índice
df_banco.dropna(inplace=True)
df_banco.reset_index(drop=True, inplace=True)

In [None]:
df_banco

In [None]:
df_banco.info()

In [None]:
# Contar los valores de la columna isFraud
conteo_isfraud = df_banco['isFraud'].value_counts()

# Crear el gráfico de barras verticales
mpt.figure(figsize=(8, 6))
conteo_isfraud.plot(kind='bar', color=['skyblue', 'salmon'])
mpt.title('Distribución de la columna isFraud')
mpt.xlabel('isFraud')
mpt.ylabel('Cantidad')
mpt.xticks([0, 1], ['No Fraude', 'Fraude'], rotation=0)
mpt.show()

In [None]:
#@title


def balanceo_clases():
    global df_banco, resultados

    # Instancia SMOTE
    smote = SMOTE(random_state=42)

    # Balanceo de clases
    X_res, y_res = smote.fit_resample(df_banco.drop(columns=['isFraud']), df_banco['isFraud'])

    # Reconstrucción del DataFrame balanceado
    df_banco = pd.DataFrame(X_res, columns=df_banco.drop(columns=['isFraud']).columns)
    df_banco['isFraud'] = y_res

    # Elimina registros duplicados
    df_banco.drop_duplicates(inplace=True)
    df_banco.reset_index(drop=True, inplace=True)

# Llama a la función balanceo_clases
balanceo_clases()

# Imprime el resultado final
df_banco

In [None]:
type(df_banco)

In [None]:
# Contar los valores de la columna isFraud
conteo_isfraud = df_banco['isFraud'].value_counts()

# Crear el gráfico de barras verticales
mpt.figure(figsize=(8, 6))
conteo_isfraud.plot(kind='bar', color=['skyblue', 'salmon'])
mpt.title('Distribución de la columna isFraud')
mpt.xlabel('isFraud')
mpt.ylabel('Cantidad')
mpt.xticks([0, 1], ['No Fraude', 'Fraude'], rotation=0)
mpt.show()

In [None]:
type(df_banco)

#### Now, we´ll convert this pandas dataframe into a PySpark dataframe to leverage.

In [None]:
sparkDF = spark.createDataFrame(df_banco)

In [None]:
df_banco.to_parquet('df.parquet')

In [None]:
df_bank_par = spark.read.parquet('df.parquet')

In [None]:
type(df_bank_par)

In [None]:
df_bank_par.show(10)

In [None]:
df_bank_par.printSchema()

In [None]:
df_bank_par.count()

In [None]:
# convert string columns into integer columns

df_bank_par = df_bank_par.withColumn("isFraud",df_bank_par["isFraud"].cast(IntegerType()))

In [None]:
df_bank_par.printSchema()

In [None]:
class_0 = df_bank_par.filter(f.col("isFraud")==0)
class_1 = df_bank_par.filter(f.col("isFraud")==1)

In [None]:
class_0.count()

In [None]:
class_1.count()

In [None]:
######################################## Convert parquet file into Pandas ##########################

##df_bank_par_pandas = df_bank_par.to_pandas_on_spark()
##df_bank_par_pandas.head(10)
##df_bank_par_pandas.describe()
##type(df_bank_par_pandas)

#### Let´s create a function to find a correlation between the target variable "isFraud" and the features. 

In [None]:
# definition of the function "correlation_df"

def correlation_df(df,target_var,feature_cols, method):
    # assemble features into a vector
    target_var = [target_var]
    feature_cols = feature_cols
    df_cor = df.select(target_var + feature_cols)
    assembler = VectorAssembler(inputCols=target_var + feature_cols, outputCol="features")
    df_cor = assembler.transform(df_cor)

    # calculate correlation matrix
    correlation_matrix = Correlation.corr(df_cor, "features", method =method).head()[0]

    # extract the correlation coefficient between target and each feature
    target_corr_list = [correlation_matrix[i,0] for i in range(len(feature_cols)+1)][1:]

    # create a Dataframe with target variable, feature names and correlation coefficients
    correlation_data = [(feature_cols[i],float(target_corr_list[i])) for i in range(len(feature_cols))]

    correlation_df = spark.createDataFrame(correlation_data, ["feature","correlation"] )

    correlation_df = correlation_df.withColumn("abs_correlation",f.abs("correlation"))

    # print the result
    return correlation_df


In [None]:
df_bank_par.printSchema()

In [None]:
target = "isFraud"

indep_cols = [x for x in df_bank_par.columns if x not in [target] ]

corr_values_df = correlation_df(df=df_bank_par, target_var= target, feature_cols= indep_cols, method='pearson')

print(f"The corelation between {target} and the other features is: ")

corr_values_df.show()


In [None]:
df_bank_par.printSchema()

In [None]:
target = "amount"

indep_cols = [x for x in df_bank_par.columns if x not in [target] ]

corr_values_df = correlation_df(df=df_bank_par, target_var= target, feature_cols= indep_cols, method='pearson')

print(f"The corelation between {target} and the other features is: ")

corr_values_df.show()


In [None]:
target = "step"

indep_cols = [x for x in df_bank_par.columns if x not in [target] ]

corr_values_df = correlation_df(df=df_bank_par, target_var= target, feature_cols= indep_cols, method='pearson')

print(f"The corelation between {target} and the other features is: ")

corr_values_df.show()

## 4. Construction of models

## 4.1 train/test split

In [None]:
train,test = df_bank_par.randomSplit([0.7,0.3])

In [None]:
#type(train) , type(test)

#### Let´s assemble these datasets "train" and "test" into a single feature vector using VectorAssembler class per each one.

In [None]:
# let´s assemble the train dataset as a single feature vector using VectorAssembler class

columns = ['step','amount','type_CASH_OUT','type_PAYMENT','type_CASH_IN','type_TRANSFER','type_DEBIT','type2_CC','type2_CM','isFraud']

assembler = VectorAssembler(inputCols=columns, outputCol='features')

train = assembler.transform(train)

train.show(10)

In [None]:
# let´s assemble the test dataset as a single feature vector using VectorAssembler class

columns = ['step','amount','type_CASH_OUT','type_PAYMENT','type_CASH_IN','type_TRANSFER','type_DEBIT','type2_CC','type2_CM','isFraud']

assembler = VectorAssembler(inputCols=columns, outputCol='features')

test = assembler.transform(test)

test.show(10)

In [None]:
type(test)

## 4.2 Models

We´ll use several machine learning algorithms to evaluate all of them and to select the best one. We´ll start with Random Forest. However, it´s important to create some lists where to store the results of the models:

In [None]:
name_model = []

accuracy = []

precision = []

recall = []

auc_roc = []

### 4.2.1 Random Forest

#### Training

In [None]:
# train the model "random forest" (rf)

rf = RandomForestClassifier(featuresCol='features', labelCol='isFraud')
model_RF = rf.fit(train)

In [None]:
type(model_RF)

### Predictions

In [None]:
# make predictions of the random forest model using the test dataset

predictions = model_RF.transform(test)


In [None]:
type(predictions)

In [None]:
predictions.show(50)

#### We can see that there are three more columns: rawPrediction, probability and prediction. We can clearly compare the actual values and predicted values with the output below:

In [None]:
predictions.select("isFraud","prediction").show(50)

#### At a glance we can see that the predicted values are the same of the actual values, at least for the first fifty registers.

### Evaluation

#### We need to evaluate our random forest machine learning algorithm.

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol="isFraud", predictionCol="prediction")
accuracy_ = evaluator.evaluate(predictions)

In [None]:
type(accuracy_)

In [None]:
print(f"The accuracy is {accuracy_}")

In [None]:
Test_Error = (1 - accuracy_)
print(f"The Test Error is {Test_Error}")

#### Let´s check out the Consufion Matrix.

In [None]:
preds_and_labels = predictions.select(["prediction","isFraud"])
preds_and_labels = preds_and_labels.withColumn("isFraud", f.col("isFraud").cast(FloatType())).orderBy("prediction")

In [None]:
preds_and_labels.show(20)

In [None]:
# AUC - ROC

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="isFraud")

auc_rf = evaluator.evaluate(predictions)

# Accuracy, Precision and Recall

metrics = MulticlassClassificationEvaluator(labelCol="isFraud", predictionCol="prediction",)

accuracy_rf = metrics.evaluate(predictions, {metrics.metricName:"accuracy"})

precision_rf = metrics.evaluate(predictions, {metrics.metricName:"weightedPrecision"})

recall_rf = metrics.evaluate(predictions, {metrics.metricName:"weightedRecall"})

# let´s store the results of this model: Random Forest

accuracy.append(accuracy_rf)

precision.append(precision_rf)

recall.append(recall_rf)

auc_roc.append(auc_rf)

# let´s store the name of the model: Random Forest 
name_model_ = "Random Forest"

name_model.append(name_model_)



print(f"AUC-ROC: ", auc_rf)

print(f"Accuracy: ", accuracy_rf)

print(f"Precsion: ", precision_rf)

print(f"Recall: ", recall_rf)

In [None]:
name_model

In [None]:
metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))

In [None]:
print("The Confusion Matrix is:")

metrics.confusionMatrix().toArray()

#### According to the confusion matrix, all the actual values will be correctly predicted. It may mean an Overfitting.

### 4.2.2 Logistic Regression

#### Training

In [None]:
# train the model Logistic Regression (lr)

lr = LogisticRegression(featuresCol='features', labelCol='isFraud')

model_LR = lr.fit(train)

In [None]:
type(model_LR)

#### To better understand the model, we can examine its coefficients and intercept. The values represent the weights assigned to each feature and the bias term, respectively.

In [None]:
coefficients = model_LR.coefficients

intercept = model_LR.intercept

print("Coefficients: ", coefficients)

print("Intercept: ", intercept)


#### Predictions

In [None]:
# make predictions of the logistic regression model using the test dataset

predictions = model_LR.transform(test)

predictions.show(50)

#### Evaluation

In [None]:
# AUC - ROC

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="isFraud")

auc_lr = evaluator.evaluate(predictions)

# Accuracy, Precision and Recall

metrics = MulticlassClassificationEvaluator(labelCol="isFraud", predictionCol="prediction",)

accuracy_lr = metrics.evaluate(predictions, {metrics.metricName:"accuracy"})

precision_lr = metrics.evaluate(predictions, {metrics.metricName:"weightedPrecision"})

recall_lr = metrics.evaluate(predictions, {metrics.metricName:"weightedRecall"})

# let´s store the results of this model: Random Forest

accuracy.append(accuracy_lr)

precision.append(precision_lr)

recall.append(recall_lr)

auc_roc.append(auc_lr)

# let´s store the name of the model: Logistic Regression 
name_model_ = "Logistic Regression"

name_model.append(name_model_)


print(f"AUC-ROC: ", auc_lr)

print(f"Accuracy: ", accuracy_lr)

print(f"Precsion: ", precision_lr)

print(f"Recall: ", recall_lr)

#### Let´s check out the Confusion Matrix.

In [None]:
preds_and_labels = predictions.select(["prediction","isFraud"])
preds_and_labels = preds_and_labels.withColumn("isFraud", f.col("isFraud").cast(FloatType())).orderBy("prediction")

metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))
print("The Confusion Matrix is:")

metrics.confusionMatrix().toArray()

### 4.2.3 Decision Tree

#### Training

In [None]:
# train the model Decision Tree (dt)

dt = DecisionTreeClassifier(featuresCol='features', labelCol='isFraud')

model_dt = dt.fit(train)

#### Predictions

In [None]:
# make predictions of the decision tree model using the test dataset

predictions = model_dt.transform(test)

predictions.show(50)

#### Evaluation

In [None]:
# AUC - ROC

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="isFraud")

auc_dt = evaluator.evaluate(predictions)

# Accuracy, Precision and Recall

metrics = MulticlassClassificationEvaluator(labelCol="isFraud", predictionCol="prediction",)

accuracy_dt = metrics.evaluate(predictions, {metrics.metricName:"accuracy"})

precision_dt = metrics.evaluate(predictions, {metrics.metricName:"weightedPrecision"})

recall_dt = metrics.evaluate(predictions, {metrics.metricName:"weightedRecall"})

# let´s store the results of this model: Decision Tree

accuracy.append(accuracy_dt)

precision.append(precision_dt)

recall.append(recall_dt)

auc_roc.append(auc_dt)


# let´s store the name of the model: Decision Tree
name_model_ = "Decision Tree"

name_model.append(name_model_)

print(f"AUC-ROC: ", auc_dt)

print(f"Accuracy: ", accuracy_dt)

print(f"Precsion: ", precision_dt)

print(f"Recall: ", recall_dt)

#### Let´s check out the Confusion Matrix.

In [None]:
preds_and_labels = predictions.select(["prediction","isFraud"])
preds_and_labels = preds_and_labels.withColumn("isFraud", f.col("isFraud").cast(FloatType())).orderBy("prediction")

In [None]:
metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))

In [None]:
print("The Confusion Matrix is:")

metrics.confusionMatrix().toArray()

### 4.2.4 Naive Bayes

#### Training

In [None]:
# train the model Naive Bayes (nb)

nb = NaiveBayes(featuresCol='features', labelCol='isFraud')

model_nb = nb.fit(train)

#### Predictions

In [None]:
# make predictions of the naive bayes model using the test dataset

predictions = model_nb.transform(test)

predictions.show(50)

#### Evaluation

In [None]:
# AUC - ROC

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="isFraud")

auc_nb = evaluator.evaluate(predictions)

# Accuracy, Precision and Recall

metrics = MulticlassClassificationEvaluator(labelCol="isFraud", predictionCol="prediction",)

accuracy_nb = metrics.evaluate(predictions, {metrics.metricName:"accuracy"})

precision_nb = metrics.evaluate(predictions, {metrics.metricName:"weightedPrecision"})

recall_nb = metrics.evaluate(predictions, {metrics.metricName:"weightedRecall"})

# let´s store the results of this model: Naive Bayes

accuracy.append(accuracy_nb)

precision.append(precision_nb)

recall.append(recall_nb)

auc_roc.append(auc_nb)

# let´s store the name of the model: Naive Bayes
name_model_ = "Naive Bayes"

name_model.append(name_model_)


print(f"AUC-ROC: ", auc_nb)

print(f"Accuracy: ", accuracy_nb)

print(f"Precsion: ", precision_nb)

print(f"Recall: ", recall_nb)

#### Let´s check out the Confusion Matrix.

In [None]:
preds_and_labels = predictions.select(["prediction","isFraud"])
preds_and_labels = preds_and_labels.withColumn("isFraud", f.col("isFraud").cast(FloatType())).orderBy("prediction")

metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))
print("The Confusion Matrix is:")

metrics.confusionMatrix().toArray()

## 4.3 Evaluation and Selection of the model

We´ll evaluate the models using the metrics used in the previous step and we´ll select the model with the best performance. As first step, let´s create a dictionary with the results of every model.

In [None]:
results = {
    'Name_Model': name_model,
    'Accuracy':accuracy,
    'Precision':precision,
    'Recall':recall,
    'AUC_ROC':auc_roc
}

In [None]:
results

In [None]:
type(results['Accuracy'][2])

#### Now, let´s create a pandas dataframe with the results dictionary.

In [None]:
results_df = pd.DataFrame(results)
results_df.set_index('Name_Model', inplace=True)
#results_df.set_index("Name_Model", inplace=True)

In [None]:
results_df.head(5)

In [None]:
results_df.info()

In [None]:
results_df

In [None]:
type(results_df)

#### Let´s visualize these results.

In [None]:
# transpose of the "results_df" dataframe

colors = ['#0077b6','#CDDBF3','#9370DB','#DDA0DD']
results_df.plot(kind='bar', figsize=(12,6), colormap='viridis', rot=0)
mpt.title('Comparison of metrics per model')
mpt.xlabel('Models')
mpt.ylabel('Score')
mpt.legend(title = 'Metrics')
mpt.tight_layout
mpt.show()

pd.DataFrame()

## 5. Storage

### 5.1 Model

In [None]:
# model: Random Forest

model_RF.save("randomF_model")

# model: Logistic Regression

model_LR.save("logit_model")

# model: Decision Tree

model_dt.save("decisionT_model")

# model: Naive Bayes

model_nb.save("naiveB_model")


### 5.2 Load

In [None]:
# model: Random Forest

loaded_model_RF = RandomForestClassifier.load("randomF_model")

# model: Logistic Regression

loaded_model_LR = LogisticRegression.load("logit_model")

# model: Decision Tree

loaded_model_LR = DecisionTreeClassifier.load("decisionT_model")

# model: Naive Bayes

loaded_model_LR = NaiveBayes.load("naiveB_model")