## CREDIT CARD FRAUD

## 1.0 EDA

Firstly we need to import the necessary libraries and configure the Spark session. After that, we load our data set, in order to analyse it, and perform EDA.

In [None]:
#Import libraries and configure spark session
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import pandas as pd
from scipy import stats
from scipy.stats import norm

import warnings
warnings.filterwarnings('ignore')

import pyspark.sql.functions as f
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("xpto") \
    .getOrCreate()
sc = spark.sparkContext

In [None]:
# Load file
df = spark.read.csv('creditcard.csv', header=True, inferSchema=True, sep=",")

# Print Schema
df.printSchema()

It appears that the dataset has a total of 31 columns, in which 28 of them are components that are results of PCA. Columns V1-V28 are going to be more abstract, since we don't have a clear definition/description of what they represent.

In [None]:
df.head()

In [None]:
#Shape of the Dataframe
print((df.count(), len(df.columns)))

After performing the counts above, we conclude that we are going to be working with a dataset that has 284 807 rows.

In [None]:
df.toPandas().info(verbose=True)

In [None]:
df.toPandas().head(5)

In [None]:
#statistics
df.toPandas().describe().T

The maximum amount of time (in seconds) recorded between transactions is 172792 seconds, which is equivalent to 2880 minutes, or approx. 48 hours. We can also conclude that the maximum value for Amount is $ 25691.16.

## 1.1 CHECKING MISSING VALUES

In [None]:
#Check missing and null data
from pyspark.sql.functions import isnan, when, count, col

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

Above checked for missing values. Each column shows the amount of null or NaN values in each column. It appears that the dataset doesn't have missing values!

## HOW MANY ROWS IN THE DATASET REPRESENT CREDIT CARD FRAUD?

In [None]:
#Number of Frauds and non-frauds
classFreq = df.groupBy("Class").count()
classFreq.show()

In [None]:
total = classFreq.select("count").agg({"count": "sum"}).collect().pop()['sum(count)']
result = classFreq.withColumn('percent', (classFreq['count']/total) * 100)
result.show()

Here we can see that, out of the 284 807 rows of information available, 492 represent credit card fraud, which translates to approx. 17% of the records. With these results, it is safe to say that we have an unbalanced dataset.

In [None]:
fraud_df = df.select("Class").toPandas()

# Count the number of transactions for each class (fraudulent or not)
fraud_counts = fraud_df['Class'].value_counts()

# Calculate the percentage of each class
fraud_perc = fraud_counts / fraud_counts.sum() * 100

# Create a bar plot with logarithmic y-axis
sns.barplot(x=fraud_counts.index, y=fraud_counts.values)
plt.yscale('log')

# Add percentage labels to the bars
for i, v in enumerate(fraud_counts):
    plt.text(i, v, f"{fraud_perc[i]:.1f}%", ha='center', fontweight='bold')

# Set the y-axis labels and formatter
plt.ylabel('Number of Transactions')
plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda x, _: '{:.0f} K'.format(x/1000)))

# Set the axis labels and title
plt.xlabel('Transaction Class')
plt.xticks([0, 1], ['Non-Fraud', 'Fraud'])
plt.title('Distribution of fraudulent transactions')

plt.show()

In [None]:
# create several histograms to show the distributions of the features
fig = plt.figure(figsize=(25, 15))
subtitle = fig.suptitle("Distribution of Features", fontsize=14, verticalalignment="center")
for col, num in zip(df.toPandas().describe().columns, range(1, 11)):
    ax = fig.add_subplot(3, 4, num)
    ax.hist(df.toPandas()[col])
    plt.grid(False)
    plt.xticks(rotation=45, fontsize=14)
    plt.yticks(fontsize=15)
    plt.title(col.upper(), fontsize=14)
plt.tight_layout()
subtitle.set_y(0.95)
fig.subplots_adjust(top=0.85, hspace=0.4)
plt.show()

By analysing these plots, we conclude that Time does not have a normal distribution and the values 75 000-100 000 seconds are the ones with less occurences throughout the dataset. All of the Vx componentes are normalized, and with most occurences centered around the value 0. These feautures also show some negative values.

## USER DEFINED FUCTIONS

Firstly lets create YES or NO values to assign to the values in our "Class" column. If the values is 1, then we assign it a "yes", and if the value is 0, then we assign it a "no". After developing the function, we also create a column called "IsFraud".

In [None]:
# create the yes/no function
y_udf = f.udf(lambda y: "no" if y == 0 else "yes", f.StringType())

In [None]:
# create the new column for the yes/no values
df = df.withColumn("IsFraud", y_udf('Class'))

df.show(5)

In [None]:
# create a function to group time values
def udf_multi(time):
    if(time < 50000):
        return "Under 50K s"
    elif(time >= 50000 and time <= 100000):
        return "Between 50K and 100K s"
    elif(time > 100000):
        return "Over 100K s"
    else: return "NA"

In [None]:
# apply the function to the "Time column"
time_udf = f.udf(udf_multi)
df = df.withColumn('time_udf', time_udf('Time'))

df.show(5)

APPLYING SOME STATISTICS

In [None]:
from pyspark.sql import Window 
window = Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) 

In [None]:
# lets create a table
time_group_table = df.select(["time_udf", "Amount"]).\
                         groupBy('time_udf').\
                            agg(
                                f.count("Amount").alias("UserCount"),
                                f.mean("Amount").alias("Amount_Avg"),
                                f.min("Amount").alias("Amount_Min"),
                                f.max("Amount").alias("Amount_Max")).\
                            withColumn("total", f.sum(f.col("UserCount")).over(window)).\
                            withColumn("Percent", f.col("UserCount")*100 / f.col("total")).\
                            drop(f.col("total")).sort(f.desc("Percent"))

In [None]:
time_group_table.toPandas()

Here we computed some statistics with the "Amount" column, divided by the three time groups defined earlier. Now let's plot our results:

In [None]:
sns.barplot(x="time_udf", y="Percent", data=time_group_table.toPandas())

After analysing the graph, we conclude that the largest percentage for the Amount values is found when the time between each transaction recorded is over 100 000 seconds.

In [None]:
# All Transactions
df_aux = df.select("Class", "Amount").toPandas()

# Define amount ranges
amount_ranges = [
    {"range": "Transaction Value <= $100", "min_amount": 0, "max_amount": 100},
    {"range": "Transaction Value between \$101 and \$2000", "min_amount": 101, "max_amount": 2000},
    {"range": "Transaction Value between \$2001 and \$5000", "min_amount": 2001, "max_amount": 5000},
    {"range": "Transaction Value > $5000", "min_amount": 5001, "max_amount": df_aux["Amount"].max()}
]

# Create four subplots for different amount ranges
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 8))

for idx, amount_range in enumerate(amount_ranges):
    row_idx = idx // 2
    col_idx = idx % 2

    # Filter transactions within amount range
    df_range = df_aux[(df_aux["Amount"] > amount_range["min_amount"]) & (df_aux["Amount"] <= amount_range["max_amount"])]

    # Plot histogram
    sns.histplot(data=df_range, x="Amount", ax=axes[row_idx, col_idx], hue="Class", kde=True)
    axes[row_idx, col_idx].set_title(amount_range["range"])
    axes[row_idx, col_idx].set_ylabel("Number of Transactions")
    axes[row_idx, col_idx].legend(labels=["Fraud", "Non-Fraud"])

fig.suptitle("All Transactions", fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Only fraud transactions
only_fraud = df.filter(df.Class == 1).select("Amount").toPandas()

# Add a pallet
aux_pal = ["#ff7f7f", "#ff3c3c"]

# Create three subplots for different amount ranges
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(18, 6))

# Amount range <= 100
sns.histplot(data=only_fraud[only_fraud["Amount"] <= 100], x="Amount", ax=axes[0], kde=True, color=aux_pal[0])
axes[0].set_title("Transaction Amount <= $100")
axes[0].set_ylabel("Number of Transactions")

# Amount range > 100
sns.histplot(data=only_fraud[(only_fraud["Amount"] > 100)], x="Amount", ax=axes[1], kde=True, color=aux_pal[1])
axes[1].set_title("Transaction Amount > $100")
axes[1].set_ylabel("Number of Transactions")

fig.suptitle("Only Fraudulent Transactions", fontsize=16)
plt.tight_layout()
plt.show()

Now we move on and try to do some correlation:

In [None]:

# create a dataframe with only numeric features in order to simplify our correlactions
numeric_features = [t[0] for t in df.dtypes if t[1] != 'string']
numeric_features_df = df.select(numeric_features)
numeric_features_df.toPandas().head()

In [None]:
col_names = numeric_features_df.columns
features = numeric_features_df.rdd.map(lambda row: row[0:])

In [None]:
# import Statistics from mllib -> Pyspark and import Pandas
from pyspark.mllib.stat import Statistics
import pandas as pd

In [None]:
# create a correlation matrix
corr_matrix = Statistics.corr(features, method="pearson")
corr_df = pd.DataFrame(corr_matrix)
corr_df.index = col_names
corr_df.columns = col_names
round(corr_df, 2)

Since the dataset has a lot of information, it is best to do a heatmap, to facilitate visualization:

In [None]:
# create the heatmap
sns.heatmap(corr_df)

By looking at our correlation matrix, we arrived at the following conclusions:

    The variables that were a product of PCA are not correlated with each other;
    "Time" seems to be negatively correlated with all of the "Vx" variables;
    "Time" and "Class" seem to have no correlation;
    "Class" seem to be negatively correlated with some of the "Vx" variables, and not correlated at all with others.

## 1.2 IMBALANCED DATA

In [None]:
df1= df.toPandas()
df1= df1.sample(frac=1)

In [None]:
# amount of fraud classes 492 rows.
fraud_df = df1.loc[df1['Class'] == 1]
non_fraud_df = df1.loc[df1['Class'] == 0][:492]

normal_distributed_df = pd.concat([fraud_df, non_fraud_df])

# Shuffle dataframe rows
new_df = normal_distributed_df.sample(frac=1, random_state=42)

new_df.shape

In [None]:
#Plot Distribution of the Classes
print('Distribution of the Classes in the subsample dataset')
print(new_df['Class'].value_counts()/len(new_df))

sns.countplot(x ='Class', data=new_df)
plt.title('Equally Distributed Classes', fontsize=14)
plt.show()

## 1.3 VISUALIZATION WITH BALANCED DATA

In [None]:
#Histogram distribution
for i in new_df.loc[:, ~new_df.columns.isin(['Time','Class','IsFraud','time_udf'])]:
    plt.figure(figsize = [10,5]);
    plt.subplot(1,2,1);
    sns.histplot(new_df [i]);
    plt.title('Distribution of {} values'.format(i) , size = 14);
    plt.xlabel(i , size = 12);
    plt.ylabel("count", size = 12);
    
    plt.subplot(1,2,2);
    sns.boxplot(data = new_df, x = i);
    plt.title('{} boxplot'.format(i) , size = 14);
    plt.xlabel(i , size = 12);
    plt.ylabel("count", size = 12);
    plt.show()

In [None]:
#Normality checking
def is_normal(x, treshhold = 0.05):
    k2,p = stats.normaltest(x)
    print(p)
    print(p > treshhold)
    print('\n')
    return p > treshhold

for name in list(new_df.loc[:, ~new_df.columns.isin(['Time','Class','IsFraud','time_udf'])]):
    is_normal(np.array(new_df[name]))

In [None]:
# Check skeness
new_df.loc[:, ~new_df.columns.isin(['Time','Class','IsFraud','time_udf'])].skew()

In [None]:
# Visualization of the relation between each variable and Class
x = 0
plt.figure(figsize = [18,20]);
for i in new_df.loc[:, ~new_df.columns.isin(['Time','Class','IsFraud','time_udf'])] :
    plt.subplot(8,4,x+1)
    sns.boxplot(data = new_df, x = 'Class', y = i)
    plt.title("barplot visualization between Class and {}".format(i), size = 12);
    x = x +1
    plt.subplots_adjust(left=0.1, 
                    bottom=0.1,  
                    right=0.9,  
                    top=0.9,  
                    wspace=0.3,  
                    hspace=0.5) 
plt.show()

In [None]:
# Make sure we use the subsample in our correlation

f, (ax1, ax2) = plt.subplots(2, 1, figsize=(24,20))

df1= df.toPandas()
df1= df1.sample(frac=1)
df2 = df1.loc[:, ~df1.columns.isin(['Time','IsFraud','time_udf'])]
new_df1= new_df.loc[:, ~new_df.columns.isin(['Time','IsFraud','time_udf'])]

# Entire DataFrame
corr = df2.corr()
sns.heatmap(corr, cmap='coolwarm_r', annot_kws={'size':20}, ax=ax1)
ax1.set_title("Imbalanced Correlation Matrix", fontsize=14)


sub_sample_corr = new_df1.corr()
sns.heatmap(sub_sample_corr, cmap='coolwarm_r', annot_kws={'size':20}, ax=ax2)
ax2.set_title('SubSample Correlation Matrix ', fontsize=14)
plt.show()

In [None]:
print('Sorted Correlation values with Class:')
print(new_df1[new_df1.columns[1:]].corr()['Class'][:-1].sort_values())

In [None]:
# Make sure we use the subsample in our correlation

f, (ax1, ax2) = plt.subplots(2, 1, figsize=(24,20))

sub_sample_corr = new_df1.corr()

sns.heatmap(sub_sample_corr > 0.7, cbar=False, annot_kws={'size':20}, ax=ax1)
ax1.set_title('SubSample Correlation Matrix: Correlation > 0.7', fontsize=14)

sns.heatmap(sub_sample_corr <-0.7,  cbar=False, annot_kws={'size':20}, ax=ax2)
ax2.set_title('SubSample Correlation Matrix: Correlation < -0.7', fontsize=14)
plt.show()

In [None]:
from sklearn.metrics import r2_score
from scipy.stats import pearsonr
from collections import OrderedDict
corrs = OrderedDict([(col, pearsonr(new_df1[col], new_df1['Class'] == 1)) for col in new_df1.loc[:, ~new_df1.columns.isin(['Time','Class'])]])
corrs = pd.DataFrame(index = corrs.keys(), data={
        'corr_coef': [corr[0] for corr in corrs.values()],
        'p_value': [corr[1] for corr in corrs.values()],
    })

corrs.applymap(lambda xx : abs(xx)).sort_values(by='corr_coef', ascending=False).rename(columns={
        'corr_coef': 'absolute correlation coefficient'
    })

In [None]:
#Bivariate Analysis
f, axes = plt.subplots(ncols=4, figsize=(20,4))

# Negative Correlations with our Class (The lower our feature value the more likely it will be a fraud transaction)
sns.boxplot(x="Class", y="V14", data=new_df1,  ax=axes[0])
axes[0].set_title('V14 vs Class Negative Correlation')

sns.boxplot(x="Class", y="V12", data=new_df1,  ax=axes[1])
axes[1].set_title('V12 vs Class Negative Correlation')


sns.boxplot(x="Class", y="V10", data=new_df1, ax=axes[2])
axes[2].set_title('V10 vs Class Negative Correlation')


sns.boxplot(x="Class", y="V16", data=new_df1, ax=axes[3])
axes[3].set_title('V16 vs Class Negative Correlation')

plt.show()

In [None]:
#Bivariate Analysis
f, axes = plt.subplots(ncols=4, figsize=(20,4))

# Positive correlations (The higher the feature the probability increases that it will be a fraud transaction)
sns.boxplot(x="Class", y="V4", data=new_df1,  ax=axes[0])
axes[0].set_title('V4 vs Class Positive Correlation')

sns.boxplot(x="Class", y="V11", data=new_df1,  ax=axes[1])
axes[1].set_title('V11 vs Class Positive Correlation')


sns.boxplot(x="Class", y="V2", data=new_df1, ax=axes[2])
axes[2].set_title('V2 vs Class Positive Correlation')


sns.boxplot(x="Class", y="V19", data=new_df1, ax=axes[3])
axes[3].set_title('V19 vs Class Positive Correlation')

plt.show()

In [None]:
#Histogram distribution if is Fraud
new_df1.loc[:, ~new_df1.columns.isin(['Time','Class'])].loc[new_df['Class'] == 1].hist(bins=30, figsize=(10, 10))
plt.subplots_adjust(left=0.1, 
                    bottom=0.1,  
                    right=0.9,  
                    top=0.9,  
                    wspace=0.4,  
                    hspace=0.8) 
plt.show()

In [None]:
from scipy.stats import norm

f, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(20, 6))

v14_fraud_dist = new_df1['V14'].loc[new_df1['Class'] == 1].values
sns.distplot(v14_fraud_dist,ax=ax1, fit=norm, color='#FB8861')
ax1.set_title('V14 Distribution \n (Fraud Transactions)', fontsize=14)

v12_fraud_dist = new_df1['V12'].loc[new_df1['Class'] == 1].values
sns.distplot(v12_fraud_dist,ax=ax2, fit=norm, color='#56F9BB')
ax2.set_title('V12 Distribution \n (Fraud Transactions)', fontsize=14)


v10_fraud_dist = new_df1['V10'].loc[new_df1['Class'] == 1].values
sns.distplot(v10_fraud_dist,ax=ax3, fit=norm, color='#C5B3F9')
ax3.set_title('V10 Distribution \n (Fraud Transactions)', fontsize=14)

plt.show()

In [None]:
for i in new_df1.columns:
    plt.figure();
    plt.tight_layout()
    sns.set(rc={"figure.figsize":(8, 5)})
    f, (ax_hist) = plt.subplots(1, sharex=True)
    plt.gca().set(xlabel= i,ylabel='Density')
    #sns.histplot(new_df[i], ax=ax_hist ,  bins = 20, kde=True)
    sns.distplot(new_df1[i], ax=ax_hist, fit=norm, color='#FB8861')

#plt.show()

## Outliers

In [None]:
z = np.abs(stats.zscore(new_df1.loc[:, ~new_df1.columns.isin(['Time','Class'])]))
threshold = 3
df1_new = new_df1[(z < 3).all(axis=1)]

df1_new.describe().T 

In [None]:
new_df1.loc[:, ~new_df1.columns.isin(['Amount','Time','Class', 'IsFraud','time_udf'])].boxplot( figsize=(12,8), vert=False)
plt.title("With outliers", fontsize=14)
plt.show()

In [None]:
df1_new.loc[:, ~df1_new.columns.isin(['Amount','Time','Class'])].boxplot( figsize=(12,8), vert=False)
plt.title("Without outliers", fontsize=14)
plt.show()

In [None]:
# % of data removed :
print("percentage of records removed is :",(1 - (df1_new.shape[0] / new_df1.shape[0]))*100,", it is an accepted % ")

In [None]:
#Plot Distribution of the Classes
print('Distribution of the Classes in the dataset without outliers')
print(df1_new['Class'].value_counts()/len(df1_new))

sns.countplot(x ='Class', data=df1_new)
plt.title('Balanced Classes without outliers', fontsize=14)
plt.show()

In [None]:
# Visualization of the relation between each variable and Class in a balanced dataset without outliers
x = 0
plt.figure(figsize = [18,20]);
for i in df1_new.loc[:, ~df1_new.columns.isin(['Time','Class'])] :
    plt.subplot(8,4,x+1)
    sns.boxplot(data = df1_new, x = 'Class', y = i)
    plt.title("barplot visualization between Class and {}".format(i), size = 12);
    x = x +1
    plt.subplots_adjust(left=0.1, 
                    bottom=0.1,  
                    right=0.9,  
                    top=0.9,  
                    wspace=0.3,  
                    hspace=0.5) 
plt.show()

In [None]:
f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20,6))
colors = ['#B3F9C5', '#f9c5b3']
# Boxplots with outliers removed
# Feature V14
sns.boxplot(x="Class", y="V14", data=df1_new,ax=ax1, palette=colors)
ax1.set_title("V14 Feature \n Reduction of outliers", fontsize=14)
ax1.annotate('Fewer extreme \n outliers', xy=(0.98, -17.5), xytext=(0, -12),
             arrowprops=dict(facecolor='black'),
             fontsize=14)
# Feature 12
sns.boxplot(x="Class", y="V12", data=df1_new, ax=ax2, palette=colors)
ax2.set_title("V12 Feature \n Reduction of outliers", fontsize=14)
ax2.annotate('Fewer extreme \n outliers', xy=(0.98, -17.3), xytext=(0, -12),
             arrowprops=dict(facecolor='black'),
             fontsize=14)
# Feature V10
sns.boxplot(x="Class", y="V10", data=df1_new, ax=ax3, palette=colors)
ax3.set_title("V10 Feature \n Reduction of outliers", fontsize=14)
ax3.annotate('Fewer extreme \n outliers', xy=(0.95, -16.5), xytext=(0, -12),
             arrowprops=dict(facecolor='black'),
             fontsize=14)
plt.show()

## Machine Learning - É PRECISO REVER TUDO DAQUI PARA BAIXO

In [None]:
# Importing required Spark ML lib methods

from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import VectorIndexer, VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import DenseVector

In [None]:
dfsp = spark.createDataFrame(new_df.loc[:, ~new_df.columns.isin(['IsFraud','time_udf'])])


In [None]:
# Converting String data type of column to double

for column in dfsp.columns:
    data = dfsp.withColumn(column,dfsp[column].cast("double"))

In [None]:
#Adding index to keep track of the rows even after shuffling

from pyspark.sql.functions import *
from pyspark.sql.window import Window
win = Window().orderBy('Time')
dfsp = dfsp.withColumn("idx", row_number().over(win))

In [None]:
dfsp.head()

In [None]:
feature_columns = [col for col in dfsp.columns if col.startswith("V")]
print(feature_columns)

In [None]:
vectorizer = VectorAssembler(inputCols = feature_columns, outputCol="features")
vectorizer.transform(df).select("features", "Class").limit(5).toPandas()

In [None]:
est = RandomForestClassifier()
est.setMaxDepth(5)
est.setLabelCol("Class")

In [None]:
print(est.explainParams())

In [None]:
# Importing required Spark ML lib methods

from pyspark.ml.pipeline import Pipeline, PipelineModel
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import VectorIndexer, VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.linalg import DenseVector

In [None]:
# Converting the feature columns to dense vector (required by spark) and creating label and index columns

training_df = dfsp.rdd.map(lambda x: (DenseVector(x[0:29]),x[30],x[31]))

In [None]:
training_df = spark.createDataFrame(training_df,["features","label","index"])

In [None]:
training_df = training_df.select("index","features","label")

In [None]:
# Splitting data into training and testing data

train_data, test_data = training_df.randomSplit([.8,.2],seed=1234)

In [None]:
train_data.groupBy("label").count().show()

In [None]:
test_data.groupBy("label").count().show()

In [None]:
df_train, df_test = df.randomSplit(weights=[0.7, 0.3], seed = 1)

In [None]:
pipeline = Pipeline()
pipeline.setStages([vectorizer, est])
model = pipeline.fit(df_train)

In [None]:

df_test_pred = model.transform(df_test)

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [None]:
evaluator = BinaryClassificationEvaluator()
evaluator.setLabelCol("Class")

In [None]:
evaluator.evaluate(model.transform(df_test))

In [None]:
from pyspark.sql.functions import *

In [None]:
test_accuracy = (df_test_pred
                 .select("Class", "prediction")
                 .withColumn("isEqual", expr("Class == prediction"))
                 .select(avg(expr("cast(isEqual as float)")))
                 .first())

In [None]:
test_accuracy

In [None]:
treeEstimator = DecisionTreeClassifier()
treeEstimator.setImpurity("entropy")
treeEstimator.setLabelCol("Class")

pipeline = Pipeline()
pipeline.setStages([vectorizer, treeEstimator])
model = pipeline.fit(df_train)
evaluator.evaluate(model.transform(df_test))

In [None]:
accuracy_evaluator = MulticlassClassificationEvaluator()
accuracy_evaluator.setLabelCol("Class")
accuracy_evaluator.setMetricName("accuracy")
accuracy_evaluator.evaluate(model.transform(df_test))

In [None]:
f1_evaluator = MulticlassClassificationEvaluator()
f1_evaluator.setLabelCol("Class")
f1_evaluator.setMetricName("f1")
f1_evaluator.evaluate(model.transform(df_test))

## Gradient Boosting Trees Classifier Model

In [None]:
# Creating Gradient Boosting Trees Classifier Model to fit and predict data

gbt = GBTClassifier(featuresCol="features", maxIter=100,maxDepth=8)
model = gbt.fit(train_data)
predictions = model.transform(test_data)

In [None]:
# Checking the count of records classified into each classes

predictions.groupBy("prediction").count().show()

In [None]:
# Calculating accuracy of model

evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(predictions)

In [None]:
# Calculating percentage of fraud records predicted correctly

predictions = predictions.withColumn("fraudPrediction",when((predictions.label==1)&(predictions.prediction==1),1).otherwise(0))
predictions.groupBy("fraudPrediction").count().show()

In [None]:
predictions.groupBy("label").count().show()

In [None]:
accurateFraud = predictions.groupBy("fraudPrediction").count().where(predictions.fraudPrediction==1).head()[1]
totalFraud = predictions.groupBy("label").count().where(predictions.label==1).head()[1]
FraudPredictionAccuracy = (accurateFraud/totalFraud)*100
print("Fraud Prediction Accuracy: ",FraudPredictionAccuracy)

In [None]:
# Calculating Confusion matrix

tp = predictions[(predictions.label == 1) & (predictions.prediction == 1)].count()
tn = predictions[(predictions.label == 0) & (predictions.prediction == 0)].count()
fp = predictions[(predictions.label == 0) & (predictions.prediction == 1)].count()
fn = predictions[(predictions.label == 1) & (predictions.prediction == 0)].count()
print("True Positive: ",tp,"\nTrue Negative: ",tn,"\nFalse Positive: ",fp,"\nFalse Negative: ",fn)
print("Recall: ",tp/(tp+fn))
print("Precision: ", tp/(tp+fp))
print("F1 Score",  (2 * (tp/(tp+fp)) * (tp/(tp+fn)) /((tp/(tp+fp)) + (tp/(tp+fn)))))