# Business Problem: Predicting Fraudulent Payment Transactions

Fitris Law has an online retail client who is losing over 10% of annual revenue to fraudulent payments via its online payment portal, which is significantly higher than the industry average (5%).

They have asked my firm to develop a model that will identify potentially fraudulent payments and rank these transactions at the end of each weekly period. 

Using historical payment data extracted from their payment system, I can develop a classification prediction model to identify potential fraudulent transactions (i.e. payments) made by customers.

Although this model is built on payment data for a specified location and contains fields that may not readily be available in other transactional datasets, a similar model can be built for various types of electronic/ACH, credit card, online, P2P transactions, etc.

The source transactional data was extracted from Kaggle contributor: https://www.kaggle.com/turkayavci


In [1]:
#import the necessary libraries

import numpy as np
import pandas as pd
import xlrd
import os
import seaborn as sns
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, \
ExtraTreesClassifier, VotingClassifier, StackingRegressor, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import plot_confusion_matrix, recall_score,\
    accuracy_score, precision_score, f1_score
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer,  make_column_selector as selector
from sklearn.dummy import DummyClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImPipeline
from imblearn.over_sampling import SMOTENC
import pickle

import requests
from bs4 import BeautifulSoup
import pandas as pd

from thefuzz import fuzz, process



## 1. EDA: Exploratory Data Analysis

The first step in the modeling process is to perform exploratory data analysis. This includes:
   - Understanding the total quantity of transactions available
   - Identifying the variables that are available in the data, and whether they are binary, continuous or categorical
   - Identifying whether any null values or NAs exist within the dataset that need to be removed or replaced
   - Performing inferential analysis on certain variables. In this case, I will analyze the level of fraud by category, gender and age group
   - Analyzing the target variable for potential class imbalance
    

In [2]:
bankdata_df_orig = pd.read_csv("./bs140513_032310_csv.csv")
bankdata_df_orig.head(5)

Unnamed: 0,step,customer,age,gender,zipcodeOri,merchant,zipMerchant,category,amount,fraud
0,0,'C1093826151','4','M','28007','M348934600','28007','es_transportation',4.55,0
1,0,'C352968107','2','M','28007','M348934600','28007','es_transportation',39.68,0
2,0,'C2054744914','4','F','28007','M1823072687','28007','es_transportation',26.89,0
3,0,'C1760612790','3','M','28007','M348934600','28007','es_transportation',17.25,0
4,0,'C757503768','5','M','28007','M348934600','28007','es_transportation',35.72,0


### Data Dictionary:

_Step:_ This feature represents the day from the start of the data aggregation. It covers a period of 6 months.

_Customer:_ Customer id

_zipCodeOrigin:_ The zip code of origin/source.

_Merchant:_ The merchant id

_zipMerchant:_ The merchant zip code

_Age:_ Categorized age
0: <= 18,
1: 19-25,
2: 26-35,
3: 36-45,
4: 46:55,
5: 56:65,
6: > 65
U: Unknown

_Gender:_ 
E : Enterprise
F: Female
M: Male
U: Unknown

_Category:_ Category of the purchase

_Amount:_ Amount of the purchase

_Fraud:_ Target - fraudulent(1) or not(0)

Due to volume of transactions in the dataset and the limited amount of RAM on my personal device, I had to randomly sample 100,000 payment transactions from the original source data. The original dataset had 7,200 fraudulent transactions whereas the sampled dataset used for my model has 1,198. Both the original and sampled dataset have approximately 1.2% fraudulent tranactions.

In [4]:
bankdata_df = bankdata_df_orig.sample(n=100000, random_state=42)

In [5]:
bankdata_df_orig[bankdata_df_orig['fraud'] == 1].count()

step           7200
customer       7200
age            7200
gender         7200
zipcodeOri     7200
merchant       7200
zipMerchant    7200
category       7200
amount         7200
fraud          7200
dtype: int64

In [None]:
bankdata_df[bankdata_df['fraud'] == 1].count()

In [None]:
bankdata_df.info()

In [None]:
bankdata_df['fraud'].value_counts(normalize=True)

In [None]:
bankdata_df['fraud'].value_counts()

We can see that only 1.2% of the payments were identified as being fraudulent. This is a clear imbalanced dataset which I will have to address prior to building a predictive model.

In [None]:
bankdata_df.groupby('category')['amount'].sum().sort_values(ascending=False)

In [None]:
bankdata_df[bankdata_df['fraud']== 1].groupby('category')['amount','fraud'].sum().sort_values(by='amount',ascending=False)

In [None]:
bankdata_df[bankdata_df['fraud']== 1].groupby('category')['amount'].mean().sort_values(ascending=False)

In [None]:
bankdata_df[bankdata_df['fraud']== 0].groupby('category')['amount'].mean().sort_values(ascending=False)

In [None]:
bankdata_df[bankdata_df['fraud']== 1]['amount'].sum()

Looking at the tables above, we can make the following observations on the dataset:
- Total fraud across the xxxx transactions was xxxx
- The most common (i.e. count) of fraud occurred in the 'Sports and Toys' category
- The most costly (i.e. total dollars ) of fraud occurred in the travel category
- The average fraudulent transaction was higher than the average non-fraudulent transaction in every category
- There was no fraud identified in the transportation, food, and contents categories.


In [None]:
bankdata_df.isna().sum()

The dataset does not have any null or any NA values, so no replacement or removal is necessary.

In [None]:
plt.figure(figsize=(30,15))
sns.barplot(x=bankdata_df.category,y=bankdata_df.amount)
plt.title("Bar Graph of Total Amount of Spend by Category")

plt.show()

In [None]:
plt.figure(figsize=(30,15))
sns.barplot(x=bankdata_df.category,y=bankdata_df[bankdata_df['fraud']== 1].amount)
plt.title("Bar Graph of Fraudulent Spend by Category")

plt.show()

In [None]:
plt.figure(figsize=(30,10))
sns.scatterplot(x=bankdata_df.step,y=bankdata_df.amount,hue=bankdata_df.fraud)
plt.title("Scatter Plot of Fraud Amount vs Day")
plt.legend()
plt.show()

In [None]:
bankdata_df.describe()

In [None]:
bankdata_df['zipcodeOri'].value_counts()

In [None]:
bankdata_df['zipMerchant'].value_counts()

Since there is only one zipcode and 1 merchant zipcode, I will remove these from the dataframe.

In [6]:
bankdata_df = bankdata_df.drop(columns=['zipcodeOri','zipMerchant'])

In [None]:
bankdata_df

In [None]:
bankdata_df['customer'].value_counts()

## 2. Preprocessing the Data

Next, I must preprocess the data to get it ready for the various modeling algorithms applicable for this type of classification problem. Preprocessing steps include:
   - Creating the train and test datasets
   - Applying SMOTE_NC to help alleviate the target class imbalance. SMOTE_NC can specifically be used on datasets that contain both continuous and categorical variables.
   - Creating a pipeline that will apply standard scaler to continuous variables and one hot encode categorical variables, and then use column transformer
   

In [None]:
bankdata_df.info()

In [7]:
subpipe_num = Pipeline(steps=[
    ('ss', StandardScaler())
])


subpipe_cat = Pipeline(steps=[
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

CT = ColumnTransformer(transformers = [
    ('subpipe_num', subpipe_num, selector(dtype_include = np.number)),
    ('subpipe_cat', subpipe_cat, selector(dtype_include = object))
], remainder = 'passthrough')

In [8]:
X = bankdata_df.drop(['fraud'],axis=1)
y = bankdata_df['fraud']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [10]:
y_train.value_counts()

0    74126
1      874
Name: fraud, dtype: int64

Due to there being so few examples of fraud in the dataset (only 1.2%), I have applied SMOTENC below to increase the minority class of the train dataset to be at least 25% of the train majority class (i.e. not fraud).

In [11]:
smote_nc = SMOTENC(categorical_features=[1,2,3,4,5],sampling_strategy=.5,random_state=42)
X_resampled, y_resampled = smote_nc.fit_resample(X_train, y_train)


In [12]:
y_resampled.value_counts()

0    74126
1    37063
Name: fraud, dtype: int64

In [13]:
y_resampled.value_counts(normalize = True)

0    0.666667
1    0.333333
Name: fraud, dtype: float64

Now the class is slightly more balanced, with 37,063 examples of the "fraud" target (or 33% of the sampled population).

## 3. Modeling, Cross Validation and GridSearching

I use the cross validation class taken from FlatIron lecture 51, and then apply certain types of classification models including:
   - Dummy classifier
   - Logistic Regression
   - KNN
   - Random Forest
   - GradientBoost Classifier

In [14]:
class ModelWithCV():
    '''Structure to save the model and more easily see its crossvalidation'''
    def __init__(self, model, model_name, X, y, cv_now=True):
        self.model = model
        self.name = model_name
        self.X = X
        self.y = y
        # For CV results
        self.cv_results = None
        self.cv_mean = None
        self.cv_median = None
        self.cv_std = None
        #
        if cv_now:
            self.cross_validate()
        
    def cross_validate(self, X=None, y=None, kfolds=10):
        '''
        Perform cross-validation and return results.
        
        Args: 
          X:
            Optional; Training data to perform CV on. Otherwise use X from object
          y:
            Optional; Training data to perform CV on. Otherwise use y from object
          kfolds:
            Optional; Number of folds for CV (default is 10)  
        '''
        
        cv_X = X if X else self.X
        cv_y = y if y else self.y

        self.cv_results = cross_val_score(self.model, cv_X, cv_y, cv=kfolds)
        self.cv_mean = np.mean(self.cv_results)
        self.cv_median = np.median(self.cv_results)
        self.cv_std = np.std(self.cv_results)

        
    def print_cv_summary(self):
        cv_summary = (
        f'''CV Results for `{self.name}` model:
            {self.cv_mean:.5f} ± {self.cv_std:.5f} accuracy
        ''')
        print(cv_summary)
    
    def cvmean(self):
        cvmean2 = round(self.cv_mean,3)
        print(cvmean2)

In [None]:
dummy_model_pipe = Pipeline(steps= [
    ('ct', CT),
    ('dum', DummyClassifier(strategy = 'most_frequent'))
])

dummy_pipe = ModelWithCV(dummy_model_pipe, 'dummy model', X_resampled, y_resampled)
dummy_pipe.print_cv_summary()

In [None]:
logreg_model_pipe = Pipeline([
    ('ct', CT),
    ('logreg', LogisticRegression(random_state=42, max_iter=5000))
])

logreg_pipe = ModelWithCV(logreg_model_pipe,'logreg_model', X_resampled, y_resampled)
logreg_pipe.print_cv_summary()

In [None]:
knn_model_pipe = Pipeline([
    ('ct',CT),
    ('knn',KNeighborsClassifier(leaf_size=10000,n_neighbors=500))
])

knn_pipe = ModelWithCV(knn_model_pipe,'knn_model', X_resampled, y_resampled)
knn_pipe.print_cv_summary()

In [None]:
rfc_model_pipe = Pipeline([
    ('ct',CT),
    ('rfc',RandomForestClassifier(random_state = 42,max_depth=9))
])

rfc_pipe = ModelWithCV(rfc_model_pipe,'rfc_model', X_resampled, y_resampled)
rfc_pipe.print_cv_summary()

In [None]:
gbc_model_pipe = Pipeline([
    ('ct',CT),
    ('gbc',GradientBoostingClassifier(random_state = 42))
])

gbc_pipe = ModelWithCV(gbc_model_pipe,'gbc_model', X_resampled, y_resampled)
gbc_pipe.print_cv_summary()

In [None]:
pickle.dump(logreg_model_pipe,open("logreg_fraud_model2.sav",'wb'))

### Apply Gridsearch to RFC and KNN, which had very high accuracy scores

In [None]:
#Tuning and Cross Validating of RFC. Here we add in some of the important selection criteria for RFC.

rfc_params = {}
rfc_params['rfc__criterion'] = ['gini','entropy']
rfc_params['rfc__min_samples_leaf'] = [5,10,15]
rfc_params['rfc__max_depth'] = [5,7,9]


knn_params = {}
knn_params['knn__n_neighbors']=[300,500,700]
knn_params['knn__weights']=['uniform','distance']
knn_params['knn__algorithm']='auto',
knn_params['knn__leaf_size']=[5000,10000]
knn_params['knn__p']=[1,2]


logreg_params = {}
logreg_params['logreg__penalty']=['l2','none']
#logreg_params['logreg__solver']= ['lbfgs', 'sag', 'saga']
#'elasticnet', 'l1',

In [None]:

rfc_params

In [None]:
gs_rfc = GridSearchCV(rfc_model_pipe, rfc_params, cv=5, verbose=1)

In [None]:
gs_rfc.fit(X_train, y_train)

In [None]:
gs_rfc.best_params_

In [None]:
gs_rfc.best_score_

In [None]:
pd.DataFrame(gs_rfc.cv_results_).head()

In [None]:
# rfc_model_pipe = Pipeline([
#     ('ct',CT),
#     ('rfc',RandomForestClassifier(random_state = 42,max_depth=7))
# ])

# rfc_model_pipe.fit(X_train,y_train)

In [None]:
# rfc_model_pipe.score(X_train,y_train)

In [None]:
# rfc_model_pipe.score(X_test,y_test)

In [None]:
gs_knn = GridSearchCV(knn_model_pipe, knn_params, cv=5, verbose=1)

In [None]:
gs_knn.fit(X_resampled, y_resampled)

In [None]:
gs_knn.best_params_

In [None]:
gs_knn.best_score_

In [None]:
pd.DataFrame(gs_knn.cv_results_).head()

In [None]:
gs_logreg = GridSearchCV(logreg_model_pipe, logreg_params, cv=5, verbose=1)

In [None]:
gs_logreg.fit(X_resampled, y_resampled)

In [None]:
gs_logreg.best_params_

In [None]:
gs_logreg.best_score_

In [None]:
Final_logreg_model_pipe = Pipeline([
    ('ct', CT),
    ('logreg', LogisticRegression(random_state=42, max_iter=5000, penalty='none'))
])

Final_logreg_model_pipe.fit(X_resampled, y_resampled)

In [None]:

pickle.dump(Final_logreg_model_pipe,open("logreg_fraud_model.sav",'wb'))

In [None]:
Final_logreg_model_pipe.score(X_test, y_test)

In [None]:
y_hat = Final_logreg_model_pipe.predict(X_test)

In [None]:
y_hat

In [None]:
print(classification_report(y_test,y_hat))

In [None]:
#Plot a confusion matric on actual and preditions of the test data.
ax = plt.subplot()
sns.set(font_scale=1) # Adjust to fit

plot_confusion_matrix(rfc_model_pipe, X_test, y_test, ax=ax,  cmap = plt.cm.Greys, display_labels=['Not Fraud','Fraud'])

# Labels, title and ticks
label_font = {'size':'18'}  # Adjust to fit
ax.set_xlabel('Predicted labels', fontdict=label_font);
ax.set_ylabel('Observed labels', fontdict=label_font);
ax.grid(b=False);

title_font = {'size':'20'}  # Adjust to fit
ax.set_title('Confusion Matrix', fontdict=title_font);

plt.savefig('Confusion Matrix')

plt.show()

In [None]:
sample_df = pd.read_csv("./sample_transactions.csv")
sample_df_2 = sample_df[['step','customer','age','gender','merchant','category','amount']]

loaded_model_knn = pickle.load(open("knn_fraud_model.sav", 'rb'))

preds = loaded_model_knn.predict(sample_df_2)
preds_proba = loaded_model_knn.predict_proba(sample_df_2)
df_prediction = pd.DataFrame(zip(preds,preds_proba[:,1]*100),columns=['Fraud Prediction','Probability(%)'])
df_prediction['Potentially Fraudulent?'] = np.where(df_prediction['Fraud Prediction'] == 1, 'Yes', 'No')
df_combined = sample_df_2.join(df_prediction)
df_combined.sort_values(by='Probability(%)')