This notebook contains the work for Step 5 of the Data Science Method. It also contains the prep work which was completed in Step 4:

The Data Science Method

1.Problem Identification

2.Data Wrangling

  
    . Data Collection
    . Data Organization
    . Data Definition
    . Data Cleaning
 

3.Exploratory Data Analysis


    . Build data profile tables and plots
    . Outliers & Anomalies
    . Explore data relationships
    . Identification and creation of features


4.Pre-processing and Training Data Development


    . Create dummy or indicator features for categorical variables
    . Standardize the magnitude of numeric features
    . Split into testing and training datasets
    . Apply scaler to the testing set


5. Modeling

    . Fit Models with Training Data Set
    . Review Model Outcomes — Iterate over additional models as needed.
    . Identify the Final Model

Documentation


    . Review the Results
    . Present and share your findings - storytelling
    . Finalize Code
    . Finalize Documentation


Introduction

In this project, several anomaly detection techniques of sklearn package have been explored to train a machince learning model to detect credict card fraud. Methods such as Local outlier factor and isolation forest algorithm was used to calculate the anomaly scores. These algorithms use a dataset of slightly under 30000 credit card transactions to predict a fradualent transaction.

Before, proceeding with the project, an attempt to briefly describe the anomaly detection and the detection techniques would be made.

What is Anomaly detection?

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. It has many applications in business, from intrusion detection (identifying strange patterns in network traffic that could signal a hack) to system health monitoring (spotting a malignant tumor in an MRI scan), and from fraud detection in credit card transactions to fault detection in operating environments.

Anomalies can be broadly categorized as: Point anomalies: A single instance of data is anomalous if it's too far off from the rest. Business use case- Detecting credit card fraud based on "amount spent."

Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Business use case- Spending $100 on food every day during the holiday season is normal, but may be odd otherwise.

Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case- Someone is trying to copy data form a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber attack.

Anomaly Detection Techniques:

Simple Statistical Methods- Metrics such as distribution, including mean, median, mode, and quantiles could be used to identify outliers since the definition of an anomalous data point is one that deviates by a certain standard deviation from the mean.

Machine Learning-Based Approaches- Density-Based Anomaly Detection-: These include the k-nearest neighbors algorithm, Relative density of data based method known as local outlier factor (LOF) algorithm Clustering-Based Anomaly Detection-: K-means algorithm Support Vector Machine-Based Anomaly Detection Isolation Forest Algorithm

What is Local Outlier Factor algorithm?

LOF algorithm is an unsupervised outlier detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples that have a substantially lower density than their neighbors.

The number of neighbors considered, (parameter n_neighbors) is typically chosen 1) greater than the minimum number of objects a cluster has to contain, so that other objects can be local outliers relative to this cluster, and 2) smaller than the maximum number of close by objects that can potentially be local outliers. In practice, such informations are generally not available, and taking n_neighbors=20 appears to work well in general.

What is Isolation forest algorithm?

Isolation Forest explicitly identifies anomalies instead of profiling normal data points. Isolation Forest, like any tree ensemble method, is built on the basis of decision trees. In these trees, partitions are created by first randomly selecting a feature and then selecting a random split value between the minimum and maximum value of the selected feature.

In principle, outliers are less frequent than regular observations and are different from them in terms of values (they lie further away from the regular observations in the feature space). That is why by using such random partitioning they should be identified closer to the root of the tree (shorter average path length, i.e., the number of edges an observation must pass in the tree going from the root to the terminal node), with fewer splits necessary.

STEP 4: Pre-processing and Training Data Development

In [10]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
from scipy.stats import chi2_contingency
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from pprint import pprint 
from sklearn.metrics import classification_report,accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from pylab import rcParams
from flask import Flask, render_template

#importing saperate functions from pandas
from pandas import read_csv, value_counts

In [11]:
# set options
pd.set_option('display.max_rows', 500)

In [12]:
# load the data saved from step 3
df=pd.read_csv('C:\\Users\\arna_mora\\Springboard\\unit 7\\creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [13]:
df.dtypes

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

In [14]:
columns = df.columns.tolist()
# Filter the columns to remove data we do not want 
columns = [c for c in columns if c not in ["Class"]]
# Store the variable we are predicting 
target = "Class"
# Define a random state 
state = np.random.RandomState(42)
X = df[columns]
Y = df[target]
X_outliers = state.uniform(low=0, high=1, size=(X.shape[0], X.shape[1]))
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)

(283726, 30)
(283726,)


In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

Number transactions X_train dataset:  (198608, 30)
Number transactions y_train dataset:  (198608,)
Number transactions X_test dataset:  (85118, 30)
Number transactions y_test dataset:  (85118,)


Normalizing the amount column. The amount column is not in line with the anonimised features

In [16]:
from sklearn.preprocessing import StandardScaler

df['normAmount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1, 1))
df = df.drop(['Time','Amount'],axis=1)
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Class,normAmount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0,0.2442
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,0,-0.342584
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0,1.1589
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0,0.139886
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0,-0.073813


Resampling:

As we mentioned earlier, there are several ways to resample skewed data. Apart from under and over sampling, there is a very popular approach called SMOTE (Synthetic Minority Over-Sampling Technique), which is a combination of oversampling and undersampling, but the oversampling approach is not by replicating minority class but constructing new minority class data instance via an algorithm.

Applying SMOTE with Over Sampling

In [17]:
from imblearn.over_sampling import SMOTE
print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))

sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel())

print('After OverSampling, the shape of X_train: {}'.format(X_train_res.shape))
print('After OverSampling, the shape of y_train: {} \n'.format(y_train_res.shape))

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))

Before OverSampling, counts of label '1': 344
Before OverSampling, counts of label '0': 198264 

After OverSampling, the shape of X_train: (396528, 30)
After OverSampling, the shape of y_train: (396528,) 

After OverSampling, counts of label '1': 198264
After OverSampling, counts of label '0': 198264


Model Prediction:

Random Forest Classifier with SMOTE 

In [20]:
# Import the pipeline module we need for this from imblearn
from imblearn.pipeline import Pipeline 
from imblearn.over_sampling import BorderlineSMOTE
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
# Define which resampling method and which ML model to use in the pipeline

resampling = BorderlineSMOTE(kind='borderline-2',random_state=0) # instead SMOTE(kind='borderline2') 
rf = RandomForestClassifier() 

# Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Random Forest Classifier', rf)])

# Fit your pipeline onto your training set and obtain predictions by fitting the model onto the test data 
pipeline.fit(X_train, y_train) 
y_predicted = pipeline.predict(X_test)

# Predict probabilities
probs = rf.predict_proba(X_test)
roc_auc = roc_auc_score(y_test, probs[:, 1])
print(accuracy_score(y_test, y_predicted))
print("AUC ROC score: ", roc_auc_score(y_test, probs[:,1]))
# Obtain the results from the classification report and confusion matrix 

print('Classifcation report:\n', classification_report(y_test, y_predicted))
print('Confusion matrix:\n',  confusion_matrix(y_true = y_test, y_pred = y_predicted))

0.9995065673535563
AUC ROC score:  0.9619113499503492
Classifcation report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     84989
           1       0.89      0.77      0.82       129

    accuracy                           1.00     85118
   macro avg       0.95      0.88      0.91     85118
weighted avg       1.00      1.00      1.00     85118

Confusion matrix:
 [[84977    12]
 [   30    99]]


With the Random Forest Classifier with SMOTE Model, we have:

84977 transactions classified as valid and were actually valid(True Positive);

12 transactions classified as fraud but that were really valid(type 1 error);

30 transactions classified as valid but which were fraud (type 2 error);

99 transactions classified as fraud and were actually fraud.

Look at the precision, recall, f1_score .The accuracy looks good.

AUC denotes an excellent classifier(0.96).

Model : Isolation Forest, Local Outlier Factor(LOF) Algorithm

In [23]:
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

# define random states
state = 1
fraud = df[df['Class']==1]
Valid = df[df['Class']==0]
outlier_fraction = len(fraud)/float(len(Valid))

# define outlier detection tools to be compared
classifiers = {
    "Isolation Forest": IsolationForest(max_samples=len(X),
                                        contamination=outlier_fraction,
                                        random_state=state),
    "Local Outlier Factor": LocalOutlierFactor(
        n_neighbors=20,
        contamination=outlier_fraction)}
#20 is default, but the higher the percentage of outliers in your in your data set the
#higher you're going to want to make this number

In [28]:
# Fit the model
from sklearn.metrics import confusion_matrix
classifier = {"Local Outlier Factor": LocalOutlierFactor(
        n_neighbors=20,
        contamination=outlier_fraction)}
plt.figure(figsize=(9, 7))
n_outliers = len(fraud)


for i, (clf_name, clf) in enumerate(classifiers.items()):
    
    # fit the data and tag outliers
    if clf_name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_pred = clf.negative_outlier_factor_
    else:
        clf.fit(X)
        scores_pred = clf.decision_function(X)
        y_pred = clf.predict(X)
    
    # Reshape the prediction values to 0 for valid, 1 for fraud. 
    y_pred[y_pred == 1] = 0
    y_pred[y_pred == -1] = 1
    
    n_errors = (y_pred != Y).sum()
    
    # Run classification metrics
    print('{}: {}'.format(clf_name, n_errors))
    print(accuracy_score(Y, y_pred))
    print(classification_report(Y, y_pred))
    print('Confusion matrix:\n',  confusion_matrix(y_true = y_test, y_pred = y_predicted))

Isolation Forest: 685
0.9975856988784955
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    283253
           1       0.28      0.28      0.28       473

    accuracy                           1.00    283726
   macro avg       0.64      0.64      0.64    283726
weighted avg       1.00      1.00      1.00    283726

Confusion matrix:
 [[84977    12]
 [   30    99]]
Local Outlier Factor: 944
0.9966728463376638
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    283253
           1       0.00      0.00      0.00       473

    accuracy                           1.00    283726
   macro avg       0.50      0.50      0.50    283726
weighted avg       1.00      1.00      1.00    283726

Confusion matrix:
 [[84977    12]
 [   30    99]]


<Figure size 648x504 with 0 Axes>

Observations :

. Isolation Forest detected 685 errors versus Local Outlier Factor detecting 944 errors.

. Isolation Forest has a 99.75% more accurate than LOF of 99.66%.

. When comparing error precision & recall for 2 models , the Isolation Forest performed much better than the LOF as we can see   that the detection of fraud cases is 28 % versus LOF detection rate of just 0 %.

. So overall Isolation Forest Method performed much better in determining the fraud cases which is around 30%.

. We can also improve on this accuracy by increasing the sample size or use deep learning algorithms however at the cost of       computational expense. We can also use complex anomaly detection models to get better accuracy in determining more fraudulent   cases.

With Isolation Forest, Local Outlier Factor Models, we have:

84977 transactions classified as valid and were actually valid(True Positive);

12 transactions classified as fraud but that were really valid(type 1 error);

30 transactions classified as valid but which were fraud (type 2 error);

99 transactions classified as fraud and were actually fraud.

Look at the precision, recall, f1_score .The accuracy looks good.





Building the XGBoost Model

In [23]:
from xgboost import XGBClassifier
xg = XGBClassifier(random_state=0)
xg.fit(X_train,y_train)
xg.score(X_train,y_train)

1.0

Validating on test data

In [24]:
pred = xg.predict(X_test)

Checking Accuracy

In [31]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_pred,y_test)
cm

array([[84982,    31],
       [    7,    98]], dtype=int64)

In [26]:
from sklearn.metrics import accuracy_score
accuracy_score(pred,y_test)

0.9995653093352758

In [32]:
print (classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     84989
           1       0.93      0.76      0.84       129

    accuracy                           1.00     85118
   macro avg       0.97      0.88      0.92     85118
weighted avg       1.00      1.00      1.00     85118



With the XGboost  Model, we have:

84982 transactions classified as valid and were actually valid(True Positive);

7 transactions classified as fraud but that were really valid(type 1 error);

31 transactions classified as valid but which were fraud (type 2 error);

98 transactions classified as fraud and were actually fraud.

Look at the precision, recall, f1_score .all are higher than from prevois model and also the accuracy is excellent.






Tuning Hyperparameters: 


Machine Learning models tuning is a type of optimization problem. We have a set of hyperparameters and we aim to find the right combination of their values which can help us to find either the minimum or the maximum of a function.

This can be particularly important when comparing how different Machine Learning models performs on a dataset. In fact, it would be unfair for example to compare an SVM model with the best Hyperparameters against a Random Forest model which has not been optimized.

 the following approaches to Hyperparameter optimization will be explained:
 
 . Random Search
 
 . Grid Search

In [29]:
# display parameters of current model
rf = RandomForestClassifier() 

print('Parameters currently in use:\n')
print(rf.get_params())

Parameters currently in use:

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


In [30]:
# Create the random grid
random_grid = {"max_features": ['auto', 'sqrt'],
               "max_depth": [1,10,20,30,40,50,60,70,80,90,100, None],
               "min_samples_leaf": [1,3,10],
               "min_samples_split": [2,5,10],
               "bootstrap": [True, False],
               "n_estimators": [10,100]}
pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 3, 10],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [10, 100]}


In [35]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
rf2 = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf2, param_distributions = random_grid, n_iter = 50, cv = 3, verbose=10, random_state=42, n_jobs = -1)

In [None]:
# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   40.9s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:  9.8min
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed: 36.7min
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed: 46.5min
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed: 53.1min


In [None]:
# display best parameters
rf_random.best_params_