#Hello! Welcome! 

In this notebook we're gonna code a toy fraud detector and break it.

Most of the code has been taken from O'Reilly's book Machine Learning and Security.

When I was working my way through the book I got stuck on chapter 2. Main reason for this is cause they wrote a detector that is simple enough for a beginner to understand and tinker with.

##How does it work and what does it do?

In a nutshell it takes the payment data and first seperates it out into two portions. 67% for training, 33% for testing. It uses the column 'label' to perform supervised training and determine the attributes necessary to seperate out the fraudulent transactions from the legitimate ones.

#Purpose
To determine the minimum number of entries we need to alter to change the predictions.

In [0]:
#Lets import everything we're going to need a head of time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn import svm

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

#Don't need to worry about future warnings
import warnings
warnings.filterwarnings('ignore')



In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/oreilly-mlsec/book-resources/master/chapter2/datasets/payment_fraud.csv')
#We're going to give the payment methods their own column. 0 for payment type not used. 1 for used.
df = pd.get_dummies(df, columns=['paymentMethod'])

In [28]:
#Verify that data is read in properly
df.head()


Unnamed: 0,accountAgeDays,numItems,localTime,paymentMethodAgeDays,label,paymentMethod_creditcard,paymentMethod_paypal,paymentMethod_storecredit
0,29,1,4.745402,28.204861,0,0,1,0
1,725,1,4.742303,0.0,0,0,0,1
2,845,1,4.921318,0.0,0,1,0,0
3,503,1,4.886641,0.0,0,1,0,0
4,2000,1,5.040929,0.0,0,1,0,0


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39221 entries, 0 to 39220
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   accountAgeDays             39221 non-null  int64  
 1   numItems                   39221 non-null  int64  
 2   localTime                  39221 non-null  float64
 3   paymentMethodAgeDays       39221 non-null  float64
 4   label                      39221 non-null  int64  
 5   paymentMethod_creditcard   39221 non-null  uint8  
 6   paymentMethod_paypal       39221 non-null  uint8  
 7   paymentMethod_storecredit  39221 non-null  uint8  
dtypes: float64(2), int64(3), uint8(3)
memory usage: 1.6 MB


Ok now that we have our libraries and dataset loaded lets establish a baseline.

In [0]:
X_train, X_test,y_train,y_test = train_test_split( df.drop('label',axis=1), df['label'], test_size = 0.33, random_state = 17)
  

In [31]:

clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
y_pred = clf.predict(X_test)

In [33]:

print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

[[12753     0]
 [    0   190]]
1.0


Sweet this matches the example code performance. Lets get to work.

In [36]:

df2 = df.copy()
mal_sample = df2.sample(5)
mal_sample['label'] = 1
df2.update(mal_sample)
X_train, X_test,y_train,y_test = train_test_split( df2.drop('label',axis=1), df2['label'], test_size = 0.33, random_state = 17)

clf = LogisticRegression()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)


print(confusion_matrix(y_test, y_pred))


[[12746     5]
 [  178    14]]


What this code does is take 5 samples from our dataset and change their labels to 1 indicated that they are fraudulent. This could potentially play havoc with the learning.

In [37]:
print(mal_sample)

       accountAgeDays  ...  paymentMethod_storecredit
22947            2000  ...                          0
28729              31  ...                          0
34794             675  ...                          0
37581             962  ...                          0
28088            2000  ...                          0

[5 rows x 8 columns]


#Results
Predicted not fraudulent and actually not fraudulent 12746 with a false positive of 5.

Predicted frauded and actually fraudulent 14 with a false negative of 178
pred not fraud, pred fraud

By changing five potentially non-fraudulent entries to be labeled as fraud we increased the false negative rate dramatically. 0 to 183. And in thousands of transactions 5 altered entries are probably going to slip past investigators.



---

When I reran the experiment I saw better performance. It seems that there are entries that are more sensitive than others. We should go find them.
Given the random nature of the sampling the rows will likely differ with each attempt. This unfortunately can give us different results.



In [0]:
def find_sensitive_row(iterations):
  df_copy = df.copy()
  first_score = 0.0
  accuracy = 0.0
  final_score = 0.0
  final_sample = df_copy.sample(1)
  X_train, X_test,y_train,y_test = train_test_split( df_copy.drop('label',axis=1), df_copy['label'], test_size = 0.33, random_state = 17)

  clf = LogisticRegression()
  clf.fit(X_train, y_train)

  y_pred = clf.predict(X_test)

  first_score = accuracy_score(y_test, y_pred)
  print("First score: " + str(first_score))
  #
  for i in range(0,iterations):
    mal_copy = df_copy.copy()
    #print(i)
    mal_sample = df_copy.sample(1)
    #print(mal_sample)
    mal_sample['label'] = 1
    mal_copy.update(mal_sample)
    X_train, X_test,y_train,y_test = train_test_split( mal_copy.drop('label',axis=1), mal_copy['label'], test_size = 0.33, random_state = 17)
  
    clf = LogisticRegression()
    clf.fit(X_train, y_train)
  
    y_pred = clf.predict(X_test)
  
    accuracy = accuracy_score(y_test, y_pred)
    
    #print(accuracy)
    if (accuracy < first_score):
      final_score = accuracy
      final_sample = mal_sample.copy()
  #
  print("Final accuracy score: " + str(final_score))    
  print(confusion_matrix(y_test,y_pred))
  return final_sample
      
  

In [43]:
print(find_sensitive_row(iterations = 10))


First score: 1.0
Final accuracy score: 0.9869427489762806
[[12748     5]
 [  164    26]]
       accountAgeDays  ...  paymentMethod_storecredit
24222             843  ...                          0

[1 rows x 8 columns]


Expected behavior test using row 12229

In [44]:
#Given the random nature of the sampling we're going to use a row we found in a previous search
test = df.loc[12229].copy()
df_test = df.copy()
test = df_test.loc[[12229]]
test['label'] = 1
print(test)
df_test.update(test)

       accountAgeDays  ...  paymentMethod_storecredit
12229            2000  ...                          0

[1 rows x 8 columns]


In [45]:
X_train, X_test,y_train,y_test = train_test_split( df_test.drop('label',axis=1), df_test['label'], test_size = 0.33, random_state = 17)
  
clf = LogisticRegression()
clf.fit(X_train, y_train)
  
y_pred = clf.predict(X_test)
print(confusion_matrix(y_test,y_pred))


[[12751     2]
 [  185     5]]


In [46]:
df_test.loc[12229]

accountAgeDays               2000.000000
numItems                        1.000000
localTime                       4.895263
paymentMethodAgeDays          261.367361
label                           1.000000
paymentMethod_creditcard        1.000000
paymentMethod_paypal            0.000000
paymentMethod_storecredit       0.000000
Name: 12229, dtype: float64

Now we know logistic regression is not the most robust ML algo out there. Lets try Support Vector Machines (SVM)

Main flaw of SVM is its hella slow. LogRegression takes milliseconds to calculate while SVM took seconds. It only gets worse as your dataset gets larger.

In [19]:
X_train, X_test,y_train,y_test = train_test_split( df.drop('label',axis=1), df['label'], test_size = 0.33, random_state = 17)
clf = svm.SVC(kernel = 'linear', gamma= 'auto')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test,y_pred))

[[12753     0]
 [    0   190]]


So far so good. Lets change the kernel and see if that makes a difference.

In [20]:
X_train, X_test,y_train,y_test = train_test_split( df.drop('label',axis=1), df['label'], test_size = 0.33, random_state = 17)
clf = svm.SVC(kernel = 'rbf', gamma= 'auto')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test,y_pred))


[[12753     0]
 [    1   189]]


Excellent! This matches our performance with log regression. Lets see how robust it is. One concern I have is that its overfitting the data. We're taking out a subset of the data so it shouldn't be. If it performs poorly with new data however we need to change how we do things.

In [21]:
#Lets run it with the altered row and see how it does
X_train, X_test,y_train,y_test = train_test_split( df_test.drop('label',axis=1), df['label'], test_size = 0.33, random_state = 17)
clf = svm.SVC(kernel = 'linear', gamma= 'auto')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test,y_pred))

[[12753     0]
 [    1   189]]


SVM isn't even phased by the altered row. So we will need to go through the dataset and see what we can do to trick SVM.

In [0]:
#Lets turn the malicious alterations into a function.
#Same method applies, take a random sample, change its label to 1, update the dataframe with the altered row
def MalAlt(dataframe,col, value): 
  print(dataframe[col].value_counts())
  mal_sample = dataframe.sample(1)
  mal_sample[col] = value
  altered_dataframe = dataframe
  altered_dataframe.update(mal_sample)
  print(altered_dataframe[col].value_counts())
  
  return altered_dataframe

In [23]:
df2 = MalAlt(df,'label',1)
#print(df['label'].value_counts())
#mal_sample = df.sample(1)
#mal_sample['label'] = 1
#df2 = df
#df2.update(mal_sample)
#print(df2['label'].value_counts())

0    38661
1      560
Name: label, dtype: int64
0.0    38660
1.0      561
Name: label, dtype: int64


In [24]:
X_train, X_test,y_train,y_test = train_test_split( df2.drop('label',axis=1), df2['label'], test_size = 0.33, random_state = 17)
clf = svm.SVC(kernel = 'linear', gamma= 'auto')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test,y_pred))

[[12753     0]
 [  190     0]]


Wow, with LogRegression we change one and it completely ruins the prediction capabilities of the system. With SVM it actually only falsely flags a not fraud transaction and doesn't change the detection of fraudulent transactions.

I'm feeling bold. Lets change a few more transactions.

In [0]:
print(df['label'].value_counts())
mal_sample = df.sample(5)
mal_sample['label'] = 1
df2 = df
df2.update(mal_sample)
print(df2['label'].value_counts())

0.0    38660
1.0      561
Name: label, dtype: int64
0.0    38655
1.0      566
Name: label, dtype: int64


In [0]:
X_train, X_test,y_train,y_test = train_test_split( df2.drop('label',axis=1), df2['label'], test_size = 0.33, random_state = 17)
clf = svm.SVC(kernel = 'linear', gamma= 'auto')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test,y_pred))

[[12752     0]
 [  191     0]]


Ok now its even worse than before with a whopping 0% accuracy!
Though it took 5 alterations instead of just one.

Though I cant help but wonder. The 'linear' kernel gives it an almost linear regression profile. Could that perhaps contribute to the poor performance?
Lets try the 'rbf' kernel really quick.


Unfortunately due to the nature of random sampling you might get different/better results from what I got. So try try try again

In [0]:

X_train, X_test,y_train,y_test = train_test_split( df2.drop('label',axis=1), df2['label'], test_size = 0.33, random_state = 17)
clf = svm.SVC(kernel = 'rbf', gamma= 'auto')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(confusion_matrix(y_test,y_pred))

[[12752     0]
 [    2   189]]


So changing the kernel completely changed the results.
Why is that?

So when the algo is training with the 'linear' kernel it uses a straight line to create the boundry between fraudulent and non-fraudulent data points.

'rbf' on the other hand curves so its better able to create the boundry and seems to handle malicious data better

[visual reference between the kernels](https://scikit-learn.org/stable/_images/sphx_glr_plot_iris_svc_0011.png)

#'linear' kernel 

pros:
* fast
* easy



cons:
* not robust

#'rbf' kernel 

pros:
* very robust against attacks
* with larger datasets likely to be more accurate
cons:
* super slow 