## RNN for detecting the fraud transaction

In [1]:
import pickle
import numpy as np
with open('data.pickle','rb') as load:
    data=pickle.load(load)
with open('le.pickle','rb') as load:
    le=pickle.load(load)
with open('ohe.pickle','rb') as load:
    ohe=pickle.load(load)
with open('train_test_index.pickle','rb') as load:
    train_test_index=pickle.load(load)
with open('feature_final.pickle','rb') as load:
    feature_final=pickle.load(load)



## I will use transaction date time feature as the time step. 
## Previously, all samples are taken as iid. By involving the time step feature, all data can be seen as time series data!
## The transaction sequence probably contain some sequential pattern. When we train RF and Xgboost this pattern may lost.
## In order to feed the sequential data into RNN, we have to reshape the samples to 3D: (samples, time_step, input_features)
## Now let's reshape sample first! 

In [2]:
import pandas as pd
data_raw=pd.read_json('transactions.txt',lines=True)
data=pd.concat([data,data_raw['transactionDateTime']],axis=1)
data=pd.concat([data,data_raw['isFraud']],axis=1)
data.head(3)

Unnamed: 0,customerId,acqCountry,cardPresent,merchantCategoryCode,merchantCountryCode,cardCVV,cardLast4Digits,merchantName,posConditionCode,posEntryMode,...,dateOfLastAddressChange_year,dateOfLastAddressChange_month,transactionDateTime_year,transactionDateTime_month,transactionDateTime_time,currentBalance,transactionAmount,creditLimit,transactionDateTime,isFraud
0,733493772,US,False,rideshare,US,492,9184,Lyft,1,5,...,2014,8,2016,1,19,0.0,111.33,5000,2016-01-08T19:04:50,True
1,733493772,US,False,rideshare,US,492,9184,Uber,1,9,...,2014,8,2016,1,22,111.33,24.75,5000,2016-01-09T22:32:39,False
2,733493772,US,False,rideshare,US,492,9184,Lyft,1,5,...,2014,8,2016,1,13,136.08,187.4,5000,2016-01-11T13:36:55,False


## Under sampling 
### In previous notebook I avoid to resampling data, either over sampling or under samping.
### Since from my experience it is waste of time and useless, especially for under sampling. which will lose many data information. I usually just adjust class weights.
### But for 3D input of RNN, it is difficult to set up class weight, so I will do under sampleing!

In [3]:
data[data['isFraud']==True].shape

(11302, 27)

In [4]:
data_under=data[data['isFraud']==False].sample(11302)
data_under=data_under.append(data[data['isFraud']==True])
data_under.shape

(22604, 27)

### Each customer has different amount of transactions. Focus on all transaction of a customer, if I sort the transaction date time, then all transaction under a customer is time ordered!

In [5]:
data=data_under.copy()
data=data.sort_values(by='transactionDateTime')

In [6]:
## transactionDateTime can be droped now
data.drop(['transactionDateTime'],axis=1,inplace=True)

### divid entire data into pieses groupby customer

In [7]:
g=data.groupby('customerId')
cust_iter=g.__iter__()

### cust is a list of dataframe, a dataframe contains all transaction under a specific customer!

In [8]:
cust=[i for i in cust_iter]

In [9]:
print('the cust has %d customers of dataframe with all the transaction in each dataframe '%len(cust))

the cust has 3468 customers of dataframe with all the transaction in each dataframe 


###  I decide to set up 10 as time step length after taking all things into account. Think that for each customer, we collect 10  neighboured transactions as a single sample for RNN. The total transaction for a customer is not exactly be multiple of 10. In order to align each sample with a same time step, I need to augment transaction (time step) for each customer to the length of nearest mutiple 10. This augment method is called padding method in NLP where the generated word will be set as padding instead of any other word in dictionary. However, in this task, I prefer to set augmented transaction as the last transaction in this sample, just like set up as the last word in a sentence to feed RNN. This will not affect the order information of transaction much!   

### So list cust_len is the transaction amount for each customer

In [10]:
cust_len=np.array([len(i[1]) for i in cust])

### The cust_len_align is the list of transaction length with nearest multiple 10.  

In [11]:
cust_len_align=np.ceil(np.array(cust_len)/10)*10

### The difference between cust_len_align and cust_len is the amount of augmented transaction

In [12]:
cust_augment=cust_len_align-cust_len
print(cust_augment)

[9. 9. 9. ... 2. 5. 3.]


### Let's generate new samples!

In [13]:
LSTM_sample=[]
from tqdm import tqdm_notebook as tqdm
for i in tqdm(range(len(cust))):
    df=cust[i][1]
    for j in range(int(cust_augment[i])):
        df=df.append(df.iloc[-1,:])
    df_list=np.array_split(df,cust_len_align[i]/10)
    array_list=[np.array(i) for i in df_list]
    LSTM_sample.extend(array_list)
LSTM_sample=np.array(LSTM_sample)
print(LSTM_sample.shape)

HBox(children=(IntProgress(value=0, max=3468), HTML(value='')))


(4636, 10, 26)


### Let's take a look at reorgnized samples for RNN

In [14]:
LSTM_sample

array([[[100547107, 'US', True, ..., 286.07, 2500, True],
        [100547107, 'US', True, ..., 286.07, 2500, True],
        [100547107, 'US', True, ..., 286.07, 2500, True],
        ...,
        [100547107, 'US', True, ..., 286.07, 2500, True],
        [100547107, 'US', True, ..., 286.07, 2500, True],
        [100547107, 'US', True, ..., 286.07, 2500, True]],

       [[100634414, 'US', True, ..., 85.51, 10000, False],
        [100634414, 'US', True, ..., 85.51, 10000, False],
        [100634414, 'US', True, ..., 85.51, 10000, False],
        ...,
        [100634414, 'US', True, ..., 85.51, 10000, False],
        [100634414, 'US', True, ..., 85.51, 10000, False],
        [100634414, 'US', True, ..., 85.51, 10000, False]],

       [[101548993, 'US', True, ..., 73.68, 7500, False],
        [101548993, 'US', True, ..., 73.68, 7500, False],
        [101548993, 'US', True, ..., 73.68, 7500, False],
        ...,
        [101548993, 'US', True, ..., 73.68, 7500, False],
        [101548993, 'US

### Then I need to encoding! Similarly, I will use the pipeline I created in AutoEncoder to preprocess data in a second!

In [15]:
x=LSTM_sample[:,:,:-1]
y=LSTM_sample[:,:,-1]

In [16]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,
                                               test_size=0.2,
                                               random_state=123,
                                               )

In [17]:
class pipeline:
    def __init__(self, init_estimator):
        self.estimator_list=[]
        self.estimator_list.extend(init_estimator)
    
    def add_estimator(self,estimator):# the adding order must be ordinal_encoding, one-hot, SVD, normalize 
        self.estimator_list.extend(estimator)
        
    def feature_encoding(self,df,line):
        if len(self.estimator_list)>=2:
            le_array=self.estimator_list[0].transform(df[:,:line])
            le_array=np.append(le_array,np.array(df[:,line:]),axis=1)
            ohe_sparse=self.estimator_list[1].transform(le_array).toarray()
            return le_array,ohe_sparse
        else:
            raise ValueError

    def feature_processing(self,df,line):
        if len(self.estimator_list)>=4:
            ohe_sparse=self.feature_encoding(df,line)
            svd_array=self.estimator_list[2].transform(ohe_sparse)
            std_array=self.estimator_list[3].transform(svd_array)
            return std_array
        else:
            raise ValueError

In [18]:
pip=pipeline([le,ohe])

In [19]:
x_train_le=[]
x_train_ohe=[]
for i in tqdm(range(len(x_train))):
    le_row,ohe_row=pip.feature_encoding(x_train[i],22)
    x_train_le.append(le_row)  
    x_train_ohe.append(ohe_row)   
              

HBox(children=(IntProgress(value=0, max=3708), HTML(value='')))




In [20]:
x_train_le=np.array(x_train_le)
x_train_le.shape

(3708, 10, 25)

In [21]:
x_train_ohe=np.array(x_train_ohe)
x_train_ohe.shape

(3708, 10, 13678)

## Let's train the model!

In [22]:
y_train=y_train.reshape((-1,10,1))
y_train.shape

(3708, 10, 1)

### In order to feed high dimension data into RNN, I have two options!
### 1. use SVD or PCA to reduce dimension of onehotencoder data
### 2. use the onehotencoder directly, that will cause too many weights in DNN. 
### 3. I will drrectly train it by onehotencoder

In [23]:
from keras import backend as K

def recall(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

def precision(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision

Using TensorFlow backend.


In [37]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, TimeDistributed, Dropout
import tensorflow as tf
from keras.callbacks import EarlyStopping, ModelCheckpoint
n_steps = 10
n_inputs = 13678
n_neurons = 10
n_outputs = 1
keep_prob=0.9
lr=0.001
epochs=10
batch_size=300

model=Sequential()
model.add(LSTM(100,return_sequences=True,input_shape=(n_steps,n_inputs)))
model.add(LSTM(50,return_sequences=True))
model.add(LSTM(30,return_sequences=True))
model.add(TimeDistributed(Dense(1,activation='sigmoid')))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy',recall,precision])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_4 (LSTM)                (None, 10, 100)           5511600   
_________________________________________________________________
lstm_5 (LSTM)                (None, 10, 50)            30200     
_________________________________________________________________
lstm_6 (LSTM)                (None, 10, 30)            9720      
_________________________________________________________________
time_distributed_2 (TimeDist (None, 10, 1)             31        
Total params: 5,551,551
Trainable params: 5,551,551
Non-trainable params: 0
_________________________________________________________________
None


In [39]:
##earlystoping
es=EarlyStopping(monitor='val_loss',mode='auto',verbose=1,patience=3)
##save best model
mc=ModelCheckpoint('lstm_best.h5',monitor='val_recall',mode='min',verbose=1,save_best_only=True)
###training
model.fit(x_train_ohe,y_train,epochs=epochs,validation_split=0.3,
          batch_size=batch_size,verbose=2,callbacks=[mc,es],
          ) 

Train on 2595 samples, validate on 1113 samples
Epoch 1/10
 - 16s - loss: 0.6880 - acc: 0.5627 - recall: 0.0775 - precision: 0.5148 - val_loss: 0.6808 - val_acc: 0.5659 - val_recall: 0.0150 - val_precision: 0.4986

Epoch 00001: val_recall improved from inf to 0.01504, saving model to lstm_best.h5
Epoch 2/10
 - 12s - loss: 0.6832 - acc: 0.5642 - recall: 0.0309 - precision: 0.5821 - val_loss: 0.6786 - val_acc: 0.5670 - val_recall: 0.0179 - val_precision: 0.5523

Epoch 00002: val_recall did not improve from 0.01504
Epoch 3/10
 - 12s - loss: 0.6824 - acc: 0.5654 - recall: 0.0138 - precision: 0.7002 - val_loss: 0.6772 - val_acc: 0.5650 - val_recall: 0.0047 - val_precision: 0.4103

Epoch 00003: val_recall improved from 0.01504 to 0.00472, saving model to lstm_best.h5
Epoch 4/10
 - 13s - loss: 0.6808 - acc: 0.5648 - recall: 0.0127 - precision: 0.6676 - val_loss: 0.6813 - val_acc: 0.5710 - val_recall: 0.0525 - val_precision: 0.5700

Epoch 00004: val_recall did not improve from 0.00472
Epoch 5/

<keras.callbacks.History at 0x12c102c7e80>

### Now let's predict in the test dataset!

In [27]:
x_test.shape

(928, 10, 25)

In [28]:
x_test_ohe=[]
for i in tqdm(range(len(x_test))):
    le_row,ohe_row=pip.feature_encoding(x_test[i],22)
    x_test_ohe.append(ohe_row) 

HBox(children=(IntProgress(value=0, max=928), HTML(value='')))




In [40]:
x_test_ohe=np.array(x_test_ohe)
x_test_ohe.shape

(928, 10, 13678)

In [82]:
y_test_predict=model.predict(x_test_ohe)

In [88]:
y_test_predict=(y_test_predict>0.5)

### Let's see the result!

In [89]:
y_test_predict.shape

(928, 10, 1)

In [90]:
y_test.shape

(9280,)

In [91]:
y_test_predict=y_test_predict.reshape((-1)).astype(int)

In [92]:
y_test_predict

array([0, 0, 0, ..., 1, 1, 1])

In [93]:
y_test=y_test.reshape((-1)).astype(int)

In [94]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_test_predict)

array([[3876, 1295],
       [2718, 1391]], dtype=int64)

## This is just a demo, for the future work. we can try more models Bidreiction RNN, Attention layer on RNN, etc!