## Credit Card Fraud Detection

In this project, we would be taking a look at a bank's customer information applying for a credit card. The variables correspond to the information a customer provides during the application. The goal is to identify customers who did fraud.
Since this is an unsupervised deep learning model, we do not have a label or a target variable.

### Import libraries

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Data Preprocessing
There are 15 variables, some are categorical and some continous. The SOM uses these variables as input to a neural network to map it to an output space. And in between the input space and output space, the neural network gets initialized with the vector of the columns corresponding to each customer. So, each neuron has 15 input signals.

In [14]:
data = pd.read_csv('data/credit_card_app.csv')
print(data.shape)
data.head()

(690, 16)


Unnamed: 0,CustomerID,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,Class
0,15776156,1,22.08,11.46,2,4,4,1.585,0,0,0,1,2,100,1213,0
1,15739548,0,22.67,7.0,2,8,4,0.165,0,0,0,0,2,160,1,0
2,15662854,0,29.58,1.75,1,4,4,1.25,0,0,0,1,2,280,1,0
3,15687688,0,21.67,11.5,1,5,3,0.0,1,1,11,1,2,0,1,1
4,15715750,1,20.17,8.17,2,6,4,1.96,1,1,14,0,2,60,159,1


### Dataset Informaton

In [15]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
CustomerID    690 non-null int64
A1            690 non-null int64
A2            690 non-null float64
A3            690 non-null float64
A4            690 non-null int64
A5            690 non-null int64
A6            690 non-null int64
A7            690 non-null float64
A8            690 non-null int64
A9            690 non-null int64
A10           690 non-null int64
A11           690 non-null int64
A12           690 non-null int64
A13           690 non-null int64
A14           690 non-null int64
Class         690 non-null int64
dtypes: float64(3), int64(13)
memory usage: 86.3 KB
None


### Split the dataset
y = the class column, where 1 means the application of the credit card is approved and 0 means not approved

In [16]:
X = data.iloc[:,0:-1].values
y = data.iloc[:,-1].values
print(X.shape)
print(y.shape)

(690, 15)
(690,)


### Scale the features

We scale the features of the dataset by normalizing it

In [17]:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range=(0,1))
X = sc.fit_transform(X)

In [18]:
print(X[0,:])

[ 0.84268147  1.          0.12526316  0.40928571  0.5         0.23076923
  0.375       0.05561404  0.          0.          0.          1.          0.5
  0.05        0.01212   ]


### Self Organizing Maps from MiniSom library
Now, for each of the inputs, the output will be the neuron which is physically closest to the customer, based on euclidean distance. Then we can use a gaussian neighbourhood function to update the weights to each neuron in the neighbourhood, pulling it closer to the customer's neuron.

The above steps happen to each customer, many many times and in the process the neighbourhood distance keeps decreasing little by little, until it cannot any more. The result is that the fraud transactions would end up being the outliers.

###   MiniSom Implementation 
The MiniSom uses the following parameters:

x & y >>>>> the dimensions of the grid. It must not be too small otherwise we lose clarity. Higher the dimension, more is the accuracy

input_len >>>>> the dimentions from the input (15)

sigma >>>>> the radius of the different neighbouhood

learning_rate >>>>> This decides by how much should the weights need to be updated. Higher the value, faster the convergence but lesser the accuracy

In [19]:
from minisom import MiniSom
som = MiniSom(x=10, y =10, input_len=15, sigma=1.0, learning_rate=0.05)
som.random_weights_init(X)
som.train_random(X, num_iteration=500)

In [20]:
%matplotlib notebook
from pylab import bone, pcolor, colorbar, plot, show

#plt.figure(figsize=(12,8))
plt.bone()                         # This creates a window and pops the plot in it
plt.pcolor(som.distance_map().T)   # this is to depict different colors to depict different means of the data
plt.colorbar()
markers = ['o', 's']               # To mark the approve = 0, 1 with a circle or a square respectively
colors = ['r', 'g']                # To color a non-approval with a red color and an approval with a green
for i, x in enumerate(X):
    w = som.winner(x)
    plot(w[0] + 0.5,
         w[1] + 0.5,
         markers[y[i]],
         markeredgecolor = colors[y[i]],
         markerfacecolor = 'None',
         markersize = 10,
         markeredgewidth = 2)
plt.show()

<IPython.core.display.Javascript object>

### Interpretation
The white/light areas are outliers (customers). The green squares are the applications that were approved and the red circles are the applications that were rejected. The green squares in the white areas denote the fraud cases of the customers whose applications were approved.

### Mapping and Concatenating the Potential Fraud cells

In [27]:
mappings = som.win_map(X)
mappings

defaultdict(list,
            {(0,
              0): [array([ 0.85004144,  1.        ,  0.42481203,  0.34964286,  0.5       ,
                      1.        ,  0.875     ,  0.27929825,  1.        ,  1.        ,
                      0.11940299,  0.        ,  0.5       ,  0.        ,  0.        ]), array([ 0.52695522,  1.        ,  0.41233083,  0.14428571,  0.5       ,
                      0.92307692,  0.875     ,  0.24561404,  1.        ,  1.        ,
                      0.11940299,  0.        ,  0.5       ,  0.16      ,  0.        ]), array([ 0.98109951,  1.        ,  0.36721805,  0.36160714,  0.5       ,
                      1.        ,  0.375     ,  0.0877193 ,  1.        ,  1.        ,
                      0.08955224,  0.        ,  0.5       ,  0.26      ,  0.00196   ]), array([ 0.67228075,  1.        ,  0.12150376,  0.39285714,  0.5       ,
                      1.        ,  0.375     ,  0.01017544,  1.        ,  1.        ,
                      0.08955224,  0.        ,  0.

### Reverse Mapping to get back the customers with "Fraud Applications"

In [28]:
#frauds = np.concatenate((mappings[(5,7)], mappings[(6,7)]), axis=0)
frauds = np.concatenate((mappings[(4,6)], mappings[(4,7)]), axis=0)
frauds = sc.inverse_transform(frauds)

### Potential Fraud Customer IDs

In [29]:
frauds_df= pd.DataFrame(data=frauds, columns=data.iloc[:,0:-1].columns)
frauds_df.head()

Unnamed: 0,CustomerID,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14
0,15796813.0,1.0,41.58,1.75,2.0,4.0,4.0,0.21,1.0,0.0,0.0,0.0,2.0,160.0,1.0
1,15731166.0,1.0,40.92,0.835,2.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,2.0,130.0,2.0
2,15649379.0,1.0,42.75,3.0,2.0,3.0,5.0,1.0,1.0,0.0,0.0,0.0,2.0,0.0,201.0
3,15797246.0,1.0,23.42,0.585,2.0,8.0,8.0,0.085,1.0,0.0,0.0,0.0,2.0,180.0,1.0
4,15800773.0,1.0,54.42,0.5,1.0,4.0,8.0,3.96,1.0,0.0,0.0,0.0,2.0,180.0,315.0


### Get the customer IDs of those potential frauds

In [30]:
for i in range(len(frauds)):
    print(int(frauds[i][0]))

15796813
15731166
15649379
15797246
15800773
15729771
15609823
15701687
15706268
15653147
15735572
15751167
15721507
15802106
15812918
15715519
15711249
15784526
15687765
15658504
15757306
15700046
15769356
15778142
15812766
15808223
