## Benford's Law
#### Apply Benford's Law to find Unusual behavior

#### To detect top 40 unusual Cardnum(Card number) and Merchnum (Merchant number) respectively by applying Benford’s Law to transaction amount.

**Process**

**Step 1 Data Cleaning**

Remove Merch description that contain "FEDEX", since there are many small amount of transactions starting with 3. Also, keep only “P”, which stands for purchase, in Transtype(transaction type). Finally, keep only the first digit in Amount.

**Step 2 Identify Distribution of Numbers for Each Merchnum and Cardnum Group**

Group by Merchnum and Cardnum respectively. For each resulting group, count the appear time for each number through 0 to 9, and sum the total count as . Sum the count of 1 and 2 as n_low , and total count minus n_low as n_high .

**Step 3 Define R: Max (R, 1/R)**

Since in Benford’s Law, ; if each resulting group follows Benford’s Law, we will expect the answer to R = (1.096*n_low) / n_high should be close to 1. Therefore, we could measure the unusualness by looking at the maximum of  and . Note that, since we don’t want to divide by 0, if either n_low  or n_high  equals to 0, we set it to 1.

**Step 4 Smoothing Formula**

However, in some resulting groups, there are not enough samples. Therefore, in order to take care of those groups, we add a smoothing formula to smooth out the original value . We define: , R_star = (R-1/1+exp^(-t)), where t = (n-15) / 3. Now, we can measure the unusualness by looking at the maximum of R_star and 1/R.

**Step 5 Result**

Identify the top 40 unusual cardnum & merchnum

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as sps 
import matplotlib.pyplot as plt 
import seaborn as sns
import sklearn as skl
from sklearn import preprocessing 
%matplotlib inline

In [5]:
%%time
data = pd.read_csv('card transactions.csv')

CPU times: user 259 ms, sys: 64.5 ms, total: 323 ms
Wall time: 324 ms


In [6]:
data.head()

Unnamed: 0,Recnum,Cardnum,Date,Merchnum,Merch description,Merch state,Merch zip,Transtype,Amount,Fraud,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,1,5142190439,1/1/10,5509006296254,FEDEX SHP 12/23/09 AB#,TN,38118.0,P,3.62,0,,,,,,,,
1,2,5142183973,1/1/10,61003026333,SERVICE MERCHANDISE #81,MA,1803.0,P,31.42,0,,,,,,,,
2,3,5142131721,1/1/10,4503082993600,OFFICE DEPOT #191,MD,20706.0,P,178.49,0,,,,,,,,
3,4,5142148452,1/1/10,5509006296254,FEDEX SHP 12/28/09 AB#,TN,38118.0,P,3.62,0,,,,,,,,
4,5,5142190439,1/1/10,5509006296254,FEDEX SHP 12/23/09 AB#,TN,38118.0,P,3.62,0,,,,,,,,


In [7]:
data = data.drop(columns=['Unnamed: 10',
 'Unnamed: 11',
 'Unnamed: 12',
 'Unnamed: 13',
 'Unnamed: 14',
 'Unnamed: 15',
 'Unnamed: 16',
 'Unnamed: 17',])
# filter P
data = data[data.Transtype == 'P']
print(data.shape)
# remove Merchnum 930090121224 & 5509006296254 (Fedex)
#data = data[data.Merchnum != '930090121224']
#data = data[data.Merchnum != '5509006296254']
data = data[~data['Merch description'].str.contains('FEDEX')]
print(data.shape)
#reset index
data.reset_index(drop=True, inplace = True)

(96398, 10)
(84623, 10)


In [8]:
# keep fisrt digit
one= []
for i in data.Amount:
    one.append(int(str(i)[:1]))
#data['one_digit'] = pd.Series(one)
one = pd.DataFrame(one, columns = ['digit'])
data = data.join(one)

In [9]:
data_card = data.drop(columns=['Recnum',
 'Merchnum',
 'Date',
 'Merch description',
 'Merch state',
 'Merch zip',
 'Transtype',
 'Amount', "Fraud"])
data_card.reset_index(drop=True, inplace = True)
data_card.shape

(84623, 2)

In [10]:
#group by cardnumber & digit
data_card_group = data_card.groupby(['Cardnum','digit'])[['digit']].count()

In [11]:
data_card_group = data_card_group.unstack()
data_card_group.head()

Unnamed: 0_level_0,digit,digit,digit,digit,digit,digit,digit,digit,digit,digit
digit,0,1,2,3,4,5,6,7,8,9
Cardnum,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
5142110002,,1.0,,,,,,,,
5142110081,,,,,2.0,,2.0,,,
5142110313,,2.0,,,1.0,,,,,
5142110402,,1.0,2.0,2.0,2.0,4.0,,,,
5142110434,,1.0,,,,,,,,


In [12]:
data_card_group.reset_index(inplace = True)
data_card_group.head()
# fill na with 0
data_card_group = data_card_group.fillna(0)

In [13]:
data_card_group.head()

Unnamed: 0_level_0,Cardnum,digit,digit,digit,digit,digit,digit,digit,digit,digit,digit
digit,Unnamed: 1_level_1,0,1,2,3,4,5,6,7,8,9
0,5142110002,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,5142110081,0.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0
2,5142110313,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,5142110402,0.0,1.0,2.0,2.0,2.0,4.0,0.0,0.0,0.0,0.0
4,5142110434,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
total= []
for index, row in data_card_group.iterrows():
    total.append(sum(row.digit[i] for i in range(10)))  

count_low = []
for index, row in data_card_group.iterrows():
    count_low.append(row.digit[0] + row.digit[1])

data_card_group['total'] = total
data_card_group['count_low'] = count_low
data_card_group["count_high"] = data_card_group['total'] - data_card_group['count_low']
#data_card_group["r_low"] = round((data_card_group['count_low'] / data_card_group['total']) *100,2)
#data_card_group["r_high"] = round((data_card_group['count_high'] / data_card_group['total']) *100,2)
data_card_group.head()

Unnamed: 0_level_0,Cardnum,digit,digit,digit,digit,digit,digit,digit,digit,digit,digit,total,count_low,count_high
digit,Unnamed: 1_level_1,0,1,2,3,4,5,6,7,8,9,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,5142110002,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,5142110081,0.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,4.0,0.0,4.0
2,5142110313,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,1.0
3,5142110402,0.0,1.0,2.0,2.0,2.0,4.0,0.0,0.0,0.0,0.0,11.0,1.0,10.0
4,5142110434,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0


In [16]:
#replace 0 with 1
data_card_group = data_card_group.replace({'count_low':{0:1}})
data_card_group = data_card_group.replace({'count_high':{0:1}})
data_card_group.head()

Unnamed: 0_level_0,Cardnum,digit,digit,digit,digit,digit,digit,digit,digit,digit,digit,total,count_low,count_high
digit,Unnamed: 1_level_1,0,1,2,3,4,5,6,7,8,9,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,5142110002,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
1,5142110081,0.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,4.0,1.0,4.0
2,5142110313,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,1.0
3,5142110402,0.0,1.0,2.0,2.0,2.0,4.0,0.0,0.0,0.0,0.0,11.0,1.0,10.0
4,5142110434,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0


In [17]:
# R and 1/R
data_card_group["R"] = (1.096 * data_card_group["count_low"]) / data_card_group["count_high"]
data_card_group["R_inv"] =  1/data_card_group["R"]

# max(R, 1/R)
R_new = []
for i in range(len(data_card_group.R)):
    R_new.append(max(data_card_group.R.loc[i], data_card_group.R_inv.loc[i]))

data_card_group['R_new'] = R_new
data_card_group.head()

Unnamed: 0_level_0,Cardnum,digit,digit,digit,digit,digit,digit,digit,digit,digit,digit,total,count_low,count_high,R,R_inv,R_new
digit,Unnamed: 1_level_1,0,1,2,3,4,5,6,7,8,9,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,5142110002,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.096,0.912409,1.096
1,5142110081,0.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,4.0,1.0,4.0,0.274,3.649635,3.649635
2,5142110313,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,1.0,2.192,0.456204,2.192
3,5142110402,0.0,1.0,2.0,2.0,2.0,4.0,0.0,0.0,0.0,0.0,11.0,1.0,10.0,0.1096,9.124088,9.124088
4,5142110434,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.096,0.912409,1.096


In [18]:
# t / c=3
data_card_group['t'] = (data_card_group['total'] - 15)/ 3
data_card_group.head()

# R star
exp = 2.71828182845904
data_card_group['R_star'] = 1 + ((data_card_group['R_new']-1)/ (1+ exp**(-data_card_group['t'])))
data_card_group['R_star_inv'] = 1/data_card_group['R_star']

R_star_max = []
for i in range(len(data_card_group.R_star)):
    R_star_max.append(max(data_card_group.R_star.loc[i], data_card_group.R_star_inv.loc[i]))

data_card_group['R_star_max'] = R_star_max

data_card_group.head()

Unnamed: 0_level_0,Cardnum,digit,digit,digit,digit,digit,digit,digit,digit,digit,...,total,count_low,count_high,R,R_inv,R_new,t,R_star,R_star_inv,R_star_max
digit,Unnamed: 1_level_1,0,1,2,3,4,5,6,7,8,...,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5142110002,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.096,0.912409,1.096,-4.666667,1.000894,0.999106,1.000894
1,5142110081,0.0,0.0,0.0,0.0,2.0,0.0,2.0,0.0,0.0,...,4.0,1.0,4.0,0.274,3.649635,3.649635,-3.666667,1.066041,0.938051,1.066041
2,5142110313,0.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,3.0,2.0,1.0,2.192,0.456204,2.192,-4.0,1.02144,0.97901,1.02144
3,5142110402,0.0,1.0,2.0,2.0,2.0,4.0,0.0,0.0,0.0,...,11.0,1.0,10.0,0.1096,9.124088,9.124088,-1.333333,2.694754,0.371091,2.694754
4,5142110434,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,1.096,0.912409,1.096,-4.666667,1.000894,0.999106,1.000894


In [19]:
final_card = data_card_group.sort_values(by=['R_star_max'],ascending=False).head(40)
final_card = final_card[['Cardnum','R_star_max']]
final_card

Unnamed: 0_level_0,Cardnum,R_star_max
digit,Unnamed: 1_level_1,Unnamed: 2_level_1
697,5142194617,33.743794
1074,5142240823,22.266612
540,5142176413,14.740518
725,5142197563,12.682482
828,5142210205,11.720619
271,5142143463,10.588202
284,5142144931,10.028264
1302,5142270003,9.985322
1454,5142288601,9.30656
132,5142125025,7.343066
