# Predicting Merchant Fraudulence for E-commerce Business

### Problem statment : 

Predict if the Merchant is Fraudster or not for an e-commerce client
‘XYZ’ is a large e-commerce company with its operations in several countries. As the online giant grows, so has the number of fraudster merchants are. They deliver counterfeits or, in some cases, nothing at all. Such schemes leave customers duped, and place both legitimate merchants and the company itself in a constant battle to rid the marketplace of scammers. Determining this is also important in budgeting for fraud investigation. It's a well-known problem both to the company and to merchants, which they say hasn't effectively addressed the issue. They are serious about it and want to protect themselves from these fraudulent merchants using technology.

You are expected to create an analytical and modelling framework to predict the Merchant Fraudulency(yes/no) based on the quantitative and qualitative features provided in the dataset 

### Problem Understanding

1. Fraudulent Merchants 
2. Merchant sell counterfeit product or don’t dispatch items
3. Company brand value impacted.
4. Company make loss on refund to customers


### Expectation 

1. Budget required for fraud detection
2. To protect client from fraudulent merchants using client’s technology
3. Build analytical framework to predict Merchant fraudulency
4. Give possible insights from the data about fraudulent merchant

# Loading required Libriaries

In [1]:

#Interface with Operating system to run Python
import os

#Import numerical and pandas libraries
import numpy as np
import pandas as pd


#Preprocessing libraries

from sklearn import preprocessing

#For imputing 
from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer

#To dummify
from sklearn.preprocessing import OneHotEncoder

#To standardize
from sklearn.preprocessing import StandardScaler

#To normalize
from sklearn.preprocessing import MinMaxScaler

#To do Train_test split
from sklearn.model_selection import train_test_split

#To do GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.compose import ColumnTransformer


#To check performance metrics
from sklearn.metrics import accuracy_score

#To create confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import confusion_matrix, roc_curve, auc

#For plotting - Matplot
import matplotlib.pyplot as plt
%matplotlib notebook

#For plotting - Seaborn
import seaborn as sns

## Read the Data set

In [2]:
#Reading training data set - Merchant details
train_Seller_data = pd.read_csv("train_merchant_data-1561627820784.csv")

In [5]:
train_Seller_data.shape

(54213, 7)

In [10]:
train_Seller_data.describe(include='all')

Unnamed: 0,Ecommerce_Provider_ID,Merchant_ID,Merchant_Registration_Date,Registered_Device_ID,Gender,Age,IP_Address
count,54213.0,54213.0,54213,54213,54213,54213.0,54213
unique,,,54213,51291,2,,52028
top,,,2018-05-23 20:35:24,VIPZYJGMVMXOF,M,,91.161.239.48
freq,,,1,9,31761,,9
mean,1746213.0,200395.176212,,,,33.12224,
std,0.0,115398.486895,,,,8.630091,
min,1746213.0,2.0,,,,18.0,
25%,1746213.0,100997.0,,,,27.0,
50%,1746213.0,200574.0,,,,32.0,
75%,1746213.0,300407.0,,,,39.0,


In [11]:
#Reading Unseen dataset - Merchant details
unseen_seller_data = pd.read_csv("test_merchant_data-1561627903902.csv")

In [12]:
#Reading training data set - Order details
train_Order_data = pd.read_csv("train_order_data-1561627847149.csv")

In [13]:
#Reading Unseen data set - Order details
unseen_Order_data = pd.read_csv("test_order_data-1561627931868.csv")

In [14]:
#Reading train data set target variable 
fraud_train = pd.read_csv("train-1561627878332.csv")

In [15]:
test_tar_data = pd.read_csv("test-1561627952093.csv")

In [16]:
#Combining train dataset
train_Seller_Order_data = pd.merge(train_Seller_data, train_Order_data, on='Merchant_ID', how='outer')

In [17]:
Unseen_Seller_Order_data = pd.merge(unseen_seller_data, unseen_Order_data, on='Merchant_ID', how='outer')

In [18]:
final_data = pd.merge(train_Seller_Order_data, fraud_train, on='Merchant_ID', how='outer')

In [23]:
unseen_data = pd.merge(Unseen_Seller_Order_data, test_tar_data, on='Merchant_ID', how='outer')

In [24]:
ip_boundaries_countries = pd.read_csv("ip_boundaries_countries-1561628631121.csv")

In [25]:
#Define a function to convert IP address to Integer.

def ip_to_int(ip_ser):
    ips = ip_ser.str.split('.', expand=True).astype(np.int16).values
    mults = np.tile(np.array([24, 16, 8, 0]), len(ip_ser)).reshape(ips.shape)
    return np.sum(np.left_shift(ips, mults), axis=1)

In [26]:
#Appling on Train data set

final_data['_ip'] = ip_to_int(final_data.IP_Address)
ip_boundaries_countries[['_ip_range_start','_ip_range_end']] = ip_boundaries_countries.filter(like='ip_address').apply(lambda x: ip_to_int(x))

In [27]:
#Appling on Train data set

unseen_data['_ip'] = ip_to_int(unseen_data.IP_Address)
ip_boundaries_countries[['_ip_range_start','_ip_range_end']] = ip_boundaries_countries.filter(like='ip_address').apply(lambda x: ip_to_int(x))

In [28]:
final_data['x'] = (final_data._ip.apply(lambda x: ip_boundaries_countries.query('_ip_range_start <= @x <= _ip_range_end')
                         .index
                         .values)
              .apply(lambda x: x[0] if len(x) else -1))

In [29]:
unseen_data['x'] = (unseen_data._ip.apply(lambda x: ip_boundaries_countries.query('_ip_range_start <= @x <= _ip_range_end')
                         .index
                         .values)
              .apply(lambda x: x[0] if len(x) else -1))

In [30]:
#Merging Train Data set to bring Country column

known_data=(pd.merge(final_data.drop('_ip',1),
          ip_boundaries_countries.filter(regex=r'^((?!.?ip_range_).*)$'),
          left_on='x',
          right_index=True,
          how='left').drop('x',1))

In [33]:
#Merging Train Data set to bring Country column

unseen_data=(pd.merge(final_data.drop('_ip',1),
          ip_boundaries_countries.filter(regex=r'^((?!.?ip_range_).*)$'),
          left_on='x',
          right_index=True,
          how='left').drop('x',1))

In [34]:
known_data.head()

Unnamed: 0,Ecommerce_Provider_ID,Merchant_ID,Merchant_Registration_Date,Registered_Device_ID,Gender,Age,IP_Address,Customer_ID,Order_ID,Date_of_Order,Order_Value_USD,Order_Source,Order_Payment_Method,Fraudster,lower_bound_ip_address,upper_bound_ip_address,country
0,1746213,50448,2018-05-01 21:15:11,VATQMMZTVOZUT,F,39,48.151.136.76,129697,3b8983a83c7b,2018-07-30 10:59:13,90,SEO,Debit Card,0,48.0.0.0,48.255.255.255,United States
1,1746213,338754,2018-04-14 10:13:00,LJCILLBRQZNKS,M,35,94.9.145.169,117390,34b5eb921228,2018-06-15 11:19:47,98,SEO,Internet Banking,0,94.0.0.0,94.15.255.255,United Kingdom
2,1746213,291127,2018-06-20 07:44:22,JFVHSUGKDAYZV,F,40,58.94.157.121,120162,41a1c86ff08b,2018-08-13 10:06:26,95,SEO,Credit Card,0,58.92.0.0,58.95.255.255,Japan
3,1746213,319919,2018-06-27 01:41:39,WFRXMPLQYXRMY,M,37,193.187.41.186,128228,e8c3ad80d916,2018-07-22 15:46:51,100,Direct,E-wallet,0,193.187.12.0,193.187.43.255,Austria
4,1746213,195911,2018-01-05 00:55:41,GGHKWMSWHCMID,F,27,125.96.20.172,136029,e71ab1f26785,2018-04-16 08:02:44,78,SEO,E-wallet,0,125.96.0.0,125.97.255.255,China


In [35]:
known_data.columns

Index(['Ecommerce_Provider_ID', 'Merchant_ID', 'Merchant_Registration_Date',
       'Registered_Device_ID', 'Gender', 'Age', 'IP_Address', 'Customer_ID',
       'Order_ID', 'Date_of_Order', 'Order_Value_USD', 'Order_Source',
       'Order_Payment_Method', 'Fraudster', 'lower_bound_ip_address',
       'upper_bound_ip_address', 'country'],
      dtype='object')

In [36]:
known_data.dtypes

Ecommerce_Provider_ID          int64
Merchant_ID                    int64
Merchant_Registration_Date    object
Registered_Device_ID          object
Gender                        object
Age                            int64
IP_Address                    object
Customer_ID                    int64
Order_ID                      object
Date_of_Order                 object
Order_Value_USD                int64
Order_Source                  object
Order_Payment_Method          object
Fraudster                      int64
lower_bound_ip_address        object
upper_bound_ip_address        object
country                       object
dtype: object

In [38]:
known_data.shape

(54213, 17)

In [39]:
known_data.describe(include='all')

Unnamed: 0,Ecommerce_Provider_ID,Merchant_ID,Merchant_Registration_Date,Registered_Device_ID,Gender,Age,IP_Address,Customer_ID,Order_ID,Date_of_Order,Order_Value_USD,Order_Source,Order_Payment_Method,Fraudster,lower_bound_ip_address,upper_bound_ip_address,country
count,54213.0,54213.0,54213,54213,54213,54213.0,54213,54213.0,54213,54213,54213.0,54213,54213,54213.0,46402,46402,46402
unique,,,54213,51291,2,,52028,,54213,54161,,3,5,,14914,14914,109
top,,,2018-05-23 20:35:24,VIPZYJGMVMXOF,M,,91.161.239.48,,4443ec2963c7,2018-05-09 11:22:28,,SEO,Credit Card,,12.0.0.0,12.255.255.255,United States
freq,,,1,9,31761,,9,,1,2,,21884,21844,,262,262,20963
mean,1746213.0,200395.176212,,,,33.12224,,137966.285208,,,92.23024,,,0.09269,,,
std,0.0,115398.486895,,,,8.630091,,15563.516156,,,45.673263,,,0.29,,,
min,1746213.0,2.0,,,,18.0,,111234.0,,,22.0,,,0.0,,,
25%,1746213.0,100997.0,,,,27.0,,124471.0,,,55.0,,,0.0,,,
50%,1746213.0,200574.0,,,,32.0,,137864.0,,,88.0,,,0.0,,,
75%,1746213.0,300407.0,,,,39.0,,151405.0,,,122.0,,,0.0,,,


In [40]:
known_data.nunique()

Ecommerce_Provider_ID             1
Merchant_ID                   54213
Merchant_Registration_Date    54213
Registered_Device_ID          51291
Gender                            2
Age                              53
IP_Address                    52028
Customer_ID                   34081
Order_ID                      54213
Date_of_Order                 54161
Order_Value_USD                 116
Order_Source                      3
Order_Payment_Method              5
Fraudster                         2
lower_bound_ip_address        14914
upper_bound_ip_address        14914
country                         109
dtype: int64

In [24]:
known_data.isnull().sum()

Ecommerce_Provider_ID            0
Merchant_ID                      0
Merchant_Registration_Date       0
Registered_Device_ID             0
Gender                           0
Age                              0
IP_Address                       0
Customer_ID                      0
Order_ID                         0
Date_of_Order                    0
Order_Value_USD                  0
Order_Source                     0
Order_Payment_Method             0
Fraudster                        0
lower_bound_ip_address        7811
upper_bound_ip_address        7811
country                       7811
dtype: int64

In [41]:
unnecessary_col =['Ecommerce_Provider_ID', 'Customer_ID', 'lower_bound_ip_address', 'upper_bound_ip_address']
known_data.drop(unnecessary_col, axis=1, inplace=True)

In [42]:
unnecessary_col =['Ecommerce_Provider_ID', 'Customer_ID', 'lower_bound_ip_address', 'upper_bound_ip_address']
unseen_data.drop(unnecessary_col, axis=1, inplace=True)

In [43]:
known_data.columns

Index(['Merchant_ID', 'Merchant_Registration_Date', 'Registered_Device_ID',
       'Gender', 'Age', 'IP_Address', 'Order_ID', 'Date_of_Order',
       'Order_Value_USD', 'Order_Source', 'Order_Payment_Method', 'Fraudster',
       'country'],
      dtype='object')

In [44]:
known_data.dtypes

Merchant_ID                    int64
Merchant_Registration_Date    object
Registered_Device_ID          object
Gender                        object
Age                            int64
IP_Address                    object
Order_ID                      object
Date_of_Order                 object
Order_Value_USD                int64
Order_Source                  object
Order_Payment_Method          object
Fraudster                      int64
country                       object
dtype: object

In [45]:
#Converting to correct categorical columns - Train data set

for col in ['Gender','Fraudster', 'Order_Source', 'Order_Payment_Method', 'Merchant_ID', 'Order_ID', 'Registered_Device_ID', 'IP_Address', 'country']:
    known_data[col] = known_data[col].astype('category')

In [46]:
#Converting to correct categorical columns - Train data set

for col in ['Gender','Fraudster', 'Order_Source', 'Order_Payment_Method', 'Merchant_ID', 'Order_ID', 'Registered_Device_ID', 'IP_Address', 'country']:
    unseen_data[col] = unseen_data[col].astype('category')

In [47]:
known_data.dtypes

Merchant_ID                   category
Merchant_Registration_Date      object
Registered_Device_ID          category
Gender                        category
Age                              int64
IP_Address                    category
Order_ID                      category
Date_of_Order                   object
Order_Value_USD                  int64
Order_Source                  category
Order_Payment_Method          category
Fraudster                     category
country                       category
dtype: object

In [48]:
#Converting to correct Date columns - Train Data set
known_data[['Merchant_Registration_Date','Date_of_Order']] = known_data[['Merchant_Registration_Date','Date_of_Order']].apply(pd.to_datetime)

In [49]:
#Converting to correct Date columns - Train Data set
unseen_data[['Merchant_Registration_Date','Date_of_Order']] = unseen_data[['Merchant_Registration_Date','Date_of_Order']].apply(pd.to_datetime)

In [50]:
for col in ['Merchant_Registration_Date', 'Date_of_Order']:
    known_data[col] = known_data[col].astype('category')

In [51]:
for col in ['Merchant_Registration_Date', 'Date_of_Order']:
    unseen_data[col] = unseen_data[col].astype('category')

In [52]:
cat_attr = list(known_data.select_dtypes("category").columns)
num_attr = list(known_data.columns.difference(cat_attr))


In [53]:
cat_attr.remove('Fraudster')

In [54]:
# Numerical Pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Categorical Pipeline 
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_attr),
        ('cat', categorical_transformer, cat_attr)])

In [40]:
#Model building libraries

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.cluster import KMeans

In [41]:
#Building Logistic regression

clf_logreg = Pipeline(steps=[('preprocessor', preprocessor), 
                           ('classifier', LogisticRegression())])

In [43]:
X, y = known_data.loc[:,known_data.columns!='Fraudster'], known_data.loc[:,'Fraudster']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123) 

In [44]:
clf_logreg.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                             

In [45]:
## Predict
train_predictions = clf_logreg.predict(X_train)
test_predictions = clf_logreg.predict(X_test)

print(clf_logreg.score(X_train, y_train))
print(clf_logreg.score(X_test, y_test))


0.9897230493557142
0.9096778160354156
TRAIN Conf Matrix : 
 [[34417     0]
 [  390  3142]]

TRAIN DATA ACCURACY 0.9897230493557142

Train data f1-score for class '0' 0.941564279292778

Train data f1-score for class '1' 0.941564279292778


--------------------------------------


TEST Conf Matrix : 
 [[14771     0]
 [ 1469    24]]

TEST DATA ACCURACY 0.9096778160354156

Test data f1-score for class '0' 0.031641397495056026

Test data f1-score for class '1' 0.031641397495056026


In [63]:
## Predict
Unseen_data_predictions = clf_logreg.predict(unseen_data)



In [67]:
unique, counts = np.unique(Unseen_data_predictions, return_counts=True)
dict(zip(unique, counts))

{0: 51047, 1: 3166}

In [68]:
unseen_pred_df = pd.DataFrame(Unseen_data_predictions, columns=['Fraudster'])
print(unseen_pred_df.head())

   Fraudster
0          0
1          0
2          0
3          0
4          0


In [47]:
## F1 Score for weighted average
from sklearn.metrics import f1_score

f1_Train = f1_score(y_true=y_train, y_pred = train_predictions, average='weighted')

f1_Test = f1_score(y_true=y_test, y_pred = test_predictions, average='weighted')

In [48]:
print(f1_Train)

0.9894517284218683


In [49]:
print(f1_Test)

0.8680849780477407


In [70]:
final_merged_data = pd.concat([test_tar_data, unseen_pred_df] )
print(final_merged_data.head())


   Fraudster  Merchant_ID
0        NaN      53637.0
1        NaN     243517.0
2        NaN     343640.0
3        NaN      69889.0
4        NaN     125706.0


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [72]:
unique, counts = np.unique(Unseen_data_predictions, return_counts=True)
dict(zip(unique, counts))


{0: 51047, 1: 3166}

In [73]:
pd.DataFrame(Unseen_data_predictions).to_csv("Fraud.csv")

In [74]:
clf = Pipeline(steps=[('preprocessor', preprocessor), 
                      ('GBM',GradientBoostingClassifier())])

In [82]:
X_train.dtypes

Merchant_ID                   category
Merchant_Registration_Date    category
Gender                        category
Age                              int64
IP_Address                    category
Order_ID                      category
Date_of_Order                 category
Order_Value_USD                  int64
Order_Source                  category
Order_Payment_Method          category
country                       category
dtype: object

In [87]:
# Numerical Pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Categorical Pipeline 
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_attr),
        ('cat', categorical_transformer, cat_attr)])

In [None]:
## Predict
gbm_train_predictions = gbm_param_grid.predict(X_train)
gbm_test_predictions = gbm_param_grid.predict(X_test)

print(gbm_param_grid.score(X_train, y_train))
print(gbm_param_grid.score(X_test, y_test))


In [None]:
## F1 Score for weighted average
from sklearn.metrics import f1_score

f1_Train = f1_score(y_true=y_train, y_pred = gbm_train_predictions, average='weighted')

f1_Test = f1_score(y_true=y_test, y_pred = gbm_test_predictions, average='weighted') 

print(f1_Train)
print(f2_Train)

In [None]:
final_output_pred = gbm_grid.predict(unseen_data)

In [None]:
pd.DataFrame(final_output_pred).to_csv("Fraud_2.csv")