<a href="https://colab.research.google.com/github/MANISH-KUMAR-CODES/Analysis-on-Fraud-Detection-In-Ecommerce-company/blob/main/Predict_if_Merchant_is_Fraudster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem statement**

‘XYZ’ is a large e-commerce company with its operations in several countries. As the online giant grows, so has
the number of fraudster merchants are. They deliver counterfeits or, in some cases, nothing at all. Such
schemes leave customers duped, and place both legitimate merchants and the company itself in a constant
battle to rid the marketplace of scammers. Determining this is also important in budgeting for fraud
investigation. It's a well-known problem both to the company and to merchants, which they say hasn't
effectively addressed the issue. They are serious about it and want to protect themselves from these fraudulent
merchants using technology.

You are expected to create an analytical and modelling framework to predict the Merchant Fraudulency(yes/no)
based on the quantitative and qualitative features provided in the dataset while answering other questions too
cited below.

#  **About Dataset**

Target attribute: "fraudster" (yes – 1, no – 0)

Train:

   * trainmerchantdata.csv : Merchant Information

   * trainorderdata.csv : Order Information

   * train.csv : Target Label Information

Test:

   * testmerchantdata.csv : Merchant Information

   * testorderdata.csv : Order Information

   * test.csv : Target is not available as it is to be predicted

➔ ipboundariescountries.csv : IP addresses boundaries for each country
(common for both train and test)   

# **Tasks**

### *Model Building:*

You are expected to create an analytical and modelling framework to predict the Merchant Fraudulency based
on the quantitative and qualitative features provided in the datasets. You may derive new features from the
existing features and also from the domain knowledge, which may help in improving the model efficiency.

### *Visualisation task:*

Exploratory Data Analysis using visualizations in R Notebook or Jupiter notebook format. (all train data to be
used for this task)

*  List down the insights/patterns observed from the visualizations
*  Explain the impact of most important attributes on target attribute observed from the
visualizations.

### *Observations:*

Is there any overfitting or underfitting problem? If yes, how do you address it?

### *Evaluation Metric:*

* Consider ‘F1-score’ of the fraudulent class as the error metric for classification task to tune the
model and for submissions in the tool.

### *Hints*

Both Python and R provides functions to convert IP string to numeric format which makes the number
comparison easier.


In [35]:
# Importi9ng necessary libraries
import numpy as np
import pandas as pd

# 1. Preprocessing Libraries
from sklearn import preprocessing
from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder    #Dummification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline    #NEW!
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold    #Hyperparameter tuning, StratifiedKFold
# another way to cross-validate
from sklearn.compose import ColumnTransformer


# 2. Algorithm Import
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

# 3. Evaluation Library
from sklearn.metrics import confusion_matrix

# 4. Viz Lib
import matplotlib.pyplot as plt 
import seaborn as sns

# 5. Misc Lib
# !pip install imblearn
from imblearn.over_sampling import SMOTE    #Data/Class imbalance
import random
random.seed(123)
import warnings
warnings.filterwarnings('ignore')

In [36]:
#Train Data Reading
merchant_train_data = pd.read_csv("/content/drive/MyDrive/merchant data/train_merchant_data-1561627820784.csv", sep = ",", header = 0)
order_train_data = pd.read_csv("/content/drive/MyDrive/merchant data/train_order_data-1561627847149.csv", sep = ",", header = 0)
train_data = pd.read_csv("/content/drive/MyDrive/merchant data/train-1561627878332.csv", sep = ",", header = 0)
print(merchant_train_data.shape)
print(order_train_data.shape)
print(train_data.shape)


(54213, 7)
(54213, 7)
(54213, 2)


In [37]:
#Test Data Reading   
merchant_test_data = pd.read_csv("/content/drive/MyDrive/merchant data/test_merchant_data-1561627903902.csv", sep = ",", header = 0)
order_test_data = pd.read_csv('/content/drive/MyDrive/merchant data/test_order_data-1561627931868.csv', sep = ",", header = 0)
test_data = pd.read_csv("/content/drive/MyDrive/merchant data/test-1561627952093.csv", sep = ",", header = 0)
print(merchant_test_data.shape)
print(order_test_data.shape)
print(test_data.shape)

(13554, 7)
(13554, 7)
(13554, 1)


# Understanding our data

In [38]:
# quick look at our tainning data
a = pd.DataFrame(merchant_train_data)
a.head()

Unnamed: 0,Ecommerce_Provider_ID,Merchant_ID,Merchant_Registration_Date,Registered_Device_ID,Gender,Age,IP_Address
0,1746213,50448,2018-05-01 21:15:11,VATQMMZTVOZUT,F,39,48.151.136.76
1,1746213,338754,2018-04-14 10:13:00,LJCILLBRQZNKS,M,35,94.9.145.169
2,1746213,291127,2018-06-20 07:44:22,JFVHSUGKDAYZV,F,40,58.94.157.121
3,1746213,319919,2018-06-27 01:41:39,WFRXMPLQYXRMY,M,37,193.187.41.186
4,1746213,195911,2018-01-05 00:55:41,GGHKWMSWHCMID,F,27,125.96.20.172


In [39]:
b= pd.DataFrame(order_train_data)
b.head()

Unnamed: 0,Customer_ID,Order_ID,Date_of_Order,Order_Value_USD,Order_Source,Order_Payment_Method,Merchant_ID
0,126221,37cea9512f8d,2018-04-29 16:39:26,148,Direct,Credit Card,124231
1,115471,09f12e6efde2,2018-06-16 17:05:40,145,SEO,Credit Card,136178
2,151786,4e69e956e159,2018-10-26 18:00:46,62,Ads,Internet Banking,198611
3,140456,663443aaeb82,2018-12-12 05:41:52,28,SEO,Debit Card,127993
4,114721,99258810c121,2018-09-20 11:06:10,70,Ads,Credit Card,250146


In [40]:
c = pd.DataFrame(train_data)
c.head()

Unnamed: 0,Merchant_ID,Fraudster
0,221592,0
1,316935,1
2,38454,1
3,214437,1
4,296240,1


In [41]:
# lets merge all three dataframe on common columns and get a complete dataframe
a = a.merge(b,on = 'Merchant_ID', how='outer')
train_df = a.merge(c,on = 'Merchant_ID', how ='outer')
train_df.head()


Unnamed: 0,Ecommerce_Provider_ID,Merchant_ID,Merchant_Registration_Date,Registered_Device_ID,Gender,Age,IP_Address,Customer_ID,Order_ID,Date_of_Order,Order_Value_USD,Order_Source,Order_Payment_Method,Fraudster
0,1746213,50448,2018-05-01 21:15:11,VATQMMZTVOZUT,F,39,48.151.136.76,129697,3b8983a83c7b,2018-07-30 10:59:13,90,SEO,Debit Card,0
1,1746213,338754,2018-04-14 10:13:00,LJCILLBRQZNKS,M,35,94.9.145.169,117390,34b5eb921228,2018-06-15 11:19:47,98,SEO,Internet Banking,0
2,1746213,291127,2018-06-20 07:44:22,JFVHSUGKDAYZV,F,40,58.94.157.121,120162,41a1c86ff08b,2018-08-13 10:06:26,95,SEO,Credit Card,0
3,1746213,319919,2018-06-27 01:41:39,WFRXMPLQYXRMY,M,37,193.187.41.186,128228,e8c3ad80d916,2018-07-22 15:46:51,100,Direct,E-wallet,0
4,1746213,195911,2018-01-05 00:55:41,GGHKWMSWHCMID,F,27,125.96.20.172,136029,e71ab1f26785,2018-04-16 08:02:44,78,SEO,E-wallet,0


In [42]:
# Quick look at our test data
a1 = pd.DataFrame(merchant_test_data)
a1.head()

Unnamed: 0,Ecommerce_Provider_ID,Merchant_ID,Merchant_Registration_Date,Registered_Device_ID,Gender,Age,IP_Address
0,1746213,53637,2018-02-11 20:50:29,PTMLBENYMQCTV,F,40,134.162.124.62
1,1746213,243517,2018-04-06 13:19:39,HMCLDZUZPWZRR,M,39,152.76.98.87
2,1746213,343640,2018-01-12 16:47:49,VLGSMAPXISSEJ,M,23,31.202.3.255
3,1746213,69889,2018-02-19 21:58:52,ZINHISBBOKQXT,M,34,12.242.168.185
4,1746213,125706,2018-05-17 15:50:19,WFKEAOTPHTYEO,M,20,26.61.210.47


In [43]:
b1 = pd.DataFrame(order_test_data)
b1.head()

Unnamed: 0,Customer_ID,Order_ID,Date_of_Order,Order_Value_USD,Order_Source,Order_Payment_Method,Merchant_ID
0,157068,a032de091f51,2018-03-09 09:09:42,60,SEO,Credit Card,53637
1,112534,0d563f0606d6,2018-06-13 20:11:19,28,SEO,Credit Card,243517
2,148774,18fb0fa888b6,2018-04-14 04:19:36,75,Ads,Debit Card,343640
3,114528,ed1eb920d721,2018-04-01 13:55:44,98,SEO,Credit Card,69889
4,120940,313dcf962627,2018-05-25 02:08:18,35,SEO,Credit Card,125706


In [44]:
c1 = pd.DataFrame(test_data)
c1.head()

Unnamed: 0,Merchant_ID
0,53637
1,243517
2,343640
3,69889
4,125706


In [45]:
# lets merge all three dataframe on common columns and get a complete dataframe
a1 = a1.merge(b1,on = 'Merchant_ID', how='outer')
test_df = a1.merge(c1,on = 'Merchant_ID', how ='outer')
test_df.head()

Unnamed: 0,Ecommerce_Provider_ID,Merchant_ID,Merchant_Registration_Date,Registered_Device_ID,Gender,Age,IP_Address,Customer_ID,Order_ID,Date_of_Order,Order_Value_USD,Order_Source,Order_Payment_Method
0,1746213,53637,2018-02-11 20:50:29,PTMLBENYMQCTV,F,40,134.162.124.62,157068,a032de091f51,2018-03-09 09:09:42,60,SEO,Credit Card
1,1746213,243517,2018-04-06 13:19:39,HMCLDZUZPWZRR,M,39,152.76.98.87,112534,0d563f0606d6,2018-06-13 20:11:19,28,SEO,Credit Card
2,1746213,343640,2018-01-12 16:47:49,VLGSMAPXISSEJ,M,23,31.202.3.255,148774,18fb0fa888b6,2018-04-14 04:19:36,75,Ads,Debit Card
3,1746213,69889,2018-02-19 21:58:52,ZINHISBBOKQXT,M,34,12.242.168.185,114528,ed1eb920d721,2018-04-01 13:55:44,98,SEO,Credit Card
4,1746213,125706,2018-05-17 15:50:19,WFKEAOTPHTYEO,M,20,26.61.210.47,120940,313dcf962627,2018-05-25 02:08:18,35,SEO,Credit Card
