# EDA by Visualization

we will perform some Exploratory Data Analysis (EDA) to find some patterns in the data and determine what would be the label for training supervised models.

## Objectives
Perform exploratory Data Analysis and determine Training Labels

* Exploratory Data Analysis
* Determine Training Labels

In [1]:
# Importing required Libraries
import pandas as pd

In [2]:
# load the dataset
Data_path = "Fraud_transations1.pkl"
transactions = pd.read_pickle(Data_path)

In [3]:
transactions

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO
0,0,2018-04-01 00:00:31,596,3156,57.16,31,0,0,0
1,1,2018-04-01 00:02:10,4961,3412,81.51,130,0,0,0
2,2,2018-04-01 00:07:56,2,1365,146.00,476,0,0,0
3,3,2018-04-01 00:09:29,4128,8737,64.49,569,0,0,0
4,4,2018-04-01 00:10:34,927,9906,50.99,634,0,0,0
...,...,...,...,...,...,...,...,...,...
1754150,1754150,2018-09-30 23:56:36,161,655,54.24,15810996,182,0,0
1754151,1754151,2018-09-30 23:57:38,4342,6181,1.23,15811058,182,0,0
1754152,1754152,2018-09-30 23:58:21,618,1502,6.62,15811101,182,0,0
1754153,1754153,2018-09-30 23:59:52,4056,3067,55.40,15811192,182,0,0


In [4]:
 transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754155 entries, 0 to 1754154
Data columns (total 9 columns):
 #   Column             Dtype         
---  ------             -----         
 0   TRANSACTION_ID     int64         
 1   TX_DATETIME        datetime64[ns]
 2   CUSTOMER_ID        object        
 3   TERMINAL_ID        object        
 4   TX_AMOUNT          float64       
 5   TX_TIME_SECONDS    object        
 6   TX_TIME_DAYS       object        
 7   TX_FRAUD           int64         
 8   TX_FRAUD_SCENARIO  int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(4)
memory usage: 120.4+ MB


In [5]:
# Percentage of fraudulent transactions:
transactions["TX_FRAUD"].mean()*100

0.8369271814634397

In [6]:
# Total Number of fraudulent transactions:
transactions.TX_FRAUD.sum()

14681

In [7]:
transactions["TX_FRAUD"].value_counts()

0    1739474
1      14681
Name: TX_FRAUD, dtype: int64

In [8]:
# total Fraud transactions on the scenario
transactions["TX_FRAUD_SCENARIO"].value_counts()

0    1739474
2       9077
3       4631
1        973
Name: TX_FRAUD_SCENARIO, dtype: int64

**A total of 14681 transactions were marked as fraudulent. This amounts to 0.8% of the transactions.
Also the sum of the frauds for each scenario does not equal the total amount of fraudulent transactions. This is because the same transactions may have been marked as fraudulent by two or more fraud scenarios.**

In [9]:
# Average Transaction amount as per scenario
transactions.groupby(["TX_FRAUD_SCENARIO"]).TX_AMOUNT.mean()

TX_FRAUD_SCENARIO
0     52.977907
1    235.317071
2     53.808108
3    260.915148
Name: TX_AMOUNT, dtype: float64

##### We can see the average transaction amount in three scenario for 1,2 and 3. 

Let us check how the number of transactions, the number of fraudulent transactions, and the number of compromised fraudulent cards vary on a daily basis.

In [10]:
def get_stats(transactions):
    #Number of transactions per day
    nb_tx_per_day=transactions.groupby(['TX_TIME_DAYS'])['CUSTOMER_ID'].count()
    #Number of fraudulent transactions per day
    nb_fraud_per_day=transactions.groupby(['TX_TIME_DAYS'])['TX_FRAUD'].sum()
    #Number of fraudulent cards per day
    nb_fraudcard_per_day=transactions[transactions['TX_FRAUD']>0].groupby(['TX_TIME_DAYS']).CUSTOMER_ID.nunique()
    
    return nb_tx_per_day,nb_fraud_per_day,nb_fraudcard_per_day

nb_tx_per_day,nb_fraud_per_day,nb_fraudcard_per_day = get_stats(transactions)

n_days=len(nb_tx_per_day)

tx_stats=pd.DataFrame({"value":pd.concat([nb_tx_per_day,nb_fraud_per_day,nb_fraudcard_per_day],ignore_index=True)})

tx_stats['stat_type']=(["nb_tx_per_day"]*n_days+["nb_fraud_per_day"]*n_days+["nb_fraudcard_per_day"]*n_days)
tx_stats

Unnamed: 0,value,stat_type
0,9488,nb_tx_per_day
1,9583,nb_tx_per_day
2,9747,nb_tx_per_day
3,9530,nb_tx_per_day
4,9651,nb_tx_per_day
...,...,...
544,88,nb_fraudcard_per_day
545,66,nb_fraudcard_per_day
546,65,nb_fraudcard_per_day
547,72,nb_fraudcard_per_day


In [11]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [12]:
sns.set(style='darkgrid',font_scale=1.4)
# sns.set(font_scale=1.4)
sns_plot = sns.lineplot(x="TX_TIME_DAYS", y="value",hue ="stat_type",
                        hue_order=["nb_tx_per_day","nb_fraud_per_day","nb_fraudcard_per_day"],
                        data=tx_stats, legend=False)
plt.show()

ValueError: Could not interpret value `TX_TIME_DAYS` for parameter `x`

In [None]:
# import matplotlib.pyplot as ply
# transactions.hist(bins=100, figsize=(20,10), color='blue')
# plt.show()