# Introduction

* **Why We Should Care About Payment Fraud?**

Payment card fraud is a serious and long-term threat to society  with an economic impact forecast to be $416bn in 2017.

Besides financial losses, it has been identified that criminal enterprises and Organised
Crime Groups (OCGs) use payment card fraud to fund their activities
including arms, drugs and terrorism. The activities of these criminals include violence and murder--individual acts of fraud have a human cost.

Fraud is increasing dramatically with the progression of modern technology and global communication. As a result, fighting
fraud has become an important issue to be explored. As presented in the following figure, the detection and 
prevention mechanisms are used mostly to combat fraud.

In [None]:
from IPython.display import Image
Image("../input/protection-systems-against-fraud/_20190826145415.png")

* **What is Fraud Detection ?**

Fraud detection tries to discover and
identify fraudulent activities as they enter the systems and report
them to a system administrator

* **Payment card transaction process**

There are multiple participants that are involved when a cashless
transaction takes place(see the picture below).


In [None]:
from IPython.display import Image
Image("../input/protection-systems-against-fraud/payment.png")

* **Fraud detection issues and challenges**

The following figure shows distribution of FDS articles
based on issues and challenges

In [None]:
from IPython.display import Image
Image("../input/protection-systems-against-fraud/_20190826153639.png")

1. Concept Drift

The detection of fraud is nonstationary as fraud vectors change over
time and thus when a fixed FMS is put in place the effectiveness is reduced over time.

The competion its self also experience concept drift problem as this [discussion](https://www.kaggle.com/c/ieee-fraud-detection/discussion/99993#latest-596023) point out.

2. Unbalanced Data

There is a large class imbalance, so that the Ratio of Genuine to
Fraud (RGF) transactions in real-world transactional datasets
is high; there are considerably fewer fraud transactions compared to
genuine transactions making the problem of classifying them nontrivial. In industry, human
reviewers tend to mistrust and can ignore alerts and information from
the FMS if it generates too many false alarms

I have written a blog about how to deal with imbalanced data [here.](https://medium.com/@haataa/fighting-imbalance-data-set-with-code-examples-f2a3880700a6)

3. Real Time Detection

The loss due to fraud is incurred at the moment of the transaction
for issuers and merchants. Therefore, to be effective, fraud needs to be
detected in real-time. A real-time FMS is illustrated in the figure below.

In [None]:
from IPython.display import Image
Image("../input/protection-systems-against-fraud/realtime.png")

* **Credit card fraud detection**

Mostly, the strategy of credit card fraud detection is pattern recognition by analyzing user spending behavior automatically.
Customer spending behavior contains information about **the
transaction amount, time gap since last purchase, day of the week,
item category, customer address, etc.** Anomaly based fraud detection is mostly used for credit card fraud detection system in
which the cardholder's profile is made up by analyzing the cardholder spending behavior pattern. In doing so, any incoming
transaction that is inconsistent with the cardholder's profile would
be considered as suspicious

* **references**

[Fraud detection system: A survey](https://son.ir/wp-content/uploads/2018/10/02cfc86711083c79d23674505833131e2.pdf)

[How Artificial Intelligence and machine learning research impacts payment card fraud detection: A survey and industry benchmark](https://www.sciencedirect.com/science/article/abs/pii/S0952197618301520#fn1)

[Fraud The Facts 2014](http://www.theukcardsassociation.org.uk/wm_documents/Fraud%20The%20Facts%202014.pdf)


# Load Data

In [None]:
import gc
import os
import numpy as np
import pandas as pd
import subprocess
import seaborn as sns
import matplotlib.pyplot as plt

# Check File Size 

In [None]:
def check_fsize(dpath,s=30):
    """check file size
    Args:
    dpath: file directory
    s: string length in total after padding
    
    Returns:
    None
    """
    for f in os.listdir(dpath):
        print(f.ljust(s) + str(round(os.path.getsize(dpath+'/' + f) / 1000000, 2)) + 'MB')

In [None]:
check_fsize('../input/ieee-fraud-detection')

In [None]:
def check_fline(fpath):
    """check total number of lines of file for large files
    
    Args:
    fpath: string. file path
    
    Returns:
    None
    
    """
    lines = subprocess.run(['wc', '-l', fpath], stdout=subprocess.PIPE).stdout.decode('utf-8')
    print(lines, end='', flush=True)

In [None]:
fs=['../input/ieee-fraud-detection/train_transaction.csv', '../input/ieee-fraud-detection/train_identity.csv', '../input/ieee-fraud-detection/test_transaction.csv','../input/ieee-fraud-detection/test_identity.csv']
[check_fline(s) for s in fs]

# Load Data

In [None]:
# Load sample training data
df_train_transac = pd.read_csv('../input/ieee-fraud-detection/train_transaction.csv')
df_train_identity = pd.read_csv('../input/ieee-fraud-detection/train_identity.csv')

In [None]:
# Show data shape
print(df_train_transac.shape)
print(df_train_identity.shape)

# Show data head
print(df_train_transac.head(2))
print(df_train_identity.head(2))

# Check Feature Type

In [None]:
def count_feature_type(df):
    return df.dtypes.value_counts()

In [None]:
print(count_feature_type(df_train_transac))
print(count_feature_type(df_train_identity))

# Explore Transaction Data
First take a closer look on transaction data. Start with the general data missing condition.

## Check Target Distribution

In [None]:
def check_cunique(df,cols):
    """check unique values for each column
    df: data frame. 
    cols: list. The columns of data frame to be counted
    """
    df_nunique = df[cols].nunique().to_frame()
    df_nunique = df_nunique.reset_index().rename(columns={'index': 'feat',0:'nunique'})
    return df_nunique

In [None]:
def feat_value_count(df,colname):
    """value count of each feature
    
    Args
    df: data frame.
    colname: string. Name of to be valued column
    
    Returns
    df_count: data frame.
    """
    df_count = df[colname].value_counts().to_frame().reset_index()
    df_count = df_count.rename(columns={'index':colname+'_values',colname:'counts'})
    return df_count

In [None]:
feat_value_count(df_train_transac,'isFraud')

Obviously this is a imbalanced classification problem. Only 3.5% of data is of target 1.

## Check Missing Value 

In [None]:
def check_missing(df,cols=None,axis=0):
    """check data frame column missing situation
    Args
    df: data frame.
    cols: list. List of column names
    axis: int. 0 means column and 1 means row
    
    Returns
    missing_info: data frame. 
    """
    if cols != None:
        df = df[cols]
    missing_num = df.isnull().sum(axis).to_frame().rename(columns={0:'missing_num'})
    missing_num['missing_percent'] = df.isnull().mean(axis)*100
    return missing_num.sort_values(by='missing_percent',ascending = False) 

In [None]:
df_colmissing = check_missing(df_train_transac,cols=None,axis=0)
df_colmissing.head()

Some colunms have very high missing rate.

In [None]:
sns.distplot(df_colmissing.missing_percent, kde=False, rug=True)

In [None]:
df_rowmissing = check_missing(df_train_transac,cols=None,axis=1)
df_rowmissing.head()

In [None]:
sns.distplot(df_rowmissing.missing_percent, kde=False, rug=True)

## Check feature unique value

In [None]:
df_nunique = check_cunique(df_train_transac,df_train_transac.select_dtypes(include=['object']).columns)
df_nunique

## Check ProductCD

In [None]:
def compare_cate(df,colname,targetname):
    """check target value difference of given category
    in the case of binary classifications.
    
    Args
    df: data frame.
    colname: string. The column name to be evaluated.
    targetname: string. The column name of the target variable.
    
    Returns
    None
    """
    # caculate aggregate stats
    df_cate = df.groupby([colname])[targetname].agg(['count', 'sum','mean'])
    df_cate.reset_index(inplace=True)
    print(df_cate)
    
    # plot visuals
    f, ax = plt.subplots(figsize=(15, 6))
    ax.tick_params(axis='x',labelrotation=90)
    plt1 = sns.lineplot(x=colname, y="mean", data=df_cate,color="r")

    for tl in ax.get_yticklabels():
        tl.set_color('r')

    ax2 = ax.twinx()
    plt2 = sns.barplot(x=colname, y="count", data=df_cate,
                       ax=ax2,alpha=0.5)
    

In [None]:
compare_cate(df_train_transac,'ProductCD','isFraud')

ProductCD: product code, the product for each transaction

Five category of products. And type 'W' is the marjority. 

However type 'C' has highest fault rate.

In [None]:
compare_cate(df_train_transac,'ProductCD','TransactionAmt')

In [None]:
df_product_aveAmt = df_train_transac.groupby(['ProductCD'])['TransactionAmt'].agg(['mean'])
df_product_aveAmt.reset_index(inplace=True)
df_pdc_Amtratio = pd.merge(df_train_transac[['TransactionID','ProductCD',
                                             'TransactionAmt','isFraud']],
                           df_product_aveAmt,on='ProductCD',how='left')
df_pdc_Amtratio.head()

In [None]:
df_pdc_Amtratio['Amt_ratio'] = df_pdc_Amtratio['TransactionAmt']/df_pdc_Amtratio['mean']

In [None]:
compare_cate(df_pdc_Amtratio,'isFraud','Amt_ratio')

In [None]:
plt.ylim(0, 50)
sns.scatterplot(x='ProductCD',y='Amt_ratio',data=df_pdc_Amtratio,
                hue='isFraud',alpha=0.5)

It looks like that fraud transactions have higher transactionAmt compared to normal product category average. And more obvioius in product c.

## Check card4

In [None]:
compare_cate(df_train_transac,'card4','isFraud')

OK, card4 means card type. I have to say that I never head discover card before. 

Discover card fraud rate is higher compared to other three.

## Check card6

In [None]:
compare_cate(df_train_transac,'card6','isFraud')

Now some information.

**Credit cards** allow you to purchase items up to your credit limit. You can repay them within the month to avoid interest charges (if there is a grace period), or you can make smaller payments over a longer period of time which will result in interest charges.

**Charge cards** are similar to credit cards in that they allow you to pay for purchases up to your credit limit. Some charge cards do not have a predetermined credit limit and will approve larger purchases on a case-by-case basis. Charge cards require the balance to be paid back in a short period of time, usually within a month.

**Debit cards** are tied to a bank account from which funds are withdrawn for each purchase. Therefore, you will get a debit card from your financial institution where you have a personal or business checking or savings account.

## Check card1~5

In [None]:
check_cunique(df_train_transac,['card1','card2','card3','card5'])

According to the competition host, card1~card6 are all categorical features. It's a bit odd there are so many unique values of these features.

However it would be normal to think that if card1,card2,card3,card5 info are the same it's the same card. So let's make a cardid variable.

In [None]:
def make_card_id(df):
    cards_cols= ['card1', 'card2', 'card3', 'card5']
    for card in cards_cols: 
        if '1' in card: 
            df['card_id']= df[card].map(str)
        else : 
            df['card_id']+= ' '+df[card].map(str)
    return df['card_id'] 

In [None]:
df_train_transac['card_id'] = make_card_id(df_train_transac)

In [None]:
df_train_transac['card_TAmt_ratio'] = df_train_transac['TransactionAmt']/df_train_transac.groupby('card_id')['TransactionAmt'].transform('mean')

In [None]:
compare_cate(df_train_transac,'isFraud','card_TAmt_ratio')

It looks like that fraud transactions spend more than average.

## Check P_emaildomain

In [None]:
feat_value_count(df_train_transac,'P_emaildomain').head()

purchaser and recipient email domain. Interesting, I think perhaps the final domain like .com/.jp will provide additional information.

But first let's forget about .com and .jp.

In [None]:
df_train_transac['P_emaildomain_clean'] = df_train_transac['P_emaildomain'].str.split('.',expand=True)[0]
compare_cate(df_train_transac,'P_emaildomain_clean','isFraud')

It looks like that protonmail has very high fraud rate.

## Check R_emaildomain

In [None]:
feat_value_count(df_train_transac,'R_emaildomain').head()

I have to say that some email address looks pretty strange.

In [None]:
df_train_transac['R_emaildomain_clean'] = df_train_transac['R_emaildomain'].str.split('.',expand=True)[0]
compare_cate(df_train_transac,'R_emaildomain_clean','isFraud')

## Check M1

In [None]:
compare_cate(df_train_transac,'M1','isFraud')

M1-M9：match, such as names on card and address. I think this is quite straight forward. False would lead to high prob of fraud. However in the case of M1,M5,M7, 'F' result in less fraud rate. And I have no idea why M4 is different from the others.

## Check addr1

In [None]:
feat_value_count(df_train_transac,'addr1').head()

Notice that feature addr1 is also categroical feature. Addr1 has 332 unique values.

## Check addr2

In [None]:
feat_value_count(df_train_transac,'addr2').head()

In [None]:
compare_cate(df_train_transac,'addr2','isFraud')

Compare to addr1 has more than 300 unique values, addr2 has only 74. Perhaps this indicates nation code and addr1 province? Also notice that 87 has far more counts. More interesting is that some place has 100% fraud rate.

## Check TransactionAmt

In [None]:
def check_distribution(df,colname):
    """check general feature distribution info and plot histogram
    
    Args
    df: data frame.
    colname: string. The column name to be evaluated.
    
    Returns
    None
    """
    print(df[colname].describe())
    print('Total missing value number: ',df[colname].isnull().sum())
    plt.figure(figsize=(12,5))
    sns.distplot(df[colname].dropna())

In [None]:
def compare_distribution(df,colname,targetname,targetdict):
    """check column distribution difference of give different target variable
    in the case of binary classifications.
    
    Args
    df: data frame.
    colname: string. The column name to be evaluated.
    targetname: string. The column name of the target variable.
    targetdict:dict. Vaule and name of each class.
    
    Returns
    None
    """
    plt.figure(figsize=(12,5))
    keys = list(targetdict.keys())
    plt1 = sns.distplot(df[df[targetname] == keys[0]][colname].dropna(), label=targetdict[keys[0]])
    plt1 = sns.distplot(df[df[targetname] == keys[1]][colname].dropna(), label=targetdict[keys[1]])
    plt1.legend()
    plt1.set_title("%s Distribution by Target"%colname, fontsize=20)
    plt1.set_xlabel(colname, fontsize=18)
    plt1.set_ylabel("Probability", fontsize=18)

In [None]:
check_distribution(df_train_transac,'TransactionAmt')

In [None]:
dict= {1:'Fraud',0:'NoFraud'}
compare_distribution(df_train_transac,'TransactionAmt','isFraud',dict)

In [None]:
df_train_transac.groupby(['isFraud'])['TransactionAmt'].agg('describe')

## Check Dist

In [None]:
check_distribution(df_train_transac,'dist1')

In [None]:
compare_distribution(df_train_transac,'dist1','isFraud',dict)

In [None]:
df_train_transac.groupby(['isFraud'])['dist1'].agg('describe')

## Check C1~C14

In [None]:
for i in range(0,14):
    col = 'C%s'%(i+1)
    print(df_train_transac.groupby(['isFraud'])[col].agg('describe'))
    compare_distribution(df_train_transac,col,'isFraud',dict)

It looks like we have outlier values

## Check D1~D15

In [None]:
for i in range(0,15):
    col = 'D%s'%(i+1)
    print(df_train_transac.groupby(['isFraud'])[col].agg('describe'))
    compare_distribution(df_train_transac,col,'isFraud',dict)

# Explore Identity Data

In [None]:
df_train_identity.head()

## Check Missing Value 

In [None]:
df_colmissing = check_missing(df_train_identity,cols=None,axis=0)
print(df_colmissing.head())
df_colmissing.describe()

In [None]:
sns.distplot(df_colmissing.missing_percent, kde=False, rug=True)

In [None]:
df_rowmissing = check_missing(df_train_identity,cols=None,axis=1)
print(df_rowmissing.head())
df_rowmissing.describe()

In [None]:
sns.distplot(df_rowmissing.missing_percent, kde=False, rug=True)

## Check feature unique value

In [None]:
df_nunique = check_cunique(df_train_identity,df_train_identity.select_dtypes(include=['object']).columns)
df_nunique

## Merge Data

In [None]:
df_all = pd.merge(df_train_transac,df_train_identity,on='TransactionID',how='left')

In [None]:
df_all.head()

## Check ID 

In [None]:
for i in range(12,39):
    col = 'id_%s'%(i)
    compare_cate(df_all,col,'isFraud')

id_30 and id_31 needs futuer investigation

## Clean and Check ID_30

In [None]:
df_all['id_30_system'] = df_all['id_30'].str.split(' ',expand=True)[0]
compare_cate(df_all,'id_30_system','isFraud')

So 'other system' is really suspecious

## Clean and Check ID_31

In [None]:
#remove version number
df_all['id_31_clean'] = df_all['id_31'].str.replace("([0-9\.])", "")
df_all['id_31_clean'][df_all['id_31_clean'].str.contains('chrome', regex=False)==True] = 'chrome'
df_all['id_31_clean'][df_all['id_31_clean'].str.contains('Samsung', regex=False)==True] = 'Samsung'
df_all['id_31_clean'][df_all['id_31_clean'].str.contains('samsung', regex=False)==True] = 'Samsung'
df_all['id_31_clean'][df_all['id_31_clean'].str.contains('firefox', regex=False)==True] = 'firefox'
df_all['id_31_clean'][df_all['id_31_clean'].str.contains('safari', regex=False)==True] = 'safari'
df_all['id_31_clean'][df_all['id_31_clean'].str.contains('opera', regex=False)==True] = 'opera'
df_all['id_31_clean'] = df_all['id_31_clean'].str.replace(" ", "")

In [None]:
compare_cate(df_all,'id_31_clean','isFraud')

## Check Device Type

In [None]:
compare_cate(df_all,'DeviceType','isFraud')

## Check Device Info

In [None]:
compare_cate(df_all,'DeviceInfo','isFraud')