<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#IEEE-CIS-Fraud-Detection" data-toc-modified-id="IEEE-CIS-Fraud-Detection-1">IEEE-CIS Fraud Detection</a></span><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1.1">Imports</a></span></li><li><span><a href="#Reading-in-and-merging-dataframes" data-toc-modified-id="Reading-in-and-merging-dataframes-1.2">Reading in and merging dataframes</a></span></li><li><span><a href="#Data-exploration-and-preprocessing" data-toc-modified-id="Data-exploration-and-preprocessing-1.3">Data exploration and preprocessing</a></span></li><li><span><a href="#Creating-simplified-data-sets-for--experimentation" data-toc-modified-id="Creating-simplified-data-sets-for--experimentation-1.4">Creating simplified data sets for  experimentation</a></span><ul class="toc-item"><li><span><a href="#Random-sampling-of-the-test-set" data-toc-modified-id="Random-sampling-of-the-test-set-1.4.1">Random sampling of the test set</a></span></li><li><span><a href="#Simplfication-of-data-set" data-toc-modified-id="Simplfication-of-data-set-1.4.2">Simplfication of data set</a></span></li></ul></li><li><span><a href="#Data-processing-steps" data-toc-modified-id="Data-processing-steps-1.5">Data processing steps</a></span><ul class="toc-item"><li><span><a href="#Assign-order-to-ordinals" data-toc-modified-id="Assign-order-to-ordinals-1.5.1">Assign order to ordinals</a></span></li></ul></li><li><span><a href="#Handling-dates" data-toc-modified-id="Handling-dates-1.6">Handling dates</a></span></li></ul></li></ul></div>

# IEEE-CIS Fraud Detection

## Imports

In [8]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [9]:
import sys
from zipfile import ZipFile
from pathlib import Path
import pandas as pd
import multiprocessing as mp
import torch
from functools import partial
pd.options.display.max_columns = None

In [10]:
path = Path('/Users/baranserajelahi/Codes/fraud-detection-pytorch-scikit-fastai/data')

In [11]:
Path.BASE_PATH = path

In [5]:
# with ZipFile('ieee-fraud-detection.zip', 'r') as zip_ref:
#     zip_ref.extractall(path/'Data')

## Reading in and merging dataframes

In [6]:
files = [path/'test_identity.csv', 
         path/'test_transaction.csv',
         path/'train_identity.csv',
         path/'train_transaction.csv']

In [None]:
%%time
def read_data(file):
    return pd.read_csv(file, low_memory=False)

with mp.Pool() as pool:
    test_id, test_tr, train_id, train_tr = pool.map(read_data, files)   

In [None]:
train = pd.merge(train_tr, train_id, on='TransactionID', how='left')
test = pd.merge(test_tr, test_id, on='TransactionID', how='left')

In [None]:
train.to_csv(path/'train.csv')
test.to_csv(path/'test.csv')

In [12]:
# train = pd.read_csv(path/'train.csv', index_col=[0], low_memory=False)
# test = pd.read_csv(path/'test.csv', index_col=[0], low_memory=False)

## Data exploration and preprocessing

In [15]:
def nans_by_col(df):
    p_nan = {}
    print('--Lengths--' + '--NaN Counts--' + '--Percent NaN--')
    for col in df.columns:
        nan_count = len(df[col]) - df[col].count()
        length = len(df[col])
        percent_nan = nan_count/length
        p_nan[col] = percent_nan
        print(f'{col}: {length},     {nan_count},     {percent_nan}')
    return p_nan

In [16]:
p_nan = nans_by_col(train)

--Lengths----NaN Counts----Percent NaN--
TransactionID: 590540,     0,     0.0
isFraud: 590540,     0,     0.0
TransactionDT: 590540,     0,     0.0
TransactionAmt: 590540,     0,     0.0
ProductCD: 590540,     0,     0.0
card1: 590540,     0,     0.0
card2: 590540,     8933,     0.015126833068039422
card3: 590540,     1565,     0.0026501168422122124
card4: 590540,     1577,     0.00267043722694483
card5: 590540,     4259,     0.007212043214684865
card6: 590540,     1571,     0.0026602770345785214
addr1: 590540,     65706,     0.1112642666034477
addr2: 590540,     65706,     0.1112642666034477
dist1: 590540,     352271,     0.596523520845328
dist2: 590540,     552913,     0.9362837403054831
P_emaildomain: 590540,     94456,     0.1599485216920107
R_emaildomain: 590540,     453249,     0.7675161716395164
C1: 590540,     0,     0.0
C2: 590540,     0,     0.0
C3: 590540,     0,     0.0
C4: 590540,     0,     0.0
C5: 590540,     0,     0.0
C6: 590540,     0,     0.0
C7: 590540,     0,     

V155: 590540,     508595,     0.8612371727571375
V156: 590540,     508595,     0.8612371727571375
V157: 590540,     508595,     0.8612371727571375
V158: 590540,     508595,     0.8612371727571375
V159: 590540,     508589,     0.8612270125647712
V160: 590540,     508589,     0.8612270125647712
V161: 590540,     508595,     0.8612371727571375
V162: 590540,     508595,     0.8612371727571375
V163: 590540,     508595,     0.8612371727571375
V164: 590540,     508589,     0.8612270125647712
V165: 590540,     508589,     0.8612270125647712
V166: 590540,     508589,     0.8612270125647712
V167: 590540,     450909,     0.763553696616656
V168: 590540,     450909,     0.763553696616656
V169: 590540,     450721,     0.7632353439225116
V170: 590540,     450721,     0.7632353439225116
V171: 590540,     450721,     0.7632353439225116
V172: 590540,     450909,     0.763553696616656
V173: 590540,     450909,     0.763553696616656
V174: 590540,     450721,     0.7632353439225116
V175: 590540,     450721

V336: 590540,     508189,     0.8605496664070174
V337: 590540,     508189,     0.8605496664070174
V338: 590540,     508189,     0.8605496664070174
V339: 590540,     508189,     0.8605496664070174
id_01: 590540,     446307,     0.7557608290716971
id_02: 590540,     449668,     0.7614522301622244
id_03: 590540,     524216,     0.8876892335828225
id_04: 590540,     524216,     0.8876892335828225
id_05: 590540,     453675,     0.7682375452975243
id_06: 590540,     453675,     0.7682375452975243
id_07: 590540,     585385,     0.9912707013919464
id_08: 590540,     585385,     0.9912707013919464
id_09: 590540,     515614,     0.8731229044603245
id_10: 590540,     515614,     0.8731229044603245
id_11: 590540,     449562,     0.7612727334304196
id_12: 590540,     446307,     0.7557608290716971
id_13: 590540,     463220,     0.7844007179869272
id_14: 590540,     510496,     0.8644562603718631
id_15: 590540,     449555,     0.7612608798726589
id_16: 590540,     461200,     0.7809801198902699
id_1

In [17]:
train["isFraud"].mean()

0.03499000914417313

This indicates that there is class imbalance of about 1 to 30. 

## Creating simplified data sets for  experimentation

### Random sampling of the test set

Now I will create take a sample of this dataset to work with during experimentation.

In [18]:
# randomly sample the date without replacement
train_s = train.sample(frac=0.1, axis=0)   

In [19]:
assert train_s.columns.shape[0]==train.columns.shape[0] 

In [20]:
train_s.to_csv(path/'train_s.csv')

### Simplfication of data set

In [21]:
# train_s = pd.read_csv(path/'train_s.csv', index_col=[0], low_memory=False)
# test = pd.read_csv(path/'test.csv', index_col=[0], low_memory=False)

In [23]:
train_s.shape, test.shape

((59054, 434), (506691, 433))

In [15]:
train_s.columns

Index(['TransactionID', 'isFraud', 'TransactionDT', 'TransactionAmt',
       'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5',
       ...
       'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38',
       'DeviceType', 'DeviceInfo'],
      dtype='object', length=434)

To simplify the dataset I will drop every collumn that records a Vxxx feature. There are over 339 such features many of which are highkly correllated.

In [16]:
train_ss = train_s.drop(list(train_s.filter(regex = 'V')), axis = 1)

In [17]:
assert len(train_ss.columns)==len(train_s.columns) - 339

In [18]:
train_ss.columns

Index(['TransactionID', 'isFraud', 'TransactionDT', 'TransactionAmt',
       'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6',
       'addr1', 'addr2', 'dist1', 'dist2', 'P_emaildomain', 'R_emaildomain',
       'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11',
       'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8',
       'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'M1', 'M2', 'M3', 'M4',
       'M5', 'M6', 'M7', 'M8', 'M9', 'id_01', 'id_02', 'id_03', 'id_04',
       'id_05', 'id_06', 'id_07', 'id_08', 'id_09', 'id_10', 'id_11', 'id_12',
       'id_13', 'id_14', 'id_15', 'id_16', 'id_17', 'id_18', 'id_19', 'id_20',
       'id_21', 'id_22', 'id_23', 'id_24', 'id_25', 'id_26', 'id_27', 'id_28',
       'id_29', 'id_30', 'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36',
       'id_37', 'id_38', 'DeviceType', 'DeviceInfo'],
      dtype='object')

In [19]:
train_ss.to_csv(path/'train_ss.csv')

## Data processing steps

### Assign order to ordinals

There does not appear to be any identifiable ordinal variables amoung the catergorical variables in the dataset.

## Handling dates

The only place where there is datetime info is the TransactionDT collumn. It's possible to turn this delta into a datetime (assuming units are seconds as indicated by some on the disscussion board) by choosing a starting point. [example](https://www.kaggle.com/nroman/eda-for-cis-fraud-detection). I will train my first models without doing that.

Next need to add datepart (i think it's important for parts of this that the correct start date is used)