## Loading File

### .npy file loading time:

CPU times: user 23.7 s, sys: 39.5 s, total: 1min 3s
Wall time: 2min 1s

### .csv file loading time:

CPU times: user 28 s, sys: 8.48 s, total: 36.5 s
Wall time: 38.2 s

### .h5 file loading time:
CPU times: user 359 ms, sys: 2.39 s, total: 2.75 s
Wall time: 4.23 s

In [5]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from IPython.display import display
from datetime import datetime
pd.set_option('display.max_columns', 500)
# pd.set_option('display.max_rows', 1000)

## Read CSV file and restore them as .h5 file

In [12]:
df = pd.read_csv('data/atec_anti_fraud_train.csv')
%time df.to_hdf('data/0_data_train_', key='data', mode='w')
df = pd.read_csv('data/atec_anti_fraud_test_a.csv')
%time df.to_hdf('data/0_data_test_a_', key='data_test', mode='w')

CPU times: user 2.16 s, sys: 1.96 s, total: 4.12 s
Wall time: 5.62 s


In [6]:
df_train_ = pd.read_hdf('data/0_data_train_')
df_test_ = pd.read_hdf('data/0_data_test_a_')

## Pre-processing

### Convert datetime

In [8]:
df_train_['date'] = pd.to_datetime(df_train_['date'].astype('str'), format = '%Y%m%d')
df_test_['date'] = pd.to_datetime(df_test_['date'].astype('str'), format = '%Y%m%d')

In [9]:
print(df_train_['date'].max())
print(df_test_['date'].min())

2017-11-05 00:00:00
2018-01-05 00:00:00


### Clear NA

Columns with no NA are same for training and test data

In [27]:
df_train_noNA_ = df_train_.loc[:,df_train_.isnull().any(axis=0) == False]
df_test_noNA_ = df_test_.loc[:,df_test_.isnull().any(axis=0) == False]

### Clear -1 label

In [30]:
df_train_noNA_noMinusOne = df_train_noNA_[df_train_noNA_['label'] != -1]

### Sort by datetime

In [32]:
df_train_sort = df_train_noNA_noMinusOne.sort_values(by='date').reset_index(drop = True)

### Split data

Set split ratio as 0.7

In [35]:
split_rate = 0.7
split_length = df_train_sort.shape[0]*split_rate

df_train = df_train_sort.iloc[:int(split_length), :].reset_index(drop = True)
df_test = df_train_sort.iloc[int(split_length):, :].reset_index(drop = True)

X_train = df_train.drop(['label'], axis=1)
y_train = df_train['label']
X_test = df_test.drop(['label'], axis=1)
y_test = df_test['label']

### Store splitted data

In [38]:
X_train.to_hdf('data/data_X_train', key='X_sub_train', mode='w')
X_test.to_hdf('data/data_X_test', key='X_sub_test', mode='w')
y_train.to_hdf('data/data_y_train', key='y_sub_train', mode='w')
y_test.to_hdf('data/data_y_test', key='y_sub_test', mode='w')

# Also store processed original training and test data
df_train_noNA_noMinusOne.to_hdf('data/data_train_', key='train_', mode='w')
df_test_noNA_.to_hdf('data/data_test_', key='test_', mode='w')

## Some Dummy test 
### Read train and test data

In [42]:
X_train = pd.read_hdf('data/data_X_train')
X_test = pd.read_hdf('data/data_X_test')
y_train = pd.read_hdf('data/data_y_train')
y_test = pd.read_hdf('data/data_y_test')

In [43]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(693004, 20) (693004,)
(297002, 20) (297002,)


### 1.1 Clean NaN

Every row contains at least 1 NaN.

* Interpolate?

### 1.2 Imbalanced Dataset

In [115]:
print("having risk: ", np.sum(y_train == 1) / y_train.shape[0])
print("No risk: ", np.sum(y_train == 0) / y_train.shape[0])
print("No label: ", np.sum(y_train == -1) / y_train.shape[0])

having risk:  0.0117117207684
No risk:  0.983323543646
No label:  0.0049647355851
