Sources:

https://blog.featurelabs.com/predicting-credit-card-fraud/

https://blog.featurelabs.com/deep-feature-synthesis/

# Load the data

In [1]:
import pandas as pd
import numpy as np
import featuretools as ft

In [2]:
train = pd.read_csv('train.csv', sep='|')
test = pd.read_csv('test.csv', sep='|')

In [3]:
train.head(1)

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,fraud
0,5,1054,54.7,7,0,3,0.027514,0.051898,0.241379,0


In [4]:
train['scannedLineItemsTotal'] = train['scannedLineItemsPerSecond'] * train['totalScanTimeInSeconds']

In [5]:
train_new = train
"""
7 main features:
    trustLevel
    totalScanTimeInSeconds
    grandTotal
    lineItemVoids
    scansWithoutRegistration
    quantityModifications
    scannedLineItemsTotal
"""

'\n7 main features:\n    trustLevel\n    totalScanTimeInSeconds\n    grandTotal\n    lineItemVoids\n    scansWithoutRegistration\n    quantityModifications\n    scannedLineItemsTotal\n'

In [6]:
train_new.drop(columns=['fraud','valuePerSecond','scannedLineItemsPerSecond','lineItemVoidsPerPosition'], inplace=True)
train_new.head()

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsTotal
0,5,1054,54.7,7,0,3,29.0
1,3,108,27.36,5,2,4,14.0
2,3,1516,62.16,3,10,5,13.0
3,6,1791,92.31,8,4,4,29.0
4,5,430,81.53,3,7,2,27.0


# Prepare the Data

Within featuretools there is a standard format for representing data and build features, which is a entity set. A EntitySet stores information about entities (database table), variables (columns in database tables), relationships, and the data itself.

In [7]:
es = ft.EntitySet("train")

In [8]:
from featuretools import variable_types as vtypes

es.entity_from_dataframe(entity_id="train",
                         dataframe=train_new,
                         index="id",
                         time_index='totalScanTimeInSeconds')



Entityset: train
  Entities:
    train [Rows: 1879, Columns: 8]
  Relationships:
    No relationships

In [9]:
es['train'].df.head()

Unnamed: 0,id,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsTotal
173,173,6,2,75.74,6,1,1,8.0
1715,1715,1,2,34.07,5,9,4,1.0
1423,1423,6,3,7.68,2,3,0,20.0
1835,1835,3,3,64.94,3,4,3,14.0
1154,1154,5,4,53.65,11,5,4,6.0


# Feature Primitives

A feature primitive is an operation applied to a table or a set of tables to create a feature. Feature primitives fall into two categories:

Aggregation: function that groups together children for each parent feature and calculates a statistic such as mean, min, max, or standard deviation across the children. 

Transformation: an operation applied to one or more columns in a single table. 

A list of the available features primitives in featuretools can be viewed below.

In [10]:
primitives = ft.list_primitives()
pd.options.display.max_colwidth = 100

primitives[primitives['type'] == 'aggregation'].head(10)

Unnamed: 0,name,type,description
0,percent_true,aggregation,Determines the percent of `True` values.
1,mean,aggregation,Computes the average for a list of values.
2,last,aggregation,Determines the last value in a list.
3,num_true,aggregation,Counts the number of `True` values.
4,sum,aggregation,"Calculates the total addition, ignoring `NaN`."
5,std,aggregation,"Computes the dispersion relative to the mean value, ignoring `NaN`."
6,num_unique,aggregation,"Determines the number of distinct values, ignoring `NaN` values."
7,skew,aggregation,Computes the extent to which a distribution differs from a normal distribution.
8,max,aggregation,"Calculates the highest value, ignoring `NaN` values."
9,time_since_last,aggregation,Calculates the time elapsed since the last datetime (in seconds).


In [11]:
primitives[primitives['type'] == 'transform']

Unnamed: 0,name,type,description
20,not_equal,transform,Determines if values in one list are not equal to another list.
21,week,transform,Determines the week of the year from a datetime.
22,cum_sum,transform,Calculates the cumulative sum.
23,divide_numeric,transform,Element-wise division of two lists.
24,longitude,transform,Returns the second tuple value in a list of LatLong tuples.
25,and,transform,Element-wise logical AND of two lists.
26,equal,transform,Determines if values in one list are equal to another list.
27,year,transform,Determines the year value of a datetime.
28,second,transform,Determines the seconds value of a datetime.
29,greater_than,transform,Determines if values in one list are greater than another list.


# Deep Feature Synthesis

DFS stacks feature primitives to form features with a "depth" equal to the number of primitives. 

In [12]:
# use mutiply, add, divide, substract
default_trans_primitives =  ["multiply_numeric", "add_numeric", "divide_numeric", "subtract_numeric"]

# DFS with specified primitives
feature_matrix, features = ft.dfs(entityset = es, target_entity = 'train',
                       trans_primitives = default_trans_primitives,
                       where_primitives = [], seed_features = [],
                       n_jobs = 1, verbose = 1,
                       max_depth = 1, features_only=False)  #set max_depth = 1

Built 112 features
Elapsed: 00:02 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks


In [36]:
features

[<Feature: trustLevel>,
 <Feature: totalScanTimeInSeconds>,
 <Feature: grandTotal>,
 <Feature: lineItemVoids>,
 <Feature: scansWithoutRegistration>,
 <Feature: quantityModifications>,
 <Feature: scannedLineItemsTotal>,
 <Feature: quantityModifications * scannedLineItemsTotal>,
 <Feature: lineItemVoids * totalScanTimeInSeconds>,
 <Feature: scannedLineItemsTotal * trustLevel>,
 <Feature: grandTotal * quantityModifications>,
 <Feature: lineItemVoids * scannedLineItemsTotal>,
 <Feature: quantityModifications * totalScanTimeInSeconds>,
 <Feature: scansWithoutRegistration * totalScanTimeInSeconds>,
 <Feature: scansWithoutRegistration * trustLevel>,
 <Feature: quantityModifications * trustLevel>,
 <Feature: lineItemVoids * scansWithoutRegistration>,
 <Feature: lineItemVoids * quantityModifications>,
 <Feature: grandTotal * scannedLineItemsTotal>,
 <Feature: scannedLineItemsTotal * totalScanTimeInSeconds>,
 <Feature: grandTotal * scansWithoutRegistration>,
 <Feature: lineItemVoids * trustLevel

In [14]:
# encode values
fm_encoded, features_encoded = ft.encode_features(feature_matrix,
                                                  features)
fm_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1879 entries, 0 to 1878
Columns: 112 entries, trustLevel to scansWithoutRegistration - totalScanTimeInSeconds
dtypes: float64(77), int64(35)
memory usage: 1.6 MB


# Correlation Check

In [15]:
df = pd.read_csv('train.csv', sep='|')
fm_encoded['fraud']=df['fraud']

In [36]:
#absolute correlation matrix sorted with descending order
corr_abs = abs(fm_encoded.corr())
corr_abs[['fraud']].sort_values(by='fraud',ascending=False)

Unnamed: 0,fraud
fraud,1.000000
scannedLineItemsTotal / trustLevel,0.660235
totalScanTimeInSeconds / trustLevel,0.437771
scansWithoutRegistration / trustLevel,0.381772
scannedLineItemsTotal - trustLevel,0.354860
lineItemVoids / trustLevel,0.352699
trustLevel,0.319765
grandTotal / trustLevel,0.309455
scannedLineItemsTotal * totalScanTimeInSeconds,0.309371
scannedLineItemsTotal + scansWithoutRegistration,0.308496
