# Not obvious features

### Debt features

In [1]:
import pandas as pd

In [2]:
transactions = pd.read_csv('onlinefraud.csv')

transactions.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [3]:
transactions['has_debt'] = transactions['amount'] > transactions['oldbalanceOrg']
transactions['debt'] = (transactions['amount'] - transactions['oldbalanceOrg']) * transactions['has_debt']

### Features associated with destination account

Let's group data by destination and calculated some statistics. It can provide us with not obvious features.

In [4]:
from tqdm import tqdm
grouped_transactions = transactions.groupby('nameDest')
info_for_group = pd.DataFrame()
info_for_group['origins_count'] = grouped_transactions['nameOrig'].count()
columns_of_interest = [
    'amount',
    'oldbalanceOrg',
    'newbalanceOrig',
    'oldbalanceDest',
    'newbalanceDest',
    'step',
    # remove these columns to avoid linear dependency between columns
    # 'actual_amount_spent',
    # 'actual_amount_received',
    'has_debt'
]

aggregation_methods = ['min', 'max', 'std', 'median', 'mean']
for column in tqdm(columns_of_interest):
    for method in aggregation_methods:
        info_for_group[f'{method}_{column}'] = grouped_transactions[column].aggregate(method)

transactions = pd.merge(
    left=transactions,
    right=info_for_group.reset_index(),
    on='nameDest'
)

100%|██████████| 7/7 [00:24<00:00,  3.43s/it]


In [None]:
has_none = transactions.isna().any(axis=0)
has_none[has_none]

Std columns has none when there is no deviations among data. Let's fill it with zero.

In [None]:
transactions.fillna(0, inplace=True)

In [None]:
has_none = transactions.isna().any(axis=0)
has_none[has_none]

Now each transaction contains information about all transaction of its destination account. Thus, we have to keep all transactions with the same destination in one group to prevent data leak.

In [None]:
transactions.groupby('nameDest')['isFraud'].std().mean()

As we can see, this value is not zero, so transactions with the same destination can be either fraud or not, but we still need to keep the whole group in one set (train/test/validation).

### Account type

Account name start with letter M or C, which can indicate different types of accounts.

In [None]:
transactions['nameDestFirstLetter'] = transactions['nameDest'].str[0]

In [None]:
transactions['nameDest'].str[1:].astype(int).hist()

As we can see, number of bank account looks like random variable which doesn't contain any information.

### Money related features

In [None]:
# fraudsters more probably don't specify cents or even ones
transactions['amount_has_cents'] = (transactions['amount'] % 1 != 0)
transactions['amount_has_units'] = (transactions['amount'] % 10 != 0)
transactions['amount_has_tens'] = (transactions['amount'] % 100 != 0)
transactions['amount_has_hundreds'] = (transactions['amount'] % 1000 != 0)
# fraudsters can try to transfer all available money
transactions['amount_is_equal_to_balance'] = (transactions['amount'] == transactions['oldbalanceOrg'])

### Categorical variables encoding

In [None]:
transactions.dtypes

In [None]:
categorical_variables = ['type', 'nameDestFirstLetter']

for variable in categorical_variables:
    transactions[f'{variable}_code'] = pd.Categorical(transactions[variable]).codes
    for value in transactions[variable].unique()[:-1]:
        transactions[f'{variable}={value}'] = transactions[variable] == value

### Drop extra features

In [None]:
pd.set_option('display.max_columns', 500)
transactions.head()

In [None]:
extra_features = [
    'type',
    # Don't drop this column to be able
    # to split data into train/test/validation with respect to this column
    # 'nameDest',
    'nameOrig',
    'isFlaggedFraud'
]
transactions.drop(columns=extra_features, inplace=True)

### Save results

In [None]:
transactions.to_csv('onlinefraud_with_features.csv', index=False)