# IEEE-CIS Fraud Detection &mdash; LightGBM Split Points

This notebook shows some techniques to snoop on the gradient boosting process used by LightGBM - using its own APIs.

By counting the split points used in the decision trees, we can see the ways the algorithm divides the input space up. This may lead to new insights about what indicates fraud, and may help in smoothing or binning the data to reduce splits that model only noise.

For more info on LightGBM see [pdf by Microsoft][3] or the [LightGBM github][4].

For another example of gradient boosting model analysis with XGBoost see the great [xgbfi][2] tool by [Faron][1].

___

We start by building a model...

 [1]: https://www.kaggle.com/mmueller
 [2]: https://github.com/Far0n/xgbfi
 [3]: https://www.microsoft.com/en-us/research/wp-content/uploads/2017/11/lightgbm.pdf
 [4]: https://github.com/Microsoft/LightGBM


In [194]:
%matplotlib inline
import pandas as pd
import numpy as np
import gc, os, sys, re, time
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
from collections import Counter, defaultdict
import matplotlib.pyplot as plt
from IPython.display import display, HTML

In [4]:
DTYPE = {
    'TransactionID': 'int32',
    'isFraud': 'int8',
    'TransactionDT': 'int32',
    'TransactionAmt': 'float32',
    'ProductCD': 'category',
    'card1': 'int16',
    'card2': 'float32',
    'card3': 'float32',
    'card4': 'category',
    'card5': 'float32',
    'card6': 'category',
    'addr1': 'float32',
    'addr2': 'float32',
    'dist1': 'float32',
    'dist2': 'float32',
    'P_emaildomain': 'category',
    'R_emaildomain': 'category',
}

IDX = 'TransactionID'
TGT = 'isFraud'

CCOLS = [f'C{i}' for i in range(1, 15)]
DCOLS = [f'D{i}' for i in range(1, 16)]
MCOLS = [f'M{i}' for i in range(1, 10)]
VCOLS = [f'V{i}' for i in range(1, 340)]

DTYPE.update((c, 'float32') for c in CCOLS)
DTYPE.update((c, 'float32') for c in DCOLS)
DTYPE.update((c, 'float32') for c in VCOLS)
DTYPE.update((c, 'category') for c in MCOLS)

IN_DIR = '../input'

NR = None

tran = pd.read_csv(f'{IN_DIR}/train_transaction.csv', index_col=IDX, nrows=NR, dtype=DTYPE)
tran.shape

In [201]:
# utility: encode binary 0/1 columns as bits in a single integer
def encode_bits(binary_df):
    ncols = binary_df.shape[1]
    assert ncols < 64
    return binary_df @ (1 << np.arange(ncols))

Add count features...

In [5]:
to_count = tran.columns[2:].tolist()

for c in to_count:
    s = tran[c]
    if hasattr(s, 'cat'):
        s = s.cat.codes
    vc = s.value_counts(dropna=False)
    tran[f'{c}_count'] = s.map(vc).astype(np.int32)

Add some simple extra features.

In [195]:
tran['TimeInDay'] = tran.TransactionDT % 86400
tran['Cents'] = tran.TransactionAmt % 1
tran['C_bin'] = encode_bits(tran[CCOLS]>0)
tran['D_bin'] = encode_bits(tran[DCOLS].isnull())
tran['M_bin'] = encode_bits(tran[MCOLS].isnull())
tran['addr_bin'] = encode_bits(tran[['addr1','addr2','dist1','dist2']].isnull())
tran['email_bin'] = encode_bits(tran[['R_emaildomain','P_emaildomain']].isnull())

Simple time based validation split, first 75% is training data, rest is validation.

In [198]:
split = tran.TransactionDT.quantile(0.75)
istrain = tran.TransactionDT < split
train_df = tran.loc[istrain]
valid_df = tran.loc[~istrain]
print(train_df.shape, valid_df.shape)

In [180]:
params = {
    'num_leaves': 64,
    'objective': 'binary',
    'min_data_in_leaf': 12,
    'learning_rate': 0.01,
    'feature_fraction': 0.6,
    'bagging_fraction': 0.9,
    'bagging_freq': 1,
    'max_cat_to_onehot': 128,
    'metric': 'auc',
    'num_threads': 8,
    'seed': 42,
}

## Note &mdash; TransactionDT

I will use all columns as features, even `TransactionDT` which is terrible as a feature - none of the test set values overlap with the training set values. By leaving it in here though we get to see if there are any hotspots of `TransactionDT` that get frequently used as a split point, showing us a potential *regime change* or shift in fraud behaviour.

In [10]:
y_tr = train_df[TGT].values
y_va = valid_df[TGT].values

use = [c for c in train_df.columns if c != TGT]
train = train_df[use]
valid = valid_df[use]

dtrain = lgb.Dataset(train, y_tr, params=params)
dvalid = lgb.Dataset(valid, y_va, params=params, reference=dtrain)

clf = lgb.train(params,
                dtrain,
                num_boost_round=3000,
                valid_sets=(dvalid,),
                early_stopping_rounds=100,
                verbose_eval=100)

In [11]:
roc_auc_score(y_va, clf.predict(valid))

Save the model - it saves the trees in an easy to parse text format. (The file won't be used here but it is useful in general to save.)

In [217]:
_ = clf.save_model('ieee_fraud_lgb_model.txt')

# Booster.dump_model()

The returned LightGBM model format is hierarchical, trees are nested `dict` objects containing `left_child` and `right_child` subtrees. Walking over the trees and summarizing the splits can be done with a short recursive function...

    tree_info  - list of dicts
    (each contains):
        tree_structure
            left_child
            right_child

The `dump_model()` information records 'gain' at each split, and we simply re-use that.

In [156]:
# uncomment to see model structure
# clf.dump_model(num_iteration=2)['tree_info']

In [28]:
# NOTE: lightgbm.Booster has a new get_split_value_histogram API which counts split points used.
# This code pre-dates that, and sums gain instead of counting appearances.
def get_split_point_stats(clf):
    split_points = defaultdict(Counter)
    names = clf.feature_name()

    def visit_node(d):
        if 'tree_info' in d:
            for tree in d['tree_info']: # a list of trees
                visit_node(tree)
        for k in ['tree_structure', 'left_child', 'right_child' ]:
            if k in d:
                visit_node(d[k])
        if 'split_feature' in d:
            split_points[names[d['split_feature']]] [d['threshold']] += d['split_gain']

    visit_node(clf.dump_model())
    return split_points

In [14]:
split_points = get_split_point_stats(clf)

Each feature indexes a Counter object in the `split_points` dict. In each Counter, the keys are feature values, and the values are sum of gain, for example, 3.5 is the most used value in feature `C1`:

In [94]:
split_points['C1'].most_common(5)

Dump all the split point data to an xlsx file (can be opened with open-source *Open Office* or *[Libre Office][1]*)

 [1]: https://www.libreoffice.org/download/download/

In [17]:
with pd.ExcelWriter('ieee_fraud_split_points.xlsx') as writer:
    for feat in use:
        counter = split_points[feat]
        df = pd.Series(counter, name=feat).sort_index().to_frame('GainSum')
        df.to_excel(writer, feat, index_label=feat)

    for sheet in writer.sheets.values():
        sheet.set_column(0, 0, 30)

# Plotting Code

Warning: this only shows the 50 split points with the most gain, so the x-axis will be a bit nonlinear, some values won't appear. See the xlsx file for all the values.

In [20]:
MAX_SHOW = 50

In [1]:
ADJS = 'abundant:common:ubiquitous:omnipresent:rampant:rife:permeant:widespread:legendary:popular:fashionable:frequent:usual:useful:predominant:recurrent:repetitive:repetitious:marked:prevalent:prevalent:prevalent'.split(':')
COLORS = [
    'tab:blue', 'tab:orange', 'tab:green', 'tab:red', 'tab:purple',
    'tab:brown', 'tab:pink', 'tab:gray', 'tab:olive', 'tab:cyan'
]
GAINS = pd.Series(index=clf.feature_name(), data=clf.feature_importance('gain'))
COUNTS = pd.Series(index=clf.feature_name(), data=clf.feature_importance())
np.random.seed(42)

def plot_it(col):
    counts = split_points[col]
    ser = pd.Series(dict(counts)).sort_values(ascending=False)
    if hasattr(tran[col], 'cat'):
        # remap categories from int -> cat value
        try:
            ser.index = tran[col].cat.categories[ser.index.astype(int)]
        except:
            # e.g. TypeError: Cannot cast Index to dtype <class 'int'>
            # a categorical with many categories and '1||4||7' etc type splits
            # leave it as it is
            pass
    adj = np.random.choice(ADJS)
    display(
        HTML(
            f'<h1 id="plot_{col}">{col}</h1>'
            f'<p>Used {COUNTS[col]} times, total gain is {GAINS[col]}.'
            f'<p>{len(ser)} split point values used. '
            f'Most {adj} is {ser.index[0]} with gain of {ser.values[0]}.'
        )
    )
    ser = ser.head(MAX_SHOW).sort_index()
    ax = ser.plot.bar(title=f'{col} — Split points by gain',
                      rot=90, fontsize=12, figsize=(15,5),
                      width=0.7, color=COLORS)
    plt.show()

# Plots For IEEE Features

All the features with 4 or more unique values are shown (to avoid "Too many output files (max 500)" error).

## Notes

Most of the split points have long decimal values like `379.00000000000006` - the LightGBM algorithm only sees binned data, so it sets split thresholds as values [halfway between neighbouring bin lower/upper edges][6], but bumped upwards a tiny fraction using `std::nextafter` in the [C++ standard library][5], resulting in strangely precise [floating point format][1] values :)

Zero is checked for using a [kZeroThreshold = 1e-35f][7] variable - this comes out of the model as a split point of 1.0000000180025095e-35 &mdash; a tiny number. When you see that, think *zero*.

Split points for categorical dtypes depends on the `max_cat_to_onehot` which I have set to 128 - so categoricals in this data set are treated with a one-vs-all split. This means `feature==value` in the node split test, instead of the usual `feature<=value`. `max_cat_to_onehot` is by default set to 4, meaning categories with more values than this use splits based on target statistics, and the resulting split points have values like `1||3||5||7||8||9` which indicate which category codes go down the *left* branch. (But this is hard to show in bar charts... hence I used *one-vs-all splits*.)

LightGBM keeps a separate bin for NaN values and at all node tests, records whether that bin goes left/right separately - this is not shown here (yet! Upvote to make me attempt something!)

## What to Look For

In some ways what we **don't** see is more interesting than what we **do**. As with normal feature importances: if we see a feature is not used at all it is clearly redudant and should be removed. So seeing low gain (0) is very reliable but seeing high gain can *can* be misleading - it may be fitting noise.

If there is **one prominent peak** it means the feature acts a bit like a boolean, and perhaps would be better fed into the model that way (e.g. seeing a split value of 100, for feature 'foo', you could instead change column 'foo' to `foo<=100`.)  

Similarly if there are **several prominent peaks** it could imply the feature should be discretized/binned in a pre-processing step, as the less-often used split points may just be picking up on noise. See this [interesting old discussion thread on this subject of feature discretization][2].

For the TimeOfDay feature you may like to check out [my time series heatmaps notebook][4] that shows the density of transactions over time, and clearly indicates night time.

 [1]: https://en.wikipedia.org/wiki/Double-precision_floating-point_format
 [2]: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/43886
 [3]: https://www.kaggle.com/tilii7
 [4]: https://www.kaggle.com/jtrotman/ieee-fraud-time-series-heatmaps
 [5]: https://en.cppreference.com/w/cpp/numeric/math/nextafter
 [6]: https://github.com/microsoft/LightGBM/blob/master/src/io/bin.cpp
 [7]: https://github.com/microsoft/LightGBM/blob/master/include/LightGBM/meta.h 

In [216]:
for col in use:
    counts = split_points[col]
    if len(counts) >= 4:
        plot_it(col)

# Gain Over Time

This part is more for illustration/teaching about gradient boosting.

As well as counting split points, we can look at how feature gain evolves as trees are added to the model.

In the gradient boosting learning process, each tree adds something to the training set predictions that moves the overall predictions closer to the target. It takes small steps towards lowering the loss function. Early trees are more like standard decision trees, fitting the big patterns. Later trees are more specialised, correcting small deviations, fine-grained wrinkles in the loss function: often little patterns, sometimes noise.

As features are incorporated into the model in early trees, their predictive power can run out, which is most notable for boolean features; at some point the existing predictions have accounted for all of the variance of the feature and they are no longer used in new trees.

We can see this by looking at gain statistics over time by passing the `iteration` parameter to the `feature_importance()` method.


In [190]:
def make_importances(clf, divisions):
    max_n = clf.num_trees()
    thres = np.arange(1, divisions+1) / divisions

    idx = pd.Index(clf.feature_name(), name='Feature')
    importances = pd.DataFrame(index=idx)

    for t in thres:
        n = int(max_n * t)
        c = f'count_{t*100:.0f}'
        dat = clf.feature_importance(iteration=n)
        importances[c] = pd.Series(dat, index=idx).astype(np.int32)

    for t in thres:
        n = int(max_n * t)
        c = f'gain_{t*100:.0f}'
        dat = clf.feature_importance('gain', iteration=n)
        importances[c] = pd.Series(dat, index=idx).astype(np.float32)

    return importances

In [191]:
importances = make_importances(clf, 5)

`V258` and `V258_count` reach high gain by 20% of the way through and are not used much after that. `card1` and `card2` are still being used with high gain throughout.

In [192]:
importances.sort_values('gain_100', ascending=False).head(10)

In [196]:
importances.to_csv('ieee_fraud_lgb_importances.csv')

The blue bar indicates gain at 40% of the way through the learning process, and red marks the gain at the end.

In [193]:
toplot = importances.sort_values('gain_100').tail(80)
toplot['gain_100'].plot.barh(figsize=(12,20), legend=True, color='red', title='Feature Gain at 40% and 100%')
toplot['gain_40'].plot.barh(figsize=(12,20), legend=True, color='royalblue')

Here we see `V258` and `V258_count` have *run out of steam* early, whilst other features are still gaining in importance...

In [178]:
gaincols = importances.columns[importances.columns.str.startswith('gain')].tolist()
toplot = importances.sort_values('gain_100', ascending=False).head(10).copy()
toplot['gain_0'] = 0
toplot[['gain_0'] + gaincols].T.plot(figsize=(14,6), title='Cumulative Gain')

A different plot, show the most used features at the 20% point of training, and how their gain evolves after that...

In [179]:
gaincols = importances.columns[importances.columns.str.startswith('gain')].tolist()
toplot = importances.sort_values('count_20', ascending=False).head(10).copy()
toplot['gain_0'] = 0
toplot[['gain_0'] + gaincols].T.plot(figsize=(14,6), title='Cumulative Gain')

# Conclusions

Now we can inspect trained models to see **which points** in the feature space matter for fraud detection... You can build this in to your pipeline to help with reducing the resolution of the data in later modelling iterations, and aid further feature engineering.

If this kernel gets enough votes I will apply it to adversarial validation too to see **where** in the feature space the train and test set differ the most. Perhaps this could even be integrated into an *auto-relaxing* function that buckets the data for us in a way that makes the  train and test sets more similar, without any tedious manual inspection of plots :)

<font color=red>Update</font>: [adversarial version here][2].

___

A note to any n00bs reading: the original features used here are only a starting point, used just to demonstrate. If (say) `DeviceInfo` of `hi6210sft Build/MRA58K` comes along in the training set and makes a fast burst of transactions (all marked fraud), then appears in the test set but spread out and on many separate days, it does not make sense to predict a high fraud likelihood, simply because of that one feature. Features that capture *event* timing & behaviour are needed :)

For inspiration you should check out [an **extensive** index of **winning** and high ranking Kaggle **solutions** here][1] (and upvote if this helps you find something useful &mdash; I guarantee there are useful links there ;)

 [1]: https://www.kaggle.com/jtrotman/high-ranking-solution-posts
 [2]: https://www.kaggle.com/jtrotman/ieee-fraud-adversarial-lgb-split-points
