<a href="https://colab.research.google.com/github/CptK1ng/dmc2019/blob/alexander_dev/LightGBM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LightGBM

In this notebook i will test the LightGBM classificator on our data.

## Installation:


1.   `pip install setuptools wheel numpy scipy scikit-learn -U`

2.   One of the following

  a. `pip install lightgbm`
  
  b.  `pip install lightgbm --install-option=--gpu`


More details [here](https://github.com/Microsoft/LightGBM/tree/master/python-package)

In [0]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn import metrics

## Data Import & Preprocessing
Download our custom Dataset splits and the unlabeled Test Set:

In [0]:
!wget -nc -q --show-progress https://www.dropbox.com/s/6m8iq9ogpzmu7vx/train_new.csv?dl=1 -O train_new.csv
!wget -nc -q --show-progress https://www.dropbox.com/s/tjpkc45oqn3uv8s/val_new.csv?dl=1 -O val_new.csv

Import data:

In [17]:

df_train_original = pd.read_csv("train_new.csv", sep="|")
df_val_original = pd.read_csv("val_new.csv", sep="|")
df_train_original.head(2)

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,fraud
0,4,828,66.56,7,4,3,0.007246,0.080386,1.166667,0
1,1,1612,31.34,2,4,3,0.008685,0.019442,0.142857,0


Feature Engineering:

In [18]:
def prepareData(df):
  df = df.copy()
  df['totalLineItems'] = (df['scannedLineItemsPerSecond'] * df['totalScanTimeInSeconds']).astype(np.int) # number of scanned products
  df['trustLevel'] = df.trustLevel.astype('category') # needed for automatic detection of categorical features later
  df['fraud'] = df.fraud.astype('category') # needed for automatic detection of categorical features later

  return df

df_train = prepareData(df_train_original)
df_val = prepareData(df_val_original)

df_train.head()

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,fraud,totalLineItems
0,4,828,66.56,7,4,3,0.007246,0.080386,1.166667,0,5
1,1,1612,31.34,2,4,3,0.008685,0.019442,0.142857,0,13
2,3,848,52.37,2,4,0,0.022406,0.061757,0.105263,0,19
3,1,321,76.03,8,7,2,0.071651,0.236854,0.347826,0,22
4,1,660,6.06,3,7,1,0.027273,0.009182,0.166667,0,18


## Using LightGBM
[Documentation](https://lightgbm.readthedocs.io/en/latest/Python-Intro.html)


### Read Data
[Data Structure API](https://lightgbm.readthedocs.io/en/latest/Python-API.html#data-structure-api)


In [19]:
train_data = lgb.Dataset(df_train.drop('fraud', axis=1), label=df_train['fraud'], )
validation_data = train_data.create_valid(df_val.drop('fraud', axis=1), label=df_val['fraud'])
train_data.save_binary('lgb_train_data.bin')

<lightgbm.basic.Dataset at 0x7f5c2fc22748>

### Train
[Parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html), e.g. [metric](https://lightgbm.readthedocs.io/en/latest/Parameters.html#metric) 

In [32]:
lgb_params = {
    'objective': 'binary', #goal of train is binary classification
#    'device_type': 'gpu', #uncomment to enable gpu version (if using gpu import/build gpu version of lgb)
    'metric': '' #=use default, we get pred_proba so we can use custom threshold.
}

#train 10 rounds with early stopping enabled
num_round = 1000
bst = lgb.train(lgb_params, train_data, num_round, valid_sets=[validation_data], early_stopping_rounds=20)

# Alternative: Training with 5-fold CV: bst = lgb.cv(lgb_params, train_data, num_round, nfold=5)

bst.save_model('lgb_model.txt', num_iteration=bst.best_iteration) #later bst = lgb.Booster(model_file='lgb_model.txt')  #init model

[1]	valid_0's binary_logloss: 0.174881
Training until validation scores don't improve for 20 rounds.
[2]	valid_0's binary_logloss: 0.15116
[3]	valid_0's binary_logloss: 0.13541
[4]	valid_0's binary_logloss: 0.124077
[5]	valid_0's binary_logloss: 0.115448
[6]	valid_0's binary_logloss: 0.108517
[7]	valid_0's binary_logloss: 0.103347
[8]	valid_0's binary_logloss: 0.099095
[9]	valid_0's binary_logloss: 0.0953967
[10]	valid_0's binary_logloss: 0.0918439
[11]	valid_0's binary_logloss: 0.0893213
[12]	valid_0's binary_logloss: 0.0866027
[13]	valid_0's binary_logloss: 0.0827592
[14]	valid_0's binary_logloss: 0.0802988
[15]	valid_0's binary_logloss: 0.0782259
[16]	valid_0's binary_logloss: 0.0757554
[17]	valid_0's binary_logloss: 0.0743443
[18]	valid_0's binary_logloss: 0.0715176
[19]	valid_0's binary_logloss: 0.0695524
[20]	valid_0's binary_logloss: 0.0674357
[21]	valid_0's binary_logloss: 0.0653719
[22]	valid_0's binary_logloss: 0.0638559
[23]	valid_0's binary_logloss: 0.0619014
[24]	valid_0's

<lightgbm.basic.Booster at 0x7f5c3c2761d0>

### Test

In [33]:
# test prediction
ypred = bst.predict(df_val.drop('fraud', axis=1), num_iteration=bst.best_iteration)
print(ypred[0:5])

[0.00963915 0.00027916 0.00017954 0.00026539 0.00072681]


## Evaluation
### Convert class propabilites to binary classes
see [issue](https://github.com/CptK1ng/dmc2019/issues/9#issuecomment-485343221) for calculating threshold.

In [34]:
classification_treshold = 25/35

ypred = np.where(ypred <= classification_treshold, 0, 1)

print(ypred[0:5])

[0 0 0 0 0]


### Calc DMC score

In [31]:
def score_function(y_true, y_pred):
  dmc = np.sum(metrics.confusion_matrix(y_true, y_pred)*np.array([[0, -25],[ -5, 5]])) #sklearn gives [[tn,fp],[fn,tp]]
  return (#0 if all(y_pred == 0) else metrics.fbeta_score(y_true, y_pred, beta=2),
          dmc, 
          dmc/len(y_pred), #comparable relative score, the higher the better.
          metrics.confusion_matrix(y_true, y_pred).tolist(),
          0 if all(y_pred == 0) else metrics.fbeta_score(y_true, y_pred, beta=0.5172))

score_function(df_val['fraud'].values, ypred)

(35, 0.09308510638297872, [[353, 0], [8, 15]], 0.8988310412553612)

As we can see we can reach a DMC score of *35* which is okay, but not outstanding.

This score might be improvable by tuning the [hyperparameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html) of the model.