<a href="https://colab.research.google.com/github/CptK1ng/dmc2019/blob/alexander_dev/notebooks/CatBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CatBoost

In this notebook i will test the CatBoost classificator on our data.

## Installation:


1.   `pip install numpy six catboost`

2.   For Visualization: `pip install ipywidgets` and `jupyter nbextension enable --py widgetsnbextension`



More details [here](https://catboost.ai/docs/concepts/python-installation.html#python-installation)

In [1]:
!pip install catboost
import numpy as np
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn import metrics



## Data Import & Preprocessing
Download our custom Dataset splits and the unlabeled Test Set:

In [0]:
!wget -nc -q --show-progress https://www.dropbox.com/s/6m8iq9ogpzmu7vx/train_new.csv?dl=1 -O train_new.csv
!wget -nc -q --show-progress https://www.dropbox.com/s/tjpkc45oqn3uv8s/val_new.csv?dl=1 -O val_new.csv

Import data:

In [3]:

df_train_original = pd.read_csv("train_new.csv", sep="|")
df_val_original = pd.read_csv("val_new.csv", sep="|")
df_train_original.head(2)

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,fraud
0,4,828,66.56,7,4,3,0.007246,0.080386,1.166667,0
1,1,1612,31.34,2,4,3,0.008685,0.019442,0.142857,0


Feature Engineering:

In [4]:
def prepareData(df):
  df = df.copy()
  df['totalLineItems'] = (df['scannedLineItemsPerSecond'] * df['totalScanTimeInSeconds']).astype(np.int) # number of scanned products
  df['trustLevel'] = df.trustLevel.astype('category') # needed for automatic detection of categorical features later
  df['fraud'] = df.fraud.astype('category') # needed for automatic detection of categorical features later

  return df

df_train = prepareData(df_train_original)
df_val = prepareData(df_val_original)

df_train.head()

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,fraud,totalLineItems
0,4,828,66.56,7,4,3,0.007246,0.080386,1.166667,0,5
1,1,1612,31.34,2,4,3,0.008685,0.019442,0.142857,0,13
2,3,848,52.37,2,4,0,0.022406,0.061757,0.105263,0,19
3,1,321,76.03,8,7,2,0.071651,0.236854,0.347826,0,22
4,1,660,6.06,3,7,1,0.027273,0.009182,0.166667,0,18


## Using CatBoost
[Documentation](https://catboost.ai/docs/concepts/python-quickstart.html)


### Read Data
[Pool](https://catboost.ai/docs/concepts/python-reference_pool.html#python-reference_pool)


In [0]:
train_pool = Pool(df_train.drop('fraud', axis=1), df_train['fraud'])
validation_pool = Pool(df_val.drop('fraud', axis=1), label=df_val['fraud'])

### Train
[Classifier Parameters](https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list), [loss functions](https://catboost.ai/docs/concepts/loss-functions-classification.html), [fit](https://catboost.ai/docs/concepts/python-reference_catboostclassifier_fit.html)


**TODO**: [Custom metric](https://catboost.ai/docs/concepts/python-usages-examples.html#custom-loss-function-eval-metric)

In [6]:
model = CatBoostClassifier(iterations=70000, #1000000, # number of iterations, tested more but 70.000 is enough, early stopping does not work properly
                           depth=2,
                           learning_rate=1, #automatically detect
                           loss_function='Logloss',
                           early_stopping_rounds=10,
                           #one_hot_max_size=6,
                           verbose=10000) # print training accuracy every x iterations
# train the model
model.fit(train_pool) #, eval_set=validation_pool)
model.save_model("catbboostmodel.cbm")

0:	learn: 0.1466787	total: 67ms	remaining: 1h 18m 10s
10000:	learn: 0.0000589	total: 1m	remaining: 6m 3s
20000:	learn: 0.0000571	total: 1m 58s	remaining: 4m 56s
30000:	learn: 0.0000551	total: 2m 57s	remaining: 3m 56s
40000:	learn: 0.0000551	total: 3m 57s	remaining: 2m 57s
50000:	learn: 0.0000551	total: 4m 58s	remaining: 1m 59s
60000:	learn: 0.0000546	total: 5m 59s	remaining: 59.9s
69999:	learn: 0.0000546	total: 7m 1s	remaining: 0us


### Test

In [7]:
# make the prediction using the resulting model
#ypred = model.predict(validation_pool)
ypred_proba = model.predict_proba(validation_pool).T[1]

print(ypred_proba[0:5])

[4.44269916e-06 4.25368357e-12 9.95164769e-13 8.97002963e-07
 1.66799220e-13]


## Evaluation
### Convert class propabilites to binary classes
see [issue](https://github.com/CptK1ng/dmc2019/issues/9#issuecomment-485343221) for calculating threshold.

In [8]:
classification_treshold = 25/35

ypred = np.where(ypred_proba <= classification_treshold, 0, 1)

print(ypred[0:5])

[0 0 0 0 0]


### Calc DMC score

In [9]:
def score_function(y_true, y_pred):
  dmc = np.sum(metrics.confusion_matrix(y_true, y_pred)*np.array([[0, -25],[ -5, 5]])) #sklearn gives [[tn,fp],[fn,tp]]
  return (#0 if all(y_pred == 0) else metrics.fbeta_score(y_true, y_pred, beta=2),
          dmc, 
          dmc/len(y_pred), #comparable relative score, the higher the better.
          metrics.confusion_matrix(y_true, y_pred).tolist(),
          0 if all(y_pred == 0) else metrics.fbeta_score(y_true, y_pred, beta=0.5172))

score_function(df_val['fraud'].values, ypred)

(50, 0.13297872340425532, [[352, 1], [4, 19]], 0.9208492139127344)

As we can see we can reach a DMC score of *50* which is quite good, but not outstanding.

This score might be improvable by tuning the [hyperparameters](https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list) of the model.