<a href="https://colab.research.google.com/github/CptK1ng/dmc2019/blob/alexander_dev/notebooks/CatBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CatBoost

In this notebook i will test the CatBoost classificator on our data.

## Installation:


1.   `pip install numpy six catboost`

2.   For Visualization: `pip install ipywidgets` and `jupyter nbextension enable --py widgetsnbextension`



More details [here](https://catboost.ai/docs/concepts/python-installation.html#python-installation)

In [6]:
!pip install catboost
import numpy as np
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn import metrics

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/2f/c4/f130237b24efd1941cb685da12496675a90045129b66774751f1bf629dfd/catboost-0.14.2-cp36-none-manylinux1_x86_64.whl (60.6MB)
[K     |████████████████████████████████| 60.6MB 1.6MB/s 
Installing collected packages: catboost
Successfully installed catboost-0.14.2


## Data Import & Preprocessing
Download our custom Dataset splits and the unlabeled Test Set:

In [2]:
!wget -nc -q --show-progress https://www.dropbox.com/s/6m8iq9ogpzmu7vx/train_new.csv?dl=1 -O train_new.csv
!wget -nc -q --show-progress https://www.dropbox.com/s/tjpkc45oqn3uv8s/val_new.csv?dl=1 -O val_new.csv



Import data:

In [3]:

df_train_original = pd.read_csv("train_new.csv", sep="|")
df_val_original = pd.read_csv("val_new.csv", sep="|")
df_train_original.head(2)

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,fraud
0,4,828,66.56,7,4,3,0.007246,0.080386,1.166667,0
1,1,1612,31.34,2,4,3,0.008685,0.019442,0.142857,0


Feature Engineering:

In [4]:
def prepareData(df):
  df = df.copy()
  df['totalLineItems'] = (df['scannedLineItemsPerSecond'] * df['totalScanTimeInSeconds']).astype(np.int) # number of scanned products
  df['trustLevel'] = df.trustLevel.astype('category') # needed for automatic detection of categorical features later
  df['fraud'] = df.fraud.astype('category') # needed for automatic detection of categorical features later

  return df

df_train = prepareData(df_train_original)
df_val = prepareData(df_val_original)

df_train.head()

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,fraud,totalLineItems
0,4,828,66.56,7,4,3,0.007246,0.080386,1.166667,0,5
1,1,1612,31.34,2,4,3,0.008685,0.019442,0.142857,0,13
2,3,848,52.37,2,4,0,0.022406,0.061757,0.105263,0,19
3,1,321,76.03,8,7,2,0.071651,0.236854,0.347826,0,22
4,1,660,6.06,3,7,1,0.027273,0.009182,0.166667,0,18


## Using CatBoost
[Documentation](https://catboost.ai/docs/concepts/python-quickstart.html)


### Read Data
[Pool](https://catboost.ai/docs/concepts/python-reference_pool.html#python-reference_pool)


In [0]:
train_pool = Pool(df_train.drop('fraud', axis=1), df_train['fraud'])
validation_pool = Pool(df_val.drop('fraud', axis=1), label=df_val['fraud'])

### Train
[Classifier Parameters](https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list), [loss functions](https://catboost.ai/docs/concepts/loss-functions-classification.html), [fit](https://catboost.ai/docs/concepts/python-reference_catboostclassifier_fit.html)

In [41]:
model = CatBoostClassifier(iterations=100000,
                           depth=2,
                           learning_rate=1,
                           loss_function='Logloss',
                           early_stopping_rounds=10,
                           verbose=500)
# train the model
model.fit(train_pool)

0:	learn: 0.1466787	total: 12.9ms	remaining: 21m 26s
500:	learn: 0.0002541	total: 4.62s	remaining: 15m 16s
1000:	learn: 0.0001290	total: 7.79s	remaining: 12m 50s
1500:	learn: 0.0000932	total: 10.9s	remaining: 11m 55s
2000:	learn: 0.0000804	total: 13.8s	remaining: 11m 16s
2500:	learn: 0.0000763	total: 16.9s	remaining: 10m 58s
3000:	learn: 0.0000701	total: 20.3s	remaining: 10m 57s
3500:	learn: 0.0000695	total: 23.3s	remaining: 10m 41s
4000:	learn: 0.0000672	total: 26.1s	remaining: 10m 26s
4500:	learn: 0.0000658	total: 29s	remaining: 10m 14s
5000:	learn: 0.0000632	total: 31.8s	remaining: 10m 4s
5500:	learn: 0.0000623	total: 34.7s	remaining: 9m 56s
6000:	learn: 0.0000623	total: 37.6s	remaining: 9m 48s
6500:	learn: 0.0000622	total: 40.9s	remaining: 9m 47s
7000:	learn: 0.0000622	total: 44.3s	remaining: 9m 47s
7500:	learn: 0.0000622	total: 47.6s	remaining: 9m 47s
8000:	learn: 0.0000622	total: 50.6s	remaining: 9m 41s
8500:	learn: 0.0000622	total: 53.5s	remaining: 9m 36s
9000:	learn: 0.0000605	

<catboost.core.CatBoostClassifier at 0x7f7299450198>

### Test

In [42]:
# make the prediction using the resulting model
#ypred = model.predict(validation_pool)
ypred_proba = model.predict_proba(validation_pool).T[1]

print(ypred_proba[0:5])

[4.44269916e-06 4.25368357e-12 9.95164769e-13 8.97002963e-07
 1.66799220e-13]


## Evaluation
### Convert class propabilites to binary classes
see [issue](https://github.com/CptK1ng/dmc2019/issues/9#issuecomment-485343221) for calculating threshold.

In [43]:
classification_treshold = 25/35

ypred = np.where(ypred_proba <= classification_treshold, 0, 1)

print(ypred[0:5])

[0 0 0 0 0]


### Calc DMC score

In [44]:
def score_function(y_true, y_pred):
  dmc = np.sum(metrics.confusion_matrix(y_true, y_pred)*np.array([[0, -25],[ -5, 5]])) #sklearn gives [[tn,fp],[fn,tp]]
  return (#0 if all(y_pred == 0) else metrics.fbeta_score(y_true, y_pred, beta=2),
          dmc, 
          dmc/len(y_pred), #comparable relative score, the higher the better.
          metrics.confusion_matrix(y_true, y_pred).tolist(),
          0 if all(y_pred == 0) else metrics.fbeta_score(y_true, y_pred, beta=0.5172))

score_function(df_val['fraud'].values, ypred)

(50, 0.13297872340425532, [[352, 1], [4, 19]], 0.9208492139127344)

As we can see we can reach a DMC score of *50* which is quite good, but not outstanding.

This score might be improvable by tuning the [hyperparameters](https://catboost.ai/docs/concepts/python-reference_parameters-list.html#python-reference_parameters-list) of the model.