<a href="https://colab.research.google.com/github/CptK1ng/dmc2019/blob/alexander_dev/notebooks/train_without_hightrust.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Remove High TrustLevels from Training and use Hard cut for Prediction
by Alexander

we will say all trustLevel >2 are no frauds anyway so there is no reason to learn them.

Our classificator can therefore focus on the interesting rows.

as we only have trust levels of 1 and 2 in our training, we will also convert the trustLevel feature to a boolean one.

In [0]:
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import time as time
%matplotlib inline

Download our custom Dataset splits

In [0]:
!wget -nc -q --show-progress https://www.dropbox.com/s/6m8iq9ogpzmu7vx/train_new.csv?dl=1 -O train_new.csv
!wget -nc -q --show-progress https://www.dropbox.com/s/tjpkc45oqn3uv8s/val_new.csv?dl=1 -O val_new.csv

Import Data:

In [40]:
df_train_original = pd.read_csv("train_new.csv", sep="|")
df_val_original = pd.read_csv("val_new.csv", sep="|")
df_train_original.head(2)

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,fraud
0,4,828,66.56,7,4,3,0.007246,0.080386,1.166667,0
1,1,1612,31.34,2,4,3,0.008685,0.019442,0.142857,0


## Feature Engineering

In [41]:
def prepareData(df):
  df = df.copy()
  df['totalLineItems'] = df['scannedLineItemsPerSecond'] * df['totalScanTimeInSeconds'] #nur of scanned products
  df = df[df.trustLevel <= 2]
  df['higherTrust'] = (df.trustLevel == 2)
  df = df.drop('trustLevel', axis=1)
  return df

df_train = prepareData(df_train_original)
df_val = prepareData(df_val_original)

df_train.head(10)

Unnamed: 0,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,fraud,totalLineItems,higherTrust
1,1612,31.34,2,4,3,0.008685,0.019442,0.142857,0,14.0,False
3,321,76.03,8,7,2,0.071651,0.236854,0.347826,0,23.0,False
4,660,6.06,3,7,1,0.027273,0.009182,0.166667,0,18.0,False
6,871,89.92,2,5,5,0.002296,0.103238,1.0,0,2.0,False
9,1758,19.32,0,7,5,0.016496,0.01099,0.0,1,29.0,False
11,1797,31.25,1,1,3,0.01113,0.01739,0.05,0,20.0,True
13,518,48.65,9,10,3,0.050193,0.093919,0.346154,1,26.0,False
15,827,83.7,5,10,4,0.010883,0.101209,0.555556,0,9.0,False
16,1355,48.87,2,0,4,0.005166,0.036066,0.285714,0,7.0,True
18,150,60.24,4,3,1,0.053333,0.4016,0.5,0,8.0,True


## Split X and Y

In [42]:
# Splitting the final dataset into internal training and testing datasets
df_train_X = df_train.drop('fraud', axis=1)
df_train_y = df_train['fraud']
df_val_X = df_val.drop('fraud', axis=1)
df_val_y = df_val['fraud']

X_train, X_val, y_train, y_val = df_train_X.values, df_val_X.values, df_train_y.values, df_val_y.values
print("Shapes",X_train.shape, X_val.shape, y_train.shape, y_val.shape)

Shapes (538, 10) (141, 10) (538,) (141,)


## Define Scorer

In [0]:
def score_function(y_true, y_pred):
  dmc = np.sum(metrics.confusion_matrix(y_true, y_pred)*np.array([[0, -25],[ -5, 5]])) #sklearn gives [[tn,fp],[fn,tp]]
  return (0 if all(y_pred == 0) else metrics.fbeta_score(y_true, y_pred, beta=2),
          dmc, 
          dmc/len(y_pred), #comparable relative score, the higher the better.
          metrics.confusion_matrix(y_true, y_pred).tolist())

##  Classify

In [44]:
classifier_adb = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3), n_estimators=500, algorithm='SAMME', random_state=1)

# Fitting the model and printing the accuracy score
classifier_adb.fit(X_train, y_train)
y_val_pred = classifier_adb.predict(X_val)
print("AdaBoost", "\t", score_function(y_val, y_val_pred) )

AdaBoost 	 (0.84070796460177, 25, 0.1773049645390071, [[116, 2], [4, 19]])


**But wait!**  We said all trustLevel >2 are no frauds and cut them off. We need to add them back into our predicted values for the score.


In [45]:
df_val_pred = df_val_X[[]].copy() # this just takes the index (ids) 
df_val_pred["fraud_pred"] = y_val_pred # append predictions to the index
df_val_pred = df_val_original[[]].join(df_val_pred) # add rows for other indices (which we didnt predict --> NaN there)
df_val_pred.fillna(0, inplace=True) # replace NaN with 0 (=non fraud)
y_val_pred_corrected = df_val_pred.values
print("AdaBoost", "\t", score_function(df_val_original['fraud'].values, y_val_pred_corrected) )

AdaBoost 	 (0.84070796460177, 25, 0.06648936170212766, [[351, 2], [4, 19]])


As we can see, this unfortunately did not help  the prediction to improve on our validation set.