# Linear SVC using Standard SVC Object with Min Max Scaler (enable predict_proba)
## Hyperparameter Finetuning with RandomizedSeachCV
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

In [1]:
import os
# ggf muss Pfad angepasst werden
os.chdir("{}/..".format(os.getcwd()))
os.getcwd()
import pandas as pd
import numpy as np
from sklearn import model_selection, linear_model, metrics, svm
from scipy.stats import uniform, randint
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RandomizedSearchCV, train_test_split, cross_validate

In [2]:
%run notebooks/utils.ipynb

## Loading Data
Loading the whole training set and adding the column "totalScannedItems" by calling `add_new_features` defined in utils

In [3]:
df = pd.read_csv("data/train.csv", sep="|")
df = add_new_features(df)
sum_frauds, sum_non_frauds  = len(df[df.fraud == 1]), len(df[df.fraud == 0])
train_y = df.fraud
train_x = df.drop(columns=['fraud'])

## Load validation data


In [4]:
# Train Data for validation
df_fit = pd.read_csv("data/train_new.csv", sep="|")
df_fit = add_new_features(df_fit)
df_fit_y = df_fit.fraud
df_fit_x = df_fit.drop(columns=['fraud'])

# Validation Data
df_val = pd.read_csv("data/val_new.csv", sep="|")
df_val = add_new_features(df_val)
df_val_y = df_val.fraud
df_val_x = df_val.drop(columns=['fraud'])

In [5]:
def scale_df(df, scaler):
    df_tmp = pd.DataFrame()
    tmp_data = scaler.fit_transform(df[df.columns])
    df_tmp[df.columns] = pd.DataFrame(tmp_data)
    return df_tmp.copy()

In [6]:
from sklearn.preprocessing import MinMaxScaler
train_c_norm = scale_df(train_x, MinMaxScaler())
train_test = scale_df(df_fit_x, MinMaxScaler())
val_test =  scale_df(df_val_x, MinMaxScaler())
train_test['fraud'] =df_fit.fraud
val_test['fraud'] = df_val.fraud

  return self.partial_fit(X, y)
  return self.partial_fit(X, y)
  return self.partial_fit(X, y)


## Scoring functions
Defining multiple scores which should be tracked in the HyperParamSearch Object

In [7]:
scoring = {'AUC': 'roc_auc', 'FBeta': metrics.make_scorer(metrics.fbeta_score, beta=0.5172)}

## Defining the paramteres which should be tuned
To tune the hyperparameters, i looked into the [documentation here](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster). `randint` can be used for integer values, for float values, use `uniform`. 

You can also use a Grid search on single parameters to get a feeling for a good interval. If you want to try only two possibilities, you can create a list like for the `scale_pos_weight` parameter.

**Note: For the classifiers which work without gpu support, you can probably set a parameter n_jobs=-1 to use all processors**

In [8]:
params = {
    "tol": uniform(1e-7, 1e-2),  # default 100
    "C" :  uniform(0.0, 80.0),
    "shrinking" : [True, False]
}

## Creating a classifier with some default values
Not all paramters of a classifier should be fine tuned. For SVM for example, the `kernel`-paramter should be set manually. In the case of xgboost, some things like the objective, the booster and the tree method should not be tuned. The choice of paramters depend on the specific classifier

In [9]:
default_svm = svm.SVC(kernel="linear", probability=True, cache_size=8000,  verbose=0, random_state=None)
default_svm.get_params().keys()

dict_keys(['verbose', 'tol', 'C', 'probability', 'gamma', 'decision_function_shape', 'degree', 'coef0', 'kernel', 'cache_size', 'random_state', 'shrinking', 'max_iter', 'class_weight'])

## Defining some RandomizedSeach parameters
- cv = cross validation: 3 is the standard value. This is enough and should not be touched
- param_distribution: The params defined two cells above
- scorer: The Scorere defined under "scoring functions" AUC and Fbeta are currently the best.
- return_train_score: Doesnt affect the hyper param search
- refit : Here, you can adress a score with the name, given in the dictionary. 
- n_jobs : -1 to use all cpus
- n_iter : depends on number of params. For 9 params, i suggest a value above 20k. For less paramters 10k could be a good value.

Further informations: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

In [10]:
search = RandomizedSearchCV(default_svm, scoring=scoring, param_distributions=params, random_state=42, n_iter=5000,
                            cv=3, verbose=1, n_jobs=-1, return_train_score=True,refit='AUC')
search.fit(train_c_norm, train_y)
results = search.cv_results_

Fitting 3 folds for each of 5000 candidates, totalling 15000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 470 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 1470 tasks      | elapsed:    9.5s
[Parallel(n_jobs=-1)]: Done 2870 tasks      | elapsed:   17.1s
[Parallel(n_jobs=-1)]: Done 4670 tasks      | elapsed:   26.9s
[Parallel(n_jobs=-1)]: Done 6870 tasks      | elapsed:   39.1s
[Parallel(n_jobs=-1)]: Done 9470 tasks      | elapsed:   53.3s
[Parallel(n_jobs=-1)]: Done 12470 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 15000 out of 15000 | elapsed:  1.4min finished


In [11]:
scorings = {"DMC" : own_scorer, "DMC_Norm" : own_scorer_normalized}
xgbo = search.best_estimator_
result_dict = test_classification(xgbo,df_train=train_test, df_val=val_test)

Results Fix Split: 
DMC Score: 45  ---  Normalized DMC Score: 0.1196808510638298, 

Results Cross Validation: 
DMC Score: 69.0  ---  Normalized DMC Score: 0.1835886524822695 


## Show false predictions

In [12]:
res_df = result_dict['dataframe']
res_df[(res_df.prediction != res_df.fraud)]

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition,totalScannedItems,fraud,prediction,probablity
97,0.4,0.952591,0.785865,0.818182,0.8,0.8,0.011107,0.02021,0.03,1.0,0,1,0.523
134,0.0,0.471885,0.418833,0.272727,1.0,0.8,0.01646,0.021582,0.013636,0.724138,0,1,0.506
312,0.2,0.796582,0.398354,0.545455,0.9,0.2,0.011046,0.012213,0.024,0.827586,1,0,0.608
353,0.0,0.5,0.727638,0.454545,0.9,0.6,0.014808,0.035433,0.02381,0.689655,1,0,0.57


## Best Estimator

In [13]:
search.best_estimator_

SVC(C=56.98164719395536, cache_size=8000, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=True, random_state=None,
  shrinking=True, tol=0.002848504943774676, verbose=0)

## Results

| test      | DMC | DMC Normmalized    |
|-----------|-----|--------------------|
| Train/Val | 45  | 0.1196808510638298 |
| Cross Val | 69  | 0.1835886524822695 |



achieved with the following estimator settings:

``` python
SVC(C=56.98164719395536, cache_size=8000, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=True, random_state=None,
  shrinking=True, tol=0.002848504943774676, verbose=0)
```     
