# Prediction of "Sachkonto" - from a few Columns based on Trainingsdata from SAP BSAK

### Possible improvements
7. Unbalanced data-set: find some strategy to counter this
8. use CATBOOST instead of XGBoost - just for comparison
9. Optimization of model with "AutoML" / (Bayesian) Parameter-Search of some sort (possibly better than grid-search )
10. Consider to use Scikit columnTransformer and pipeline instead of doing everything manually: this would make the entire pipeline testable.
11. Split data preparation into separate notebook


# Imports

In [1]:
import pandas as pd
import numpy as np
from utils_bsak import printSamplesFromSaktos
from utils_bsak import is_date_column, is_decimal_column, convert_column_decimal2float

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from utils_bsak import target_min_value_records

import joblib

import xgboost as xgb


# Load Preprosessed Data & Encoders

#### Files for Reading Preprocessed Data and Writting Onnx-Model

In [2]:
# filenames for ONNX model

onnx_model_name = "Sachkonto_stratified" # name of the exportet ONNX-model that is going to be generated
path_model= "models/model_" + onnx_model_name + ".onnx"

# files with preprocessed data

folder_preprocessed_data = "../data_preprocessed/"
file_joblib_data = 'Data_' + onnx_model_name + '.pkl'
path_joblib_data = folder_preprocessed_data + file_joblib_data # path for dumping preprocessed data



file_joblib_onnx_params = 'OnnxParams_' + onnx_model_name + '.pkl'
path_joblib_onnx_params = folder_preprocessed_data + file_joblib_onnx_params

### Load Data

In [None]:
# Code-Snippet: Load Preprosessed Data & Encoders
data = joblib.load(path_joblib_data)

X_train = data["X_train"]
X_test = data["X_test"]
y_train = data["y_train"]
y_test = data["y_test"]
column_encoders = data["column_encoders"]
target_dict = data["target_dict"]
Steuerkennzeichen_dict = data["Steuerkennzeichen_dict"]

## Use Boruta to find the most relevant features in the dataset

In [4]:
#Use Boruta after the split to avoid data leakage

# for notebook control:
apply_Boruta = False
#apply_Boruta = True

if(apply_Boruta):

    from boruta import BorutaPy
    from sklearn.ensemble import RandomForestClassifier

    # Define the Random Forest model
    rf_model = xgb.XGBClassifier(n_jobs=-1, verbosity=0)

    boruta_selector = BorutaPy(rf_model, n_estimators='auto', random_state=42)
    boruta_selector.fit(X_train.values, y_train)

    # Check the results
    selected_features = X_train.columns[boruta_selector.support_].tolist()
    print("--------------------------------")
    print("Selected Features:", selected_features)

    # Optional: Features that were rejected
    rejected_features = X_train.columns[~boruta_selector.support_].tolist()
    print("Rejected Features:", rejected_features)
    print("--------------------------------")

### All datasets combined Boruta selected_features (runtime: 21m 40s):
Selected Features: ['Buchungskreis', 'Lieferant', 'Position', 'WÃ¤hrung', 'Belegart', 'Buchungsperiode', 'Steuerkennzeichen', 'Betrag', 'Funktionale WÃ¤hrung', 'Zahlungsbedingung', 'Tage 1', 'Skontoprozentsatz 1', 'Skontobasis', 'Skontobetrag', 'Skontobetrag.1', 'Zahlweg', 'Zahlungssperre', 'Hausbank', 'Partnerbanktyp', 'Steuerkennzeichen.1', 'Steuerkennzeichen.2', 'HW-2-Betrag', 'Skontobetrag HW2', 'ReferenzschlÃ¼ssel 2', 'WÃ¤hrung Hauptbuch', 'Betrag Hauptbuch', 'Profitcenter', 'Position im Sender System', 'Währung', 'Funktionale Währung', 'Währung Hauptbuch']

# Train XGBoost Model

## Define a factory-function for the model

In [5]:
# define a factory-function for the model - to always work with the same type of model (... could be implemented as singleton...)
def create_Model():
    return xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')

## Define Selected Features

Using the results from the Boruta-analysis - define different sets of columns to train on. <br>
Since this is a demonstrator I tried to use different features to prove flexibility. <br>
Which columns to use is dependent on the front-end application in which the model is going to be used. <br>

In [None]:

from sklearn.model_selection import train_test_split

# Using the results from the Boruta-analysis - define different sets of columns to train on.
# Since this is a demonstrator I tried to use different features to prove flexibility.
# Which columns to use is dependent on the front-end application in which the model is going to be used.

# all combined:
#all_combined =  ['Buchungskreis', 'Lieferant', 'Position'] # Accuracy (in %): 96.62 +/- 2.62 --- "Position" is weird - how can this be relevant?!
all_combined =  ['Buchungskreis', 'Lieferant', 'Steuerkennzeichen']


## Train on Selected Features

In [None]:
# chose one of the feature selections:
final_seleced_features =  all_combined

X_train = X_train[final_seleced_features]
X_test = X_test[final_seleced_features]

# automatic feature selection from Boruta selected_features:
""" if selected_features and len(selected_features) > 3:
    X_train = X_train[selected_features[:3]]
    X_test = X_test[selected_features[:3]]
    print("--------------------------------")
    print(f"selected_features[:3]: {selected_features[:3]}")
    print("--------------------------------")
else:
    X_train = X_train[selected_features]
    X_test = X_test[selected_features]
    print("--------------------------------")
    print(f"selected_features: {selected_features}") 
    print("--------------------------------") """

print("--------------------------------")
print(f"Features the model is trained on: {final_seleced_features}")
print("--------------------------------")
print(f"X_train.shape: {X_train.shape}")
print(f"X_test.shape : {X_test.shape}")
print("--------------------------------")
print(f"X_train.head(3):{X_train.head(3)}")
print(f"X_test.head(3) : {X_test.head(3)}")
print("--------------------------------")

model = create_Model()
model.fit(X_train, y_train)


# Model Quality Assessment

## Simple Run Accuracy

In [None]:
# make predictions for test data
print("--------------------------------")
print(f"We have {y_test.shape[0]} rows of test-data.")
print("--------------------------------")

y_pred = model.predict(X_test)
""" 
print("--------------------------------")
print(y_pred[:10])
print("--------------------------------")
#predictions = [round(value) for value in y_pred] """

# evaluate predictions
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
#accuracy = accuracy_score(y_test, predictions)
print("--------------------------------")
print("Simple One-Run Accuracy: %.2f%%" % (accuracy * 100.0))
print("--------------------------------")

## Check Scikits classification_report:

In [None]:
from sklearn.metrics import classification_report

# Evaluate the model with precision, recall, and F1-score
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=target_label_encoder.classes_.astype(str)))
print("--------------------------------")

## Accuracy with Cross-Validation

In [None]:
import warnings

warnings.simplefilter('ignore', UserWarning)

from sklearn.model_selection import cross_val_score, KFold

X = df.drop(target, axis=1)
y_trans = target_label_encoder.fit_transform(y).astype(int)
print("----------- Check encoding ---------------------")
print(y_trans[:10])
print("--------------------------------")

results = cross_val_score(model, X, y_trans, cv=5)
print("--------------------------------")
print(f"Accuracy (in %): {results.mean() * 100:.2f} +/- {results.std() * 100:.2f}")
print("--------------------------------")

## Confusion Matrices

### Straight Confusion Matrix

In [None]:
from utils_bsak import plot_confusion_matrix

y_pred = model.predict(X_test)

plot_confusion_matrix(y_test=y_test, y_pred=y_pred, labels=target_label_encoder.classes_)

### Top-k Confusion Matrix

In [None]:
from utils_bsak import plot_top_k_confusion_matrix

k = 3
y_pred_prob = model.predict_proba(X_test)
plot_top_k_confusion_matrix(y_test=y_test, y_pred_prob=y_pred_prob, labels=target_label_encoder.classes_, top_k=k, show_off_top_k_info=False)

# Export/ Convert Model to ONNX

## Model Conversion and Saving

In [None]:
import xgboost as xgb
from onnxmltools import convert_xgboost
from onnxmltools.convert.common.data_types import FloatTensorType

# rename the columns of X to make Onnx conversion possible:
X_old_columns = { f"f{i}" : col for i, col in enumerate(X_train.columns)}
X_train.columns = [f"f{i}" for i in range(X_train.shape[1])]

# DMatrix (i.e. dense-matrix) erstellen und enable_categorical setzen
dtrain = xgb.DMatrix(data=X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(data=X_test, label=y_test, enable_categorical=True)

# Convert the model to ONNX format
initial_type = [('float_input', FloatTensorType([None, X_train.shape[1]]))]
onnx_model = convert_xgboost(model, initial_types=initial_type)

# Save the ONNX model to a file in D:\1000_DataScience_MachineLearning\1000_ML_Projects\1000_Github_ML_2\projects_planned\Psinova_Examples\saktoBsik_private :

with open(path_model, "wb") as f:
    f.write(onnx_model.SerializeToString())

print("------------------------------------------------------------------------")
print(f"model name: {onnx_model_name}")
print("------------------------------------------------------------------------")



## Write ONNX-Model Parameter

In [None]:
import joblib, onnxmltools, xgboost

onnx_parameters = {
    "onnx_model_name" : onnx_model_name,
    "trained_features" : final_seleced_features,
    "xgboost_version" : xgboost.__version__,
    "onnxmltools_version" : onnxmltools.__version__,
}

joblib.dump(onnx_parameters, path_joblib_onnx_params)

['data_preprocessed/OnnxParams_Sachkonto_stratified_All3.pkl']

In [None]:
import xgboost
print(xgboost.__version__)

#Answer:
# 1.4.2

2.1.1


In [None]:
import onnxmltools
print(onnxmltools.__version__)

# Answer:
# 1.7.0

1.13.0


Working combination of versions of xgboost and onnxmltools:

xgboost : 1.4.2 <br>
onnxmltools : 1.7.0

# TO DO : 
+ explain Classification Report
+ deal with imbalance

#### Note: Strategies to Address Imbalance 

+ Cost-Sensitive Learning: Assign higher misclassification costs to the minority class, encouraging the model to consider it more seriously.
+ Alternative Splitting Criteria: Use metrics like Hellinger distance (what is this?) instead of traditional ones like information gain, as it better handles skewed distributions (why? How?).
+ Sampling Techniques: Balance the dataset by oversampling the minority class or undersampling the majority class, or by using wrapper frameworks that combine sampling with the splitting metric.
+ Adjusted Evaluation Metrics: Accuracy alone is misleading in imbalanced settings. Instead, prioritize metrics like precision, recall, and F1-score to assess the model’s performance on the minority class more accurately.