# Payment Scoring

*... a tutorial for students in the FHNW, written by [Andreas Martin, PhD](https://andreasmartin.ch).*

|[![deepnote](https://deepnote.com/buttons/launch-in-deepnote-small.svg)](https://deepnote.com/launch?url=https%3A%2F%2Fgithub.com%2FAI4BP%2Fainotes%2Fblob%2Fmain%2Fipynb%2Forder-approval-process%2Forder-approval.ipynb)|[![Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4BP/ainotes/blob/main/ipynb/order-approval-process/order-approval.ipynb)|[![Gitpod](https://img.shields.io/badge/Gitpod-Run%20in%20VS%20Code-908a85?logo=gitpod)](https://gitpod.io/#https://github.com/AI4BP/ainotes/)|[![GitHub.dev](https://img.shields.io/badge/github.dev-Open%20in%20VS%20Code-908a85?logo=github)](https://github.dev/AI4BP/ainotes/blob/main/ipynb/order-approval-process/order-approval.ipynb)|
|-|-|-|-|

## 0. Initialization Configuration
In the following there is some code for initialization. For example, the URL to the data `url_data` and the BPMN/DMN models `url_modelling` is set.

In [None]:
url_github = "https://raw.githubusercontent.com/AI4BP/ainotes/main"
project_name = "order-approval-process"
url_data = f"{url_github}/data/{project_name}"
url_modelling = f"{url_github}/modelling/{project_name}"

## 1. Load the CSV File
Load the CSV file from GitHub and feed the data into the *data* variable by using [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). Pandas is intended to be a data analysis and manipulation tool, which is used here and the following steps until dataset separation.

In [None]:
import pandas

data = pandas.read_csv(f"{url_data}/payment-scoring.csv", sep=",")

data

## 2. Map Categories to Numbers
To feed that data into our ML model, we need to convert and map the categorical strings to numbers.

In [None]:
legal_entity = {"private": 0, "juristical": 1}
data.legal_entity = [legal_entity[item] for item in data.legal_entity]
payment_method = {"invoice": 0, "cash": 1, "creditcard": 2, "prepayment": 3, "twint": 4}
data.payment_method = [payment_method[item] for item in data.payment_method]
scoring = {"green": 0, "orange": 1, "red": 2}
data.scoring = [scoring[item] for item in data.scoring]

data

## 3. Data Segregation and Feature Selection
For further processing, we need to segregate **X** and **y** as follows.

In [None]:
X_data = data.drop("scoring", axis=1)
y_data = data.scoring

print("X Data:\n", X_data)
print("y Data:\n", y_data)

## 4. Data Partitioning
We split / partition the data set into a training and a testing set to be able to evaluate the performance. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_data, y_data, test_size=0.40, random_state=None
)

print("X_train:\n", X_train)
print("y_train:\n", y_train)
print("X_test:\n", X_test)
print("y_test:\n", y_test)

## 5. Initialize Learner
Now we are going to initialize the learner for our classification problem.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000, multi_class="auto")

print("Model: ", model)

### 🚧 Other Classification Models

In [None]:
# from sklearn.tree import DecisionTreeClassifier
# model = DecisionTreeClassifier()

In [None]:
# from sklearn.svm import SVC
# model = SVC()

In [None]:
# from sklearn.naive_bayes import GaussianNB
# model = GaussianNB()

In [None]:
# from sklearn.neighbors import KNeighborsClassifier
# model = KNeighborsClassifier()

> ‼️ You can only initialize one model per run of the pipeline.

## 6. Tune Class Weights


In [None]:
model.class_weight = {0: 1.0, 1: 0.5, 2:0.25}

print("Model: ", model)

### 🚧 Grid Search

We can do some cross-validation and grid search to find the best class weight in the following.

In [None]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.utils.multiclass import is_multilabel

weights = np.linspace(0.0, 0.99, 100)
class_weight = {
    "class_weight": [{len(np.unique(y_data)) - 1: 1.0 - x} for x in weights]
}

grid = GridSearchCV(
    estimator=model,
    param_grid=class_weight, #accuracy #f1_micro roc_auc_ovr roc_auc_ovo roc_auc_ovo_weighted
    scoring="roc_auc" if is_multilabel(y_data) else "f1_micro", #average_precision #precision #roc_auc
    n_jobs=-1,
    cv=5,
    verbose=2,
)

grid.fit(X_train, y_train)

print(f"Class weight: {grid.best_params_['class_weight']}")
print(f"ROC AUC score: {grid.best_score_}")

## 7. Train Model
Now we can train the configured model on the training set by using the sklearn `fit` method.

In [None]:
model.fit(X_train, y_train)

## 8. Make Predictions

After training, we can use our testing set to make predictions by using the `predict` method sklearn. With the prediction, we can now retrieve and calculate performance metrics.

In [None]:
y_pred = model.predict(X_test)

print("Predictions (y): ", y_pred)

## 9. Scoring
Using various scoring metrics, we can examine how well the trained model performs on the test set.

### 9.1 Precision, Recall, F1 and Accuracy
In the following, multiple methods of sklearn are used to get overall precision, recall, F1 and overall accuracy.

In [None]:
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score

RS = recall_score(y_test, y_pred, average="micro")
print(f"Recall Score (RS): {100 * RS:.2f}%")

PS = precision_score(y_test, y_pred, average="micro")
print(f"Precision Score (PS): {100 * PS:.2f}%")

F1 = f1_score(y_test, y_pred, average="micro")
print(f"F1: {100 * F1:.2f}%")

AS = accuracy_score(y_test, y_pred)
print(f"Accuracy Score (AS): {100 * AS:.2f}%")

In the following, a classification report can be generated by using the `classification_report` method of sklearn to get precision, recall and F1 on each class or label.

In [None]:
from sklearn.metrics import classification_report

CR = classification_report(y_test, y_pred)

print("Classification Report (CR):\n", CR)

### 9.2 Confusion Matrix
In the following, a confusion matrix can be generated by using the `confusion_matrix` method of sklearn.

> In binary classification problems, the confusion matrix consists of the number of true positive (TP), true negative (TN), false positive (FP) and false negative (FN) predictions.

In [None]:
from sklearn.metrics import confusion_matrix

CM = confusion_matrix(y_test, y_pred)

print("Confusion Matrix (CM):\n", CM)

The generated confusion matrix can be plotted with `ConfusionMatrixDisplay` of sklearn.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

disp = ConfusionMatrixDisplay(confusion_matrix=CM, display_labels=model.classes_)
disp.plot(values_format="")
plt.show()

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=3cfa0341-d9ff-404f-9c44-bcc446a7617d' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>