
## Payment Delay Training 

This notebook is designed to train a model for predicting payment delays. We will load and prepare the preprocessed data, handle missing values and encode categorical variables. The model training then consists of several steps to fit the models and select the best parameters for our final model. We will evaluate the models using various metrics to determine the best model.

### Install and Import Packages
All necessary packages for this notebook are going to be outlined in the following notebook cell. In order to make sure that the results are reproducible, the following packages are going to be installed.

In [0]:
%pip install xgboost
%pip install optuna
%pip install optuna-dashboard
%restart_python

In [0]:
import mlflow
from mlflow.models import infer_signature
from mlflow.client import MlflowClient
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
import optuna
import optuna.visualization as ov
from sklearn.model_selection import train_test_split
from pathlib import Path
import pandas as pd

### Setup Spark Session and consume data product
In order to isolate the created data assets, we create a catalog within Databricks and a respective schema within the catalog. Please replace the values `<CATALOG_NAME>` and `<SCHEMA_NAME>` with the specific values that match our use case and group. You can find the correct names by checking the **Unity Catalog** and look for the specific catalog and schema names:`uc_XXX`, `grpX`.

Please note: 
We adapted the code here to match our use case. Therefore, some of the lines are commented out and not needed. However, they can be useful for future applications. 

In [0]:
%sql
-- CREATE CATALOG IF NOT EXISTS <CATALOG_NAME>;
SET CATALOG uc_delayed_payments;
-- CREATE SCHEMA IF NOT EXISTS <SCHEMA_NAME>;
USE SCHEMA grp01;

### Prepare Data
Replace the value `<TABLE_NAME>` with the name of the table that we created with the data preparation notebook. Additionally, set the value `<SEED_PARAMETER>` with a random number. 

In [0]:
data = (spark.read.table("prepared_accounting_document").
    where(col("delay").isNotNull()).
    drop("ClearingDate", "NetDueDate").
    sample(0.25, seed=42))

### Create Training Data
In the following step we will create a train-test split and focus on the column `delay`. We will also check the data types and adjust accordingly. Once the data is ready, we can continue with the actual model training

Please adjust the code by setting the values for `<SPLIT>`. We aim for an 80% / 20% train-test split. Also, set a random number for the value `<SEED_PARAMETER>`.

In [0]:
mlflow.set_tracking_uri("databricks")
mlflow.set_registry_uri("databricks-uc")

In [0]:
train_df, test_df = data.drop("CompanyCode", "AccountingDocument", "FiscalYear", "AccountingDocumentItem").randomSplit([0.8, 0.2], seed=42)

The next cell converts the Spark DataFrames `train_df` and `test_df` into Pandas DataFrames.
It separates the target variable "delay" from the features for both training and testing datasets.

In [0]:
train_target = train_df.select("delay").toPandas()
train_data = train_df.drop("delay").toPandas()
test_target = test_df.select("delay").toPandas()
test_data = test_df.drop("delay").toPandas()

In the following function `infer_column_dtype` we want to return the strings `numeric`, `datetime` or `boolean` depending on the data type of the column. Please adjust the return statements accordingly by replacing `<DTYPE>`.

In [0]:
def infer_column_dtype(series):
    # Try to convert to numeric
    try:
        pd.to_numeric(series.dropna())
        return 'numeric'
    except:
        pass

    # Try to convert to datetime
    try:
        pd.to_datetime(series.dropna(), errors='raise', infer_datetime_format=True)
        return 'datetime'
    except:
        pass

    # If all unique values are 'True' or 'False' like
    lower_vals = set(str(v).strip().lower() for v in series.dropna().unique())
    if lower_vals <= {'true', 'false', '1', '0'}:
        return 'boolean'
    
    return 'string'


In [0]:
# apply to train data
for col in train_data.columns:
    inferred = infer_column_dtype(train_data[col])
    if inferred == 'numeric':
        train_data[col] = pd.to_numeric(train_data[col], errors='coerce')
    elif inferred == 'datetime':
        train_data[col] = pd.to_datetime(train_data[col], errors='coerce')
    elif inferred == 'boolean':
        train_data[col] = train_data[col].astype('bool')
    else:
        train_data[col] = train_data[col].astype('category')

# apply to test data
for col in test_data.columns:
    inferred = infer_column_dtype(test_data[col])
    if inferred == 'numeric':
        test_data[col] = pd.to_numeric(test_data[col], errors='coerce')
    elif inferred == 'datetime':
        test_data[col] = pd.to_datetime(test_data[col], errors='coerce')
    elif inferred == 'boolean':
        test_data[col] = test_data[col].astype('bool')
    else:
        test_data[col] = test_data[col].astype('category')

## Model Training
The function `model_training` performs the following steps:

1. Load the training data
2. Set model parameters
3. Create train-test split
4. Train the model
5. Evaluate the model's performance on the validation dataset
6. Fine-tune the model parameters for optimal performance

Running the code may take a lot of time.

In [0]:
mlflow.xgboost.autolog(log_input_examples=True)

Using the prepared dataset, we perform hyperparameter tuning of an XGBoost regressor using the Optuna TPESampler. For our hyperparameter set we configure the parameters in `params`.

In [0]:
def model_training(trial: optuna.trial.Trial, X: pd.DataFrame, y: pd.Series):
    # Convert unsupported data types
    X = X.copy()
    for col in X.select_dtypes(include=['category', 'datetime64[ns, UTC]']).columns:
        if X[col].dtype.name == 'category':
            X[col] = X[col].astype('category').cat.codes
        elif X[col].dtype.name == 'datetime64[ns, UTC]':
            X[col] = X[col].astype('int64')  

    params = {
        "booster": "gbtree",
        "objective": "reg:squarederror",
        "tree_method": "hist",
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "learning_rate": trial.suggest_float("learning_rate", 1e-2, 3e-1, log=True),
        "n_estimators": trial.suggest_int("n_estimators", 200, 1500),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "gamma": trial.suggest_float("gamma", 0.0, 5.0),
        "min_child_weight": trial.suggest_float("min_child_weight", 1.0, 10.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 1.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 0.0, 1.0),
        "random_state": 42,
    }

    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    model = XGBRegressor(**params, enable_categorical=True, n_jobs=-1, early_stopping_rounds=10)
    model.fit(
        X_train,
        y_train,
        eval_set=[(X_valid, y_valid)],
        verbose=False
    )
    preds = model.predict(X_valid)
    mse_score = mean_squared_error(y_valid, preds)
    mape_score = mean_absolute_percentage_error(y_valid, preds)
    r2_metric = r2_score(y_valid, preds)
    return mse_score

mlflow.set_registry_uri("databricks-uc")

study = optuna.create_study(
    direction="minimize",
    sampler=optuna.samplers.TPESampler(seed=42),
    pruner=optuna.pruners.MedianPruner(n_warmup_steps=5),
)
study.optimize(lambda trial: model_training(trial, train_data, train_target), n_trials=50)

## Show Plots
Next we will have a look at the optimization plots and check which of the parameters work best for our use case.

In [0]:
%matplotlib inline

In [0]:
ov.plot_optimization_history(study)

In [0]:
ov.plot_param_importances(study)

In [0]:
ov.plot_parallel_coordinate(study)

## Retrieve Optimal Parameters
To retrieve the best parameters for our final model you can run the following code and check the output.

In [0]:
print("Number of finished trials:", len(study.trials))
print("Best MSE:", study.best_value)
print("Best params:")
for k, v in study.best_params.items():
    print(f"  {k}: {v}")

## Fit Model
Now we will fit the model using the preferred parameters and finally register the model. We will set the alias `prod` for the registered model.

In [0]:
# Convert datetime column to numerical format
# train_data['__TIMESTAMP'] = train_data['__TIMESTAMP'].astype('int64')
# test_data['__TIMESTAMP'] = test_data['__TIMESTAMP'].astype('int64')

# Ensure categorical columns are properly encoded
categorical_columns = train_data.select_dtypes(include=['category']).columns
train_data[categorical_columns] = train_data[categorical_columns].apply(lambda x: x.cat.codes)
test_data[categorical_columns] = test_data[categorical_columns].apply(lambda x: x.cat.codes)

with mlflow.start_run(run_name="Delay_Prediction_Training") as run:
    final_xgbmodel = XGBRegressor(
        **study.best_params,
        enable_categorical=True,
        n_jobs=-1
    )
    final_xgbmodel.fit(
        train_data,
        train_target,
        verbose=False
    )
    predictions = final_xgbmodel.predict(test_data)
    mse = mean_squared_error(test_target, predictions)
    r2_metric = r2_score(test_target, predictions)
    mape_score = mean_absolute_percentage_error(test_target, predictions)

In [0]:
trained_model = mlflow.register_model(f"runs:/{run.info.run_id}/model", "delay_prediction")

Please set the alias for the trained model with `prod` by replacing the value `<ALIAS>` accordingly. 

In [0]:
mlflow_client = mlflow.MlflowClient()
mlflow_client.set_registered_model_alias(name=trained_model.name, alias="prod", version=trained_model.version)