## Payment Delay Data Preparation

This notebook will cover the set up and cleaning of the data. We will prepare the data by handling missing values, removing duplicates and updating data types. Additionally, we will normalize the data, encode categorical variables and create new features. This is followed by a visualization of the distribution to identify any patterns or anomalies. 

Here are the steps of this exercise:

1. Install and import packages
2. Load data from data product `Entry View Journal Entry`
3. Data Preparation
4. Persist prepared data

### 1. Install and import packages

In the next few cells, we will install and import the required packages.

In [0]:
%pip install databricks-feature-engineering
%pip install ydata-profiling
%restart_python

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, col, date_trunc, sum, explode, sequence, min, max, lit, expr, datediff, row_number, when
from pyspark.sql.window import Window
from ydata_profiling import ProfileReport
import pandas as pd
import mlflow
from pyspark.sql.functions import max, col

# &#x270D;
Please replace the values `<CATALOG_NAME>` and `<SCHEMA_NAME>` with the specific values that match our use case and group. You can find the correct names by checking the **Unity Catalog** and look for the specific catalog and schema names:`uc_XXX`, `grpX`.

In [0]:
%sql
-- CREATE CATALOG IF NOT EXISTS uc_delayed_payments;
SET CATALOG uc_delayed_payments;
CREATE SCHEMA IF NOT EXISTS grp01;
USE SCHEMA grp01;

### 2. Load data from data product `Entry View Journal Entry`

# &#x270D;
Replace the values `<DELTA_SHARE_TABLE_PATH>` with full path name of the data product `Entry View Journal Entry`and table `operationalacctgdocitem`. You'll find the correct full path name by navigating to the **Unity Catalog**-->**Delta Shares Received**. 
Hint: the path name is following the pattern `share_name`.`schema_name`.`table_name`.

![operationalacctgdocitem.png](../../images/operationalacctgdocitem.png)

In [0]:
data = spark.read.table("bdc_share_journal_entry.entryviewjournalentry.operationalacctgdocitem")

### 3. Data Preparation
# &#x270D;
Use the following code to drop unneccessary columns. Add the following columns to the list of columns that need to be dropped (`<COLUMNS_TO_DROP>`): 
- CashDiscount2DueDate
- CashDiscount1DueDate
- DocumentDate
- PostingDate
- TaxDeterminationDate
- AcctgDocItmCstmsClearanceDate
- DueCalculationBaseDate
- AssetValueDate
- ValueDate
- ClearingCreationDate


In [0]:
# create list of columns that need to be dropped
to_drop = ["CashDiscount2DueDate", "CashDiscount1DueDate", "DocumentDate", "PostingDate", "TaxDeterminationDate", "AcctgDocItmCstmsClearanceDate", "DueCalculationBaseDate", "AssetValueDate", "ValueDate", "ClearingCreationDate"]

selected_data = data.drop(*to_drop)

Now adjust the following code to replace empty strings with `None`.

In [0]:
transactional_data = selected_data.replace('', None)

In the following cell, we will drop unnecessary columns from our dataset.

In [0]:
def drop_fully_null_columns(df, but_keep_these=[]):
    """Drops DataFrame columns that are fully null
    (i.e. the maximum value is null)

    Arguments:
        df {spark DataFrame} -- spark dataframe
        but_keep_these {list} -- list of columns to keep without checking for nulls

    Returns:
        spark DataFrame -- dataframe with fully null columns removed
    """

    # skip checking some columns
    cols_to_check = [col for col in df.columns if col not in but_keep_these]
    if len(cols_to_check) > 0:
        # drop columns for which the max is None
        rows_with_data = df.select(*cols_to_check).groupby().agg(*[max(c).alias(c) for c in cols_to_check]).take(1)[0]
        cols_to_drop = [c for c, const in rows_with_data.asDict().items() if const == None]
        cleaned_df = df.drop(*cols_to_drop)

        return cleaned_df
    else:
        return df

# &#x270D;
Set the primary keys by adding the following columns to the list of primary keys `<PRIMARY_KEYS>`. The following columns should be set as primary keys:
- CompanyCode
- AccountingDocument
- FiscalYear
- AccountingDocumentItem

In [0]:
primary_key = ["CompanyCode", "AccountingDocument", "FiscalYear", "AccountingDocumentItem"]

In [0]:
transactional_data = transactional_data.where(col("Customer").isNotNull())

In [0]:
transactional_data = drop_fully_null_columns(transactional_data)
converted_data_types = {column: col(column).cast('integer').alias(column) for column, column_dtype in transactional_data.dtypes if column_dtype == 'boolean'}
if converted_data_types:
    transactional_data = transactional_data.withColumns(converted_data_types)
replace_string_values = {column: f"No{column}" for column, col_dtypes in transactional_data.dtypes if col_dtypes == "string" and column not in primary_key}
prepared_transactional_data = transactional_data.fillna(replace_string_values)

#### Filter Data 
# &#x270D;
We only need the data where we have a value for the column `Customer`. Please adjust the following code to make sure data with the value `NoCustomer`is filtered out by replacing the value `<SET_FILTER>`.

In [0]:
prepared_transactional_data = prepared_transactional_data.where(col("Customer") != "NoCustomer")

#### Calculate Delayed Days
To create a delay prediction data set we need to select the columns that have an actual delay. 

In [0]:
# Calculate the delay in days between ClearingDate and NetDueDate
delay_prediction_dataset = prepared_transactional_data.\
    withColumn("delay", datediff("ClearingDate", "NetDueDate")).\
    withColumn("delay", when(col("delay") < 0, 0).otherwise(col("delay")))

# &#x270D;
For the column `delay` we need the values to be bigger or equal to `0`. Adjust the following code to match these requirements by changing the value of `<SET_FILTER>`.  

In [0]:
delay_prediction_dataset = delay_prediction_dataset.where(col("delay") >= 0)

### 4.Persist prepared data
# &#x270D;
Once the data preparation is finished, we store the results of our Delta Table in Unity Catalog in Databricks.
For that, replace `<DELAY_DATASET>` by the name of the dataset that we have just created above and want to persist.

In [0]:
mlflow.set_tracking_uri("databricks")
mlflow.set_registry_uri("databricks-uc")
fe_client = FeatureEngineeringClient()

In [0]:
fe_client.create_table(
    name="prepared_accounting_document",
    primary_keys=primary_key,
    schema=delay_prediction_dataset.schema,
    description="Prepared Accounting document item data product for payment delay forecasting"
)

In [0]:
fe_client.write_table(
    name="prepared_accounting_document",
    df=delay_prediction_dataset,
    mode="merge"
)

%md
Please double check in the unity catalog, that the table has been created correctly and stored in your SCHEMA.

![prepared_accounting_document.png](../../images/prepared_accounting_document.png)