## Cash Liquidity Training
In this notebook we will use the prepared times series data to train a forecasting model for `cashflow`. For that we will work with the [statsforecast](https://nixtlaverse.nixtla.io/statsforecast/index.html) library, which is highly optimized for speed and scalability.

This involves in total the following steps for the overall prediction:

1. Install and import packages
2. Load prepared time series table
3. Train forecasting model
4. Evaluating the logged performance in mlflow

### 1. Install and import packages
All necessary packages for this notebook are going to be outlined in the following notebook cell. In order to make sure that the results are reproducible, the following packages are going to be installed:
- mlflow: Tracking of our ML model
- statsforecast:  is a high-performance Python package designed for univariate time series forecasting using statistical and econometric model and is optimized for speed and scalability, making it suitable for both production environments and benchmarking.

In [0]:
%pip install mlflow
%pip install statsforecast
%pip install mlflavors
%restart_python

In [0]:
import pandas as pd
from statsforecast.core import StatsForecast
from statsforecast.models import ADIDA, AutoARIMA, CrostonOptimized, AutoETS, AutoCES, ARCH, AutoMFLES, AutoTheta
from delta import *
import mlflow
from mlflow.models import infer_signature
import mlflavors
from mlflow.client import MlflowClient
from utilsforecast.losses import rmse, mae
from utilsforecast.evaluation import evaluate
import os
import pickle
from pathlib import Path
import re
from functools import reduce
from pyspark.sql.functions import date_trunc, col, avg

Please replace the values `<CATALOG_NAME>` and `<SCHEMA_NAME>` with the specific values that match our use case and group. You can find the correct names by checking the **Unity Catalog** and look for the specific catalog and schema names:`uc_XXX`, `grpX`.

In [0]:
%sql
-- CREATE CATALOG IF NOT EXISTS <CATALOG_NAME>;
SET CATALOG uc_cash_liquidity_forecast;
CREATE SCHEMA IF NOT EXISTS grp01;
USE SCHEMA grp01;

### 2. Load prepared time series table
We load the prepared table generated by our Data Preparation scripts. Please replace the value `<PREPARED_TABLE_NAME>` with the prepared cashflow table name.

In [0]:
data = spark.read.table("prepared_cash_flow_time_series")


In [0]:
data = data.withColumn("y", data["y"].cast("float"))


### 3. Train forecasting model

The **statsforecast** library is a high-performance Python package designed for univariate time series forecasting using statistical and econometric models. Itâ€™s developed by [Nixtla](https://nixtlaverse.nixtla.io/statsforecast/index.html) and is optimized for speed and scalability, making it suitable for both production environments and benchmarking.

Furthermore, it is compatible with Spark and well integratable with MLFlow, which we use to log and track the training execution.

**Models Included**:

AutoARIMA, AutoETS, AutoCES, Theta, MSTL (Multiple Seasonalities)



**Functionality**:
- Probabilistic forecasting with confidence intervals
- Support for exogenous variables and static covariates
- Anomaly detection and cross-validation
- Familiar .fit() and .predict() syntax like scikit-learn

For this exercise we set the `<LENGTH>` to the value `6`.

In [0]:
FORECAST_LENGTH = 6

In [0]:
unique_date = data.select("ds").distinct().orderBy("ds")
test_date = unique_date.tail(FORECAST_LENGTH)
test_date = spark.createDataFrame(test_date)
train_date = unique_date.join(test_date, "ds", "leftanti")

In [0]:
train_data = data.join(train_date, "ds", "inner")
test_data = data.join(test_date, "ds", "inner")

In [0]:
stats_forecast = StatsForecast(
    models=[
        ADIDA(),
        AutoARIMA(),
        CrostonOptimized(),
        AutoETS(),
        AutoCES(),
        ARCH(),
        AutoMFLES(
            test_size=FORECAST_LENGTH, 
        ),
        AutoTheta(),
    ],
    freq="MS",
    n_jobs=-1,
    verbose=True,
)

In [0]:
mlflow.set_tracking_uri("databricks")
mlflow.set_registry_uri("databricks-uc")

Please be aware to replace the values `<DATE_COLUMN>`, `<VALUE_COLUMN>`, and `<ID_COLUMN>` with the actual column names from your dataset. 

> Hint: Those are the same columns that we created in the data preparation notebook.

> Remark: The execution of training will take approximately 2min.

To set the name of trained model we need log it via MLflow and replace `<MODEL_NAME>` with the value `statsforecast`. 

In [0]:
with mlflow.start_run() as run:
    
    # Please be aware to replace the values <DATE_COLUMN>, <VALUE_COLUMN>, and <ID_COLUMN> with the actual column names from your dataset. Hint: Those are the same columns that we created in the data preparation notebook.
    DATE_COLUMN = 'ds'
    ID_COLUMN= 'CompanyCode'
    VALUE_COLUMN = 'y'

    prediction = stats_forecast.forecast(df=train_data, h=FORECAST_LENGTH, id_col=ID_COLUMN)
    prediction = prediction.withColumn(DATE_COLUMN, date_trunc("MM", col(DATE_COLUMN)))
    model_list = prediction.drop(ID_COLUMN, DATE_COLUMN).columns
    prediction_eval = prediction.join(test_data, [DATE_COLUMN, ID_COLUMN], "inner")
    eval_df = evaluate(prediction_eval, metrics=[rmse, mae], models=model_list, id_col=ID_COLUMN, time_col=DATE_COLUMN, target_col=VALUE_COLUMN)
    grouped_eval = eval_df.groupBy("metric").agg(avg(col("ADIDA")).alias("ADIDA"), avg(col("AutoARIMA")).alias("AutoARIMA"), avg(col("CrostonOptimized")).alias("CrostonOptimized"), avg(col("AutoETS")).alias("AutoETS"), avg(col("CES")).alias("CES"), avg(col("ARCH(1)")).alias("ARCH(1)"), avg(col("AutoMFLES")).alias("AutoMFLES"), avg(col("AutoTheta")).alias("AutoTheta"))
    grouped_eval_list = grouped_eval.toPandas().to_dict(orient="records")
    for element in grouped_eval_list:
        metric_name = element.pop("metric")
        metrics = {f"{metric_name}_{key}": value for key, value in element.items()}
        mlflow.log_metrics(metrics)
    input_example = train_data.head(1)
    input_example_df = pd.DataFrame(input_example, columns=train_data.columns)
    prediction_df = pd.DataFrame(prediction.head(1), columns=prediction.columns)
    signature = infer_signature(input_example_df, prediction_df)

    # when model is logged, it will be also persisted in the unity catalog
    model = mlflavors.statsforecast.log_model(
        statsforecast_model=stats_forecast,
        artifact_path="models",
        input_example=input_example_df,
        signature=signature,
        registered_model_name="statsforecast"
    )


When executed successfully, you should be able to find the trained model `statsforecast` in the Unity Catalog under your created SCHEMA.

In [0]:
mlflow_client = MlflowClient()
mlflow_client.set_registered_model_alias(
    name="statsforecast",  # registered model name
    alias="prod",       # alias name
    version=model.registered_model_version  # version number
)

### 4. Evaluating the logged performance in mlflow

Since we have logged the execution using MLflow, we can get more detailed analysis on the execution of the training itself:

1. Click in the above cell with `mlflow.start_run(...)` on `See Performance` to get all executed statements listed.
2. Click on one of the statements to get additional information such as *memory consumption*, *number of rows read*, *number of rows returned*, etc.
3. Click on **`query profile`**, which will show:
    - Execution plan (steps taken to run the query)
    - Time spent on each operation
    - CPU, memory, and I/O usage
    - Bottlenecks or inefficient joins/scans
  


![](../../images/see_performance.png)

The query profile is a very useful tool for:
- Performance tuning: Identify slow steps and optimize them.
- Cost control: Reduce compute time and resource usage.
- Debugging: Understand why a query fails or returns unexpected results

![](../../images/query_profile.jpg)