## Cash Liquidity Training
In this notebook we will use the prepared times series data to train a forecasting model for `cashflow`. For that we will work with the [neuralforecast](https://nixtlaverse.nixtla.io/neuralforecast/docs/getting-started/introduction.html) library, which consists of neural network-based models for time series forecasting.

This involves in total the following steps for the overall prediction:

1. Install and import packages
2. Load prepared time series table
3. Prepare test and train data
4. Evaluate performance with `query profile`
5. Apply hyperparameter optimization
6. Train forecasting model

### 1. Install and import packages
All necessary packages for this notebook are going to be outlined in the following notebook cell. In order to make sure that the results are reproducible, the following packages are going to be installed:
- **mlflow**: Tracking of our ML model
- **neuralforecast**:  is a comprehensive suite of neural network-based models for time series forecasting. It's designed to be scalable, user-friendly, and highly performant, making it suitable for both researchers and practitioners.

In [0]:
%pip install mlflow
%pip install neuralforecast
%restart_python

In [0]:
import pandas as pd
from neuralforecast.core import NeuralForecast
from neuralforecast.auto import AutoNHITS
from neuralforecast.models import NHITS
from neuralforecast.utils import PredictionIntervals
from delta import *
import mlflow
from mlflow.models import infer_signature
from mlflow.client import MlflowClient
from utilsforecast.losses import rmse, mae
from utilsforecast.evaluation import evaluate
import os
import pickle
from pathlib import Path
import re
from functools import reduce
from pyspark.sql.functions import date_trunc, col, avg

# &#x270D;
Please replace the values `<CATALOG_NAME>` and `<SCHEMA_NAME>` with the specific values that match our use case and group. You can find the correct names by checking the **Unity Catalog** and look for the specific catalog and schema names:`uc_XXX`, `grpX`.

In [0]:
%sql
-- CREATE CATALOG IF NOT EXISTS <CATALOG_NAME>;
SET CATALOG <CATALOG_NAME>;
-- CREATE SCHEMA IF NOT EXISTS <SCHEMA_NAME>;
USE SCHEMA <SCHEMA_NAME>;

### 2. Load prepared time series table
# &#x270D;
We load the prepared table generated by our Data Preparation scripts. Please replace the value `<PREPARED_TABLE_NAME>` with the prepared cashflow table name.

In [0]:
data = spark.read.table("<PREPARED_TABLE_NAME>")


### 3. Prepare test and train data

In [0]:
# data type casting
data = data.withColumn("y", data["y"].cast("float"))


# &#x270D;
For this exercise we set the `<LENGTH>` to the value `6`.

In [0]:
FORECAST_LENGTH = <LENGTH>

In [0]:
unique_date = data.select("ds").distinct().orderBy("ds")
test_date = unique_date.tail(FORECAST_LENGTH)
test_date = spark.createDataFrame(test_date)
train_date = unique_date.join(test_date, "ds", "leftanti")

In [0]:
train_data = data.join(train_date, "ds", "inner")
test_data = data.join(test_date, "ds", "inner")

In [0]:
train_data_df = train_data.toPandas()
train_data_df["ds"] = pd.to_datetime(train_data_df["ds"])
test_data_df = test_data.toPandas()
test_data_df["ds"] = pd.to_datetime(test_data_df["ds"])

### 4. Evaluate performance with `query profile`

Since we have logged the execution using MLflow, we can get more detailed analysis on the execution of the training itself:

1. Click in the above cell on `See Performance` to get all executed statements listed.
2. Click on one of the listed statements.
3. Click on **`query profile`**, which will show:
    - Execution plan (steps taken to run the query)
    - Time spent on each operation
    - CPU, memory, and I/O usage
    - Bottlenecks or inefficient joins/scans
  


This is where you find the performance trace:

![see_performance.png](../../images/see_performance.png)

The query profile is a very useful tool for:
- Performance tuning: Identify slow steps and optimize them.
- Cost control: Reduce compute time and resource usage.
- Debugging: Understand why a query fails or returns unexpected results

Here is an example on the `query profile`:

![query_profile.png](../../images/query_profile.png)

### 5. Apply hyperparameter optimization

#### model fitting

In [0]:
mlflow.set_tracking_uri("databricks")
mlflow.set_registry_uri("databricks-uc")

In [0]:
mlflow.pytorch.autolog(checkpoint=False, log_every_n_epoch=100, log_datasets=True, log_models=True)

In [0]:
nf = NeuralForecast(
    models=[AutoNHITS(
        h=FORECAST_LENGTH, backend="optuna", num_samples=5)],
    freq="MS")

Remark: The model fitting will take **approximately 12 min**.

In [0]:
nf.fit(train_data_df, id_col="CompanyCode")

#### retrieve best parameters for training

In [0]:
results = nf.models[0].results.trials_dataframe()
results = results[results["state"] == "COMPLETE"].sort_values(by="value", ascending=True)

In [0]:
results

In [0]:
best_params = results.iloc[0, 15]
display(best_params)

### 6. Train forecasting model

#### NeuralForecast
For the time series training, we will use the library [NeuralForecast](https://nixtlaverse.nixtla.io/neuralforecast/docs/getting-started/introduction.html). NeuralForecast is an open-source Python library developed by Nixtla that provides a comprehensive suite of neural network-based models for time series forecasting. It's designed to be scalable, user-friendly, and highly performant, making it suitable for both researchers and practitioners. Key Features are:

**Model Variety**: Includes over 30 state-of-the-art models such as:

- Classical: MLP, RNN, LSTM, GRU
- Advanced: NBEATS, NHITS, DeepAR, TFT, Informer, Autoformer, PatchTST, TimeLLM, StemGNN, and more.

**Usability**:
- Scikit-learn-like API (.fit() and .predict())
- Compatible with other Nixtla libraries like StatsForecast and MLForecast
- Built-in support for visualization and data wrangling via utilsforecast and coreforecast

**Forecasting Capabilities**:

- Probabilistic Forecasting: Supports quantile losses and parametric distributions
- Interpretability: Tools to analyze trend, seasonality, and exogenous components
- Exogenous Variables: Handles static, historical, and future covariates



**Performance**:
- Parallelized and distributed training
- Integration with tools like Ray and Optuna for hyperparameter tuning
- Transfer learning support for forecasting with limited historical data

In [0]:
class NeuralForecastPyFunc(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        import cloudpickle
        with open(context.artifacts["nf_path"], "rb") as f:
            self.nf = cloudpickle.load(f)

    def predict(self, context, model_input):
        # Expect a pandas DataFrame with NeuralForecastâ€™s standard columns
        if not isinstance(model_input, pd.DataFrame):
            raise TypeError("model_input must be a pandas DataFrame")
        return self.nf.predict(df=model_input, level=[90])

In [0]:
df = data.toPandas()
df["ds"] = pd.to_datetime(df["ds"])

In [0]:
mlflow.pytorch.autolog(checkpoint=False, log_every_n_epoch=100, log_datasets=True, log_models=False)

In [0]:
prediction_intervals = PredictionIntervals()

# &#x270D;
Please be aware to replace the variable `<ID_COLUMN>` with the value `CompanyCode` and the variable `<MODEL_NAME>`with the value `neuralforecast_nhits`.

Remark: the training is running **approximately 5 min**

In [0]:
with mlflow.start_run() as run:
    final_model = NeuralForecast(
        models=[NHITS(**best_params)], freq="MS"
    )
    final_model.fit(df, id_col="<ID_COLUMN>", prediction_intervals=prediction_intervals)
    prediction = final_model.predict(train_data_df, level=[90])
    x_example = train_data_df.head(5)
    signature = infer_signature(x_example, prediction)
    model_path = "neuralforecast_model.pkl"
    with open(model_path, "wb") as f:
        pickle.dump(final_model, f)
    mlflow.pyfunc.log_model(
        artifact_path="model",
        python_model=NeuralForecastPyFunc(),
        artifacts={"nf_path": model_path},
        input_example=x_example,
        signature=signature
    )
    model = mlflow.register_model(f"runs:/{run.info.run_id}/model", "<MODEL_NAME>")

# &#x270D;
We will label the latest trained model for easier model finding and loading in the next exercise. 
For that please replace the variable `<ALIAS_NAME>` with the value `prod`. 

In [0]:
mlflow_client = MlflowClient()
mlflow_client.set_registered_model_alias(
    name=model.name,
    alias="<ALIAS_NAME>",
    version=model.version  # version number
)

When executed successfully, you should be able to find the trained model `neuralforecast_nhits` in the Unity Catalog under your created SCHEMA.

#### visualizing training with prediction data

In [0]:
from utilsforecast.plotting import plot_series

In [0]:
# Plot predictions
plot_series(train_data_df, prediction, id_col="<ID_COLUMN>")