In [1]:
import pandas as pd
import plotly.express as px
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.compose import ForecastingPipeline
from sktime.forecasting.exp_smoothing import ExponentialSmoothing

# Time series functions

This notebook intends to show the functions we use in the Taipy Forecast application. Whether you want to dive into time series or not, you can use this notebook to run each function and get a general idea of what it does. 

In this chapter, we use the [sktime](https://www.sktime.net/en/stable/) library to call time-series forecasting models (that come from other libraries).

## Opening the parquet files

To retrieve data from parquet files as pandas DataFrames, we use the `read_parquet` method. Let's get the data from all the files.

In [2]:
df_sales = pd.read_parquet("../src/data/sales.parquet")
df_sales.head(2)

Unnamed: 0,date,day,product,type,name,color,style,customer,birth,generation,gender,unit_price,items,sales
0,2010-12-29,We,BK-R93R-62,Road,"Road-150 Red, 62",Red,U,AW00021768,1952,Boomers,M,3578.27,1,3578.27
1,2010-12-29,We,BK-M82S-44,Mountain,"Mountain-100 Silver, 44",Silver,U,AW00028389,1970,Gen X,F,3399.99,1,3399.99


## Selecting subsets and preparing data

Our app allows selecting subsets of the `df_sales` DataFrame. We create a function named `filter_dataframe`. This way, we can run the time series foresting for a particular group or segment of the population (we can filter by generation, gender and product). In this example, we'll make a forecast for "Gen X".

We also preprocess the resulting DataFrame with the `prepare_data` function: Time series require DataFrames with two columns: 
* A column that holds the time values (like a date or a timestamp)
* A column with the target values we want to predict

In our case, we want to predict sales by date. The `prepare_data` function groups all the lines of the (filtered) DataFrame by date, so we have one line per date, for a given target. The `target` parameter lets us select select the values we want to predict: we can predict sales (in monetary units) or items (number of sold bikes, regardless of the price). Note that both of these could have different practical uses, sales having a higher financial impact, and items being more relevant to manage stock.

In [3]:
def filter_dataframe(df, gender_forecast, generation_forecast, product_forecast):
    if gender_forecast != "All":
        df = df[df["gender"] == gender_forecast]
    if generation_forecast != "All":
        df = df[df["generation"] == generation_forecast]
    if product_forecast != "All":
        df = df[df["type"] == product_forecast]
    df = df.reset_index(drop=True)
    return df

In [4]:
def prepare_data(df, target):
    df_agg = df.groupby("date")[target].sum().reset_index()
    df_agg["date"] = pd.to_datetime(df_agg["date"])
    df_agg = df_agg.set_index("date")
    df_agg.index = pd.date_range(
        start=df_agg.index.min(), periods=len(df_agg), freq="D"
    )
    return df_agg

Let's call our functions to see how the prepared data looks:

In [5]:
df_sales_gen_x = filter_dataframe(
    df_sales, gender_forecast="All", generation_forecast="Gen X", product_forecast="All"
)
df_sales_gen_x.head(2)

Unnamed: 0,date,day,product,type,name,color,style,customer,birth,generation,gender,unit_price,items,sales
0,2010-12-29,We,BK-M82S-44,Mountain,"Mountain-100 Silver, 44",Silver,U,AW00028389,1970,Gen X,F,3399.99,1,3399.99
1,2010-12-29,We,BK-M82S-44,Mountain,"Mountain-100 Silver, 44",Silver,U,AW00011003,1973,Gen X,F,3399.99,1,3399.99


In [6]:
prepared_data = prepare_data(df_sales_gen_x, "sales")
prepared_data.head()

Unnamed: 0,sales
2010-12-29,6799.98
2010-12-30,10353.25
2010-12-31,7855.6382
2011-01-01,3578.27
2011-01-02,3578.27


In [7]:
prepared_data

Unnamed: 0,sales
2010-12-29,6799.9800
2010-12-30,10353.2500
2010-12-31,7855.6382
2011-01-01,3578.2700
2011-01-02,3578.2700
...,...
2013-11-28,40525.5300
2013-11-29,30914.6200
2013-11-30,45078.0900
2013-12-01,34517.2300


## Creating the SKTime pipeline

SKTime is like a universal adapter that brings together tons of different time series models from various libraries in a single API. SKTime is sort of a wrapper, and you can't just use SKTime alone. 

In our example, we'll be using the "Exponential Smoothing" model which is [a mathematical model to predict time series](https://www.sciencedirect.com/topics/https://en.wikipedia.org/wiki/Exponential_smoothing).

You can think of this setup in three layers:

1. The core statistical method that describes the time-series equation that defines how the forecast is calculated (the mathematical model)
2. There is a Python library that has a function with this model, in our case, it's the `statsmodels` package and the [`ExponentialSmoothing`](https://www.statsmodels.org/dev/generated/statsmodels.tsa.holtwinters.ExponentialSmoothing.html#statsmodels.tsa.holtwinters.ExponentialSmoothing) class
3. SKTime wraps around statsmodel's exponential smoothing and provides a consistent and uniform API so we can work with time series models as we would with SK Learn. SKTime provides a [dedicated class](https://www.sktime.net/en/latest/api_reference/auto_generated/sktime.forecasting.exp_smoothing.ExponentialSmoothing.html)

**Important**: Each time series model is different, SKTime standardizes the workflow, not the models or their parameters.

Let's break down the functions we need to create the Pipeline:

* `create_pipeline`: This function creates the actual forecasting pipeline. It uses SKTime’s `ForecastingPipeline`, with a single step: the `ExponentialSmoothing` model. We configure this model with:

  * Additive trend: Assumes that the trend (i.e., the increase or decrease over time) adds a fixed amount to the forecast at each step.
  * Multiplicative seasonality: Assumes that the seasonal pattern (like highs and downs due to yearly cycles) multiplies the forecast. This is good when the effect of seasonality grows with the level of the series — like retail sales being higher during holidays and increasing year-over-year, it's useful when seasonal effects grow with the level of the series.

* `fit_and_forecast_future`: This is where the work happens:
  * It fits the model to the historical data
  * The `number_of_days` parameter sets how many days into the future we want to predict
  * It returns the forecasted values for those future dates

* `compute_confidence_intervals`: Forecasts are never perfect, they come with uncertainty. This function calculates a basic confidence interval around the forecast:
  * It uses the standard deviation of the forecasted values and a multiplier (e.g. 1.96 for ~95% confidence)
  * It helps us visualize a range where the true values are likely to fall.

* `create_forecast_dataframe`: This wraps everything into a single DataFrame that includes:
  * Forecasted values
  * Confidence interval upper and lower bounds
  * The corresponding future dates
  * This structured format makes it easy to plot for our Taipy app!

### Model Hyperparameters



- **`trend="add"`**:  We're assuming that sales increase or decrease by a fixed amount over time. This is a good fit when the growth is roughly linear rather than percentage-based. It also tends to be more stable, especially when working with smaller datasets (which is our case).
    
- **`seasonal="mul"`**: This captures recurring seasonal effects that *scale with the data*. For example, if sales spike every quarter and those spikes get bigger each year, a multiplicative pattern captures that better than an additive one.
    
- **`sp=91`**: `sp` stands for **seasonal period**. Since our data is **daily**, a value of 91 is about 3 monts, so this is **quarterly seasonality**. This tells the model to look for patterns that repeat every 91 days.
    
- **`initialization_method="estimated"`**: This allows the model to estimate the initial level, trend, and seasonal components automatically, which is helpful when we don't want to specify those manually.
    
- **`use_boxcox=True`**: This one applies a [Box-Cox transformation](https://www.statisticshowto.com/probability-and-statistics/normal-distributions/box-cox-transformation/) to stabilize variance. This can make patterns easier to model, especially if the scale of the data changes over time.
    
- **`optimized=True`**: It enables automatic tuning of the smoothing parameters using maximum likelihood estimation.
    
- **`method="L-BFGS-B"`**: This is the optimization algorithm used to find the best-fitting parameters. It’s efficient and works well with constrained problems like ours.

In [8]:
def create_pipeline():
    pipeline = ForecastingPipeline(
        steps=[
            (
                "model",
                ExponentialSmoothing(
                    trend="add",
                    seasonal="mul",
                    sp=91,  # sp means "seasonal period"
                    initialization_method="estimated",
                    use_boxcox=True,
                    optimized=True,
                    method="L-BFGS-B",
                ),
            )
        ]
    )
    return pipeline

### Train the Model and Predict Future Values

The `fit_and_forecast_future` function is our most important function. These are the important steps:

1. **Create the pipeline**  
    It calls `create_pipeline()` to get our configured SKTime forecasting pipeline using Exponential Smoothing.
    
2. **Prepare the target series**  
    It selects the time series column (`target`) from the input DataFrame. This is the data the model will learn from.
    
3. **Fit the model**  
    It fits the pipeline to the historical data (`y`). This is where the model learns the patterns, like trend, seasonality, and more.
    
4. **Create the future timeline**  
    It calculates a list of future dates to forecast based on the `number_of_days` argument.  
    For example, if the last date in your data is `"2023-12-31"` and you set `number_of_days=30`, the function creates a `DatetimeIndex` from Jan 1 to Jan 30.
    
5. **Build a forecasting horizon**  
    SKTime needs a special object called `ForecastingHorizon` to know **when** to make predictions. We provide the list of future dates as an *absolute* time index (`is_relative=False`).
    
6. **Generate predictions**  
    Finally, it calls `.predict()` on the pipeline using the future horizon and returns two things:
    
    - `y_future`: the predicted values for the future
        
    - `future_dates`: the timeline of those predicted values
        

This function is at the heart of forecasting — it's where history meets the future.

In [9]:
def fit_and_forecast_future(df_agg, target, number_of_days):
    pipeline = create_pipeline()
    y = df_agg[target]
    pipeline.fit(y)

    last_date = y.index[-1]
    future_dates = pd.date_range(
        start=last_date + pd.Timedelta(days=1), periods=number_of_days, freq="D"
    )
    fh_future = ForecastingHorizon(future_dates, is_relative=False)
    y_future = pipeline.predict(fh_future)

    return y_future, future_dates

In [10]:
def compute_confidence_intervals(y_pred, sigma=1.96):
    pred_values = y_pred.values
    std_dev = pred_values.std()
    conf_min = pred_values - sigma * std_dev
    conf_max = pred_values + sigma * std_dev
    return conf_min, conf_max

In [11]:
def create_forecast_dataframe(future_dates, y_future, conf_min, conf_max):
    forecast_df = pd.DataFrame(
        {
            "date": future_dates,
            "forecast": y_future.values,
            "conf_min": conf_min,
            "conf_max": conf_max,
        }
    )
    return forecast_df

Let's create our prediction DataFrame! We'll predict the next 40 days (imagining that we're ending the year 2013).

In [12]:
forecast_days = 40

In [13]:
y_future, future_dates = fit_and_forecast_future(prepared_data, "sales", forecast_days)
conf_min, conf_max = compute_confidence_intervals(y_future)
forecast_df = create_forecast_dataframe(future_dates, y_future, conf_min, conf_max)

In [14]:
forecast_df.sample(5)

Unnamed: 0,date,forecast,conf_min,conf_max
24,2013-12-27,34909.822592,26497.606684,43322.038501
35,2014-01-07,39898.596742,31486.380833,48310.812651
39,2014-01-11,35047.903321,26635.687412,43460.119229
12,2013-12-15,30444.394149,22032.17824,38856.610058
9,2013-12-12,33941.297352,25529.081444,42353.513261


## Plotting our DataFrame

In [15]:
def plot_forecast(df_agg, target, forecast_df):
    df_agg_reset = df_agg.reset_index().rename(columns={"index": "date"})
    fig = px.line(
        df_agg_reset, x="date", y=target, title=f"{target.capitalize()} Forecast"
    )
    fig.add_scatter(
        x=forecast_df["date"], y=forecast_df["forecast"], mode="lines", name="Forecast"
    )
    fig.add_scatter(
        x=forecast_df["date"],
        y=forecast_df["conf_min"],
        mode="lines",
        name="Lower Bound",
        line=dict(dash="dash"),
    )
    fig.add_scatter(
        x=forecast_df["date"],
        y=forecast_df["conf_max"],
        mode="lines",
        name="Upper Bound",
        line=dict(dash="dash"),
    )
    return fig

In [16]:
plot_forecast(prepared_data, "sales", forecast_df)