ValueError for backtesting_forecaster when interval is provided #177

tkaraouzene · 2022-07-18T11:20:25Z

Hi !

I'm trying to use your backtesting_forecaster

and when I use and ask for intervals, it leads to ValueError

When no intervals are asked, all works perfectly:

[in]:

if __name__ == "__main__":
    import pandas as pd
    from skforecast.ForecasterAutoreg import ForecasterAutoreg
    from skforecast.model_selection import backtesting_forecaster
    from sklearn.ensemble import RandomForestRegressor

    y_train = pd.Series([479.157, 478.475, 481.205, 492.467, 490.42, 508.166, 523.182,
                         499.634, 495.88, 494.174, 494.174, 490.078, 490.078, 495.539,
                         488.713, 485.3, 493.491, 492.126, 493.832, 485.983, 481.887,
                         474.379, 433.084, 456.633, 477.451, 468.919, 484.959, 471.99,
                         486.324, 498.61, 517.381, 485.3, 480.864, 485.983, 484.276,
                         490.761, 490.078, 494.515, 495.88, 493.15, 491.443, 490.42,
                         485.3, 485.3, 486.665, 467.895, 441.616, 469.601, 477.11,
                         486.324, 485.3, 489.054, 494.856, 513.968, 544.683, 557.31,
                         574.374, 603.383, 617.034, 621.812, 627.273, 612.598, 598.605,
                         610.891, 598.605, 563.112, 542.635, 536.492, 499.634, 456.633,
                         431.037, 453.903, 464.141, 454.244, 456.633, 476.768, 495.88,
                         523.524, 537.516, 577.787, 600.994, 616.693, 631.71, 636.487,
                         621.471, 635.805, 625.908, 616.011, 581.2, 565.842, 553.556,
                         570.279, 514.992, 483.253, 460.046, 469.26, 475.745, 478.816,
                         482.57, 506.801, 510.896])



    backtesting_forecaster(
        forecaster=ForecasterAutoreg(regressor=RandomForestRegressor(random_state=42), lags=10),
        y=y_train,
        steps=24,
        metric="mean_absolute_percentage_error",
        initial_train_size=14,
        n_boot=50,
    )

[out]:

(array([0.07647964]),
           pred
 14   493.40924
 15   493.17717
 16   492.99968
 17   492.98603
 18   492.69932
 ..         ...
 96   492.98603
 97   492.98603
 98   492.98603
 99   492.98603
 100  492.98603
 
 [87 rows x 1 columns])

however asking for intervals leads to ValueError

[in]:

if __name__ == "__main__":
    import pandas as pd
    from skforecast.ForecasterAutoreg import ForecasterAutoreg
    from skforecast.model_selection import backtesting_forecaster
    from sklearn.ensemble import RandomForestRegressor

    y_train = pd.Series([479.157, 478.475, 481.205, 492.467, 490.42, 508.166, 523.182,
                         499.634, 495.88, 494.174, 494.174, 490.078, 490.078, 495.539,
                         488.713, 485.3, 493.491, 492.126, 493.832, 485.983, 481.887,
                         474.379, 433.084, 456.633, 477.451, 468.919, 484.959, 471.99,
                         486.324, 498.61, 517.381, 485.3, 480.864, 485.983, 484.276,
                         490.761, 490.078, 494.515, 495.88, 493.15, 491.443, 490.42,
                         485.3, 485.3, 486.665, 467.895, 441.616, 469.601, 477.11,
                         486.324, 485.3, 489.054, 494.856, 513.968, 544.683, 557.31,
                         574.374, 603.383, 617.034, 621.812, 627.273, 612.598, 598.605,
                         610.891, 598.605, 563.112, 542.635, 536.492, 499.634, 456.633,
                         431.037, 453.903, 464.141, 454.244, 456.633, 476.768, 495.88,
                         523.524, 537.516, 577.787, 600.994, 616.693, 631.71, 636.487,
                         621.471, 635.805, 625.908, 616.011, 581.2, 565.842, 553.556,
                         570.279, 514.992, 483.253, 460.046, 469.26, 475.745, 478.816,
                         482.57, 506.801, 510.896])



    backtesting_forecaster(
        forecaster=ForecasterAutoreg(regressor=RandomForestRegressor(random_state=42), lags=10),
        y=y_train,
        steps=24,
        metric="mean_absolute_percentage_error",
        initial_train_size=14,
        interval=[95],
        n_boot=50,
    )

[out]:

Traceback (most recent call last):
  File "...\lib\site-packages\IPython\core\interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-9cd2f471a2e7>", line 22, in <cell line: 22>
    backtesting_forecaster(
  File "...\lib\site-packages\skforecast\model_selection\model_selection.py", line 925, in backtesting_forecaster
    metric_value, backtest_predictions = _backtesting_forecaster_no_refit(
  File "...\lib\site-packages\skforecast\model_selection\model_selection.py", line 705, in _backtesting_forecaster_no_refit
    pred = forecaster.predict_interval(
  File "...\lib\site-packages\skforecast\ForecasterAutoreg\ForecasterAutoreg.py", line 757, in predict_interval
    predictions = pd.DataFrame(
  File "...\lib\site-packages\pandas\core\frame.py", line 694, in __init__
    mgr = ndarray_to_mgr(
  File "...\lib\site-packages\pandas\core\internals\construction.py", line 351, in ndarray_to_mgr
    _check_values_indices_shape_match(values, index, columns)
  File "...\dev\lib\site-packages\pandas\core\internals\construction.py", line 422, in _check_values_indices_shape_match
    raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (24, 2), indices imply (24, 3)

The text was updated successfully, but these errors were encountered:

tkaraouzene · 2022-07-18T11:29:40Z

As far as I understood, the error comes from this file: skforecast/ForecasterAutoreg/ForecasterAutoreg.py .nox/dev/Lib/site-packages/skforecast/ForecasterAutoreg/ForecasterAutoreg.py

you can see folowing lines:

        predictions_interval = self._estimate_boot_interval(
                                    steps       = steps,
                                    last_window = copy(last_window_values_original),
                                    exog        = copy(exog_values_original),
                                    interval    = interval,
                                    n_boot      = n_boot,
                                    random_state = random_state,
                                    in_sample_residuals = in_sample_residuals
                                )
        
        predictions = np.column_stack((predictions, predictions_interval))

        predictions = pd.DataFrame(
                        data = predictions,
                        index = expand_index(
                                    index = last_window_index,
                                    steps = steps
                                ),
                        columns = ['pred', 'lower_bound', 'upper_bound']
                      )

according to your documentation:

"""
Returns 
-------
prediction_interval : numpy ndarray, shape (steps, 2)
    Interval estimated for each prediction by bootstrapping:
         first column = lower bound of the interval.
         second column= upper bound interval of the interval.
"""

prediction_interval should have a shape of (24, 2) however, in my case it has a shape of (24,1)

I think _estimate_boot_interval should return np.column_stack([-prediction_interval, prediction_interval]) instead of prediction_interval

Do you agree with that or is their something I don't understand?

JoaquinAmatRodrigo · 2022-07-18T12:48:02Z

Hi @tkaraouzene,
The argument interval must be a sequence with the upper an lower limits of the intervals. For example, if you want an interval of 95% you should indicate interval=[2.5, 97.5].

@JavierEscobarOrtiz I think we should explain this better in the docs.

tkaraouzene · 2022-07-18T16:45:47Z

Hi !

Thanks for this information.
I also suggest to add some validation which could help the user in case of wrong input:
Bellow my suggestion:

    def check_interval(interval: Optional[list[int]]) -> None:
        """Check provided confidence interval sequence is valid.

        Parameters
        ----------
        interval : list[int], optional
            Confidence of the prediction interval estimated. Sequence of percentiles
            to compute, which must be between 0 and 100 inclusive. If `None`, no check are made.
        """
        if interval is not None:

            if len(interval) != 2:
                raise ValueError(
                    f"Interval sequence should contain exactly 2 values, respectively lower and upper interval bounds."
                )

            lower_bound = interval[0]
            upper_bound = interval[1]

            if lower_bound >= interval[1]:
                raise ValueError(
                    f"Lower interval bound ({lower_bound}) has to be strictly smaller than upper interval bound "
                    f"({upper_bound}."
                )

            if interval[0] < 0:
                raise ValueError(f"Lower interval bound ({lower_bound}) has to be >= 0.")

            if interval[1] > 100:
                raise ValueError(f"Upper interval bound ({upper_bound}) has to be <= 100.")

JoaquinAmatRodrigo · 2022-07-19T07:54:02Z

I like this check @tkaraouzene We will add it in the new release.
Thanks for the suggestion!

tkaraouzene · 2022-07-20T07:35:03Z

Hi @JoaquinAmatRodrigo !

If you want I can propose a pull request

JoaquinAmatRodrigo · 2022-07-20T07:48:03Z

Hi @tkaraouzene,
I just added part of your code to the development branch 0.5.x
I am planning to mention you in the information of the release. Is it fine for you?

tkaraouzene · 2022-07-20T09:09:29Z

Hi @JoaquinAmatRodrigo !

I just added part of your code to the development branch 0.5.x

I didn't see you've added the code so I made a PR (I saw your code thanks to the merge conflicts) 🤦‍♂️...
I let you reject the PR.

I am planning to mention you in the information of the release. Is it fine for you?

Yes perfect, thank you

JavierEscobarOrtiz · 2022-07-21T07:24:45Z

Hello @tkaraouzene,

We are closing this issue. Thank you very much for your help.

You can see it mentioned in the changelog:

[0.5.0] - [Dev]

Added

Functions _backtesting_forecaster_verbose, random_search_forecaster, _evaluate_grid_hyperparameters, bayesian_search_forecaster, _bayesian_search_optuna and _bayesian_search_skopt in model_selection.
Created ForecasterAutoregMultiSeries class for multi time series forecasting.
Created module model_selection_multiseries. Functions: _backtesting_forecaster_multiseries_refit, _backtesting_forecaster_multiseries_no_refit, backtesting_forecaster_multiseries, grid_search_forecaster_multiseries, random_search_forecaster_multiseries and _evaluate_grid_hyperparameters_multiseries.
Function _check_interval in utils. (suggested by Thomas Karaouzene https://github.com/tkaraouzene)

...

JavierEscobarOrtiz added enhancement New feature or request good first issue Good for newcomers labels Jul 19, 2022

tkaraouzene mentioned this issue Jul 20, 2022

Add check_check for ForecasterAutoreg.predict_interval #180

Closed

JavierEscobarOrtiz closed this as completed Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError for backtesting_forecaster when interval is provided #177

ValueError for backtesting_forecaster when interval is provided #177

tkaraouzene commented Jul 18, 2022

tkaraouzene commented Jul 18, 2022

JoaquinAmatRodrigo commented Jul 18, 2022

tkaraouzene commented Jul 18, 2022

JoaquinAmatRodrigo commented Jul 19, 2022

tkaraouzene commented Jul 20, 2022

JoaquinAmatRodrigo commented Jul 20, 2022

tkaraouzene commented Jul 20, 2022 •

edited

JavierEscobarOrtiz commented Jul 21, 2022

ValueError for backtesting_forecaster when interval is provided #177

ValueError for backtesting_forecaster when interval is provided #177

Comments

tkaraouzene commented Jul 18, 2022

tkaraouzene commented Jul 18, 2022

JoaquinAmatRodrigo commented Jul 18, 2022

tkaraouzene commented Jul 18, 2022

JoaquinAmatRodrigo commented Jul 19, 2022

tkaraouzene commented Jul 20, 2022

JoaquinAmatRodrigo commented Jul 20, 2022

tkaraouzene commented Jul 20, 2022 • edited

JavierEscobarOrtiz commented Jul 21, 2022

[0.5.0] - [Dev]

tkaraouzene commented Jul 20, 2022 •

edited