Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError for backtesting_forecaster when interval is provided #177

Closed
tkaraouzene opened this issue Jul 18, 2022 · 8 comments
Closed

ValueError for backtesting_forecaster when interval is provided #177

tkaraouzene opened this issue Jul 18, 2022 · 8 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@tkaraouzene
Copy link

Hi !

I'm trying to use your backtesting_forecaster

and when I use and ask for intervals, it leads to ValueError

When no intervals are asked, all works perfectly:

[in]:

if __name__ == "__main__":
    import pandas as pd
    from skforecast.ForecasterAutoreg import ForecasterAutoreg
    from skforecast.model_selection import backtesting_forecaster
    from sklearn.ensemble import RandomForestRegressor

    y_train = pd.Series([479.157, 478.475, 481.205, 492.467, 490.42, 508.166, 523.182,
                         499.634, 495.88, 494.174, 494.174, 490.078, 490.078, 495.539,
                         488.713, 485.3, 493.491, 492.126, 493.832, 485.983, 481.887,
                         474.379, 433.084, 456.633, 477.451, 468.919, 484.959, 471.99,
                         486.324, 498.61, 517.381, 485.3, 480.864, 485.983, 484.276,
                         490.761, 490.078, 494.515, 495.88, 493.15, 491.443, 490.42,
                         485.3, 485.3, 486.665, 467.895, 441.616, 469.601, 477.11,
                         486.324, 485.3, 489.054, 494.856, 513.968, 544.683, 557.31,
                         574.374, 603.383, 617.034, 621.812, 627.273, 612.598, 598.605,
                         610.891, 598.605, 563.112, 542.635, 536.492, 499.634, 456.633,
                         431.037, 453.903, 464.141, 454.244, 456.633, 476.768, 495.88,
                         523.524, 537.516, 577.787, 600.994, 616.693, 631.71, 636.487,
                         621.471, 635.805, 625.908, 616.011, 581.2, 565.842, 553.556,
                         570.279, 514.992, 483.253, 460.046, 469.26, 475.745, 478.816,
                         482.57, 506.801, 510.896])



    backtesting_forecaster(
        forecaster=ForecasterAutoreg(regressor=RandomForestRegressor(random_state=42), lags=10),
        y=y_train,
        steps=24,
        metric="mean_absolute_percentage_error",
        initial_train_size=14,
        n_boot=50,
    )

[out]:

(array([0.07647964]),
           pred
 14   493.40924
 15   493.17717
 16   492.99968
 17   492.98603
 18   492.69932
 ..         ...
 96   492.98603
 97   492.98603
 98   492.98603
 99   492.98603
 100  492.98603
 
 [87 rows x 1 columns])

however asking for intervals leads to ValueError

[in]:

if __name__ == "__main__":
    import pandas as pd
    from skforecast.ForecasterAutoreg import ForecasterAutoreg
    from skforecast.model_selection import backtesting_forecaster
    from sklearn.ensemble import RandomForestRegressor

    y_train = pd.Series([479.157, 478.475, 481.205, 492.467, 490.42, 508.166, 523.182,
                         499.634, 495.88, 494.174, 494.174, 490.078, 490.078, 495.539,
                         488.713, 485.3, 493.491, 492.126, 493.832, 485.983, 481.887,
                         474.379, 433.084, 456.633, 477.451, 468.919, 484.959, 471.99,
                         486.324, 498.61, 517.381, 485.3, 480.864, 485.983, 484.276,
                         490.761, 490.078, 494.515, 495.88, 493.15, 491.443, 490.42,
                         485.3, 485.3, 486.665, 467.895, 441.616, 469.601, 477.11,
                         486.324, 485.3, 489.054, 494.856, 513.968, 544.683, 557.31,
                         574.374, 603.383, 617.034, 621.812, 627.273, 612.598, 598.605,
                         610.891, 598.605, 563.112, 542.635, 536.492, 499.634, 456.633,
                         431.037, 453.903, 464.141, 454.244, 456.633, 476.768, 495.88,
                         523.524, 537.516, 577.787, 600.994, 616.693, 631.71, 636.487,
                         621.471, 635.805, 625.908, 616.011, 581.2, 565.842, 553.556,
                         570.279, 514.992, 483.253, 460.046, 469.26, 475.745, 478.816,
                         482.57, 506.801, 510.896])



    backtesting_forecaster(
        forecaster=ForecasterAutoreg(regressor=RandomForestRegressor(random_state=42), lags=10),
        y=y_train,
        steps=24,
        metric="mean_absolute_percentage_error",
        initial_train_size=14,
        interval=[95],
        n_boot=50,
    )

[out]:

Traceback (most recent call last):
  File "...\lib\site-packages\IPython\core\interactiveshell.py", line 3398, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-9cd2f471a2e7>", line 22, in <cell line: 22>
    backtesting_forecaster(
  File "...\lib\site-packages\skforecast\model_selection\model_selection.py", line 925, in backtesting_forecaster
    metric_value, backtest_predictions = _backtesting_forecaster_no_refit(
  File "...\lib\site-packages\skforecast\model_selection\model_selection.py", line 705, in _backtesting_forecaster_no_refit
    pred = forecaster.predict_interval(
  File "...\lib\site-packages\skforecast\ForecasterAutoreg\ForecasterAutoreg.py", line 757, in predict_interval
    predictions = pd.DataFrame(
  File "...\lib\site-packages\pandas\core\frame.py", line 694, in __init__
    mgr = ndarray_to_mgr(
  File "...\lib\site-packages\pandas\core\internals\construction.py", line 351, in ndarray_to_mgr
    _check_values_indices_shape_match(values, index, columns)
  File "...\dev\lib\site-packages\pandas\core\internals\construction.py", line 422, in _check_values_indices_shape_match
    raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (24, 2), indices imply (24, 3)
@tkaraouzene
Copy link
Author

As far as I understood, the error comes from this file: skforecast/ForecasterAutoreg/ForecasterAutoreg.py .nox/dev/Lib/site-packages/skforecast/ForecasterAutoreg/ForecasterAutoreg.py

you can see folowing lines:

        predictions_interval = self._estimate_boot_interval(
                                    steps       = steps,
                                    last_window = copy(last_window_values_original),
                                    exog        = copy(exog_values_original),
                                    interval    = interval,
                                    n_boot      = n_boot,
                                    random_state = random_state,
                                    in_sample_residuals = in_sample_residuals
                                )
        
        predictions = np.column_stack((predictions, predictions_interval))

        predictions = pd.DataFrame(
                        data = predictions,
                        index = expand_index(
                                    index = last_window_index,
                                    steps = steps
                                ),
                        columns = ['pred', 'lower_bound', 'upper_bound']
                      )

according to your documentation:

"""
Returns 
-------
prediction_interval : numpy ndarray, shape (steps, 2)
    Interval estimated for each prediction by bootstrapping:
         first column = lower bound of the interval.
         second column= upper bound interval of the interval.
"""

prediction_interval should have a shape of (24, 2) however, in my case it has a shape of (24,1)

I think _estimate_boot_interval should return np.column_stack([-prediction_interval, prediction_interval]) instead of prediction_interval

Do you agree with that or is their something I don't understand?

@JoaquinAmatRodrigo
Copy link
Owner

Hi @tkaraouzene,
The argument interval must be a sequence with the upper an lower limits of the intervals. For example, if you want an interval of 95% you should indicate interval=[2.5, 97.5].

@JavierEscobarOrtiz I think we should explain this better in the docs.

@tkaraouzene
Copy link
Author

Hi !

Thanks for this information.
I also suggest to add some validation which could help the user in case of wrong input:
Bellow my suggestion:

    def check_interval(interval: Optional[list[int]]) -> None:
        """Check provided confidence interval sequence is valid.

        Parameters
        ----------
        interval : list[int], optional
            Confidence of the prediction interval estimated. Sequence of percentiles
            to compute, which must be between 0 and 100 inclusive. If `None`, no check are made.
        """
        if interval is not None:

            if len(interval) != 2:
                raise ValueError(
                    f"Interval sequence should contain exactly 2 values, respectively lower and upper interval bounds."
                )

            lower_bound = interval[0]
            upper_bound = interval[1]

            if lower_bound >= interval[1]:
                raise ValueError(
                    f"Lower interval bound ({lower_bound}) has to be strictly smaller than upper interval bound "
                    f"({upper_bound}."
                )

            if interval[0] < 0:
                raise ValueError(f"Lower interval bound ({lower_bound}) has to be >= 0.")

            if interval[1] > 100:
                raise ValueError(f"Upper interval bound ({upper_bound}) has to be <= 100.")

@JoaquinAmatRodrigo
Copy link
Owner

I like this check @tkaraouzene We will add it in the new release.
Thanks for the suggestion!

@JavierEscobarOrtiz JavierEscobarOrtiz added enhancement New feature or request good first issue Good for newcomers labels Jul 19, 2022
@tkaraouzene
Copy link
Author

Hi @JoaquinAmatRodrigo !

If you want I can propose a pull request

@JoaquinAmatRodrigo
Copy link
Owner

Hi @tkaraouzene,
I just added part of your code to the development branch 0.5.x
I am planning to mention you in the information of the release. Is it fine for you?

@tkaraouzene
Copy link
Author

tkaraouzene commented Jul 20, 2022

Hi @JoaquinAmatRodrigo !

I just added part of your code to the development branch 0.5.x

I didn't see you've added the code so I made a PR (I saw your code thanks to the merge conflicts) 🤦‍♂️...
I let you reject the PR.

I am planning to mention you in the information of the release. Is it fine for you?

Yes perfect, thank you

@JavierEscobarOrtiz
Copy link
Collaborator

Hello @tkaraouzene,

We are closing this issue. Thank you very much for your help.

You can see it mentioned in the changelog:

[0.5.0] - [Dev]

Added

  • Functions _backtesting_forecaster_verbose, random_search_forecaster, _evaluate_grid_hyperparameters, bayesian_search_forecaster, _bayesian_search_optuna and _bayesian_search_skopt in model_selection.

  • Created ForecasterAutoregMultiSeries class for multi time series forecasting.

  • Created module model_selection_multiseries. Functions: _backtesting_forecaster_multiseries_refit, _backtesting_forecaster_multiseries_no_refit, backtesting_forecaster_multiseries, grid_search_forecaster_multiseries, random_search_forecaster_multiseries and _evaluate_grid_hyperparameters_multiseries.

  • Function _check_interval in utils. (suggested by Thomas Karaouzene https://github.com/tkaraouzene)

...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants