Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native Time Series and Forecasting Support (Sequence Learning) #49

Open
andrewdalpino opened this issue Nov 11, 2019 · 10 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@andrewdalpino
Copy link
Member

andrewdalpino commented Nov 11, 2019

Time series analysis is a popular machine learning technique for forecasting trends of time-dependent variables such as stock price, GDP, and quarterly sales. Given the popularity (#35, #38, #40) and current lack of tooling within the PHP ecosystem, I propose adding native time series support as well as a new type of estimator class for forecasting time series datasets. This includes the following ...

  1. A datastructure extending Dataset for time series datasets that includes an additional index for timestamps
  2. An additional estimator type "Forecaster" to predict the next k values in a series

There should be no need to modify any of the public interfaces to integrate these features into the current architecture

Proposed initial Forecaster implementations:

  • ARIMA - AutoRegressive Integrated Moving Average (univariate)
  • VARMAX - Vector AutoRegressive Moving Average with eXogenous regressors (multivariate)

Open to comments

@andrewdalpino andrewdalpino added the enhancement New feature or request label Nov 11, 2019
@andrewdalpino andrewdalpino self-assigned this Nov 11, 2019
@andrewdalpino andrewdalpino added this to Backlog in Roadmap via automation Nov 11, 2019
@andrewdalpino andrewdalpino moved this from Backlog to In progress in Roadmap Nov 11, 2019
@BasvanH
Copy link
Contributor

BasvanH commented Nov 12, 2019

Yes, I would very like those additions to the library. Thank you!

@andrewdalpino
Copy link
Member Author

andrewdalpino commented Nov 13, 2019

Thanks for the input @BasvanH

Expanding on the aforementioned design outline ...

The TimeSeries dataset object will have additional sorting, filtering, etc. methods that operate on the timestamp column. These will be similar to how Labeled provides additional methods that operate on labels. The timestamp column will allow either homogeneous integer or DateTime object elements.

Since time series estimation often diverges when considering univariate vs the multivarate case, the TimeSeries dataset object will handle both cases simultaneously, simply by keeping track of the number of target variables (as already accomplished using the numColumns() method on the Dataset class). For example, a univariate TimeSeries dataset object has a single column, whereas a multivariate one has more than 1 column. It will be the responsibility of the estimator to check whether the incoming dataset is compatible.

As mentioned previously, the public Estimator API will not change with the introduction of the new estimator type. In the case of forecasters the output of the predict() method will be the estimation of the next value given the last value in a series. The interpretation of the dataset therefore is slightly different at inference than during training in which the dataset is interpreted as a both contiguous and atomic. During inference, each sample will be considered independently and the value will be interpreted as either the empirical or theoretical last value of a time series the user would like to start inferring from. Since forecasters are estimators at heart, they benefit from all the additional tooling such as meta-Estimators and the cross validation framework.

In addition, we will add the Forecaster interface allowing estimators to implement the forecast() method which, unlike predict() will estimate the next k values starting at a given offset. It is assumed that most forecaster types will implement the Forecaster interface as prediction (as defined above) is only a special case of forecasting where k=1. There are currently two prototypes for the forecast() method signature to consider. The first is borrowing the idea of start and end from the statsmodels library (see their predict API). The second idea is to use the timestamp of the TimeSeries dataset object as the start and then output the next k subsequent values. The differences look like this ...

public forecast(TimeSeries $dataset, $start, $end) : array

vs.

public forecast(TimeSeries $dataset, int $k) : array

So far I personally prefer the latter case

As with the Learner, Probabalistic, and Ranking interfaces, the Forecaster interface will also include the forecastSample() method to handle inference on single samples at a time.

Open to comments

@andrewdalpino andrewdalpino moved this from In progress to Backlog in Roadmap Feb 4, 2020
@andrewdalpino
Copy link
Member Author

Update:

Since we are in a feature-freeze for the time being, this enhancement will be moved over to the Extras package for the time being and may be integrated into the main package after

Roadmap automation moved this from Backlog to Completed Apr 11, 2020
@andrewdalpino andrewdalpino removed this from Completed in Roadmap Apr 11, 2020
@LasseRafn
Copy link

Hi! sorry for commenting on a closed issue.

The comment said that its moved to the Extras package, understandably, however is it that the idea will be moved there or is it already there?

Regardless I much appreciate all the hard work been put into RubixML, just curious. 😄

@Rello
Copy link

Rello commented Mar 17, 2021

Hello,
I would also like to know the status here.
I would like to test forecasting for an idea on my side

thank you

@andrewdalpino
Copy link
Member Author

andrewdalpino commented Mar 17, 2021

Hello @LasseRafn and @Rello thanks for commenting, I'll give an update and we'll reopen this issue to keep the discussion going.

We haven't got around to implementing time-series in ML or Extras yet, although we have plenty of research planned in regards to sequence learning, we have no immediate plans to implement features at this time. Having that said, we're seeing an uptick in contributions, it's possible that someone from the community can take on this effort.

@andrewdalpino andrewdalpino reopened this Mar 17, 2021
@andrewdalpino andrewdalpino changed the title Native Time Series and Forecasting Support Native Time Series and Forecasting Support (Sequence Learning) Mar 17, 2021
@mindaugasdi
Copy link

Could simpler sequence implementation be faster to implement first?

For example, dataset:

[0,1,1,1,0,0,0,0,0,0,1,1,1,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,0,1,1,1,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,1,1,1,1,1,1,1,1,0,1,0,0,0,0]

I see in this data, that 1 is more likely to be followed by 1, and 0 is more likely to be followed by 0. The more 1 or 0 are in a row, the more likely next value to be the same. Maybe there are other patters too. If human can see this pattern, maybe ML could too (and state the confidence).

@itrack
Copy link

itrack commented Sep 13, 2022

Hi guys!
Any news about this feature?

Thank you!

@andrewdalpino
Copy link
Member Author

Hi @itrack. There's still talk about implementing VAR (vector autoregression) and LSTM. Nothing material has come about yet though. It's not that there's not enough want for sequence learning but that we really don't have the resources right now. Hopefully, we can attract more interest from the community.

@ThomasW69
Copy link

Are there any new developments here in the meantime. I would also be interested in a time series forecast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants