# mlforecast

> Scalable machine learning based time series forecasting.

**mlforecast** is a framework to perform time series forecasting using machine learning models, with the option to scale to massive amounts of data using remote clusters.

![CI](https://img.shields.io/github/workflow/status/Nixtla/mlforecast/CI)
![Python](https://img.shields.io/pypi/pyversions/mlforecast)
![PyPi](https://img.shields.io/pypi/v/mlforecast)
![License](https://img.shields.io/github/license/Nixtla/mlforecast)

## Install

`pip install mlforecast`

### Optional dependencies
If you want more functionality you can instead use `pip install mlforecast[extra1, extra2, ...]`. The current extra dependencies are:

* **aws**: adds the functionality to use S3 as the storage in the CLI.
* **cli**: includes the validations necessary to use the CLI.
* **distributed**: installs [dask](https://dask.org/) to perform distributed training. Note that you'll also need to install either [lightgbm](https://github.com/microsoft/LightGBM/tree/master/python-package) or [xgboost](https://xgboost.readthedocs.io/en/latest/install.html#python).

For example, if you want to perform distributed training through the CLI using S3 as your storage you'll need all three extras, which you can get using: `pip install mlforecast[aws, cli, distributed]`.

## How to use

### Programmatic API

In [None]:
#hide
import os
import shutil
from pathlib import Path

from IPython.display import display, Markdown


os.chdir('..')


def display_df(df):
    display(Markdown(df.to_markdown()))

Store your time series in a pandas dataframe with an index named **unique_id** that identifies each time serie, a column **ds** that contains the datestamps and a column **y** with the values.

In [None]:
from mlforecast.utils import generate_daily_series

series = generate_daily_series(20)
display_df(series.head())

Then define your flow configuration. This includes lags, transformations on the lags and date features. The lag transformations are defined as [numba](http://numba.pydata.org/) *jitted* functions that transform an array, if they have additional arguments you supply a tuple (`transform_func`, `arg1`, `arg2`, ...).

In [None]:
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean

flow_config = dict(
    lags=[7, 14],
    lag_transforms={
        1: [expanding_mean],
        7: [(rolling_mean, 7), (rolling_mean, 14)]
    },
    date_features=['dayofweek', 'month']
)

Next define a model. If you want to use the local interface this can be any regressor that follows the scikit-learn API. For distributed training there are `LGBMForecast` and `XGBForecast`.

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

Now instantiate your forecast object with the model and the flow configuration. There are two types of forecasters, `Forecast` which is local and `DistributedForecast` which performs the whole process in a distributed way.

In [None]:
from mlforecast.forecast import Forecast

fcst = Forecast(model, flow_config)

To compute the features and train the model using them call `.fit` on your `Forecast` object.

In [None]:
fcst.fit(series)

To get the forecasts for the next 14 days call `.predict(14)` on the forecaster. This will update the target with each prediction and recompute the features to get the next one.

In [None]:
predictions = fcst.predict(14)

display_df(predictions.head())

### CLI

If you're looking for computing quick baselines, want to avoid some boilerplate or just like using CLIs better then you can use the `mlforecast` binary with a configuration file like the following:

In [None]:
!cat sample_configs/local.yaml

The configuration is validated using `FlowConfig`.

This configuration will use the data in `data.prefix/data.input` to train and write the results to `data.prefix/data.output` both with `data.format`.

In [None]:
#hide
data_path = Path('data')
data_path.mkdir()
series.to_parquet(data_path/'train')

In [None]:
!mlforecast sample_configs/local.yaml

In [None]:
list((data_path/'outputs').iterdir())

In [None]:
#hide
shutil.rmtree(data_path)