# Run Fugue + Nixtla

## Import and Modulize

Accoring to the Ray experiment, we created two steps, generating data and forecasting.

In [None]:
import argparse
import os
from time import time

import pandas as pd
from statsforecast.utils import generate_series
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, MSTL

In [None]:
def generate_data(n):
    df = generate_series(n_series=n, seed=1).reset_index()
    return df.assign(unique_id=df.unique_id.astype(int))

# schema: *-y+AutoARIMA:double
def forecast(df:pd.DataFrame) -> pd.DataFrame:
    tdf = df.set_index("unique_id")
    model = StatsForecast(models=[AutoARIMA(season_length = 7)], freq='D',n_jobs=1)
    return model.forecast(h=7,df=tdf).reset_index()

Now test them:

In [None]:
forecast(generate_data(1))

Unnamed: 0,unique_id,ds,AutoARIMA
0,0,2000-03-28,3.215056
1,0,2000-03-29,4.273245
2,0,2000-03-30,5.277228
3,0,2000-03-31,6.247396
4,0,2000-04-01,0.258311
5,0,2000-04-02,1.303178
6,0,2000-04-03,2.218654


## Using Fugue Transform

Now rewrite the unit test using Fugue transform function. Also we specify the logical partition key.

This step is still unit testable because the input output are both Pandas Dataframes:

In [None]:
from fugue import transform

transform(generate_data(2), forecast, partition="unique_id", as_local=True)

Unnamed: 0,unique_id,ds,AutoARIMA
0,0,2000-03-28,3.2514
1,0,2000-03-29,4.272859
2,0,2000-03-30,5.271996
3,0,2000-03-31,6.208647
4,0,2000-04-01,0.268886
5,0,2000-04-02,1.238485
6,0,2000-04-03,2.380695
7,1,2000-10-12,1.321963
8,1,2000-10-13,2.221669
9,1,2000-10-14,3.264195


### Test on Spark

We only need to add the parameter `engine` to move the compute to the spark cluster on Databricks (`spark` is the SparkSession).

We also add `num` as a partition parameter to control the balance between overhead and load-balance:

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [None]:
transform(generate_data(20), forecast, partition={"num":500, "by":"unique_id"}, as_local=True, engine=spark)

  for column, series in pdf.iteritems():
  for column, series in pdf.iteritems():
[Stage 5:>                                                          (0 + 1) / 1]

CPU times: user 243 ms, sys: 28.6 ms, total: 271 ms
Wall time: 1min


  series = series.astype(t, copy=False)


Unnamed: 0,unique_id,ds,AutoARIMA
0,0,2000-03-28,3.269367
1,0,2000-03-29,4.248864
2,0,2000-03-30,5.251064
3,0,2000-03-31,6.324217
4,0,2000-04-01,0.240028
...,...,...,...
135,19,2001-02-13,2.280188
136,19,2001-02-14,3.224672
137,19,2001-02-15,4.286341
138,19,2001-02-16,5.206712


### Test on Dask

The only difference is the value of the `engine` parameter:

In [None]:
%%time
transform(generate_data(10000), forecast, partition={"num":500, "by":"unique_id"}, as_local=True, engine="dask")

Unnamed: 0,unique_id,ds,auto_arima
0,28,2000-09-22,5.808166
1,28,2000-09-23,2.667787
2,28,2000-09-24,1.198313
3,28,2000-09-25,0.240285
4,28,2000-09-26,0.109615
...,...,...,...
69995,9917,2000-06-14,1.498216
69996,9917,2000-06-15,1.182815
69997,9917,2000-06-16,0.933812
69998,9917,2000-06-17,0.737228


In [None]:
sdf = spark.createDataFrame(generate_data(4))


  for column, series in pdf.iteritems():
  for column, series in pdf.iteritems():


In [None]:
import statsforecast.distributed.fugue
model = StatsForecast(models=[AutoARIMA(season_length = 7)], freq='D',n_jobs=1)
model.forecast(h=7, df=sdf).toPandas()

make fugue


  series = series.astype(t, copy=False)


Unnamed: 0,unique_id,ds,AutoARIMA
0,0,2000-03-28,3.271822
1,0,2000-03-29,4.271764
2,0,2000-03-30,5.208725
3,0,2000-03-31,6.266395
4,0,2000-04-01,0.2383
5,0,2000-04-02,1.378324
6,0,2000-04-03,2.274187
7,1,2000-10-12,1.220782
8,1,2000-10-13,2.263074
9,1,2000-10-14,3.267686


In [None]:
import fugue.api as fa

with fa.engine_context(spark):
    print(model.forecast(h=7, df=generate_data(4)).toPandas())

make fugue


  for column, series in pdf.iteritems():
  for column, series in pdf.iteritems():
[Stage 11:>                                                         (0 + 1) / 1]

    unique_id         ds  AutoARIMA
0           0 2000-03-28   3.271822
1           0 2000-03-29   4.271764
2           0 2000-03-30   5.208725
3           0 2000-03-31   6.266395
4           0 2000-04-01   0.238300
5           0 2000-04-02   1.378324
6           0 2000-04-03   2.274187
7           1 2000-10-12   1.220782
8           1 2000-10-13   2.263074
9           1 2000-10-14   3.267686
10          1 2000-10-15   4.238875
11          1 2000-10-16   5.275468
12          1 2000-10-17   6.250961
13          1 2000-10-18   0.336281
14          2 2001-03-22   6.285254
15          2 2001-03-23   0.258897
16          2 2001-03-24   1.262251
17          2 2001-03-25   2.276786
18          2 2001-03-26   3.264612
19          2 2001-03-27   4.239303
20          2 2001-03-28   5.280932
21          3 2000-05-02   2.265761
22          3 2000-05-03   3.199819
23          3 2000-05-04   4.217481
24          3 2000-05-05   5.250148
25          3 2000-05-06   6.224373
26          3 2000-05-07   0

  series = series.astype(t, copy=False)
