# How to on Spark: Forecasting
> Run TimeGPT distributedly on top of Spark.

`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Spark DataFrame, StatsForecast will use the existing Spark session to run the forecast.


In [None]:
#| hide
from nixtlats.utils import colab_badge

In [None]:
#| echo: false
colab_badge('docs/how-to-guides/0_distributed_fcst_spark')

[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nixtla/nixtla/blob/main/nbs/docs/how-to-guides/0_distributed_fcst_spark.ipynb)

# Installation 

As long as Spark is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Spark cluster, make use the `nixtlats` library is installed across all the workers.

## Executing on Spark

To run the forecasts distributed on Spark, just pass in a Spark DataFrame instead. 

In [None]:
#| hide
import os

import pandas as pd
from dotenv import load_dotenv

load_dotenv()

True

Instantiate `NixtlaClient` class.

In [None]:
from nixtlats import NixtlaClient

In [None]:
nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'
)

In [None]:
#| hide
nixtla_client = NixtlaClient()

Use Spark as an engine.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/01 03:34:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Forecast

In [None]:
url_df = 'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'
spark_df = spark.createDataFrame(pd.read_csv(url_df))
spark_df.show(5)

                                                                                

+---------+-------------------+-----+
|unique_id|                 ds|    y|
+---------+-------------------+-----+
|       BE|2016-12-01 00:00:00| 72.0|
|       BE|2016-12-01 01:00:00| 65.8|
|       BE|2016-12-01 02:00:00|59.99|
|       BE|2016-12-01 03:00:00|50.69|
|       BE|2016-12-01 04:00:00|52.58|
+---------+-------------------+-----+
only showing top 5 rows



In [None]:
fcst_df = nixtla_client.forecast(spark_df, h=12)
fcst_df.show(5)

INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
INFO:nixtlats.timegpt:Inferred freq: H
INFO:nixtlats.timegpt:Calling Forecast Endpoint...


+---------+-------------------+------------------+
|unique_id|                 ds|           TimeGPT|
+---------+-------------------+------------------+
|       FR|2016-12-31 00:00:00|62.130218505859375|
|       FR|2016-12-31 01:00:00|56.890830993652344|
|       FR|2016-12-31 02:00:00| 52.23155212402344|
|       FR|2016-12-31 03:00:00| 48.88866424560547|
|       FR|2016-12-31 04:00:00| 46.49836730957031|
+---------+-------------------+------------------+
only showing top 5 rows



                                                                                

In [None]:
#| hide
from fastcore.test import test_fail

In [None]:
#| hide
# test different results for different models
fcst_df_1 = fcst_df.toPandas()
fcst_df_2 = nixtla_client.forecast(spark_df, h=12, model='timegpt-1-long-horizon')
fcst_df_2 = fcst_df_2.toPandas()
test_fail(
    lambda: pd.testing.assert_frame_equal(fcst_df_1[['TimeGPT']], fcst_df_2[['TimeGPT']]),
    contains='(column name="TimeGPT") are different'
)

INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
INFO:nixtlats.timegpt:Inferred freq: H
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
INFO:nixtlats.timegpt:Inferred freq: H
INFO:nixtlats.timegpt:Inferred freq: H
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
INFO:nixtlats.timegpt:Inferred freq: H
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
INFO:nixtlats.timegpt:Inferred freq: H
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing datafr

### Forecast with exogenous variables

Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.

For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.

To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.

Let's see an example.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')
spark_df = spark.createDataFrame(df)
spark_df.show(5)

+---------+-------------------+-----+----------+----------+-----+-----+-----+-----+-----+-----+-----+
|unique_id|                 ds|    y|Exogenous1|Exogenous2|day_0|day_1|day_2|day_3|day_4|day_5|day_6|
+---------+-------------------+-----+----------+----------+-----+-----+-----+-----+-----+-----+-----+
|       BE|2016-12-01 00:00:00| 72.0|   61507.0|   71066.0|  0.0|  0.0|  0.0|  1.0|  0.0|  0.0|  0.0|
|       BE|2016-12-01 01:00:00| 65.8|   59528.0|   67311.0|  0.0|  0.0|  0.0|  1.0|  0.0|  0.0|  0.0|
|       BE|2016-12-01 02:00:00|59.99|   58812.0|   67470.0|  0.0|  0.0|  0.0|  1.0|  0.0|  0.0|  0.0|
|       BE|2016-12-01 03:00:00|50.69|   57676.0|   64529.0|  0.0|  0.0|  0.0|  1.0|  0.0|  0.0|  0.0|
|       BE|2016-12-01 04:00:00|52.58|   56804.0|   62773.0|  0.0|  0.0|  0.0|  1.0|  0.0|  0.0|  0.0|
+---------+-------------------+-----+----------+----------+-----+-----+-----+-----+-----+-----+-----+
only showing top 5 rows



To produce forecasts we have to add the future values of the exogenous variables. Let's read this dataset. In this case we want to predict 24 steps ahead, therefore each unique id will have 24 observations.

In [None]:
future_ex_vars_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-future-ex-vars.csv')
spark_future_ex_vars_df = spark.createDataFrame(future_ex_vars_df)
spark_future_ex_vars_df.show(5)

+---------+-------------------+----------+----------+-----+-----+-----+-----+-----+-----+-----+
|unique_id|                 ds|Exogenous1|Exogenous2|day_0|day_1|day_2|day_3|day_4|day_5|day_6|
+---------+-------------------+----------+----------+-----+-----+-----+-----+-----+-----+-----+
|       BE|2016-12-31 00:00:00|   64108.0|   70318.0|  0.0|  0.0|  0.0|  0.0|  0.0|  1.0|  0.0|
|       BE|2016-12-31 01:00:00|   62492.0|   67898.0|  0.0|  0.0|  0.0|  0.0|  0.0|  1.0|  0.0|
|       BE|2016-12-31 02:00:00|   61571.0|   68379.0|  0.0|  0.0|  0.0|  0.0|  0.0|  1.0|  0.0|
|       BE|2016-12-31 03:00:00|   60381.0|   64972.0|  0.0|  0.0|  0.0|  0.0|  0.0|  1.0|  0.0|
|       BE|2016-12-31 04:00:00|   60298.0|   62900.0|  0.0|  0.0|  0.0|  0.0|  0.0|  1.0|  0.0|
+---------+-------------------+----------+----------+-----+-----+-----+-----+-----+-----+-----+
only showing top 5 rows



Let's call the `forecast` method, adding this information:

In [None]:
timegpt_fcst_ex_vars_df = nixtla_client.forecast(df=spark_df, X_df=spark_future_ex_vars_df, h=24, level=[80, 90])
timegpt_fcst_ex_vars_df.show(5)

INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
INFO:nixtlats.timegpt:Preprocessing dataframes...
INFO:nixtlats.timegpt:Inferred freq: H
INFO:nixtlats.timegpt:Inferred freq: H
INFO:nixtlats.timegpt:Using the following exogenous variables: Exogenous1, Exogenous2, day_0, day_1, day_2, day_3, day_4, day_5, day_6
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
INFO:nixtlats.timegpt:Using the following exogenous variables: Exogenous1, Exogenous2, day_0, day_1, day_2, day_3, day_4, day_5, day_6
INFO:nixtlats.timegpt:Calling Forecast Endpoint...

+---------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
|unique_id|                 ds|           TimeGPT|     TimeGPT-lo-90|     TimeGPT-lo-80|    TimeGPT-hi-80|    TimeGPT-hi-90|
+---------+-------------------+------------------+------------------+------------------+-----------------+-----------------+
|       FR|2016-12-31 00:00:00| 59.39155162090687| 54.47111514324573| 56.13039408916859|62.65270915264515|  64.311988098568|
|       FR|2016-12-31 01:00:00|  60.1843929541434|56.167005220683926|56.778585672649264|63.59020023563754|64.20178068760288|
|       FR|2016-12-31 02:00:00| 58.12912691907976| 53.55469655256365| 55.23512607984636|61.02312775831316|62.70355728559587|
|       FR|2016-12-31 03:00:00|53.825965179940155| 46.31002742817014| 50.66449432422726|56.98743603565305|61.34190293171017|
|       FR|2016-12-31 04:00:00|  47.6941769331486| 38.21902702317546| 42.94538668046305|52.44296718583414|57.16932684312174|


                                                                                

In [None]:
spark.stop()