# How to on Spark: Cross Validation
> Run TimeGPT distributedly on top of Spark.

`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Spark DataFrame, `TimeGPT` will use the existing Spark session to run the forecast.


In [None]:
#| hide
from nixtlats.utils import colab_badge

In [None]:
#| echo: false
colab_badge('docs/how-to-guides/1_distributed_cv_spark')

# Installation 

As long as Spark is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Spark cluster, make use the `nixtlats` library is installed across all the workers.

## Executing on Spark

To run the forecasts distributed on Spark, just pass in a Spark DataFrame instead. 

In [None]:
#| hide
import os

import pandas as pd
from dotenv import load_dotenv

load_dotenv()

Instantiate `NixtlaClient` class.

In [None]:
from nixtlats import NixtlaClient

In [None]:
nixtla_client = NixtlaClient(
    # defaults to os.environ.get("NIXTLA_API_KEY")
    api_key = 'my_api_key_provided_by_nixtla'
)

In [None]:
#| hide
nixtla_client = NixtlaClient()

Use Spark as an engine.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

### Cross validation

In [None]:
url_df = 'https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv'
spark_df = spark.createDataFrame(pd.read_csv(url_df))
spark_df.show(5)

In [None]:
fcst_df = nixtla_client.cross_validation(spark_df, h=12, n_windows=5, step_size=2)
fcst_df.show(5)

### Cross validation with exogenous variables

Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.

For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.

To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.

Let's see an example.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')
spark_df = spark.createDataFrame(df)
spark_df.show(5)

Let's call the `cross_validation` method, adding this information:

In [None]:
timegpt_cv_ex_vars_df = nixtla_client.cross_validation(
    df=spark_df,
    h=48, 
    level=[80, 90],
    n_windows=5,
)
timegpt_cv_ex_vars_df.show(5)

In [None]:
spark.stop()