# How to on Dask: Cross Validation
> Run TimeGPT distributedly on top of Dask.

`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Dask DataFrame, TimeGPT will use the existing Dask session to run the forecast.


In [None]:
#| hide
from nixtlats.utils import colab_badge

In [None]:
#| echo: false
colab_badge('docs/how-to-guides/3_distributed_cv_dask')

[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nixtla/nixtla/blob/main/nbs/docs/how-to-guides/3_distributed_cv_dask.ipynb)

# Installation 

[Dask](https://www.dask.org/get-started) is an open source parallel computing library for Python. As long as Dask is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Dask cluster, make sure the `nixtlats` library is installed across all the workers.

In addition to Dask, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Dask using pip. 

In [None]:
%%capture 
pip install "fugue[dask]"

## Executing on Dask

First, instantiate a `NixtlaClient` class. To do this, you will need an API key provided by Nixtla. If you don't have one already, please request yours [here](https://docs.nixtla.io/).

There are different ways to set your API key. Here, we will set it up as an environment variable. Please refer to this [tutorial](https://docs.nixtla.io/docs/setting_up_your_authentication_api_key) to learn more.

In [None]:
#| hide
from dotenv import load_dotenv

load_dotenv()

True

In [None]:
from nixtlats import NixtlaClient

nixtla_client = NixtlaClient() # defaults to os.environ.get("NIXTLA_API_KEY")

### Cross validation

Time series cross validation is a method to check how well a model would have performed in the past. It uses a moving window over historical data to make predictions for the next period. After each prediction, the window moves ahead and the process keeps going until it covers all the data. The `NixtlaClient` class allows you to perfom cross validation on top of Dask. 

Start by loading the data using `pandas` and then convert it to a Dask DataFrame. 

In [None]:
import pandas as pd 

df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')
df.head()

Unnamed: 0,unique_id,ds,y
0,BE,2016-12-01 00:00:00,72.0
1,BE,2016-12-01 01:00:00,65.8
2,BE,2016-12-01 02:00:00,59.99
3,BE,2016-12-01 03:00:00,50.69
4,BE,2016-12-01 04:00:00,52.58


In [None]:
import dask.dataframe as dd

dask_df = dd.from_pandas(df, npartitions=2)

Now call the cross-validation method from the `NixtlaClient` class with the Dask DataFrame. 

In [None]:
fcst_df = nixtla_client.cross_validation(dask_df, h=12, freq="H", n_windows=5, step_size=2)
fcst_df.head()

INFO:distributed.http.proxy:To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
Perhaps you already have a cluster running?
Hosting the HTTP server on port 55623 instead
INFO:distributed.scheduler:State start
INFO:distributed.scheduler:  Scheduler at:     tcp://127.0.0.1:55624
INFO:distributed.scheduler:  dashboard at:  http://127.0.0.1:55623/status
INFO:distributed.scheduler:Registering Worker plugin shuffle
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:55627'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:55628'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:55629'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:55630'
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:55637', name: 0, status: init, memory: 0, processing: 0>
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:55637
INFO:distributed.core

Unnamed: 0,unique_id,ds,cutoff,TimeGPT
0,FR,2016-12-30 04:00:00,2016-12-30 03:00:00,44.893745
1,FR,2016-12-30 05:00:00,2016-12-30 03:00:00,46.05793
2,FR,2016-12-30 06:00:00,2016-12-30 03:00:00,48.790077
3,FR,2016-12-30 07:00:00,2016-12-30 03:00:00,54.397026
4,FR,2016-12-30 08:00:00,2016-12-30 03:00:00,57.593002


### Cross validation with exogenous variables

Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.

For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.

To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.

Let's see an example. First we will load the data using `pandas` and convert it to a Dask DataFrame.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')
df.head()

Unnamed: 0,unique_id,ds,y,Exogenous1,Exogenous2,day_0,day_1,day_2,day_3,day_4,day_5,day_6
0,BE,2016-12-01 00:00:00,72.0,61507.0,71066.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,BE,2016-12-01 01:00:00,65.8,59528.0,67311.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,BE,2016-12-01 02:00:00,59.99,58812.0,67470.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,BE,2016-12-01 03:00:00,50.69,57676.0,64529.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,BE,2016-12-01 04:00:00,52.58,56804.0,62773.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [None]:
dask_df = dd.from_pandas(df, npartitions=2)

Let's call the `cross_validation` method, adding this information:

In [None]:
cv_ex_vars_df = nixtla_client.cross_validation(
    df=dask_df,
    h=48, 
    freq='H',
    level=[80, 90],
    n_windows=5,
)
cv_ex_vars_df.head()



INFO:nixtlats.nixtla_client:Validating inputs...
INFO:nixtlats.nixtla_client:Validating inputs...
INFO:nixtlats.nixtla_client:Preprocessing dataframes...
INFO:nixtlats.nixtla_client:Using the following exogenous variables: Exogenous1, Exogenous2, day_0, day_1, day_2, day_3, day_4, day_5, day_6
INFO:nixtlats.nixtla_client:Calling Forecast Endpoint...
INFO:nixtlats.nixtla_client:Validating inputs...
INFO:nixtlats.nixtla_client:Validating inputs...
INFO:nixtlats.nixtla_client:Preprocessing dataframes...
INFO:nixtlats.nixtla_client:Using the following exogenous variables: Exogenous1, Exogenous2, day_0, day_1, day_2, day_3, day_4, day_5, day_6
INFO:nixtlats.nixtla_client:Calling Forecast Endpoint...
INFO:nixtlats.nixtla_client:Validating inputs...
INFO:nixtlats.nixtla_client:Validating inputs...
INFO:nixtlats.nixtla_client:Preprocessing dataframes...
INFO:nixtlats.nixtla_client:Using the following exogenous variables: Exogenous1, Exogenous2, day_0, day_1, day_2, day_3, day_4, day_5, day_6
I

Unnamed: 0,unique_id,ds,cutoff,TimeGPT,TimeGPT-lo-90,TimeGPT-lo-80,TimeGPT-hi-80,TimeGPT-hi-90
0,FR,2016-12-21 00:00:00,2016-12-20 23:00:00,66.397483,62.037771,63.289465,69.505501,70.757195
1,FR,2016-12-21 01:00:00,2016-12-20 23:00:00,63.718419,59.770956,61.168329,66.268508,67.665882
2,FR,2016-12-21 02:00:00,2016-12-20 23:00:00,61.137844,58.881849,59.515674,62.760015,63.39384
3,FR,2016-12-21 03:00:00,2016-12-20 23:00:00,55.774907,53.04736,53.220712,58.329103,58.502455
4,FR,2016-12-21 04:00:00,2016-12-20 23:00:00,48.803787,44.101765,44.58028,53.027293,53.505808
