# How to on Ray: Cross Validation
> Run TimeGPT distributedly on top of Ray.

`TimeGPT` works on top of Spark, Dask, and Ray through Fugue. `TimeGPT` will read the input DataFrame and use the corresponding engine. For example, if the input is a Ray DataFrame, `TimeGPT` will use the existing Ray session to run the forecast.


In [None]:
#| hide
from nixtlats.utils import colab_badge

In [None]:
#| echo: false
colab_badge('docs/how-to-guides/5_distributed_cv_ray')

[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nixtla/nixtla/blob/main/nbs/docs/how-to-guides/5_distributed_cv_ray.ipynb)

# Installation 

[Ray](https://www.ray.io/) is an open source unified compute framework to scale Python workloads. As long as Ray is installed and configured, `TimeGPT` will be able to use it. If executing on a distributed Ray cluster, make sure the `nixtlats` library is installed across all the workers.

In addition to Ray, you'll also need to have [Fugue](https://fugue-tutorials.readthedocs.io/) installed. Fugue provides an easy-to-use interface for distributed computing that lets users execute Python code on top of Spark, Dask and Ray. You can install Fugue for Ray using pip. 

In [None]:
%%capture
pip install "fugue[ray]"

## Executing on Ray

First, instantiate a `NixtlaClient` class. To do this, you will need an API key provided by Nixtla. If you don't have one already, please request yours [here](https://docs.nixtla.io/).

There are different ways to set your API key. Here, we will set it up as an environment variable. Please refer to this [tutorial](https://docs.nixtla.io/docs/setting_up_your_authentication_api_key) to learn more.

In [None]:
#| hide
from dotenv import load_dotenv

load_dotenv()

True

In [None]:
from nixtlats import NixtlaClient

nixtla_client = NixtlaClient() # defaults to os.environ.get("NIXTLA_API_KEY")

Start Ray cluster. 

In [None]:
import ray
from ray.cluster_utils import Cluster

ray_cluster = Cluster(
    initialize_head=True,
    head_node_args={"num_cpus": 2}
)
ray.init(address=ray_cluster.address, ignore_reinit_error=True)

2024-04-23 17:13:47,386	INFO utils.py:108 -- Overwriting previous Ray address (127.0.0.1:58344). Running ray.init() on this node will now connect to the new instance at 127.0.0.1:65331. To override this behavior, pass address=127.0.0.1:58344 to ray.init().
2024-04-23 17:13:47,388	INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 127.0.0.1:65331...
2024-04-23 17:13:47,392	INFO worker.py:1621 -- Connected to Ray cluster.


0,1
Python version:,3.10.13
Ray version:,2.6.2


### Cross validation

Time series cross validation is a method to check how well a model would have performed in the past. It uses a moving window over historical data to make predictions for the next period. After each prediction, the window moves ahead and the process keeps going until it covers all the data. `TimeGPT` allows you to perfom cross validation on top of Dask. 

After starting Ray, load a pandas DataFrame and then convert it to a Ray dataset. 

In [None]:
import pandas as pd 

df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short.csv')
df.head()

Unnamed: 0,unique_id,ds,y
0,BE,2016-12-01 00:00:00,72.0
1,BE,2016-12-01 01:00:00,65.8
2,BE,2016-12-01 02:00:00,59.99
3,BE,2016-12-01 03:00:00,50.69
4,BE,2016-12-01 04:00:00,52.58


In [None]:
ray_df = ray.data.from_pandas(df)

Now call `cross-validation` method from the `NixtlaClient` class with the Ray dataset. 

In [None]:
fcst_df = nixtla_client.cross_validation(ray_df, h=12, freq='H', n_windows=5, step_size=2)

2024-04-23 17:15:11,091	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition]
2024-04-23 17:15:11,092	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-04-23 17:15:11,093	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Repartition 1:   0%|          | 0/2 [00:00<?, ?it/s]

Split Repartition 2:   0%|          | 0/2 [00:00<?, ?it/s]

Running 0:   0%|          | 0/2 [00:00<?, ?it/s]

2024-04-23 17:15:11,608	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(add_coarse_key)] -> LimitOperator[limit=1]
2024-04-23 17:15:11,609	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-04-23 17:15:11,609	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

2024-04-23 17:15:11,655	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(add_coarse_key)] -> AllToAllOperator[Sort] -> TaskPoolMapOperator[MapBatches(group_fn)]
2024-04-23 17:15:11,655	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-04-23 17:15:11,656	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


[2m[36m(MapBatches(add_coarse_key) pid=53095)[0m   return transform_pyarrow.concat(tables)
[2m[36m(reduce pid=53095)[0m   ret = concat(blocks)
[2m[36m(MapBatches(group_fn) pid=53095)[0m INFO:nixtlats.nixtla_client:Validating inputs...
[2m[36m(MapBatches(group_fn) pid=53095)[0m INFO:nixtlats.nixtla_client:Validating inputs...
[2m[36m(MapBatches(group_fn) pid=53095)[0m INFO:nixtlats.nixtla_client:Preprocessing dataframes...
[2m[36m(MapBatches(group_fn) pid=53094)[0m INFO:nixtlats.nixtla_client:Calling Forecast Endpoint...
[2m[36m(MapBatches(add_coarse_key) pid=53094)[0m   return transform_pyarrow.concat(tables)[32m [repeated 3x across cluster][0m
[2m[36m(reduce pid=53094)[0m   ret = concat(blocks)
[2m[36m(MapBatches(group_fn) pid=53094)[0m INFO:nixtlats.nixtla_client:Validating inputs...[32m [repeated 4x across cluster][0m
[2m[36m(MapBatches(group_fn) pid=53094)[0m INFO:nixtlats.nixtla_client:Preprocessing dataframes...[32m [repeated 2x across cluster]

- Sort 1:   0%|          | 0/2 [00:00<?, ?it/s]

Sort Sample 2:   0%|          | 0/2 [00:00<?, ?it/s]

Shuffle Map 3:   0%|          | 0/2 [00:00<?, ?it/s]

Shuffle Reduce 4:   0%|          | 0/2 [00:00<?, ?it/s]

Running 0:   0%|          | 0/2 [00:00<?, ?it/s]

Sort Sample 0:   0%|          | 0/2 [00:00<?, ?it/s]

  return transform_pyarrow.concat(tables)


In [None]:
fcst_df.to_pandas().head()

Unnamed: 0,unique_id,ds,cutoff,TimeGPT
0,DE,2017-12-30 04:00:00,2017-12-30 03:00:00,12.175045
1,DE,2017-12-30 05:00:00,2017-12-30 03:00:00,13.225025
2,DE,2017-12-30 06:00:00,2017-12-30 03:00:00,14.233379
3,DE,2017-12-30 07:00:00,2017-12-30 03:00:00,18.126492
4,DE,2017-12-30 08:00:00,2017-12-30 03:00:00,19.505131


### Cross validation with exogenous variables

Exogenous variables or external factors are crucial in time series forecasting as they provide additional information that might influence the prediction. These variables could include holiday markers, marketing spending, weather data, or any other external data that correlate with the time series data you are forecasting.

For example, if you're forecasting ice cream sales, temperature data could serve as a useful exogenous variable. On hotter days, ice cream sales may increase.

To incorporate exogenous variables in TimeGPT, you'll need to pair each point in your time series data with the corresponding external data.

Let's see an example. Notice that you need to load using `pandas` and then convert it to a Ray dataset. 

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/electricity-short-with-ex-vars.csv')
df.head()

Unnamed: 0,unique_id,ds,y,Exogenous1,Exogenous2,day_0,day_1,day_2,day_3,day_4,day_5,day_6
0,BE,2016-12-01 00:00:00,72.0,61507.0,71066.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,BE,2016-12-01 01:00:00,65.8,59528.0,67311.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,BE,2016-12-01 02:00:00,59.99,58812.0,67470.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,BE,2016-12-01 03:00:00,50.69,57676.0,64529.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,BE,2016-12-01 04:00:00,52.58,56804.0,62773.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [None]:
ray_df = ray.data.from_pandas(df)

Let's call the `cross_validation` method, adding this information:

In [None]:
cv_ex_vars_df = nixtla_client.cross_validation(
    df=ray_df,
    h=48, 
    freq='H',
    level=[80, 90],
    n_windows=5,
)

2024-04-23 17:16:47,672	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition]
2024-04-23 17:16:47,673	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-04-23 17:16:47,674	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Repartition 1:   0%|          | 0/2 [00:00<?, ?it/s]

Split Repartition 2:   0%|          | 0/2 [00:00<?, ?it/s]

Running 0:   0%|          | 0/2 [00:00<?, ?it/s]

2024-04-23 17:16:47,715	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(add_coarse_key)] -> LimitOperator[limit=1]
2024-04-23 17:16:47,716	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-04-23 17:16:47,716	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

2024-04-23 17:16:47,800	INFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(add_coarse_key)] -> AllToAllOperator[Sort] -> TaskPoolMapOperator[MapBatches(group_fn)]
2024-04-23 17:16:47,801	INFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=True, actor_locality_enabled=True, verbose_progress=False)
2024-04-23 17:16:47,801	INFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`


- Sort 1:   0%|          | 0/2 [00:00<?, ?it/s]

Sort Sample 2:   0%|          | 0/2 [00:00<?, ?it/s]

Shuffle Map 3:   0%|          | 0/2 [00:00<?, ?it/s]

Shuffle Reduce 4:   0%|          | 0/2 [00:00<?, ?it/s]

Running 0:   0%|          | 0/2 [00:00<?, ?it/s]

Sort Sample 0:   0%|          | 0/2 [00:00<?, ?it/s]

  return transform_pyarrow.concat(tables)


In [None]:
cv_ex_vars_df.to_pandas().head()

Unnamed: 0,unique_id,ds,cutoff,TimeGPT,TimeGPT-lo-90,TimeGPT-lo-80,TimeGPT-hi-80,TimeGPT-hi-90
0,DE,2017-12-21 00:00:00,2017-12-20 23:00:00,36.616544,32.499393,32.857875,40.375214,40.733695
1,DE,2017-12-21 01:00:00,2017-12-20 23:00:00,33.457679,28.431474,29.033569,37.881789,38.483884
2,DE,2017-12-21 02:00:00,2017-12-20 23:00:00,33.057284,26.491331,27.538233,38.576335,39.623238
3,DE,2017-12-21 03:00:00,2017-12-20 23:00:00,32.649935,23.811479,26.895411,38.404459,41.488391
4,DE,2017-12-21 04:00:00,2017-12-20 23:00:00,34.146899,23.650726,27.622537,40.671262,44.643072


Don't forget to stop Ray once you're done. 

In [None]:
ray.shutdown()