# "Forecasting with Privacy at Scale (Nixtla & YData) "
> "Make your time series data private and then forecast them with Deep Learning models"

- toc: true
- branch: main
- badges: true
- comments: true
- author: Federico Garza
- categories: [machine learning, forecasting]
- image: images/nixtla_logo.png

## Introduction

 In this post, we explain how to use [nixtlats](https://github.com/Nixtla/nixtlats) and [ydata-synthetic](https://github.com/ydataai/ydata-synthetic), python libraries to address the problem of data privacy in the context of time series forecasting. We develop a deep learning forecasting pipeline without direct access to the original data and show that data anonymization has a minimal impact on the performance of the models.

## Motivation

In the last decade, neural network-based forecasting methods have become ubiquitous in large-scale forecasting applications, transcending industry boundaries into academia, as it has redefined the state-of-the-art in many practical tasks like demand planning, electricity load forecasting, reverse logistics, weather forecasting, as well as forecasting competitions like the M4 and M5.

However, one of the problems for those interested in creating forecasts is data privacy. In many applications, the user does not want the model to have access to the actual data, in particular, if the model training is done in the cloud or outside one's infrastructure. The above dramatically limit the practice, preventing the scaling of models for large datasets using available clouds.

This post shows how to solve this problem using `nixtlats` and `ydata-synthetic`.  First, the user can anonymize the data using `ydata-synthetic` and subsequently train state-of-the-art neural forecasting algorithms using `nixtlats` without accessing the original data. Once the model is trained, the model can be sent to the owner of the original data and perform inference in the security of their infrastructure.

We evaluate and show the performance of the private model's predictions remains constant compared with the original model's predictions.

# Libraries

The libraries `nixtlats` and `ydata-synthetic` are available in [PyPI](https://pypi.org/project/nixtlats/), so you can install them using `pip install nixtlats` and `pip install ydata-synthetic`.

In [1]:
#%gist gistname: libraries-nixtla-ydata.py
import random
from itertools import product

import numpy as np
import pandas as pd
import pytorch_lightning as pl
import torch as t
import matplotlib.pyplot as plt
from pytorch_lightning import loggers as pl_loggers
from pytorch_lightning import seed_everything

#Nixtla libraries
from nixtlats.data.datasets.m4 import M4, M4Info, M4Evaluation
from nixtlats.data.tsdataset import TimeSeriesDataset
from nixtlats.data.tsloader import TimeSeriesLoader
from nixtlats.models.esrnn.esrnn import ESRNN

# YData libraries
from ydata_synthetic.synthesizers import ModelParameters
from ydata_synthetic.synthesizers.timeseries import TimeGAN

2021-12-04 20:32:41.985003: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-12-04 20:32:41.985029: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## Data

We will show how to anonymize data, forecast it using state-of-the-art deep learning models and generate useful non-anonymized forecasts using M4 competition data. This time series competition has been one of the most important in the world. In particular, we will use the Yearly data and show how even training models with anonymized data yields a similar state-of-the-art performance. The `nixtlats` library provides useful functions to download this data. 

In [2]:
#%gist gistname: data-yearly-nixtla-ydata.py
group = M4Info['Yearly']
Y_df, _, S_df = M4.load(directory='data', group=group.name)

In this example, we will use 100 Yearly time series.

In [3]:
#%gist gistname: data-subset-nixtla-ydata.py
uids = Y_df['unique_id'].unique()[:1_000]
Y_df = Y_df.query('unique_id in @uids')

The `M4.load` method returns train and test sets, so we need to split them. The library also provides a wide variety of datasets, [see the documentation](https://nixtla.github.io/nixtlats). 

In [4]:
#%gist gistname: data-remove-test-nixtla-ydata.py
Y_df_test = Y_df.groupby('unique_id').tail(group.horizon).copy()
Y_df_train = Y_df.drop(Y_df_test.index)

To avoid leakage, we set test values as zero.

In [5]:
#%gist gistname: data-test-zerp-nixtla-ydata.py
Y_df_test.loc[:, 'y'] = 0

`nixtlats` requires a dummy test set to make forecasts, so we combine the training data with the testing data with zero values.

In [6]:
#%gist gistname: data-full-nixtla-ydata.py upload: both
Y_df_full = pd.concat([Y_df_train, Y_df_test]).sort_values(['unique_id', 'ds'], ignore_index=True)
Y_df_full.head()

Unnamed: 0,unique_id,ds,y
0,Y1,1,5172.1
1,Y1,2,5133.5
2,Y1,3,5186.9
3,Y1,4,5084.6
4,Y1,5,5182.0


## Pipeline

### Creating private data using ydata-synthetic

In this section we make private the training data defined by `Y_df_train` using the `TimeGAN` model from `ydata-synthetic`. You can learn more about the `TimeGAN` model seeing the post [Synthetic Time-Series Data: A GAN approach](https://towardsdatascience.com/synthetic-time-series-data-a-gan-approach-869a984f2239).

In [7]:
#%gist gistname: pipe-dataset-nixtla-ydata.py
train_ts_dataset = TimeSeriesDataset(Y_df=Y_df_train,
                                     input_size=4,
                                     output_size=group.horizon)
y_np = train_ts_dataset.ts_tensor[:,0].cpu().numpy().T

In [8]:
#%gist gistname: pipe-gan-args-nixtla-ydata.py
seq_len, n_seq = y_np.shape
hidden_dim = 24
gamma = 1

noise_dim = 32
dim = 128
batch_size = 1

log_step = 100
learning_rate = 5e-4

gan_args = ModelParameters(batch_size=batch_size,
                           lr=learning_rate,
                           noise_dim=noise_dim,
                           layers_dim=dim)

The following lines train the `TimeGAN` model,

In [9]:
#%gist gistname: train-gan-nixtla-ydata.py
synth = TimeGAN(model_parameters=gan_args, hidden_dim=24, seq_len=seq_len, n_seq=n_seq, gamma=1)
synth.train([y_np], train_steps=50)

2021-12-04 20:32:43.116115: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-12-04 20:32:43.116348: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-12-04 20:32:43.116364: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-12-04 20:32:43.116389: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-172-31-89-104): /proc/driver/nvidia/version does not exist
2021-12-04 20:32:43.121435: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow

In [10]:
#%gist gistname: get-gan-sample-nixtla-ydata.py
synth_data = synth.sample(1)[0][::-1]

Synthetic data generation: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.34it/s]


Thus, the object `synth_data` contains the training data but now it is anonymized. To use `nixtlats` we need to transform `synth_data` to a pandas dataframe. This can be easy done using the following lines. 

In [11]:
#%gist gistname: get-synth-training-nixtla-ydata.py upload: both
Y_df_train_synth = pd.DataFrame(synth_data.T, index=uids).rename_axis('unique_id')
Y_df_train_synth = Y_df_train_synth.stack().rename_axis(['unique_id', 'ds']).rename('y').reset_index()
Y_df_train_synth['ds'] = Y_df_train_synth['ds'].astype('object') 

Y_df_train_synth.head()

Unnamed: 0,unique_id,ds,y
0,Y1,0,0.855188
1,Y1,1,0.855188
2,Y1,2,0.855188
3,Y1,3,0.855188
4,Y1,4,0.855188


### Training Deep Learning model using nixtlats

In this section, we use the previous anonymized dataset to train the ESRNN model, the winner of the M4 competition. This model is hybrid; by one hand, it fits each time series locally through an Exponential Smoothing model and then trains the levels using a Recurrent Neural Network. You can learn more about this model by seeing the post [Forecasting in Python with the ESRNN model](https://medium.com/analytics-vidhya/forecasting-in-python-with-esrnn-model-75f7fae1d242).

The pipeline for model training follows the logic of deep learning practices. In the first instance a `Dataset` must be instantiated. The `TimeSeriesDataset` class allows to return the complete series in each iteration, this is useful for recurrent models such as ESRNN. To be instantiated, the class receives the target series `Y_df` as a pandas dataframe with columns `unique_id`, `ds` and `y`. Additionally, temporary exogenous variables `X_df` and static variables `S_df` can be included. In this case we only use static variables as in the original model.

In [12]:
#%gist gistname: synth-dataset-nixtla-ydata.py
train_ts_dataset_synth = TimeSeriesDataset(Y_df=Y_df_train_synth, S_df=S_df,
                                           input_size=4,
                                           output_size=group.horizon)

In [13]:
#%gist gistname: synth-loader-nixtla-ydata.py
train_ts_loader_synth = TimeSeriesLoader(dataset=train_ts_dataset_synth,
                                         batch_size=16,
                                         shuffle=False)

The next we need to do is define the ESRNN model included in `nixtlats` as follows,

In [14]:
#%gist gistname: synth-model-nixtla-ydata.py
model_synth = ESRNN(n_series=group.n_ts,
                    n_x=0, n_s=1,
                    sample_freq=1,
                    input_size=4,
                    output_size=group.horizon,
                    learning_rate=0.0025,
                    lr_scheduler_step_size=6,
                    lr_decay=0.08,
                    per_series_lr_multip=0.8,
                    gradient_clipping_threshold=20,
                    rnn_weight_decay=0,
                    level_variability_penalty=50,
                    testing_percentile=50,
                    training_percentile=50,
                    cell_type='GRU',
                    state_hsize=30,
                    dilations=[[1, 2], [2, 6]],
                    add_nl_layer=True,
                    loss='SMYL',
                    val_loss='SMAPE',
                    seasonality=[])

And then we can train it as follows,

In [15]:
#%gist gistname: trainer-nixtla-ydata.py
trainer = pl.Trainer(max_epochs=15,
                     progress_bar_refresh_rate=10, 
                     deterministic=True)

  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [16]:
#%gist gistname: fit-synth-model-nixtla-ydata.py
trainer.fit(model_synth, train_ts_loader_synth)

  rank_zero_warn("You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.")

  | Name  | Type   | Params
---------------------------------
0 | model | _ESRNN | 44.2 K
---------------------------------
44.2 K    Trainable params
0         Non-trainable params
44.2 K    Total params
0.177     Total estimated model params size (MB)
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Training: 0it [00:00, ?it/s]

#### Model trained with real data

To compare both solutions offer similar results, in this section we train the model with the original data. 

In [17]:
#%gist gistname: real-dataset-nixtla-ydata.py
train_ts_dataset = TimeSeriesDataset(Y_df=Y_df_train, S_df=S_df,
                                     input_size=4,
                                     output_size=group.horizon)

In [18]:
#%gist gistname: real-loader-nixtla-ydata.py
train_ts_loader = TimeSeriesLoader(dataset=train_ts_dataset,
                                   batch_size=16,
                                   shuffle=False)

In [19]:
#%gist gistname: real-model-nixtla-ydata.py
model = ESRNN(n_series=group.n_ts,
              n_x=0, n_s=1,
              sample_freq=1,
              input_size=4,
              output_size=group.horizon,
              learning_rate=0.0025,
              lr_scheduler_step_size=6,
              lr_decay=0.08,
              per_series_lr_multip=0.8,
              gradient_clipping_threshold=20,
              rnn_weight_decay=0,
              level_variability_penalty=50,
              testing_percentile=50,
              training_percentile=50,
              cell_type='GRU',
              state_hsize=30,
              dilations=[[1, 2], [2, 6]],
              add_nl_layer=True,
              loss='SMYL',
              val_loss='SMAPE',
              seasonality=[])

And then we can train it as follows,

In [20]:
#%gist gistname: fit-real-model-nixtla-ydata.py
trainer.fit(model, train_ts_loader)


  | Name  | Type   | Params
---------------------------------
0 | model | _ESRNN | 44.2 K
---------------------------------
44.2 K    Trainable params
0         Non-trainable params
44.2 K    Total params
0.177     Total estimated model params size (MB)
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")


Training: 0it [00:00, ?it/s]

### Comparing forecasts

Finally, we use the original data to make forecasts for both models, `model_synth` trained with synthetic data and `model`, trained with the original data. First, we define the test dataset and loader.

In [21]:
#%gist gistname: test-dataset-nixtla-ydata.py
test_ts_dataset = TimeSeriesDataset(Y_df=Y_df_full, S_df=S_df,
                                    input_size=4,
                                    output_size=group.horizon)

In [22]:
#%gist gistname: test-loader-nixtla-ydata.py
test_ts_loader = TimeSeriesLoader(dataset=test_ts_dataset,
                                  batch_size=1024,
                                  eq_batch_size=False,
                                  shuffle=False)

The following lines obtains forecasts with the synthetic model,

In [23]:
#%gist gistname: synth-forecasts-nixtla-ydata.py
outputs_model_synth = trainer.predict(model_synth, test_ts_loader)
_, y_hat_model_synth, _ = zip(*outputs_model_synth)
y_hat_model_synth = t.cat([y_hat_[:, -1] for y_hat_ in y_hat_model_synth]).cpu().numpy()

  'This class wraps the pytorch `DataLoader` with a '
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Predicting: 63it [00:00, ?it/s]

Likewise, the following lines obtaines forecasts with the model trained with real data,

In [24]:
#%gist gistname: real-forecasts-nixtla-ydata.py
outputs_model = trainer.predict(model, test_ts_loader)
_, y_hat_model, _ = zip(*outputs_model)
y_hat_model = t.cat([y_hat_[:, -1] for y_hat_ in y_hat_model]).cpu().numpy()

Predicting: 63it [00:00, ?it/s]

Now we compare the performance of both models against the real value using the Mean Average Percentage Error (MAPE) and its symmetric version (SMAPE). `nixtlats` provides functions to easily do that.

In [25]:
#%gist gistname: test-set-eval-nixtla-ydata.py
Y_df_test, _, S_df = M4.load(directory='data', group=group.name)
Y_df_test = Y_df.groupby('unique_id').tail(group.horizon).copy()
Y_df_test = Y_df_test.query('unique_id in @uids')
Y_df_test['ds'] = Y_df_test.groupby('unique_id')['ds'].transform(lambda x: np.arange(len(x)))
y_test = Y_df_test.set_index(['unique_id', 'ds']).unstack().values

In [26]:
#%gist gistname: evaluation-nixtla-ydata.py upload: both
from nixtlats.losses.numpy import mape, smape

metrics_dict = {'MAPE': [mape(y_test, y_hat_model),
                         mape(y_test, y_hat_model_synth)],
                'SMAPE': [smape(y_test, y_hat_model),
                          smape(y_test, y_hat_model_synth)]}

results = pd.DataFrame(metrics_dict, index=['Real', 'Synthetic'])

results

Unnamed: 0,MAPE,SMAPE
Real,27.939907,19.686992
Synthetic,25.93798,19.924676


As can we see, even the model trained with synthetic data generated with `ydata-synthetic` produces better forecasts considering the MAPE loss.

## Conclusion

Data privacy is a common concern for many large companies. Here we will show a complete pipeline for anonymizing data and using it to train state-of-the-art Deep Learning models. As we saw, performance is not harmed, and even for some metrics, it is even better.