# Time-series synthetic data 
### A generation and evaluation example with **Clearbox Engine**

This notebook walks you through the time-series synthetic data generation and evaluation process with **Clearbox Engine**.

You can run this notebook on Google Colab or on your local machine.<br> 
In the second case, we highly recommend to create a dedicated virtual environment.

<div class="alert alert-secondary">
To run this notebook, make sure you change the runtime to <strong>GPU</strong><br>
<hr>
<strong>Runtime</strong> --> <strong>Change Runtime Type</strong> <br>
and set <strong>Hardware Accelerator</strong> to "<strong>GPU</strong>"
</div>

In [None]:
# Install the library and its dependencies

%pip install clearbox-synthetic-kit

In [70]:
# Import necessary dependencies
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
from clearbox_synthetic.utils import Dataset, Preprocessor

from clearbox_synthetic.generation import TimeSeriesEngine

## 0. Data import and preparation

In [3]:

train_dataset = Dataset.from_csv('./data/daily_delhi_climate/DailyDelhiClimateTrain.csv')

### Data pre-processing
Datasets are pre-processd with the **Preprocessor** class, which prepares data for the subsequent steps.

In [4]:
# Adding a time index column with year and month, as "yyyymm"
train_dataset.data['id'] =train_dataset.data['date'].apply(lambda x: ''.join(x.split('-')[0:2]))
prepro = Preprocessor(train_dataset, time_index='id', meta_columns=['date'])
X_train, meta = prepro.transform(train_dataset.data)

ts_id = train_dataset.data['id'].unique().shape[0]
print(f"Time series id found in the dataset: {ts_id}")
print(f"Number of time series channels found in the dataset: {prepro.n_time_features}")
print(f"Max time series length: {prepro.max_sequence_length}")

Time series id found in the dataset: 49
Number of time series channels found in the dataset: 4
Max time series length: 32


## 1. Synhetic Data Generation

In [6]:
# Initializing the time series generator

engine = TimeSeriesEngine(
    layers_size=[40],
    feature_sizes=prepro.n_time_features, # number of features
    max_sequence_length=prepro.max_sequence_length, # max time series length
    num_heads=4
)

An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.


(1, 32, 4)
(1, 32, 4)


In [7]:
import jax.numpy as jnp

def count_parameters(params_dict):
    total_count = 0
    for key, value in params_dict.items():
        if isinstance(value, dict):  # If the value is another dictionary, recurse
            total_count += count_parameters(value)
        elif isinstance(value, jnp.ndarray):  # If the value is an array, count parameters
            total_count += value.size
    return total_count

# Count the total number of parameters

total_params = count_parameters(engine.params['encoder'])
print("Number of parameters (encoder):", total_params)
total_params = count_parameters(engine.params['decoder'])
print("Number of parameters (decoder):", total_params)

Number of parameters (encoder): 26996
Number of parameters (decoder): 25120


In [8]:
# Start the training of the tabular synthetic data generator

engine.fit(X_train, epochs=5000, learning_rate=0.00001)

Engine fitting in progress:   0%|                                                           | 0/5000 [00:00<?, ?epoch/s]

(49, 32, 4)
(49, 32, 4)
(1000, 32, 4)
(1000, 32, 4)


Engine fitting in progress: 100%|███████████████████████████| 5000/5000 [00:58<00:00, 85.21epoch/s, Train loss=9506.455]


Reconstructing the data in the original format. In the current version of the library this reconstruction needs to be done manually, working to automate it.


In [82]:
def generate_series(N):
    # Generate the synthetic time series by decoding the samples from a gaussian distribution
    synth_data = engine.decode(np.random.randn(N,engine.architecture['layers_size'][0]))  # b.shape[1] dimensione dato da generare
    indeces = train_dataset.data[prepro.time_index].sample(N, replace = False).values
    df =  prepro.reverse_transform(synth_data)
    dfs = []
    # Create a dataframe with the original schema
    for i in tqdm(range(df.shape[0])):
        x_i = df.iloc[i]
        time_series = []
        for feat_name in prepro.time_columns:
            time_series.append(x_i[[j for j in df.columns if feat_name in j]].values)
        df_i = pd.DataFrame(np.array(time_series).T)
        df_i.columns = prepro.time_columns
        
        df_i[prepro.time_index] = indeces[i]
        dfs.append(df_i)
    return pd.concat(dfs, axis= 0)    

In [83]:
df = generate_series(10)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 520.24it/s]


Unnamed: 0,meantemp,humidity,wind_speed,meanpressure,id
0,11.718544,70.293396,6.718538,1077.253906,201603
1,13.921609,46.658714,3.141367,876.932434,201603
2,16.129503,68.830276,5.397058,922.216675,201603
3,17.365017,63.537136,7.632838,1095.543823,201603
4,18.268507,70.532715,6.232273,990.137939,201603
...,...,...,...,...,...
27,26.111961,58.067581,1.958867,1042.641357,201402
28,26.497934,56.300514,7.197012,1161.903076,201402
29,28.986565,58.019032,9.456677,951.868347,201402
30,26.214050,54.053692,7.622200,1023.886047,201402
