In [1]:
import sys
from pathlib import Path

script_dir = Path().resolve()
root_dir = (script_dir.parent)
sys.path.append(str(root_dir))

import pandas as pd
import numpy as np

from endata.datasets.timeseries_dataset import TimeSeriesDataset
from endata.trainer import Trainer

  from .autonotebook import tqdm as notebook_tqdm


## Training a model from scratch ##

To train your own model from scratch, the ` Trainer ` class provides a simple implementation. Simply define your a custom dataset and a ` Trainer ` object, and call the ` Trainer ` 's ` fit() ` method.

## Writing a custom dataset ##

When creating a custom time series dataset class for use with EnData, the class must inherit from the provided `TimeSeriesDataset` base class. The `TimeSeriesDataset` class provides a robust and modular framework for handling wide-format time series data. Custom implementations only need to assign a name to `self.name` and implement the `_preprocess_data` method, which is an abstract method in the base class. This method should ensure that the data is available in a clean wide-format data frame, that has the structure outlined below.

### Responsibilities of `_preprocess_data`

- Preprocess raw input data into a DataFrame that satisfies the expected structure.
- Ensure time series columns contain arrays of the correct sequence length (`seq_len`).
- Add any additional columns, such as entity identifiers or context variables.

### Benefits of the Base Class

- **Normalization and Scaling:** Automatically handles standardization and min-max scaling.
- **context Variables:** Provides support for encoding and managing context variables.
- **Time Series Merging and Splitting:** Facilitates operations to merge multiple time series columns into a single multidimensional array and split them back when needed.
- **Data Transformation:** Includes functions for inverse transformations to revert normalized data to its original scale.

---

### Expected Input DataFrame Structure

The input to the `TimeSeriesDataset` class must adhere to the following structure:

| **Column Name**       | **Description**                                                                                     |
|------------------------|-----------------------------------------------------------------------------------------------------|
| `timeseries_col1`      | A column containing arrays of length `seq_len` (after preprocessing) representing the first dimension of the time series. |
| `timeseries_col2`      | A column containing arrays of length `seq_len` (after preprocessing) representing the second dimension of the time series.|
| `entity_column`        | A column containing unique identifiers for each entity (e.g., user, household, or device ID).       |
| `context_var1`    | An (optional) static or numeric context variable (e.g., categorical or continuous feature).                |
| `context_var2`    | Further (optional) static or numeric context variables.                                                    |

- The `time_series_column_names` parameter specifies which columns are part of the time series.
- The `entity_column_name` parameter identifies the column containing unique entity IDs.
- The `context_var_column_names` parameter defines additional context variables.

---

In [2]:
class CustomTimeSeriesDataset(TimeSeriesDataset):
    """
    A custom TimeSeriesDataset implementation for handling toy data.

    Input data structure:
    - time_series_col1, time_series_col2: Time series data with arrays of length seq_len.
    - entity_id: Unique identifier for each entity.
    - static_context: Categorical or numeric context variable.
    """
    def __init__(
        self,
        data: pd.DataFrame,
        seq_len: int = 16,
        normalize: bool = True,
        scale: bool = True,
    ):
        time_series_column_names = ["time_series_col1", "time_series_col2"]
        context_var_column_names = ["context_var"]

        super().__init__(
            data=data,
            time_series_column_names=time_series_column_names,
            context_var_column_names=context_var_column_names,
            seq_len=seq_len,
            normalize=normalize,
            scale=scale,
        )

    def _preprocess_data(self, data: pd.DataFrame) -> pd.DataFrame:
        """
        Preprocesses the raw input data to ensure it conforms to the expected format.

        - Ensures time series columns contain arrays of length seq_len.
        - Ensures all required columns are present.

        Args:
            data (pd.DataFrame): The raw input data.

        Returns:
            pd.DataFrame: The preprocessed data.
        """
        required_columns = ["time_series_col1", "time_series_col2", "context_var"]
        for col in required_columns:
            if col not in data.columns:
                raise ValueError(f"Missing required column: {col}")

        for col in ["time_series_col1", "time_series_col2"]:
            data[col] = data[col].apply(
                lambda x: np.array(x).reshape(-1, 1) if isinstance(x, list) else x
            )
            data[col] = data[col].apply(
                lambda x: np.array(x) if isinstance(x, np.ndarray) else ValueError(f"Invalid data in {col}")
            )
        for col in ["time_series_col1", "time_series_col2"]:
            data[col] = data[col].apply(
                lambda x: x[:self.seq_len] if len(x) >= self.seq_len else ValueError(f"Sequence too short in {col}")
            )
        return data

Now that we have defined our dataset class, let's create some artificial timeseries columns and context variables which will comprise our dataset:

In [3]:
data = pd.DataFrame({
        "time_series_col1": [np.random.rand(16) for _ in range(100)],
        "time_series_col2": [np.random.rand(16) for _ in range(100)],
        "context_var": np.random.choice(["a", "b", "c"], size=100).tolist(),
    })

custom_dataset = CustomTimeSeriesDataset(data)
custom_dataset.data

[EnData] Loaded normalizer from /home/fuest/.cache/endata/checkpoints/custom/normalizer/custom_normalizer_dim2.ckpt


Unnamed: 0,index,context_var,timeseries,is_frequency_rare,cluster,is_pattern_rare,is_rare
0,0,1,"[[0.48750117, 0.1542439], [0.41293383, 0.23400...",False,7,True,False
1,1,2,"[[0.1933689, 0.53490204], [0.7364248, 0.898635...",False,9,True,False
2,2,0,"[[0.9472752, 0.2833785], [0.79172343, 0.356586...",True,3,True,True
3,3,0,"[[0.67579883, 0.65443283], [0.09132707, 0.3540...",True,5,True,True
4,4,0,"[[0.981122, 0.7548456], [0.22297916, 0.934267]...",True,8,True,True
...,...,...,...,...,...,...,...
95,95,2,"[[0.066371, 0.29005468], [0.4312763, 0.5725938...",False,7,True,False
96,96,1,"[[0.44981784, 0.8007116], [0.9998199, 0.101008...",False,1,True,False
97,97,1,"[[0.7390621, 0.80842996], [0.7798741, 0.276360...",False,2,True,False
98,98,2,"[[0.071502745, 0.48175776], [0.7330775, 0.3269...",False,7,True,False


We will now create a `Trainer` object by passing the name of the desired model and the dataset object. To start training, simply call `Trainer.fit()`.

In [7]:
trainer = Trainer(model_name="acgan", dataset=custom_dataset, overrides=["trainer.max_epochs=5", "trainer.strategy=auto"])
trainer.fit()

Trainer will use only 1 of 2 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=2)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name           | Type              | Params | Mode 
-------------------------------------------------------------
0 | context_module | ContextModule     | 4.3 K  | train
1 | generator      | Generator         | 309 K  | train
2 | discriminator  | Discriminator     | 167 K  | train
3 | adv_loss       | BCEWithLogitsLoss | 0      | train
4 | aux_loss       | CrossEntropyLoss  | 0      | train
-------------------------------------------------------------
477 K     Trainable params
0         Non-trainable para

Epoch 4: 100%|██████████| 1/1 [00:00<00:00,  3.51it/s, loss_G=1.830, loss_D=3.580]

`Trainer.fit` stopped: `max_epochs=5` reached.


Epoch 4: 100%|██████████| 1/1 [00:00<00:00,  3.10it/s, loss_G=1.830, loss_D=3.580]


<endata.trainer.Trainer at 0x7f5e744a87c0>

Once training is complete, we can create a data generator object that has access to the trained model and dataset information. To generate data, there is no need to load in a trained model. Simply define the context variables, and call the `DataGenerator` 's `generate()` method.

In [8]:
data_generator = trainer.get_data_generator()

In [9]:
data_generator.set_context(context_var=2)
generated_df = data_generator.generate(n=100)
generated_df

Unnamed: 0,context_var,time_series_col1,time_series_col2
0,2,"[0.45727977, 0.51897794, 0.4798854, 0.42708677...","[0.470251, 0.60801685, 0.44749513, 0.5431756, ..."
1,2,"[0.48344016, 0.5245678, 0.47357607, 0.531312, ...","[0.4844045, 0.52361476, 0.44142634, 0.55366766..."
2,2,"[0.44399303, 0.54520196, 0.42349514, 0.4442703...","[0.47629273, 0.5668736, 0.4106424, 0.5360027, ..."
3,2,"[0.45557648, 0.49937442, 0.5163582, 0.53112465...","[0.4393615, 0.49636224, 0.4633952, 0.45261684,..."
4,2,"[0.4618596, 0.5549065, 0.45492694, 0.46001238,...","[0.48802534, 0.51868093, 0.4877114, 0.4838679,..."
...,...,...,...
95,2,"[0.45156628, 0.5218126, 0.51553077, 0.4835453,...","[0.5041389, 0.53898656, 0.47410166, 0.46987182..."
96,2,"[0.43952155, 0.55379754, 0.5375207, 0.4894367,...","[0.47052127, 0.54430103, 0.42144778, 0.4910470..."
97,2,"[0.46058822, 0.48529935, 0.50417596, 0.5024766...","[0.49035966, 0.53197294, 0.5194386, 0.50335693..."
98,2,"[0.45818165, 0.5586115, 0.49885723, 0.4629985,...","[0.46143523, 0.46299663, 0.4316833, 0.5535009,..."


# Model Evaluation

The `Evaluator` class provides functionality to assess the quality of generated data compared to the original training data. It computes various metrics including:

- Distribution similarity between real and generated data
- Utility metrics
- Context-FID

To evaluate the trained model, we can use the `evaluate()` method of the `Trainer`. This method accepts:

- A dataset to evaluate against (typically the training dataset)
- Optional evaluation configuration parameters

The evaluation results provide insights into how well the model captures the underlying data patterns and maintains the relationship with context variables.

Let's run an evaluation on our trained model:

In [10]:
results = trainer.evaluate()
metadata = results["metadata"]
metrics = results["metrics"]

Training Discriminative Score Model: 100%|██████████| 2000/2000 [00:06<00:00, 301.42it/s]
Training Predictive Score Model: 100%|██████████| 5000/5000 [00:16<00:00, 305.58it/s]


In [11]:
metrics

{'DTW': {'mean': np.float64(1.416781231229178),
  'std': np.float64(0.15105484522843415)},
 'MMD': {'mean': np.float64(0.015422067693394749),
  'std': np.float64(0.010529945421156697)},
 'Context_FID': np.float64(3.890161812010505),
 'Disc_Score': np.float64(0.5),
 'Pred_Score': 0.254741563051939}