# Benchmark Datasets

In this guide we will explain the format in which the datasets used by the
DeepEcho benchmarking framework are stored, how to use them to test a
DeepEcho model and how to create a new dataset with your own data.

## DeepEcho Input Format

DeepEcho models work with time series datasets passed as `pandas.DataFrames` with the
following format:

### Entities and `entity_id`:

* The datasets may contain one or more `entity_columns` that form an `entity_id` that
  relates groups of rows to external, abstract, `entities`.
* The rows associated with each `entity_id` form a time series sequence, where order
  of the rows matters and where inter-row dependencies exist.
* However, the rows of different `entities` are completely independent from each other.
* If a dataset does not contain `entity_columns`, the complete dataset is interpreted
  as a single timeseries sequence.

### Context

* The datasets may have one or more `context_columns`. `context_columns` are expected
  to be constant within each `entity_id`, and they provide contextual information that
  conditions the properties of the associated timeseries.
* These `context_columns` will be used by DeepEcho models to learn how to generate
  timeseries with different properties based on the values of these columns.

### Sequence Index

* The datasets may contain a `sequence_index` column which indicates the order in which
  the rows of each timeseries must be learned.
* The `sequence_index` column can be of any type that can be sorted, including numerical
  values such as integers or floats, or datetimes.
* If the `sequence_index` is not present on a dataset, the rows are assumed to be already
  given in the right order.

> **NOTE**: It's important to mention that the `sequence_index` column will be only used
to sort the rows and then dropped before learning the data, which means that the
synthetically generated sequences will not contain this column.

### Data Columns

* The dataset will contain an arbitrary number of additional columns which can be `numerical`
  or `categorical`, and which will be what the models learn to replicate.

### DeepEcho Benchmark Dataset Format

The DeepEcho Benchmark framework is prepared to load datasets containing the
information explained above by reading from a folder that contains:

* A CSV file with the values from all the required columns, if they exist:
    * The `entity_columns`
    * The `context_columns`
    * The `sequence_index`
    * The `data_columns`
* A `metadata.json` with the [SDV Metadata]() format, with the following properties:
    * A `tables` entry which should contain at least one table in it.
    * A `table` entry within the `tables` dictionary, which by default should be named
      exactly like the dataset (it can optionally have a different name).
    * Within that `table`, there are the following additional entries:
        * `path`: Path to the CSV file, relative to the `metadata.json` file
          within the dataset folder. In most cases, this is just the CSV file name.
        * `entity_columns`: List containing the names of the `entity_columns`.
          It can be empty.
        * `sequence_index`: Name of the column that acts as the `sequence_index`.
          It can be null, or not exist at all.
        * `deepecho_version`: Version of the DeepEcho dataset format.
        
Here is an example of what a very simple `metadata.json` file will look like for a
dataset named `my_dataset` with the following columns:

* `id`: Column that acts as the `entity_id` for this dataset.
* `timestamp`: Column that acts as the `sequence_index` for this dataset.
* `float_value`: Float value that we want to replicate.
* `categorical_value`: Categorical value that we want to replicate.

```json
{
    "tables": {
        "my_dataset": {
            "path": "my_dataset.csv",
            "fields": {
                "id": {
                    "type": "numerical",
                    "subtype": "integer"
                },
                "timestamp": {
                    "type": "datetime"
                },
                "float_value": {
                    "type": "numerical",
                    "subtype": "float"
                },
                "categorical_value": {
                    "type": "categorical"
                }
            },
            "entity_columns": [
                "id"
            ],
            "sequence_index": "timestamp",
            "deepecho_version": "0.1.1"
        }
    }
}
```

### Generating a dataset from a `pandas.DataFrame`

The DeepEcho benchmark framework provides a utility function to facilitate the
creation of new datasets from `pandas.DataFrames`.

Let's try to load the DeepEcho Demo data and store it as a valid dataset.

In [1]:
from deepecho.demo import load_demo

data = load_demo()

The output of this call will be the `pandas.DataFrame` that you used if you followed
the DeepEcho Quickstart:

In [2]:
data.head(10)

Unnamed: 0,date,store_id,region,day_of_week,total_sales,nb_customers
0,2020-06-01,68608,New York,0,736.19,43
1,2020-06-02,68608,New York,1,777.31,45
2,2020-06-03,68608,New York,2,921.22,54
3,2020-06-04,68608,New York,3,1085.69,63
4,2020-06-05,68608,New York,4,1476.3,86
5,2020-06-06,68608,New York,5,2463.12,144
6,2020-06-07,68608,New York,6,1579.1,92
7,2020-06-01,47226,California,0,2750.94,161
8,2020-06-02,47226,California,1,2853.73,167
9,2020-06-03,47226,California,2,2915.41,171


This table has 6 columns:

- `date`: Column that acts the time series index.
- `store_id`: Column that acts as `entity_id`.
- `region`: Column that acts as `context`.
- `day_of_week`, `total_sales` and `nb_customers`: Columns that we want to learn and replicate.

In order to create a valid DeepEcho Dataset with this data, we can use the
`deepecho.benchmark.dataset.make_dataset` function, passing the name of the
dataset, the data, the `entity_columns` and the `sequence_index`, which in
this case is `None`:

Additionally, we can also pass a path to where we want the dataset to be created,
which in this case we will pass as `'.'` (which is the default value) to
create the dataset in this same folder.

In [4]:
from deepecho.benchmark.dataset import make_dataset

make_dataset(
    name='sunglasses',
    data=data,
    entity_columns=['store_id'],
    sequence_index='date',
    datasets_path='.',
)

As we can see, the a `Demo` folder has been generated in our current working directory,
containing both the `Demo.csv` and `metadata.json` files explained above:

In [5]:
!tree ./sunglasses/

[01;34m./sunglasses/[00m
├── metadata.json
└── sunglasses.csv

0 directories, 2 files


And we can see the contents of our `metadata.json` file:

In [6]:
!cat ./sunglasses/metadata.json

{
    "tables": {
        "sunglasses": {
            "fields": {
                "date": {
                    "type": "categorical"
                },
                "store_id": {
                    "type": "numerical",
                    "subtype": "integer"
                },
                "region": {
                    "type": "categorical"
                },
                "day_of_week": {
                    "type": "numerical",
                    "subtype": "integer"
                },
                "total_sales": {
                    "type": "numerical",
                    "subtype": "float"
                },
                "nb_customers": {
                    "type": "numerical",
                    "subtype": "integer"
                }
            },
            "path": "sunglasses.csv",
            "entity_columns": [
                "store_id"
            ],
            "sequence_index": "date",
            "deepecho_version"

## The `Dataset` class

The DeepEcho benchmark framework also provides a class `Dataset` to load
and work with datasets stored in the format explained above.

In order to load the dataset that we just created all we need to do is create an instance
of the class `deepecho.benchmark.Dataset` passing the path to the datset folder.

In [7]:
from deepecho.benchmark import Dataset

dataset = Dataset('./sunglasses')

This will load the data and its properties and have it ready to use by DeepEcho
in the following attributes:

- `data`: The table data, loaded and ready to be learned.
- `entity_columns`: The names of the entity columns of this dataset.
- `context_columns`: The names of the context columns of this dataset.
- `sequence_index`: The names of the sequence index of this dataset.
- `model_columns`: The names of the columns that will be learned.

In [8]:
dataset

Dataset('sunglasses')

In [9]:
dataset.data.head()

Unnamed: 0,date,store_id,region,day_of_week,total_sales,nb_customers
0,2020-06-01,68608,New York,0,736.19,43
1,2020-06-02,68608,New York,1,777.31,45
2,2020-06-03,68608,New York,2,921.22,54
3,2020-06-04,68608,New York,3,1085.69,63
4,2020-06-05,68608,New York,4,1476.3,86


In [10]:
dataset.entity_columns

['store_id']

In [11]:
dataset.context_columns

['region']

In [12]:
dataset.sequence_index

'date'

In [13]:
dataset.model_columns

['day_of_week', 'total_sales', 'nb_customers']

Additionally, it provides a `describe` method that will return some
basic information about the dataset characteristics.

In [14]:
dataset.describe()

entities            100
entity_columns        1
context_columns       1
model_columns         3
max_sequence_len      7
min_sequence_len      7
dtype: int64

## Loading the DeepEcho datasets

The DeepEcho benchmark framwork has a collection of datasets which are used
to evaluate the DeepEcho models at every release.

You can see the complete list of datasets and their properties by calling the
`get_datasets_list` function:

In [15]:
from deepecho.benchmark import get_datasets_list

get_datasets_list().head()

Unnamed: 0,dataset,size_in_kb,entities,entity_columns,context_columns,data_columns,max_sequence_len,min_sequence_len
0,Libras,108739,360,1,1,4,45,45
1,AtrialFibrillation,111019,30,1,1,4,640,640
2,BasicMotions,196062,80,1,1,8,100,100
3,ERing,223502,300,1,1,6,65,65
4,RacketSports,235392,303,1,1,8,30,30


These datsets can easily be loaded using our `Dataset` class by simply passing their
name to it.

Let's load the first dataset from the list, `Libras`:

In [16]:
libras = Dataset('Libras')

In [17]:
libras.describe()

entities            360
entity_columns        1
context_columns       1
model_columns         2
max_sequence_len     45
min_sequence_len     45
dtype: int64

In [18]:
libras.data.head()

Unnamed: 0,e_id,dim_0,dim_1,ml_class
0,0,0.67892,0.27315,1
1,0,0.68085,0.27315,1
2,0,0.68085,0.27315,1
3,0,0.68085,0.27315,1
4,0,0.67892,0.26852,1


## Using a `Dataset` with `DeepEcho`.

Once we have loaded a `dataset` instance, we can use a DeepEcho model
to learn it and generate synthetic versions of it.

Let's try to use the `PARModel` class on the `Libras` dataset that we just loaded.

In order to do this, we will first need to create an instance of the model
with the desired hyperparameters.

In [19]:
from deepecho import PARModel

model = PARModel(epochs=512)

PARModel(epochs=512, sample_size=1, cuda='cuda', verbose=True) instance created


And then we can fit the model by passing the `data`, the `entity_columns`,
the `context_columns` and the `sequence_index` from our `dataset.

In [20]:
model.fit(
    libras.data,
    entity_columns=libras.entity_columns,
    context_columns=libras.context_columns,
    sequence_index=libras.sequence_index
)

Epoch 512 | Loss -0.06662452965974808: 100%|██████████| 512/512 [05:52<00:00,  1.45it/s]   


And finally create new versions of our dataset:

In [21]:
sampled = model.sample(num_entities=5)

100%|██████████| 5/5 [00:00<00:00,  8.97it/s]


In [22]:
sampled.head()

Unnamed: 0,e_id,dim_0,dim_1,ml_class
0,0.0,-0.014832,0.303777,15
1,0.0,-0.110731,0.434711,15
2,0.0,0.195686,0.459823,15
3,0.0,0.317443,0.50665,15
4,0.0,0.446434,0.525029,15
