Synthetic Time-Series

Generate synthetic time-series using generative adversarial networks. This project holds an end-to-end system for generating time-series dataset with database support, training scripts running on compute clusters, post-training model registration, interactive model inferences with time-series visualization.

Software Architecture

Docker Structure

How It Works

Create dataset in the TS Generation page. Dataset is sent to API which saves it in the Mongo DB database with the configurations parameters used.
From the TS Database page, we query the API and get automatically all the datasets available from the database. We can then inspect/interact with the vizualized datasets (time-series).
We're now ready to initiate a training session by submitting a train job to the Ray. In addition to training functions, we're required to set a model name and the dataset name we want. From the training script, we submit the job to Ray, which runs the job, saves each model after each training and finally loops through all of the trials, and registrates the best one to the database. As the job is running we can inspect the progression for each trial in the ML Flow.
As the page loads, we fetch all registrated models from the model registry. By selecting the model we want, we can send a inference request to the API with a given model name, version and inference parameters. The request will prompt the API to load the registrated model from the ML Flow model registry (or use a locally cached version). Subsequently, the API runs a forward pass on the data provided, and returns a prediction response. Finally, the UI application will process the meta response and render a interactive vizualization of the prediction.

User-Interface

HOME

TS Generation

TS Database

TS Operations

File structure

.
├── docker-compose.yml
├── Dockerfile
├── pyproject.toml
├── README.md
├── setup.py
└── synthetic_data
    ├── api
    │   └── *.py
    ├── app
    │   ├── *.py
    │   └── pages
    │       └── *.py
    ├── common
    │   ├── *.py
    └── mlops
        ├── datasets
        │   └── *.py
        ├── models
        │   └── *.py
        ├── tools
        │   └── *.py
        ├── train_*.py
        ├── train_*.sh
        └── transforms
            └── *.py

Prerequisites

Service running Ray ML
Service running ML Flow

Usage

Follow instructions for installing Docker Engine.

Install repository

git clone git@github.com:ML4ITS/synthetic-data.git
cd synthetic-data

Create an environment file (.env) with the following credentials:

# Hostname of service/server running Ray ML (aka. the Ray compute cluster) 
COMPUTATION_HOST=<REPLACE> // Service running Ray ML
# Port of service/server running Ray ML (aka. the Ray compute cluster) 
RAY_PORT=<REPLACE>

# Hostname of service/server running ML Flow 
APPLICATION_HOST=<REPLACE>
# Port of service/server running ML Flow 
MODELREG_PORT=<REPLACE>

# Select the name of your database
DATABASE_NAME=<REPLACE>

# Protect the database with your username & password
DATABASE_USERNAME=<REPLACE>
DATABASE_PASSWORD=<REPLACE>

# Hostname of database (aka. the name of the container when running Docker)
DATABASE_HOST=mongodb
DATABASE_PORT=27017

# Hostname of service/server running the API (aka. the name of the container when running Docker)
BACKEND_HOST=backend
BACKEND_PORT=8502

The following credentials are then stored in the .env file, and will be located by dotenv to handle the various config classes when running the application.
When interfacing with Ray Tune / ML Flow, we use underlying server configuration:

  class ServerConfig:

      @property
      def APPLICATION_HOST(self):
          return os.getenv("APPLICATION_HOST")

      @property
      def COMPUTATION_HOST(self):
          return os.getenv("COMPUTATION_HOST")

(see config.py for more details)

Run the following command to start the application:
```
  docker-compose up --build -d
```

Training

Create an virtual environment and install dependencies

NOTE: local developement requires python 3.7, because of the timesynth library
```
  virtualenv venv -p=python3.7
  source venv/bin/activate
  pip install -e .
```
Run the following shell script to train your C-GAN/WGAN-GP model:

NOTE: adjust training parameters as needed inside their respective *.py files
```
  sh synthetic_data/mlops/train_cgan.sh
  # or
  sh synthetic_data/mlops/train_gan_gp.sh
```

Evaluation

Following performance indications and visualizations are based on two models: the WGAN-GP model and the C-GAN model. Both were trained on the same datasets. WGAN-GP model was trained using a learning rate of 0.0002, batch size of 128 and a total of 1000 epochs (or 9990050 global steps). C-GAN model was trained using a learning rate of 0.0002, batch size of 128 and a total of 300 epochs (or 3000050 global steps).

---

In the space of deep learning, GANs opposed to e.g. object detectors, doesn't really have a direct way of measuring performance straight forward. Where object detectors could rely on intersection over union as simple and easy evaluation metric for measuring bounding box accuracy, GAN models are much more difficult to guide and interpret in terms of training and evaluation.

Commonly, GAN models are used for image generation which is a task that is not directly related to time-series generation. For image generation, popular evaluation metrics such as Inception Score and Fréchet Inception Distance has been used to evaluate the performance of GAN models. Both of these metrics relies on a pre-trained image classifier (e.g. Inception-v3) developed for 2D -domain. This leaves us with the challenge of evaluating the performance of GAN models in terms of time-series generation, as we're working with 1D -domain.

---

Efforts has been made to evaluate the performance of GAN models in terms of time-series generation. The analysis.ipynb notebooks shows various experiments and evaluations such as average cosine similarity scoring, t-SNE, PCA and latent-space interpolation. By looking at a few of them, we can evaluate them visually.

The most straight forward way to evaluate the performance visually is to generate (e.g. 10 samples) and compare them equally to the real time-series data.

WGAN-GP: Randomly sampled original data vs. Random generated data

C-GAN: Sequentially sampled original data vs. Conditionally generated data

t-SNE and PCA

t-SNE and PCA are two slightly more uncommon ways of evaluating performance, but they could help discovery insights by displaying clusters visually. For instance, using PCA, we can sample the original and generated data per class, cluster them in pairs and vizualize the distributions as seen below.

C-GAN: Condition on 1-10 Hz

--

Latent-space exploration

To investigate how the models generalize as they are presented various latent-space inputs, we can by interpolation, manipulate the inputs to discover different but similar sequence generations. The examples below shows output sequences based on a given latent space from 10 different noise distributions with 200 spherical linear interpolation interpolations between each one. For setup and how to perform these interpolations, see the slerp.ipynb notebook.

Both models were trained on the same multi-harmonic dataset, consisting of 10 000 time series evenly distributed between 1-10 Hz. Using conditions/labels, we can manipulate (embed) the input latent space to make the generator output desired frequencies.

WGAN-GP: latent space interpolation (slerp)

C-GAN: latent space interpolation (slerp)

Future suggestions

Create unified 'model-registration-method' for training scripts
Migrate model trainers to PyTorch Lightning
Refactor LSMT to train with the new MultiHarmonicDataset
Implementing IS and FID (using some kinda of 1-D classifier (e.g Incepetion v3 model but for 1-D?)
Experiment with training / evaluating models using TSTR or TRTS.
Experiment with other datasets (e.g. synthetic, real, etc.)
Experiment with other evaluation metrics for GANs.
Experiment with subtracting some gaussian, and looking at the residuals.
Experiment with other evaluation metrics for GANs.
Refactor certain components of the application, and remove unnecessary/unused methods.
Experiment with different types of generated time-series (e.g. gaussian process, pseudo-periodic, auto-regressive, etc.)

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
docs		docs
synthetic_data		synthetic_data
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

ML4ITS/synthetic-data

Folders and files

Latest commit

History

Repository files navigation