<a href="https://colab.research.google.com/github/HarisNaveed17/TimeGAN/blob/master/tutorial_timegan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
cd drive/MyDrive/FYP

/content/drive/MyDrive/FYP


In [6]:
!git clone https://github.com/HarisNaveed17/TimeGAN.git

Cloning into 'TimeGAN'...
remote: Enumerating objects: 119, done.[K
remote: Counting objects: 100% (119/119), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 163 (delta 63), reused 98 (delta 43), pack-reused 44[K
Receiving objects: 100% (163/163), 2.08 MiB | 12.26 MiB/s, done.
Resolving deltas: 100% (74/74), done.


# TimeGAN Tutorial

## Time-series Generative Adversarial Networks

- Paper: Jinsung Yoon, Daniel Jarrett, Mihaela van der Schaar, "Time-series Generative Adversarial Networks," Neural Information Processing Systems (NeurIPS), 2019.

- Paper link: https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks

- Last updated Date: April 24th 2020

- Code author: Jinsung Yoon (jsyoon0823@gmail.com)

This notebook describes the user-guide of a time-series synthetic data generation application using timeGAN framework. We use Stock, Energy, and Sine dataset as examples.

### Prerequisite
Clone https://github.com/jsyoon0823/timeGAN.git to the current directory.

## Necessary packages and functions call

- timegan: Synthetic time-series data generation module
- data_loading: 2 real datasets and 1 synthetic datasets loading and preprocessing
- metrics: 
    - discriminative_metrics: classify real data from synthetic data
    - predictive_metrics: train on synthetic, test on real
    - visualization: PCA and tSNE analyses

In [None]:
cd 

/home/haris/TimeGAN


In [None]:
## Necessary packages
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import warnings
import pandas as pd
warnings.filterwarnings("ignore")

# 1. TimeGAN model
from timegan import timegan
# 2. Data loading
from data_loading import real_data_loading, sine_data_generation
# 3. Metrics
from metrics.discriminative_metrics import discriminative_score_metrics
from metrics.predictive_metrics import predictive_score_metrics
from metrics.visualization_metrics import visualization

## Data Loading

Load original dataset and preprocess the loaded data.

- data_name: stock, energy, or sine
- seq_len: sequence length of the time-series data

In [None]:
## Data loading
data_name = 'tester'
seq_len = 3

if data_name in ['stock', 'energy','tester']:
  ori_data, dat_min, dat_max = real_data_loading(data_name, seq_len)
elif data_name == 'sine':
  # Set number of samples and its dimensions
  no, dim = 10000, 5
  ori_data = sine_data_generation(no, seq_len, dim)
    
print(data_name + ' dataset is ready.')

Mixed data: [array([[0.35897436, 0.13327607],
       [0.35897436, 0.        ],
       [0.17948718, 0.72874091]]), array([[0.17948718, 0.69245969],
       [0.12820513, 0.84943882],
       [0.12820513, 0.65494963]]), array([[0.02564103, 0.84664275],
       [0.02564103, 0.8839505 ],
       [0.02564103, 0.99999996]]), array([[0.99999999, 0.14796881],
       [0.69230769, 0.29374782],
       [0.69230769, 0.45012165]]), array([[0.07692308, 0.62791545],
       [0.02564103, 0.84664275],
       [0.02564103, 0.8839505 ]]), array([[0.07692308, 0.66972547],
       [0.07692308, 0.62791545],
       [0.02564103, 0.84664275]]), array([[0.69230769, 0.29374782],
       [0.69230769, 0.45012165],
       [0.69230769, 0.37906319]]), array([[0.02564103, 0.8839505 ],
       [0.02564103, 0.99999996],
       [0.        , 0.76300762]]), array([[0.07692308, 0.70128629],
       [0.07692308, 0.66972547],
       [0.07692308, 0.62791545]]), array([[0.12820513, 0.65494963],
       [0.12820513, 0.80161169],
       [0.07

## Set network parameters

TimeGAN network parameters should be optimized for different datasets.

- module: gru, lstm, or lstmLN
- hidden_dim: hidden dimensions
- num_layer: number of layers
- iteration: number of training iterations
- batch_size: the number of samples in each batch

In [None]:
## Newtork parameters
parameters = dict()

parameters['module'] = 'gru' 
parameters['hidden_dim'] = 24
parameters['num_layer'] = 3
parameters['iterations'] = 5000
parameters['batch_size'] = 6

## Run TimeGAN for synthetic time-series data generation

TimeGAN uses the original data and network parameters to return the generated synthetic data.

In [None]:
# Run TimeGAN
generated_data = timegan(ori_data, parameters)   
print('Finish Synthetic Data Generation')

Start Embedding Network Training
step: 0/5000, e_loss: 0.3605
step: 1000/5000, e_loss: 0.0788
step: 2000/5000, e_loss: 0.0414
step: 3000/5000, e_loss: 0.0303
step: 4000/5000, e_loss: 0.0227
Finish Embedding Network Training
Start Training with Supervised Loss Only
step: 0/5000, s_loss: 0.3179
step: 1000/5000, s_loss: 0.1762
step: 2000/5000, s_loss: 0.1197
step: 3000/5000, s_loss: 0.1152
step: 4000/5000, s_loss: 0.0987
Finish Training with Supervised Loss Only
Start Joint Training
step: 0/5000, d_loss: 2.0645, g_loss_u: 0.7093, g_loss_s: 0.1236, g_loss_v: 0.4818, e_loss_t0: 0.0378


In [None]:
print(generated_data.shape)
gen = generated_data.reshape((24,2))
df = pd.DataFrame(gen, columns=['ctrl.x', 'ctrl.y'])
renorm = dat_max - dat_min
df['Values_1'] = (df['Values_1']*renorm[0])+dat_min[0]
df['Values_2'] = (df['Values_2']*renorm[1])+dat_min[1]
df.head(10)


(24, 1, 2)


Unnamed: 0,Values_1,Values_2
0,4.410329,0.927868
1,2.734207,1.104205
2,3.740629,1.005147
3,4.921349,0.843515
4,5.077711,0.81022
5,5.461843,0.709613
6,5.154601,0.792275
7,2.736253,1.103991
8,3.154627,1.062184
9,2.90147,1.087053


In [None]:
df.to_csv('gentoydata2.csv')

## Evaluate the generated data

### 1. Discriminative score

To evaluate the classification accuracy between original and synthetic data using post-hoc RNN network. The output is |classification accuracy - 0.5|.

- metric_iteration: the number of iterations for metric computation.

In [None]:
metric_iteration = 5

discriminative_score = list()
for _ in range(metric_iteration):
  temp_disc = discriminative_score_metrics(ori_data, generated_data)
  discriminative_score.append(temp_disc)

print('Discriminative score: ' + str(np.round(np.mean(discriminative_score), 4)))

## Evaluate the generated data

### 2. Predictive score

To evaluate the prediction performance on train on synthetic, test on real setting. More specifically, we use Post-hoc RNN architecture to predict one-step ahead and report the performance in terms of MAE.

In [None]:
predictive_score = list()
for tt in range(metric_iteration):
  temp_pred = predictive_score_metrics(ori_data, generated_data)
  predictive_score.append(temp_pred)   
    
print('Predictive score: ' + str(np.round(np.mean(predictive_score), 4)))

## Evaluate the generated data

### 3. Visualization

We visualize the original and synthetic data distributions using PCA and tSNE analysis.

In [None]:
visualization(ori_data, generated_data, 'pca')
visualization(ori_data, generated_data, 'tsne')