# TimeGAN 

## Time-series Generative Adversarial Networks

- Paper: Jinsung Yoon, Daniel Jarrett, Mihaela van der Schaar, "Time-series Generative Adversarial Networks," Neural Information Processing Systems (NeurIPS), 2019.

- Paper link: https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks

- Code author: Jinsung Yoon (jsyoon0823@gmail.com)

## Necessary packages and functions call

- timegan: Synthetic time-series data generation module
- data_loading: 2 real datasets and 1 synthetic datasets loading and preprocessing
- metrics: 
    - discriminative_metrics: classify real data from synthetic data
    - predictive_metrics: train on synthetic, test on real
    - visualization: PCA and tSNE analyses

In [None]:
!pip install -r requirements.txt
!pip install openpyxl

In [None]:
## Necessary packages
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np 
import warnings
warnings.filterwarnings("ignore")

# 1. TimeGAN model
from timegan import timegan
# 2. Data loading
from data_loading import patient_data_loading, sine_data_generation
# 3. Metrics
from metrics.discriminative_metrics import discriminative_score_metrics
from metrics.predictive_metrics import predictive_score_metrics
from metrics.visualization_metrics import visualization

## Data Loading

Load original dataset and preprocess the loaded data.

- data_name: stock, energy, or sine
- seq_len: sequence length of the time-series data

In [None]:
## Data loading
data_name = 'patient_HighTrain'
seq_len = 88

if data_name == 'patient_HighTrain':
   folder_path = 'data/patient_HighTrain' 
   ori_data = patient_data_loading(folder_path, seq_len)
elif data_name == 'sine':
  # Set number of samples and its dimensions
  no, dim = 10000, 5
  ori_data = sine_data_generation(no, seq_len, dim)

print('Number of sequences:', len(ori_data))
print('Each sequence shape:', ori_data[0].shape)    
print(data_name + ' dataset is ready.')

## Set network parameters

TimeGAN network parameters should be optimized for different datasets.

- module: gru, lstm, or lstmLN
- hidden_dim: hidden dimensions
- num_layer: number of layers
- iteration: number of training iterations
- batch_size: the number of samples in each batch

In [None]:
## Newtork parameters
parameters = dict()

parameters['module'] = 'gru' 
parameters['hidden_dim'] = 24
parameters['num_layer'] = 3
parameters['iterations'] = 5000
parameters['batch_size'] = 32

## Run TimeGAN for synthetic time-series data generation

TimeGAN uses the original data and network parameters to return the generated synthetic data.

In [None]:
import time

# Start the timer
start_time = time.time()

In [None]:
# Run TimeGAN
generated_data = timegan(ori_data, parameters)   
print('Finish Synthetic Data Generation')

## Evaluate the generated data

### 1. Discriminative score

To evaluate the classification accuracy between original and synthetic data using post-hoc RNN network. The output is |classification accuracy - 0.5|.

- metric_iteration: the number of iterations for metric computation.

In [None]:
metric_iteration = 5

discriminative_score = list()
for _ in range(metric_iteration):
  temp_disc = discriminative_score_metrics(ori_data, generated_data)
  discriminative_score.append(temp_disc)

print('Discriminative score: ' + str(np.round(np.mean(discriminative_score), 4)))

## Evaluate the generated data

### 2. Predictive score

To evaluate the prediction performance on train on synthetic, test on real setting. More specifically, we use Post-hoc RNN architecture to predict one-step ahead and report the performance in terms of MAE.

In [None]:
predictive_score = list()
for tt in range(metric_iteration):
  temp_pred = predictive_score_metrics(ori_data, generated_data)
  predictive_score.append(temp_pred)   
    
print('Predictive score: ' + str(np.round(np.mean(predictive_score), 4)))

## Evaluate the generated data

### 3. Visualization

We visualize the original and synthetic data distributions using PCA and tSNE analysis.

In [None]:
visualization(ori_data, generated_data, 'pca')
visualization(ori_data, generated_data, 'tsne')

## Save generated data and limit the amount of generated data saved to the folder

In [None]:
import os
import pandas as pd
from timegan import timegan  # or the appropriate import path in your project

# Folder containing the original Excel files
input_folder = r'C:\your_path\data\patient_HighTrain'

# Folder to save the generated results
save_folder = r'C:\your_path\HR_Train_107_TimeGAN_HighRisk'
os.makedirs(save_folder, exist_ok=True)

# Get a list of all .xlsx files
excel_files = [f for f in os.listdir(input_folder) if f.lower().endswith('.xlsx')]

for file in excel_files:
    original_filename = os.path.splitext(file)[0]
    file_path = os.path.join(input_folder, file)

    # Limit to 39 generated samples instead of using all TimeGAN-generated data.
    # Afterward, copy the original data to obtain HR Ã— 39 samples.
    generated_data = generated_data[:39]

    # Save each generated sample to a separate Excel file, including column headers
    for idx, sample in enumerate(generated_data, start=1):
        df = pd.DataFrame(
            sample,
            columns=['Time', 'Brachial Data', 'Carotid Diameter', 'blood velocity']
        )
        out_name = f"{original_filename}_{idx}.xlsx"
        out_path = os.path.join(save_folder, out_name)
        df.to_excel(out_path, index=False)

    print(
        f"{len(generated_data)} files have been saved from "
        f"`{original_filename}` to `{save_folder}`"
    )


In [None]:
end_time = time.time()
print(f"\nTotal time taken for the entire process: {end_time - start_time:.2f} seconds")