# Generating Synthetic Data using PARSynthesizer

## Introduction
PARSynthesizer is a model designed by Synthetic Data Vault (SDV) and used to generate synthetic time series data based on multi-squence data.
The PARSynthesizer utilizes a deep learning method to train a model and generate synthetic data to create brand new entities and brand new sequences for each one.

## Installation

Install the SDV library.

In [1]:
%pip install sdv

Collecting sdv
  Downloading sdv-1.17.1-py3-none-any.whl.metadata (13 kB)
Collecting boto3<2.0.0,>=1.28 (from sdv)
  Downloading boto3-1.35.59-py3-none-any.whl.metadata (6.7 kB)
Collecting botocore<2.0.0,>=1.31 (from sdv)
  Downloading botocore-1.35.59-py3-none-any.whl.metadata (5.7 kB)
Collecting copulas>=0.11.0 (from sdv)
  Downloading copulas-0.12.0-py3-none-any.whl.metadata (9.1 kB)
Collecting ctgan>=0.10.0 (from sdv)
  Downloading ctgan-0.10.2-py3-none-any.whl.metadata (10 kB)
Collecting deepecho>=0.6.0 (from sdv)
  Downloading deepecho-0.6.1-py3-none-any.whl.metadata (10 kB)
Collecting rdt>=1.12.3 (from sdv)
  Downloading rdt-1.13.0-py3-none-any.whl.metadata (10 kB)
Collecting sdmetrics>=0.16.0 (from sdv)
  Downloading sdmetrics-0.16.0-py3-none-any.whl.metadata (8.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3<2.0.0,>=1.28->sdv)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.11.0,>=0.10.0 (from boto3<2.0.0,>=1.28->sdv)
  Downloading s

## Data Preparation

### Import Modules

In [2]:
#Import relevant modules
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Upload Dataset

In [4]:
# Import csv file from folder
real_data = pd.read_csv('./drive/MyDrive/DVAE11/Real_data.csv')

In [5]:
real_data.head()

Unnamed: 0,SID,timestamp,harsh_event,acc_x,acc_y,acc_z,gyro_x,gyro_y,gyro_z,mag_x,mag_y,mag_z,event_class
0,1,22-3-2023 20:0:4:187775,safe,-0.465592,-0.934335,0.47318,-0.058358,5e-05,0.003445,0.375,0.666667,0.375,0
1,1,22-3-2023 20:0:4:584860,safe,-0.385455,-1.33741,0.135064,-0.072915,-0.003698,0.000831,0.375,0.666667,0.375,0
2,1,22-3-2023 20:0:4:586927,safe,-0.083945,-1.622691,-1.184497,-0.094686,-0.00455,-0.001229,0.25,1.0,0.499999,0
3,1,22-3-2023 20:0:4:594926,safe,-0.241811,-1.258516,-0.890717,-0.076667,0.002451,-0.005478,0.25,1.0,0.499999,0
4,1,22-3-2023 20:0:4:602926,safe,-0.682277,-0.546198,0.420734,-0.047071,0.002969,0.002524,0.124999,0.888889,0.375,0


### Train Test Split

In [6]:
x = real_data.iloc[:, :-1]
y = real_data.iloc[:, -1]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=real_data['SID'], random_state=0)

In [7]:
train_real_dataset = pd.concat([x_train, y_train], axis=1)
test_real_dataset = pd.concat([x_test, y_test], axis=1)

In [8]:
train_real_dataset.to_csv('./train_real_dataset.csv', index=False)
test_real_dataset.to_csv('./test_real_dataset.csv', index=False)

In [None]:
count = (train_real_dataset['SID'] == 9654).sum()
print(count)

40


### Metadata

Before creating synthetic data, it's important to have data ready in the right format:

- **Data**, a dictionary that maps every table name to a pandas DataFrame object containing the actual data.

- **Metadata**, a Metadata object that describes your table. It includes the data types in each column, keys and other identifiers.

#### Upload Metadata from JSON
This function is used to upload metadata from a JSON file.

In [9]:
from sdv.metadata import Metadata
metadata = Metadata.load_from_json(filepath='./drive/MyDrive/DVAE11/metadata_v1.json')

### Adding multi-sequence information
- **Sequence key** is an identifier for each sequence and is usually an ID column. *If you don't supply a sequence key, the SDV assumes that your table only contains a single sequence*

- **Sequence Index** determines the spacing between the rows in a sequence. Use this if you have an explicit index such as a timestamp. If you don't supply a sequence index, the SDV assumes there is equal spacing of an unknown unit.

- Context column is just another column that happens to never vary within a sequence. There is no need to declare a sequence key as a context column because a sequence key is already guaranteed not to vary within a sequence -- rather, it is defining what a sequence is.

### Modeling

In [12]:
from sdv.sequential import PARSynthesizer

# Step 1: Create the synthesizer
synthesizer = PARSynthesizer(
    metadata,
    enforce_min_max_values= True, # adhere to the same min/max boundaries set by the real data
    enforce_rounding=False,       # same number of decimal digits as the real data
    locales= ['en_US'] ,          # Any PII columns will correspond to the locales that you provide.
    context_columns=['harsh_event'], #  a list of strings that do not vary inside of a sequence
    epochs=5,                   # Number of times to train the GAN
    cuda=True,                    #  a parallel computing platform
    verbose=True,                 #  print out the results of each epoch
    )

sample_size: The number of times to sample (before choosing and returning the sample which maximizes the likelihood). Defaults to 1.

segment_size: Cut each training sequence into several segments by using this parameter. For example, if the segment_size=10 then each segment contains 10 data points. Defaults to None, which means the sequences are not cut into any segments.

### Trian Model

In [13]:
# Train the synthesizer on the current window
synthesizer.fit(train_real_dataset)

Loss (0.003):  20%|██        | 1/5 [1:31:39<6:06:37, 5499.37s/it]


KeyboardInterrupt: 

In [None]:
synthesizer.save(filepath='./drive/MyDrive/DVAE11/5EP_Synth.pkl')

### Load Synthesizer

In [None]:
from sdv.sequential import PARSynthesizer

synthesizer = PARSynthesizer.load(
    filepath='./drive/MyDrive/DVAE11/40EP/40EP_Synth_80Train.pkl'
)

### Sample Synthetic Data

In [None]:
# Generate synthetic data
synthetic_data = synthesizer.sample(num_sequences=30859, sequence_length=10)

In [None]:
# Save to CSV or Excel
synthetic_data.to_csv('./drive/MyDrive/DVAE11/Synth.csv', index=False)