# Synthetic Data Vault (SDV)
In this notebook, we will use Synthetic Data Vault (SDV) to create sequential synthetic data. The foundation of this notebook is the [tutorial from SDV](https://colab.research.google.com/drive/1cT4-jFK2Bxe93QudC_CwHq_yVCcNcxal?usp=sharing) and the notebook on DataSynthesizer.

Both served as inspiration in creating synthetic data with SDV in this manner.

In [1]:
# import relevant modules
import sdv
import pandas as pd
import numpy as np
import os

# Load data

In [2]:
data = pd.read_csv('CMAPSS/train_FD001.txt', sep=" ", header=None)

# drop last two columns with N/A values
data = data.iloc[:, :-2]

# rename columns according to readme.txt
col_names = ["unit-nr", "timecycle", "ops-set1", "ops-set2", "ops-set3"]
for i in range(1,22):
    col_names.append(f"sens-{i}")
data.columns = col_names
data.to_csv('CMAPSS/train_FD001_pre.csv', index=False)

data_length = len(data)

# display data
# data
# data.columns

# Create synthetic data

In [3]:
# Create metadata
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()

# load data
metadata.detect_from_dataframe(data=data)

# set unit-nr as sequence key, timecycle as sequence index
metadata.update_column(
    column_name='unit-nr',
    sdtype='id')
metadata.set_sequence_key('unit-nr')
metadata.set_sequence_index('timecycle')

# show metadata
# metadata

In [4]:
"""
In this example, we do not need context_columns.
As all columns have an alternating value, none
remains constant or context-dependent.

"""
from sdv.sequential import PARSynthesizer

# create synthesizer
synthesizer = PARSynthesizer(
    metadata,
    epochs=100,
    verbose=True
    )

synthesizer.fit(data)

Epoch 100 | Loss -0.44507795572280884: 100%|██| 100/100 [02:05<00:00,  1.25s/it]


In [5]:
# find amount of unique unit-nrs
n_unitnrs = len(data['unit-nr'].unique())

# create synthetic data with that same amount of unit-nrs
synthetic_data = synthesizer.sample(num_sequences=n_unitnrs)
synthetic_data

# export SyntData to folder
synthetic_data.to_csv('./CMAPSS/Synthetic/SDV_FD001.csv')

100%|█████████████████████████████████████████| 100/100 [01:32<00:00,  1.08it/s]


# Analysis
Cool! We've created our first synthetic data, now we wish to analyse this synthetic data. More specifically, we would like to compare the output and distributions of our synthetic data compared to the original data.

In [6]:
data_description = data.describe()
data_description

Unnamed: 0,unit-nr,timecycle,ops-set1,ops-set2,ops-set3,sens-1,sens-2,sens-3,sens-4,sens-5,...,sens-12,sens-13,sens-14,sens-15,sens-16,sens-17,sens-18,sens-19,sens-20,sens-21
count,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,...,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0
mean,51.506568,108.807862,-9e-06,2e-06,100.0,518.67,642.680934,1590.523119,1408.933782,14.62,...,521.41347,2388.096152,8143.752722,8.442146,0.03,393.210654,2388.0,100.0,38.816271,23.289705
std,29.227633,68.88099,0.002187,0.000293,0.0,0.0,0.500053,6.13115,9.000605,1.7764e-15,...,0.737553,0.071919,19.076176,0.037505,1.3878120000000003e-17,1.548763,0.0,0.0,0.180746,0.108251
min,1.0,1.0,-0.0087,-0.0006,100.0,518.67,641.21,1571.04,1382.25,14.62,...,518.69,2387.88,8099.94,8.3249,0.03,388.0,2388.0,100.0,38.14,22.8942
25%,26.0,52.0,-0.0015,-0.0002,100.0,518.67,642.325,1586.26,1402.36,14.62,...,520.96,2388.04,8133.245,8.4149,0.03,392.0,2388.0,100.0,38.7,23.2218
50%,52.0,104.0,0.0,0.0,100.0,518.67,642.64,1590.1,1408.04,14.62,...,521.48,2388.09,8140.54,8.4389,0.03,393.0,2388.0,100.0,38.83,23.2979
75%,77.0,156.0,0.0015,0.0003,100.0,518.67,643.0,1594.38,1414.555,14.62,...,521.95,2388.14,8148.31,8.4656,0.03,394.0,2388.0,100.0,38.95,23.3668
max,100.0,362.0,0.0087,0.0006,100.0,518.67,644.53,1616.91,1441.49,14.62,...,523.38,2388.56,8293.72,8.5848,0.03,400.0,2388.0,100.0,39.43,23.6184


In [7]:
syn_description = synthetic_data.describe()
syn_description

Unnamed: 0,unit-nr,timecycle,ops-set1,ops-set2,ops-set3,sens-1,sens-2,sens-3,sens-4,sens-5,...,sens-12,sens-13,sens-14,sens-15,sens-16,sens-17,sens-18,sens-19,sens-20,sens-21
count,23687.0,23687.0,23687.0,23687.0,23687.0,23687.0,23687.0,23687.0,23687.0,23687.0,...,23687.0,23687.0,23687.0,23687.0,23687.0,23687.0,23687.0,23687.0,23687.0,23687.0
mean,48598.832693,133.089374,-5.6e-05,5e-06,100.0,518.67,642.733602,1591.36925,1410.287548,14.62,...,521.333383,2388.103602,8145.846806,8.447381,0.03,393.395449,2388.0,100.0,38.79179,23.277042
std,26249.294173,89.670899,0.001056,0.000152,0.0,0.0,0.273815,3.243974,5.392316,3.552789e-15,...,0.429735,0.037836,13.00718,0.020914,1.7347600000000002e-17,0.91814,0.0,0.0,0.103855,0.060912
min,476.0,1.0,-0.004213,-0.0006,100.0,518.67,641.706398,1578.471903,1390.592358,14.62,...,519.617555,2387.959022,8099.94,8.355087,0.03,390.0,2388.0,100.0,38.397895,23.021687
25%,28574.0,60.0,-0.000738,-9.4e-05,100.0,518.67,642.561008,1589.314865,1406.742171,14.62,...,521.04576,2388.079819,8137.454104,8.433732,0.03,393.0,2388.0,100.0,38.722322,23.236924
50%,48479.0,119.0,-9e-06,2e-06,100.0,518.67,642.714209,1591.186161,1410.121169,14.62,...,521.350893,2388.101983,8145.204948,8.446768,0.03,393.0,2388.0,100.0,38.794869,23.280143
75%,68964.0,194.0,0.000618,0.000104,100.0,518.67,642.914956,1593.532483,1413.907929,14.62,...,521.611186,2388.128643,8154.564755,8.461495,0.03,394.0,2388.0,100.0,38.859839,23.316176
max,99711.0,362.0,0.004178,0.0006,100.0,518.67,643.810067,1606.807363,1432.680089,14.62,...,522.969677,2388.277698,8207.24846,8.53634,0.03,397.0,2388.0,100.0,39.207847,23.512199


What directly stands out is that the unit-nrs are not ordered from 1-100. However, this is due to the very nature of SDV: it will never create instances that overlap with the original data [source](https://colab.research.google.com/drive/1YLk2uwn8yrSRPy0soEeJwu8Hdk_tGTlE#scrollTo=IjvfTOMVKQot). Therefore, this behaviour is to be expected.

The descriptions look rather similar, most means and standarddeviations are close to the description of the original data. Therefore it would be interesting to test the SDV data on a ML task.

## SDV evaluation

In [8]:
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    data,
    synthetic_data,
    metadata
)

# with 128 epochs:
"""Overall Quality Score: 81.07%

Properties:
- Column Shapes: 82.68%
- Column Pair Trends: 79.45%
"""

Generating report ...
(1/2) Evaluating Column Shapes: : 100%|████████| 26/26 [00:00<00:00, 103.28it/s]
(2/2) Evaluating Column Pair Trends: : 100%|█| 325/325 [00:00<00:00, 453.22it/s]

Overall Quality Score: 81.28%

Properties:
- Column Shapes: 83.11%
- Column Pair Trends: 79.45%


'Overall Quality Score: 81.07%\n\nProperties:\n- Column Shapes: 82.68%\n- Column Pair Trends: 79.45%\n'

# Testing properties