# Synthetic Data Vault (SDV)
In this notebook, we will use Synthetic Data Vault (SDV) to create sequential synthetic data. The foundation of this notebook is the [tutorial from SDV](https://colab.research.google.com/drive/1cT4-jFK2Bxe93QudC_CwHq_yVCcNcxal?usp=sharing) and the notebook on DataSynthesizer.

Both served as inspiration in creating synthetic data with SDV in this manner.

In [55]:
# import relevant modules
import sdv
import pandas as pd
import numpy as np
import os

# Load data

In [56]:
data = pd.read_csv('CMAPSS/train_FD001.txt', sep=" ", header=None)

# drop last two columns with N/A values
data = data.iloc[:, :-2]

# rename columns according to readme.txt
col_names = ["unit-nr", "timecycle", "ops-set1", "ops-set2", "ops-set3"]
for i in range(1,22):
    col_names.append(f"sens-{i}")
data.columns = col_names
data.to_csv('CMAPSS/train_FD001_pre.csv', index=False)

data_length = len(data)

# display data
# data
# data.columns

# Create synthetic data

In [76]:
# Create metadata
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()

# load data
metadata.detect_from_dataframe(data=data)

# set unit-nr as sequence key, timecycle as sequence index
metadata.update_column(
    column_name='unit-nr',
    sdtype='id')
metadata.set_sequence_key('unit-nr')
metadata.set_sequence_index('timecycle')

# show metadata
# metadata

{
    "sequence_index": "timecycle",
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
    "sequence_key": "unit-nr",
    "columns": {
        "unit-nr": {
            "sdtype": "id"
        },
        "timecycle": {
            "sdtype": "numerical"
        },
        "ops-set1": {
            "sdtype": "numerical"
        },
        "ops-set2": {
            "sdtype": "numerical"
        },
        "ops-set3": {
            "sdtype": "numerical"
        },
        "sens-1": {
            "sdtype": "numerical"
        },
        "sens-2": {
            "sdtype": "numerical"
        },
        "sens-3": {
            "sdtype": "numerical"
        },
        "sens-4": {
            "sdtype": "numerical"
        },
        "sens-5": {
            "sdtype": "numerical"
        },
        "sens-6": {
            "sdtype": "numerical"
        },
        "sens-7": {
            "sdtype": "numerical"
        },
        "sens-8": {
            "sdtype": "numerical"
        },
        "sens-9": {

In [68]:
"""
In this example, we do not need context_columns.
As all columns have an alternating value, none
remains constant or context-dependent.

"""
from sdv.sequential import PARSynthesizer

# create synthesizer
synthesizer = PARSynthesizer(
    metadata,
    epochs=100,
    verbose=True
    )

synthesizer.fit(data)

Epoch 100 | Loss -0.3859042823314667: 100%|█████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:00<00:00,  1.20s/it]


In [79]:
# find amount of unit-nrs
n_unitnrs = len(data['unit-nr'].unique())

# create synthetic data with that amount of unit-nrs
synthetic_data = synthesizer.sample(num_sequences=n_unitnrs)
synthetic_data

# export SyntData to folder
synthetic_data.to_csv('./CMAPSS/Synthetic/SDV_FD001.csv')

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [01:14<00:00,  1.34it/s]


# Analysis
Cool! We've created our first synthetic data, now we wish to analyse this synthetic data. More specifically, we would like to compare the output and distributions of our synthetic data compared to the original data.

In [66]:
data_description = data.describe()
data_description

Unnamed: 0,unit-nr,timecycle,ops-set1,ops-set2,ops-set3,sens-1,sens-2,sens-3,sens-4,sens-5,...,sens-12,sens-13,sens-14,sens-15,sens-16,sens-17,sens-18,sens-19,sens-20,sens-21
count,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,...,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0
mean,51.506568,108.807862,-9e-06,2e-06,100.0,518.67,642.680934,1590.523119,1408.933782,14.62,...,521.41347,2388.096152,8143.752722,8.442146,0.03,393.210654,2388.0,100.0,38.816271,23.289705
std,29.227633,68.88099,0.002187,0.000293,0.0,0.0,0.500053,6.13115,9.000605,1.7764e-15,...,0.737553,0.071919,19.076176,0.037505,1.3878120000000003e-17,1.548763,0.0,0.0,0.180746,0.108251
min,1.0,1.0,-0.0087,-0.0006,100.0,518.67,641.21,1571.04,1382.25,14.62,...,518.69,2387.88,8099.94,8.3249,0.03,388.0,2388.0,100.0,38.14,22.8942
25%,26.0,52.0,-0.0015,-0.0002,100.0,518.67,642.325,1586.26,1402.36,14.62,...,520.96,2388.04,8133.245,8.4149,0.03,392.0,2388.0,100.0,38.7,23.2218
50%,52.0,104.0,0.0,0.0,100.0,518.67,642.64,1590.1,1408.04,14.62,...,521.48,2388.09,8140.54,8.4389,0.03,393.0,2388.0,100.0,38.83,23.2979
75%,77.0,156.0,0.0015,0.0003,100.0,518.67,643.0,1594.38,1414.555,14.62,...,521.95,2388.14,8148.31,8.4656,0.03,394.0,2388.0,100.0,38.95,23.3668
max,100.0,362.0,0.0087,0.0006,100.0,518.67,644.53,1616.91,1441.49,14.62,...,523.38,2388.56,8293.72,8.5848,0.03,400.0,2388.0,100.0,39.43,23.6184


In [70]:
syn_description = synthetic_data.describe()
syn_description

Unnamed: 0,unit-nr,timecycle,ops-set1,ops-set2,ops-set3,sens-1,sens-2,sens-3,sens-4,sens-5,...,sens-12,sens-13,sens-14,sens-15,sens-16,sens-17,sens-18,sens-19,sens-20,sens-21
count,19784.0,19784.0,19784.0,19784.0,19784.0,19784.0,19784.0,19784.0,19784.0,19784.0,...,19784.0,19784.0,19784.0,19784.0,19784.0,19784.0,19784.0,19784.0,19784.0,19784.0
mean,48366.006723,107.840275,-1e-06,-1e-06,100.0,518.67,642.734102,1591.348618,1410.190599,14.62,...,521.316496,2388.103319,8146.060829,8.446897,0.03,393.372776,2388.0,100.0,38.788067,23.276434
std,25728.643523,71.993391,0.001117,0.000151,0.0,0.0,0.288573,3.649032,5.564546,5.329205e-15,...,0.450343,0.040756,14.581111,0.022261,1.04086e-17,0.970728,0.0,0.0,0.106076,0.064826
min,476.0,1.0,-0.004504,-0.0006,100.0,518.67,641.527526,1574.61267,1389.388329,14.62,...,519.331791,2387.918358,8099.94,8.362546,0.03,389.0,2388.0,100.0,38.347425,23.02294
25%,29238.0,50.0,-0.000708,-0.000101,100.0,518.67,642.552837,1588.98963,1406.658385,14.62,...,521.017962,2388.076989,8136.460805,8.432665,0.03,393.0,2388.0,100.0,38.717716,23.233896
50%,48479.0,99.0,-9e-06,2e-06,100.0,518.67,642.715664,1591.197809,1409.915698,14.62,...,521.33381,2388.102487,8145.501508,8.445801,0.03,393.0,2388.0,100.0,38.792765,23.282295
75%,68964.0,152.0,0.00071,9.9e-05,100.0,518.67,642.922927,1593.745277,1413.84487,14.62,...,521.607305,2388.130084,8155.730166,8.461635,0.03,394.0,2388.0,100.0,38.856996,23.316711
max,99711.0,362.0,0.004412,0.0006,100.0,518.67,643.826244,1605.153378,1431.740267,14.62,...,523.009164,2388.279736,8204.292511,8.534215,0.03,397.0,2388.0,100.0,39.263419,23.552514


What directly stands out is that the unit-nrs are not ordered from 1-100. However, this is due to the very nature of SDV: it will never create instances that overlap with the original data [source](https://colab.research.google.com/drive/1YLk2uwn8yrSRPy0soEeJwu8Hdk_tGTlE#scrollTo=IjvfTOMVKQot). Therefore, this behaviour is to be expected.

The descriptions look rather similar, most means and standarddeviations are close to the description of the original data. Therefore it would be interesting to test the SDV data on a ML task.

## SDV evaluation

In [65]:
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    data,
    synthetic_data,
    metadata
)

# with 128 epochs:
"""Overall Quality Score: 81.07%

Properties:
- Column Shapes: 82.68%
- Column Pair Trends: 79.45%
"""

Generating report ...
(1/2) Evaluating Column Shapes: : 100%|███████████████████████████████████████████████████████████████████████████████████████████| 26/26 [00:00<00:00, 86.00it/s]
(2/2) Evaluating Column Pair Trends: : 100%|███████████████████████████████████████████████████████████████████████████████████| 325/325 [00:00<00:00, 471.16it/s]

Overall Quality Score: 80.23%

Properties:
- Column Shapes: 80.95%
- Column Pair Trends: 79.5%


'Overall Quality Score: 81.07%\n\nProperties:\n- Column Shapes: 82.68%\n- Column Pair Trends: 79.45%\n'

Though these results are quite promising, and a lot better than the results from DataSynthesizer. However, we see that ....

# Testing properties

In [74]:
# select data with unit-nr 1.0
for i in range(1,6): 
    df_syn = synthetic_data.loc[synthetic_data['unit-nr'] == i]
    df_cmapps = data.loc[data['unit-nr'] == i]
    
    print(f"Number of data points for unit-nr {i} for CMAPPS: {len(df_cmapps)}. For synthetic data: {len(df_syn)}")
    
print("\nThe number of unique values per unit number:")

for i in range(1,6): 
    df_syn = synthetic_data.loc[synthetic_data['unit-nr'] == i].drop_duplicates(subset='timecycle')
    df_cmapps = data.loc[data['unit-nr'] == i].drop_duplicates(subset='timecycle')
    
    
    print(f"Number of unique values in timestamp for unit-nr {i} for CMAPPS: {len(df_cmapps)}. For synthetic data: {len(df_syn)}")

    
df_syn

Number of data points for unit-nr 1 for CMAPPS: 192. For synthetic data: 0
Number of data points for unit-nr 2 for CMAPPS: 287. For synthetic data: 0
Number of data points for unit-nr 3 for CMAPPS: 179. For synthetic data: 0
Number of data points for unit-nr 4 for CMAPPS: 189. For synthetic data: 0
Number of data points for unit-nr 5 for CMAPPS: 269. For synthetic data: 0

The number of unique values per unit number:
Number of unique values in timestamp for unit-nr 1 for CMAPPS: 192. For synthetic data: 0
Number of unique values in timestamp for unit-nr 2 for CMAPPS: 287. For synthetic data: 0
Number of unique values in timestamp for unit-nr 3 for CMAPPS: 179. For synthetic data: 0
Number of unique values in timestamp for unit-nr 4 for CMAPPS: 189. For synthetic data: 0
Number of unique values in timestamp for unit-nr 5 for CMAPPS: 269. For synthetic data: 0


Unnamed: 0,unit-nr,timecycle,ops-set1,ops-set2,ops-set3,sens-1,sens-2,sens-3,sens-4,sens-5,...,sens-12,sens-13,sens-14,sens-15,sens-16,sens-17,sens-18,sens-19,sens-20,sens-21
