# Synthetic Data Vault (SDV)
In this notebook, we will use Synthetic Data Vault (SDV) to create sequential synthetic data. The foundation of this notebook is the [tutorial from SDV](https://colab.research.google.com/drive/1cT4-jFK2Bxe93QudC_CwHq_yVCcNcxal?usp=sharing) and the notebook on DataSynthesizer.

In [1]:
# import relevant modules
import sdv
import pandas as pd
import numpy as np
import os

# Load data

In [2]:
data = pd.read_csv('CMAPSS/train_FD001.txt', sep=" ", header=None)

# drop last two columns with N/A values
data = data.iloc[:, :-2]

# rename columns according to readme.txt
col_names = ["unit_nr", "timecycle", "ops_set1", "ops_set2", "ops_set3"]
for i in range(1,22):
    col_names.append(f"sens_{i}")
data.columns = col_names

# Compute Remaining Useful Life (RUL) for each index (engine)
def add_remaining_useful_life(df):
    # Get the total number of cycles for each unit
    grouped_by_unit = df.groupby(by="unit_nr")
    max_cycle = grouped_by_unit["timecycle"].max()
    
    # Merge the max cycle back into the original frame
    result_frame = df.merge(max_cycle.to_frame(name='max_cycle'), left_on='unit_nr', right_index=True)
    
    # Calculate remaining useful life for each row
    remaining_useful_life = result_frame["max_cycle"] - result_frame["timecycle"]
    result_frame["RUL"] = remaining_useful_life
    
    # drop max_cycle as it's no longer needed
    result_frame = result_frame.drop("max_cycle", axis=1)
    return result_frame

data = add_remaining_useful_life(data)

data.to_csv('CMAPSS/train_FD001_pre.csv', index=False)

data_length = len(data)

# display data
data
data.columns

Index(['unit_nr', 'timecycle', 'ops_set1', 'ops_set2', 'ops_set3', 'sens_1',
       'sens_2', 'sens_3', 'sens_4', 'sens_5', 'sens_6', 'sens_7', 'sens_8',
       'sens_9', 'sens_10', 'sens_11', 'sens_12', 'sens_13', 'sens_14',
       'sens_15', 'sens_16', 'sens_17', 'sens_18', 'sens_19', 'sens_20',
       'sens_21', 'RUL'],
      dtype='object')

# Create synthetic data

In [3]:
# Create metadata
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()

# load data
metadata.detect_from_dataframe(data=data)

# set unit-nr as sequence key, timecycle as sequence index
metadata.update_column(
    column_name='unit_nr',
    sdtype='id')
metadata.set_sequence_key('unit_nr')
metadata.set_sequence_index('timecycle')

# show metadata
# metadata

In [4]:
"""
In this example, we do not need context_columns.
As all columns have an alternating value, none
remains constant or context-dependent.

"""
from sdv.sequential import PARSynthesizer

# create synthesizer
synthesizer = PARSynthesizer(
    metadata,
    epochs=100,
    verbose=True
    )

synthesizer.fit(data)

Epoch 100 | Loss -0.4249756932258606: 100%|███| 100/100 [02:08<00:00,  1.29s/it]


In [5]:
# find amount of unique unit-nrs
n_unitnrs = len(data['unit_nr'].unique())

# create synthetic data with that same amount of unit-nrs
synthetic_data = synthesizer.sample(num_sequences=n_unitnrs)
synthetic_data

# export SyntData to folder
synthetic_data.to_csv('./CMAPSS/Synthetic/SDV_FD001_RUL.csv')

100%|█████████████████████████████████████████| 100/100 [01:11<00:00,  1.41it/s]


# Analysis
Cool! We've created our first synthetic data, now we wish to analyse this synthetic data. More specifically, we would like to compare the output and distributions of our synthetic data compared to the original data.

In [6]:
data_description = data.describe()
data_description

Unnamed: 0,unit_nr,timecycle,ops_set1,ops_set2,ops_set3,sens_1,sens_2,sens_3,sens_4,sens_5,...,sens_13,sens_14,sens_15,sens_16,sens_17,sens_18,sens_19,sens_20,sens_21,RUL
count,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,...,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0
mean,51.506568,108.807862,-9e-06,2e-06,100.0,518.67,642.680934,1590.523119,1408.933782,14.62,...,2388.096152,8143.752722,8.442146,0.03,393.210654,2388.0,100.0,38.816271,23.289705,107.807862
std,29.227633,68.88099,0.002187,0.000293,0.0,0.0,0.500053,6.13115,9.000605,1.7764e-15,...,0.071919,19.076176,0.037505,1.3878120000000003e-17,1.548763,0.0,0.0,0.180746,0.108251,68.88099
min,1.0,1.0,-0.0087,-0.0006,100.0,518.67,641.21,1571.04,1382.25,14.62,...,2387.88,8099.94,8.3249,0.03,388.0,2388.0,100.0,38.14,22.8942,0.0
25%,26.0,52.0,-0.0015,-0.0002,100.0,518.67,642.325,1586.26,1402.36,14.62,...,2388.04,8133.245,8.4149,0.03,392.0,2388.0,100.0,38.7,23.2218,51.0
50%,52.0,104.0,0.0,0.0,100.0,518.67,642.64,1590.1,1408.04,14.62,...,2388.09,8140.54,8.4389,0.03,393.0,2388.0,100.0,38.83,23.2979,103.0
75%,77.0,156.0,0.0015,0.0003,100.0,518.67,643.0,1594.38,1414.555,14.62,...,2388.14,8148.31,8.4656,0.03,394.0,2388.0,100.0,38.95,23.3668,155.0
max,100.0,362.0,0.0087,0.0006,100.0,518.67,644.53,1616.91,1441.49,14.62,...,2388.56,8293.72,8.5848,0.03,400.0,2388.0,100.0,39.43,23.6184,361.0


In [7]:
syn_description = synthetic_data.describe()
syn_description

Unnamed: 0,unit_nr,timecycle,ops_set1,ops_set2,ops_set3,sens_1,sens_2,sens_3,sens_4,sens_5,...,sens_13,sens_14,sens_15,sens_16,sens_17,sens_18,sens_19,sens_20,sens_21,RUL
count,18564.0,18564.0,18564.0,18564.0,18564.0,18564.0,18564.0,18564.0,18564.0,18564.0,...,18564.0,18564.0,18564.0,18564.0,18564.0,18564.0,18564.0,18564.0,18564.0,18564.0
mean,52300.959491,101.884292,-5e-06,4e-06,100.0,518.67,642.752021,1591.245154,1410.282246,14.62,...,2388.101472,8146.460691,8.446927,0.03,393.414727,2388.0,100.0,38.796656,23.273122,98.710946
std,30224.243654,69.72395,0.001011,0.000149,0.0,0.0,0.285639,3.321873,5.296097,3.552809e-15,...,0.035969,11.956419,0.020248,1.0408620000000001e-17,0.917487,0.0,0.0,0.101342,0.059943,33.933783
min,297.0,1.0,-0.00394,-0.00059,100.0,518.67,641.606238,1577.999235,1390.250104,14.62,...,2387.948887,8099.94,8.363616,0.03,390.0,2388.0,100.0,38.366522,23.052963,0.0
25%,20604.0,47.0,-0.000668,-9.3e-05,100.0,518.67,642.563708,1589.167107,1406.85393,14.62,...,2388.080701,8138.860253,8.43549,0.03,393.0,2388.0,100.0,38.730902,23.233326,76.0
50%,57382.0,93.0,-9e-06,2e-06,100.0,518.67,642.74131,1590.986394,1410.020844,14.62,...,2388.096152,8145.760799,8.443472,0.03,393.0,2388.0,100.0,38.804221,23.27696,100.0
75%,75683.0,141.0,0.00066,0.0001,100.0,518.67,642.940499,1593.400952,1413.857345,14.62,...,2388.123623,8154.34823,8.459784,0.03,394.0,2388.0,100.0,38.859594,23.311051,120.0
max,99455.0,362.0,0.004212,0.000584,100.0,518.67,643.919006,1606.10494,1431.071379,14.62,...,2388.285619,8209.460559,8.541943,0.03,397.0,2388.0,100.0,39.212197,23.503433,234.0


What directly stands out is that the unit-nrs are not ordered from 1-100. However, this is due to the very nature of SDV: it will never create instances that overlap with the original data [source](https://colab.research.google.com/drive/1YLk2uwn8yrSRPy0soEeJwu8Hdk_tGTlE#scrollTo=IjvfTOMVKQot). Therefore, this behaviour is to be expected.

The descriptions look rather similar, most means and standarddeviations are close to the description of the original data. Therefore it would be interesting to test the SDV data on a ML task.

## SDV evaluation

In [8]:
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    data,
    synthetic_data,
    metadata
)

# with 128 epochs:
"""Overall Quality Score: 81.07%

Properties:
- Column Shapes: 82.68%
- Column Pair Trends: 79.45%
"""

Generating report ...
(1/2) Evaluating Column Shapes: : 100%|████████| 27/27 [00:00<00:00, 122.04it/s]
(2/2) Evaluating Column Pair Trends: : 100%|█| 351/351 [00:00<00:00, 510.20it/s]

Overall Quality Score: 80.91%

Properties:
- Column Shapes: 82.84%
- Column Pair Trends: 78.98%


'Overall Quality Score: 81.07%\n\nProperties:\n- Column Shapes: 82.68%\n- Column Pair Trends: 79.45%\n'

# Testing properties