# Synthetic Data Vault: DeepEcho (SDV)
In this notebook, we will use DeepEcho Synthetic Data Vault (SDV) to create sequential synthetic data. The foundation of this notebook is the [the readme page on GitHub](https://github.com/sdv-dev/DeepEcho) and the notebook on DataSynthesizer.

In [1]:
# import relevant modules
import sdv
import pandas as pd
import numpy as np
import os

from deepecho import PARModel

# Load data

In [2]:
data = pd.read_csv('CMAPSS/train_FD001.txt', sep=" ", header=None)

# drop last two columns with N/A values
data = data.iloc[:, :-2]

# rename columns according to readme.txt
col_names = ["unit_nr", "timecycle", "ops_set1", "ops_set2", "ops_set3"]
for i in range(1,22):
    col_names.append(f"sens_{i}")
data.columns = col_names

# Compute Remaining Useful Life (RUL) for each index (engine)
def add_remaining_useful_life(df):
    # Get the total number of cycles for each unit
    grouped_by_unit = df.groupby(by="unit_nr")
    max_cycle = grouped_by_unit["timecycle"].max()
    
    # Merge the max cycle back into the original frame
    result_frame = df.merge(max_cycle.to_frame(name='max_cycle'), left_on='unit_nr', right_index=True)
    
    # Calculate remaining useful life for each row
    remaining_useful_life = result_frame["max_cycle"] - result_frame["timecycle"]
    result_frame["RUL"] = remaining_useful_life
    
    # drop max_cycle as it's no longer needed
    result_frame = result_frame.drop("max_cycle", axis=1)
    return result_frame

data = add_remaining_useful_life(data)

data.to_csv('CMAPSS/train_FD001_pre.csv', index=False)

data_length = len(data)

# display data
data
data.columns

Index(['unit_nr', 'timecycle', 'ops_set1', 'ops_set2', 'ops_set3', 'sens_1',
       'sens_2', 'sens_3', 'sens_4', 'sens_5', 'sens_6', 'sens_7', 'sens_8',
       'sens_9', 'sens_10', 'sens_11', 'sens_12', 'sens_13', 'sens_14',
       'sens_15', 'sens_16', 'sens_17', 'sens_18', 'sens_19', 'sens_20',
       'sens_21', 'RUL'],
      dtype='object')

# Create synthetic data

In [3]:
# Define data types for all the columns
data_types = {
    'ops_set1': 'continuous',
    'ops_set2': 'continuous',
    'ops_set3': 'continuous',
    'sens_1': 'continuous',
    'sens_2': 'continuous',
    'sens_3': 'continuous',
    'sens_4': 'continuous',
    'sens_5': 'continuous',
    'sens_6': 'continuous',
    'sens_7': 'continuous',
    'sens_8': 'continuous',
    'sens_9': 'continuous',
    'sens_10': 'continuous',
    'sens_11': 'continuous',
    'sens_12': 'continuous',
    'sens_13': 'continuous',
    'sens_14': 'continuous',
    'sens_15': 'continuous',
    'sens_16': 'continuous',
    'sens_17': 'continuous',
    'sens_18': 'continuous',
    'sens_19': 'continuous',
    'sens_20': 'continuous',
    'sens_21': 'continuous',
    'RUL': 'continuous',
}

model = PARModel(cuda=False)

model.fit(
    data=data,
    entity_columns=['unit_nr'],
    data_types=data_types,
    sequence_index='timecycle'
)

Epoch 128 | Loss -0.30394765734672546: 100%|██| 128/128 [02:27<00:00,  1.16s/it]


In [4]:
# find amount of unique unit-nrs
n_unitnrs = len(data['unit_nr'].unique())

# create synthetic data with that same amount of unit-nrs
synthetic_data = model.sample(num_entities=n_unitnrs)
synthetic_data

# export SyntData to folder
synthetic_data.to_csv('./CMAPSS/Synthetic/DeepEcho_FD001_RUL.csv')

100%|█████████████████████████████████████████| 100/100 [01:37<00:00,  1.02it/s]


# Analysis
Cool! We've created our first synthetic data, now we wish to analyse this synthetic data. More specifically, we would like to compare the output and distributions of our synthetic data compared to the original data.

In [5]:
data_description = data.describe()
data_description

Unnamed: 0,unit_nr,timecycle,ops_set1,ops_set2,ops_set3,sens_1,sens_2,sens_3,sens_4,sens_5,...,sens_13,sens_14,sens_15,sens_16,sens_17,sens_18,sens_19,sens_20,sens_21,RUL
count,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,...,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0
mean,51.506568,108.807862,-9e-06,2e-06,100.0,518.67,642.680934,1590.523119,1408.933782,14.62,...,2388.096152,8143.752722,8.442146,0.03,393.210654,2388.0,100.0,38.816271,23.289705,107.807862
std,29.227633,68.88099,0.002187,0.000293,0.0,0.0,0.500053,6.13115,9.000605,1.7764e-15,...,0.071919,19.076176,0.037505,1.3878120000000003e-17,1.548763,0.0,0.0,0.180746,0.108251,68.88099
min,1.0,1.0,-0.0087,-0.0006,100.0,518.67,641.21,1571.04,1382.25,14.62,...,2387.88,8099.94,8.3249,0.03,388.0,2388.0,100.0,38.14,22.8942,0.0
25%,26.0,52.0,-0.0015,-0.0002,100.0,518.67,642.325,1586.26,1402.36,14.62,...,2388.04,8133.245,8.4149,0.03,392.0,2388.0,100.0,38.7,23.2218,51.0
50%,52.0,104.0,0.0,0.0,100.0,518.67,642.64,1590.1,1408.04,14.62,...,2388.09,8140.54,8.4389,0.03,393.0,2388.0,100.0,38.83,23.2979,103.0
75%,77.0,156.0,0.0015,0.0003,100.0,518.67,643.0,1594.38,1414.555,14.62,...,2388.14,8148.31,8.4656,0.03,394.0,2388.0,100.0,38.95,23.3668,155.0
max,100.0,362.0,0.0087,0.0006,100.0,518.67,644.53,1616.91,1441.49,14.62,...,2388.56,8293.72,8.5848,0.03,400.0,2388.0,100.0,39.43,23.6184,361.0


In [6]:
syn_description = synthetic_data.describe()
syn_description

Unnamed: 0,unit_nr,ops_set1,ops_set2,ops_set3,sens_1,sens_2,sens_3,sens_4,sens_5,sens_6,...,sens_13,sens_14,sens_15,sens_16,sens_17,sens_18,sens_19,sens_20,sens_21,RUL
count,25726.0,25726.0,25726.0,25726.0,25726.0,25726.0,25726.0,25726.0,25726.0,25726.0,...,25726.0,25726.0,25726.0,25726.0,25726.0,25726.0,25726.0,25726.0,25726.0,25726.0
mean,47.635505,-1.1e-05,4e-06,100.0,518.67,642.748479,1591.333339,1410.355164,14.62,21.60984,...,2388.104428,8145.877752,8.446775,0.03,393.454274,2388.0,100.0,38.785953,23.276097,97.707901
std,28.024097,0.000787,0.000121,0.0,0.0,0.22484,2.69435,4.213793,2.089053e-15,0.000154,...,0.030959,10.912809,0.017731,7.420229e-18,0.754804,0.0,0.0,0.084871,0.049593,26.241508
min,0.0,-0.003924,-0.000489,100.0,518.67,641.824374,1578.126899,1390.247604,14.62,21.608105,...,2387.978509,8098.106224,8.374475,0.03,389.087827,2388.0,100.0,38.437401,23.083539,-6.20959
25%,23.0,-0.000531,-7.5e-05,100.0,518.67,642.600519,1589.61632,1407.665279,14.62,21.609743,...,2388.08433,8138.717385,8.435333,0.03,392.960433,2388.0,100.0,38.729131,23.243036,79.951807
50%,47.0,-9e-06,2e-06,100.0,518.67,642.74314,1591.195889,1410.198928,14.62,21.609835,...,2388.102837,8145.578991,8.445837,0.03,393.438813,2388.0,100.0,38.788992,23.27748,98.245332
75%,71.0,0.000509,8.4e-05,100.0,518.67,642.898767,1593.125153,1413.186997,14.62,21.609938,...,2388.124967,8153.056555,8.458469,0.03,393.962771,2388.0,100.0,38.839134,23.308334,114.660051
max,99.0,0.003518,0.000479,100.0,518.67,643.744416,1604.222572,1426.556646,14.62,21.611221,...,2388.238684,8191.202522,8.51995,0.03,396.83137,2388.0,100.0,39.147816,23.481895,231.517175


What directly stands out is that the unit-nrs are not ordered from 1-100. However, this is due to the very nature of SDV: it will never create instances that overlap with the original data [source](https://colab.research.google.com/drive/1YLk2uwn8yrSRPy0soEeJwu8Hdk_tGTlE#scrollTo=IjvfTOMVKQot). Therefore, this behaviour is to be expected.

The descriptions look rather similar, most means and standarddeviations are close to the description of the original data. Therefore it would be interesting to test the SDV data on a ML task.

# Testing properties