# Synthefy Time Series Synthesis API: A Comprehensive Guide

This notebook demonstrates how to use the Synthesis Stream API to generate synthetic time series and how to use the output from the Synthetic Data Agent.

## Getting Started

First, let's set up our environment and import the necessary libraries.


In [1]:
import json
from typing import Dict, List, Optional, Tuple

import httpx
import numpy as np
import pandas as pd
from tqdm import tqdm

## Loading and Preparing the Data

In this example, we'll work with the PPG+Dalia dataset containing physiological signals (BVP and ECG) and heart rate measurements. We'll demonstrate how to:
1. Load and preprocess the data
2. Create windows for time series analysis
3. Generate synthetic data using the Synthefy API

### Note that the below method relies on the "custom_split" column of the input data, where the uploaded data defines the train/val/test split, not Synthefy!

In [2]:
# Read the dataset
df = pd.read_parquet("/home/raimi/synthefy-package/data.parquet")

group_label_column = "subject"
# Let's use subjects 1-10 for this example
df = df[
    df[group_label_column].isin(
        ["S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8", "S9", "S10"]
    )
]

# Split data into train/validation/test sets
df["custom_split"] = df[group_label_column].map(
    lambda x: "train"
    if x in ["S1", "S2", "S3", "S4", "S5", "S6", "S7", "S8"]
    else "val"
    if x in ["S9"]
    else "test"
)

df.head()

Unnamed: 0,BVP,activity,ACC_wrist_idx_0,ACC_wrist_idx_1,ACC_wrist_idx_2,EDA_wrist,TEMP_wrist,ACC_chest_idx_0,ACC_chest_idx_1,ACC_chest_idx_2,...,RESP_chest,WEIGHT,Gender,AGE,HEIGHT,SKIN,SPORT,subject,heart_rate,custom_split
0,7.28,0.0,-0.765625,-0.078125,0.671875,4.722437,32.13,0.869504,-0.083094,-0.324201,...,1.048783,78.0,m,34,182.0,3,6,S1,49.611369,train
1,6.33,0.0,-0.765625,-0.078125,0.671875,4.722437,32.13,0.850155,-0.065887,-0.373781,...,5.169502,78.0,m,34,182.0,3,6,S1,49.611369,train
2,5.46,0.0,-0.765625,-0.078125,0.65625,4.722437,32.13,0.853612,-0.067946,-0.362127,...,4.234545,78.0,m,34,182.0,3,6,S1,49.611369,train
3,4.6,0.0,-0.765625,-0.078125,0.65625,4.722437,32.13,0.853015,-0.066775,-0.364364,...,4.915926,78.0,m,34,182.0,3,6,S1,49.611369,train
4,3.74,0.0,-0.765625,-0.078125,0.671875,4.722437,32.13,0.856592,-0.069934,-0.359636,...,4.536579,78.0,m,34,182.0,3,6,S1,49.611369,train


In [3]:
df.tail()

Unnamed: 0,BVP,activity,ACC_wrist_idx_0,ACC_wrist_idx_1,ACC_wrist_idx_2,EDA_wrist,TEMP_wrist,ACC_chest_idx_0,ACC_chest_idx_1,ACC_chest_idx_2,...,RESP_chest,WEIGHT,Gender,AGE,HEIGHT,SKIN,SPORT,subject,heart_rate,custom_split
5535611,-44.11,0.0,-0.84375,0.5,-0.078125,0.215518,33.97,0.895363,0.005929,-0.120425,...,-0.642989,56.0,f,55,164.0,4,5,S10,85.709338,test
5535612,-53.25,0.0,-0.90625,0.609375,-0.078125,0.215518,33.97,0.897526,-0.013646,-0.132039,...,-0.350993,56.0,f,55,164.0,4,5,S10,85.709338,test
5535613,-59.69,0.0,-0.90625,0.609375,-0.078125,0.215518,33.97,0.918993,-0.033857,-0.137682,...,-0.893012,56.0,f,55,164.0,4,5,S10,85.709338,test
5535614,-63.97,0.0,-1.078125,0.640625,-0.078125,0.215518,33.97,0.931857,-0.04229,-0.133806,...,-0.315626,56.0,f,55,164.0,4,5,S10,85.709338,test
5535615,-66.35,0.0,-1.078125,0.640625,-0.078125,0.215518,33.97,0.947258,-0.051514,-0.140313,...,-1.512757,56.0,f,55,164.0,4,5,S10,85.709338,test


## Creating Time Series Windows

For time series analysis, we need to create windows of data. Each window represents a sequence of measurements that we can use for training or synthesis. Let's define a function to create these windows.

Note: to use your own data 

In [4]:
WINDOW_SIZE = 512
STRIDE = 512
feature_columns = ["BVP", "ECG"]
target_column = "heart_rate"

def create_windows(
    df_for_windows: pd.DataFrame,
    window_size: int,
    stride: int,
    target_column: str,
    feature_columns: List[str],
    train_or_val_or_test: Optional[str] = None,
    should_check_same_subject: bool = True,
) -> Tuple[np.ndarray, np.ndarray]:
    """Create windows from time series data.

    Args:
        df_for_windows: Input DataFrame
        window_size: Size of each window
        stride: Step size between windows
        target_column: Column to predict
        feature_columns: Columns to use as features
        train_or_val_or_test: Optional split to filter by
        should_check_same_subject: Whether to ensure windows contain data from same subject (this would be False if use_label_col_as_discrete_metadata is False in the preprocess config)

    Returns:
        Tuple of (X, y) where X contains feature windows and y contains target values
    """
    if train_or_val_or_test is not None:
        df_for_windows = df_for_windows[df_for_windows["custom_split"] == train_or_val_or_test]

    # Precompute valid window indices
    valid_indices = []
    for i in tqdm(range(0, len(df_for_windows) - window_size + 1, stride)):
        start_index = i
        end_index = i + window_size - 1
        if should_check_same_subject:
            # Windows should only include data from the same subject
            if df_for_windows.iloc[start_index]["subject"] == df_for_windows.iloc[end_index]["subject"]:
                valid_indices.append((start_index, end_index + 1))
        else:
            valid_indices.append((start_index, end_index + 1))

    # Convert to numpy arrays for faster slicing
    feature_data = df_for_windows[feature_columns].values
    target_data = df_for_windows[target_column].values

    # Create arrays using numpy slicing
    X = np.array([feature_data[start:end] for start, end in valid_indices])
    y = np.array([target_data[start:end].mean() for start, end in valid_indices])
    return np.array(X), np.array(y)

## Creating Training, Validation, and Test Sets

Let's create our training, validation, and test sets using the window creation function.

In [5]:
# Create windows for each split
X_train, y_train = create_windows(
    df_for_windows=df,
    window_size=WINDOW_SIZE,
    stride=STRIDE,
    target_column=target_column,
    feature_columns=feature_columns,
    train_or_val_or_test="train",
)

X_val, y_val = create_windows(
    df_for_windows=df,
    window_size=WINDOW_SIZE,
    stride=STRIDE,
    target_column=target_column,
    feature_columns=feature_columns,
    train_or_val_or_test="val",
)

X_test, y_test = create_windows(
    df_for_windows=df,
    window_size=WINDOW_SIZE,
    stride=STRIDE,
    target_column=target_column,
    feature_columns=feature_columns,
    train_or_val_or_test="test",
)

print(f"Training set shape: {X_train.shape}, {y_train.shape}")
print(f"Validation set shape: {X_val.shape}, {y_val.shape}")
print(f"Test set shape: {X_test.shape}, {y_test.shape}")

100%|██████████| 8410/8410 [00:00<00:00, 14812.83it/s]
100%|██████████| 1070/1070 [00:00<00:00, 14997.18it/s]
100%|██████████| 1331/1331 [00:00<00:00, 15052.07it/s]

Training set shape: (8403, 512, 2), (8403,)
Validation set shape: (1070, 512, 2), (1070,)
Test set shape: (1331, 512, 2), (1331,)





## Method 1: Using the API for Data Synthesis

Now, let's demonstrate how to use the Synthefy API to generate synthetic data. We'll focus on generating synthetic data for Subject 1.

To get the parameters to set for TRAINING_JOB_ID and USER_ID, go to: https://prod.synthefy.com/home/model-apis, and select your dataset and model.

In [None]:
# API Configuration
DATASET_NAME = "ppg_hr"
TASK = "synthesis"
TRAINING_JOB_ID = "XXXX-XXXX" # Replace with your training job ID
USER_ID = "XXXX-XXXX" # Replace with your user ID
X_API_KEY = "XXXX-XXXX"  # Replace with your API key
BASE_URL = "https://prod.synthefy.com"

client = httpx.Client(base_url=BASE_URL, timeout=30.0)
ENDPOINT = f"/api/{TASK}/{USER_ID}/{DATASET_NAME}/{TRAINING_JOB_ID}/stream"

# Prepare data for synthesis
df_for_synthetic = df[df["subject"] == "S1"]
num_synthetic_samples = 10

# Generate synthetic data
synthetic_X_windows = []
synthetic_y_windows = []

for i in tqdm(range(num_synthetic_samples)):
    # Take a window of data
    df_for_synthetic_tmp = (
        df_for_synthetic.iloc[i * WINDOW_SIZE : (i + 1) * WINDOW_SIZE]
        .reset_index(drop=True)
        .ffill()
    )

    # Prepare request
    json_for_request = json.loads(df_for_synthetic_tmp.to_json())

    # Make API call
    response = client.post(
        ENDPOINT,
        json=json_for_request,
        headers={"X-API-Key": X_API_KEY},
    )

    if response.status_code != 200:
        raise Exception(f"Error: {response.status_code} - {response.text}")

    # Process response
    pred_df = pd.DataFrame(response.json())
    pred_df = pred_df[
        [f"{feature_column}_synthetic" for feature_column in feature_columns]
    ]

    synthetic_X_windows.append(pred_df.values)
    synthetic_y_windows.append(df_for_synthetic_tmp[target_column].mean())

100%|██████████| 10/10 [01:12<00:00,  7.22s/it]


## Combining Real and Synthetic Data

Now that we have generated synthetic data, let's combine it with our real training data to create an enhanced training set.

In [7]:
# Combine real and synthetic data
real_and_synthetic_X_train_windows = np.concatenate(
    [X_train, synthetic_X_windows], axis=0
)
real_and_synthetic_y_train_windows = np.concatenate(
    [y_train, synthetic_y_windows], axis=0
)

print(f"Original training set shape: {X_train.shape}, {y_train.shape}")
print(f"Enhanced training set shape: {real_and_synthetic_X_train_windows.shape}, {real_and_synthetic_y_train_windows.shape}")

# This data can now be flattened or otherwise processed for model training

Original training set shape: (8403, 512, 2), (8403,)
Enhanced training set shape: (8413, 512, 2), (8413,)


## Method 2: Using the Synthetic Data Agent Output

Alternatively, you can use the Synthetic Data Agent to generate synthetic data. This method provides a different approach to data synthesis.

In [8]:
# Load synthetic data from agent output
import zipfile

subject_1_synthetic_data_path = "/home/raimi/subject-1-synthetic_data.zip"

# Extract the zip file
with zipfile.ZipFile(subject_1_synthetic_data_path, "r") as zip_ref:
    zip_ref.extractall("/home/raimi/subject-1-synthetic_data")

# Load the synthetic data
df_subject_1_synthetic = pd.read_parquet(
    "/home/raimi/subject-1-synthetic_data/synthetic_data.parquet"
)

# Create windows from synthetic data
method_2_synthetic_X_windows, method_2_synthetic_y_windows = create_windows(
    df_for_windows=df_subject_1_synthetic,
    window_size=WINDOW_SIZE,
    stride=STRIDE,
    target_column=target_column,
    feature_columns=feature_columns,
    train_or_val_or_test=None,
    should_check_same_subject=False,
)

print(f"Method 2 Synthetic Data shape: {method_2_synthetic_X_windows.shape}, {method_2_synthetic_y_windows.shape}")

100%|██████████| 10/10 [00:00<00:00, 249660.95it/s]

Method 2 Synthetic Data shape: (10, 512, 2), (10,)





In [9]:
# Combine real and synthetic data
real_and_synthetic_X_train_windows_method_2 = np.concatenate(
    [X_train, method_2_synthetic_X_windows], axis=0
)
real_and_synthetic_y_train_windows_method_2 = np.concatenate(
    [y_train, method_2_synthetic_y_windows], axis=0
)

print(f"Original training set shape: {X_train.shape}, {y_train.shape}")
print(f"Enhanced training set shape: {real_and_synthetic_X_train_windows_method_2.shape}, {real_and_synthetic_y_train_windows_method_2.shape}")

# This data can now be flattened or otherwise processed for model training

Original training set shape: (8403, 512, 2), (8403,)
Enhanced training set shape: (8413, 512, 2), (8413,)


## Preparing Data for Model Training

Finally, let's prepare our data for model training by flattening the windows and combining real and synthetic data.

In [10]:
# Flatten the data for model training
X_train_flattened = X_train.reshape(X_train.shape[0], -1)
real_and_synthetic_X_train_flattened = real_and_synthetic_X_train_windows.reshape(
    real_and_synthetic_X_train_windows.shape[0], -1
)
real_and_synthetic_X_train_flattened_method_2 = real_and_synthetic_X_train_windows_method_2.reshape(
    real_and_synthetic_X_train_windows_method_2.shape[0], -1
)
X_val_flattened = X_val.reshape(X_val.shape[0], -1)
X_test_flattened = X_test.reshape(X_test.shape[0], -1)

print(f"Train Data for TRTR: {X_train_flattened.shape=}, {y_train.shape=}")
print(f"Train Data for TRSTR - method 1: {real_and_synthetic_X_train_flattened.shape=}, {real_and_synthetic_y_train_windows.shape=}")
print(f"Train Data for TRSTR - method 2: {real_and_synthetic_X_train_flattened_method_2.shape=}, {real_and_synthetic_y_train_windows_method_2.shape=}")
print(f"Validation Data: {X_val_flattened.shape=}, {y_val.shape=}")
print(f"Test Data: {X_test_flattened.shape=}, {y_test.shape=}")

Train Data for TRTR: X_train_flattened.shape=(8403, 1024), y_train.shape=(8403,)
Train Data for TRSTR - method 1: real_and_synthetic_X_train_flattened.shape=(8413, 1024), real_and_synthetic_y_train_windows.shape=(8413,)
Train Data for TRSTR - method 2: real_and_synthetic_X_train_flattened_method_2.shape=(8413, 1024), real_and_synthetic_y_train_windows_method_2.shape=(8413,)
Validation Data: X_val_flattened.shape=(1070, 1024), y_val.shape=(1070,)
Test Data: X_test_flattened.shape=(1331, 1024), y_test.shape=(1331,)
