In this tutorial, we will guide you through the process of onboarding a dataset for synthetic data generation using the Rockfish Onboarding Module.

We will cover the following:

- Preparing your dataset for synthetic data generation.
- Utilizing Rockfish Recommendation Engine to automatically determine the most suitable model for training, along with key configurations and settings required for successful onboarding.
- Generating and then evaluating synthetic data using the Rockfish Synthetic Data Assessor, which will help you improve the quality of your synthetic datasets.


### Install and Import Rockfish SDK


In [1]:
%%capture
%pip install -U 'rockfish[labs]' -f 'https://packages.rockfish.ai'

In [2]:
import rockfish as rf
import rockfish.actions as ra
from rockfish.labs.dataset_properties import (
    DatasetPropertyExtractor,
    FieldType,
    EncoderType,
)
from rockfish.labs.steps import Recommender
from rockfish.labs.metrics import marginal_dist_score

### Connect to the Rockfish Platform

❗❗ Replace API_KEY and API_URL.


In [3]:
api_key = "API_KEY"
api_url = "API_URL"

conn = rf.Connection.remote(api_url, api_key)

### Load the Dataset

We support ingesting other data formats, refer documentation for more details.


In [4]:
%%capture
!wget --no-clobber https://docs.rockfish.ai/tutorials/finance.csv
dataset = rf.Dataset.from_csv("finance", "finance.csv")

I0000 00:00:1738890367.568470 19734524 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers


In [5]:
dataset.to_pandas()

Unnamed: 0,customer,age,gender,merchant,category,amount,fraud,timestamp
0,C1093826151,4,M,M348934600,transportation,4.55,0,2023-01-01
1,C575345520,2,F,M348934600,transportation,76.67,0,2023-01-01
2,C1787537369,2,M,M1823072687,transportation,48.02,0,2023-01-01
3,C1732307957,5,F,M348934600,transportation,55.06,0,2023-01-01
4,C842799656,1,F,M348934600,transportation,25.62,0,2023-01-01
...,...,...,...,...,...,...,...,...
49995,C1971105040,3,M,M348934600,transportation,67.91,0,2023-01-20
49996,C51444479,3,M,M348934600,transportation,32.27,0,2023-01-20
49997,C1096642744,5,M,M1535107174,wellnessandbeauty,149.70,0,2023-01-20
49998,C1166683343,2,F,M1823072687,transportation,24.78,0,2023-01-20


### Onboard the dataset onto Rockfish

The onboarding workflow is a good starting point to get to a synthetic version of your dataset quickly.

To ensure optimal synthetic data generation, it's crucial to provide domain-specific information related to your dataset. This helps Rockfish’s Recommendation Engine tailor the workflow to your specific needs.


In [6]:
dataset_properties = DatasetPropertyExtractor(
    dataset,
    session_key="customer",
    metadata_fields=["age", "gender"],
    additional_property_keys=["association_rules"],
).extract()
recommender_output = Recommender(dataset_properties).run()
print(recommender_output.report)

# _________________________________________________________________________
#
# RECOMMENDED CONFIGURATIONS
#
# (Remove or change any actions or configurations that are inappropriate
#  for your use case, or add missing ones)
# _________________________________________________________________________


We detected a timeseries dataset with the following properties:
Dimensions of dataset: (50000 x 8)
Metadata fields: ['age', 'gender']
Measurement fields: ['merchant', 'category', 'fraud', 'amount']
Timestamp field: timestamp
Session key field: customer
Number of sessions: 3765

# _________________________________________________________________________
#
# ~~~~~ Pre-processing recommendations ~~~~~
# _________________________________________________________________________


merge the following associated fields: [['merchant', 'category']]

# _________________________________________________________________________
#
# ~~~~~ Model recommendations ~~~~~
# __________________________________

#### Run the recommended workflow to get a synthetic dataset


In [7]:
rec_actions = recommender_output.actions
save = ra.DatasetSave(name="synthetic")

# use recommended actions in a Rockfish workflow
builder = rf.WorkflowBuilder()
builder.add_path(dataset, *rec_actions, save)

# run the Rockfish workflow
workflow = await builder.start(conn)
print(f"Workflow: {workflow.id()}")

Workflow: 3Gv9c3qdJMW8XzIlW9KDjj


View logs for the running workflow:


In [8]:
async for log in workflow.logs():
    print(log)

2025-02-07T01:06:09Z dataset-load: INFO Downloading dataset '3MJPgpwtdI105YLiyf5sL1'
2025-02-07T01:06:09Z dataset-load: INFO Downloaded dataset '3MJPgpwtdI105YLiyf5sL1' with 50000 rows
2025-02-07T01:06:10Z train-time-gan: WARN The `sessions` parameter is deprecated in the train config, set it in the generate configuration.
2025-02-07T01:06:15Z train-time-gan: INFO Starting DG training job
2025-02-07T01:06:26Z train-time-gan: INFO Epoch 1 completed.
2025-02-07T01:06:37Z train-time-gan: INFO Epoch 2 completed.
2025-02-07T01:06:48Z train-time-gan: INFO Epoch 3 completed.
2025-02-07T01:06:59Z train-time-gan: INFO Epoch 4 completed.
2025-02-07T01:07:10Z train-time-gan: INFO Epoch 5 completed.
2025-02-07T01:07:21Z train-time-gan: INFO Epoch 6 completed.
2025-02-07T01:07:32Z train-time-gan: INFO Epoch 7 completed.
2025-02-07T01:07:43Z train-time-gan: INFO Epoch 8 completed.
2025-02-07T01:07:53Z train-time-gan: INFO Epoch 9 completed.
2025-02-07T01:08:05Z train-time-gan: INFO Epoch 10 complete

Download and view the synthetic dataset locally:


In [12]:
syn = None
async for sds in workflow.datasets():
    syn = await sds.to_local(conn)
syn.to_pandas()

Unnamed: 0,timestamp,amount,age,gender,fraud,merchant,category,session_key
0,2023-01-01 12:57:08.160,82.365322,3,F,1,M1748431652,wellnessandbeauty,0.0
1,2023-01-01 08:34:43.703,59.190336,1,M,0,M348934600,transportation,1.0
2,2023-01-02 18:22:12.907,54.033554,1,M,0,M348934600,transportation,1.0
3,2023-01-04 07:21:54.332,47.782675,1,M,0,M348934600,transportation,1.0
4,2023-01-05 23:00:27.081,45.394337,1,M,0,M348934600,transportation,1.0
...,...,...,...,...,...,...,...,...
2557,2023-01-16 18:46:23.662,34.619232,2,M,0,M348934600,transportation,199.0
2558,2023-01-18 10:25:54.740,34.352470,2,M,0,M348934600,transportation,199.0
2559,2023-01-20 01:56:54.299,34.192381,2,M,0,M348934600,transportation,199.0
2560,2023-01-21 16:25:18.600,34.115688,2,M,0,M348934600,transportation,199.0


### Evaluate the synthetic dataset


In [13]:
# @title ##### Define a helper function `get_fidelity_score()` to calculate the marginal distribution score:

import copy


def get_fidelity_score(source, source_dataset_properties, syn):
    source = copy.deepcopy(source)
    syn = copy.deepcopy(syn)

    columns_to_drop = [source_dataset_properties.session_key]
    source.table = source.table.drop_columns(columns_to_drop)

    columns_to_drop = ["session_key"]
    syn.table = syn.table.drop_columns(columns_to_drop)

    categorical_measurements = source_dataset_properties.filter_fields(
        ftype=FieldType.MEASUREMENT, etype=EncoderType.CATEGORICAL
    )

    return marginal_dist_score(
        source,
        syn,
        metadata=source_dataset_properties.metadata_fields,
        other_categorical=categorical_measurements,
    )

In [15]:
get_fidelity_score(
    source=dataset, source_dataset_properties=dataset_properties, syn=syn
)

0.7060827985906382

### Next Steps

As you just saw, the onboarding workflow is a good starting point to get to a synthetic dataset quickly.

You can now modify this workflow according to your requirements to get your final synthetic dataset!

The following pages in the Rockfish documentation will be useful for this purpose:

1. Adding more steps (i.e. Rockfish actions) to a Rockfish workflow: https://docs142.rockfish.ai/sdk-overview.html#actions-and-workflows
2. Hyperparameters you can change to improve the performance of Rockfish models: https://docs142.rockfish.ai/models.html
3. Using more metrics and plots to evaluate your synthetic dataset: https://docs142.rockfish.ai/data-eval.html
