In this tutorial, we will guide you through the process of onboarding a dataset for synthetic data generation using the Rockfish Onboarding Module.

We will cover the following:

- Preparing your dataset for synthetic data generation.
- Utilizing Rockfish Recommendation Engine to automatically determine the most suitable model for training, along with key configurations and settings required for successful onboarding.
- Generating and then evaluating synthetic data using the Rockfish Synthetic Data Assessor, which will help you improve the quality of your synthetic datasets.


### Install and Import Rockfish SDK


In [None]:
%%capture
%pip install -U 'rockfish[labs]' -f 'https://docs.rockfish.ai/packages/index.html'

In [None]:
import rockfish as rf
import rockfish.actions as ra
from rockfish.labs.dataset_properties import (
    DatasetPropertyExtractor,
    FieldType,
    EncoderType,
)
from rockfish.labs.steps import Recommender
from rockfish.labs.metrics import marginal_dist_score
from rockfish.labs.sda import SDA

### Connect to the Rockfish Platform

❗❗ Replace API_KEY and API_URL.


In [None]:
api_key = "API_KEY"
api_url = "API_URL"

conn = rf.Connection.remote(api_url, api_key)

### Load the Dataset

We support ingesting other data formats, refer documentation for more details.


In [None]:
%%capture
!wget --no-clobber https://docs.rockfish.ai/tutorials/finance.csv
dataset = rf.Dataset.from_csv("finance", "finance.csv")

In [None]:
dataset.to_pandas()

Unnamed: 0,customer,age,gender,merchant,category,amount,fraud,timestamp
0,C1093826151,4,M,M348934600,transportation,4.55,0,2023-01-01
1,C575345520,2,F,M348934600,transportation,76.67,0,2023-01-01
2,C1787537369,2,M,M1823072687,transportation,48.02,0,2023-01-01
3,C1732307957,5,F,M348934600,transportation,55.06,0,2023-01-01
4,C842799656,1,F,M348934600,transportation,25.62,0,2023-01-01
...,...,...,...,...,...,...,...,...
49995,C1971105040,3,M,M348934600,transportation,67.91,0,2023-01-20
49996,C51444479,3,M,M348934600,transportation,32.27,0,2023-01-20
49997,C1096642744,5,M,M1535107174,wellnessandbeauty,149.70,0,2023-01-20
49998,C1166683343,2,F,M1823072687,transportation,24.78,0,2023-01-20


### Onboard the dataset onto Rockfish

The onboarding workflow is a good starting point to get to a synthetic version of your dataset quickly.

To ensure optimal synthetic data generation, it's crucial to provide domain-specific information related to your dataset. This helps Rockfish’s Recommendation Engine tailor the workflow to your specific needs.


In [None]:
dataset_properties = DatasetPropertyExtractor(
    dataset,
    session_key="customer",
    metadata_fields=["age", "gender"],
    additional_property_keys=["association_rules"],
).extract()
recommender_output = Recommender(dataset_properties).run()
print(recommender_output.report)

# _________________________________________________________________________
#
# RECOMMENDED CONFIGURATIONS
#
# (Remove or change any actions or configurations that are inappropriate
#  for your use case, or add missing ones)
# _________________________________________________________________________


We detected a timeseries dataset with the following properties:
Dimensions of dataset: (50000 x 8)
Metadata fields: ['age', 'gender']
Measurement fields: ['merchant', 'amount', 'category', 'fraud']
Timestamp field: timestamp
Session key field: customer
Number of sessions: 3765

# _________________________________________________________________________
#
# ~~~~~ Pre-processing recommendations ~~~~~
# _________________________________________________________________________



# _________________________________________________________________________
#
# ~~~~~ Model recommendations ~~~~~
# _________________________________________________________________________


We recommend using the T

#### Run the recommended workflow to get a synthetic dataset


In [None]:
rec_actions = recommender_output.actions
save = ra.DatasetSave({"name": "synthetic"})

# use recommended actions in a Rockfish workflow
builder = rf.WorkflowBuilder()
builder.add_path(dataset, *rec_actions, save)

# run the Rockfish workflow
workflow = await builder.start(conn)
print(f"Workflow: {workflow.id()}")

Workflow: DrJV04kpYwNU9KrNQ0YkS


View logs for the running workflow:


In [None]:
async for log in workflow.logs():
    print(log)

2024-08-15T17:15:47Z dataset-load: INFO Loading dataset '7uDOtiGmtdbZaJvid64CB5' with 50000 rows
2024-08-15T17:15:53Z train-time-gan: INFO Starting DG training job
2024-08-15T17:16:39Z train-time-gan: INFO Epoch 1 completed.
2024-08-15T17:17:25Z train-time-gan: INFO Epoch 2 completed.
2024-08-15T17:18:14Z train-time-gan: INFO Epoch 3 completed.
2024-08-15T17:19:01Z train-time-gan: INFO Epoch 4 completed.
2024-08-15T17:19:45Z train-time-gan: INFO Epoch 5 completed.
2024-08-15T17:20:29Z train-time-gan: INFO Epoch 6 completed.
2024-08-15T17:21:16Z train-time-gan: INFO Epoch 7 completed.
2024-08-15T17:22:04Z train-time-gan: INFO Epoch 8 completed.
2024-08-15T17:22:52Z train-time-gan: INFO Epoch 9 completed.
2024-08-15T17:23:38Z train-time-gan: INFO Epoch 10 completed.
2024-08-15T17:23:44Z train-time-gan: INFO Training completed. Uploaded model 1e9ec457-5b2b-11ef-8a5e-066bb3496981
2024-08-15T17:23:44Z generate-time-gan: INFO Downloading model with model_id='1e9ec457-5b2b-11ef-8a5e-066bb3496

Download and view the synthetic dataset locally:


In [None]:
syn = None
async for sds in workflow.datasets():
    syn = await sds.to_local(conn)
syn.to_pandas()

Unnamed: 0,timestamp,amount,age,gender,merchant,category,fraud,session_key
0,2023-01-02 18:00:41.577,83.476025,5,M,M1823072687,transportation,0,0.0
1,2023-01-04 00:51:08.292,76.195101,5,M,M348934600,transportation,0,0.0
2,2023-01-05 05:12:23.132,67.294532,5,M,M1823072687,transportation,0,0.0
3,2023-01-06 06:55:59.081,60.182416,5,M,M1823072687,transportation,0,0.0
4,2023-01-07 04:27:41.559,57.635280,5,M,M1823072687,transportation,0,0.0
...,...,...,...,...,...,...,...,...
2937,2023-01-10 09:11:36.664,120.864914,1,F,M348934600,transportation,0,199.0
2938,2023-01-10 19:03:19.399,121.203457,1,F,M348934600,transportation,0,199.0
2939,2023-01-11 06:36:25.967,117.667182,1,F,M348934600,transportation,0,199.0
2940,2023-01-11 22:20:04.651,111.463604,1,F,M348934600,transportation,0,199.0


### Evaluate the synthetic dataset


In [None]:
# @title ##### Define a helper function `get_fidelity_score()` to calculate the marginal distribution score:

import copy


def get_fidelity_score(source, source_dataset_properties, syn):
    source = copy.deepcopy(source)
    syn = copy.deepcopy(syn)

    columns_to_drop = [source_dataset_properties.session_key]
    source.table = source.table.drop_columns(columns_to_drop)

    columns_to_drop = ["session_key"]
    syn.table = syn.table.drop_columns(columns_to_drop)

    categorical_measurements = source_dataset_properties.filter_fields(
        ftype=FieldType.MEASUREMENT, etype=EncoderType.CATEGORICAL
    )

    return marginal_dist_score(
        source,
        syn,
        metadata=source_dataset_properties.metadata_fields,
        other_categorical=categorical_measurements,
    )

In [None]:
get_fidelity_score(
    source=dataset, source_dataset_properties=dataset_properties, syn=syn
)

0.6797544933180476

### Next Steps

As you just saw, the onboarding workflow is a good starting point to get to a synthetic dataset quickly.

You can now modify this workflow according to your requirements to get your final synthetic dataset!

The following pages in the Rockfish documentation will be useful for this purpose:

1. Adding more steps (i.e. Rockfish actions) to a Rockfish workflow: https://docs142.rockfish.ai/sdk-overview.html#actions-and-workflows
2. Hyperparameters you can change to improve the performance of Rockfish models: https://docs142.rockfish.ai/models.html
3. Using more metrics and plots to evaluate your synthetic dataset: https://docs142.rockfish.ai/data-eval.html
