In this tutorial, we will guide you through the process of setting up an end-to-end continuously running workflow for the purposes of continuous ingestion of data.

We will cover the following:

- Preparing your dataset for synthetic data generation.
- Utilizing Rockfish Recommendation Engine to automatically determine the most suitable model for training, along with key configurations and settings required for successful onboarding.
- Generating and then evaluating synthetic data using the Rockfish Synthetic Data Assessor, which will help you improve the quality of your synthetic datasets.
- Setting up an always on workflow using the settings generated from the onboarding process.
- Applying custom labels to the models that are trained by the workflow.
- Searching for a previously trained model in Rockfish's model store.
- Using the model to generate synthetic data.


### Install and Import Rockfish SDK


In [1]:
%%capture
%pip install -U 'rockfish[labs]' -f 'https://packages.rockfish.ai'

In [2]:
import rockfish as rf
import rockfish.actions as ra
from rockfish.labs.dataset_properties import (
    DatasetPropertyExtractor,
    FieldType,
    EncoderType,
)
from rockfish.labs.steps import Recommender
from rockfish.labs.metrics import marginal_dist_score
from rockfish.labs.sda import SDA

import time

### Connect to the Rockfish Platform

❗❗ Replace API_KEY and API_URL.


In [3]:
api_key = "API_KEY"

conn = rf.Connection.remote("https://api.rockfish.ai", api_key)

# 1. Onboard the dataset onto Rockfish


### Load the Dataset

We support ingesting other data formats, refer documentation for more details.


In [4]:
%%capture
!wget --no-clobber https://docs.rockfish.ai/tutorials/finance-1.csv
dataset = rf.Dataset.from_csv("finance", "finance-1.csv")

I0000 00:00:1738890154.026454 19731987 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers


In [5]:
dataset.to_pandas()

Unnamed: 0,customer,age,gender,merchant,category,amount,fraud,timestamp
0,C100045114,4,M,M348934600,transportation,35.13,0,2023-01-01 00:00:00.000000000
1,C100045114,4,M,M348934600,transportation,27.63,0,2023-01-01 08:00:00.000000000
2,C100045114,4,M,M348934600,transportation,13.46,0,2023-01-01 16:00:00.000000000
3,C100045114,4,M,M348934600,transportation,28.86,0,2023-01-02 00:00:00.000000000
4,C100045114,4,M,M151143676,barsandrestaurants,64.99,0,2023-01-02 08:00:00.000000000
...,...,...,...,...,...,...,...,...
8774,C1335108214,3,M,M348934600,transportation,62.58,0,2023-01-09 15:53:05.114045618
8775,C1335108214,3,M,M348934600,transportation,11.99,0,2023-01-09 23:53:05.114045618
8776,C1335108214,3,M,M348934600,transportation,18.85,0,2023-01-10 07:53:05.114045618
8777,C1335108214,3,M,M348934600,transportation,41.86,0,2023-01-10 15:53:05.114045618


### Onboard the dataset onto Rockfish

The onboarding workflow is a good starting point to get to a synthetic version of your dataset quickly.

To ensure optimal synthetic data generation, it's crucial to provide domain-specific information related to your dataset. This helps Rockfish’s Recommendation Engine tailor the workflow to your specific needs.


In [6]:
dataset_properties = DatasetPropertyExtractor(
    dataset,
    session_key="customer",
    metadata_fields=["age", "gender"],
).extract()
recommender_output = Recommender(dataset_properties).run()
print(recommender_output.report)

# _________________________________________________________________________
#
# RECOMMENDED CONFIGURATIONS
#
# (Remove or change any actions or configurations that are inappropriate
#  for your use case, or add missing ones)
# _________________________________________________________________________


We detected a timeseries dataset with the following properties:
Dimensions of dataset: (8779 x 8)
Metadata fields: ['age', 'gender']
Measurement fields: ['merchant', 'category', 'amount', 'fraud']
Timestamp field: timestamp
Session key field: customer
Number of sessions: 658

# _________________________________________________________________________
#
# ~~~~~ Pre-processing recommendations ~~~~~
# _________________________________________________________________________



# _________________________________________________________________________
#
# ~~~~~ Model recommendations ~~~~~
# _________________________________________________________________________


We recommend using the Tim

#### Run the recommended workflow to get a synthetic dataset


In [None]:
rec_actions = recommender_output.actions
save = ra.DatasetSave(name="synthetic")

# use recommended actions in a Rockfish workflow
builder = rf.WorkflowBuilder()
builder.add_path(dataset, *rec_actions, save)

# run the Rockfish workflow
pre_workflow = await builder.start(conn)
print(f"Workflow: {pre_workflow.id()}")

Workflow: 2EryeAqlG2x2PHFOol5q5f


View logs for the running workflow:


In [8]:
async for log in pre_workflow.logs():
    print(log)

2025-02-07T01:02:35Z dataset-load: INFO Loading dataset '5yz5eoxezE3DpVcJbLr0pb' with 8779 rows
2025-02-07T01:02:35Z train-time-gan: WARN Unsafe time cast on timestamp
2025-02-07T01:02:36Z train-time-gan: INFO Starting DG training job
2025-02-07T01:02:43Z train-time-gan: INFO Epoch 1 completed.
2025-02-07T01:02:49Z train-time-gan: INFO Epoch 2 completed.
2025-02-07T01:02:56Z train-time-gan: INFO Epoch 3 completed.
2025-02-07T01:03:02Z train-time-gan: INFO Epoch 4 completed.
2025-02-07T01:03:09Z train-time-gan: INFO Epoch 5 completed.
2025-02-07T01:03:16Z train-time-gan: INFO Epoch 6 completed.
2025-02-07T01:03:22Z train-time-gan: INFO Epoch 7 completed.
2025-02-07T01:03:28Z train-time-gan: INFO Epoch 8 completed.
2025-02-07T01:03:34Z train-time-gan: INFO Epoch 9 completed.
2025-02-07T01:03:40Z train-time-gan: INFO Epoch 10 completed.
2025-02-07T01:03:42Z train-time-gan: INFO Training completed. Uploaded model 5f1d0c7d-e4ef-11ef-8b30-b6c89c2b7480
2025-02-07T01:03:42Z generate-time-gan: 

Download and view the synthetic dataset locally:


In [9]:
syn = await pre_workflow.datasets().last()
syn = await syn.to_local(conn)
syn.to_pandas()

Unnamed: 0,timestamp,amount,age,gender,merchant,category,fraud,session_key
0,2023-01-05 15:14:19.306,16.613481,4,M,M85975013,home,0,0.0
1,2023-01-05 23:53:21.843,33.693912,U,M,M348934600,otherservices,0,1.0
2,2023-01-06 05:47:57.156,30.031166,U,M,M348934600,transportation,0,1.0
3,2023-01-06 11:56:29.138,26.455817,U,M,M348934600,transportation,0,1.0
4,2023-01-06 18:44:22.751,22.902034,U,M,M348934600,transportation,0,1.0
...,...,...,...,...,...,...,...,...
3519,2023-01-06 19:29:35.328,1512.302483,3,M,M1823072687,transportation,0,199.0
3520,2023-01-07 03:43:06.071,1402.310192,3,M,M1823072687,transportation,0,199.0
3521,2023-01-07 11:21:29.069,1294.213933,3,M,M1823072687,transportation,0,199.0
3522,2023-01-07 18:16:48.542,1157.108533,3,M,M1823072687,transportation,0,199.0


### Evaluate the synthetic dataset


In [10]:
# @title ##### Define a helper function `get_fidelity_score()` to calculate the marginal distribution score:

import copy


def get_fidelity_score(source, source_dataset_properties, syn):
    source = copy.deepcopy(source)
    syn = copy.deepcopy(syn)

    columns_to_drop = [source_dataset_properties.session_key]
    source.table = source.table.drop_columns(columns_to_drop)

    columns_to_drop = ["session_key"]
    syn.table = syn.table.drop_columns(columns_to_drop)

    categorical_measurements = source_dataset_properties.filter_fields(
        ftype=FieldType.MEASUREMENT, etype=EncoderType.CATEGORICAL
    )

    return marginal_dist_score(
        source,
        syn,
        metadata=source_dataset_properties.metadata_fields,
        other_categorical=categorical_measurements,
    )

In [11]:
get_fidelity_score(
    source=dataset, source_dataset_properties=dataset_properties, syn=syn
)

0.7292203197330803

### Since the actions look good, we can use them for setting up the always-on workflow.


In [12]:
rec_actions

[<rockfish.actions.dg.TrainTimeGAN at 0x1767987d0>,
 <rockfish.actions.dg.GenerateTimeGAN at 0x1771bca40>]

In [13]:
train_actions = rec_actions[:-1]
generate_actions = rec_actions[-1:]

# 2. Set up an always-on workflow for continuous data ingestion


### Employ the DataStreamLoad action to keep the workflow always on


In [14]:
# reduce the batch size for the following ingested data stream as the batch
# size should be smaller than the number of sessions in the dataset
train_actions[0].config().doppelganger.batch_size = 14
stream = ra.DatastreamLoad()

builder = rf.WorkflowBuilder()
builder.add(stream, alias="input")
builder.add_path(*train_actions, parents=["input"], alias="train_actions")
ingest_workflow = await builder.start(conn)
print(f"Ingestion Workflow ID: {ingest_workflow.id()}")

Ingestion Workflow ID: 1HD4Hj5x8lBvoPgUSCTalW


### Write the data files to the workflow stream

- each input is a dataset
- each output is a trained model stored to the model_store


### Write data files to the workflow stream

Replace the workflow ID with the actual workflow ID of the workflow that was set up


### Download the sample files for the datastream workflow


In [15]:
%%capture
!wget --no-clobber https://docs.rockfish.ai/tutorials/finance-2.csv
!wget --no-clobber https://docs.rockfish.ai/tutorials/finance-3.csv
!wget --no-clobber https://docs.rockfish.ai/tutorials/finance-4.csv

I0000 00:00:1738890225.401350 19731987 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1738890225.667719 19731987 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers
I0000 00:00:1738890225.901019 19731987 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers


### Replace the workflow ID with the ID of the workflow that was just set up.

This also allows you to run the data-ingestion service in an independent process.


In [16]:
# Retrieve the workflow with the previous ID without need to re-build the workflow
workflow_id = ingest_workflow.id()  # insert workflow ID here
workflow = await conn.get_workflow(workflow_id)

In [17]:
for file_num in range(2, 5):
    data = rf.Dataset.from_csv("finance", f"finance-{file_num}.csv")
    await workflow.write_datastream(
        "input", data
    )  # "input" is the pre-set alias of the datastream
    print(f"Writing finance-{file_num} to datastream...")
    time.sleep(10)
await workflow.close_datastream(
    "input"
)  # "input" is the pre-set alias of the datastream

Writing finance-2 to datastream...
Writing finance-3 to datastream...
Writing finance-4 to datastream...


In [18]:
# check the status of the workflow
async for log in workflow.logs():
    print(log)

2025-02-07T01:03:47Z train_actions: WARN Unsafe time cast on timestamp
2025-02-07T01:03:48Z train_actions: INFO Starting DG training job
2025-02-07T01:03:57Z train_actions: INFO Epoch 1 completed.
2025-02-07T01:04:06Z train_actions: INFO Epoch 2 completed.
2025-02-07T01:04:16Z train_actions: INFO Epoch 3 completed.
2025-02-07T01:03:47Z input: INFO Read batch with 8850 rows from datastream
2025-02-07T01:03:58Z input: INFO Read batch with 9080 rows from datastream
2025-02-07T01:04:08Z input: INFO Read batch with 6979 rows from datastream
2025-02-07T01:04:25Z train_actions: INFO Epoch 4 completed.
2025-02-07T01:04:36Z train_actions: INFO Epoch 5 completed.
2025-02-07T01:04:45Z train_actions: INFO Epoch 6 completed.
2025-02-07T01:04:55Z train_actions: INFO Epoch 7 completed.
2025-02-07T01:05:04Z train_actions: INFO Epoch 8 completed.
2025-02-07T01:05:13Z train_actions: INFO Epoch 9 completed.
2025-02-07T01:05:21Z train_actions: INFO Epoch 10 completed.
2025-02-07T01:05:22Z train_actions: I

### Optional: Add custom labels to the models that are generated

These labels can be used later to filter models based off custom parameters


In [19]:
usage = ["experimental", "staging", "production"]
i = 0
async for model in conn.list_models(labels={"workflow_id": workflow_id}):
    await model.add_labels(conn, usage=usage[i])
    i += 1

# 3. Generate synthetic data using the trained model


### Provide query params to the model_store search to get appropriate models as response

This can be used if the models trained were previously tagged, the default label that exists is 'workflow_id' which is the id of the workflow that trained the model


In [20]:
async for model in conn.list_models(labels={"usage": "production"}):
    print(model)

Model(id='f9756bb9-e4ef-11ef-8b30-b6c89c2b7480', labels={'usage': 'production', 'workflow_id': '1HD4Hj5x8lBvoPgUSCTalW'}, create_time=datetime.datetime(2025, 2, 7, 1, 8, tzinfo=datetime.timezone.utc), size_bytes=16009216)


### Select a model from the list of queried models and fetch it from remote


In [21]:
model = await rf.Model.from_id(
    conn,
    model.id,  # insert model id here of the filtered model after querying
)
print(model)

Model(id='f9756bb9-e4ef-11ef-8b30-b6c89c2b7480', labels={'usage': 'production', 'workflow_id': '1HD4Hj5x8lBvoPgUSCTalW'}, create_time=datetime.datetime(2025, 2, 7, 1, 8, tzinfo=datetime.timezone.utc), size_bytes=16009216)


### Provide the model and the synthesis config to a workflow to generate a synthetic dataset as the output


In [22]:
builder = rf.WorkflowBuilder()
builder.add(model)
builder.add(*generate_actions, parents=[model], alias="gen")
builder.add(ra.DatasetSave(name="syn_data"), parents=["gen"])
workflow = await builder.start(conn)
print(f"Workflow ID: {workflow.id()}")

Workflow ID: 5rjOlzD7ZzFg3PHSluBqkk


In [23]:
async for log in workflow.logs():
    print(log)

2025-02-07T01:08:02Z gen: INFO Downloading model with model_id='f9756bb9-e4ef-11ef-8b30-b6c89c2b7480'...
2025-02-07T01:08:02Z gen: INFO Generating 200 sessions...
2025-02-07T01:08:03Z dataset-save: INFO using field 'session_key' to concatenate tables
2025-02-07T01:08:03Z dataset-save: INFO Saved dataset '5EK43L3e7lZkt5RvMHkfSI' with 3436 rows


In [25]:
syn_data = await workflow.datasets().concat(conn)
syn_data.to_pandas()

Unnamed: 0,timestamp,amount,age,gender,merchant,category,fraud,session_key
0,2023-01-16 10:29:30.608,5769.000968,2,M,M348934600,otherservices,0,0.0
1,2023-01-16 19:42:28.983,5639.624898,2,M,M348934600,transportation,0,0.0
2,2023-01-17 06:36:53.880,5116.755126,2,M,M348934600,transportation,0,0.0
3,2023-01-17 19:10:35.828,4433.340920,2,M,M348934600,transportation,0,0.0
4,2023-01-18 08:47:56.532,3812.356856,2,M,M348934600,transportation,0,0.0
...,...,...,...,...,...,...,...,...
3431,2023-01-26 17:32:19.900,3096.980100,2,M,M348934600,transportation,0,199.0
3432,2023-01-27 06:28:56.743,3049.270808,2,M,M348934600,transportation,0,199.0
3433,2023-01-27 19:22:46.724,3031.661770,2,M,M348934600,transportation,0,199.0
3434,2023-01-28 08:12:34.148,3010.777257,2,M,M348934600,transportation,0,199.0
