In this tutorial, we will guide you through the process of setting up an end-to-end continuously running workflow for the purposes of continuous ingestion of data.

We will cover the following:

* Preparing your dataset for synthetic data generation.
* Utilizing Rockfish Recommendation Engine to automatically determine the most suitable model for training, along with key configurations and settings required for successful onboarding.
* Generating and then evaluating synthetic data using the Rockfish Synthetic Data Assessor, which will help you improve the quality of your synthetic datasets.
* Setting up an always on workflow using the settings generated from the onboarding process.
* Applying custom labels to the models that are trained by the workflow.
* Searching for a previously trained model in Rockfish's model store.
* Using the model to generate synthetic data.

### Install and Import Rockfish SDK

In [None]:
%%capture
%pip install -U 'rockfish[labs]==0.22.0' -f 'https://docs142.rockfish.ai/packages/index.html'

In [None]:
import rockfish as rf
import rockfish.actions as ra
from rockfish.labs.dataset_properties import DatasetPropertyExtractor, FieldType, EncoderType
from rockfish.labs.steps import Recommender
from rockfish.labs.metrics import marginal_dist_score
from rockfish.labs.sda import SDA

import time

### Connect to the Rockfish Platform

❗❗ Replace API_KEY and API_URL.

In [None]:
api_key = "API_KEY"

conn = rf.Connection.from_env()

# 1. Onboard the dataset onto Rockfish

### Load the Dataset

We support ingesting other data formats, refer documentation for more details.

In [None]:
%%capture
!wget --no-clobber 'https://docs142.rockfish.ai/tutorials/finance.csv'
dataset = rf.Dataset.from_csv("finance", "finance.csv")

In [None]:
dataset.to_pandas()

### Onboard the dataset onto Rockfish

The onboarding workflow is a good starting point to get to a synthetic version of your dataset quickly.

To ensure optimal synthetic data generation, it's crucial to provide domain-specific information related to your dataset. This helps Rockfish’s Recommendation Engine tailor the workflow to your specific needs.

In [None]:
dataset_properties = DatasetPropertyExtractor(
    dataset,
    session_key="customer",
    metadata_fields=["age", "gender"],
    additional_property_keys=["association_rules"]
).extract()
recommender_output = Recommender(dataset_properties).run()
print(recommender_output.report)

#### Run the recommended workflow to get a synthetic dataset

In [None]:
rec_actions = recommender_output.actions
save = ra.DatasetSave({"name": "synthetic"})

# use recommended actions in a Rockfish workflow
builder = rf.WorkflowBuilder()
builder.add_path(dataset, *rec_actions, save)

# run the Rockfish workflow
workflow = await builder.start(conn)
print(f"Workflow: {workflow.id()}")

View logs for the running workflow:

In [None]:
async for log in workflow.logs():
    print(log)

Download and view the synthetic dataset locally:

In [None]:
syn = await workflow.datasets().last()
syn = await syn.to_local(conn)
syn.to_pandas()

### Evaluate the synthetic dataset

In [None]:
#@title ##### Define a helper function `get_fidelity_score()` to calculate the marginal distribution score:

import copy

def get_fidelity_score(source, source_dataset_properties, syn):
    source = copy.deepcopy(source)
    syn = copy.deepcopy(syn)

    columns_to_drop = [source_dataset_properties.session_key]
    source.table = source.table.drop_columns(columns_to_drop)

    columns_to_drop = ["session_key"]
    syn.table = syn.table.drop_columns(columns_to_drop)

    categorical_measurements = source_dataset_properties.filter_fields(
        ftype=FieldType.MEASUREMENT, etype=EncoderType.CATEGORICAL
    )

    return marginal_dist_score(
        source,
        syn,
        metadata=source_dataset_properties.metadata_fields,
        other_categorical=categorical_measurements,
    )

In [None]:
get_fidelity_score(
    source=dataset,
    source_dataset_properties=dataset_properties,
    syn=syn
)

### Since the actions look good, we can use them for setting up the always-on workflow.

In [None]:
rec_actions

In [None]:
train_actions = rec_actions[:-1]
generate_actions = rec_actions[-1:]

# 2. Set up an always-on workflow for continuous data ingestion

### Employ the DataStreamLoad action to keep the workflow always on

In [None]:
stream = ra.DatastreamLoad()

builder = rf.WorkflowBuilder()
builder.add(stream, alias="input")
builder.add_path(*train_actions, parents=["input"], alias="train_actions")
workflow = await builder.start(conn)
print(f'Workflow ID: {workflow.id()}')

 
### Write the data files to the workflow stream
- each input is a dataset
- each output is a trained model stored to the model_store

### Write data files to the workflow stream

Replace the workflow ID with the actual workflow ID of the workflow that was set up

### Download the sample files for the datastream workflow

In [None]:
%%capture
!wget --no-clobber https://docs142.rockfish.ai/tutorials/finance-1.csv
!wget --no-clobber https://docs142.rockfish.ai/tutorials/finance-2.csv
!wget --no-clobber https://docs142.rockfish.ai/tutorials/finance-3.csv
!wget --no-clobber https://docs142.rockfish.ai/tutorials/finance-4.csv

### Replace the workflow ID with the ID of the workflow that was just set up. 

This also allows you to run the data-ingestion service in an independent process.

In [None]:
workflow_id = 'workflow ID here'
workflow = await conn.get_workflow(workflow_id)

In [None]:
for file_num in range(1,4):
  data = rf.Dataset.from_csv('finance', f'finance-{file_num}.csv')
  await workflow.write_datastream("input", data)
  print(f'Writing finance-{file_num} to datastream...')
  time.sleep(10)

### Optional: Add custom labels to the models that are generated

These labels can be used later to filter models based off custom parameters

In [None]:
usage = ['experimental', 'staging', 'production', 'improvement']
i = 0
async for model in (conn.list_models(labels={'workflow_id':workflow_id})):
    await model.add_labels(conn, usage=usage[i])
    i+=1

# 3. Generate synthetic data using the trained model

 
### Provide query params to the model_store search to get appropriate models as response

This can be used if the models trained were previously tagged, the default label that exists is 'workflow_id' which is the id of the workflow that trained the model

In [3]:
async for model in conn.list_models(labels={'usage': 'production'}):
    print(model)

Model(id='6a841e23-60da-11ef-b0af-b21fd42e2041', labels={'customer': 'hbo', 'month': 'june', 'workflow_id': '51AoIAxZw3sNM99H2s83wD'}, create_time=datetime.datetime(2024, 8, 22, 23, 1, 7, tzinfo=datetime.timezone.utc), size_bytes=26864128)


 
### Select a model from the list of queried models and fetch it from remote

In [4]:
model = await rf.Model.from_id(conn, 'model id here of the filtered model after querying')
print(model)

 
### Provide the model and the synthesis config to a workflow to generate a synthetic dataset as the output

In [7]:
builder = rf.WorkflowBuilder()
builder.add(model)
builder.add(*generate_actions, parents=[model], alias='gen')
builder.add(ra.DatasetSave(name='syn_data'), parents=['gen'])
workflow = await builder.start(conn)
print(f'Workflow ID: {workflow.id()}')

Workflow ID: 7azJsUA2r6fovNks6hF93O


In [8]:
syn_data = await workflow.datasets().concat(conn)

In [9]:
syn_data.to_pandas()

Unnamed: 0,sessionStartTimeMs,pageStartTimeMs,lifeSessionSessionDurationMs,lifeSessionPageLoadDurationMs,lifeSessionUserActiveTimeMs,lifeSessionNetworkRequestFailureDurationMs,lifeSessionNetworkRequestFailureCount,lifeSessionEventCount,lifeSessionUserEventCount,intvSessionDurationMs,...,asn,connType,netSpeed,browser,lifeSessionAppCrashCount,lifeSessionPageLoadSuccessCount,lifeSessionPageLoadAttemptCount,intvPageLoadSuccessCount,intvNetworkRequestFailureCount,session_key
0,2024-04-26 05:13:32.883,9751,2973442,1526,3270148,2000,1,150,12,12878,...,1136,15,Cellular,Chrome;Chrome 123.0.0.0,0,6,7,0,1,0.0
1,2024-04-26 04:11:49.457,13790,8978073,4109,119458,2127,0,-8,143,52064,...,50266,14,Cable/DSL,Native App;Native App,0,8,1,1,1,1.0
2,2024-04-26 07:10:24.584,14159,8665729,4109,-125857,2208,0,5,143,55803,...,50266,14,Cable/DSL,Native App;Native App,0,8,1,1,1,1.0
3,2024-04-26 10:07:55.914,14145,8581321,4109,-134053,2268,0,12,143,57064,...,50266,14,Cable/DSL,Native App;Native App,0,8,1,1,1,1.0
4,2024-04-26 13:07:30.336,14344,8334301,4109,-279604,2326,0,12,143,59130,...,50266,14,Cable/DSL,Native App;Native App,0,8,5,1,1,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10436,2024-05-14 08:13:02.236,4934,4988734,59,97064,2120,1,199,15,33786,...,31615,16,Cable/DSL,Chrome;Chrome 123.0.0.0,0,2,5,0,6,181.0
10437,2024-05-14 11:01:53.791,4951,4976269,56,94949,2119,1,198,15,33606,...,31615,16,Cable/DSL,Chrome;Chrome 123.0.0.0,0,2,5,0,6,181.0
10438,2024-05-14 13:50:44.880,4883,4994239,58,104530,2133,1,199,15,33431,...,31615,16,Cable/DSL,Chrome;Chrome 123.0.0.0,0,2,5,0,6,181.0
10439,2024-05-14 16:39:32.423,5024,4824629,58,70594,2120,1,198,15,33202,...,31615,16,Cable/DSL,Chrome;Chrome 123.0.0.0,0,2,5,0,6,181.0
