# Getting Started with Synthetic Data

*... a tutorial for students at FHNW, written by [Andreas Martin, PhD](https://andreasmartin.ch).*

|[![deepnote](https://deepnote.com/buttons/launch-in-deepnote-small.svg)](https://deepnote.com/launch?url=https%3A%2F%2Fgithub.com%2FAI4BP%2Fainotes%2Fblob%2Fmain%2Fsynthetic-data-getting-started%2Fipynb%2Fpayment-decision-data-generation.ipynb)|[![Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4BP/ainotes/blob/main/synthetic-data-getting-started/ipynb/payment-decision-data-generation.ipynb)|[![Gitpod](https://img.shields.io/badge/Gitpod-Run%20in%20VS%20Code-908a85?logo=gitpod)](https://gitpod.io/#https://github.com/AI4BP/ainotes/)|[![GitHub.dev](https://img.shields.io/badge/github.dev-Open%20in%20VS%20Code-908a85?logo=github)](https://github.dev/AI4BP/ainotes/blob/main/synthetic-data-getting-started/ipynb/payment-decision-data-generation.ipynb)|
|-|-|-|-|

This tutorial shows a pragmatic approach to generate synthetic data for business decisions. It uses numpy and pandas, quasi Python on-board tools, to transform domain knowledge into plausible data generation and to generate a CSV. The generated data is fed into a knowledge engineered decision model to label the decision.

## Use Case
The use case is about a decision in a payment process, as shown in Fig a which is about making sure that the right payment method is used.

![](https://github.com/AI4BP/ainotes/raw/main/synthetic-data-getting-started/ipynb/images/Payment-Decision-Overview.png)

**Fig a**: payment process

![](https://github.com/AI4BP/ainotes/raw/main/synthetic-data-getting-started/ipynb/images/Payment-Decision-Definition.png)

Due to missing data, a knowledge engineer has codified the rules from the experience in a DMN decision table, which is shown in Fig b.

**Fig b**: payment decision definition

### 🚧 Main Task
Now it is a matter of generating plausible synthetic data, putting it through the decision table and finally having a synthetic data set for training a machine learning model.

## 0. Initialization Configuration
In the following there is some code for initialization.

In [1]:
import os

working_dir = os.path.normpath(os.getcwd()+"/../")
url_github = "https://raw.githubusercontent.com/AI4BP/ainotes/main"
root_dir = os.path.normpath(os.getcwd()+"/../../")
model_folder = "synthetic-data-getting-started/modelling"
model_folder = f"{root_dir}/{model_folder}" if os.path.exists(f"{root_dir}/{model_folder}") else f"{url_github}/{model_folder}"
print(model_folder)

/work/synthetic-data-getting-started/modelling


## 1. Controlled-Random Data Generation
In the following, the aim is to create an input data set with a certain size (`size_dataset`) on the one hand randomly and on the other hand controlled according to plausible distributions based on empirical knowledge.

Some considerations for plausible distribution are:
- the payments are between CHF 1 and CHF 10,000
- 50 % of the payments range up to CHF 2'000.-
- 40 % of the payments are between CHF 2'001.- and CHF 7'000.-
- 10 % of the payments are over CHF 7'000.-.
- the means of payment is 50 % card, 40 % cash and 10 % card
- 70 % of payments are business and only 30 % private

> An important note: Please choose a small size of the dataset first (`size_dataset = 500`), otherwise the decision engine may have performance problems.

In [2]:
size_dataset = 500

In [3]:
import numpy as np
import pandas as pd

try: size_dataset
except: size_dataset = 500

df = pd.DataFrame(columns=['Amount', 'Means', 'Business'])
df['Amount'] = np.random.choice(np.arange(1,10001), size=size_dataset, p=np.append(np.append(np.repeat(0.5/2000, 2000), np.repeat(0.4/5000, 5000)), np.repeat(0.1/3000, 3000)))
df['Means'] = np.random.choice(["Invoice", "Cash", "Card"], size=size_dataset, p=[0.50, 0.40, 0.10])
df['Business'] = np.random.choice([True, False], size=size_dataset, p=[0.70, 0.30])

df

Unnamed: 0,Amount,Means,Business
0,1145,Invoice,True
1,1030,Card,False
2,2659,Cash,True
3,431,Invoice,True
4,1103,Invoice,True
...,...,...,...
495,246,Invoice,True
496,1416,Invoice,True
497,305,Cash,True
498,666,Cash,False


## 2. Multi-Instance BPMN Rule Task for Data Generation
The next step is to use a BPMN model (see Fig 2), which instantiates a DMN rule task and thus a decision table several times based on the data set and returns the decisions.

![](https://github.com/AI4BP/ainotes/raw/main/synthetic-data-getting-started/ipynb/images/Synthetic-Data-Gen.png)

**Fig 2**: synthetic data generation process

### BPMN and DMN Deployment
First, the BPMN and the corresponding DMN model are deployed to the engine.

> Please set your `tenant_id` first!

In [4]:
tenant_id = 'showcase'

In [5]:
import requests

try: tenant_id
except: tenant_id = "showcase"

engine_endpoint = "https://digibp.herokuapp.com/engine-rest"

request_files = {
    "Synthetic-Data-Gen": open(f"{model_folder}/Synthetic-Data-Gen.camunda.bpmn", "rb"),
    "Payment-Decision-Definition": open(f"{model_folder}/Payment-Decision-Definition.camunda.dmn", "rb")
}

request_data = {
    "tenant-id": tenant_id,  # please change the tenant-id
}

response = requests.post(f"{engine_endpoint}/deployment/create", files=request_files, data=request_data)
deployment_id = response.json()["id"]

print(deployment_id)

774b2af4-52e0-11ed-af91-429b0122ece7


### Start Generation Process
Next, the process can be instantiated, and the data set will be transferred as a collection. The id (`payment_decision`) of the decision table is also set.

In [6]:
import requests
import json

url = f"{engine_endpoint}/process-definition/key/Synthetic_Data_Gen/tenant-id/{tenant_id}/start"

collection = ""

for index, row in df.iterrows():

    if collection:
        collection = collection + ","

    collection = collection + json.dumps({
        "amount": row['Amount'],
        "means": row['Means'],
        "business": row['Business']
    })

payload = json.dumps({
  "variables": {
    "collection": {
      "value": "{\"collection\":["+collection+"]}",
      "type": "json"
    },
    "decision": {
      "value": "payment_decision",
      "type": "String"
    }
  }
})
headers = {
  'Content-Type': 'application/json',
  'Accept': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

{"links":[{"method":"GET","href":"https://digibp.herokuapp.com/engine-rest/process-instance/779a837d-52e0-11ed-af91-429b0122ece7","rel":"self"}],"id":"779a837d-52e0-11ed-af91-429b0122ece7","definitionId":"Synthetic_Data_Gen:5:7751e1b7-52e0-11ed-af91-429b0122ece7","businessKey":null,"caseInstanceId":null,"ended":false,"suspended":false,"tenantId":"showcase"}


### Subscribe for Generation Result
As you might have seen in the process on fig 2, there are some transaction boundaries defined due to some performance optimizations of the workflow engine—it does not block the entire engine. Therefore, we have an asynchronous continuation, and we need to retrieve the result using a worker subscription. This is done in the following:

In [7]:
import httpimport
with httpimport.github_repo('DigiBP', 'digibp-camunda-external-python-task', 'cam'): import cam

worker = cam.Client(engine_endpoint)

response = worker.subscribe("synthetic_data_gen_done", None, tenant_id)

taskid = str(response[0]['id'])

data  = json.loads(response[0]['variables']['combinedResult']['value'])['values']

df = df.assign(decision=data)

worker.complete(taskid)

df

polling subscription: synthetic_data_gen_done
200

204


Unnamed: 0,Amount,Means,Business,decision
0,1145,Invoice,True,verify
1,1030,Card,False,approve
2,2659,Cash,True,reject
3,431,Invoice,True,approve
4,1103,Invoice,True,verify
...,...,...,...,...
495,246,Invoice,True,approve
496,1416,Invoice,True,verify
497,305,Cash,True,approve
498,666,Cash,False,approve


## Adding Noise
In a data set that is as real as possible, noise is typically present. This means that in reality, certain decisions were either made mistakenly or that exceptions were made. For this reason, it can make sense to add a small percentage of plausible noise.

The noise considerations are the following:
- we add random decisions to a certain (small) percentage of randomly selected rows
- the decisions are 50% approve, 40% verify and 10% reject

In [8]:
noise_percentage = 1

In [9]:
import numpy as np
import pandas as pd

try: noise_percentage = noise_percentage/100
except: noise_percentage = 0.01

noise_size = int(size_dataset*noise_percentage)
df_noise = pd.DataFrame()
df_noise['index'] = np.random.choice(noise_size, size=noise_size)
df_noise['decision'] = np.random.choice(["approve", "verify", "reject"], size=noise_size, p=[0.50, 0.40, 0.10])

for index, row in df_noise.iterrows():
    df.loc[row['index'],'decision'] = row['decision']

df

Unnamed: 0,Amount,Means,Business,decision
0,1145,Invoice,True,approve
1,1030,Card,False,approve
2,2659,Cash,True,reject
3,431,Invoice,True,approve
4,1103,Invoice,True,approve
...,...,...,...,...
495,246,Invoice,True,approve
496,1416,Invoice,True,verify
497,305,Cash,True,approve
498,666,Cash,False,approve


## 3. Persist Synthetic Data
🎉 Now, finally, we can generate the `.csv` file with our plausible synthetic data.

In [10]:
df.to_csv(f"{working_dir}/generated/payment_decision_file.csv", index=False)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=1732c578-12df-4d1f-9f3a-e173589fb5f6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>