# Generate synthetic data for AI/ML Feature Pipelines tutorial

This notebook generates sample data that can be used to illustrate the workflow implementing feature pipelines for AI/ML with Data Distiller. These pipelines transform data from Experience Platform datasets into a feature dataset that can be used to train and score a propensity model in an external ML environment. We will generate sample data with the following steps:

- [Setup](#setup)
- [1. Create Experience Event schema and dataset](#1-create-experience-event-schema-and-dataset)
- [2. Create Profile schema and dataset](#2-create-profile-schema-and-dataset)
- [3. Statistical simulation of Experience Events](#3-statistical-simulation-of-profiles-and-experience-events)
- [4. Ingest synthetic data into AEP dataset](#4-ingest-synthetic-data-into-aep-dataset)


## Setup

Before we run anything, make sure to install the following required libraries for this notebook. They are all publicly available libraries and the latest version should work fine.

In [None]:
%pip install mmh3
%pip install rstr
%pip install aepp
%pip install pygresql

This notebook requires some configuration parameters to properly authenticate to your Adobe Experience Platform instance. Please follow the instructions in the [**README**](../README.md) to gather the necessary configuration parameters and prepare the [config.ini](../conf/config.ini) file with the specific values for your environment.

The next cell will be looking for your configuration file under your **ADOBE_HOME** path to fetch the configuration values that will be used for this notebook. If necessary, modify the `config_path` and/or the `config_file` name to reflect the location of your config file. 

In [12]:
import os
from configparser import ConfigParser
import aepp

os.environ["ADOBE_HOME"] = os.path.dirname(os.getcwd())

if "ADOBE_HOME" not in os.environ:
    raise Exception("ADOBE_HOME environment variable needs to be set.")

config = ConfigParser()
config_file = "config.ini"
#config_path = os.path.join(os.environ["ADOBE_HOME"], "conf", config_file)
config_path = "/Users/jeremypage/Library/CloudStorage/OneDrive-Adobe/Projects/Cloud ML/environments/aemassets_config.ini"

if not os.path.exists(config_path):
    raise Exception(f"Looking for configuration under {config_path} but config not found, please verify path")

config.read(config_path)

aepp.configure(
  org_id=config.get("Platform", "ims_org_id"),
  tech_id=config.get("Platform", "tech_acct_id"), 
  secret=config.get("Platform", "client_secret"),
  scopes=config.get("Platform", "scopes"),
  client_id=config.get("Platform", "client_id"),
  environment=config.get("Platform", "environment"),
  sandbox=config.get("Platform", "sandbox_name")
)

To ensure uniqueness of resources created as part of this notebook, we are using your local username to include in each of the resource titles to avoid conflicts.

In [13]:
import re
username = os.getlogin()
unique_id = s = re.sub("[^0-9a-zA-Z]+", "_", username)

Helper function to generate link to resource in the UI:

In [14]:
def get_ui_link(tenant_id, resource_type, resource_id):
    environment = config.get("Platform", "environment")
    sandbox_name = config.get("Platform", "sandbox_name")
    if environment == "prod":
        prefix = f"https://experience.adobe.com"
    else:
        prefix = f"https://experience-{environment}.adobe.com"
    return f"{prefix}/#/@{tenant_id}/sname:{sandbox_name}/platform/{resource_type}/{resource_id}"

# 1. Create Experience Event schema and dataset

We will now create the schema to support our synthetic event data. We need a few field groups which will be included in the synthetic event schema:
- A custom field group with an identity field which we'll call `synth_id`
- Direct Marketing information
- Web details

## 1.1 Create connection to XDM Schema Registry

We first instantiate a connection to the Schema Registry API, then retrieve the name of the sandbox we're working in and the tenant ID which we'll reference below

In [15]:
from aepp import schema
schema_conn = schema.Schema()
print(f"Sandbox: {schema_conn.sandbox}")
tenant_id = schema_conn.getTenantId()
print(f"Tenant ID: {tenant_id}")

Sandbox: laa-e2e
Tenant ID: aemonacpprodcampaign


## Create User ID field group for Experience Event Schema

We need to create a custom field group with a `synth_id` field to go in the experience event schema. Other schema fields will come from standard field field groups we will include when creating the schema.

First we'll define a helper function to gracefully handle cases where the CMLE User ID field group has already been created:

In [16]:
from aepp import utils

def createFieldGroupifnotExists(
        title: str, 
        data: dict, 
        config_path: str,
        utils_conn: utils.Utils = utils.Utils(),
        schema_conn: schema.Schema = schema.Schema()
):
    existing = utils_conn.check_if_exists('Data', 'fieldgroup_id', config_path)
    if existing:
        print(f"'{title}' already exists, retrieving existing field group")
        return schema_conn.getFieldGroup(existing)
    else:
        return schema_conn.createFieldGroup(data)

Create (or get) the `CMLE User ID` field group, and update the config file with the field group ID

In [20]:
fieldgroup_title = f"[Synthetic Data] Exp Event User ID (created by {username})"
fieldgroup_data = {
  "type": "object",
	"title": fieldgroup_title,
	"description": "This field group is used to identify the user to whom an experience event belongs.",
	"allOf": [{
		"$ref": "#/definitions/customFields"
	}],
	"meta:containerId": "tenant",
	"meta:resourceType": "mixins",
	"meta:xdmType": "object",
	"definitions": {
      "customFields": {
        "type": "object",
        "properties": {
          f"_{tenant_id}": {
            "type": "object",
            "properties": {
              "synth_id": {
                "title": "Synthetic User ID",
                "description": "Person identifier for synthetic event data for AI/ML feature pipelines tutorial",
                "type": "string"
              }
            }
          }
        }
      }
	},
	"meta:intendedToExtend": ["https://ns.adobe.com/xdm/context/experienceevent"]
}

fieldgroup_res = createFieldGroupifnotExists(fieldgroup_title, fieldgroup_data, config_path)
fieldgroup_id = fieldgroup_res['$id']
print(f"User ID field group ID: {fieldgroup_id}")

# update config file and object
utils_conn = utils.Utils()
utils_conn.save_field_in_config('Data', 'fieldgroup_id', fieldgroup_id, config_path)
config.read(config_path)

# Get link to field group in AEP UI
import urllib.parse
fieldgroup_link = get_ui_link(tenant_id, "schema/mixin/browse", urllib.parse.quote(fieldgroup_id, safe="a"))
print(f"View field group in UI: {fieldgroup_link}")

User ID field group ID: https://ns.adobe.com/aemonacpprodcampaign/mixins/ef239be06873c2a8d280c597128dc2094d85e941fef63100
View field group in UI: https://experience.adobe.com/#/@aemonacpprodcampaign/sname:laa-e2e/platform/schema/mixin/browse/https%3A%2F%2Fns.adobe.com%2Faemonacpprodcampaign%2Fmixins%2Fef239be06873c2a8d280c597128dc2094d85e941fef63100


## 1.2 Compose Experience Event schema

Now we'll create the experience event schema from our custom field group and the following standard field groups:
- Direct Marketing Details
- Web Details

As with the field group, we'll first define a helper function to gracefully handle cases where the Experience Event schema has already been created:

In [17]:
def createSchemaifnotExists(
        title: str, 
        fieldGroups: list[str], 
        config_field: str,
        config_path: str,
        description: str = "",
        utils_conn: utils.Utils = utils.Utils(), 
        schema_conn: schema.Schema = schema.Schema()
):
    existing = utils_conn.check_if_exists('Data', config_field, config_path)
    if existing:
        print(f"'{title}' already exists, retrieving existing schema")
        return schema_conn.getSchema(existing)
    else:
        if config_field == "events_schema":
            return schema_conn.createExperienceEventSchema(
                name=title,
                fieldGroupIds=fieldGroups,
                description=description
            )
        elif config_field =="profile_schema":
            return schema_conn.createProfileSchema(
                name=title,
                fieldGroupIds=fieldGroups,
                description=description
            )

Create (or get) the Experience Event schema and update the config file with its ID. We'll capture the ID and AltID in variables that we can reference further down, and generate a link to the schema in the AEP UI that you can use to verify that the schema was created.

In [21]:
schema_ee_title = f"[Synthetic Data] Experience Event schema (created by {username})"
schema_ee_fgs = [
    fieldgroup_id,
    "https://ns.adobe.com/xdm/context/experienceevent-directmarketing",
    "https://ns.adobe.com/xdm/context/experienceevent-web"
]
schema_ee_desc = "Profile Schema generated by CMLE for synthetic events"

schema_ee_res = createSchemaifnotExists(
    title=schema_ee_title,
    fieldGroups=schema_ee_fgs,
    config_field='events_schema',
    config_path=config_path,
    description=schema_ee_desc
)
schema_ee_id = schema_ee_res['$id']
schema_ee_altId = schema_ee_res["meta:altId"]
print(f"EE Schema ID: {schema_ee_id}")
print(f"EE Schema Alt ID: {schema_ee_altId}")

# update config object
utils_conn.save_field_in_config('Data', 'events_schema', schema_ee_id, config_path)
config.read(config_path)

schema_ee_link = get_ui_link(tenant_id, "schema/mixin/browse", urllib.parse.quote(schema_ee_id, safe="a"))
print(f"View EE schema in UI: {schema_ee_link}")

EE Schema ID: https://ns.adobe.com/aemonacpprodcampaign/schemas/1b485af0567eb1110b09ee9b2d43d21601b24694e4f8d303
EE Schema Alt ID: _aemonacpprodcampaign.schemas.1b485af0567eb1110b09ee9b2d43d21601b24694e4f8d303
View EE schema in UI: https://experience.adobe.com/#/@aemonacpprodcampaign/sname:laa-e2e/platform/schema/mixin/browse/https%3A%2F%2Fns.adobe.com%2Faemonacpprodcampaign%2Fschemas%2F1b485af0567eb1110b09ee9b2d43d21601b24694e4f8d303


We need to set `synth_id` as the primary ID for the events schema with ECID as the namespace. We do this by creating an identity descriptor in the schema registry:

In [22]:
identity_type = "ECID"
identity_data_events = {
            "@type": "xdm:descriptorIdentity",
            "xdm:sourceSchema": schema_ee_id,
            "xdm:sourceVersion": 1,
            "xdm:sourceProperty": f"/_{tenant_id}/synth_id",
            "xdm:namespace": identity_type,
            "xdm:property": "xdm:id",
            "xdm:isPrimary": True
        }
identity_dsc_ee_res = schema_conn.createDescriptor(descriptorObj=identity_data_events)
identity_dsc_ee_res

{'@id': '53b467f407ea07162ee337024cc413cd9603f0927d580359',
 '@type': 'xdm:descriptorIdentity',
 'xdm:sourceSchema': 'https://ns.adobe.com/aemonacpprodcampaign/schemas/1b485af0567eb1110b09ee9b2d43d21601b24694e4f8d303',
 'xdm:sourceVersion': 1,
 'xdm:sourceProperty': '/_aemonacpprodcampaign/synth_id',
 'imsOrg': '906E3A095DC834230A495FD6@AdobeOrg',
 'version': '1',
 'xdm:namespace': 'ECID',
 'xdm:property': 'xdm:id',
 'xdm:isPrimary': True,
 'meta:containerId': '5523fc81-ee25-4546-a3fc-81ee25554627',
 'meta:sandboxId': '5523fc81-ee25-4546-a3fc-81ee25554627',
 'meta:sandboxType': 'production'}

With a primary identity set, we can now enable the events schema for the Profile service.

In [23]:
enable_ee_res = schema_conn.enableSchemaForRealTime(schema_ee_altId)
enable_ee_res

{'$id': 'https://ns.adobe.com/aemonacpprodcampaign/schemas/1b485af0567eb1110b09ee9b2d43d21601b24694e4f8d303',
 'meta:altId': '_aemonacpprodcampaign.schemas.1b485af0567eb1110b09ee9b2d43d21601b24694e4f8d303',
 'meta:resourceType': 'schemas',
 'version': '1.1',
 'title': '[Synthetic Data] Experience Event schema (created by jeremypage)',
 'type': 'object',
 'description': 'Profile Schema generated by CMLE for synthetic events',
 'allOf': [{'$ref': 'https://ns.adobe.com/xdm/context/experienceevent',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/aemonacpprodcampaign/mixins/ef239be06873c2a8d280c597128dc2094d85e941fef63100',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/context/experienceevent-directmarketing',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/context/experienceevent-web',
   'type': 'object',
   'meta:xdmType': 'object'}],
 'refs': ['https://ns.adobe.com/xdm/

## 1.3 Create Experience Event dataset

With a schema defined for our events data, we'll now create a dataset to hold the data. As before, we will first define a helper function that checks whether the Experience Events dataset has already been created (specifically whether the events_dataset field is populated in the config) before attempting to create the new dataset. The function will return the ID of the events dataset if it already exists, otherwise it creates the new dataset and returns its ID.

In [24]:
from aepp import catalog

def createDatasetifNotExists(
        name: str, 
        schemaId: str,
        config_field: str,
        config_path: str,
        profile: bool = False,
        utils_conn: utils.Utils = utils.Utils(),
        cat_conn: catalog.Catalog = catalog.Catalog()
):
    
    existing = utils_conn.check_if_exists('Data', config_field, config_path)
    if existing:
        print(f"'{name}' already exists, retrieving existing dataset")
        return existing
    else:
        new_dataset = cat_conn.createDataSets(
            name=name,
            schemaId=schemaId,
            profileEnabled=profile,
            identityEnabled=profile
        )
        return new_dataset[0].split("/")[-1]

Create the Experience Event dataset and update the config file with the dataset ID

In [25]:
dataset_ee_name = f"[Synthetic Data] Experience Event dataset (created by {username})"
dataset_ee_id = createDatasetifNotExists(
    name=dataset_ee_name,
    schemaId=schema_ee_id,
    config_field="events_dataset",
    config_path=config_path,
    profile=True)
print(f"EE Dataset ID: {dataset_ee_id}")

# update config object
utils_conn.save_field_in_config('Data', 'events_dataset', dataset_ee_id, config_path)
config.read(config_path)

dataset_ee_link = get_ui_link(tenant_id, "dataset/browse", dataset_ee_id)
print(f"View EE Dataset in UI: {dataset_ee_link}")

EE Dataset ID: 654b20fe82b01e28d381630a
View EE Dataset in UI: https://experience.adobe.com/#/@aemonacpprodcampaign/sname:laa-e2e/platform/dataset/browse/654b20fe82b01e28d381630a


Enable the events dataset for Profile
<div class="alert alert-block alert-warning">
<b>Note:</b> After you do this step please go in the UI and click on the link above, if the profile toggle is not enabled please manually toggle the profile on
</div>

In [26]:
cat_conn = catalog.Catalog()
cat_conn.enableDatasetProfile(dataset_ee_id)

['@/dataSets/654b20fe82b01e28d381630a']

# 2. Create Profile schema and dataset

The Profile schema will include the following field groups:
- Loyalty Details
- Personal Contact Details
- Demographic Details
- User Account Details

## 2.1 Create Profile schema

In [27]:
# Set schema parameters
schema_profile_title = f"[Synthetic Data] Profile Schema (created by {username})"
schema_profile_fgs = [
    'https://ns.adobe.com/xdm/mixins/profile/profile-loyalty-details',
    'https://ns.adobe.com/xdm/context/profile-personal-details',
    'https://ns.adobe.com/xdm/context/profile-person-details',
    'https://ns.adobe.com/xdm/mixins/profile/profile-user-account-details'
]
schema_profile_desc = "Profile Schema generated by CMLE"

schema_profile_res = createSchemaifnotExists(
    title=schema_profile_title,
    fieldGroups=schema_profile_fgs,
    config_field="profile_schema",
    config_path=config_path,
    description=schema_profile_desc
)
schema_profile_id = schema_profile_res['$id']
schema_profile_altId = schema_profile_res["meta:altId"]
print(f"Profile Schema ID: {schema_profile_id}")
print(f"Profile Schema Alt ID: {schema_profile_altId}")

# update config object
utils_conn.save_field_in_config('Data', 'profile_schema', schema_profile_id, config_path)
config.read(config_path)

schema_profile_link = get_ui_link(tenant_id, "schema/mixin/browse", urllib.parse.quote(schema_profile_id, safe="a"))
print(f"View Profile schema in UI: {schema_profile_link}")

Profile Schema ID: https://ns.adobe.com/aemonacpprodcampaign/schemas/1f444d0fa751cdd0e377d6ea48c7ff67cc821545ee75e89a
Profile Schema Alt ID: _aemonacpprodcampaign.schemas.1f444d0fa751cdd0e377d6ea48c7ff67cc821545ee75e89a
View Profile schema in UI: https://experience.adobe.com/#/@aemonacpprodcampaign/sname:laa-e2e/platform/schema/mixin/browse/https%3A%2F%2Fns.adobe.com%2Faemonacpprodcampaign%2Fschemas%2F1f444d0fa751cdd0e377d6ea48c7ff67cc821545ee75e89a


Set `personID` as the primary ID for the schema with ECID as the namespace

In [28]:
identity_type = "ECID"
identity_data_profiles = {
            "@type": "xdm:descriptorIdentity",
            "xdm:sourceSchema": schema_profile_id,
            "xdm:sourceVersion": 1,
            "xdm:sourceProperty": f"/personID",
            "xdm:namespace": identity_type,
            "xdm:property": "xdm:id",
            "xdm:isPrimary": True
        }
identity_dsc_profile_res = schema_conn.createDescriptor(descriptorObj=identity_data_profiles)
identity_dsc_profile_res

{'@id': '3afba8ada66bdcd1dc4cb925cfdc6ef59cb0e8747374a3cf',
 '@type': 'xdm:descriptorIdentity',
 'xdm:sourceSchema': 'https://ns.adobe.com/aemonacpprodcampaign/schemas/1f444d0fa751cdd0e377d6ea48c7ff67cc821545ee75e89a',
 'xdm:sourceVersion': 1,
 'xdm:sourceProperty': '/personID',
 'imsOrg': '906E3A095DC834230A495FD6@AdobeOrg',
 'version': '1',
 'xdm:namespace': 'ECID',
 'xdm:property': 'xdm:id',
 'xdm:isPrimary': True,
 'meta:containerId': '5523fc81-ee25-4546-a3fc-81ee25554627',
 'meta:sandboxId': '5523fc81-ee25-4546-a3fc-81ee25554627',
 'meta:sandboxType': 'production'}

Enable the profile schema for Profile

In [29]:
enable_profile_res = schema_conn.enableSchemaForRealTime(schema_profile_altId)
enable_profile_res

{'$id': 'https://ns.adobe.com/aemonacpprodcampaign/schemas/1f444d0fa751cdd0e377d6ea48c7ff67cc821545ee75e89a',
 'meta:altId': '_aemonacpprodcampaign.schemas.1f444d0fa751cdd0e377d6ea48c7ff67cc821545ee75e89a',
 'meta:resourceType': 'schemas',
 'version': '1.1',
 'title': '[Synthetic Data] Profile Schema (created by jeremypage)',
 'type': 'object',
 'description': 'Profile Schema generated by CMLE',
 'allOf': [{'$ref': 'https://ns.adobe.com/xdm/context/profile',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/mixins/profile/profile-loyalty-details',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/context/profile-personal-details',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/context/profile-person-details',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/mixins/profile/profile-user-account-details',
   'type': 'object',
   'meta:

## 2.2 Create Profile dataset

Create the Profile dataset and update the config file with the dataset ID

In [30]:
dataset_profile_name = f"[Synthetic Data] Profile dataset (created by {username})"
dataset_profile_id = createDatasetifNotExists(
    name=dataset_profile_name,
    schemaId=schema_profile_id,
    config_field="profile_dataset",
    config_path=config_path,
    profile=True)
print(f"Profile dataset ID: {dataset_profile_id}")

# update config object
utils_conn.save_field_in_config('Data', 'profile_dataset', dataset_profile_id, config_path)
config.read(config_path)

dataset_profile_link = get_ui_link(tenant_id, "dataset/browse", dataset_profile_id)
print(f"View Profile Dataset in UI: {dataset_profile_link}")

Profile dataset ID: 654b217132adab28d2b5e5ef
View Profile Dataset in UI: https://experience.adobe.com/#/@aemonacpprodcampaign/sname:laa-e2e/platform/dataset/browse/654b217132adab28d2b5e5ef


Enable dataset for Profile
<div class="alert alert-block alert-warning">
<b>Note:</b> After you do this step please go in the UI and click on the link above, if the profile toggle is not enabled please manually toggle the profile on
</div>

In [31]:
cat_conn.enableDatasetProfile(dataset_profile_id)

['@/dataSets/654b217132adab28d2b5e5ef']

# 3. Statistical simulation of Profiles and Experience Events

We will set up a statistical simulation to generate Experience event data that can be used illustrate the end-to-end flow of creating a propensity model to predict subscriptions to a brand's paid service.

We will use the standard `web.formFilledOut` event type to represent the subscription conversions that the brand wants to predict, and generate similulated sequences of various types of experience events along with the target subscription conversions that will be used to train a propensity model.

## 3.1 Event types and their contribution to propensity


In [32]:
import random, string
import uuid
from datetime import timedelta
import mmh3
from random import randrange

First, we'll define some events, their frequencies, and dependencies between the events

In [33]:
advertising_events = {
 
    #eventType          : (weeklyAverageOccurrence, propensityDelta, [(field_to_replace, value)], timeInHoursFromDependent)
    "advertising.clicks": (0.01,                    0.002,            [("advertising/clicks/value", 1.0)], 0.5) , 
    "advertising.impressions": (0.1, 0.001, [("advertising/impressions/value", 1.0)], 0),

    "web.webpagedetails.pageViews": (0.1, 0.005, [("web/webPageDetails/pageViews/value", 1.0)], 0.1),
    "web.webinteraction.linkClicks": (0.05, 0.005, [("web/webInteraction/linkClicks/value", 1.0)], 0.1),
   
    
    "commerce.productViews": (0.05, 0.005, [("commerce/productViews/value", 1.0)], 0.2),
    "commerce.purchases": (0.01, 0.1, [("commerce/purchases/value", 1.0)], 1),
    
    
    "decisioning.propositionDisplay": (0.05, 0.005, [("_experience/decisioning/propositionEventType/display", 1)], 0.1),
    "decisioning.propositionInteract": (0.01, 0.1, [("_experience/decisioning/propositionEventType/interact", 1)], 0.05),
    "decisioning.propositionDismiss": (0.01, -0.2, [("_experience/decisioning/propositionEventType/dismiss", 1)], 0.05),

    
    "directMarketing.emailOpened": (0.2, 0.02, [("directMarketing/opens/value", 1.0)], 24),
    "directMarketing.emailClicked": (0.05, 0.1, [("directMarketing/clicks/value", 1.0)], 0.5),
    "directMarketing.emailSent": (0.5, 0.005, [("directMarketing/sends/value", 1.0)], 0),
    
    "web.formFilledOut": (0.0, 0.0, [("web/webPageDetails/name", "subscriptionForm")], 0),

}

event_dependencies = {
    "advertising.impressions": ["advertising.clicks"],
    "directMarketing.emailSent": ["directMarketing.emailOpened"],
    "directMarketing.emailOpened": ["directMarketing.emailClicked"],
    "directMarketing.emailClicked": ["web.webpagedetails.pageViews"],
    "web.webpagedetails.pageViews": ["web.webinteraction.linkClicks", "commerce.productViews", "decisioning.propositionDisplay"],
    "commerce.productViews": ["commerce.purchases"],
    "decisioning.propositionDisplay": ["decisioning.propositionInteract", "decisioning.propositionDismiss"]
    
}

Next, define a helper function for assigning random dates to the event slater in the simulation.

In [34]:
import numpy as np
from datetime import datetime

def random_date(start, end):
    """
    This function will return a random datetime between two datetime 
    objects.
    """
    delta = end - start
    int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
    random_second = randrange(int_delta)
    return start + timedelta(seconds=random_second)

## 3.2 Event generation process

### 3.2.1 Define the logic for generating raw event data

In [35]:

def create_data_for_n_users(n_users, first_user):
  
  N_USERS = n_users
  FIRST_USER = first_user
  
  N_WEEKS = 10
  GLOBAL_START_DATE = datetime.now() - timedelta(weeks=12)
  GLOBAL_END_DATE = GLOBAL_START_DATE + timedelta(weeks=N_WEEKS)

  events = []

  for user in range(N_USERS):
        user_id = FIRST_USER + user
        user_events = []
        base_events = {}
        for event_type in ["advertising.impressions", "web.webpagedetails.pageViews", "directMarketing.emailSent"]:
            n_events = np.random.poisson(advertising_events[event_type][0] * N_WEEKS)
            times = []
            for _ in range(n_events):
                #times.append(random_date(GLOBAL_START_DATE, GLOBAL_END_DATE)
                times.append(random_date(GLOBAL_START_DATE, GLOBAL_END_DATE).isoformat())

            base_events[event_type] = times

        for event_type, dependent_event_types in event_dependencies.items():

            if event_type in base_events:
                #for each originating event
                for event_time in base_events[event_type]:
                    #Look for possible later on events
                    for dependent_event in dependent_event_types:
                                n_events = np.random.poisson(advertising_events[dependent_event][0] * N_WEEKS)
                                times = []
                                for _ in range(n_events):
                                    #times.append(event_time + timedelta(hours = np.random.exponential(advertising_events[dependent_event][3])))
                                    new_time = datetime.fromisoformat(event_time) + timedelta(hours = np.random.exponential(advertising_events[dependent_event][3]))
                                    times.append(new_time.isoformat())
                                base_events[dependent_event] = times


        for event_type, times in base_events.items():
            for time in times:
                user_events.append({"synth_id": user_id, "eventType": event_type, "timestamp": time})

        user_events = sorted(user_events, key = lambda x: (x["synth_id"], x["timestamp"]))


        cumulative_probability = 0.001
        subscribed = False
        for event in user_events:
            cumulative_probability = min(1.0, max(cumulative_probability + advertising_events[event["eventType"]][1], 0))
            event["subscriptionPropensity"] = cumulative_probability
            if subscribed == False and "directMarketing" not in event["eventType"] and "advertising" not in event["eventType"]:
                subscribed = np.random.binomial(1, cumulative_probability) > 0
                if subscribed:
                    subscriptiontime = (datetime.fromisoformat(event["timestamp"]) + timedelta(seconds = 60)).isoformat()
                    #subscriptiontime = event["timestamp"] + timedelta(seconds = 60)
                    user_events.append({"synth_id": user_id, "eventType": "web.formFilledOut",  "timestamp": subscriptiontime})
            event["subscribed"] = subscribed
        user_events = sorted(user_events, key = lambda x: (x["synth_id"], x["timestamp"]))

        events = events + user_events
  return events

### 3.2.1 Define functions to translate raw event data into XDM format that can be ingested into our events dataset

First some helper functions to generate ECID values:

In [36]:
def normalize_ecid(ecid_part):
    ecid_part_str = str(abs(ecid_part))
    if len(ecid_part_str) != 19:
        ecid_part_str = "".join([str(x) for x in range(
            0, 19 - len(ecid_part_str))]) + ecid_part_str
    return ecid_part_str

def get_ecid(user_id):
    """
    The ECID must be two valid 19 digit longs concatenated together
    """
    email = f"synthetic-user-{user_id}@adobe.com"
    ecidpart1, ecidpart2 = mmh3.hash64(email)
    ecid1, ecid2 = (normalize_ecid(ecidpart1), normalize_ecid(ecidpart2))
    return ecid1 + ecid2

Next, define functions that generate an XDM json-formatted event payload for each raw event. We'll define two functions, one for transforming email-related events and another for transforming web-related events. Then a final function `create_xdm_event` which combines the two with logic to apply the appropriate function.

In [37]:
# Define the data that goes into an email event payload
def create_email_event(user_id, event_type, timestamp):
  """
  Combines previous methods to create various type of email events
  """
  
  if event_type == "directMarketing.emailSent":
    directMarketing = {"emailDelivered": {"value": 1.0}, 
                       "sends": {"value": 1.0}, 
                       "emailVisitorID": user_id,
                       "hashedEmail": ''.join(random.choices(string.ascii_letters + string.digits, k=10)),
                       "messageID": str(uuid.uuid4()),
                      }
  elif event_type == "directMarketing.emailOpened":
    directMarketing = {"offerOpens": {"value": 1.0}, 
                     "opens": {"value": 1.0}, 
                     "emailVisitorID": user_id,
                     "messageID": str(uuid.uuid4()),
                    }
  elif event_type == "directMarketing.emailClicked":
    directMarketing = {"clicks": {"value": 1.0}, 
                     "offerOpens": {"value": 1.0}, 
                     "emailVisitorID": user_id,
                     "messageID": str(uuid.uuid4()),
                    }
  return {
    "directMarketing": directMarketing,
    "web": None,
    "_id": str(uuid.uuid4()),
    "eventMergeId": None,
    "eventType": event_type,
    f"_{tenant_id}": {"synth_id":get_ecid(user_id)},
    "producedBy": "databricks-synthetic",
    "timestamp": timestamp
  }

In [45]:
# Define the data that goes into a web event payload 
def create_web_event(user_id, event_type, timestamp):
  """
  Combines previous methods to creat various type of web events
  """
  url = f"http://www.{''.join(random.choices(string.ascii_letters + string.digits, k=5))}.com"
  ref_url = f"http://www.{''.join(random.choices(string.ascii_letters + string.digits, k=5))}.com"
  name = ''.join(random.choices(string.ascii_letters + string.digits, k=5))
  isHomePage = random.choice([True, False])
  server = ''.join(random.choices(string.ascii_letters + string.digits, k=10))
  site_section = ''.join(random.choices(string.ascii_letters, k=2))
  view_name = ''.join(random.choices(string.ascii_letters, k=3))
  region = ''.join(random.choices(string.ascii_letters + string.digits, k=5))
  interaction_type = random.choice(["download", "exit", "other"])
  web_referrer = random.choice(["internal", "external", "search_engine", "email", "social", "unknown", "usenet", "typed_bookmarked"])
  base_web = {"webInteraction": {"linkClicks": {"value": 0.0}, 
                                 "URL": url, 
                                 "name": name,
                                "region": region,
                                "type": interaction_type},
              "webPageDetails": {"pageViews": {"value": 1.0},
                                 "URL": url,
                                 "isErrorPage": False,
                                 #"isHomepage": isHomePage,
                                 "name": name,
                                 "server": server,
                                 "siteSection": site_section,
                                 "viewName": view_name
                                },
              "webReferrer": {
                "URL": ref_url,
                "type": web_referrer
              }
             }
  if event_type in ["advertising.clicks", "commerce.purchases", "web.webinteraction.linkClicks", "web.formFilledOut", 
                   "decisioning.propositionInteract", "decisioning.propositionDismiss"]:
    base_web["webInteraction"]["linkClicks"]["value"] = 1.0

  return {
    "directMarketing": None,
    "web": base_web,
    "_id": str(uuid.uuid4()),
    "eventMergeId": None,
    "eventType": event_type,
    f"_{tenant_id}": {"synth_id":get_ecid(user_id)},
    "producedBy": "databricks-synthetic",
    "timestamp": timestamp
  }

In [38]:
    
def create_xdm_event(user_id, event_type, timestamp):
  """
  The final 'event factory' method that converts an event into an XDM event
  """
  if "directMarketing" in event_type:
    return create_email_event(user_id, event_type, timestamp)
  else: 
    return create_web_event(user_id, event_type, timestamp)

And finally, we define a function that combines the above functions to generate a batch of events corresponding to *n* users:

In [39]:
def createEventsBatch(n_users, first_user):
    batch_events = create_data_for_n_users(n_users, first_user)
    batch_data = [create_xdm_event(x["synth_id"], x["eventType"], x["timestamp"]) for x in batch_events]
    return batch_data

## 3.3 Profile generation

We'll define a similar (but much simpler!) function that generates a batch of profiles in XDM json format that we can use to populate the Profile dataset

In [40]:
import mimesis
import time
def createProfilesBatch(n_users, first_user):

    N_USERS = n_users
    FIRST_USER = first_user
    u = 'u' + str(int(time.time()))

    field = mimesis.Field(mimesis.Locale.EN)
    profile_schema = mimesis.Schema(
        schema=lambda: {
            "personID": get_ecid(FIRST_USER + field("increment", accumulator=u) - 1),
            "person": {
                "name": {
                    "firstName": field("first_name"),
                    "lastName": field("last_name")
                },
                "gender": field("choice", items=['male', 'female', 'not_specified'])
            },
            "personalEmail": {
                "address": field("email", domains=["emailsim.io"]),
            },
            "mobilePhone": {
                "number": field("telephone", mask="###-###-####")
            },
            "homeAddress": {
                "street1": field("address"),
                "city": field("city"),
                "state": field("state", abbr=True),
                "postalCode": field("postal_code")
            },
            "loyalty": {
                "loyaltyID": [field("integer_number", start=5000000, end=6000000)],
                "tier": field("choice", items=["diamond", "platinum", "gold", "silver", "member"]),
                "points": field("integer_number", start=0, end=1000000), 
                "joinDate": field("datetime", start=2000, end=2023).strftime("%Y-%m-%dT%H:%M:%SZ")
            }
        },
        iterations=N_USERS
    )
    return profile_schema.create()

# 4. Ingest synthetic data into AEP dataset

We'll now use the functions defined above to simulate sequences of Experience Events and Profile records for a number of users, then ingest the simulated data into the corresponding datasets we create above.

For each batch, we will:
1. Generate a batch of events using the `createEventsBatch` function, and ingest the batch into the events dataset (using the `ingestBatch` helper function defined below)
2. Generate a batch of profiles using the `createProfilesBatch` function, and ingest the batch into the profile dataset

First we'll create a connection to the AEP batch ingestion API:

In [41]:
from aepp import ingestion
ingest_conn = ingestion.DataIngestion()

Define a helper function to combine the steps involved in ingesting a batch of data into a dataset:

In [42]:
def ingestBatch(
        ingest_conn: ingestion.DataIngestion,
        dataset_id: str,
        data: list[dict]):
    # Initialize batch creation
    batch_res = ingest_conn.createBatch(
        datasetId = dataset_id
    )
    batch_id = batch_res["id"]
    # Upload data
    file_path = f"batch-synthetic-{batch_id}"
    ingest_conn.uploadSmallFile(
        batchId = batch_id,
        datasetId = dataset_id,
        filePath = file_path,
        data = data
    )
    # Complete the batch
    ingest_conn.uploadSmallFileFinish(
        batchId = batch_id
    )
    return batch_id

Define another function that encapsulates the process we defined above for each batch. If you want to populate data for only the events dataset or only the profile dataset, you can leave the id for the other dataset blank.

In [43]:

def ingestSyntheticBatches(
        ingest_conn: ingestion.DataIngestion,
        n_users: int = 10000,
        n_batches: int = 10,
        event_dataset_id: str = None,
        profile_dataset_id: str = None
):
    if event_dataset_id is None and profile_dataset_id is None:
        raise AttributeError('At least one of "event_dataset_id" or "profile_dataset_id" must be provided')
    event_batch_ids = []
    profile_batch_ids = []
    for b in range(n_batches):
        first_user = b * n_users
        if event_dataset_id is not None:
            event_batch = createEventsBatch(n_users, first_user)
            event_batch_id = ingestBatch(ingest_conn, event_dataset_id, event_batch)
            print(f"Processing events batch {b + 1}/{n_batches} with ID {event_batch_id}")
            event_batch_ids.append(event_batch_id)
        if profile_dataset_id is not None:
            profile_batch = createProfilesBatch(n_users, first_user)
            profile_batch_id = ingestBatch(ingest_conn, profile_dataset_id, profile_batch)
            print(f"Processing profiles batch {b + 1}/{n_batches} with ID {profile_batch_id}")
            profile_batch_ids.append(profile_batch_id)
    return (event_batch_ids, profile_batch_ids)


Finally, we'll use the `ingestSyntheticBatches` function to generate and ingest data for the desired number of batches and batch size:

In [46]:
num_batches = 10
batch_size = 10000

dataset_ee_id = config.get('Data', 'events_dataset')
dataset_profile_id = config.get('Data', 'profile_dataset')

event_batches, profile_batches = ingestSyntheticBatches(
    ingest_conn=ingest_conn,
    n_users=batch_size,
    n_batches=num_batches,
    event_dataset_id=dataset_ee_id,
    profile_dataset_id=dataset_profile_id
)
print(event_batches)
print(profile_batches)

Processing events batch 1/10 with ID 01HEPQS4TRKVBM5FT0YYA00RCQ
Processing profiles batch 1/10 with ID 01HEPQSG321B7XP8C56CN2VRVF
Processing events batch 2/10 with ID 01HEPQSXQ0YCRD3TFWVMRDG84S
Processing profiles batch 2/10 with ID 01HEPQT7ZX89FY0JFE39PHQT05
Processing events batch 3/10 with ID 01HEPQTMDD73W6D689TP9W77MV
Processing profiles batch 3/10 with ID 01HEPQTZS26DW5CX2RXWQS6Y0D
Processing events batch 4/10 with ID 01HEPQVCFRXW0Y3N6AW68EX690
Processing profiles batch 4/10 with ID 01HEPQW06ACF4DN91872V1C3G3
Processing events batch 5/10 with ID 01HEPQWDF6737GKGVX9Z5R641E
Processing profiles batch 5/10 with ID 01HEPQWWSKD79KVV4XM8Q8012W
Processing events batch 6/10 with ID 01HEPQXBQX6J71WDCEC8RK2ZCM
Processing profiles batch 6/10 with ID 01HEPQXNDW1EKDXYG2QBYSB7PV
Processing events batch 7/10 with ID 01HEPQY261RAYH587VQ3QSDFZP
Processing profiles batch 7/10 with ID 01HEPQYCEPVGNR9DQET26RK67J
Processing events batch 8/10 with ID 01HEPQYSAKGNN3NWTXBP5NP06F
Processing profiles batch 

**Note**: Batches are ingested asynchronously in AEP. It may take some time for all the data generated here to be available in your dataset depending on how your AEP organization has been provisioned. You can check ingestion status for all your batches in [the dataset page of your AEP UI](https://experience.adobe.com/#/@TENANT/sname:SANDBOX/platform/dataset/browse/DATASETID)

You can also check the ingestion status from the notebook by running the following cell:

In [48]:
from aepp import catalog
import time
cat_conn = catalog.Catalog()

all_ingested = False
while not all_ingested:
  incomplete_batches = cat_conn.getBatches(
    limit=min(100, num_batches),
    n_results=num_batches,
    output="dataframe",
    dataSet=dataset_profile_id,
    status="staging"
  )
  
  num_incomplete_batches = len(incomplete_batches)
  if num_incomplete_batches == 0:
    print("All batches have been ingested")
    all_ingested = True
  else:
    print(f"Remaining batches being ingested: {num_incomplete_batches}")
    time.sleep(30)

All batches have been ingested
