This notebook generates sample data that can be used to illustrate the Cloud ML Ecosystem workflow for using AEP Data for machine learning use cases with external ML tools and platforms. We will generate sample data with the following steps:

- [Setup](#setup)
- [1. Create Experience Event schema and dataset](#1-create-experience-event-schema-and-dataset)
- [2. Statistical simulation of Experience Events](#2-statistical-simulation-of-experience-events)
- [3. Ingest sythetic data into AEP dataset](#3-ingest-sythetic-data-into-aep-dataset)



# Setup

Before we run anything, make sure to install the following required libraries for this notebook. They are all publicly available libraries and the latest version should work fine.

In [None]:
%pip install mmh3
%pip install rstr
%pip install aepp
%pip install pygresql

This notebook requires some configuration data to properly authenticate to your Adobe Experience Platform instance. You should be able to find all the values required above by following the Setup section of the **README**.

The next cell will be looking for your configuration file under your **ADOBE_HOME** path to fetch the values used throughout this notebook. See more details in the Setup section of the **README** to understand how to create your configuration file.

In [29]:
import os
from configparser import ConfigParser
import aepp

os.environ["ADOBE_HOME"] = os.path.dirname(os.getcwd())

if "ADOBE_HOME" not in os.environ:
    raise Exception("ADOBE_HOME environment variable needs to be set.")

config = ConfigParser()
config_file = "cmle_gov_config.ini"
config_path = os.path.join(os.environ["ADOBE_HOME"], "conf", config_file)

if not os.path.exists(config_path):
    raise Exception(f"Looking for configuration under {config_path} but config not found, please verify path")

config.read(config_path)

aepp.configure(
  org_id=config.get("Platform", "ims_org_id"),
  tech_id=config.get("Authentication", "tech_acct_id"), 
  secret=config.get("Authentication", "client_secret"),
  scopes=config.get("Authentication", "scopes"),
  client_id=config.get("Authentication", "client_id"),
  environment=config.get("Platform", "environment"),
  sandbox=config.get("Platform", "sandbox_name")
)

To ensure uniqueness of resources created as part of this notebook, we are using your local username to include in each of the resource titles to avoid conflicts.

In [30]:
import re
username = os.getlogin()
unique_id = s = re.sub("[^0-9a-zA-Z]+", "_", username)

Function to generate link to resource in the UI:

In [31]:
def get_ui_link(tenant_id, resource_type, resource_id):
    environment = config.get("Platform", "environment")
    sandbox_name = config.get("Platform", "sandbox_name")
    if environment == "prod":
        prefix = f"https://experience.adobe.com"
    else:
        prefix = f"https://experience-{environment}.adobe.com"
    return f"{prefix}/#/@{tenant_id}/sname:{sandbox_name}/platform/{resource_type}/{resource_id}"

# 1. Create Experience Event schema and dataset

We will now create the schema to support our synthetic data. We need a few fields which will be included in the synthetic event data:
- Direct Marketing information
- Web details

## 1.1 Create connection to XDM Schema Registry

In [32]:
from aepp import schema
schema_conn = schema.Schema()
schema_conn.sandbox
tenant_id = schema_conn.getTenantId()
tenant_id

'cloudmlecosystem'

## Create User ID field group for Experience Event Schema

We need to create a custom field group with a "userId" field to go in the experience event schema. Other schema fields will come from standard field field groups we will include when creating the schema.

First we'll define some utility functions to gracefully handle cases where the User ID field group has already been created:

In [33]:
def getFieldGroupbyTitle(schema_conn: schema.Schema, title: str):
    fieldgroups = schema_conn.getFieldGroups()
    match = list(filter(lambda d: d['title'] == title, fieldgroups))
    if len(match) == 1:
        return match[0]
    else:
        return None

In [34]:
def createFieldGroupifnotExists(schema_conn: schema.Schema, title: str, data: dict):
    existing = getFieldGroupbyTitle(schema_conn, title)
    if existing:
        print(f"'{title}' already exists, retrieving existing field group")
        return existing
    else:
        return schema_conn.createFieldGroup(data)

Create the User ID field group (or retrieve the field group ID if it already exists):

In [35]:
fieldgroup_title = f"[CMLE Synthetic Data] Exp Event User ID (created by {username})"
fieldgroup_data = {
  "type": "object",
	"title": fieldgroup_title,
	"description": "This field group is used to identify the user to whom an experience event belongs.",
	"allOf": [{
		"$ref": "#/definitions/customFields"
	}],
	"meta:containerId": "tenant",
	"meta:resourceType": "mixins",
	"meta:xdmType": "object",
	"definitions": {
      "customFields": {
        "type": "object",
        "properties": {
          f"_{tenant_id}": {
            "type": "object",
            "properties": {
              "userId": {
                "title": "User ID",
                "description": "This refers to the user having a propensity towards an outcome.",
                "type": "string"
              }
            }
          }
        }
      }
	},
	"meta:intendedToExtend": ["https://ns.adobe.com/xdm/context/experienceevent"]
}

In [36]:
fieldgroup_res = createFieldGroupifnotExists(schema_conn, fieldgroup_title, fieldgroup_data)
fieldgroup_id = fieldgroup_res['$id']
print(f"User ID field group ID: {fieldgroup_id}")

# Get link to field group in AEP UI
import urllib.parse
fieldgroup_link = get_ui_link(tenant_id, "schema/mixin/browse", urllib.parse.quote(fieldgroup_id, safe="a"))
print(f"View field group in UI: {fieldgroup_link}")

User ID field group ID: https://ns.adobe.com/cloudmlecosystem/mixins/7db99ef15ef3c4dbd140abb1ee4bf7f189bd06aac2be5b3c
View field group in UI: https://experience-stage.adobe.com/#/@cloudmlecosystem/sname:cmle-governance/platform/schema/mixin/browse/https%3A%2F%2Fns.adobe.com%2Fcloudmlecosystem%2Fmixins%2F7db99ef15ef3c4dbd140abb1ee4bf7f189bd06aac2be5b3c


## 1.2 Compose Experience Event schema

Now we'll create the experience event schema from our custom field group and the following standard field groups:
- Direct Marketing Details
- Web Details

First we'll define some utility functions to gracefully handle cases where the Experience Event schema has already been created:

In [37]:
def getSchemabyTitle(schema_conn: schema.Schema, title: str):
    schemas = schema_conn.getSchemas()
    # Handle case where no schemas have been created
    if 'results' in schemas: 
        return None
    # Filter schemas list for matching title
    match = list(filter(lambda d: d['title'] == title, schemas))
    # XDM schema titles must be unique, so 'match' will have exactly 1 element if a schema
    # with the same title already exists
    if len(match) == 1:
        return match[0]
    else:
        return None

In [38]:
def createSchemaifnotExists(schema_conn: schema.Schema, title: str, fieldGroups: list[str], description: str = "", type: str = "event"):
    existing = getSchemabyTitle(schema_conn, title)
    if existing:
        print(f"'{title}' already exists, retrieving existing schema")
        return existing
    else:
        if type == "event":
            return schema_conn.createExperienceEventSchema(
                name=title,
                fieldGroupIds=fieldGroups,
                description=description
            )
        elif type =="profile":
            return schema_conn.createProfileSchema(
                name=title,
                fieldGroupIds=fieldGroups,
                description=description
            )
        else:
            raise AttributeError('"type" must be "event" (default) or "profile"')


Create the Experience Event schema (or retrieve the ID and Alt ID if the schema already exists):

In [39]:
schema_ee_title = f"[CMLE Synthetic Data] Experience Event schema (created by {username})"
schema_ee_fgs = [
    fieldgroup_id,
    "https://ns.adobe.com/xdm/context/experienceevent-directmarketing",
    "https://ns.adobe.com/xdm/context/experienceevent-web"
]
schema_ee_desc = "Profile Schema generated by CMLE for synthetic events"

In [40]:
schema_ee_res = createSchemaifnotExists(
    schema_conn=schema_conn,
    title=schema_ee_title,
    fieldGroups=schema_ee_fgs,
    description=schema_ee_desc
)
schema_ee_id = schema_ee_res['$id']
schema_ee_altId = schema_ee_res["meta:altId"]
print(f"EE Schema ID: {schema_ee_id}")
print(f"EE Schema Alt ID: {schema_ee_altId}")

schema_ee_link = get_ui_link(tenant_id, "schema/mixin/browse", urllib.parse.quote(schema_ee_id, safe="a"))
print(f"View EE schema in UI: {schema_ee_link}")

EE Schema ID: https://ns.adobe.com/cloudmlecosystem/schemas/12f4248650750b5ba80644ec9737e73135b90ee1f1ef316d
EE Schema Alt ID: _cloudmlecosystem.schemas.12f4248650750b5ba80644ec9737e73135b90ee1f1ef316d
View EE schema in UI: https://experience-stage.adobe.com/#/@cloudmlecosystem/sname:cmle-governance/platform/schema/mixin/browse/https%3A%2F%2Fns.adobe.com%2Fcloudmlecosystem%2Fschemas%2F12f4248650750b5ba80644ec9737e73135b90ee1f1ef316d


Set "userId" as the primary ID for the schema with ECID as the namespace

In [41]:
identity_type = "ECID"
identity_desc_data = {
    "@type": "xdm:descriptorIdentity",
    "xdm:sourceSchema": schema_ee_id,
    "xdm:sourceVersion": 1,
    "xdm:sourceProperty": f"/_{tenant_id}/userId",
    "xdm:namespace": identity_type,
    "xdm:property": "xdm:id",
    "xdm:isPrimary": True
  }
identity_dsc_ee_res = schema_conn.createDescriptor(
    descriptorObj = identity_desc_data
)
identity_dsc_ee_res

{'@id': '495d45b81df4df10c7486e69b81c1a243ae90514487254ea',
 '@type': 'xdm:descriptorIdentity',
 'xdm:sourceSchema': 'https://ns.adobe.com/cloudmlecosystem/schemas/12f4248650750b5ba80644ec9737e73135b90ee1f1ef316d',
 'xdm:sourceVersion': 1,
 'xdm:sourceProperty': '/_cloudmlecosystem/userId',
 'imsOrg': '3ADF23C463D98F640A494032@AdobeOrg',
 'version': '1',
 'xdm:namespace': 'ECID',
 'xdm:property': 'xdm:id',
 'xdm:isPrimary': True,
 'meta:containerId': '97e9e135-cb1e-49df-a9e1-35cb1e29dfe5',
 'meta:sandboxId': '97e9e135-cb1e-49df-a9e1-35cb1e29dfe5',
 'meta:sandboxType': 'production'}

Enable EE schema for Profile

In [42]:
enable_ee_res = schema_conn.enableSchemaForRealTime(schema_ee_altId)
enable_ee_res

{'$id': 'https://ns.adobe.com/cloudmlecosystem/schemas/12f4248650750b5ba80644ec9737e73135b90ee1f1ef316d',
 'meta:altId': '_cloudmlecosystem.schemas.12f4248650750b5ba80644ec9737e73135b90ee1f1ef316d',
 'meta:resourceType': 'schemas',
 'version': '1.1',
 'title': '[CMLE Synthetic Data] Experience Event schema (created by jeremypage)',
 'type': 'object',
 'description': 'Profile Schema generated by CMLE for synthetic events',
 'allOf': [{'$ref': 'https://ns.adobe.com/xdm/context/experienceevent',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/cloudmlecosystem/mixins/7db99ef15ef3c4dbd140abb1ee4bf7f189bd06aac2be5b3c',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/context/experienceevent-directmarketing',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/context/experienceevent-web',
   'type': 'object',
   'meta:xdmType': 'object'}],
 'refs': ['https://ns.adobe.com/cloudmlecos

## 1.3 Create Experience Event dataset

First, create a connection to the Catalog API

In [43]:
from aepp import catalog
cat_conn = catalog.Catalog()

Define some utility functions to gracefully handle cases where the Experience Event dataset has alrerady been created:

In [44]:
def getDatasetbyName(cat_conn: catalog.Catalog, name: str):
    datasets = cat_conn.getDataSets()
    match = {k:v for k, v in datasets.items() if v['name'] == name}
    if match:
        return list(match.keys())[0]
    else:
        return None

In [45]:
def createDatasetifNotExists(cat_conn: catalog.Catalog, name: str, schemaId: str):
    existing = getDatasetbyName(cat_conn=cat_conn, name=name)
    if existing:
        return existing
    else:
        dataset = cat_conn.createDataSets(name=name, schemaId=schemaId)
        return dataset[0].split("/")[-1]

Create the Experience Event dataset

In [46]:
dataset_ee_name = f"[CMLE Synthetic Data] Experience Event dataset (created by {username})"
dataset_ee_id = createDatasetifNotExists(cat_conn=cat_conn, name=dataset_ee_name, schemaId=schema_ee_id)
print(f"EE Dataset ID: {dataset_ee_id}")

dataset_ee_link = get_ui_link(tenant_id, "dataset/browse", dataset_ee_id)
print(f"View EE Dataset in UI: {dataset_ee_link}")


EE Dataset ID: 64f8b18c3b27a8289ea9287d
View EE Dataset in UI: https://experience-stage.adobe.com/#/@cloudmlecosystem/sname:cmle-governance/platform/dataset/browse/64f8b18c3b27a8289ea9287d


Enable dataset for Profile
<div class="alert alert-block alert-warning">
<b>Note:</b> After you do this step please go in the UI and click on the link above, if the profile toggle is not enabled please manually toggle the profile on
</div>

In [47]:
cat_conn.enableDatasetProfile(dataset_ee_id)

['@/dataSets/64f8b18c3b27a8289ea9287d']

# 2. Create Profile schema and dataset

The Profile schema will include the following field groups:
- Loyalty Details
- Personal Contact Details
- Demographic Details
- User Account Details

## 2.1 Create Profile schema

In [48]:
# Set schema parameters
schema_profile_title = f"[CMLE Synthetic Data] Profile Schema (created by {username})"
schema_profile_fgs = [
    'https://ns.adobe.com/xdm/mixins/profile/profile-loyalty-details',
    'https://ns.adobe.com/xdm/context/profile-personal-details',
    'https://ns.adobe.com/xdm/context/profile-person-details',
    'https://ns.adobe.com/xdm/mixins/profile/profile-user-account-details'
]
schema_profile_desc = "Profile Schema generated by CMLE"

In [49]:
schema_profile_res = createSchemaifnotExists(
    schema_conn=schema_conn,
    title=schema_profile_title,
    fieldGroups=schema_profile_fgs,
    description=schema_profile_desc,
    type="profile"
)
schema_profile_id = schema_profile_res['$id']
schema_profile_altId = schema_profile_res["meta:altId"]
print(f"Profile Schema ID: {schema_profile_id}")
print(f"Profile Schema Alt ID: {schema_profile_altId}")

schema_profile_link = get_ui_link(tenant_id, "schema/mixin/browse", urllib.parse.quote(schema_profile_id, safe="a"))
print(f"View Profile schema in UI: {schema_profile_link}")

Profile Schema ID: https://ns.adobe.com/cloudmlecosystem/schemas/f415f7d964337d192cd4b53a29fd0c07a5eea100031223ec
Profile Schema Alt ID: _cloudmlecosystem.schemas.f415f7d964337d192cd4b53a29fd0c07a5eea100031223ec
View Profile schema in UI: https://experience-stage.adobe.com/#/@cloudmlecosystem/sname:cmle-governance/platform/schema/mixin/browse/https%3A%2F%2Fns.adobe.com%2Fcloudmlecosystem%2Fschemas%2Ff415f7d964337d192cd4b53a29fd0c07a5eea100031223ec


Set "userId" as the primary ID for the schema with ECID as the namespace

In [50]:
identity_type = "ECID"
identity_desc_data = {
    "@type": "xdm:descriptorIdentity",
    "xdm:sourceSchema": schema_profile_id,
    "xdm:sourceVersion": 1,
    "xdm:sourceProperty": f"/personID",
    "xdm:namespace": identity_type,
    "xdm:property": "xdm:id",
    "xdm:isPrimary": True
  }
identity_dsc_profile_res = schema_conn.createDescriptor(
    descriptorObj = identity_desc_data
)
identity_dsc_profile_res

{'@id': 'acc392e12f4849792fcf5ca1aaaae5453a2056e233aab030',
 '@type': 'xdm:descriptorIdentity',
 'xdm:sourceSchema': 'https://ns.adobe.com/cloudmlecosystem/schemas/f415f7d964337d192cd4b53a29fd0c07a5eea100031223ec',
 'xdm:sourceVersion': 1,
 'xdm:sourceProperty': '/personID',
 'imsOrg': '3ADF23C463D98F640A494032@AdobeOrg',
 'version': '1',
 'xdm:namespace': 'ECID',
 'xdm:property': 'xdm:id',
 'xdm:isPrimary': True,
 'meta:containerId': '97e9e135-cb1e-49df-a9e1-35cb1e29dfe5',
 'meta:sandboxId': '97e9e135-cb1e-49df-a9e1-35cb1e29dfe5',
 'meta:sandboxType': 'production'}

Enable EE schema for Profile

In [51]:
enable_profile_res = schema_conn.enableSchemaForRealTime(schema_profile_altId)
enable_profile_res

{'$id': 'https://ns.adobe.com/cloudmlecosystem/schemas/f415f7d964337d192cd4b53a29fd0c07a5eea100031223ec',
 'meta:altId': '_cloudmlecosystem.schemas.f415f7d964337d192cd4b53a29fd0c07a5eea100031223ec',
 'meta:resourceType': 'schemas',
 'version': '1.1',
 'title': '[CMLE Synthetic Data] Profile Schema (created by jeremypage)',
 'type': 'object',
 'description': 'Profile Schema generated by CMLE',
 'allOf': [{'$ref': 'https://ns.adobe.com/xdm/context/profile',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/mixins/profile/profile-loyalty-details',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/context/profile-personal-details',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/context/profile-person-details',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/xdm/mixins/profile/profile-user-account-details',
   'type': 'object',
   'meta:xdm

## 2.2 Create Profile dataset

Create the Profile dataset

In [52]:
dataset_profile_name = f"[CMLE Synthetic Data] Profile dataset (created by {username})"
dataset_profile_id = createDatasetifNotExists(cat_conn=cat_conn, name=dataset_profile_name, schemaId=schema_profile_id)
print(f"Profile Dataset ID: {dataset_profile_id}")

dataset_profile_link = get_ui_link(tenant_id, "dataset/browse", dataset_profile_id)
print(f"View Profile Dataset in UI: {dataset_profile_link}")


Profile Dataset ID: 64f8b1d4b2464f289ec5cca4
View Profile Dataset in UI: https://experience-stage.adobe.com/#/@cloudmlecosystem/sname:cmle-governance/platform/dataset/browse/64f8b1d4b2464f289ec5cca4


Enable dataset for Profile
<div class="alert alert-block alert-warning">
<b>Note:</b> After you do this step please go in the UI and click on the link above, if the profile toggle is not enabled please manually toggle the profile on
</div>

In [53]:
cat_conn.enableDatasetProfile(dataset_profile_id)

['@/dataSets/64f8b1d4b2464f289ec5cca4']

# 3. Statistical simulation of Profiles and Experience Events

We will set up a statistical simulation to generate Experience event data that can be used illustrate the end-to-end flow of creating a propensity model to predict subscriptions to a brand's paid service.

We will use the standard `web.formFilledOut` event type to represent the subscription conversions that the brand wants to predict, and generate similulated sequences of various types of experience events along with the target subscription conversions that will be used to train a propensity model.

## 3.1 Event types and their contribution to propensity


In [54]:
import random, string
import uuid
from datetime import timedelta
import mmh3
from random import randrange

Define some events and dependencies between the events

In [55]:
advertising_events = {
 
    #eventType          : (weeklyAverageOccurrence, propensityDelta, [(field_to_replace, value)], timeInHoursFromDependent)
    "advertising.clicks": (0.01,                    0.002,            [("advertising/clicks/value", 1.0)], 0.5) , 
    "advertising.impressions": (0.1, 0.001, [("advertising/impressions/value", 1.0)], 0),

    "web.webpagedetails.pageViews": (0.1, 0.005, [("web/webPageDetails/pageViews/value", 1.0)], 0.1),
    "web.webinteraction.linkClicks": (0.05, 0.005, [("web/webInteraction/linkClicks/value", 1.0)], 0.1),
   
    
    "commerce.productViews": (0.05, 0.005, [("commerce/productViews/value", 1.0)], 0.2),
    "commerce.purchases": (0.01, 0.1, [("commerce/purchases/value", 1.0)], 1),
    
    
    "decisioning.propositionDisplay": (0.05, 0.005, [("_experience/decisioning/propositionEventType/display", 1)], 0.1),
    "decisioning.propositionInteract": (0.01, 0.1, [("_experience/decisioning/propositionEventType/interact", 1)], 0.05),
    "decisioning.propositionDismiss": (0.01, -0.2, [("_experience/decisioning/propositionEventType/dismiss", 1)], 0.05),

    
    "directMarketing.emailOpened": (0.2, 0.02, [("directMarketing/opens/value", 1.0)], 24),
    "directMarketing.emailClicked": (0.05, 0.1, [("directMarketing/clicks/value", 1.0)], 0.5),
    "directMarketing.emailSent": (0.5, 0.005, [("directMarketing/sends/value", 1.0)], 0),
    
    "web.formFilledOut": (0.0, 0.0, [("web/webPageDetails/name", "subscriptionForm")], 0),

}

event_dependencies = {
    "advertising.impressions": ["advertising.clicks"],
    "directMarketing.emailSent": ["directMarketing.emailOpened"],
    "directMarketing.emailOpened": ["directMarketing.emailClicked"],
    "directMarketing.emailClicked": ["web.webpagedetails.pageViews"],
    "web.webpagedetails.pageViews": ["web.webinteraction.linkClicks", "commerce.productViews", "decisioning.propositionDisplay"],
    "commerce.productViews": ["commerce.purchases"],
    "decisioning.propositionDisplay": ["decisioning.propositionInteract", "decisioning.propositionDismiss"]
    
}

Define utility functions that will be used to implement the event simulation

In [56]:
import numpy as np
from datetime import datetime

def random_date(start, end):
    """
    This function will return a random datetime between two datetime 
    objects.
    """
    delta = end - start
    int_delta = (delta.days * 24 * 60 * 60) + delta.seconds
    random_second = randrange(int_delta)
    return start + timedelta(seconds=random_second)

## 3.2 Event generation process

In [57]:

def create_data_for_n_users(n_users, first_user):
  
  N_USERS = n_users
  FIRST_USER = first_user
  
  N_WEEKS = 10
  GLOBAL_START_DATE = datetime.now() - timedelta(weeks=12)
  GLOBAL_END_DATE = GLOBAL_START_DATE + timedelta(weeks=N_WEEKS)

  events = []

  for user in range(N_USERS):
        user_id = FIRST_USER + user
        user_events = []
        base_events = {}
        for event_type in ["advertising.impressions", "web.webpagedetails.pageViews", "directMarketing.emailSent"]:
            n_events = np.random.poisson(advertising_events[event_type][0] * N_WEEKS)
            times = []
            for _ in range(n_events):
                #times.append(random_date(GLOBAL_START_DATE, GLOBAL_END_DATE)
                times.append(random_date(GLOBAL_START_DATE, GLOBAL_END_DATE).isoformat())

            base_events[event_type] = times

        for event_type, dependent_event_types in event_dependencies.items():

            if event_type in base_events:
                #for each originating event
                for event_time in base_events[event_type]:
                    #Look for possible later on events
                    for dependent_event in dependent_event_types:
                                n_events = np.random.poisson(advertising_events[dependent_event][0] * N_WEEKS)
                                times = []
                                for _ in range(n_events):
                                    #times.append(event_time + timedelta(hours = np.random.exponential(advertising_events[dependent_event][3])))
                                    new_time = datetime.fromisoformat(event_time) + timedelta(hours = np.random.exponential(advertising_events[dependent_event][3]))
                                    times.append(new_time.isoformat())
                                base_events[dependent_event] = times


        for event_type, times in base_events.items():
            for time in times:
                user_events.append({"userId": user_id, "eventType": event_type, "timestamp": time})

        user_events = sorted(user_events, key = lambda x: (x["userId"], x["timestamp"]))


        cumulative_probability = 0.001
        subscribed = False
        for event in user_events:
            cumulative_probability = min(1.0, max(cumulative_probability + advertising_events[event["eventType"]][1], 0))
            event["subscriptionPropensity"] = cumulative_probability
            if subscribed == False and "directMarketing" not in event["eventType"] and "advertising" not in event["eventType"]:
                subscribed = np.random.binomial(1, cumulative_probability) > 0
                if subscribed:
                    subscriptiontime = (datetime.fromisoformat(event["timestamp"]) + timedelta(seconds = 60)).isoformat()
                    #subscriptiontime = event["timestamp"] + timedelta(seconds = 60)
                    user_events.append({"userId": user_id, "eventType": "web.formFilledOut",  "timestamp": subscriptiontime})
            event["subscribed"] = subscribed
        user_events = sorted(user_events, key = lambda x: (x["userId"], x["timestamp"]))

        events = events + user_events
  return events

In [58]:
def normalize_ecid(ecid_part):
    ecid_part_str = str(abs(ecid_part))
    if len(ecid_part_str) != 19:
        ecid_part_str = "".join([str(x) for x in range(
            0, 19 - len(ecid_part_str))]) + ecid_part_str
    return ecid_part_str

In [59]:

def get_ecid(user_id):
    """
    The ECID must be two valid 19 digit longs concatenated together
    """
    email = f"synthetic-user-{user_id}@adobe.com"
    ecidpart1, ecidpart2 = mmh3.hash64(email)
    ecid1, ecid2 = (normalize_ecid(ecidpart1), normalize_ecid(ecidpart2))
    return ecid1 + ecid2

In [60]:
# Define the data that goes into an email event payload
def create_email_event(user_id, event_type, timestamp):
  """
  Combines previous methods to create various type of email events
  """
  
  if event_type == "directMarketing.emailSent":
    directMarketing = {"emailDelivered": {"value": 1.0}, 
                       "sends": {"value": 1.0}, 
                       "emailVisitorID": user_id,
                       "hashedEmail": ''.join(random.choices(string.ascii_letters + string.digits, k=10)),
                       "messageID": str(uuid.uuid4()),
                      }
  elif event_type == "directMarketing.emailOpened":
    directMarketing = {"offerOpens": {"value": 1.0}, 
                     "opens": {"value": 1.0}, 
                     "emailVisitorID": user_id,
                     "messageID": str(uuid.uuid4()),
                    }
  elif event_type == "directMarketing.emailClicked":
    directMarketing = {"clicks": {"value": 1.0}, 
                     "offerOpens": {"value": 1.0}, 
                     "emailVisitorID": user_id,
                     "messageID": str(uuid.uuid4()),
                    }
  return {
    "directMarketing": directMarketing,
    "web": None,
    "_id": str(uuid.uuid4()),
    "eventMergeId": None,
    "eventType": event_type,
    f"_{tenant_id}": {"userId":get_ecid(user_id)},
    "producedBy": "databricks-synthetic",
    "timestamp": timestamp
  }

In [61]:
# Define the data that goes into a web event payload 
def create_web_event(user_id, event_type, timestamp):
  """
  Combines previous methods to creat various type of web events
  """
  url = f"http://www.{''.join(random.choices(string.ascii_letters + string.digits, k=5))}.com"
  ref_url = f"http://www.{''.join(random.choices(string.ascii_letters + string.digits, k=5))}.com"
  name = ''.join(random.choices(string.ascii_letters + string.digits, k=5))
  isHomePage = random.choice([True, False])
  server = ''.join(random.choices(string.ascii_letters + string.digits, k=10))
  site_section = ''.join(random.choices(string.ascii_letters, k=2))
  view_name = ''.join(random.choices(string.ascii_letters, k=3))
  region = ''.join(random.choices(string.ascii_letters + string.digits, k=5))
  interaction_type = random.choice(["download", "exit", "other"])
  web_referrer = random.choice(["internal", "external", "search_engine", "email", "social", "unknown", "usenet", "typed_bookmarked"])
  base_web = {"webInteraction": {"linkClicks": {"value": 0.0}, 
                                 "URL": url, 
                                 "name": name,
                                "region": region,
                                "type": interaction_type},
              "webPageDetails": {"pageViews": {"value": 1.0},
                                 "URL": url,
                                 "isErrorPage": False,
                                 #"isHomepage": isHomePage,
                                 "name": name,
                                 "server": server,
                                 "siteSection": site_section,
                                 "viewName": view_name
                                },
              "webReferrer": {
                "URL": ref_url,
                "type": web_referrer
              }
             }
  if event_type in ["advertising.clicks", "commerce.purchases", "web.webinteraction.linkClicks", "web.formFilledOut", 
                   "decisioning.propositionInteract", "decisioning.propositionDismiss"]:
    base_web["webInteraction"]["linkClicks"]["value"] = 1.0

  return {
    "directMarketing": None,
    "web": base_web,
    "_id": str(uuid.uuid4()),
    "eventMergeId": None,
    "eventType": event_type,
    f"_{tenant_id}": {"userId":get_ecid(user_id)},
    "producedBy": "databricks-synthetic",
    "timestamp": timestamp
  }

In [62]:
    
def create_xdm_event(user_id, event_type, timestamp):
  """
  The final 'event factory' method that converts an event into an XDM event
  """
  if "directMarketing" in event_type:
    return create_email_event(user_id, event_type, timestamp)
  else: 
    return create_web_event(user_id, event_type, timestamp)

In [63]:
def createEventsBatch(n_users, first_user):
    batch_events = create_data_for_n_users(n_users, first_user)
    batch_data = [create_xdm_event(x["userId"], x["eventType"], x["timestamp"]) for x in batch_events]
    return batch_data

## 3.3 Profile generation

The following function generates a set of profiles for populating the Profile dataset

In [74]:
from mimesis import Schema, Field, Locale
import time
def createProfilesBatch(n_users, first_user):

    N_USERS = n_users
    FIRST_USER = first_user
    u = 'u' + str(int(time.time()))

    field = mimesis.Field(Locale.EN)
    profile_schema = mimesis.Schema(
        schema=lambda: {
            "personID": get_ecid(FIRST_USER + field("increment", accumulator=u) - 1),
            "person": {
                "name": {
                    "firstName": field("first_name"),
                    "lastName": field("last_name")
                },
                "gender": field("choice", items=['male', 'female', 'not_specified'])
            },
            "personalEmail": {
                "address": field("email", domains=["emailsim.io"]),
            },
            "mobilePhone": {
                "number": field("telephone", mask="###-###-####")
            },
            "homeAddress": {
                "street1": field("address"),
                "city": field("city"),
                "state": field("state", abbr=True),
                "postalCode": field("postal_code")
            },
            "loyalty": {
                "loyaltyID": [field("integer_number", start=5000000, end=6000000)],
                "tier": field("choice", items=["diamond", "platinum", "gold", "silver", "member"]),
                "points": field("integer_number", start=0, end=1000000), 
                "joinDate": field("datetime", start=2000, end=2023).strftime("%Y-%m-%dT%H:%M:%SZ")
            }
        },
        iterations=N_USERS
    )
    return profile_schema.create()

# 4. Ingest sythetic data into AEP dataset

We'll now use the functions defined above to simulate sequences of Experience Events for a number of users, then ingest the simulated event data into the Experience Event dataset we create above.

For each batch, we will:
1. Initialize a batch to ingest to our Experience Event dataset
2. Generate a sequence of simulate events using the `create_data_for_n_users` function
3. Format the events into XDM Experience Event payloads using the `create_xdm_event` function
4. Add the synthetic data to the batch
5. Close the batch

First create a connection to the AEP batch ingestion API:

In [76]:
from aepp import ingestion
ingest_conn = ingestion.DataIngestion()

In [79]:
def ingestBatch(
        ingest_conn: ingestion.DataIngestion,
        dataset_id: str,
        data: list[dict]):
    # Initialize batch creation
    batch_res = ingest_conn.createBatch(
        datasetId = dataset_id
    )
    batch_id = batch_res["id"]
    # Upload data
    file_path = f"batch-synthetic-{batch_id}"
    ingest_conn.uploadSmallFile(
        batchId = batch_id,
        datasetId = dataset_id,
        filePath = file_path,
        data = data
    )
    # Complete the batch
    ingest_conn.uploadSmallFileFinish(
        batchId = batch_id
    )
    return batch_id

In [80]:
def ingestSyntheticBatches(
        ingest_conn: ingestion.DataIngestion,
        n_users: int = 10000,
        n_batches: int = 10,
        event_dataset_id: str = None,
        profile_dataset_id: str = None
):
    if event_dataset_id is None and profile_dataset_id is None:
        raise AttributeError('At least one of "event_dataset_id" or "profile_dataset_id" must be provided')
    event_batch_ids = []
    profile_batch_ids = []
    for b in range(n_batches):
        first_user = b * n_users
        if event_dataset_id is not None:
            event_batch = createEventsBatch(n_users, first_user)
            event_batch_id = ingestBatch(ingest_conn, event_dataset_id, event_batch)
            print(f"Processing events batch {b + 1}/{n_batches} with ID {event_batch_id}")
            event_batch_ids.append(event_batch_id)
        if profile_dataset_id is not None:
            profile_batch = createProfilesBatch(n_users, first_user)
            profile_batch_id = ingestBatch(ingest_conn, profile_dataset_id, profile_batch)
            print(f"Processing profiles batch {b + 1}/{n_batches} with ID {profile_batch_id}")
            profile_batch_ids.append(profile_batch_id)
    return (event_batch_ids, profile_batch_ids)


Then repeat the sequences of actions described above to generate and ingest simulated events for the desired number of batches.

In [81]:
num_batches = 10
batch_size = 10000

event_batches, profile_batches = ingestSyntheticBatches(
    ingest_conn=ingest_conn,
    n_users=batch_size,
    n_batches=num_batches,
    event_dataset_id=dataset_ee_id,
    profile_dataset_id=dataset_profile_id
)
print(event_batches)
print(profile_batches)

Processing events batch 1/10 with ID 01H9NQYBHNJR37GZG8ZCW3PB9X
Processing profiles batch 1/10 with ID 01H9NQYW04QGPT9M2S9HV5BNED
Processing events batch 2/10 with ID 01H9NQZ9B59PJ3CFXJY8VSZCXM
Processing profiles batch 2/10 with ID 01H9NQZNZ3P7Y90719YY74XFR5
Processing events batch 3/10 with ID 01H9NR03458719JQWPSH3PQ2FP
Processing profiles batch 3/10 with ID 01H9NR0NW6E3ATRYJ10R2P8MF8
Processing events batch 4/10 with ID 01H9NR14H46W9GY70X2V0SG4SY
Processing profiles batch 4/10 with ID 01H9NR1JZBNS0CCDM5QFKMKPRF
Processing events batch 5/10 with ID 01H9NR20MA4NP475QF6DN3N4GA
Processing profiles batch 5/10 with ID 01H9NR2FBQWPA9TTYHT7C1Q33A
Processing events batch 6/10 with ID 01H9NR2X34BH7F18Y44BNKGPHC
Processing profiles batch 6/10 with ID 01H9NR39F8ZFH0SSAQJ8E25JBV
Processing events batch 7/10 with ID 01H9NR3QSPCC8WNYNCDPBJZHTR
Processing profiles batch 7/10 with ID 01H9NR45CQFH1V9QX7TEAYR0ZC
Processing events batch 8/10 with ID 01H9NR4KWVFCH3S6N0EZJTKNXH
Processing profiles batch 

**Note**: Batches are ingested asynchronously in AEP. It may take some time for all the data generated here to be available in your dataset depending on how your AEP organization has been provisioned. You can check ingestion status for all your batches in [the dataset page of your AEP UI](https://experience.adobe.com/#/@TENANT/sname:SANDBOX/platform/dataset/browse/DATASETID)

You can also check the ingestion status from the notebook by running the following cell:

In [84]:
from aepp import catalog
import time
cat_conn = catalog.Catalog()

all_ingested = False
while not all_ingested:
  incomplete_batches = cat_conn.getBatches(
    limit=min(100, num_batches),
    n_results=num_batches,
    output="dataframe",
    dataSet=dataset_profile_id,
    status="staging"
  )
  
  num_incomplete_batches = len(incomplete_batches)
  if num_incomplete_batches == 0:
    print("All batches have been ingested")
    all_ingested = True
  else:
    print(f"Remaining batches being ingested: {num_incomplete_batches}")
    time.sleep(30)

All batches have been ingested
