# Scope of Notebook

The goal of this notebook is to showcase how you can use, in your own environment, a pre-trained model along with some profile data extracted from the Adobe Experience Platform to generate propensity scores and ingest those back to enrich the Unified Profile.

![Workflow](../media/CMLE-SageMaker-Notebooks-Week4-Workflow.png)

We'll go through several steps:
- **Reading the featurized data** from the Amazon S3 bucket
- Generating the **scores**
- Creating a **target dataset**
- Creating a **dataflow** to deliver data in the right format to that dataset.

# Setup

This notebook requires some configuration data to properly authenticate to your Adobe Experience Platform instance. You should be able to find all the values required above by following the Setup section of the **README**.

The next cell will be looking for your configuration file under your **ADOBE_HOME** path to fetch the values used throughout this notebook. See more details in the Setup section of the **README** to understand how to create your configuration file.

In [2]:
!apt update -y
!apt install software-properties-common -y
!apt install default-jdk -y

Hit:1 http://deb.debian.org/debian bullseye InRelease
Get:2 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB]
Get:3 http://security.debian.org/debian-security bullseye-security InRelease [48.4 kB]
Fetched 92.4 kB in 0s (369 kB/s)3m[33m
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
34 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
software-properties-common is already the newest version (0.96.20.2-2.1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
default-jdk is already the newest version (2:1.11-72).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


In [2]:
!java -version

openjdk version "11.0.20" 2023-07-18
OpenJDK Runtime Environment (build 11.0.20+8-post-Debian-1deb11u1)
OpenJDK 64-Bit Server VM (build 11.0.20+8-post-Debian-1deb11u1, mixed mode, sharing)


In [3]:
from pyspark import SparkConf
from pyspark.sql import SparkSession
import sagemaker_pyspark

conf = (SparkConf()
        .set("spark.driver.extraClassPath", ":".join(sagemaker_pyspark.classpath_jars())))

print(str(sagemaker_pyspark.classpath_jars()))

spark = (
    SparkSession
    .builder
    .config(conf=conf) \
    .config("spark.jars", "../../jars/hadoop-aws-3.3.6.jar,../../jars/hadoop-common-3.3.6.jar")
    .appName("schema_test")
    .getOrCreate()
)
print(spark.version)

['/opt/conda/lib/python3.10/site-packages/sagemaker_pyspark/jars/aws-java-sdk-bundle-1.11.901.jar', '/opt/conda/lib/python3.10/site-packages/sagemaker_pyspark/jars/aws-java-sdk-core-1.12.262.jar', '/opt/conda/lib/python3.10/site-packages/sagemaker_pyspark/jars/aws-java-sdk-kms-1.12.262.jar', '/opt/conda/lib/python3.10/site-packages/sagemaker_pyspark/jars/aws-java-sdk-s3-1.12.262.jar', '/opt/conda/lib/python3.10/site-packages/sagemaker_pyspark/jars/aws-java-sdk-sagemaker-1.12.262.jar', '/opt/conda/lib/python3.10/site-packages/sagemaker_pyspark/jars/aws-java-sdk-sagemakerruntime-1.12.262.jar', '/opt/conda/lib/python3.10/site-packages/sagemaker_pyspark/jars/aws-java-sdk-sts-1.12.262.jar', '/opt/conda/lib/python3.10/site-packages/sagemaker_pyspark/jars/hadoop-aws-3.3.1.jar', '/opt/conda/lib/python3.10/site-packages/sagemaker_pyspark/jars/sagemaker-spark_2.12-spark_3.3.0-1.4.5.jar']
23/09/28 05:53:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using buil

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/09/28 05:53:23 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
3.3.0


![config_file_week4](images/config_file_week4.png)

In [4]:
import os
from configparser import ConfigParser

os.environ['ADOBE_HOME'] = "../../"
if "ADOBE_HOME" not in os.environ:
    raise Exception("ADOBE_HOME environment variable needs to be set.")

config = ConfigParser()
config_path = os.path.join(os.environ["ADOBE_HOME"], "conf", "config.ini")

if not os.path.exists(config_path):
    raise Exception(f"Looking for configuration under {config_path} but config not found, please verify path")

config.read(config_path)

ims_org_id = config.get("Platform", "ims_org_id")
sandbox_name = config.get("Platform", "sandbox_name")
environment = config.get("Platform", "environment")
client_id = config.get("Authentication", "client_id")
client_secret = config.get("Authentication", "client_secret")
scopes = config.get("Authentication", "scopes")
tech_account_id = config.get("Authentication", "tech_acct_id")
dataset_id = config.get("Platform", "dataset_id")
featurized_dataset_id = config.get("Platform", "featurized_dataset_id")
export_path = config.get("Cloud", "export_path")
import_path = config.get("Cloud", "import_path")
data_format = config.get("Cloud", "data_format")
compression_type = config.get("Cloud", "compression_type")
model_name = config.get("Cloud", "model_name")
s3_bucket_name = config.get("AWS","s3_bucket_name")
s3_prefix = config.get("AWS","s3_prefix")
cfn_stack_id = config.get("AWS", "cfn_stack_id")

if not s3_bucket_name or not s3_prefix or not cfn_stack_id:
    raise Exception("Please make sure the above fields s3_bucket_name, s3_prefix, cfn_stack_id are all populated with valid values in config.ini under the AWS section")


Some utility functions that will be used throughout this notebook:

In [5]:
def get_ui_link(tenant_id, resource_type, resource_id):
    if environment == "prod":
        prefix = f"https://experience.adobe.com"
    else:
        prefix = f"https://experience-{environment}.adobe.com"
    return f"{prefix}/#/@{tenant_id}/sname:{sandbox_name}/platform/{resource_type}/{resource_id}"

In [6]:
import json
import re
NOTEBOOK_METADATA_FILE="/opt/ml/metadata/resource-metadata.json"
if os.path.exists(NOTEBOOK_METADATA_FILE):
    with open(NOTEBOOK_METADATA_FILE,"rb") as f:
        username = json.loads(f.read())['UserProfileName']
username = re.sub("[^0-9a-zA-Z]+", "_", username)
print(f"Username: {username}")

Username: amuiaws
Unique ID: amuiaws


Before we run anything, make sure to install the following required libraries for this notebook. They are all publicly available libraries and the latest version should work fine.

In [7]:
!pip install mlflow==2.7.1
!pip install aepp==0.3.1-7

[0m

We'll be using the [aepp Python library](https://github.com/pitchmuc/aepp) here to interact with AEP APIs and create a schema and dataset suitable for adding our synthetic data further down the line. This library simply provides a programmatic interface around the REST APIs, but all these steps could be completed similarly using the raw APIs directly or even in the UI. For more information on the underlying APIs please see [the API reference guide](https://developer.adobe.com/experience-platform-apis/).

Before any calls can take place, we need to configure the library and setup authentication credentials. For this you'll need the following piece of information. For information about how you can get these, please refer to the `Setup` section of the **Readme**:
- Client ID
- Client secret
- Technical account ID

In [8]:
import aepp

aepp.configure(
  org_id=ims_org_id,
  tech_id=tech_account_id, 
  secret=client_secret,
  scopes=scopes,
  client_id=client_id,
  environment=environment,
  sandbox=sandbox_name
)

# 1. Generating Propensity Scores Using the Trained Model

## 1.1 Reading the Featurized Data from S3

In the Week2Notebook, the featurized dataset was written to Amazon S3. Then, Week3Notebook reads a sampled portion of the dataset to train our model. At that point, we want to score all of the profiles, so we need to read everything.

The featurized data exported into the Amazon S3 is under the format **\$S3PREFIX**/**$DATASETID**/**exportTime=\$EXPORTTIME**. We know the dataset ID which is in your config under `featurized_dataset_id` so we're just missing the export time so we know what to read. To get that we can simply list files in the S3 and find what the value is. 

Now we use some Python libraries to authenticate and issue listing commands so we can get the paths and extract the time from it.

In [9]:
import boto3

s3 = boto3.resource('s3')
bucket = s3.Bucket(s3_bucket_name)
objects = bucket.objects.filter(
    Prefix=f'{s3_prefix}/{featurized_dataset_id}',
    MaxKeys=1
)

for obj in objects:
    obj = obj

print(obj.key)

parts = obj.key.split('/')
for part in parts:
    if part.startswith('exportTime'):
        export_time = part.split('=')[1]

print(f'Using featurized data export time of {export_time}')

take2/6504a2f1abf9b128d3f90e5d/exportTime=20230919232925/part-00000-tid-1963095377847182754-6035fc41-0864-422f-9085-22345d077206-3292655-1-c000.gz.parquet
Using featurized data export time of 20230919232925


In [10]:
parquet_s3a_path = f's3a://{s3_bucket_name}/{obj.key}'
print(parquet_s3a_path)
df = spark.read.parquet(parquet_s3a_path)
df.printSchema()

s3a://mui-aep-testing/take2/6504a2f1abf9b128d3f90e5d/exportTime=20230919232925/part-00000-tid-1963095377847182754-6035fc41-0864-422f-9085-22345d077206-3292655-1-c000.gz.parquet
23/09/28 05:55:23 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties


                                                                                

root
 |-- userId: string (nullable = true)
 |-- eventType: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- subscriptionOccurred: long (nullable = true)
 |-- emailsReceived: long (nullable = true)
 |-- emailsOpened: long (nullable = true)
 |-- emailsClicked: long (nullable = true)
 |-- productsViewed: long (nullable = true)
 |-- propositionInteracts: long (nullable = true)
 |-- propositionDismissed: long (nullable = true)
 |-- webLinkClicks: long (nullable = true)
 |-- minutes_since_emailSent: integer (nullable = true)
 |-- minutes_since_emailOpened: integer (nullable = true)
 |-- minutes_since_emailClick: integer (nullable = true)
 |-- minutes_since_productView: integer (nullable = true)
 |-- minutes_since_propositionInteract: integer (nullable = true)
 |-- minutes_since_propositionDismiss: integer (nullable = true)
 |-- minutes_since_linkClick: integer (nullable = true)
 |-- random_row_number_for_user: integer (nullable = true)



We can verify it matches what we had written out in the second weekly assignment:

In [11]:
df.count()

97418

And also do a sanity check on the data to make sure it looks good:

In [18]:
df.show()

+--------------------+--------------------+--------------------+--------------------+--------------+------------+-------------+--------------+--------------------+--------------------+-------------+-----------------------+-------------------------+------------------------+-------------------------+---------------------------------+--------------------------------+-----------------------+--------------------------+
|              userId|           eventType|           timestamp|subscriptionOccurred|emailsReceived|emailsOpened|emailsClicked|productsViewed|propositionInteracts|propositionDismissed|webLinkClicks|minutes_since_emailSent|minutes_since_emailOpened|minutes_since_emailClick|minutes_since_productView|minutes_since_propositionInteract|minutes_since_propositionDismiss|minutes_since_linkClick|random_row_number_for_user|
+--------------------+--------------------+--------------------+--------------------+--------------+------------+-------------+--------------+--------------------+-

                                                                                

In [19]:
df = df.fillna(0)
df.show()

+--------------------+--------------------+--------------------+--------------------+--------------+------------+-------------+--------------+--------------------+--------------------+-------------+-----------------------+-------------------------+------------------------+-------------------------+---------------------------------+--------------------------------+-----------------------+--------------------------+
|              userId|           eventType|           timestamp|subscriptionOccurred|emailsReceived|emailsOpened|emailsClicked|productsViewed|propositionInteracts|propositionDismissed|webLinkClicks|minutes_since_emailSent|minutes_since_emailOpened|minutes_since_emailClick|minutes_since_productView|minutes_since_propositionInteract|minutes_since_propositionDismiss|minutes_since_linkClick|random_row_number_for_user|
+--------------------+--------------------+--------------------+--------------------+--------------+------------+-------------+--------------+--------------------+-

## 1.2 Scoring the Profiles

For scoring we need 2 things:
1. The **data** to score.
2. The **trained model** that will be used to do the scoring.

We just created a dataframe containing the first one, and in the previous weekly assignment we created a production model that can operate on this data, so let's fetch this model from our model hub and turn it into a Spark UDF so it can interact with our data easily:

In [20]:
print(model_name)
import mlflow 
mlflow.set_tracking_uri("file:///root/mlruns")


cmle_propensity_model


In [21]:
import mlflow.pyfunc

model_udf = mlflow.pyfunc.spark_udf(spark, f"models:/{model_name}/production")

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

2023/09/28 05:58:00 INFO mlflow.models.flavor_backend_registry: Selected backend for flavor 'python_function'


Now we're ready to apply our trained model on top of the entire dataset. For that, we need to give this UDF its inputs - which in our case are all the columns that the model needs to operate on. We can get that easily as the Spark dataframe contains metadata about its columns:

In [22]:
from pyspark.sql.functions import struct
 
# Apply the model to the new data
udf_inputs = struct(*(df.columns))
 
df_scored = df.withColumn(
  "prediction",
  model_udf(udf_inputs)
)
df_scored.printSchema()

root
 |-- userId: string (nullable = true)
 |-- eventType: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- subscriptionOccurred: long (nullable = true)
 |-- emailsReceived: long (nullable = true)
 |-- emailsOpened: long (nullable = true)
 |-- emailsClicked: long (nullable = true)
 |-- productsViewed: long (nullable = true)
 |-- propositionInteracts: long (nullable = true)
 |-- propositionDismissed: long (nullable = true)
 |-- webLinkClicks: long (nullable = true)
 |-- minutes_since_emailSent: integer (nullable = true)
 |-- minutes_since_emailOpened: integer (nullable = true)
 |-- minutes_since_emailClick: integer (nullable = true)
 |-- minutes_since_productView: integer (nullable = true)
 |-- minutes_since_propositionInteract: integer (nullable = true)
 |-- minutes_since_propositionDismiss: integer (nullable = true)
 |-- minutes_since_linkClick: integer (nullable = true)
 |-- random_row_number_for_user: integer (nullable = true)
 |-- prediction: array (nullabl

If we look at the data we should see a new column called `prediction` which corresponds to the score generated by the model for this particular profile based on all the features computed earlier.

In [23]:
df_scored.show()

[Stage 8:>                                                          (0 + 1) / 1]

+--------------------+--------------------+--------------------+--------------------+--------------+------------+-------------+--------------+--------------------+--------------------+-------------+-----------------------+-------------------------+------------------------+-------------------------+---------------------------------+--------------------------------+-----------------------+--------------------------+--------------------+
|              userId|           eventType|           timestamp|subscriptionOccurred|emailsReceived|emailsOpened|emailsClicked|productsViewed|propositionInteracts|propositionDismissed|webLinkClicks|minutes_since_emailSent|minutes_since_emailOpened|minutes_since_emailClick|minutes_since_productView|minutes_since_propositionInteract|minutes_since_propositionDismiss|minutes_since_linkClick|random_row_number_for_user|          prediction|
+--------------------+--------------------+--------------------+--------------------+--------------+------------+---------

                                                                                

When you think about bringing the scored profiles back into the Adobe Experience Platform, we don't need to bring back all the features. In fact, we only really need 2 columns:
- The user ID, so we know in the Unified Profile to which profile this row corresponds.
- The score for this user ID.

In [24]:
from pyspark.sql.functions import udf, col, lit, create_map, array, struct, current_timestamp

from itertools import chain

df_to_ingest = df_scored.select(
  "userId",
  "prediction"
).cache()
df_to_ingest.printSchema()

root
 |-- userId: string (nullable = true)
 |-- prediction: array (nullable = true)
 |    |-- element: double (containsNull = true)



In [25]:
df_to_ingest.count()

                                                                                

97418

In [26]:
df_to_ingest.show()

+--------------------+--------------------+
|              userId|          prediction|
+--------------------+--------------------+
|39532772609378079...|[0.8389126519099904]|
|61601964786234821...|[0.03970397082988...|
|88220432841898108...|[0.03763855055100...|
|68649542653677654...|[0.00719929588625...|
|66226762425417586...|[0.8516564014277057]|
|31630450781799875...|[0.8280758828013567]|
|45127514250829205...|[0.00433036866033...|
|37753065076778385...|[0.00294472852455...|
|70183182020082786...|[0.3205237097396433]|
|86727598932352157...|[0.00420483596433...|
|46721051801070672...|[0.00591178580714...|
|48091565267902934...|[0.07270872958279...|
|74877651972137620...|[0.08988605383861...|
|71768820913809895...|[0.3767206724141176]|
|81842598888358053...|[0.07497042665872...|
|85356416789643443...|[0.00582912375956...|
|48740060473731326...|[0.06772841662926...|
|52268593109347095...|[0.4175620051662752]|
|03170541752185975...|[0.00348903382015...|
|17034210353855217...|  [0.55972

At that point we have the scored profiles and exactly what we need to bring back into Adobe Experience Platform. But we're not quite ready to write the results yet, there's a bit of setup that needs to happen first:
- We need to create and configure a destination **dataset** in Adobe Experience Platform where our data will end up.
- We need to setup a **data flow** that will be able to take this data, convert it into an XDM format, and deliver it to this dataset.

# 2. Bringing the Scores back into Unified Profile

## 2.1 Create ingestion schema and dataset

The first step is to define where this propensity data we are creating as the output of our model should end up in the Unified Profile. We need to create a few entities for that:
- A **fieldgroup** that will define the XDM for where propensity scores should be stored.
- A **schema** based on that field group that will tie it back to the concept of profile.
- A **dataset** based on that schema that will hold the data.

As for the structure itself it's pretty simple, we just need 2 fields:
- The **propensity** itself as a decimal number.
- The **user ID** to which this propensity score relates.

Let's put that in practice and create the field group. Note that because we are creating custom fields here, they need to be nested under the tenant ID corresponding to your organization.

In [27]:
from aepp import schema

schema_conn = schema.Schema()

tenant_id = schema_conn.getTenantId()
tenant_id

'exchangesandboxbravo'

In [24]:
fieldgroup_res = schema_conn.createFieldGroup({
  	"type": "object",
	"title": f"[CMLE][Week4] Fieldgroup for user propensity (created by {username})",
	"description": "This mixin is used to define a propensity score that can be assigned to a given profile.",
	"allOf": [{
		"$ref": "#/definitions/customFields"
	}],
	"meta:containerId": "tenant",
	"meta:resourceType": "mixins",
	"meta:xdmType": "object",
	"definitions": {
      "customFields": {
        "type": "object",
        "properties": {
          f"_{tenant_id}": {
            "type": "object",
            "properties": {
              "propensity": {
                "title": "Propensity",
                "description": "This refers to the propensity of a user towards an outcome.",
                "type": "number"
              },
              "cmle_id": {
                "title": "CMLE User ID",
                "description": "This refers to the user having a propensity towards an outcome.",
                "type": "string"
              }
            }
          }
        }
      }
	},
	"meta:intendedToExtend": ["https://ns.adobe.com/xdm/context/profile"]
})

fieldgroup_id = fieldgroup_res["$id"]
fieldgroup_id

'https://ns.adobe.com/exchangesandboxbravo/mixins/c4fd541a164db4552ea79a4befcd9223fd05029664a5ec29'

In [30]:
print(fieldgroup_res)

{'$id': 'https://ns.adobe.com/exchangesandboxbravo/mixins/c4fd541a164db4552ea79a4befcd9223fd05029664a5ec29', 'meta:altId': '_exchangesandboxbravo.mixins.c4fd541a164db4552ea79a4befcd9223fd05029664a5ec29', 'meta:resourceType': 'mixins', 'version': '1.0', 'title': '[CMLE][Week4] Fieldgroup for user propensity (created by amuiaws)', 'type': 'object', 'description': 'This mixin is used to define a propensity score that can be assigned to a given profile.', 'definitions': {'customFields': {'type': 'object', 'properties': {'_exchangesandboxbravo': {'type': 'object', 'properties': {'propensity': {'title': 'Propensity', 'description': 'This refers to the propensity of a user towards an outcome.', 'type': 'number', 'meta:xdmType': 'number'}, 'cmle_id': {'title': 'CMLE User ID', 'description': 'This refers to the user having a propensity towards an outcome.', 'type': 'string', 'meta:xdmType': 'string'}}, 'meta:xdmType': 'object'}}, 'meta:xdmType': 'object'}}, 'allOf': [{'$ref': '#/definitions/cus

From this field group ID we can add it to a brand new schema that will be marked for profiles.

In [26]:
schema_res = schema_conn.createProfileSchema(
  name=f"[CMLE][Week4] Schema for user propensity ingestion (created by {username})",
  mixinIds=[
    fieldgroup_id
  ],
  description="Schema generated by CMLE for user propensity score ingestion",
)

schema_id = schema_res["$id"]
schema_alt_id = schema_res["meta:altId"]

print(f"Schema ID: {schema_id}")
print(f"Schema Alt ID: {schema_alt_id}")

Schema ID: https://ns.adobe.com/exchangesandboxbravo/schemas/b301e4b7423ff9bbfbf6a54dc3a4150b0411bb59932dfdf1
Schema Alt ID: _exchangesandboxbravo.schemas.b301e4b7423ff9bbfbf6a54dc3a4150b0411bb59932dfdf1


Because we eventually intend for these scores to end up in the Unified Profile, we need to specify which field of the schema corresponds to an identity so it can resolve the corresponding profile. In our case, the `userid` field is an ECID and we mark it as such.

In [27]:
identity_type = "ECID"
descriptor_res = schema_conn.createDescriptor(
  descriptorObj = {
    "@type": "xdm:descriptorIdentity",
    "xdm:sourceSchema": schema_id,
    "xdm:sourceVersion": 1,
    "xdm:sourceProperty": f"/_{tenant_id}/cmle_id",
    "xdm:namespace": identity_type,
    "xdm:property": "xdm:id",
    "xdm:isPrimary": True
  }
)
descriptor_res

{'@id': '40cf57d831d77e9015656aa0c470822b03c5b3ffb5fc89aa',
 '@type': 'xdm:descriptorIdentity',
 'xdm:sourceSchema': 'https://ns.adobe.com/exchangesandboxbravo/schemas/b301e4b7423ff9bbfbf6a54dc3a4150b0411bb59932dfdf1',
 'xdm:sourceVersion': 1,
 'xdm:sourceProperty': '/_exchangesandboxbravo/cmle_id',
 'imsOrg': 'EFE243245DB9D3DD0A495E80@AdobeOrg',
 'version': '1',
 'xdm:namespace': 'ECID',
 'xdm:property': 'xdm:id',
 'xdm:isPrimary': True,
 'meta:containerId': 'e0b17021-6875-424f-b170-216875124f2d',
 'meta:sandboxId': 'e0b17021-6875-424f-b170-216875124f2d',
 'meta:sandboxType': 'development'}

And of course that schema needs to be enabled for Unified Profile consumption, so it can be added to the profile union schema.

In [28]:
enable_res = schema_conn.enableSchemaForRealTime(schema_alt_id)
enable_res

{'$id': 'https://ns.adobe.com/exchangesandboxbravo/schemas/b301e4b7423ff9bbfbf6a54dc3a4150b0411bb59932dfdf1',
 'meta:altId': '_exchangesandboxbravo.schemas.b301e4b7423ff9bbfbf6a54dc3a4150b0411bb59932dfdf1',
 'meta:resourceType': 'schemas',
 'version': '1.1',
 'title': '[CMLE][Week4] Schema for user propensity ingestion (created by amuiaws)',
 'type': 'object',
 'description': 'Schema generated by CMLE for user propensity score ingestion',
 'allOf': [{'$ref': 'https://ns.adobe.com/xdm/context/profile',
   'type': 'object',
   'meta:xdmType': 'object'},
  {'$ref': 'https://ns.adobe.com/exchangesandboxbravo/mixins/c4fd541a164db4552ea79a4befcd9223fd05029664a5ec29',
   'type': 'object',
   'meta:xdmType': 'object'}],
 'refs': ['https://ns.adobe.com/exchangesandboxbravo/mixins/c4fd541a164db4552ea79a4befcd9223fd05029664a5ec29',
  'https://ns.adobe.com/xdm/context/profile'],
 'imsOrg': 'EFE243245DB9D3DD0A495E80@AdobeOrg',
 'additionalInfo': {'numberOfIdentities': 1,
  'numberOfRelationShips': 

At that point we're ready to create the dataset that will hold our propensity scores. This dataset is based on our schema we just created and nothing more.

In [29]:
from aepp import catalog

cat_conn = catalog.Catalog()

ingestion_dataset_res = cat_conn.createDataSets(
  name=f"[CMLE][Week4] Dataset for user propensity ingestion (created by {username})",
  schemaId=schema_id
)

ingestion_dataset_id = ingestion_dataset_res[0].split("/")[-1]
ingestion_dataset_id

'65146eaa232a2f28d2561e69'

And similarly that dataset needs to be enabled for Unified Profile consumption, so that any batch of data written to this dataset is automatically picked up and processed to insert into the individual profiles and create new fragments.

In [30]:
# TODO: this is currently failing due to invalid content type, need to fix in aepp, see https://github.com/pitchmuc/aepp/issues/15
# for now just enable in the UI...
cat_conn.enableDatasetProfile(ingestion_dataset_id)

['@/dataSets/65146eaa232a2f28d2561e69']

You should be able to see your dataset in the UI at the link below, and it should match the new schema created as shown in the following screenshot.

In [31]:
ingestion_dataset_link = get_ui_link(tenant_id, "dataset/browse", ingestion_dataset_id)
print(f"Dataset ID {ingestion_dataset_id} available under {ingestion_dataset_link}")

Dataset ID 65146eaa232a2f28d2561e69 available under https://experience.adobe.com/#/@exchangesandboxbravo/sname:amanzonpaymentssandboxae/platform/dataset/browse/65146eaa232a2f28d2561e69


![Dataset](../media/CMLE-Notebooks-Week4-ScoringDataset.png)

With the ingestion dataset defined, let's stage the scored data in S3 to be copied back to Adobe Experience Platform using an ingestion Data Flow  which you'll create in Section 2.2. First, define the write path in S3.

In [None]:
# INTERNAL NOTE - the following cell was brought up from Section 2.3

from datetime import datetime

scoring_export_time = datetime.utcnow().strftime('%Y%m%d%H%M%S')
protocol = "s3a"
output_path = f"{protocol}://{s3_bucket_name}/{s3_prefix}/{import_path}/{ingestion_dataset_id}/exportTime={scoring_export_time}/"
output_path

Now write the dataframe to the defined output path. The resulting file is written as CSV format and includes the header so column fields can be used for a mapping configuration when creating the ingestion Data Flow.

In [None]:
# INTERNAL NOTE - the following cell was brought up from Section 2.3

df_to_ingest \
  .withColumn("prediction", col("prediction").getItem(0)) \
  .write \
  .option("header", True) \
  .format("csv") \
  .save(output_path)

Confirm the write to S3 was successful, and store the object filename to be used when creating the ingestion Data Flow.

In [None]:
bucket = s3.Bucket(s3_bucket_name)
objects = bucket.objects.filter(
    Prefix=f'{s3_prefix}/{import_path}/{ingestion_dataset_id}/exportTime={scoring_export_time}/part',
    MaxKeys=1
)

for obj in objects:
    obj = obj

print(obj.key)

parts = obj.key.split('/')
file = parts[-1]  # capture file name

print(f'Found written CSV file in S3: {file}')

## 2.2 Setup ingestion data flow

Now that all the dataset and schema setup is completed, we're ready to define our Data Flow. The Data Flow defines the contract between the source and destination dataset.

For the purposes of this notebook we will be using the [Amazon S3](https://experienceleague.adobe.com/docs/experience-platform/sources/connectors/cloud-storage/s3.html?lang=en) as the source filesystem under which the scoring results will be written. We'll be using that as a delivery mechanism for the featurized data, but this step can be customized to delivery this data to any cloud storage filesystem.

To setup the delivery pipeline, we'll be using the [Flow Service for Source](https://experienceleague.adobe.com/docs/experience-platform/sources/api-tutorials/create/cloud-storage/s3.html?lang=en) which will be responsible for picking up the scored data and dump it from the 
S3. There's a few steps involved:
- Creating a **source connection**.
- Creating a **target connection**.
- Creating a **transformation**.
- Creating a **data flow**.

For that, again we use `aepp` to abstract all the APIs:

In [None]:
from aepp import flowservice

flow_conn = flowservice.FlowService()

The **source connection** is responsible for connecting to your cloud storage account (in our case here, the Amazon S3) so that the resulting Data Flow will know from where data needs to be picked up.

For reference, here is a list of all the connection specs available for the most popular cloud storage accounts (these IDs are global across every single customer account and sandbox):

| Cloud Storage Type    | Connection Spec ID                   | Connection Spec Name
|-----------------------|--------------------------------------|----------------------
| Amazon S3             | ecadc60c-7455-4d87-84dc-2a0e293d997b | amazon-s3
| Azure Blob Storage    | d771e9c1-4f26-40dc-8617-ce58c4b53702 | google-adwords
| Azure Data Lake       | b3ba5556-48be-44b7-8b85-ff2b69b46dc4 | adls-gen2
| Data Landing Zone     | 26f526f2-58f4-4712-961d-e41bf1ccc0e8 | landing-zone
| Google Cloud Storage  | 32e8f412-cdf7-464c-9885-78184cb113fd | google-cloud
| SFTP                  | b7bf2577-4520-42c9-bae9-cad01560f7bc | sftp

In [None]:
import boto3
from botocore.exceptions import ValidationError, ClientError
cfn = boto3.client('cloudformation')
secrets_manager = boto3.client('secretsmanager')
try:
    # validate cloudformation ID and get outputs
    print('Validating CloudFormation ID')
    response = cfn.describe_stacks(StackName=cfn_stack_id)
    print('Stack found')
    outputs = response['Stacks'][0]['Outputs']
    for output in outputs:
        if output['OutputKey'] == 'DataFlowUserAccessKey':
            access_key = output['OutputValue']
            print(f'Found access key: {access_key}')
        if output['OutputKey'] == 'DataFlowUserSecretKey':
            secret_name = output['OutputValue']
            print(f'Found secret stored in Secrets Manager: {secret_name}')
except ValidationError as e:
    print(f'Could not find stack from provided stack ID: {cfn_stack_id}')
except ClientError as e:
    raise(e)
    
try:
    response = secrets_manager.get_secret_value(SecretId=secret_name)
    secret_key = response['SecretString']
except ClientError as e:
    raise(e)
    
print(f"ACCESS: {access_key}")
print(f"SECRET: {secret_key}")

In [189]:
connection_spec_id = "ecadc60c-7455-4d87-84dc-2a0e293d997b"
base_res = flow_conn.createConnection(data={
        "name": f"[CMLE] [Week4] Base Connection to S3 created by {username}",
        "auth": {
            "specName": "Access Key",
            "params": {
                "s3AccessKey": access_key,
                "s3SecretKey": secret_key,
                "bucketName": s3_bucket_name,
                "folderPath": s3_prefix
            }
        },
        "connectionSpec": {
            "id": connection_spec_id,
            "version": "1.0"
        }
    }
)

In [190]:
base_connection_id = base_res["id"]
print(base_connection_id)
print(ingestion_dataset_id)

18f28422-7e8b-410a-8513-bbda3e8fb8a5
65146eaa232a2f28d2561e69


In [181]:
print(schema_id)

https://ns.adobe.com/exchangesandboxbravo/schemas/b301e4b7423ff9bbfbf6a54dc3a4150b0411bb59932dfdf1


In [192]:
source_res = flow_conn.createSourceConnection({
  "name": "[CMLE][Week4] S3 source connection for propensity scores",
  "baseConnectionId": base_connection_id,
  "data": {
      "format": "delimited"
  },
  "params": {
    "path": f"/{s3_bucket_name}/{s3_prefix}/{import_path}/{ingestion_dataset_id}/exportTime={scoring_export_time}/{file}",
    "type": "file"
  },
  "connectionSpec": {
      "id": connection_spec_id,
      "version": "1.0"
  }
})

source_connection_id = source_res["id"]
source_connection_id

'240f9615-8614-4578-af2f-68cc872bf9fd'

The **target connection** is responsible for connecting to your Adobe Experience Platform dataset so that the resulting Data Flow will know where the data needs to be written. Because we already created our ingestion dataset in the previous section, we can simply tie it to that dataset ID and the corresponding schema.

In [172]:
print(ingestion_dataset_id)

65146eaa232a2f28d2561e69


In [193]:
target_res = flow_conn.createTargetConnectionDataLake(
  name="[CMLE][Week4] User Propensity Target Connection",
  datasetId=ingestion_dataset_id,
  schemaId=schema_id
)

target_connection_id = target_res["id"]
target_connection_id

'1c578ed8-3285-49d8-8ae1-17808df5a72b'

We're still missing one step. If you look back to the previous cells, this is what we have as the schema of our scored dataframe:
- `userId`
- `prediction`

And this is what we have as the schema of our ingestion dataset:
- `_$TENANTID.userid`
- `_$TENANTID.propensity`

Although it may look obvious to us, we still need to let the platform know which fields maps to what. This can be achieved using the [Data Prep service](https://experienceleague.adobe.com/docs/experience-platform/data-prep/home.html) which allows you to specify a set of **transformations** to map one field to another. In our case the transformation is pretty simple, we just need to match the schemas without making any changes, but you can do a lot more extensive transformations using this service if needed.

In [194]:
from aepp import dataprep

dataprep_conn = dataprep.DataPrep()

In [195]:
mapping_res = dataprep_conn.createMappingSet(
  schemaId=schema_id,
  validate=True,
  mappingList=[
    {
      "sourceType": "ATTRIBUTE",
      "source": "prediction",
      "destination": f"_{tenant_id}.propensity"
    },
    {
      "sourceType": "ATTRIBUTE",
      "source": "userId",
      "destination": f"_{tenant_id}.cmle_id"
    }
  ]
)

mapping_id = mapping_res["id"]
mapping_id

'4217643de91547f79eb584df60f03223'

At that point we have everything we need to create a **Data Flow**. A data flow is the "recipe" that describes where the data comes from and where it should end up. We can also specify how often checks happen to find new data, but it cannot be lower than 15 minutes currently for platform stability reasons. A data flow is tied to a flow spec ID which contains the instructions for transfering data in an optimized way between a source and destination.

For reference, here is a list of all the flow specs available for the most popular cloud storage accounts (these IDs are global across every single customer account and sandbox):

| Cloud Storage Type    | Flow Spec ID                         | Flow Spec Name
|-----------------------|--------------------------------------|------------------
| Amazon S3             | 9753525b-82c7-4dce-8a9b-5ccfce2b9876 | CloudStorageToAEP
| Azure Blob Storage    | 14518937-270c-4525-bdec-c2ba7cce3860 | CRMToAEP
| Azure Data Lake       | 9753525b-82c7-4dce-8a9b-5ccfce2b9876 | CloudStorageToAEP
| Data Landing Zone     | 9753525b-82c7-4dce-8a9b-5ccfce2b9876 | CloudStorageToAEP
| Google Cloud Storage  | 9753525b-82c7-4dce-8a9b-5ccfce2b9876 | CloudStorageToAEP
| SFTP                  | 9753525b-82c7-4dce-8a9b-5ccfce2b9876 | CloudStorageToAEP

In [196]:
flow_spec = flow_conn.getFlowSpecs("name==CloudStorageToAEP")
flow_spec_id = flow_spec[0]["id"]
flow_spec_id

'9753525b-82c7-4dce-8a9b-5ccfce2b9876'

In [197]:
import time

# TODO: cleanup in aepp, first param should not be required
flow_res = flow_conn.createFlow(flow_spec_id, obj={
  "name": f"[CMLE][Week4] S3 to AEP for user propensity (created by {username})",
  "flowSpec": {
      "id": flow_spec_id,
      "version": "1.0"
  },
  "sourceConnectionIds": [
      source_connection_id
  ],
  "targetConnectionIds": [
      target_connection_id
  ],
  "transformations": [
      {
          "name": "Mapping",
          "params": {
              "mappingId": mapping_id,
              "mappingVersion": 0
          }
      }
  ],
  "scheduleParams": {
      "startTime": str(int(time.time())),
      "frequency": "minute",
      "interval": "15"
  }
})
dataflow_id = flow_res["id"]
dataflow_id

'02b51a26-68ae-4ed2-9f6a-f6da70625c49'

Note that the name of the transformation has to be set to `Mapping` or the job will fail.

You should be able to see your Data Flow in the UI at the link below, and you may see some executions depending on when you check since it runs on a schedule and will still show the run even if there was no data to process, as shown in the screenshot below.

In [143]:
dataflow_link = get_ui_link(tenant_id, "source/dataflows", dataflow_id)
print(f"Data Flow created as ID {dataflow_id} available under {dataflow_link}")

Data Flow created as ID d6e52117-5d73-470d-9d18-d35d28964f4d available under https://experience.adobe.com/#/@exchangesandboxbravo/sname:amanzonpaymentssandboxae/platform/source/dataflows/d6e52117-5d73-470d-9d18-d35d28964f4d


![Source Dataflow](../media/CMLE-Notebooks-Week4-Dataflow.png)

Note: If you would like to switch to a different cloud storage, you need to update the `flow_spec_id` variable above to the matching value in the table mentioned earlier in this section. You can refer to the name from the table above to find out the ID.

## 2.3 Ingest the scored users into the Unified Profile

Because the Data Flow is executed asynchronously every 15 minutes, it may take a few minutes before the data is ingested in the dataset. We can check the status of the runs below until we can see the run has successfully completed to check some summary statistics.

In [None]:
import time

# TODO: handle that more gracefully in aepp
finished = False
while not finished:
  try:
      runs = flow_conn.getRuns(prop=f"flowId=={dataflow_id}")
      for run in runs:
          run_id = run["id"]
          run_started_at = run["metrics"]["durationSummary"]["startedAtUTC"]
          run_ended_at = run["metrics"]["durationSummary"]["completedAtUTC"]
          run_duration_secs = (run_ended_at - run_started_at) / 1000
          run_size_mb = run["metrics"]["sizeSummary"]["outputBytes"] / 1024. / 1024.
          run_num_rows = run["metrics"]["recordSummary"]["outputRecordCount"]
          run_num_files = run["metrics"]["fileSummary"]["outputFileCount"]
          print(f"Run ID {run_id} completed with: duration={run_duration_secs} secs; size={run_size_mb} MB; num_rows={run_num_rows}; num_files={run_num_files}")
      finished = True
  except Exception as e:
      print(f"No runs completed yet for flow {dataflow_id}")
      time.sleep(60)

Once this is done, you should be able to go back in your dataset at the same link as before and see a batch created successfully in it. You should also notice for that batch that the records ingested will also show up under **Existing Profile Fragments** which means they have been ingested in the Unified Profile successfully.

![Ingestion](../media/CMLE-Notebooks-Week4-Ingestion.png)

## 2.4 Storing the scoring dataset ID in the configuration

Now that we got everything working, we just need to save the `ingestion_dataset_id` variable in the original configuration file, so we can refer to it in the following weekly assignment. To do that, execute the code below:

In [73]:
print(config_path)
print(config)

../../conf/config.ini
<configparser.ConfigParser object at 0x7fbe965027a0>


In [71]:
config.set("Platform", "scoring_dataset_id", ingestion_dataset_id)

with open(config_path, "w") as configfile:
    config.write(configfile)