# Export training data

Now that our featurized data is in a dataset, we need to bring it out to an external cloud storage filesystem from which the ML model training and scoring will be performed. 

For the purposes of this notebook we will be using the [Data Landing Zone (DLZ)](https://experienceleague.adobe.com/docs/experience-platform/sources/api-tutorials/create/cloud-storage/data-landing-zone.html?lang=en) as the filesystem. Every Adobe Experience Platform customer has a DLZ already setup as an [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs) container. We'll be using that as a delivery mechanism for the featurized data, but this step can be customized to deliver this data to any cloud storage filesystem.

To setup the delivery pipeline, we'll be using the [Flow Service for Destinations](https://developer.adobe.com/experience-platform-apis/references/destinations/) which will be responsible for picking up the featurized data and dump it into the DLZ. There's a few steps involved:

- [Setup](#setup)
- [1. Create a source connection](#1-basic-queries)
- [2. Create a target connection](#2-exploratory-data-analysis-with-query-service)
- [3. Create a data flow](#2-featurization-with-query-service)

# Setup

This notebook requires some configuration parameters to properly authenticate to your Adobe Experience Platform instance. Please follow the instructions in the [**README**](../README.md) to gather the necessary configuration parameters and prepare the [config.ini](../conf/config.ini) file with the specific values for your environment.

The next cell will be looking for your configuration file under your **ADOBE_HOME** path to fetch the configuration values that will be used for this notebook. If necessary, modify the `config_path` and/or the `config_file` name to reflect the location of your config file. 

In [20]:
import os
from configparser import ConfigParser
import aepp

os.environ["ADOBE_HOME"] = os.path.dirname(os.getcwd())

if "ADOBE_HOME" not in os.environ:
    raise Exception("ADOBE_HOME environment variable needs to be set.")

config = ConfigParser()
config_file = "aemassets_config.ini"
#config_path = os.path.join(os.environ["ADOBE_HOME"], "conf", config_file)
config_path = f"/Users/jeremypage/Library/CloudStorage/OneDrive-Adobe/Projects/Cloud ML/environments/{config_file}"

if not os.path.exists(config_path):
    raise Exception(f"Looking for configuration under {config_path} but config not found, please verify path")

config.read(config_path)

export_path = config.get("Cloud", "export_path")
import_path = config.get("Cloud", "import_path")
data_format = config.get("Cloud", "data_format")

aepp.configure(
  org_id=config.get("Platform", "ims_org_id"),
  tech_id=config.get("Platform", "tech_acct_id"), 
  secret=config.get("Platform", "client_secret"),
  scopes=config.get("Platform", "scopes"),
  client_id=config.get("Platform", "client_id"),
  environment=config.get("Platform", "environment"),
  sandbox=config.get("Platform", "sandbox_name")
)

To ensure uniqueness of resources created as part of this notebook, we are using your local username to include in each of the resource titles to avoid conflicts.

In [4]:
import re
username = os.getlogin()
unique_id = s = re.sub("[^0-9a-zA-Z]+", "_", username)

Helper function to generate link to resource in the UI:

In [5]:
def get_ui_link(tenant_id, resource_type, resource_id):
    environment = config.get("Platform", "environment")
    sandbox_name = config.get("Platform", "sandbox_name")
    if environment == "prod":
        prefix = f"https://experience.adobe.com"
    else:
        prefix = f"https://experience-{environment}.adobe.com"
    return f"{prefix}/#/@{tenant_id}/sname:{sandbox_name}/platform/{resource_type}/{resource_id}"

# 1. Create the Source Connection

The source connection is responsible for configuring the connection to your Adobe Experience Platform dataset so that the resulting flow will know exactly where to look for the data and in what format.

We will use `aepp` to make requests to the Flow Service APIs:

In [6]:
from aepp import flowservice
flow_conn = flowservice.FlowService()

created_dataset_id = config.get("Data", "featurized_dataset_id")

source_res = flow_conn.createSourceConnectionDataLake(
    name=f"[AIML-FP] Featurized Dataset source connection created by {username}",
    dataset_ids=[created_dataset_id],
    format="parquet"
)
source_connection_id = source_res["id"]
source_connection_id

'512c504a-6531-42c1-8e44-064ccf640eb3'

# 2. Create the Target Connection

The target connection is responsible for connecting to the destination filesystem. In our case, we want to connect to the DLZ and specify in what format the data will be stored, as well as the type of compression.

Before we can create it however, we need to create a base connection to the DLZ. A base connection is just an instance of a connection spec that details how one authenticates to a particular destination. In our case, because we're using the DLZ which is a known entity internal to Adobe, we can just reference the standard DLZ connection spec ID and create an empty base connection.

For reference, here is a list of all the connection specs available for the most popular cloud storage accounts (these IDs are global across every single customer account and sandbox):

| Cloud Storage Type    | Connection Spec ID                   |
|-----------------------|--------------------------------------|
| Amazon S3             | 4fce964d-3f37-408f-9778-e597338a21ee |
| Azure Blob Storage    | 6d6b59bf-fb58-4107-9064-4d246c0e5bb2 |
| Azure Data Lake       | be2c3209-53bc-47e7-ab25-145db8b873e1 |
| Data Landing Zone     | 10440537-2a7b-4583-ac39-ed38d4b848e8 |
| Google Cloud Storage  | c5d93acb-ea8b-4b14-8f53-02138444ae99 |
| SFTP                  | 36965a81-b1c6-401b-99f8-22508f1e6a26 |

In [7]:
# TODO: implement in aepp a way to abstract that
connection_spec_id = "10440537-2a7b-4583-ac39-ed38d4b848e8"
base_connection_res = flow_conn.createConnection(data={
    "name": f"[AIML-FP] Base Connection to DLZ created by {username}",
    "auth": None,
    "connectionSpec": {
        "id": connection_spec_id,
        "version": "1.0"
    }
})
base_connection_id = base_connection_res["id"]
base_connection_id

'e76a9429-4ea2-4b8f-8704-d0c1e371d6d7'

With that base connection, we're ready to create the target connection that will tie to our DLZ directory.  We will configure the target connection using the parameters specified in the `[Cloud]` section of the `config.ini` file:
- compression_type
- data_format
- export_path

In [8]:
# TODO: implement in aepp a way to abstract that
target_res = flow_conn.createTargetConnection(
    data={
        "name": f"[AIML-FP] Data Landing Zone target connection created by {username}",
        "baseConnectionId": base_connection_id,
        "params": {
            "mode": "Server-to-server",
            "compression": config.get("Cloud", "compression_type"),
            "datasetFileType": config.get("Cloud", "data_format"),
            "path": config.get("Cloud", "export_path")
        },
        "connectionSpec": {
            "id": connection_spec_id,
            "version": "1.0"
        }
    }
)

target_connection_id = target_res["id"]
target_connection_id

'd6f1ba66-5c20-4db3-ab57-29160ac38282'

<div class="alert alert-block alert-warning">
<b>Note:</b> 
    
If you would like to switch to a different cloud storage, you need to update the `connection_spec_id` variable above to the matching value in the table mentioned earlier in this section.
</div>

# 3. Create the Data Flow

Now that we have the source and target connections setup, we can construct the data flow. A data flow is the "recipe" that describes where the data comes from and where it should end up. We can also specify how often checks happen to find new data, but it cannot be lower than 3 hours currently for platform stability reasons. A data flow is tied to a flow spec ID which contains the instructions for transfering data in an optimized way between a source and destination.

For reference, here is a list of all the flow specs available for the most popular cloud storage accounts (these IDs are global across every single customer account and sandbox):

| Cloud Storage Type    | Flow Spec ID                         |
|-----------------------|--------------------------------------|
| Amazon S3             | 269ba276-16fc-47db-92b0-c1049a3c131f |
| Azure Blob Storage    | 95bd8965-fc8a-4119-b9c3-944c2c2df6d2 |
| Azure Data Lake       | 17be2013-2549-41ce-96e7-a70363bec293 |
| Data Landing Zone     | cd2fc47e-e838-4f38-a581-8fff2f99b63a |
| Google Cloud Storage  | 585c15c4-6cbf-4126-8f87-e26bff78b657 |
| SFTP                  | 354d6aad-4754-46e4-a576-1b384561c440 |


In order to execute the data flow, There are two options available to you:
1. If you do not want to wait you can do a **adhoc run** to execute it instantly in Section 4.
2. **Wait for the first scheduled run**. We selected to have it run every 3 hours, so you may need to wait up to 3 hours.

The code below configures the data flow to run ad hoc by default. If you prefer to run the dataflow on a recurring schedule:
1. Change the value of on_schedule to `True` before executing the cell below
2. Wait up to 3 hours, then execute the cells in Section 4.

In [9]:
import time

on_schedule = False
if on_schedule:
    schedule_params = {
        "interval": 3,
        "timeUnit": "hour",
        "startTime": int(time.time())
    }
else:
    schedule_params = {
        "interval": 1,
        "timeUnit": "day",
        "startTime": int(time.time() + 60*60*24*365) # Start the schedule far in the future
    }


flow_spec_id = "cd2fc47e-e838-4f38-a581-8fff2f99b63a"
flow_obj = {
    "name": f"[AIML-FP] Flow for Featurized Dataset to DLZ created by {username}",
    "flowSpec": {
        "id": flow_spec_id,
        "version": "1.0"
    },
    "sourceConnectionIds": [
        source_connection_id
    ],
    "targetConnectionIds": [
        target_connection_id
    ],
    "transformations": [],
    "scheduleParams": schedule_params
}
flow_res = flow_conn.createFlow(
    obj = flow_obj,
    flow_spec_id = flow_spec_id
)
dataflow_id = flow_res["id"]
dataflow_id

'40a89499-90f1-4988-a38e-55b19d8ce091'

<div class="alert alert-block alert-warning">
<b>Note:</b> 

If you would like to switch to a different cloud storage, you need to update the `flow_spec_id` variable above to the matching value in the table mentioned earlier in this section.
</div>

After you create the data flow, you should be able to see it in the UI to monitor executions, runtimes and its overall lifecycle. You can get the link below and should be able to see it in the UI as shown in the screenshot as well.

In [10]:
from aepp import schema
tenant_id = schema.Schema().getTenantId()

dataflow_link = get_ui_link(tenant_id, "destination/browse", dataflow_id)
print(f"Data Flow created as ID {dataflow_id} available under {dataflow_link}")

Data Flow created as ID 40a89499-90f1-4988-a38e-55b19d8ce091 available under https://experience.adobe.com/#/@aemonacpprodcampaign/sname:laa-e2e/platform/destination/browse/40a89499-90f1-4988-a38e-55b19d8ce091


![Dataflow](./media/CMLE-Notebooks-Week2-Dataflow.png)

# 4. Execute the Data Flow

At this point we've just created our Data Flow, but it has not executed yet. Please follow the instructions for the option you selected in Section 4.3 :
- If you do not want to wait you can do a **adhoc run** to execute it instantly.
- Either **wait until it gets scheduled**. We selected to have it run every 3 hours, so you may need to wait up to 3 hours.

In the cell below we're showing how to do the first option to trigger a adhoc run, if you selected the second option, you can skip the cell below and will need to wait up to 3 hours to execute the cells after.

<div class="alert alert-block alert-warning">
<b>Note:</b> Please wait at least 10 minutes after creating the dataflow before triggering the next cell, otherwise the job might not execute at all.
</div>

### Trigger an ad hoc run

In [12]:
# TODO: use new functionality in aepp when it is released
from aepp import connector

connector = connector.AdobeRequest(
  config_object=aepp.config.config_object,
  header=aepp.config.header,
  loggingEnabled=False,
  logger=None,
)

endpoint = aepp.config.endpoints["global"] + "/data/core/activation/disflowprovider/adhocrun"

payload = {
    "activationInfo": {
        "destinations": [
            {
                "flowId": dataflow_id, 
                "datasets": [
                    {"id": created_dataset_id}
                ]
            }
        ]
    }
}

connector.header.update({"Accept":"application/vnd.adobe.adhoc.dataset.activation+json; version=1"})
activation_res = connector.postData(endpoint=endpoint, data=payload)
activation_res

{'destinations': [{'datasets': [{'id': '654b3ddd7dc4ee28d32c8c6e',
     'statusURL': 'https://platform.adobe.io/data/foundation/flowservice/runs/e27898e4-7896-4ec9-b5cb-6ac13bc23a6c',
     'flowId': '40a89499-90f1-4988-a38e-55b19d8ce091'}]}]}

<div class="alert alert-block alert-warning">
<b>Note:</b> 

If you see an error such as `Invalid parameter: Flow for id 93790efa-645b-4400-8afe-b6f135734656 is incorrect. Error is [Adhoc run can not be executed for Flow spec=cd2fc47e-e838-4f38-a581-8fff2f99b63a.`. it means your cloud storage is not yet whitelisted for exporting datasets. Please reach out to your Adobe contact to have it enabled.
</div>

<div class="alert alert-block alert-warning">
<b>Note:</b> 

If you see an error such as `Invalid parameter: Following order ID(s) are not ready for dataset export, please wait for 10 minutes and retry.`. it means you need to wait a few minutes and retry again.
</div>

<div class="alert alert-block alert-warning">
<b>Note:</b> 

If you get a message saying a run already exists, it means that either this dataset has been exported already based on the schedule, or that you've already done an adhoc export before.
</div>

Now we can check the execution of our Data Flow to make sure it actually executes. You can run the following cell until you can see the run appear.

In [13]:
import time

# TODO: handle that more gracefully in aepp
finished = False
while not finished:
    try:
        runs = flow_conn.getRuns(prop=f"flowId=={dataflow_id}")
        for run in runs:
            run_id = run["id"]
            run_started_at = run["metrics"]["durationSummary"]["startedAtUTC"]
            run_ended_at = run["metrics"]["durationSummary"]["completedAtUTC"]
            run_duration_secs = (run_ended_at - run_started_at) / 1000
            run_size_mb = run["metrics"]["sizeSummary"]["outputBytes"] / 1024. / 1024.
            run_num_rows = run["metrics"]["recordSummary"]["outputRecordCount"]
            run_num_files = run["metrics"]["fileSummary"]["outputFileCount"]
            print(f"Run ID {run_id} completed with: duration={run_duration_secs} secs; size={run_size_mb} MB; num_rows={run_num_rows}; num_files={run_num_files}")
        finished = True
    except Exception as e:
        print(f"No runs completed yet for flow {dataflow_id}")
        time.sleep(30)

No runs completed yet for flow 40a89499-90f1-4988-a38e-55b19d8ce091
Run ID e27898e4-7896-4ec9-b5cb-6ac13bc23a6c completed with: duration=14.767 secs; size=3.0555238723754883 MB; num_rows=99923; num_files=12


Now that a run of our Data Flow has executed successfully, we're all set! We can do a sanity check to verify that the data indeed made its way into the DLZ. For that, we recommend setting up [Azure Storage Explorer](https://azure.microsoft.com/en-us/products/storage/storage-explorer) to connect to your DLZ container using [this guide](https://experienceleague.adobe.com/docs/experience-platform/destinations/catalog/cloud-storage/data-landing-zone.html?lang=en). To get the credentials, you can execute the code below to get the SAS URL needed:

In [16]:
# TODO: use functionality in aepp once released
from aepp import connector

connector = connector.AdobeRequest(
  config_object=aepp.config.config_object,
  header=aepp.config.header,
  loggingEnabled=False,
  logger=None,
)

endpoint = aepp.config.endpoints["global"] + "/data/foundation/connectors/landingzone/credentials"

dlz_credentials = connector.getData(endpoint=endpoint, params={
  "type": "dlz_destination"
})
dlz_container = dlz_credentials["containerName"]
dlz_sas_token = dlz_credentials["SASToken"]
dlz_storage_account = dlz_credentials["storageAccountName"]
dlz_sas_uri = dlz_credentials["SASUri"]
print(f"DLZ container: {dlz_container}")
print(f"DLZ storage account: {dlz_storage_account}")
print(f"DLZ SAS URL: {dlz_sas_uri}")

DLZ container: dlz-destination
DLZ storage account: sndbxdtlndh8ia8em3oyoh69
DLZ SAS URL: https://sndbxdtlndh8ia8em3oyoh69.blob.core.windows.net/dlz-destination?sv=2020-10-02&si=dlz-c9c43e14-26f9-4e2d-a1df-4e6f51e718f3&sr=c&sp=rl&sig=lU9zXuPnjqOeuheouud%2FKPBgi%2F4v4MbhUb%2B8Cv1xxKw%3D


Once setup you should be able to see your featurized data as a set of Parquet files under the following directory structure: `cmle/egress/$DATASETID/exportTime=$TIMESTAMP` - see screenshot below.

In [23]:
print(f"Featurized data in DLZ should be available under {export_path}/{created_dataset_id}")

Featurized data in DLZ should be available under cmle/egress/654b3ddd7dc4ee28d32c8c6e


![DLZ](./media/CMLE-Notebooks-Week2-ExportedDataset.png)

### Next: Train a propensity model