# Scope of Notebook

This notebook allows you to plug in your featuried dataset from the previous week into an ml model, in this case we use random forest.  You will then be able to store the trained model in mlflow and calculate performance characteristics around the model like AUC and accuracy.  The advanced section of this notebook outlines how to retrieve the best set of hyperparameters to certify a model for production use.

![ml-model-train](../media/CMLE-Notebooks-Week3-Workflow.png)

# Setup

This notebook requires some configuration data to properly authenticate to your Adobe Experience Platform instance. You should be able to find all the values required above by following the Setup section of the **README**.

The next cell will be looking for your configuration file under your **ADOBE_HOME** path to fetch the values used throughout this notebook. See more details in the Setup section of the **README** to understand how to create your configuration file.

In [0]:
import os
from configparser import ConfigParser
  
config = ConfigParser()
config_path = os.path.join(os.environ["ADOBE_HOME"], "conf", "config.ini")
if not os.path.exists(config_path):
  raise Exception(f"Looking for configuration under {config_path} but config not found, please verify path")
config.read(config_path)
  
ims_org_id = config.get("Platform", "ims_org_id")
sandbox_name = config.get("Platform", "sandbox_name")
environment = config.get("Platform", "environment")
client_id = config.get("Authentication", "client_id")
client_secret = config.get("Authentication", "client_secret")
scopes = config.get("Authentication", "scopes")
dataset_id = config.get("Platform", "dataset_id")
featurized_dataset_id = config.get("Platform", "featurized_dataset_id")
export_path = config.get("Cloud", "export_path")
import_path = config.get("Cloud", "import_path")
data_format = config.get("Cloud", "data_format")
compression_type = config.get("Cloud", "compression_type")
model_name = config.get("Cloud", "model_name")
datarobot_key = config.get("DataRobot", 'datarobot_key')
datarobot_endpoint = config.get("DataRobot", 'datarobot_endpoint')

In [0]:
import re

username = dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()
unique_id = s = re.sub("[^0-9a-zA-Z]+", "_", username)

print(f"Username: {username}")
print(f"Unique ID: {unique_id}")

Username: cmenguy@adobe.com
Unique ID: cmenguy_adobe_com


Before we run anything, make sure to install the following required libraries for this notebook. They are all publicly available libraries and the latest version should work fine.

In [0]:
!pip install aepp
!pip install adlfs
!pip install s3fs
!pip install fsspec
%pip install datarobot

Collecting aepp
  Downloading aepp-0.2.9-py3-none-any.whl (120 kB)
[?25l[K     |██▊                             | 10 kB 30.6 MB/s eta 0:00:01[K     |█████▍                          | 20 kB 14.9 MB/s eta 0:00:01[K     |████████▏                       | 30 kB 20.8 MB/s eta 0:00:01[K     |██████████▉                     | 40 kB 10.2 MB/s eta 0:00:01[K     |█████████████▋                  | 51 kB 11.3 MB/s eta 0:00:01[K     |████████████████▎               | 61 kB 13.1 MB/s eta 0:00:01[K     |███████████████████             | 71 kB 10.0 MB/s eta 0:00:01[K     |█████████████████████▊          | 81 kB 9.4 MB/s eta 0:00:01[K     |████████████████████████▍       | 92 kB 10.4 MB/s eta 0:00:01[K     |███████████████████████████▏    | 102 kB 10.9 MB/s eta 0:00:01[K     |█████████████████████████████▉  | 112 kB 10.9 MB/s eta 0:00:01[K     |████████████████████████████████| 120 kB 10.9 MB/s 
[?25hCollecting pathlib2
  Downloading pathlib2-2.3.7.post1-py2.py3-none-any.w

Before any calls can take place, we need to configure the library and setup authentication credentials. For this you'll need the following piece of information. For information about how you can get these, please refer to the `Setup` section of the **Readme**:
- Client ID
- Client secret

In [0]:
import aepp

aepp.configure(
  environment=environment,
  sandbox=sandbox_name,
  org_id=ims_org_id,
  scopes=scopes, 
  secret=client_secret,
  client_id=client_id
)

# 1. Running a model on AEP data

In the previous week we generated our featurized data in the Data Landing Zone under the `dlz-destination` container. We can now read it so we can use it to train our ML model. Because this data can be pretty big, we want to first read it via a Spark dataframe, so we can then use a sample of it for training.

The featurized data exported into the Data Landing Zone is under the format **cmle/egress/$DATASETID/exportTime=$EXPORTTIME**. We know the dataset ID which is in your config under `featurized_dataset_id` so we're just missing the export time so we know what to read. To get that we can simply list files in the DLZ and find what the value is. The first step is to retrieve the credentials for the DLZ related to the destination container:

In [0]:
from aepp import flowservice

flow_conn = flowservice.FlowService()
credentials = flow_conn.getLandingZoneCredential(dlz_type='dlz_destination')

Now we use some Python libraries to authenticate and issue listing commands so we can get the paths and extract the time from it.

In [0]:
import fsspec
from fsspec import AbstractFileSystem

def getDLZFSPath(credentials: dict):
    if 'dlzProvider' in credentials.keys() and ['Amazon', 's3'] in credentials['dlzProvider']:
        aws_credentials = {
            'key' : credentials['credentials']['awsAccessKeyId'],
            'secret' : credentials['credentials']['awsSecretAccessKey'],
            'token' : credentials['credentials']['awsSessionToken']
        }
        return fsspec.filesystem('s3', **aws_credentials), credentials['dlzPath']['bucketName']
    else:
        abs_credentials = {
            'account_name' : credentials['storageAccountName'],
            'sas_token' : credentials['SASToken']
        }
        return fsspec.filesystem('abfss', **abs_credentials), credentials['containerName']
    
def getDLZDataPath(credentials):
    if 'dlzProvider' in credentials.keys() and ['Amazon', 's3'] in credentials['dlzProvider']:
        aws_buket = credentials['dlzPath']['bucketName']
        dlz_folder = credentials['dlzPath']['dlzFolder']
        return f"s3a://${aws_buket}/{dlz_folder}/"
    else:
        dlz_storage_account = credentials['storageAccountName']
        dlz_container = credentials['containerName']
        return f"abfss://{dlz_container}@{dlz_storage_account}.dfs.core.windows.net/"


def get_export_time(fs: AbstractFileSystem, container_name: str, base_path: str, dataset_id: str):
  featurized_data_base_path = f"{container_name}/{base_path}/{dataset_id}"
  featurized_data_export_paths = fs.ls(featurized_data_base_path)
  
  if len(featurized_data_export_paths) == 0:
    raise Exception(f"Found no exports for featurized data from dataset ID {dataset_id} under path {featurized_data_base_path}")
  elif len(featurized_data_export_paths) > 1:
    print(f"Found {len(featurized_data_export_paths)} exports from dataset dataset ID {dataset_id} under path {featurized_data_base_path}, using most recent one")
  
  featurized_data_export_path = featurized_data_export_paths[-1]
  featurized_data_export_time = featurized_data_export_path.strip().split("/")[-1].split("=")[-1]
  return featurized_data_export_time


fs, container = getDLZFSPath(credentials)


export_time = get_export_time(fs, container, export_path, featurized_dataset_id)
print(f"Using featurized data export time of {export_time}")

Using featurized data export time of 20230401140556


At that point we're ready to read this data. We're using Spark since it could be pretty large as we're not doing any sampling. 
Based on the provisioned account Landing Zone could be either configured to use azure or aws, 
in case of azure following properties will be used to authenticate using SAS:
- `fs.azure.account.auth.type.$ACCOUNT.dfs.core.windows.net` should be set to `SAS`.
- `fs.azure.sas.token.provider.type.$ACCOUNT.dfs.core.windows.net` should be set to `org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider`.
- `fs.azure.sas.fixed.token.$ACCOUNT.dfs.core.windows.net` should be set to the SAS token retrieved earlier.

in case of aws following properties will be used to access data stored in s3:
- `fs.s3a.access.key` and `spark.hadoop.fs.s3a.access.key` should be the s3 access key
- `fs.s3a.secret.key` and `spark.hadoop.fs.s3a.secret.key` should be the s3 secret
- `fs.s3a.session.token` and `spark.hadoop.fs.s3a.session.token` should be set to s3 session token
- `fs.s3a.aws.credentials.provider` and `spark.hadoop.fs.s3a.aws.credentials.provider` should be set to `org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider`
- `fs.s3.impl` and `spark.hadoop.fs.s3.impl` should be set to `org.apache.hadoop.fs.s3a.S3AFileSystem`


The above properties are calculated based on the landing zone credentials, following util method will set these up:

In [None]:
def configureSparkSessionAndGetPath(credentials):
    if 'dlzProvider' in credentials.keys() and ['Amazon', 's3'] in credentials['dlzProvider']:
        aws_key = credentials['credentials']['awsAccessKeyId']
        aws_secret = credentials['credentials']['awsSecretAccessKey']
        aws_token = credentials['credentials']['awsSessionToken']
        aws_buket = credentials['dlzPath']['bucketName']
        dlz_folder = credentials['dlzPath']['dlzFolder']
        spark.conf.set("fs.s3a.access.key", aws_key)
        spark.conf.set("fs.s3a.secret.key", aws_secret)
        spark.conf.set("fs.s3a.session.token", aws_token)
        spark.conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
        spark.conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
        spark.conf.set("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
        spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
        spark.conf.set("spark.hadoop.fs.s3a.access.key", aws_key)
        spark.conf.set("spark.hadoop.fs.s3a.secret.key", aws_secret)
        spark.conf.set("fs.s3a.session.token", aws_token)
        return f"s3a://${aws_buket}/{dlz_folder}/"
    else:
        dlz_storage_account = credentials['storageAccountName']
        dlz_sas_token = credentials['SASToken']
        dlz_container = credentials['containerName']
        spark.conf.set(f"fs.azure.account.auth.type.{dlz_storage_account}.dfs.core.windows.net", "SAS")
        spark.conf.set(f"fs.azure.sas.token.provider.type.{dlz_storage_account}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
        spark.conf.set(f"fs.azure.sas.fixed.token.{dlz_storage_account}.dfs.core.windows.net", dlz_sas_token)
        return f"abfss://{dlz_container}@{dlz_storage_account}.dfs.core.windows.net/"

In [0]:
# init spark session for provisioned DLZ and get the base path (fs3://bucket_name/folder or abfss://container@account/)
cloud_base_path = configureSparkSessionAndGetPath(credentials)

input_path = cloud_base_path + f"{export_path}/{featurized_dataset_id}/exportTime={export_time}/"

#Let's put that in practice and create a Spark dataframe containing the entire featurized data:
df = spark.read.parquet(input_path)
df.printSchema()

root
 |-- userId: string (nullable = true)
 |-- eventType: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- subscriptionOccurred: long (nullable = true)
 |-- emailsReceived: long (nullable = true)
 |-- emailsOpened: long (nullable = true)
 |-- emailsClicked: long (nullable = true)
 |-- productsViewed: long (nullable = true)
 |-- propositionInteracts: long (nullable = true)
 |-- propositionDismissed: long (nullable = true)
 |-- webLinkClicks: long (nullable = true)
 |-- minutes_since_emailSent: integer (nullable = true)
 |-- minutes_since_emailOpened: integer (nullable = true)
 |-- minutes_since_emailClick: integer (nullable = true)
 |-- minutes_since_productView: integer (nullable = true)
 |-- minutes_since_propositionInteract: integer (nullable = true)
 |-- minutes_since_propositionDismiss: integer (nullable = true)
 |-- minutes_since_linkClick: integer (nullable = true)
 |-- random_row_number_for_user: integer (nullable = true)



We can then sample it to keep only a portion of the data for training before we bring the data in memory for use in the `scikit-learn` library. Here we're just going to use a sampling ratio of 50%, but you are welcome to use a bigger or smaller ratio. We use sampling **without** replacement to ensure the same profiles don't get picked up multiple times.

In [0]:
sampling_ratio = 0.5
df = df.sample(withReplacement=False, fraction=sampling_ratio)

df_train=df.toPandas()

## 1.1 Creating baseline models in DataRobot

Before doing any ML we can look at summary statistics to understand the structure of the data, and what kind of algorithm(s) might be suited to solve the problem.

In [0]:
df.describe()

Out[10]: DataFrame[summary: string, userId: string, eventType: string, subscriptionOccurred: string, emailsReceived: string, emailsOpened: string, emailsClicked: string, productsViewed: string, propositionInteracts: string, propositionDismissed: string, webLinkClicks: string, minutes_since_emailSent: string, minutes_since_emailOpened: string, minutes_since_emailClick: string, minutes_since_productView: string, minutes_since_propositionInteract: string, minutes_since_propositionDismiss: string, minutes_since_linkClick: string, random_row_number_for_user: string]

To keep the model name unique we append the username to the model name:

In [0]:
model_name = f"{model_name}_{unique_id}"

In order to feed data to our model, we need to do a few preparation steps:
- Separate the target variable (which in our case is whether a subscription occured or not) from the other variables.
- Split the data into a training and test set so we can evaluate our model performance down the line.

%md

## 1.2. Connect to DataRobot

Read more about different options for connecting to DataRobot from the client: https://docs.datarobot.com/en/docs/api/api-quickstart/api-qs.html

**To connect to DataRobot,** we just need to provide our API Token (found in Developer Tools) and the endpoint.

The endpoint for VPC installs will be different. The format will follow:
**https://{datarobot.example.com}/api/v2** <br>
See On-premise section in: https://docs.datarobot.com/en/docs/api/api-quickstart/index.html#retrieve-the-api-endpoint

In [None]:
import datarobot as dr

dr.Client(
    token=datarobot_key, 
    endpoint=datarobot_endpoint
)

%md
### Upload dataset to DataRobot

In [None]:
new_dataset = dr.Dataset.create_from_in_memory_data(data_frame=df_train)
# Update the dataset name in the AI Catalog
new_dataset.modify(name=model_name)

# Output the new dataset ID
print("Dataset ID of our new AI Catalog dataset: " + new_dataset.id)

%md

Every dataset in the AI Catalog is given a unique dataset ID. This can be found in the URL as well as the dataset metadata in the UI. For example, the dataset url 'https://app.datarobot.com/ai-catalog/64cfc12417441cd3242e99ec' contains the dataset ID: 64cfc12417441cd3242e99ec. The ID can be used to retrieve the AI Catalog dataset object with the API.

%md

### 1.3. Create a DataRobot project from a AI Catalog Dataset

Ref: https://datarobot-public-api-client.readthedocs-hosted.com/en/v2.28.2/reference/modeling/project.html#create-a-project

We can create DataRobot projects directly from:

* A dataset in AI Catalog (using the dataset's ID in DataRobot)
* A pandas dataframe (don't need to write back to data source or disk)
* Data sources

Note: Each created project is associated with a unique Project ID. To use the API for downstream analytics, we can use the project ID to pull the project of interest.

In [None]:
project = dr.Project.create_from_dataset(dataset_id=new_dataset.id,
                                            project_name=model_name, max_wait=600)

# Quick link to the DataRobot project you just created
print(
    "DataRobot Project URL: " + project.get_uri()
)
print("Project ID: " + project.id)

partitioning = dr.GroupCV(partition_key_cols=["userId"], reps=5, holdout_pct=20)
# featurelist = project.create_featurelist('modelling', ['subscription'])

advanced_options = dr.AdvancedOptions(
    feature_discovery_supervised_feature_reduction=False,) 

project.analyze_and_model(
            target='subscriptionOccurred',
            mode=dr.AUTOPILOT_MODE.QUICK,
            # featurelist_id=featurelist.id,
            partitioning_method=partitioning,
            worker_count=-1,
            max_wait=600*600) #Entering -1 uses all available workers

# Setting timeout=None as the feature engineering and reduction for this dataset is extensive
project.wait_for_autopilot(timeout=60*60*4)

%md
### Deploy Best Model


In [None]:

# Get the prediction server ID and model ID from your DataRobot project
default_prediction_server_id = [dps for dps in dr.PredictionServer.list()][0]

# Create a new deployment from best model


rec = dr.ModelRecommendation.get(
                project.id,
                recommendation_type=dr.enums.RECOMMENDED_MODEL_TYPE.RECOMMENDED_FOR_DEPLOYMENT)

model = rec.get_model()

deployment = dr.Deployment.create_from_learning_model(
    model.id,
    label='Adobe_Subsription',
    description='Adobe_Subsription',
    default_prediction_server_id=default_prediction_server_id.id
)

deployment_id = deployment.id
# Display created deployment
print(deployment)
print(deployment.id)


## 1.4 Saving the DataRobot deplyment ID to configuration

Now that we got everything working, we just need to save the updated `deployment_id` variable in the original configuration file, so we can refer to it in the following weekly assignments. To do that, execute the code below:

In [0]:
config.set("DataRobot", "datarobot_deployment_id", deployment_id)

with open(config_path, "w") as configfile:
    config.write(configfile)