# Scope of Notebook

This notebook allows you to plug in your featuried dataset from the previous week into an ml model, in this notebook we will use Watsonx AutoAI tool which automates various steps of machine learning model building process by automating data preparion, feature engineering, model selection and hyperparameter optimization. AutoAI operates by analyzing data, selecting most suitable algorithms and then generating multiple model pipelines which it evaluates and ranks based on the performance metrics like accuracy and precission. In the end AutoAI will recommend the best model based on the scoring parameters selected however you are free to chose whichiever model you believe is suitable.

# Setup

This notebook requires some configuration data to properly authenticate to your Adobe Experience Platform instance. You should be able to find all the values required above by following the Setup section of the watsonx/README.

The next cell will be looking for your configuration file to fetch the values used throughout this notebook. See more details in the Setup section of the watsonx/README to understand how to create your configuration file.

In [24]:
!pip install project-lib



In [25]:
from project_lib import Project
from configparser import ConfigParser
import io

project = Project.access()
config_file = project.get_file('config.ini')

config = ConfigParser()
config.read_string(config_file.read().decode('utf-8'))

ims_org_id = config.get("Platform", "ims_org_id")
sandbox_name = config.get("Platform", "sandbox_name")
environment = config.get("Platform", "environment")
client_id = config.get("Authentication", "client_id")
client_secret = config.get("Authentication", "client_secret")
scopes = config.get("Authentication", "scopes")
dataset_id = config.get("Platform", "dataset_id")
featurized_dataset_id = config.get("Platform", "featurized_dataset_id")
export_path = config.get("Cloud", "export_path")
import_path = config.get("Cloud", "import_path")
data_format = config.get("Cloud", "data_format")
compression_type = config.get("Cloud", "compression_type")
model_name = config.get("Cloud", "model_name")

watson_username = config.get("Watsonx", "watson_username")
watson_apikey = config.get("Watsonx", "watson_apikey")

Now lets init the APIClient in order to be able to interact with the platform.

In [26]:
wml_credentials = {
    "instance_id": "openshift",
    "version": "4.8",
    "url": "https://cpd-cpd-instance.apps.p712zf6h.eastus2.aroapp.io",
    "username": watson_username,
    "apikey": watson_apikey
}

In [27]:
from ibm_watsonx_ai import APIClient

project_id = project.get_metadata()['metadata']['guid']
print("Project ID:", project_id)    

client = APIClient(wml_credentials)
client.set.default_project(project_id)

Project ID: 506b7b7a-ecf6-454f-8931-6d1aab37044f


'SUCCESS'

To ensure uniqueness of resources created as part of this notebook, we are using your system provisioned username to include in each of the resource titles to avoid conflicts, 
it is recommended to supply a more readable one so you could easily identify resources in AEP created by this notebook

In [28]:
import re

username=watson_username # supply your custom one ex: foo@bar.com
unique_id = s = re.sub("[^0-9a-zA-Z]+", "_", watson_username)

print(f"Username: {username}")
print(f"Unique ID: {unique_id}")

Username: mndymuqvx34peqwqz-ydi68gcdn1kj9ugzqs-towtum
Unique ID: mndymuqvx34peqwqz_ydi68gcdn1kj9ugzqs_towtum


Before we run anything, make sure to install the following required libraries for this notebook. 
They are all publicly available libraries and the latest version should work fine.

In [29]:
!pip install aepp
!pip install adlfs
!pip install s3fs
!pip install fsspec
!pip install pyarrow 
!pip install pandas



Before any calls can take place, we need to configure the library and setup authentication credentials. For this you'll need the following piece of information. For information about how you can get these, please refer to the Setup section of the Readme:

* Client ID
* Client secret

In [30]:
import aepp

aepp.configure(
  environment=environment,
  sandbox=sandbox_name,
  org_id=ims_org_id,
  secret=client_secret,
  scopes=scopes,
  client_id=client_id
)


# 1. Running a model on AEP data

In the previous week we generated our featurized data in the Data Landing Zone under the dlz-destination container. We can now read it so we can use it to train our ML model. Because this data can be pretty big, we want to first read it via a Spark dataframe, so we can then use a sample of it for training.

The featurized data exported into the Data Landing Zone is under the format `cmle/egress`/`DATASETID`/exportTime=`EXPORTTIME`. We know the dataset ID which is in your config under featurized_dataset_id so we're just missing the export time so we know what to read. To get that we can simply list files in the DLZ and find what the value is. The first step is to retrieve the credentials for the DLZ related to the destination container:


In [31]:
from aepp import flowservice

flow_conn = flowservice.FlowService()
credentials = flow_conn.getLandingZoneCredential(dlz_type='dlz_destination')

Now we use some Python libraries to authenticate and issue listing commands so we can get the paths and extract the time from it.

In [32]:
import fsspec
from fsspec import AbstractFileSystem

def getDLZFSPath(credentials: dict):
    if 'dlzProvider' in credentials.keys() and ['Amazon', 's3'] in credentials['dlzProvider']:
        aws_credentials = {
            'key' : credentials['credentials']['awsAccessKeyId'],
            'secret' : credentials['credentials']['awsSecretAccessKey'],
            'token' : credentials['credentials']['awsSessionToken']
        }
        return fsspec.filesystem('s3', **aws_credentials), credentials['dlzPath']['bucketName']
    else:
        abs_credentials = {
            'account_name' : credentials['storageAccountName'],
            'sas_token' : credentials['SASToken']
        }
        return fsspec.filesystem('abfss', **abs_credentials), credentials['containerName']
    
def getDLZDataPath(credentials):
    if 'dlzProvider' in credentials.keys() and ['Amazon', 's3'] in credentials['dlzProvider']:
        aws_buket = credentials['dlzPath']['bucketName']
        dlz_folder = credentials['dlzPath']['dlzFolder']
        return f"s3a://${aws_buket}/{dlz_folder}/"
    else:
        dlz_storage_account = credentials['storageAccountName']
        dlz_container = credentials['containerName']
        return f"abfss://{dlz_container}@{dlz_storage_account}.dfs.core.windows.net/"


def get_export_time(fs: AbstractFileSystem, container_name: str, base_path: str, dataset_id: str):
  featurized_data_base_path = f"{container_name}/{base_path}/{dataset_id}"
  featurized_data_export_paths = fs.ls(featurized_data_base_path)
  
  if len(featurized_data_export_paths) == 0:
    raise Exception(f"Found no exports for featurized data from dataset ID {dataset_id} under path {featurized_data_base_path}")
  elif len(featurized_data_export_paths) > 1:
    print(f"Found {len(featurized_data_export_paths)} exports from dataset dataset ID {dataset_id} under path {featurized_data_base_path}, using most recent one")
  
  featurized_data_export_path = featurized_data_export_paths[-1]
  featurized_data_export_time = featurized_data_export_path.strip().split("/")[-1].split("=")[-1]
  return featurized_data_export_time


fs, container = getDLZFSPath(credentials)


export_time = get_export_time(fs, container, export_path, featurized_dataset_id)
print(f"Using featurized data export time of {export_time}")

Using featurized data export time of 20240506204134


Now we will pull data from from the DLZ and store them locally as assets the following helper functions will help us pull partitioned data so we could feed it into AutoAI later.
at the time ow writing this notebook AutoAI supported only CSV and XLSX format, thus we will transform parquet type to CSV.

In [14]:
import pandas as pd

data_path = f"{container}/{export_path}/{featurized_dataset_id}/exportTime={export_time}/"

EXPORTED_FILENAME = f"dlz_exported_{featurized_dataset_id}_data_merged.csv"

dfs = []
for index, file_name in enumerate(fs.ls(data_path)):
      if file_name.endswith('.parquet'):
        with fs.open(file_name) as parquet_file:
            dfs.append(pd.read_parquet(parquet_file))

combined_df = pd.concat(dfs)

print(combined_df['userId'].nunique())

project.save_data(f"{EXPORTED_FILENAME}", combined_df.to_csv(index=False))                                    

100000


{'file_name': 'dlz_exported_66018d8312377d2c68545bac_data_merged.csv',
 'message': 'File saved to project storage.',
 'asset_id': '3633fd31-621f-4c64-9004-9806dc7e1604'}

Create DataConnections for the above, and confirm data structure.

In [33]:
from ibm_watsonx_ai.helpers import DataConnection

# Find the asset by name
asset_id = None
for asset in project.get_assets():
    if asset['name']== EXPORTED_FILENAME:
        asset_id = asset['asset_id']
        break

if asset_id:
    print("Asset ID:", asset_id)
else:
    print("File not found.")
    
    
trainig_data_connection = DataConnection(data_asset_id = asset_id)

trainig_data_connection.read().info()

Asset ID: 3633fd31-621f-4c64-9004-9806dc7e1604
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399624 entries, 0 to 399623
Data columns (total 19 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   userId                             399624 non-null  object 
 1   eventType                          399624 non-null  object 
 2   timestamp                          399624 non-null  object 
 3   subscriptionOccurred               399624 non-null  int64  
 4   emailsReceived                     399624 non-null  int64  
 5   emailsOpened                       399624 non-null  int64  
 6   emailsClicked                      399624 non-null  int64  
 7   productsViewed                     399624 non-null  int64  
 8   propositionInteracts               399624 non-null  int64  
 9   propositionDismissed               399624 non-null  int64  
 10  webLinkClicks                      399624 non-null  int64

# 2. Initializing AutoAI

Lets now create an instance of AutoAI.
We will select several out of the box classification algorithms which comes as part of AutoAI, specify list of features we are interested in as well as label we are trying to predict and lastly chose the scoring.

as part of: `train_sample_columns_index_list` we will list column indexes we are interested in (label is also listed):

0   userId                            
1   eventType                         
2   timestamp                         
*3   subscriptionOccurred*              
*4   emailsReceived*                    
*5   emailsOpened*                      
*6   emailsClicked*                     
*7   productsViewed*                    
*8   propositionInteracts*              
*9   propositionDismissed*              
*10  webLinkClicks*                     
*11  minutes_since_emailSent*           
*12  minutes_since_emailOpened*         
*13  minutes_since_emailClick*          
*14  minutes_since_productView*         
*15  minutes_since_propositionInteract* 
*16  minutes_since_propositionDismiss*  
17  minutes_since_linkClick           
18  random_row_number_for_user        

as part of: `include_only_estimators` - will list AutoAI built in/supported classifiers:

'RandomForestClassifierEstimator', 'DecisionTreeClassifierEstimator', 'LogisticRegressionEstimator', 'ExtraTreesClassifierEstimator', 'XGBClassifierEstimator', 
'SnapDecisionTreeClassifierEstimator', 'SnapRandomForestClassifierEstimator', 'SnapLogisticRegressionEstimator', 'SnapSVMClassifierEstimator', 'GradientBoostingClassifierEstimator'

we will select for `scoring` : 'roc_auc'

In [34]:


from ibm_watsonx_ai.experiment import AutoAI
experiment = AutoAI(wml_credentials, project_id=project_id)

pipeline_optimizer = experiment.optimizer(
    name="Cloud ML Watson (merged csv)",
    prediction_type='binary',
    prediction_column='subscriptionOccurred', # label we are trying to predict
    holdout_size=0.15,
    scoring='roc_auc',
    csv_separator=',',
    random_state=33,
    max_number_of_estimators=2,
    include_only_estimators=['RandomForestClassifierEstimator', 'DecisionTreeClassifierEstimator', 'LogisticRegressionEstimator', 'ExtraTreesClassifierEstimator', 'XGBClassifierEstimator', 'SnapDecisionTreeClassifierEstimator', 'SnapRandomForestClassifierEstimator', 'SnapLogisticRegressionEstimator', 'SnapSVMClassifierEstimator', 'GradientBoostingClassifierEstimator'],
    text_processing=False,
    train_sample_columns_index_list=[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], #features indexes we are interested in
    positive_label=1,
    drop_duplicates=True,
    outliers_columns=[],
    include_batched_ensemble_estimators=[],
    feature_selector_mode='auto'
)

run_details = pipeline_optimizer.fit(
            training_data_reference=[trainig_data_connection],
            background_mode=False)


pipeline_optimizer.get_run_status()



Training job f1027639-a3e9-4be6-83b0-96d9b6c84118 completed: 100%|████████| [15:11<00:00,  9.11s/it]


'completed'

### Once pipeline ran lets ensure what parameters were used for AutoAI run

In [35]:
pipeline_optimizer.get_params()

{'name': 'Cloud ML Watson (merged csv)',
 'desc': '',
 'prediction_type': 'binary',
 'prediction_column': 'subscriptionOccurred',
 'prediction_columns': None,
 'timestamp_column_name': None,
 'scoring': 'roc_auc',
 'holdout_size': 0.15,
 'max_num_daub_ensembles': 2,
 't_shirt_size': 'm',
 'train_sample_rows_test_size': None,
 'include_only_estimators': [<ClassificationAlgorithms.RF: 'RandomForestClassifier'>,
  <ClassificationAlgorithms.DT: 'DecisionTreeClassifier'>,
  <ClassificationAlgorithms.LR: 'LogisticRegression'>,
  <ClassificationAlgorithms.EX_TREES: 'ExtraTreesClassifier'>,
  <ClassificationAlgorithms.XGB: 'XGBClassifier'>,
  <ClassificationAlgorithms.SnapDT: 'SnapDecisionTreeClassifier'>,
  <ClassificationAlgorithms.SnapRF: 'SnapRandomForestClassifier'>,
  <ClassificationAlgorithms.SnapLR: 'SnapLogisticRegression'>,
  <ClassificationAlgorithms.SnapSVM: 'SnapSVMClassifier'>,
  <ClassificationAlgorithms.GB: 'GradientBoostingClassifier'>],
 'include_batched_ensemble_estimators':

Lets pull some stats from the algo run, get best pipeline (based on selected scoring)

In [36]:
summary = pipeline_optimizer.summary()
summary

Unnamed: 0_level_0,Enhancements,Estimator,training_roc_auc_(optimized),holdout_average_precision,holdout_log_loss,training_accuracy,holdout_roc_auc,training_balanced_accuracy,training_f1,holdout_precision,training_average_precision,training_log_loss,holdout_recall,training_precision,holdout_accuracy,holdout_balanced_accuracy,training_recall,holdout_f1
Pipeline Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Pipeline_6,,SnapRandomForestClassifier,0.972468,0.86418,0.15738,0.938013,0.973661,0.8605,0.783847,0.823204,0.862405,0.152513,0.752369,0.82115,0.938659,0.861938,0.7498,0.786194
Pipeline_7,HPO,SnapRandomForestClassifier,0.972468,0.86418,0.15738,0.938013,0.973661,0.8605,0.783847,0.823204,0.862405,0.152513,0.752369,0.82115,0.938659,0.861938,0.7498,0.786194
Pipeline_1,,RandomForestClassifier,0.972538,0.872363,0.221596,0.90125,0.973283,0.91563,0.739743,0.618596,0.855574,0.223885,0.938611,0.61154,0.904049,0.918283,0.936167,0.745721
Pipeline_2,HPO,RandomForestClassifier,0.972538,0.872363,0.221596,0.90125,0.973283,0.91563,0.739743,0.618596,0.855574,0.223885,0.938611,0.61154,0.904049,0.918283,0.936167,0.745721
Pipeline_8,"HPO, FE",SnapRandomForestClassifier,0.971145,0.866939,0.145073,0.937508,0.971802,0.855412,0.77979,0.83563,0.853281,0.150104,0.755483,0.826404,0.941071,0.86464,0.738165,0.793538
Pipeline_9,"HPO, FE, HPO",SnapRandomForestClassifier,0.971145,0.866939,0.145073,0.937508,0.971802,0.855412,0.77979,0.83563,0.853281,0.150104,0.755483,0.826404,0.941071,0.86464,0.738165,0.793538
Pipeline_10,"HPO, FE, HPO, Ensemble",BatchedTreeEnsembleClassifier(SnapRandomForest...,0.971145,0.866939,0.145073,0.937508,0.971802,0.855412,0.77979,0.83563,0.853281,0.150104,0.755483,0.826404,0.941071,0.86464,0.738165,0.793538
Pipeline_3,"HPO, FE",RandomForestClassifier,0.969668,0.858676,0.219277,0.901414,0.971014,0.912265,0.738317,0.619926,0.839282,0.225546,0.929526,0.613155,0.90401,0.914518,0.927762,0.743795
Pipeline_4,"HPO, FE, HPO",RandomForestClassifier,0.969668,0.858676,0.219277,0.901414,0.971014,0.912265,0.738317,0.619926,0.839282,0.225546,0.929526,0.613155,0.90401,0.914518,0.927762,0.743795
Pipeline_5,"HPO, FE, HPO, Ensemble",BatchedTreeEnsembleClassifier(RandomForestClas...,0.969668,0.858676,0.219277,0.901414,0.971014,0.912265,0.738317,0.619926,0.839282,0.225546,0.929526,0.613155,0.90401,0.914518,0.927762,0.743795


lets list feature importance

In [37]:
pipeline_optimizer.get_pipeline_details()['features_importance']

Unnamed: 0,features_importance
minutes_since_emailClick,0.3113
minutes_since_emailOpened,0.1422
minutes_since_linkClick,0.137
emailsClicked,0.1155
minutes_since_productView,0.1117
minutes_since_emailSent,0.0881
emailsOpened,0.0447
emailsReceived,0.0125
productsViewed,0.0122
webLinkClicks,0.0092



lets see accuracy scores of pipelines on the hold out data


In [None]:
import pandas as pd
pd.options.plotting.backend = "plotly"

summary.holdout_accuracy.plot()

![holdout_accuracy](images/holdout_accuracy.png)

now lets viasualize confusion matrix

In [39]:
pipeline_optimizer.get_pipeline_details()['confusion_matrix']

Unnamed: 0_level_0,fn,fp,tn,tp
true_class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,1245,1908,5797,42451
1.0,1908,1245,42451,5797


### lets pull best pipeline and visualize it

In [None]:
best_pipeline = pipeline_optimizer.get_pipeline()
best_pipeline.visualize()

![best_pipeline](images/best_pipeline.png)

# 3. Store best pipeline model in repository for future use

we will convert it first to sklean so we could further use it in a spark cluster

In [None]:
sklearn_pipeline = pipeline_optimizer.get_pipeline(astype='sklearn')

sklearn_pipeline

![sklearn_pipeline](images/sklearn_pipeline.png)

lets push pipeline in repository.

In [42]:
software_spec_uid = client.software_specifications.get_id_by_name("runtime-23.1-py3.10")
software_spec_uid

# Define metadata for storing the model
metadata = {
    client.repository.ModelMetaNames.NAME: "CMLE Watson AutoAI Model",
    client.repository.ModelMetaNames.TYPE: 'scikit-learn_1.1',
    client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: software_spec_uid
}

# Store the model in WML repository
stored_model_details = client.repository.store_model(
    model=sklearn_pipeline, 
    meta_props=metadata
)

model_uid = client.repository.get_model_id(stored_model_details)
print("Model UID:", model_uid)

Model UID: a9ed5637-f4cf-4446-abcd-6510344762ba


In [None]:
Now that we got everything working, we just need to save the model_id variable in the original configuration file, so we can refer to it in the following weekly assignments. To do that, execute the code below:

In [43]:
config.set("Watsonx", "model_id", model_uid)
config_string = io.StringIO()
config.write(config_string)
project.save_data(file_name="config.ini", data=config_string.getvalue(), overwrite=True)

{'file_name': 'config.ini',
 'message': 'File saved to project storage.',
 'asset_id': '3b0346fe-24e6-42cf-9f33-38d3b9bc87a7'}