
# ACM Custom Models: SageMaker Notebook Interface

This notebook serves as a comprehensive guide for leveraging the AMC Custom Models APIs. It provides Python methods to simplify:

1. Creating and managing ML Input Channels
2. Initiating and monitoring Trained Models
3. Creating and managing Inference Jobs
4. Creating and Uploading inference results

---


In [4]:
# Import necessary libraries and the handler script
from acm_api_handler import MlInputChannel, TrainedModels, TrainedModelInferenceJobs, ManageModeledDatasets

In [5]:
# Initialize handler classes
ml_channel_handler = MlInputChannel()
trained_model_handler = TrainedModels()
inference_job_handler = TrainedModelInferenceJobs()
modeled_dataset_handler = ManageModeledDatasets()

In [6]:
# Defining the CMAA for Custom LAL
configured_model_algorithm_associations = "<Configured Model Algorithm Association ARN>"


## 1. ML Input Channels

ML Input Channels are used to prepare input datasets for training and inferencing. Each ML Input Channel materializes the results of an SQL query executed on an AMC-ACR collaboration. The query results (CSV files) can then be accessed by your model containers during training and inferencing.

#### Usage
- ML Input Channel are created by providing an AMC SQL query and associated time window to the AMC API. ML Input Channel stores the query results (data), enabling the data to later be used for training and inferencing. 
- The AMC SQL Query to create an ML Input Channel does not require aggregation (GROUP BY) in the final SELECT statement.
- A single training job can accept up to 20 ML Input Channels. Queries over long time periods can be split into multiple ML Input Channels. 
- ML Input Channel resources have a TTL of 28 days. After 28 days have passed, the ML Input Channel will become INACTIVE, which is reflected by the status field in the API response. An INACTIVE ML Input Channel cannot be leveraged for training or inferencing. 
- Each ML Input Channel can be associated with a single Configured Model Algorithm Association (CMAA).

#### Tips 
- Trained Models and Inference Jobs intended for audience creation must use **_for_audiences variants of AMC datasources only. 
- By specifying enforceUserLevelTargeting=True in the API call, AMC will check if ML Input Channel created is allowed for audience creation.

### 1.1 Creating ML Input Channels

You can use the `create_ml_input_channel` method to generate input datasets for training or inference.


In [7]:
# # Template: Create an ML Input Channel
try:
     ml_input_channel_id = ml_channel_handler.create_ml_input_channel(
         name="Sample Input Channel",
         model_arn=["<CMAA_ARN>"],
         channel_name="train",
         sql_query="<SQL Query",
         time_start="<Query Start Date>", #Eg: "2024-01-01T00:00:00",
         time_end=="<Query End Date>", #Eg: "2024-01-31T23:59:59",
         user_level_targeting=True/False
     )['mlInputChannelId']
     print("Created ML Input Channel:", ml_input_channel_id)
    
except Exception as e:
    print(f"An error occurred while creating the input channel: {e}")

### 1.2 Status of ML Input Channels

You can use the `get_ml_input_channel` method to get details about the status of ml input channel.

In [None]:
ml_channel_handler.get_ml_input_channel(ml_input_channel_id)

### 1.3 List All ML Input Channels that was executed

You can use the `list_all_ml_input_channels` method to get details about the status of all ml input channel.

In [None]:
# To list all ml input channels
ml_channel_handler.list_all_ml_input_channels()

In [None]:
# To list all active ml input channels
ml_channel_handler.list_all_ml_input_channels(status='ACTIVE')

### 1.4 List Details of Most Recent ML Input Channels

You can use the `list_most_recent_ml_input_channels` method to get details about the status of top x most recent ml input channel.

In [None]:
# To list top x ml input channels
ml_channel_handler.list_most_recent_ml_input_channels(top=3)

In [None]:
# To list top x successful ml input channels
ml_channel_handler.list_most_recent_ml_input_channels(top=3, status='ACTIVE')

### 1.5 Good to Proceed to next step

You can use the `is_ready` method to check if the Input Channel is ready and if you are good to proceed to next step

In [None]:
# Is ml input channels available
ml_channel_handler.is_ready(ml_input_channel_id)


## 2. Training a Model

The Trained Model resource encapsulates a model training job. Creating a trained model will kick- off a training job in AWS Clean Rooms with the specified model algorithm and ML Input Channels (materialized training datasets).

#### Usage
- Before you initiate a training job, ensure you have associated your model algorithm to the AMC-ACR collaboration. 
- Once training is complete, serialized model weights must be written to the /opt/ml/model directory. The artifacts stored in this directory can be accessed later during inferencing (e.g., to load your model into memory).

#### Tips
- AWS Clean Rooms leverages the SageMaker [CreateTrainingJob API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html). As a best practice, it is recommended to test your training containers on SageMaker using synthetic data from AMC sandbox. 
- Each training job can accept up to 20 ML Input Channels. Queries over large time windows can be split into multiple ML Input Channels. 
- Trained Models can only be created with ML Input Channel(s) which are in ACTIVE state. 
- Trained Models have a TTL of 30 days, AMC will automatically delete Trained Model resources once the TTL has expired.  Model evaluation can be performed during the training job execution. A typical approach is to (1) separate a portion of the training data for testing, (2) leverage the trained model to make predictions on the test data, (3) Log custom metrics (e.g., RMSE, AUC) in the container. If you setup a service role for Clean Rooms to push custom metrics to Cloudwatch, you can access custom metrics in Cloudwatch once the training job is complete.

### 2.1 Create Trained Model
The `create_trained_model` method triggers model training using specified input channels and parameters.


In [None]:
# # Template: Train a model
try:
    trained_model_id = trained_model_handler.create_trained_model(
        name="Sample Trained Model",
        model_arn="<CMAA_ARN>",
        input_channels=[ml_input_channel_id], #[ml_input_channel_id_1, ml_input_channel_id_2, ....] 
        instance_type="ml.m5.xlarge",
        instance_count=1,
        vol_size=10,
        timeout=123400
    )['trainedModelId']
    print("Trained Model ID:", trained_model_id)
    
except Exception as e:
    print(f"An error occurred while creating the input channel: {e}")

### 2.2 Status of Trained Model

You can use the `get_trained_model` method to get details about the status of Trained Model.

In [None]:
trained_model_handler.get_trained_model(trained_model_id)

### 2.3 List All Trained Models that was executed

You can use the `list_all_trained_models` method to get details about the status of all Trained Models.

In [None]:
# To list all trained models
trained_model_handler.list_all_trained_models()

In [None]:
# To list all active trained models
trained_model_handler.list_all_trained_models(status='SUCCEEDED')

### 2.4 List Details of Most Recent Trained Models

You can use the `list_most_recent_ml_input_channels` method to get details about the status of top x most recent Trained Models.

In [None]:
# To list top x trained models
trained_model_handler.list_most_recent_trained_models(top=3)

In [None]:
# To list top x successful trained models
trained_model_handler.list_most_recent_trained_models(top=3, status='SUCCEEDED')

### 2.5 Good to Proceed to next step

You can use the `is_ready` method to check if the Model Training is complete and if you are good to proceed to next step

In [None]:
# Is Trained Model available
trained_model_handler.is_ready(trained_model_id)


## 3. Running Inference

Inference Jobs leverage a Trained Model to generate predictions (inference) on input data (provided via ML Input Channel). Once an inference job is complete, the results can be uploaded to a new AMC dataset and leveraged for audience creation. 

[Inferencing](https://docs.aws.amazon.com/clean-rooms/latest/userguide/run-inference-jobs.html) in AWS Clean Rooms ML leverages the [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) API. The AMC Custom Models QuickStart guide includes instructions for setting up a basic inferencing container. As described in the SageMaker guide for [creating your own inference container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html), the container must host a webserver which responds to incoming requests. The incoming request contains batches of data from the ML Input Channel in CSV format. The inferencing container should load the trained model parameters from **opt/ml/model** (saved during training), and respond to inferencing requests by adding additional columns generated by the model.

#### Usage
- Inference jobs accept an ML Input Channel, Trained Model, and CMA (inference container) as input. The purpose of an inference job is to augment the input data (ML Input Channel) with additional columns containing the model-generated predictions. 
- The inference job should produce CSV files containing a user_id column, and additional columns which represent the model-generated predictions. The dataset generated during inference can then be uploaded to a new AMC dataset for audience creation.

#### Tips

- Each Trained Model Inference Job accepts single ML Input Channel (SageMaker limitation). In order generate inference results on large datasets, it is recommended to split the input data into batches (multiple ML Input Channels) and create multiple Trained Model Inference Jobs. The results from each inference job can then be loaded into AMC for audience creation. - See the AMC Custom Models QuickStart guides for a sample inferencing container.

### 3.1 Create Inference Job

Use the `create_inference_job` method to generate predictions based on trained models.


In [None]:
# # Template: Run an inference job
try:
    trainedModelInferenceJobId = inference_job_handler.create_inference_job(
        name="Sample Inference Job",
        model_arn="<CMAA_ARN>",
        trained_model_id=trained_model_id,
        input_channel_id=ml_input_channel_id,
        instance_type="ml.m5.xlarge"
    )['trainedModelInferenceJobId']
    print("Inference Job ID:", trainedModelInferenceJobId)
    
except Exception as e:
    print(f"An error occurred while creating the input channel: {e}")

### 3.2 Status of Inference Job

You can use the `get_inference_job` method to get details about the status of Inference Job.

In [None]:
inference_job_handler.get_inference_job(trainedModelInferenceJobId)

### 3.3 List All Inference Job that was executed

You can use the `list_all_trained_models` method to get details about the status of all Inference Job.

In [None]:
# To list all trained models
inference_job_handler.list_all_inference_jobs()

In [None]:
# To list all active trained models
inference_job_handler.list_all_inference_jobs(status='ACTIVE')

### 3.4 List Details of Most Recent Inference Job

You can use the `list_most_recent_ml_input_channels` method to get details about the status of top x most recent Inference Job.

In [None]:
# To list top x trained models
inference_job_handler.list_all_inference_jobs(top=3)

In [None]:
# To list top x successful trained models
inference_job_handler.list_all_inference_jobs(top=3, status='ACTIVE')

### 3.5 Good to Proceed to next step

You can use the `is_ready` method to check if the Model Inference is complete and if you are good to proceed to next step

In [None]:
# Is Trained Model available
inference_job_handler.is_ready(trainedModelInferenceJobId)


## 4. Creating Modeled Dataset

Modeled datasets are used to store the results of inferences (model predictions) generated by an Inference Job. Once the inference results are uploaded to a Modeled Dataset, the AMC audiences UI and API can be leveraged to create custom model-based audiences (rule-based and lookalikes) for activation on Amazon DSP and Sponsored Ads.

#### Usage

- Start by creating a Modeled Dataset which matches the schema of the CSV files generated by your inference job(s). 
- The Modeled Dataset must contain and identity column to store user_id values, and optionally contain any number of additional columns from the inference job output. 
- For example, say a model generates a conversion_propensity column with float values between [0, 1]. The conversion_propensity column can be uploaded to the Modeled Dataset, enabling audience creation with users above a certain propensity value. 
- Once the Modeled Dataset is created, you will create Modeled Dataset Upload Jobs to export results of an inference job to the dataset. Each upload job accepts a single inference job.

#### Tips

- The schema for the Modeled Dataset should match the column headers in the file (e.g., CSV) generated during inference. 
- Measurement queries on Modeled Datasets is currently not supported. Modeled Datasets can only be used in the AMC Audiences UI and API. 
- Inference results uploaded to a Modeled Dataset are persisted for 30 days. After 30 days, the uploaded data is removed (cannot be used for audience creation). To avoid audience depletion, inference results should be re-uploaded to the dataset every 30 days at minimum. 
-  Data deletion at the record level is currently not supported. In order to remove data from AMC, you can delete the Modeled Dataset. 
- Upload jobs support two update strategies: 
    - **Full Replace:** Replace the entire contents of the dataset with the results of a given inference job. 
    - **Additive:** Add the contents of the inference job to the current contents of the dataset.
    
### 4.1 Create Modeled Dataset

Use the `create_dataset` method to generate a new modeled dataset in AMC.

In [None]:

# Template: Create Modeled Dataset
try:
    modeled_dataset_id = modeled_dataset_handler.create_dataset(
        name="Sample_Modeled_Dataset",
        description="Sample Modeled Dataset Description",
        schema= {
            "tableColumns": [ 
                # Required: one column containing AMC user IDs. 
                { 
                    "columnName": "user_id", 
                    "description": "string", 
                    "userIdColumn" : True, 
                    "dataType" : "string" 
                },

                # Optional: additional column(s) containing inference values
                { 
                    "columnName": "string", 
                    "description": "string", 
                    "userIdColumn" : False, 
                    "dataType" : "string" 
                }
            ]
        }
    )['modeledDatasetId']
    print("Modeled Dataset ID:", modeled_dataset_id)
    
except Exception as e:
    print(f"An error occurred while creating the input channel: {e}")



### 4.2 Status of Modeled Dataset 

You can use the `get_dataset` method to get details about the status of Modeled Dataset.

In [None]:
modeled_dataset_handler.get_dataset(modeledDatasetId)

### 4.3 List All Modeled Dataset that was executed

You can use the `list_all_datasets` method to get details about the status of all Modeled Datasets.

In [None]:
# To list all trained models
modeled_dataset_handler.list_all_datasets()

In [None]:
# To list all active trained models
modeled_dataset_handler.list_all_datasets(status='ACTIVE')

### 4.4 List Details of Most Recent Modeled Dataset

You can use the `list_most_recent_ml_input_channels` method to get details about the status of top x most recent Inference Job.

In [None]:
# To list top x trained models
modeled_dataset_handler.list_most_recent_datasets(top=3)

In [None]:
# To list top x successful trained models
modeled_dataset_handler.list_most_recent_datasets(top=3, status='ACTIVE')

### 4.5 Delete the Modeled Dataset (Coming Soon)

You can use the `delete_dataset` method to delete the Modeled Dataset Table.

In [None]:
# To list top x trained models
modeled_dataset_handler.delete_dataset(modeledDatasetId)

### 4.6 Good to Proceed to next step

You can use the `is_ready` method to check if the Model Dataset is complete and if you are good to proceed to next step

In [None]:
# Is Trained Model available
modeled_dataset_handler.is_ready(modeledDatasetId)


### 4.7 Create Modeled Dataset Upload

Use the `create_dataset_upload` method to generate a new modeled dataset Upload in AMC.

In [None]:

# Template: Create Modeled Dataset Upload
try:
    modeledDatasetUploadJobId = modeled_dataset_handler.create_dataset_upload(
       modeled_dataset_id = "string",
       trained_model_ids = "string" or inference_job_ids = "string",
       update_strategy = "FULL_REPLACE",  # ["FULL_REPLACE", "ADDITIVE"]
    )['modeledDatasetUploadJobId']
    print("Modeled Dataset Upload ID:", modeledDatasetUploadJobId)
    
except Exception as e:
    print(f"An error occurred while creating the input channel: {e}")



### 4.8 Status of Modeled Dataset

You can use the `get_dataset_upload` method to get details about the status of Modeled Dataset Upload.

In [None]:
modeled_dataset_handler.get_dataset_upload(modeledDatasetUploadJobId)

### 4.9 List All Modeled Dataset Upload that was executed

You can use the `list_all_dataset_uploads` method to get details about the status of all Modeled Datasets Uploads.

In [None]:
# To list all trained models
modeled_dataset_handler.list_all_dataset_uploads()

In [None]:
# To list all active trained models
modeled_dataset_handler.list_all_dataset_uploads(status='ACTIVE')

### 4.10 List Details of Most Recent Modeled Dataset Uploads

You can use the `list_most_recent_dataset_uploads` method to get details about the status of top x most recent Modeled Dataset Upload Jobs.

In [None]:
# To list top x trained models
modeled_dataset_handler.list_most_recent_dataset_uploads(top=3)

In [None]:
# To list top x successful trained models
modeled_dataset_handler.list_most_recent_dataset_uploads(top=3, status='ACTIVE')

### 4.11 Good to Proceed to next step

You can use the `is_ready` method to check if the Model Upload is complete and if you are good to proceed to next step

In [None]:
# Is Trained Model available
modeled_dataset_handler.is_ready(modeledDatasetUploadJobId, type='Upload')