# Save to S3 with a SageMaker Processing Job

<div class="alert alert-info"> 💡 <strong> Quick Start </strong>
To save your processed data to S3, select the Run menu above and click <strong>Run all cells</strong>. 
<strong><a style="color: #0397a7 " href="#Job-Status-&-S3-Output-Location">
    <u>View the status of the export job and the output S3 location</u></a>.
</strong>
</div>


This notebook executes your Data Wrangler Flow `gsml-nyc-taxi-full-etl-test-3-custompyspark.flow` on the entire dataset using a SageMaker 
Processing Job and will save the processed data to S3.

This notebook saves data from the step `Custom Code`. To save from a different step, go to Data Wrangler 
to select a new step to export. 

---

## Contents

1. [Inputs and Outputs](#Inputs-and-Outputs)
1. [Run Processing Job](#Run-Processing-Job)
   1. [Job Configurations](#Job-Configurations)
   1. [Create Processing Job](#Create-Processing-Job)
   1. [Job Status & S3 Output Location](#Job-Status-&-S3-Output-Location)
1. [Optional Next Steps](#(Optional)Next-Steps)
    1. [Load Processed Data into Pandas](#(Optional)-Load-Processed-Data-into-Pandas)
    1. [Train a model with SageMaker](#(Optional)Train-a-model-with-SageMaker)
---

# Inputs and Outputs

The below settings configure the inputs and outputs for the flow export.

<div class="alert alert-info"> 💡 <strong> Configurable Settings </strong>

In <b>Input - Source</b> you can configure the data sources that will be used as input by Data Wrangler

1. For S3 sources, configure the source attribute that points to the input S3 prefixes
2. For all other sources, configure attributes like query_string, database in the source's 
<b>DatasetDefinition</b> object.

If you modify the inputs the provided data must have the same schema and format as the data used in the Flow. 
You should also re-execute the cells in this section if you have modified the settings in any data sources.

Parametrized data sources will be ignored when creating ProcessingInputs, and will directly read from the source.
Network isolation is not supported for parametrized data sources.
</div>

In [2]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.dataset_definition.inputs import AthenaDatasetDefinition, DatasetDefinition, RedshiftDatasetDefinition

data_sources = []

## Input - S3 Source: ride-info

In [3]:
data_sources.append(ProcessingInput(
    source="s3://dsoaws/nyc-taxi-orig-cleaned-split-parquet-per-year/ride-info/", # You can override this to point to other dataset on S3
    destination="/opt/ml/processing/ride-info",
    input_name="ride-info",
    s3_data_type="S3Prefix",
    s3_input_mode="File",
    s3_data_distribution_type="FullyReplicated"
))

## Input - S3 Source: ride-fare

In [4]:
data_sources.append(ProcessingInput(
    source="s3://dsoaws/nyc-taxi-orig-cleaned-split-parquet-per-year/ride-fare/", # You can override this to point to other dataset on S3
    destination="/opt/ml/processing/ride-fare",
    input_name="ride-fare",
    s3_data_type="S3Prefix",
    s3_input_mode="File",
    s3_data_distribution_type="FullyReplicated"
))

## Output: S3 settings

<div class="alert alert-info"> 💡 <strong> Configurable Settings </strong>

1. <b>bucket</b>: you can configure the S3 bucket where Data Wrangler will save the output. The default bucket from 
the SageMaker notebook session is used. 
2. <b>flow_export_id</b>: A randomly generated export id. The export id must be unique to ensure the results do not 
conflict with other flow exports 
3. <b>s3_ouput_prefix</b>:  you can configure the directory name in your bucket where your data will be saved.
</div>

In [5]:
import time
import uuid
import boto3
import sagemaker

# Sagemaker session
sess = sagemaker.Session()

region = boto3.Session().region_name

# You can configure this with your own bucket name, e.g.
# bucket = "my-bucket"
bucket = sess.default_bucket()
print(f"Data Wrangler export storage bucket: {bucket}")

# unique flow export ID
# flow_export_id = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}"
flow_export_id = f"{time.strftime('%Y-%m-%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}"
flow_export_name = f"flow-{flow_export_id}"

Data Wrangler export storage bucket: sagemaker-us-east-1-079002598131


Below are the inputs required by the SageMaker Python SDK to launch a processing job.

In [6]:
# # Output name is auto-generated from the select node's ID + output name from the flow file.
# # THIS IS THE DEFAULT GENERATE BY DATAWRANGLER
# # KHOI: this is only for one of the Transform node, Creating training set
# output_name = "842a9df0-a299-4625-8f8a-ffca58929650.default"

# s3_folder = 'gsml-nyc-taxi-full-etl-ml-test-3-custompyspark-export-s3-via-notebook'
# s3_output_prefix = f"{s3_folder}/export-{flow_export_name}/output"
# s3_output_base_path = f"s3://{bucket}/{s3_output_prefix}"
# print(f"Processing output base path: {s3_output_base_path}\nThe final output location will contain additional subdirectories.")

# processing_job_output = ProcessingOutput(
#     output_name=output_name,
#     source="/opt/ml/processing/output",
#     destination=s3_output_base_path,
#     s3_upload_mode="EndOfJob"
# )

In [7]:
#! KHOI'S MODIFICATION:

# Output name is auto-generated from the select node's ID + output name from the flow file.
s3_folder = 'gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook'
s3_output_prefix = f"{s3_folder}/export-{flow_export_name}/output"
s3_output_base_path = f"s3://{bucket}/{s3_output_prefix}"

# !!!Make sure to change this according to the `node_id` of the exact transformation using the Flow Editor (right click on flow file and Open With Editor)
splitting_data_mapping = {
    'training': '842a9df0-a299-4625-8f8a-ffca58929650',
    'validation': '052787ea-75d7-4079-be2a-3d436a45b0c8', 
}

processing_job_outputs = []
for set_name, transform_id in splitting_data_mapping.items():
    output_name = f'{transform_id}.default'
    s3_output_prefix_set = f"{s3_output_prefix}/{set_name}"
    s3_output_base_path_set = f"s3://{bucket}/{s3_output_prefix_set}"
    print(f"{set_name.title()} Processing output base path: {s3_output_base_path_set}\nThe final output location will contain additional subdirectories.")

    processing_job_outputs.append(ProcessingOutput(output_name=output_name,
                                                   source=f"/opt/ml/processing/output/{set_name}",
                                                   destination=s3_output_base_path_set,
                                                   s3_upload_mode="EndOfJob"))

Training Processing output base path: s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-02-03-32-10-53926e35/output/training
The final output location will contain additional subdirectories.
Validation Processing output base path: s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-02-03-32-10-53926e35/output/validation
The final output location will contain additional subdirectories.


## Upload Flow to S3

To use the Data Wrangler as an input to the processing job,  first upload your flow file to Amazon S3.

In [8]:
import os
import json
import boto3

# name of the flow file which should exist in the current notebook working directory
flow_file_name = "gsml-nyc-taxi-full-etl-test-3-custompyspark.flow"

# Load .flow file from current notebook working directory 
!echo "Loading flow file from current notebook working directory: $PWD"

with open(flow_file_name) as f:
    flow = json.load(f)

# Upload flow to S3
s3_client = boto3.client("s3")
flow_file_prefix_and_name = f"{s3_folder}/data_wrangler_flows/{flow_export_name}.flow"
s3_client.upload_file(flow_file_name, bucket, flow_file_prefix_and_name, ExtraArgs={"ServerSideEncryption": "aws:kms"})

flow_s3_uri = f"s3://{bucket}/{flow_file_prefix_and_name}"

print(f"Data Wrangler flow {flow_file_name} uploaded to {flow_s3_uri}")

Loading flow file from current notebook working directory: /root/data-science-on-aws.xgboost
Data Wrangler flow gsml-nyc-taxi-full-etl-test-3-custompyspark.flow uploaded to s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/data_wrangler_flows/flow-2023-03-02-03-32-10-53926e35.flow


The Data Wrangler Flow is also provided to the Processing Job as an input source which we configure below.

In [9]:
## Input - Flow: gsml-nyc-taxi-full-etl-test-3-custompyspark.flow
flow_input = ProcessingInput(
    source=flow_s3_uri,
    destination="/opt/ml/processing/flow",
    input_name="flow",
    s3_data_type="S3Prefix",
    s3_input_mode="File",
    s3_data_distribution_type="FullyReplicated"
)

# Run Processing Job 
## Job Configurations

<div class="alert alert-info"> 💡 <strong> Configurable Settings </strong>

You can configure the following settings for Processing Jobs. If you change any configurations you will 
need to re-execute this and all cells below it by selecting the Run menu above and click 
<b>Run Selected Cells and All Below</b>

1. IAM role for executing the processing job. 
2. A unique name of the processing job. Give a unique name every time you re-execute processing jobs
3. Data Wrangler Container URL.
4. Instance count, instance type and storage volume size in GB.
5. Content type for each output. Data Wrangler supports CSV as default and Parquet.
6. Network Isolation settings
7. KMS key to encrypt output data
</div>

In [10]:
from sagemaker import image_uris

# IAM role for executing the processing job.
iam_role = sagemaker.get_execution_role()

# Unique processing job name. Give a unique name every time you re-execute processing jobs.
processing_job_name = f"data-wrangler-flow-processing-{flow_export_id}"

# Data Wrangler Container URL.
container_uri = "663277389841.dkr.ecr.us-east-1.amazonaws.com/sagemaker-data-wrangler-container:2.x"
# Pinned Data Wrangler Container URL.
container_uri_pinned = "663277389841.dkr.ecr.us-east-1.amazonaws.com/sagemaker-data-wrangler-container:2.1.2"

# Processing Job Instance count and instance type.
instance_count = 6
instance_type = "ml.m5.24xlarge"

# Size in GB of the EBS volume to use for storing data during processing.
# !KHOI: must change this to at least 200GB for full dataset
volume_size_in_gb = 200


# Content type for each output. Data Wrangler supports CSV as default and Parquet.
output_content_type = "Parquet"

# Delimiter to use for the output if the output content type is CSV. Uncomment to set.
# delimiter = ","

# Compression to use for the output. Uncomment to set.
# compression = "gzip"

# Configuration for partitioning the output. Uncomment to set.
# "num_partition" sets the number of partitions/files written in the output.
# "partition_by" sets the column names to partition the output by.
# partition_config = {
#     "num_partitions": 1,
#     "partition_by": ["column_name_1", "column_name_2"],
# }

# Network Isolation mode; default is off.
enable_network_isolation = False

# List of tags to be passed to the processing job.
user_tags = []

# Output configuration used as processing job container arguments. Only applies when writing to S3.
# Uncomment to set additional configurations.
# output_config = {
#     output_name: {
#         "content_type": output_content_type,
#         # "delimiter": delimiter,
#         # "compression": compression,
#         # "partition_config": partition_config,
#     }
# }

# KHOI: has to create separate output_config for each output:
output_configs = []
for set_name, transform_id in splitting_data_mapping.items():
    output_name = f'{transform_id}.default'
    output_configs.append({
        output_name: {
            "content_type": output_content_type,
            # "delimiter": delimiter,
            # "compression": compression,
            # "partition_config": partition_config,
        }
    })

# Refit configuration determines whether Data Wrangler refits the trainable parameters on the entire dataset. 
# When True, the processing job relearns the parameters and outputs a new flow file.
# You can specify the name of the output flow file under 'output_flow'.
# Note: There are length constraints on the container arguments (max 256 characters).
refit_trained_params = {
    "refit": False,
    "output_flow": f"data-wrangler-flow-processing-{flow_export_id}.flow"
}

# KMS key for per object encryption; default is None.
kms_key = None

### (Optional) Configure Spark Cluster Driver Memory

In [11]:
# The Spark memory configuration. Change to specify the driver and executor memory in MB for the Spark cluster during processing.
driver_memory_in_mb = 2048
executor_memory_in_mb = 55742

config = json.dumps({
    "Classification": "spark-defaults",
    "Properties": {
        "spark.driver.memory": f"{driver_memory_in_mb}m",
        "spark.executor.memory": f"{executor_memory_in_mb}m"
    }
})

config_file = f"config-{flow_export_id}.json"
with open(config_file, "w") as f:
    f.write(config)

config_s3_path = f"{s3_folder}/spark_configuration/{processing_job_name}/configuration.json"
config_s3_uri = f"s3://{bucket}/{config_s3_path}"
s3_client.upload_file(config_file, bucket, config_s3_path, ExtraArgs={"ServerSideEncryption": "aws:kms"})
print(f"Spark Config file uploaded to {config_s3_uri}")
os.remove(config_file)

# Provides the spark config file to processing job and set the cluster driver memory. Uncomment to set.
# data_sources.append(ProcessingInput(
#     source=config_s3_uri,
#     destination="/opt/ml/processing/input/conf",
#     input_name="spark-config",
#     s3_data_type="S3Prefix",
#     s3_input_mode="File",
#     s3_data_distribution_type="FullyReplicated"
# ))

Spark Config file uploaded to s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/spark_configuration/data-wrangler-flow-processing-2023-03-02-03-32-10-53926e35/configuration.json


## Create Processing Job

To launch a Processing Job, you will use the SageMaker Python SDK to create a Processor function.

In [12]:
# Setup processing job network configuration
from sagemaker.network import NetworkConfig

network_config = NetworkConfig(
        enable_network_isolation=enable_network_isolation,
        security_group_ids=None,
        subnets=None
    )

In [13]:
from sagemaker.processing import Processor

processor = Processor(
    role=iam_role,
    image_uri=container_uri,
    instance_count=instance_count,
    instance_type=instance_type,
    volume_size_in_gb=volume_size_in_gb,
    network_config=network_config,
    sagemaker_session=sess,
    output_kms_key=kms_key,
    tags=user_tags
)

# # Start Job
# processor.run(
#     inputs=[flow_input] + data_sources, 
#     outputs=[processing_job_output],
#     arguments=[f"--output-config '{json.dumps(output_config)}'"] + [f"--refit-trained-params '{json.dumps(refit_trained_params)}'"],
#     wait=False,
#     logs=False,
#     job_name=processing_job_name
# )

## Job Status & S3 Output Location

Below you wait for processing job to finish. If it finishes successfully, the raw parameters used by the 
Processing Job will be printed.

To prevent data of different processing jobs and different output nodes from being overwritten or combined, 
Data Wrangler uses the name of the processing job and the name of the output to write the output.

In [14]:
import sys
print(sys.version)

3.7.10 (default, Jun  4 2021, 14:48:32) 
[GCC 7.5.0]


In [15]:
print(f'S3 base path: {s3_output_base_path}')
print(f'Processing job name: {processing_job_name}')

S3 base path: s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-02-03-32-10-53926e35/output
Processing job name: data-wrangler-flow-processing-2023-03-02-03-32-10-53926e35


In [16]:
%%time

# KHOI:
output_args = [f"--output-config '{json.dumps(output_config)}'" for output_config in output_configs]

# Start Job
processor.run(
    inputs=[flow_input] + data_sources, 
    outputs=processing_job_outputs,
    arguments=output_args + [f"--refit-trained-params '{json.dumps(refit_trained_params)}'"],
    wait=False,
    logs=False,
    job_name=processing_job_name
)

# Status
training_set_specific_path = f'{list(splitting_data_mapping.values())[0]}.default'
s3_job_results_training_path = f"{s3_output_base_path}/{list(splitting_data_mapping.keys())[0]}/{processing_job_name}/{training_set_specific_path.replace('.', '/')}"
print(f"Job results (training set) are saved to S3 path: {s3_job_results_training_path}")
validation_set_specific_path = f'{list(splitting_data_mapping.values())[1]}.default'
s3_job_results_validation_path = f"{s3_output_base_path}/{list(splitting_data_mapping.keys())[1]}/{processing_job_name}/{validation_set_specific_path.replace('.', '/')}"
print(f"Job results (validation set) are saved to S3 path: {s3_job_results_validation_path}")

job_result = sess.wait_for_processing_job(processing_job_name)
job_result

INFO:sagemaker:Creating processing-job with name data-wrangler-flow-processing-2023-03-02-03-32-10-53926e35


Job results (training set) are saved to S3 path: s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-02-03-32-10-53926e35/output/training/data-wrangler-flow-processing-2023-03-02-03-32-10-53926e35/842a9df0-a299-4625-8f8a-ffca58929650/default
Job results (validation set) are saved to S3 path: s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-02-03-32-10-53926e35/output/validation/data-wrangler-flow-processing-2023-03-02-03-32-10-53926e35/052787ea-75d7-4079-be2a-3d436a45b0c8/default
............................................................................................................................................................................!CPU times: user 749 ms, sys: 92.7 ms, total: 842 ms
Wall time: 14min 24s


{'ProcessingInputs': [{'InputName': 'flow',
   'AppManaged': False,
   'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/data_wrangler_flows/flow-2023-03-02-03-32-10-53926e35.flow',
    'LocalPath': '/opt/ml/processing/flow',
    'S3DataType': 'S3Prefix',
    'S3InputMode': 'File',
    'S3DataDistributionType': 'FullyReplicated',
    'S3CompressionType': 'None'}},
  {'InputName': 'ride-info',
   'AppManaged': False,
   'S3Input': {'S3Uri': 's3://dsoaws/nyc-taxi-orig-cleaned-split-parquet-per-year/ride-info/',
    'LocalPath': '/opt/ml/processing/ride-info',
    'S3DataType': 'S3Prefix',
    'S3InputMode': 'File',
    'S3DataDistributionType': 'FullyReplicated',
    'S3CompressionType': 'None'}},
  {'InputName': 'ride-fare',
   'AppManaged': False,
   'S3Input': {'S3Uri': 's3://dsoaws/nyc-taxi-orig-cleaned-split-parquet-per-year/ride-fare/',
    'LocalPath': '/opt/ml/processing/ride-fare',
    'S3DataType': 

## (Optional)Next Steps

Now that data is available on S3 you can use other SageMaker components that take S3 URIs as input such as 
SageMaker Training, Built-in Algorithms, etc. Similarly you can load the dataset into a Pandas dataframe 
in this notebook for further inspection and work. The examples below show how to do both of these steps.

By default optional steps do not run automatically, set `run_optional_steps` to True if you want to 
execute optional steps

In [17]:
run_optional_steps = True

In [18]:
# This will stop the below cells from executing if "Run All Cells" was used on the notebook.
if not run_optional_steps:
    raise SystemExit("Stop here. Do not automatically execute optional steps.")

### (Optional) Load Processed Data into Pandas

We use the [AWS SDK for pandas library](https://github.com/awslabs/aws-sdk-pandas) to load the exported 
dataset into a Pandas data frame for a preview of first 10000 rows.

To turn on automated visualizations and data insights for your pandas data frame, import the sagemaker_datawrangler library.

In [19]:
!pip install -q awswrangler pandas
import awswrangler as wr

# Import sagemaker_datawrangler to show visualizations and automated data
# quality insights, and export code to prepare data in a pandas data frame.
try:
    import sagemaker_datawrangler
except ImportError:
    print("sagemaker_datawrangler is not imported. Change your kernel to the Data Science 3.0 Kernel Image and try again.")
    pass

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [20]:
chunksize = 10000

# KHOI: we are loading the training set here only
if output_content_type.upper() == "CSV":
    dfs = wr.s3.read_csv(s3_job_results_training_path, chunksize=chunksize)
elif output_content_type.upper() == "PARQUET":
    dfs = wr.s3.read_parquet(s3_job_results_training_path, chunked=chunksize)
else:
    print(f"Unexpected output content type {output_content_type}") 

df = next(dfs)
df

INFO:root:{"event_type": "ganymede.initialization", "event_status": "start", "app_context": {"ganymede_version": "0.3.8", "app_metadata": {"AppType": "KernelGateway", "DomainId": "d-iik9aga3atel", "UserProfileName": "demo", "ResourceArn": "arn:aws:sagemaker:us-east-1:079002598131:app/d-iik9aga3atel/demo/KernelGateway/datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395", "ResourceName": "datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395", "AppImageVersion": "", "Region": "us-east-1", "AccountId": "079002598131"}}}
INFO:root:DataFrame size: row_count = 10000, column_count = 11.
INFO:root:Computing on the top 10000 rows of the DataFrame.
INFO:root:Toggled to the sagemaker_datawrangler view.


      total_amount      ride_id_0  passenger_count  trip_distance  \
0        26.250000  2345053812090                1           7.98   
1         9.740000  3616363168001                1           0.97   
2        20.700001  2345053812210                1           8.03   
3        13.200000  1692217947308                2           2.20   
4        12.500000  1692217947853                1           2.70   
...            ...            ...              ...            ...   
9995      6.860000  3624953193593                2           0.91   
9996     14.100000  2731599453867                3           3.46   
9997     12.740000  3624953193717                4           2.76   
9998     16.459999  1520420564183                1           2.35   
9999      8.200000  3624953194196                1           1.90   

      rate_code_id  payment_type  fare_amount  extra  mta_tax  tip_amount  \
0                1             1    20.100000    0.5      0.5        5.15   
1                

INFO:root:{"event_type": "ganymede.initialization", "event_status": "success", "app_context": {"ganymede_version": "0.3.8", "app_metadata": {"AppType": "KernelGateway", "DomainId": "d-iik9aga3atel", "UserProfileName": "demo", "ResourceArn": "arn:aws:sagemaker:us-east-1:079002598131:app/d-iik9aga3atel/demo/KernelGateway/datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395", "ResourceName": "datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395", "AppImageVersion": "", "Region": "us-east-1", "AccountId": "079002598131"}}, "metadata": {"latency": 0.14584922790527344}}


## (Optional)Train a model with SageMaker
Now that the data has been processed, you may want to train a model using the data. The following shows an 
example of doing so using a popular algorithm - XGBoost. For more information on algorithms available in 
SageMaker, see [Getting Started with SageMaker Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). 
It is important to note that the following XGBoost objective ['binary', 'regression', 'multiclass'] 
hyperparameters, or content_type may not be suitable for the output data, and will require changes to 
train a proper model. Furthermore, for CSV training, the algorithm assumes that the target 
variable is in the first column. For more information on SageMaker XGBoost, 
see https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html.


### Set Training Data and Validation Data paths
We set the training input data path from the output of the Data Wrangler processing job..

In [21]:
s3_training_input_path = s3_job_results_training_path
print(f"Training input data path: {s3_training_input_path}")

s3_validation_input_path = s3_job_results_validation_path
print(f"Validation input data path: {s3_validation_input_path}")

Training input data path: s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-02-03-32-10-53926e35/output/training/data-wrangler-flow-processing-2023-03-02-03-32-10-53926e35/842a9df0-a299-4625-8f8a-ffca58929650/default
Validation input data path: s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-02-03-32-10-53926e35/output/validation/data-wrangler-flow-processing-2023-03-02-03-32-10-53926e35/052787ea-75d7-4079-be2a-3d436a45b0c8/default


### Configure the algorithm and training job

The Training Job hyperparameters are set. For more information on XGBoost Hyperparameters, 
see https://xgboost.readthedocs.io/en/latest/parameter.html.

In [22]:
region = boto3.Session().region_name
# KHOI: switched to get newer XGBoost image: default was 1.2-1
container = sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

# KHOI: set an output path where the trained model will be saved
s3_training_job_output_prefix = f"{s3_folder}/built-in-xgboost"
training_job_output_path = 's3://{}/{}/{}/output'.format(bucket, s3_training_job_output_prefix, 'nyc-taxi-full-built-in-xgboost')

hyperparameters = {
    "max_depth":"5",
    "objective": "reg:squarederror",
    "num_round": "10",
}
train_content_type = (
    "application/x-parquet" if output_content_type.upper() == "PARQUET"
    else "text/csv"
)
train_input = sagemaker.inputs.TrainingInput(
    s3_data=s3_training_input_path,
    content_type=train_content_type,
    distribution='ShardedByS3Key',  # testing
)

# KHOI: add validation input
validation_content_type = (
    "application/x-parquet" if output_content_type.upper() == "PARQUET"
    else "text/csv"
)
validation_input = sagemaker.inputs.TrainingInput(
    s3_data=s3_validation_input_path,
    content_type=validation_content_type,
    distribution='ShardedByS3Key',   # testing
)

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


### Start the Training Job

The TrainingJob configurations are set using the SageMaker Python SDK Estimator, and which is fit using 
the training data from the Processing Job that was run earlier.

In [23]:
# %%time

# # KHOI: training job without ShardedByS3Key
# estimator = sagemaker.estimator.Estimator(
#     container,
#     iam_role,
#     hyperparameters=hyperparameters,
#     instance_count=6,
#     instance_type="ml.m5.24xlarge",
#     volume_size=200,    # KHOI: must change this or will get diskspace error, default is 30GB
# )
# # estimator.fit({"train": train_input})
# # KHOI: add validation inout
# estimator.fit({"train": train_input, 'validation': validation_input})

In [24]:
!{sys.executable} -m pip install --upgrade sagemaker

Collecting sagemaker
  Downloading sagemaker-2.135.1.tar.gz (673 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m674.0/674.0 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting importlib-metadata<5.0,>=1.4.0
  Using cached importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.135.1-py2.py3-none-any.whl size=911593 sha256=4c35e553a3e7382d596ca919780225357acddd1c42235d7c0d936a4f1c3a2ae9
  Stored in directory: /root/.cache/pip/wheels/fe/8d/13/4676f5847fd7b702de26744c89720edb2c1e94ff829dab21ed
Successfully built sagemaker
Installing collected packages: importlib-metadata, sagemaker
  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 6.0.0
    Uninstalling importlib-metadata-6.0.0:
      Successfully unin

In [25]:
import time
import sagemaker

experiment_name = f"gsml-nyc-taxi-full-built-in-xgboost-{time.strftime('%Y-%m-%d-%H-%M-%S', time.gmtime())}"
print(experiment_name)

gsml-nyc-taxi-full-built-in-xgboost-2023-03-02-03-46-46


In [26]:
run_name = f"experiment-run-{time.strftime('%Y-%m-%d-%H-%M-%S', time.gmtime())}"
print(run_name)

experiment-run-2023-03-02-03-46-46


In [30]:
# KHOI: rerun training job with "SharededByS3Key"
estimator = sagemaker.estimator.Estimator(
    container,
    iam_role,
    hyperparameters=hyperparameters,
    instance_count=6,
    instance_type="ml.m5.24xlarge",
    # volume_size=200,    # KHOI: must change this or will get diskspace error, default is 30GB,
)


In [None]:
%%time

# KHOI: add validation input
with sagemaker.experiments.load_run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=sagemaker.session.Session()) as run:
# with sagemaker.experiments.Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=sagemaker.session.Session()) as run:
# with sagemaker.experiments.Run() as run:
    # run.experiment_config = experiment_config
    # run.log_parameter(
        # {"num_train_samples": len(train_input), "num_validation_samples": len(validation_input)},
        
    # )
    # run.log_parameters()

    # run.log_metric(name='Khoi', value='0.1')

    # run.log_metric(name=metric_type+":loss", value=loss, step=epoch)
    # run.log_metric(name=metric_type+":accuracy", value=accuracy, step=epoch)
    # run.log_confusion_matrix(target, pred, "Confusion-Matrix-Test-Data")
    # Log a metric over the course of a run at each epoch
    # run.log_metric(name="test:loss", value=loss, step=epoch)

    # Define values for the parameters to log
    # run.log_parameter("batch_size", batch_size)
    # run.log_parameter("epochs", epochs)
    # run.log_parameter("dropout", 0.5)
    
    estimator.fit({"train": train_input, 'validation': validation_input},
                  # experiment_config=experiment_config
                  # experiment_config=run.experiment_config
                  
                 )


#     score = estimator.evaluate(x_test, y_test, verbose=0)
#     print("Test loss:", score[0])
#     print("Test accuracy:", score[1])
    
#     # Define metrics to log
#     run.log_metric(name = "Final Test Loss", value = score[0])
#     run.log_metric(name = "Final Test Loss", value = score[1])

INFO:sagemaker.experiments.run:The run (experiment-run-2023-03-02-03-46-46) under experiment (gsml-nyc-taxi-full-built-in-xgboost-2023-03-02-03-46-46) already exists. Loading it. Note: sagemaker.experiments.load_run is recommended to use when the desired run already exists.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-03-02-03-58-02-063


2023-03-02 03:58:02 Starting - Starting the training job...
2023-03-02 03:58:18 Starting - Preparing the instances for training......
2023-03-02 03:59:21 Downloading - Downloading input data.........
2023-03-02 04:01:03 Training - Training image download completed. Training in progress..[32m[2023-03-02 04:01:04.378 ip-10-0-237-228.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[32m[2023-03-02:04:01:04:INFO] Imported framework sagemaker_xgboost_container.training[0m
[32m[2023-03-02:04:01:04:INFO] Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[32mReturning the value itself[0m
[32m[2023-03-02:04:01:04:INFO] No GPUs detected (normal if no gpus installed)[0m
[32m[2023-03-02:04:01:04:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[32m[2023-03-02:04:01:04:INFO] Determined 0 GPU(s) available on the instance.[0m
[32m[2023-03-02:04:01:04:INFO] files path: /opt/ml/input/data/train[0m
[34m[2023-03-02 04:01:04.394 ip-10-0

Now that you have a trained model there are a number of different things you can do. 
For more details on training with SageMaker, please see 
https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html.

In [None]:
training_job_name = client.describe_hyper_parameter_tuning_job(
        HyperParameterTuningJobName=tuning_job_name
    )["BestTrainingJob"]["TrainingJobName"]

In [None]:
%matplotlib inline

# Ref: https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.html
#  Get the `training_job_name` from the output of `estimator.fit()` above
from sagemaker.analytics import TrainingJobAnalytics

metric_name = "validation:rmse"

metrics_dataframe = TrainingJobAnalytics(
    training_job_name='sagemaker-xgboost-2023-02-20-20-33-16-657', metric_names=[metric_name]
).dataframe()
plt = metrics_dataframe.plot(
    kind="line", figsize=(12, 5), x="timestamp", y="value", style="b.", legend=False
)
plt.set_ylabel(metric_name);



In [None]:
xgboost_nyctaxi_full_experiment = sagemaker.experiment.Experiment.create(experiment_name=experiment_name, 
                                              description="GSML test", 
                                              sagemaker_boto_client=boto3.client('sagemaker'))

trial = sagemaker.trial.Trial.create(trial_name=run_name, 
                     experiment_name=xgboost_nyctaxi_full_experiment.experiment_name,
                     sagemaker_boto_client=boto3.client('sagemaker'))

# container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker XGBoost container: {} ({})'.format(container, region_name))


### TESTING 2023-02-28
Using SageMaker Experiment SDK, which is not recommended anymore: https://docs.aws.amazon.com/sagemaker/latest/dg/experiments-additional-sdk.html


In [None]:
!{sys.executable} -m pip install sagemaker-experiments

In [None]:
from time import strftime

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

# role = sagemaker.get_execution_role()
# sm_sess = sagemaker.session.Session()



In [None]:
create_date = strftime("%Y-%m-%d-%H-%M-%S")
demo_experiment = Experiment.create(experiment_name = "DEMO-{}".format(create_date),
                                    description = "Demo experiment",
                                    tags = [{'Key': 'demo-experiments', 'Value': 'demo1'}])

demo_trial = Trial.create(trial_name = "DEMO-{}".format(create_date),
                          experiment_name = demo_experiment.experiment_name,
                          tags = [{'Key': 'demo-trials', 'Value': 'demo1'}])

In [None]:
# KHOI: rerun training job with "SharededByS3Key"
estimator = sagemaker.estimator.Estimator(
    container,
    iam_role,
    hyperparameters=hyperparameters,
    instance_count=6,
    instance_type="ml.m5.24xlarge",
    volume_size=200,    # KHOI: must change this or will get diskspace error, default is 30GB,
)

estimator.fit({"train": train_input, 'validation': validation_input},
              # experiment_config=experiment_config
              experiment_config={
                  # "ExperimentName"
                  "TrialName" : demo_trial.trial_name,
                  "TrialComponentDisplayName" : "TrainingJob",
              })