# Save to S3 with a SageMaker Processing Job

<div class="alert alert-info"> 💡 <strong> Quick Start </strong>
To save your processed data to S3, select the Run menu above and click <strong>Run all cells</strong>. 
<strong><a style="color: #0397a7 " href="#Job-Status-&-S3-Output-Location">
    <u>View the status of the export job and the output S3 location</u></a>.
</strong>
</div>


This notebook executes your Data Wrangler Flow `gsml-nyc-taxi-full-etl-test-3-custompyspark.flow` on the entire dataset using a SageMaker 
Processing Job and will save the processed data to S3.

This notebook saves data from the step `Custom Code`. To save from a different step, go to Data Wrangler 
to select a new step to export. 

---

## Contents

1. [Inputs and Outputs](#Inputs-and-Outputs)
1. [Run Processing Job](#Run-Processing-Job)
   1. [Job Configurations](#Job-Configurations)
   1. [Create Processing Job](#Create-Processing-Job)
   1. [Job Status & S3 Output Location](#Job-Status-&-S3-Output-Location)
1. [Optional Next Steps](#(Optional)Next-Steps)
    1. [Load Processed Data into Pandas](#(Optional)-Load-Processed-Data-into-Pandas)
    1. [Train a model with SageMaker](#(Optional)Train-a-model-with-SageMaker)
---

# Inputs and Outputs

The below settings configure the inputs and outputs for the flow export.

<div class="alert alert-info"> 💡 <strong> Configurable Settings </strong>

In <b>Input - Source</b> you can configure the data sources that will be used as input by Data Wrangler

1. For S3 sources, configure the source attribute that points to the input S3 prefixes
2. For all other sources, configure attributes like query_string, database in the source's 
<b>DatasetDefinition</b> object.

If you modify the inputs the provided data must have the same schema and format as the data used in the Flow. 
You should also re-execute the cells in this section if you have modified the settings in any data sources.

Parametrized data sources will be ignored when creating ProcessingInputs, and will directly read from the source.
Network isolation is not supported for parametrized data sources.
</div>

In [2]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.dataset_definition.inputs import AthenaDatasetDefinition, DatasetDefinition, RedshiftDatasetDefinition

data_sources = []

## Input - S3 Source: ride-info

In [3]:
data_sources.append(ProcessingInput(
    source="s3://dsoaws/nyc-taxi-orig-cleaned-split-parquet-per-year/ride-info/", # You can override this to point to other dataset on S3
    destination="/opt/ml/processing/ride-info",
    input_name="ride-info",
    s3_data_type="S3Prefix",
    s3_data_distribution_type="FullyReplicated"
))

## Input - S3 Source: ride-fare

In [4]:
data_sources.append(ProcessingInput(
    source="s3://dsoaws/nyc-taxi-orig-cleaned-split-parquet-per-year/ride-fare/", # You can override this to point to other dataset on S3
    destination="/opt/ml/processing/ride-fare",
    input_name="ride-fare",
    s3_data_type="S3Prefix",
    s3_data_distribution_type="FullyReplicated"
))

## Output: S3 settings

<div class="alert alert-info"> 💡 <strong> Configurable Settings </strong>

1. <b>bucket</b>: you can configure the S3 bucket where Data Wrangler will save the output. The default bucket from 
the SageMaker notebook session is used. 
2. <b>flow_export_id</b>: A randomly generated export id. The export id must be unique to ensure the results do not 
conflict with other flow exports 
3. <b>s3_ouput_prefix</b>:  you can configure the directory name in your bucket where your data will be saved.
</div>

In [5]:
import time
import uuid
import boto3
import sagemaker

# Sagemaker session
sess = sagemaker.Session()

region = boto3.Session().region_name

# You can configure this with your own bucket name, e.g.
# bucket = "my-bucket"
bucket = sess.default_bucket()
print(f"Data Wrangler export storage bucket: {bucket}")

# unique flow export ID
# flow_export_id = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}"
flow_export_id = f"{time.strftime('%Y-%m-%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}"
flow_export_name = f"flow-{flow_export_id}"

Data Wrangler export storage bucket: sagemaker-us-east-1-079002598131


Below are the inputs required by the SageMaker Python SDK to launch a processing job.

In [6]:
# # Output name is auto-generated from the select node's ID + output name from the flow file.
# # THIS IS THE DEFAULT GENERATE BY DATAWRANGLER
# # KHOI: this is only for one of the Transform node, Creating training set
# output_name = "842a9df0-a299-4625-8f8a-ffca58929650.default"

# s3_folder = 'gsml-nyc-taxi-full-etl-ml-test-3-custompyspark-export-s3-via-notebook'
# s3_output_prefix = f"{s3_folder}/export-{flow_export_name}/output"
# s3_output_base_path = f"s3://{bucket}/{s3_output_prefix}"
# print(f"Processing output base path: {s3_output_base_path}\nThe final output location will contain additional subdirectories.")

# processing_job_output = ProcessingOutput(
#     output_name=output_name,
#     source="/opt/ml/processing/output",
#     destination=s3_output_base_path,
#     s3_upload_mode="EndOfJob"
# )

In [7]:
# Output name is auto-generated from the select node's ID + output name from the flow file.
s3_folder = 'gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook'
s3_output_prefix = f"{s3_folder}/export-{flow_export_name}/output"
s3_output_base_path = f"s3://{bucket}/{s3_output_prefix}"

# !!!Make sure to change this according to the `node_id` of the exact transformation using the Flow Editor (right click on flow file and Open With Editor)
splitting_data_mapping = {
    'training': '842a9df0-a299-4625-8f8a-ffca58929650',
    'validation': '052787ea-75d7-4079-be2a-3d436a45b0c8', 
}

processing_job_outputs = []
for set_name, transform_id in splitting_data_mapping.items():
    output_name = f'{transform_id}.default'
    s3_output_prefix_set = f"{s3_output_prefix}/{set_name}"
    s3_output_base_path_set = f"s3://{bucket}/{s3_output_prefix_set}"
    print(f"{set_name.title()} Processing output base path: {s3_output_base_path_set}\nThe final output location will contain additional subdirectories.")

    processing_job_outputs.append(ProcessingOutput(output_name=output_name,
                                                   source=f"/opt/ml/processing/output/{set_name}",
                                                   destination=s3_output_base_path_set,
                                                   s3_upload_mode="EndOfJob"))

Training Processing output base path: s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-05-18-35-55-5b413e79/output/training
The final output location will contain additional subdirectories.
Validation Processing output base path: s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-05-18-35-55-5b413e79/output/validation
The final output location will contain additional subdirectories.


## Upload Flow to S3

To use the Data Wrangler as an input to the processing job,  first upload your flow file to Amazon S3.

In [8]:
import os
import json
import boto3

# name of the flow file which should exist in the current notebook working directory
flow_file_name = "gsml-nyc-taxi-full-etl-test-3-custompyspark.flow"

# Load .flow file from current notebook working directory 
!echo "Loading flow file from current notebook working directory: $PWD"

with open(flow_file_name) as f:
    flow = json.load(f)

# Upload flow to S3
s3_client = boto3.client("s3")
flow_file_prefix_and_name = f"{s3_folder}/data_wrangler_flows/{flow_export_name}.flow"
s3_client.upload_file(flow_file_name, bucket, flow_file_prefix_and_name, ExtraArgs={"ServerSideEncryption": "aws:kms"})

flow_s3_uri = f"s3://{bucket}/{flow_file_prefix_and_name}"

print(f"Data Wrangler flow {flow_file_name} uploaded to {flow_s3_uri}")

Loading flow file from current notebook working directory: /root/data-science-on-aws.xgboost
Data Wrangler flow gsml-nyc-taxi-full-etl-test-3-custompyspark.flow uploaded to s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/data_wrangler_flows/flow-2023-03-05-18-35-55-5b413e79.flow


The Data Wrangler Flow is also provided to the Processing Job as an input source which we configure below.

In [9]:
## Input - Flow: gsml-nyc-taxi-full-etl-test-3-custompyspark.flow
flow_input = ProcessingInput(
    source=flow_s3_uri,
    destination="/opt/ml/processing/flow",
    input_name="flow",
    s3_data_type="S3Prefix",
    s3_data_distribution_type="FullyReplicated"
)

# Run Processing Job 
## Job Configurations

<div class="alert alert-info"> 💡 <strong> Configurable Settings </strong>

You can configure the following settings for Processing Jobs. If you change any configurations you will 
need to re-execute this and all cells below it by selecting the Run menu above and click 
<b>Run Selected Cells and All Below</b>

1. IAM role for executing the processing job. 
2. A unique name of the processing job. Give a unique name every time you re-execute processing jobs
3. Data Wrangler Container URL.
4. Instance count, instance type and storage volume size in GB.
5. Content type for each output. Data Wrangler supports CSV as default and Parquet.
6. Network Isolation settings
7. KMS key to encrypt output data
</div>

In [10]:
from sagemaker import image_uris

# IAM role for executing the processing job.
iam_role = sagemaker.get_execution_role()

# Unique processing job name. Give a unique name every time you re-execute processing jobs.
processing_job_name = f"data-wrangler-flow-processing-{flow_export_id}"

# Data Wrangler Container URL.
container_uri = "663277389841.dkr.ecr.us-east-1.amazonaws.com/sagemaker-data-wrangler-container:2.x"
# Pinned Data Wrangler Container URL.
container_uri_pinned = "663277389841.dkr.ecr.us-east-1.amazonaws.com/sagemaker-data-wrangler-container:2.1.2"

# Processing Job Instance count and instance type.
instance_count = 6
instance_type = "ml.m5.24xlarge"

# Size in GB of the EBS volume to use for storing data during processing.
# !KHOI: must change this to at least 200GB for full dataset
volume_size_in_gb = 200


# Content type for each output. Data Wrangler supports CSV as default and Parquet.
output_content_type = "Parquet"

# Delimiter to use for the output if the output content type is CSV. Uncomment to set.
# delimiter = ","

# Compression to use for the output. Uncomment to set.
# compression = "gzip"

# Configuration for partitioning the output. Uncomment to set.
# "num_partition" sets the number of partitions/files written in the output.
# "partition_by" sets the column names to partition the output by.
# partition_config = {
#     "num_partitions": 1,
#     "partition_by": ["column_name_1", "column_name_2"],
# }

# Network Isolation mode; default is off.
enable_network_isolation = False

# List of tags to be passed to the processing job.
user_tags = []

# Output configuration used as processing job container arguments. Only applies when writing to S3.
# Uncomment to set additional configurations.
# output_config = {
#     output_name: {
#         "content_type": output_content_type,
#         # "delimiter": delimiter,
#         # "compression": compression,
#         # "partition_config": partition_config,
#     }
# }

# KHOI: has to create separate output_config for each output:
output_configs = []
for set_name, transform_id in splitting_data_mapping.items():
    output_name = f'{transform_id}.default'
    output_configs.append({
        output_name: {
            "content_type": output_content_type,
            # "delimiter": delimiter,
            # "compression": compression,
            # "partition_config": partition_config,
        }
    })

# Refit configuration determines whether Data Wrangler refits the trainable parameters on the entire dataset. 
# When True, the processing job relearns the parameters and outputs a new flow file.
# You can specify the name of the output flow file under 'output_flow'.
# Note: There are length constraints on the container arguments (max 256 characters).
refit_trained_params = {
    "refit": False,
    "output_flow": f"data-wrangler-flow-processing-{flow_export_id}.flow"
}

# KMS key for per object encryption; default is None.
kms_key = None

### (Optional) Configure Spark Cluster Driver Memory

In [11]:
# The Spark memory configuration. Change to specify the driver and executor memory in MB for the Spark cluster during processing.
driver_memory_in_mb = 2048
executor_memory_in_mb = 55742

config = json.dumps({
    "Classification": "spark-defaults",
    "Properties": {
        "spark.driver.memory": f"{driver_memory_in_mb}m",
        "spark.executor.memory": f"{executor_memory_in_mb}m"
    }
})

config_file = f"config-{flow_export_id}.json"
with open(config_file, "w") as f:
    f.write(config)

config_s3_path = f"{s3_folder}/spark_configuration/{processing_job_name}/configuration.json"
config_s3_uri = f"s3://{bucket}/{config_s3_path}"
s3_client.upload_file(config_file, bucket, config_s3_path, ExtraArgs={"ServerSideEncryption": "aws:kms"})
print(f"Spark Config file uploaded to {config_s3_uri}")
os.remove(config_file)

# Provides the spark config file to processing job and set the cluster driver memory. Uncomment to set.
# data_sources.append(ProcessingInput(
#     source=config_s3_uri,
#     destination="/opt/ml/processing/input/conf",
#     input_name="spark-config",
#     s3_data_type="S3Prefix",
#     s3_input_mode="File",
#     s3_data_distribution_type="FullyReplicated"
# ))

Spark Config file uploaded to s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/spark_configuration/data-wrangler-flow-processing-2023-03-05-18-35-55-5b413e79/configuration.json


## Create Processing Job

To launch a Processing Job, you will use the SageMaker Python SDK to create a Processor function.

In [12]:
# Setup processing job network configuration
from sagemaker.network import NetworkConfig

network_config = NetworkConfig(
        enable_network_isolation=enable_network_isolation,
        security_group_ids=None,
        subnets=None
    )

In [13]:
from sagemaker.processing import Processor

processor = Processor(
    role=iam_role,
    image_uri=container_uri,
    instance_count=instance_count,
    instance_type=instance_type,
    volume_size_in_gb=volume_size_in_gb,
    network_config=network_config,
    sagemaker_session=sess,
    output_kms_key=kms_key,
    tags=user_tags
)

# # Start Job
# processor.run(
#     inputs=[flow_input] + data_sources, 
#     outputs=[processing_job_output],
#     arguments=[f"--output-config '{json.dumps(output_config)}'"] + [f"--refit-trained-params '{json.dumps(refit_trained_params)}'"],
#     wait=False,
#     logs=False,
#     job_name=processing_job_name
# )

## Job Status & S3 Output Location

Below you wait for processing job to finish. If it finishes successfully, the raw parameters used by the 
Processing Job will be printed.

To prevent data of different processing jobs and different output nodes from being overwritten or combined, 
Data Wrangler uses the name of the processing job and the name of the output to write the output.

In [14]:
import sys
print(sys.version)

3.7.10 (default, Jun  4 2021, 14:48:32) 
[GCC 7.5.0]


In [15]:
print(f'S3 base path: {s3_output_base_path}')
print(f'Processing job name: {processing_job_name}')

S3 base path: s3://sagemaker-us-east-1-079002598131/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-05-18-35-55-5b413e79/output
Processing job name: data-wrangler-flow-processing-2023-03-05-18-35-55-5b413e79


In [16]:
# output_args = [f"--output-config '{json.dumps(output_config)}'" for output_config in output_configs]

# # Start Job
# processor.run(
#     inputs=[flow_input] + data_sources, 
#     outputs=processing_job_outputs,
#     arguments=output_args + [f"--refit-trained-params '{json.dumps(refit_trained_params)}'"],
#     wait=False,
#     logs=False,
#     job_name=processing_job_name
# )

# # Status
# training_set_specific_path = f'{list(splitting_data_mapping.values())[0]}.default'
# s3_job_results_training_path = f"{s3_output_base_path}/{list(splitting_data_mapping.keys())[0]}/{processing_job_name}/{training_set_specific_path.replace('.', '/')}"
# print(f"Job results (training set) are saved to S3 path: {s3_job_results_training_path}")
# validation_set_specific_path = f'{list(splitting_data_mapping.values())[1]}.default'
# s3_job_results_validation_path = f"{s3_output_base_path}/{list(splitting_data_mapping.keys())[1]}/{processing_job_name}/{validation_set_specific_path.replace('.', '/')}"
# print(f"Job results (validation set) are saved to S3 path: {s3_job_results_validation_path}")

# job_result = sess.wait_for_processing_job(processing_job_name)
# job_result

# TODO:  Figure out why Data Wrangler is not splitting properly

## Train a model with SageMaker
Now that the data has been processed, you may want to train a model using the data. The following shows an 
example of doing so using a popular algorithm - XGBoost. For more information on algorithms available in 
SageMaker, see [Getting Started with SageMaker Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html). 
It is important to note that the following XGBoost objective ['binary', 'regression', 'multiclass'] 
hyperparameters, or content_type may not be suitable for the output data, and will require changes to 
train a proper model. Furthermore, for CSV training, the algorithm assumes that the target 
variable is in the first column. For more information on SageMaker XGBoost, 
see https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html.


### Set Training Data and Validation Data paths
We set the training input data path from the output of the Data Wrangler processing job..

In [17]:
# train_set_s3_uri = job_result['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri']
# validation_set_s3_uri = job_result['ProcessingOutputConfig']['Outputs'][1]['S3Output']['S3Uri']

train_set_s3_uri = 's3://dsoaws/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-02-03-32-10-53926e35/output/training/'
validation_set_s3_uri = 's3://dsoaws/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-02-03-32-10-53926e35/output/validation/'

print(f"Training input data path: {train_set_s3_uri}")
print(f"Validation input data path: {validation_set_s3_uri}")

Training input data path: s3://dsoaws/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-02-03-32-10-53926e35/output/training/
Validation input data path: s3://dsoaws/gsml-nyc-taxi-full-etl-ml-test-4-custompyspark-export-s3-via-notebook/export-flow-2023-03-02-03-32-10-53926e35/output/validation/


In [18]:
train_content_type = (
    "application/x-parquet" if output_content_type.upper() == "PARQUET"
    else "text/csv"
)
train_input = sagemaker.inputs.TrainingInput(
    s3_data=train_set_s3_uri,
    content_type=train_content_type,
    distribution='ShardedByS3Key',
    input_mode='FastFile'
)

validation_content_type = (
    "application/x-parquet" if output_content_type.upper() == "PARQUET"
    else "text/csv"
)
validation_input = sagemaker.inputs.TrainingInput(
    s3_data=validation_set_s3_uri,
    content_type=validation_content_type,
    distribution='ShardedByS3Key',
    input_mode='FastFile'    
)

### Configure the algorithm and training job

The Training Job hyperparameters are set. For more information on XGBoost Hyperparameters, 
see https://xgboost.readthedocs.io/en/latest/parameter.html.

In [19]:
region = boto3.Session().region_name
container = sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

s3_training_job_output_prefix = f"{s3_folder}/built-in-xgboost"
training_job_output_path = 's3://{}/{}/{}/output'.format(bucket, s3_training_job_output_prefix, 'nyc-taxi-full-built-in-xgboost')

hyperparameters = {
    "eta": "0.2",
    "gamma": "4",
    "max_depth": "5",
    "min_child_weight": "6",
    "num_round": "50",
    "objective": "reg:squarederror",
    "subsample": "0.7"
}

### Start the Training Job

The TrainingJob configurations are set using the SageMaker Python SDK Estimator, and which is fit using 
the training data from the Processing Job that was run earlier.

In [20]:
estimator = sagemaker.estimator.Estimator(
    container,
    iam_role,
    hyperparameters=hyperparameters,
    instance_count=6,
    instance_type="ml.m5.24xlarge",
    volume_size=200,
)

In [22]:
import time
import sagemaker

experiment_name = f"gsml-nyc-taxi-full-built-in-xgboost-{time.strftime('%Y-%m-%d-%H-%M-%S', time.gmtime())}"
print(experiment_name)

run_name = f"experiment-run-{time.strftime('%Y-%m-%d-%H-%M-%S', time.gmtime())}"
print(run_name)

#with sagemaker.experiments.load_run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=sagemaker.session.Session()) as run:

training_job_results = estimator.fit({"train": train_input, 
                                      #'validation': validation_input}
                                     })
print(training_job_results)
    
#     # Define metrics to log
#     run.log_metric(name = "Final Test Loss", value = score[0])
#     run.log_metric(name = "Final Test Loss", value = score[1])

INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-03-05-18-38-56-790


gsml-nyc-taxi-full-built-in-xgboost-2023-03-05-18-38-56
experiment-run-2023-03-05-18-38-56
2023-03-05 18:38:57 Starting - Starting the training job...
2023-03-05 18:39:12 Starting - Preparing the instances for training......
2023-03-05 18:40:12 Downloading - Downloading input data...
2023-03-05 18:40:48 Training - Training image download completed. Training in progress....[35m[2023-03-05 18:41:15.419 ip-10-2-91-7.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[35m[2023-03-05:18:41:15:INFO] Imported framework sagemaker_xgboost_container.training[0m
[35m[2023-03-05:18:41:15:INFO] Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[35mReturning the value itself[0m
[35m[2023-03-05:18:41:15:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2023-03-05:18:41:15:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[35m[2023-03-05:18:41:15:INFO] Determined 0 GPU(s) available on the instance.[0m
[35m[2023-03-05:18:41:15: