# Machine Learning Automated Workflows

#### First, we'll enable step caching. Step caching tells SageMaker to check for a previous execution of a step that was called with the same arguments. This is so that it can use the previous step values of a successful run instead of re-executing a step with the exact same arguments. You should consider using step caching to avoid unnecessary tasks and costs. As an example, if the second step (model training) in your pipeline fails, you can start the pipeline again without re-executing the data preparation step if that step has not changed, as follows:

In [None]:
from sagemaker.workflow.steps import CacheConfig

cache_config = CacheConfig(enable_caching=True, expire_after="T360m")

#### Next, we'll define the runtime arguments using the get_run_args method. In this case, we are passing the Spark processor that was previously configured, in combination with the parameters identifying the inputs (raw weather data), the outputs (train, test, and validation datasets), and additional arguments the data processing script accepts as input. The data processing script, preprocess.py, is a slightly modified version of the processing script used in Chapter 4, Data Preparation at Scale Using Amazon SageMaker Data Wrangler and Processing. Refer to the following script:

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

run_args = pyspark_processor.get_run_args(
    "preprocess.py",
    submit_jars=["s3://crawler-public/json/serde/json-serde.jar"],
    spark_event_logs_s3_uri=spark_event_logs_s3_uri,
    configuration=configuration,
    outputs=[ \
        ProcessingOutput(output_name="validation", destination=validation_data_out, source="/opt/ml/processing/validation"),

        ProcessingOutput(output_name="train", destination=train_data_out, source="/opt/ml/processing/train"),

        ProcessingOutput(output_name="test", destination=test_data_out, source="/opt/ml/processing/test"),
     ],
    arguments=[
        '--s3_input_bucket', s3_bucket,
        '--s3_input_key_prefix', s3_prefix_parquet,
        '--s3_output_bucket', s3_bucket,
        '--s3_output_key_prefix', s3_output_prefix+'/prepared-data/'+timestamp
    ]
)

#### Next, we'll use the runtime parameters to configure the actual SageMaker Pipelines step for our data preprocessing tasks. You'll notice we're using all of the parameters we configured previously to build the step that will execute as part of the pipeline:

In [None]:
from sagemaker.workflow.steps import ProcessingStep

step_process = ProcessingStep(
    name="DataPreparation",
    processor=pyspark_processor,
    inputs=run_args.inputs,
    outputs=run_args.outputs,
    job_arguments=run_args.arguments,
    code="modelbuild/pipelines/preprocess.py",
)

#### First, we'll configure the SageMaker training job, as follows: