# Machine Learning Automated Workflows

#### First, we'll enable step caching. Step caching tells SageMaker to check for a previous execution of a step that was called with the same arguments. This is so that it can use the previous step values of a successful run instead of re-executing a step with the exact same arguments. You should consider using step caching to avoid unnecessary tasks and costs. As an example, if the second step (model training) in your pipeline fails, you can start the pipeline again without re-executing the data preparation step if that step has not changed, as follows:

In [None]:
from sagemaker.workflow.steps import CacheConfig

cache_config = CacheConfig(enable_caching=True, expire_after="T360m")

#### Next, we'll define the runtime arguments using the get_run_args method. In this case, we are passing the Spark processor that was previously configured, in combination with the parameters identifying the inputs (raw weather data), the outputs (train, test, and validation datasets), and additional arguments the data processing script accepts as input. The data processing script, preprocess.py, is a slightly modified version of the processing script used in Chapter 4, Data Preparation at Scale Using Amazon SageMaker Data Wrangler and Processing. Refer to the following script:

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

run_args = pyspark_processor.get_run_args(
    "preprocess.py",
    submit_jars=["s3://crawler-public/json/serde/json-serde.jar"],
    spark_event_logs_s3_uri=spark_event_logs_s3_uri,
    configuration=configuration,
    outputs=[ \
        ProcessingOutput(output_name="validation", destination=validation_data_out, source="/opt/ml/processing/validation"),

        ProcessingOutput(output_name="train", destination=train_data_out, source="/opt/ml/processing/train"),

        ProcessingOutput(output_name="test", destination=test_data_out, source="/opt/ml/processing/test"),
     ],
    arguments=[
        '--s3_input_bucket', s3_bucket,
        '--s3_input_key_prefix', s3_prefix_parquet,
        '--s3_output_bucket', s3_bucket,
        '--s3_output_key_prefix', s3_output_prefix+'/prepared-data/'+timestamp
    ]
)

#### Next, we'll use the runtime parameters to configure the actual SageMaker Pipelines step for our data preprocessing tasks. You'll notice we're using all of the parameters we configured previously to build the step that will execute as part of the pipeline:

In [None]:
from sagemaker.workflow.steps import ProcessingStep

step_process = ProcessingStep(
    name="DataPreparation",
    processor=pyspark_processor,
    inputs=run_args.inputs,
    outputs=run_args.outputs,
    job_arguments=run_args.arguments,
    code="modelbuild/pipelines/preprocess.py",
)

#### First, we'll configure the SageMaker training job, as follows:

In [None]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"5"}

# set an output path where the trained model will be saved
m_prefix = 'pipeline/model'
output_path = 's3://{}/{}/{}/output'.format(s3_bucket, m_prefix, 'xgboost')
# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
image_uri = sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")
# construct a SageMaker estimator that calls the xgboost-container
xgb_estimator = sagemaker.estimator.Estimator(image_uri=image_uri,
                         hyperparameters=hyperparameters,
                     role=sagemaker.get_execution_role(),
                         instance_count=1,
                         instance_type='ml.m5.12xlarge',
                         volume_size=200, # 5 GB
                         output_path=output_path)

#### Next, we'll configure the SageMaker Pipelines step that will be used to execute your model training task. For this, we'll use the built-in training step (https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-training). This tells Pipelines this step will be a SageMaker training job. Figure 12.6 shows the high-level inputs and outputs/artifacts that a Training step will expect:

#### We previously configured the estimator, so we will now use that estimator combined with the other inputs shown in Figure 12.6 to set up our Pipelines step:

In [None]:
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep

step_train = TrainingStep(
    name="ModelTrain",
    estimator=xgb_estimator,
    cache_config=cache_config,
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
            content_type="text/csv",
        ),
        "validation": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["validation"].S3Output.S3Uri,
            content_type="text/csv",
        ),
    },
)

## Model evaluation step

#### In this step, you'll configure a SageMaker processing job that will be used to evaluate your trained model using the model artifact produced from the training step in combination with your processing code and configuration:

#### First, we'll configure the SageMaker processing job starting with ScriptProcessor. We will use this to execute a simple evaluation script, as follows:

In [None]:
from sagemaker.processing import ScriptProcessor

script_eval = ScriptProcessor(
    image_uri=image_uri,
    command=["python3"],
    instance_type=processing_instance_type,
    instance_count=1,
    base_job_name="script-weather-eval",
    role=role,
)

#### Next, we'll configure the SageMaker Pipelines step that will be used to execute your model evaluation tasks. For this, we'll use the built-in Processing step (https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing). This tells Pipelines this step will be a SageMaker processing job. 

#### We previously configured the processor, so we will now use that processor combined with the other inputs shown in Figure 12.7 to set up our Pipelines step. To do this, we'll first set up the property file that will be used to store the output, in this case, model evaluation metrics, of our processing job. Then, we'll configure the ProcessingStep definition as follows:

In [None]:
from sagemaker.workflow.properties import PropertyFile

evaluation_report = PropertyFile(
    name="EvaluationReport", output_name="evaluation", path="evaluation.json"
)

step_eval = ProcessingStep(
    name="WeatherEval",
    processor=script_eval,
    cache_config = cache_config,
    inputs=[
        ProcessingInput(
            source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model",
        ),
        ProcessingInput(
          source=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,  destination="/opt/ml/processing/test",
        ),
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
    ],
    code="modelbuild/pipelines/evaluation.py",
    property_files=[evaluation_report],
)

## Conditional step

#### In this step, you'll configure a built-in conditional step that will determine whether to proceed to the next step in the pipeline based on the results of your previous model evaluation step. Setting up a conditional step requires a list of conditions or items that must be true. This is in combination with instructions on the list of steps to execute based on that condition.

#### In this case, we're going to set up a condition using the mean squared error (MSE) metric. If the metric is less than or equal to nn, then we will indicate the steps to proceed with using the if_steps parameter. In this case, the next steps if the condition were true would be to register the model and then create the model that packages your model for deployment. You can optionally specify else_steps to indicate the next steps to perform if the condition is not true. In this case, we will simply terminate the pipeline if the condition is not true:

In [None]:
from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo
from sagemaker.workflow.condition_step import (
    ConditionStep,
    JsonGet
)

cond_lte = ConditionLessThanOrEqualTo(

    left=JsonGet(
        step=step_eval,
        property_file=evaluation_report,
        json_path="regression_metrics.mse.value"
    ),
    right=6.0
)

step_cond = ConditionStep(
    name="MSECondition",
    conditions=[cond_lte],
    if_steps=[step_register, step_create_model],
    else_steps=[]
)

## Register model step(s)

#### In this final step, you'll package the model and configure a built-in register model (https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-register-model) step that will register your model to a model package group in SageMaker model registry. As seen in Figure 12.9, the inputs we'll use to register the model contain information about the packaged model, such as the model version, estimator, and S3 location of the model artifact. This information, when combined with additional information such as model metrics and inference specifications, is used to register the model version:

#### This step will use data from the prior steps in the pipeline to register the model and centrally store key metadata about this specific model version. In addition, you'll see an approval_status parameter. This parameter can be used to trigger downstream deployment processes (these will be discussed in more detail under SageMaker Projects):

In [None]:
from sagemaker.model_metrics import MetricsSource, ModelMetrics
from sagemaker.workflow.step_collections import RegisterModel

model_metrics = ModelMetrics(
    model_statistics=MetricsSource(
    s3_uri="{}/evaluation.json".format(

step_eval.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]
        ),
        content_type="application/json",
    )

)

step_register = RegisterModel(
    name="RegisterModel",
    estimator=xgb_train,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts,
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=["ml.t2.medium", "ml.m5.xlarge"],
    transform_instances=["ml.m5.xlarge"],
    model_package_group_name=model_package_group_name,
    approval_status=model_approval_status,
    model_metrics=model_metrics,
)