# titulo

The purpose of this notebook is to deploy an End-To-End Machine Learning Pipeline with Amazon SageMaker. We will work with Adult [Census Income Dataset](https://www.kaggle.com/uciml/adult-census-income). We will use 'income', a binary variable that explains if a person earns more than 50k or not, as the target variable. For the training step, we will use an image of XGBoost. 

In order to replicate the results in your SageMaker Studio, you can clone the repository. 
Once you have this notebook and the data in your SageMaker Studio, is time to start!

In [24]:
import boto3
import sagemaker


region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
role = sagemaker.get_execution_role()
default_bucket = sagemaker_session.default_bucket() # the default S3 bucket where you will store everything 
model_package_group_name = f"AdultModelPackageGroupName"

In this cell we will copy the data from the local path (the data from the side bar) to a S3 bucket.

In [26]:
local_path = "data/adult.csv" # local path where you have the data

s3 = boto3.resource("s3")
base_uri = f"s3://{default_bucket}/adult"

# This line copies your data in the local path to a default S3 bucket
input_data_uri = sagemaker.s3.S3Uploader.upload(
    local_path=local_path,
    desired_s3_uri=base_uri,
) 

s3://sagemaker-us-east-1-634368063255/adult/adult.csv


Now we will specify some parameters that will be useful when setting the Pipeline. Since we will manage a good amount of variables, I think is a good practice to use the search bar (ctrl+F) to see what a variable will do or where it came from.  

In [27]:
from sagemaker.workflow.parameters import (
    ParameterInteger,
    ParameterString,
)


processing_instance_count = ParameterInteger(name="ProcessingInstanceCount", default_value=1)


processing_instance_type = ParameterString(
    name="ProcessingInstanceType", default_value="ml.m5.xlarge"
)

training_instance_type = ParameterString(name="TrainingInstanceType", default_value="ml.m5.xlarge")


model_approval_status = ParameterString(
    name="ModelApprovalStatus", default_value="PendingManualApproval"
)

input_data = ParameterString(
    name="InputData",
    default_value=input_data_uri,
)


# Preprocessing step

Now its time to go with the first step. We will create a preprocessing script and store it in a folder called 'adult'.

In [28]:
!mkdir -p adult

In [29]:
%%writefile adult/preprocessing.py

import pandas as pd
import numpy as np


if __name__ == "__main__":
    
    base_dir = "/opt/ml/processing"
    # read data
    df = pd.read_csv(f"{base_dir}/input/adult.csv", sep=",", 
                     error_bad_lines=False, engine='python') # to avoid an error

    # replace for 0 and 1
    df['income'].replace(['<=50K','>50K'],[0,1], inplace=True) 

    # drop useless variables
    df = df.drop('fnlwgt', axis=1)
    df = df.drop('education.num', axis=1)

    # Drop rows with missing data
    df = df.loc[ (df['workclass'] != '?') & (df['occupation'] != '?') & (df['native.country']!= '?')]

    # split data into dependent and independent variables
    X = df.drop('income', axis=1)
    y = df['income']

    # split dependent variables into continous and categorical variables
    X_continous  = X[['age', 'capital.gain', 'capital.loss', 'hours.per.week']]

    X_categorical = X[['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race',
                    'sex', 'native.country']]


    # One hot encoding
    X_encoded = pd.get_dummies(X_categorical)

    # Concatenate both continous and encoded sets:
    X = pd.concat([X_continous, X_encoded],axis=1)

    y = y.to_numpy().reshape(y.shape[0],1)
    X = X.to_numpy()


    dataset = np.concatenate((y, X), axis=1)

    np.random.shuffle(dataset)
    
    # Split into train validation and test datasets
    train, validation, test = np.split(dataset, [int(.6*len(dataset)), int(.7*len(dataset))])

    # Save the data
    pd.DataFrame(train).to_csv(f"{base_dir}/train/train.csv", header=False, index=False)
    pd.DataFrame(validation).to_csv(f"{base_dir}/validation/validation.csv", header=False, index=False)
    pd.DataFrame(test).to_csv(f"{base_dir}/test/test.csv", header=False, index=False)


Overwriting adult/preprocessing.py


Now we can create a SkLearn processor instance:

In [30]:
from sagemaker.sklearn.processing import SKLearnProcessor


framework_version = "0.23-1"

sklearn_processor = SKLearnProcessor(
    framework_version=framework_version,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    base_job_name="sklearn-adult-process",
    role=role,
)

And then pass the processor and the code into a ProcessingStep. We also specify the inputs and outputs paths. 
NOTE: /opt/ml/processing is just a default path in the processing container. You can take a quick look a the image [here](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) to understand this better.

In [31]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep


step_process = ProcessingStep(
    name="adultProcess",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ], # copies the data from the S3 bucket to the container
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ], # You store the output in s3 (example: sagemaker-us-east-<>/<base_job_name>/output/train/train.csv)
    code="adult/preprocessing.py",
)

In [32]:
from sagemaker.estimator import Estimator


model_path = f"s3://{default_bucket}/adultTrain"


image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type=training_instance_type,
)
xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=training_instance_type,
    instance_count=1,
    output_path=model_path,
    role=role,
)
xgb_train.set_hyperparameters(
    objective="binary:logistic", #
    num_round=50, # The number of rounds to run the training.
    max_depth=5, #
#    gamma=4,
#    min_child_weight=6,
    subsample=1, #
    silent=0,
#nuevos
    eval_metric="logloss",
    eta=0.3, # learning rate
#    reg_lambda=10,
)

In [33]:
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep


step_train = TrainingStep(
    name="adultTrain",
    estimator=xgb_train,
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri,
            content_type="text/csv",    
        ),
        # step_process es el preprocessing step. Coges el output llamado "train"
        
        "validation": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "validation"
            ].S3Output.S3Uri,
            content_type="text/csv",
        ),
    },
)

In [34]:
%%writefile adult/evaluation.py

import json
import pathlib
import pickle
import tarfile

# import joblib
import numpy as np
import pandas as pd
import xgboost


from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

if __name__ == "__main__":
    # importar el modelo
    model_path = f"/opt/ml/processing/model/model.tar.gz"
    with tarfile.open(model_path) as tar:
        tar.extractall(path=".")
    
    model = pickle.load(open("xgboost-model", "rb"))

    # importar el test set
    test_path = "/opt/ml/processing/test/test.csv"
    df = pd.read_csv(test_path, header=None)
    

    y_test = df.iloc[:, 0].to_numpy()
    df.drop(df.columns[0], axis=1, inplace=True)
    
    X_test = xgboost.DMatrix(df.values)
    

    predictions = model.predict(X_test)
    predictions = np.where(predictions > 0.5, 1, 0 )
 
    acc = accuracy_score(y_test, predictions)
    std = np.std(y_test - predictions)
    report_dict = {
        "regression_metrics": {
            "accuracy": {
                "value": acc,
                "standard_deviation": std
            },
        },
    }

    output_dir = "/opt/ml/processing/evaluation"
    pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)
    # crea un path con subpaths (en plan path/subpath/)

    evaluation_path = f"{output_dir}/evaluation.json"
    with open(evaluation_path, "w") as f:
        f.write(json.dumps(report_dict))

Overwriting adult/evaluation.py


In [35]:
from sagemaker.processing import ScriptProcessor


script_eval = ScriptProcessor(
    image_uri=image_uri,
    command=["python3"],
    instance_type=processing_instance_type,
    instance_count=1,
    base_job_name="script-adult-eval",
    role=role,
)

In [36]:
from sagemaker.workflow.properties import PropertyFile

# You use property files to store information from the output of a processing step. 
# This is particularly useful when analyzing the results of a processing step to decide how a conditional
# step should be executed.


evaluation_report = PropertyFile(
    name="EvaluationReport", output_name="evaluation", path="evaluation.json"
)
# The path parameter is the name of the JSON file that the property file is saved to
# output_name must match the output_name of the ProcessingOutput that you define in your processing step.



step_eval = ProcessingStep(
    name="adultEval",
    processor=script_eval,
    inputs=[
        ProcessingInput(
            source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model",
        ),
        ProcessingInput(
            source=step_process.properties.ProcessingOutputConfig.Outputs["test"].S3Output.S3Uri,
            destination="/opt/ml/processing/test",
        ),
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
    ],
    code="adult/evaluation.py",
    property_files=[evaluation_report],
)

In [37]:
from sagemaker.model import Model


model = Model(
    image_uri=image_uri,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts, # The S3 location of a SageMaker model data .tar.gz file
    sagemaker_session=sagemaker_session,
    role=role,
)

In [38]:
from sagemaker.inputs import CreateModelInput
from sagemaker.workflow.steps import CreateModelStep


inputs = CreateModelInput(
    instance_type="ml.m5.large",
    accelerator_type="ml.eia1.medium",
)
step_create_model = CreateModelStep(
    name="adultCreateModel",
    model=model,
    inputs=inputs,
)

In [39]:
from sagemaker.model_metrics import MetricsSource, ModelMetrics
from sagemaker.workflow.step_collections import RegisterModel


model_metrics = ModelMetrics(
    model_statistics=MetricsSource(
        s3_uri="{}/evaluation.json".format(
            step_eval.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]
        ),
        content_type="application/json",
    )
)
step_register = RegisterModel(
    name="adultRegisterModel",
    estimator=xgb_train,
    model_data=step_train.properties.ModelArtifacts.S3ModelArtifacts, # The S3 uri to the model data .tar.gz file
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=["ml.t2.medium", "ml.m5.xlarge"],
    transform_instances=["ml.m5.xlarge"],
    model_package_group_name=model_package_group_name,
    approval_status=model_approval_status,
    model_metrics=model_metrics,
)

In [40]:
from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo
from sagemaker.workflow.condition_step import (
    ConditionStep,
    JsonGet,
)


cond_lte = ConditionLessThanOrEqualTo(
    left=JsonGet(
        step=step_eval,
        property_file=evaluation_report,
        json_path="regression_metrics.accuracy.value",
    ),
    right=6.0,
)

step_cond = ConditionStep(
    name="adultAccCond",
    conditions=[cond_lte],
    if_steps=[step_register, step_create_model],
    else_steps=[],
)

The class JsonGet has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [41]:
from sagemaker.workflow.pipeline import Pipeline


pipeline_name = f"adultPipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_type,
        processing_instance_count,
        training_instance_type,
        model_approval_status,
        input_data,
    ],
    steps=[step_process, step_train, step_eval, step_cond],
)

In [42]:
import json


definition = json.loads(pipeline.definition())
definition

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


{'Version': '2020-12-01',
 'Metadata': {},
 'Parameters': [{'Name': 'ProcessingInstanceType',
   'Type': 'String',
   'DefaultValue': 'ml.m5.xlarge'},
  {'Name': 'ProcessingInstanceCount', 'Type': 'Integer', 'DefaultValue': 1},
  {'Name': 'TrainingInstanceType',
   'Type': 'String',
   'DefaultValue': 'ml.m5.xlarge'},
  {'Name': 'ModelApprovalStatus',
   'Type': 'String',
   'DefaultValue': 'PendingManualApproval'},
  {'Name': 'InputData',
   'Type': 'String',
   'DefaultValue': 's3://sagemaker-us-east-1-634368063255/adult/adult.csv'}],
 'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'},
  'TrialName': {'Get': 'Execution.PipelineExecutionId'}},
 'Steps': [{'Name': 'adultProcess',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': {'Get': 'Parameters.ProcessingInstanceType'},
      'InstanceCount': {'Get': 'Parameters.ProcessingInstanceCount'},
      'VolumeSizeInGB': 30}},
    'AppSpecification': {'ImageUri

In [43]:
pipeline.upsert(role_arn=role)

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


{'PipelineArn': 'arn:aws:sagemaker:us-east-1:634368063255:pipeline/adultpipeline',
 'ResponseMetadata': {'RequestId': '0e19ec29-25f6-4a05-a0e4-426101b3c235',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '0e19ec29-25f6-4a05-a0e4-426101b3c235',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '81',
   'date': 'Sat, 18 Sep 2021 14:17:21 GMT'},
  'RetryAttempts': 0}}

In [44]:
execution = pipeline.start()

In [45]:
execution.describe()

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:634368063255:pipeline/adultpipeline',
 'PipelineExecutionArn': 'arn:aws:sagemaker:us-east-1:634368063255:pipeline/adultpipeline/execution/vjfr0w8t5x11',
 'PipelineExecutionDisplayName': 'execution-1631974642407',
 'PipelineExecutionStatus': 'Executing',
 'CreationTime': datetime.datetime(2021, 9, 18, 14, 17, 22, 241000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2021, 9, 18, 14, 17, 22, 241000, tzinfo=tzlocal()),
 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:634368063255:user-profile/d-dceowp1z5rza/default-1631865696300',
  'UserProfileName': 'default-1631865696300',
  'DomainId': 'd-dceowp1z5rza'},
 'LastModifiedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:634368063255:user-profile/d-dceowp1z5rza/default-1631865696300',
  'UserProfileName': 'default-1631865696300',
  'DomainId': 'd-dceowp1z5rza'},
 'ResponseMetadata': {'RequestId': '03b79ab5-0dc8-44b3-ae69-a49f07a5918b',
  'HTTPStatusCode': 200,
  'HTT

In [46]:
execution.wait()

In [47]:
execution.list_steps()

[{'StepName': 'adultCreateModel',
  'StartTime': datetime.datetime(2021, 9, 18, 14, 30, 9, 498000, tzinfo=tzlocal()),
  'EndTime': datetime.datetime(2021, 9, 18, 14, 30, 10, 467000, tzinfo=tzlocal()),
  'StepStatus': 'Succeeded',
  'Metadata': {'Model': {'Arn': 'arn:aws:sagemaker:us-east-1:634368063255:model/pipelines-vjfr0w8t5x11-adultcreatemodel-by2ys8pkun'}}},
 {'StepName': 'adultRegisterModel',
  'StartTime': datetime.datetime(2021, 9, 18, 14, 30, 9, 417000, tzinfo=tzlocal()),
  'EndTime': datetime.datetime(2021, 9, 18, 14, 30, 10, 983000, tzinfo=tzlocal()),
  'StepStatus': 'Succeeded',
  'Metadata': {'RegisterModel': {'Arn': 'arn:aws:sagemaker:us-east-1:634368063255:model-package/adultmodelpackagegroupname/2'}}},
 {'StepName': 'adultAccCond',
  'StartTime': datetime.datetime(2021, 9, 18, 14, 30, 8, 428000, tzinfo=tzlocal()),
  'EndTime': datetime.datetime(2021, 9, 18, 14, 30, 8, 937000, tzinfo=tzlocal()),
  'StepStatus': 'Succeeded',
  'Metadata': {'Condition': {'Outcome': 'True'}

In [48]:
from pprint import pprint


evaluation_json = sagemaker.s3.S3Downloader.read_file(
    "{}/evaluation.json".format(
        step_eval.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]
    )
)
pprint(json.loads(evaluation_json))

{'regression_metrics': {'accuracy': {'standard_deviation': 0.3585934376502634,
                                     'value': 0.8693778318046192}}}


In [49]:
import time
from sagemaker.lineage.visualizer import LineageTableVisualizer


viz = LineageTableVisualizer(sagemaker.session.Session())
for execution_step in reversed(execution.list_steps()):
    print(execution_step)
    display(viz.show(pipeline_execution_step=execution_step))
    time.sleep(5)

{'StepName': 'adultProcess', 'StartTime': datetime.datetime(2021, 9, 18, 14, 17, 23, 747000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2021, 9, 18, 14, 21, 49, 338000, tzinfo=tzlocal()), 'StepStatus': 'Succeeded', 'Metadata': {'ProcessingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:634368063255:processing-job/pipelines-vjfr0w8t5x11-adultprocess-lkbxqemamb'}}}


Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,s3://...14-17-21-639/input/code/preprocessing.py,Input,DataSet,ContributedTo,artifact
1,s3://...r-us-east-1-634368063255/adult/adult.csv,Input,DataSet,ContributedTo,artifact
2,68331...om/sagemaker-scikit-learn:0.23-1-cpu-py3,Input,Image,ContributedTo,artifact
3,s3://...cess-2021-09-18-14-17-20-880/output/test,Output,DataSet,Produced,artifact
4,s3://...021-09-18-14-17-20-880/output/validation,Output,DataSet,Produced,artifact
5,s3://...ess-2021-09-18-14-17-20-880/output/train,Output,DataSet,Produced,artifact


{'StepName': 'adultTrain', 'StartTime': datetime.datetime(2021, 9, 18, 14, 21, 49, 591000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2021, 9, 18, 14, 25, 44, 427000, tzinfo=tzlocal()), 'StepStatus': 'Succeeded', 'Metadata': {'TrainingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:634368063255:training-job/pipelines-vjfr0w8t5x11-adulttrain-z8msbmyltz'}}}


Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,s3://...021-09-18-14-17-20-880/output/validation,Input,DataSet,ContributedTo,artifact
1,s3://...ess-2021-09-18-14-17-20-880/output/train,Input,DataSet,ContributedTo,artifact
2,68331...naws.com/sagemaker-xgboost:1.0-1-cpu-py3,Input,Image,ContributedTo,artifact
3,s3://...dultTrain-Z8MSbMyltZ/output/model.tar.gz,Output,Model,Produced,artifact


{'StepName': 'adultEval', 'StartTime': datetime.datetime(2021, 9, 18, 14, 25, 45, 146000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2021, 9, 18, 14, 30, 8, 231000, tzinfo=tzlocal()), 'StepStatus': 'Succeeded', 'Metadata': {'ProcessingJob': {'Arn': 'arn:aws:sagemaker:us-east-1:634368063255:processing-job/pipelines-vjfr0w8t5x11-adulteval-uxra51fegc'}}}


Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,s3://...18-14-17-21-704/input/code/evaluation.py,Input,DataSet,ContributedTo,artifact
1,s3://...cess-2021-09-18-14-17-20-880/output/test,Input,DataSet,ContributedTo,artifact
2,s3://...dultTrain-Z8MSbMyltZ/output/model.tar.gz,Input,Model,ContributedTo,artifact
3,68331...naws.com/sagemaker-xgboost:1.0-1-cpu-py3,Input,Image,ContributedTo,artifact
4,s3://...021-09-18-14-17-20-502/output/evaluation,Output,DataSet,Produced,artifact


{'StepName': 'adultAccCond', 'StartTime': datetime.datetime(2021, 9, 18, 14, 30, 8, 428000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2021, 9, 18, 14, 30, 8, 937000, tzinfo=tzlocal()), 'StepStatus': 'Succeeded', 'Metadata': {'Condition': {'Outcome': 'True'}}}


None

{'StepName': 'adultRegisterModel', 'StartTime': datetime.datetime(2021, 9, 18, 14, 30, 9, 417000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2021, 9, 18, 14, 30, 10, 983000, tzinfo=tzlocal()), 'StepStatus': 'Succeeded', 'Metadata': {'RegisterModel': {'Arn': 'arn:aws:sagemaker:us-east-1:634368063255:model-package/adultmodelpackagegroupname/2'}}}


Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,s3://...dultTrain-Z8MSbMyltZ/output/model.tar.gz,Input,Model,ContributedTo,artifact
1,68331...naws.com/sagemaker-xgboost:1.0-1-cpu-py3,Input,Image,ContributedTo,artifact
2,adultmodelpackagegroupname-2-PendingManualAppr...,Input,Approval,ContributedTo,action
3,AdultModelPackageGroupName-1631957603-aws-mode...,Output,ModelGroup,AssociatedWith,context


{'StepName': 'adultCreateModel', 'StartTime': datetime.datetime(2021, 9, 18, 14, 30, 9, 498000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2021, 9, 18, 14, 30, 10, 467000, tzinfo=tzlocal()), 'StepStatus': 'Succeeded', 'Metadata': {'Model': {'Arn': 'arn:aws:sagemaker:us-east-1:634368063255:model/pipelines-vjfr0w8t5x11-adultcreatemodel-by2ys8pkun'}}}


None