# Milestone 3- SageMaker Pipeline:
### By Lakshmi Devesh Kumar

Business Context: 

Fashion Haven has recognized the importance of data-driven decision-making and has successfully developed a predictive model for sales revenue estimation based on their advertising campaigns across different media sources (TV, Newspaper, Radio). To scale their efforts and make data-driven decisions more accessible throughout the organization, Fashion Haven aims to create an automated machine-learning pipeline for sales revenue prediction.

The goal of this assignment is to design and implement a machine-learning pipeline that automates the entire process of data preprocessing, model training, and evaluation. The automated pipeline should take in raw data containing information on advertising campaigns and sales revenue across various stores and media sources. It should then transform the data, select the most relevant features, train the model, and evaluate its performance.

By creating an automated machine-learning pipeline, Fashion Haven aims to streamline the process of sales revenue prediction, making it accessible to various departments within the organization. This automation will save time and effort for data scientists and other stakeholders, allowing them to focus on higher-value tasks and strategic decision-making.

### Setup

**General Imports**

In [2]:
import logging
import json

**SageMaker Imports**

In [3]:
import sagemaker
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.properties import PropertyFile
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.processing import ScriptProcessor
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.inputs import TrainingInput

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


**SageMaker Authentication**

In [4]:
pipeline_session = sagemaker.Session(default_bucket="sagemaker-us-east-1-583418938145")

In [5]:
try:
    aws_role = sagemaker.get_execution_role()
except ValueError:
    print('Local configuration is not complete; use Sagemaker Studio')

In [6]:
print(f"AWS execution role associated with the account {aws_role}")
print(f"Default bucket associated with the account: {pipeline_session.default_bucket()}")
print(f"Default boto region associated with the account: {pipeline_session.boto_region_name}")

AWS execution role associated with the account arn:aws:iam::583418938145:role/service-role/AmazonSageMaker-ExecutionRole-20240611T124233
Default bucket associated with the account: sagemaker-us-east-1-583418938145
Default boto region associated with the account: us-east-1


### SageMaker Pipelines

#### Step 1: Preprocessing

In [7]:
input_data_uri = 's3://sagemaker-us-east-1-583418938145/milestone3/processed_data.csv'

In [8]:
sklearn_processor = SKLearnProcessor(
    framework_version="1.0-1",
    instance_type="ml.m4.xlarge",
    instance_count=1,
    base_job_name="milestone3-processed-data-process",
    role=aws_role,
    sagemaker_session=pipeline_session
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3


In [9]:
# Create a ProcessingStep with the following parameters

step_process = ProcessingStep(
    name="newspaperProcess",
    processor=sklearn_processor,
    inputs=[
        ProcessingInput(source=input_data_uri, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test")
    ],
    code="transform.py",
)

In [10]:
step_process.inputs

[<sagemaker.processing.ProcessingInput at 0x7fe0b403b880>]

In [11]:
step_process.outputs

[<sagemaker.processing.ProcessingOutput at 0x7fe0b403b4c0>,
 <sagemaker.processing.ProcessingOutput at 0x7fe0b403a650>]

#### Step 2: Training

In [12]:
sklearn_estimator = SKLearn(
    entry_point="dt.py",
    framework_version="1.0-1",
    role=aws_role,
    sagemaker_session=pipeline_session,
    instance_type="ml.m4.xlarge",
    instance_count=1,
    volume_size=1
)

In [13]:
step_train = TrainingStep(
    name="newspaperTrain",
    estimator=sklearn_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "train"
            ].S3Output.S3Uri,
            content_type="text/csv"
        ),
        "test": TrainingInput(
            s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri,
            content_type="text/csv"
        )
    }
)

In [14]:
step_process.properties.ProcessingOutputConfig.Outputs["train"].S3Output.S3Uri

{'_step': <sagemaker.workflow.steps.ProcessingStep object at 0x7fe0b403a740>, 'step_name': 'newspaperProcess', 'path': "ProcessingOutputConfig.Outputs['train'].S3Output.S3Uri", '_shape_names': ['S3Uri'], '__str__': 'S3Uri'}

#### Step 3: Evaluation

In [15]:
# Replace with your desired region
session_region = 'us-east-1'

#Get the specific SKLearn image URI for the given region
sklearn_image_uri = sagemaker.image_uris.retrieve(
    framework='sklearn',
    version='1.0-1',
    region=session_region
)

#Print the retrieved scikit-learn image URI to the console
print(sklearn_image_uri)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker.image_uris:Defaulting to only supported image scope: cpu.


683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.0-1-cpu-py3


In [16]:
# Create an instance of the ScriptProcessor class from te sagemmaker.processing module
script_eval = ScriptProcessor(
    image_uri=sklearn_image_uri,
    command=["python"],
    instance_type='ml.m4.xlarge',
    instance_count=1,
    base_job_name="script-processed-data-eval",
    role=aws_role,
    sagemaker_session=pipeline_session
)

In [17]:
# Create an instance of the PropertyFile class from the sagemaker.workflow.properties module

evaluation_report = PropertyFile(
    name="EvaluationReport",
    output_name="evaluation",
    path="evaluation.json"
)

In [18]:
# Create an instance of the ProcessingStep class from the sagemaker.workflo.steps for evaluation purposes

step_eval = ProcessingStep(
    name="newspaperEval",
    processor=script_eval,
    inputs=[
        ProcessingInput(
            source=step_train.properties.ModelArtifacts.S3ModelArtifacts,
            destination="/opt/ml/processing/model"
        ),
        ProcessingInput(
            source=step_process.properties.ProcessingOutputConfig.Outputs[
                "test"
            ].S3Output.S3Uri,
            destination="/opt/ml/processing/test"
        )
    ],
    outputs=[
        ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation"),
    ],
    code="evaluation.py",
    property_files=[evaluation_report]

)

#### Step 4: Assembling the pipeline

In [19]:
# Create an instance of the Pipeline class from the sagemaker.workflow.pipeline 

pipeline = Pipeline(
    name="newspaperpipeline",
    steps=[step_process, step_train, step_eval],
    sagemaker_session=pipeline_session
)

In [20]:
json.loads(pipeline.definition())



{'Version': '2020-12-01',
 'Metadata': {},
 'Parameters': [],
 'PipelineExperimentConfig': {'ExperimentName': {'Get': 'Execution.PipelineName'},
  'TrialName': {'Get': 'Execution.PipelineExecutionId'}},
 'Steps': [{'Name': 'newspaperProcess',
   'Type': 'Processing',
   'Arguments': {'ProcessingResources': {'ClusterConfig': {'InstanceType': 'ml.m4.xlarge',
      'InstanceCount': 1,
      'VolumeSizeInGB': 30}},
    'AppSpecification': {'ImageUri': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.0-1-cpu-py3',
     'ContainerEntrypoint': ['python3',
      '/opt/ml/processing/input/code/transform.py']},
    'RoleArn': 'arn:aws:iam::583418938145:role/service-role/AmazonSageMaker-ExecutionRole-20240611T124233',
    'ProcessingInputs': [{'InputName': 'input-1',
      'AppManaged': False,
      'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-583418938145/milestone3/processed_data.csv',
       'LocalPath': '/opt/ml/processing/input',
       'S3DataType': 'S3Prefix',
     

In [21]:
# The registered pipeline is upserted (update + insert) to create or update the pipeline in the AWS infrastructure using the aws_role role

pipeline.upsert(role_arn=aws_role)



{'PipelineArn': 'arn:aws:sagemaker:us-east-1:583418938145:pipeline/newspaperpipeline',
 'ResponseMetadata': {'RequestId': '18b97956-f0c9-46d6-8753-2bfef473fe9c',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '18b97956-f0c9-46d6-8753-2bfef473fe9c',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '85',
   'date': 'Sun, 18 Aug 2024 19:48:26 GMT'},
  'RetryAttempts': 0}}

In [22]:
execution = pipeline.start()

In [23]:
# Monitor execution using the describe() method

execution.describe()

{'PipelineArn': 'arn:aws:sagemaker:us-east-1:583418938145:pipeline/newspaperpipeline',
 'PipelineExecutionArn': 'arn:aws:sagemaker:us-east-1:583418938145:pipeline/newspaperpipeline/execution/tu51t6j1psso',
 'PipelineExecutionDisplayName': 'execution-1724009388157',
 'PipelineExecutionStatus': 'Succeeded',
 'PipelineExperimentConfig': {'ExperimentName': 'newspaperpipeline',
  'TrialName': 'tu51t6j1psso'},
 'CreationTime': datetime.datetime(2024, 8, 18, 19, 29, 48, 101000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2024, 8, 18, 19, 42, 33, 175000, tzinfo=tzlocal()),
 'CreatedBy': {'UserProfileArn': 'arn:aws:sagemaker:us-east-1:583418938145:user-profile/d-diygkqr8zm5i/gl-demo-userv14062024',
  'UserProfileName': 'gl-demo-userv14062024',
  'DomainId': 'd-diygkqr8zm5i',
  'IamIdentity': {'Arn': 'arn:aws:sts::583418938145:assumed-role/AmazonSageMaker-ExecutionRole-20240611T124233/SageMaker',
   'PrincipalId': 'AROAYPVT2QMQUS3K5LR3H:SageMaker'}},
 'LastModifiedBy': {'UserProfil