### Assigment 2.3. Custom Processor Instance: FrameworkProcessor

* <span style="padding:4px; background-color: #f2a68a; color: #000;"><strong>Assignment 2.3</strong></span> We used an instance of [`SKLearnProcessor`](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-processor) to run the preprocessing script, but there's no way to add dependencies to the processing container. Modify the code to use an instance of [`FrameworkProcessor`](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.FrameworkProcessor) instead. This class will allow you to specify a directory containing a `requirements.txt` file containing the list of dependencies that will be installed in the target container prior to triggering processing job.

**Steps**:

0. Setting up the basics for the pipeline to run: imports, prev. code
1. Create requirements.txt
2. Create Dockerfile
3. Build Docker image
4. Upload Docker image to Amazon ECR
5. **Define FrameworkProcessor**
6. **Run processing job**
7. Check results

In [25]:
%load_ext autoreload
%autoreload 2

import sys
from pathlib import Path

import boto3
import sagemaker
from sagemaker import image_uris
from sagemaker.processing import FrameworkProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import CacheConfig
from sagemaker.workflow.pipeline_definition_config import PipelineDefinitionConfig
from sagemaker.workflow.pipeline_context import PipelineSession
from sagemaker.sklearn.estimator import SKLearn

BUCKET = "mlschool-davidorti"

region = boto3.Session().region_name
pipeline_session = PipelineSession(default_bucket=BUCKET)
sagemaker_session = sagemaker.session.Session()
role = sagemaker.get_execution_role()
pipeline_definition_config = PipelineDefinitionConfig(use_custom_job_prefix=True)

ASSIGNMENT_CODE_DIR = Path("code")
if str(ASSIGNMENT_CODE_DIR) not in sys.path:
    sys.path.append(f"{ASSIGNMENT_CODE_DIR}")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [9]:
sys.path

['/root/ml.school/penguins/assignments/assignment_23',
 '/usr/local/lib/python38.zip',
 '/usr/local/lib/python3.8',
 '/usr/local/lib/python3.8/lib-dynload',
 '',
 '/usr/local/lib/python3.8/site-packages',
 '/usr/local/lib/python3.8/site-packages/IPython/extensions',
 '/root/.ipython',
 'code']

1. First, we create the requirements.txt file so the new instance can install a few dependencies at startup -- just to test

In [11]:
%%writefile {ASSIGNMENT_CODE_DIR}/requirements.txt

# Core libraries
numpy==1.21.0
pandas==1.3.0
matplotlib==3.4.2
scipy==1.7.0

# Machine Learning libraries
scikit-learn==0.24.2
# tensorflow==2.5.0
# keras==2.4.3

# Deep Learning Visualization
# tensorboard==2.5.0

# Natural Language Processing
# nltk==3.6.2
# spacy==3.1.0

# Data visualization
seaborn==0.11.1
# plotly==5.1.0

Writing code/requirements.txt


2. Create the Dockerfile

In [18]:
sklearn_image_uri = image_uris.retrieve(
    framework='sklearn',
    region=region,
    version='0.23-1',
    image_scope='training'
)
sklearn_image_uri

'141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3'

In [19]:
%%writefile {ASSIGNMENT_CODE_DIR}/Dockerfile

FROM 141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3

COPY code/requirements.txt /opt/ml/processing/requirements.txt

RUN pip install -r /opt/ml/processing/requirements.txt

Overwriting code/Dockerfile


3 & 4. Not tested or used in the end, as I just went with a prebuild image from their registry...

5. Create the Framework Processor

The previously used SKLearnProcessor

```python3
processor = SKLearnProcessor(
    base_job_name="penguins-preprocessing",
    framework_version="0.23-1",
    instance_type="ml.t3.medium",
    instance_count=1,
    role=role,
    sagemaker_session=pipeline_session
)
```

Let's just clone the preprocessing script here too...

In [13]:
!cp ../../code/preprocessor.py ./code/preprocessor.py

In [14]:
!ls code

Dockerfile  preprocessor.py  requirements.txt


In [16]:
S3_LOCATION = f"s3://{BUCKET}/penguins"

dataset_location = ParameterString(
    name="dataset_location",
    default_value=f"{S3_LOCATION}/data.csv",
)

cache_config = CacheConfig(
    enable_caching=True, 
    expire_after="15d"
)

Define the Processor, ProcessingStep and Pipeline:

In [27]:
# processor
sklearn_version = '0.23-1'
processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version=sklearn_version,
    base_job_name="penguins-preprocessing-assignment23",
    image_uri=sklearn_image_uri,
    instance_count=1,
    instance_type="ml.t3.medium",
    role=role,
    sagemaker_session=pipeline_session
)

# processing step
preprocess_data_step = ProcessingStep(
    name="preprocess-data-assignment23",
    step_args=processor.run(
        code=f"{ASSIGNMENT_CODE_DIR}/preprocessor.py",
        inputs=[
            ProcessingInput(source=dataset_location, destination="/opt/ml/processing/input"),  
        ],
        outputs=[
            ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
            ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
            ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
            ProcessingOutput(output_name="pipeline", source="/opt/ml/processing/pipeline"),
            ProcessingOutput(output_name="classes", source="/opt/ml/processing/classes"),
            ProcessingOutput(output_name="train-baseline", source="/opt/ml/processing/train-baseline"),
            ProcessingOutput(output_name="test-baseline", source="/opt/ml/processing/test-baseline"),
        ]
    ),
    cache_config=cache_config
)

# pipeline
session2_pipeline_assignment23 = Pipeline(
    name="penguins-session2-pipeline-assignment23",
    parameters=[dataset_location],
    steps=[preprocess_data_step],
    pipeline_definition_config=pipeline_definition_config,
    sagemaker_session=pipeline_session
)

# upsert: update if already exists, insert if it doesn't exist yet
session2_pipeline_assignment23.upsert(role_arn=role)

INFO:sagemaker.processing:Uploaded None to s3://mlschool-davidorti/penguins-session2-pipeline-assignment23/code/7297b4b06c23db2e733e9a063268df16/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool-davidorti/penguins-session2-pipeline-assignment23/code/7f0e78074facc1d2d05c3b54b0d9f907/runproc.sh


Using provided s3_resource


INFO:sagemaker.processing:Uploaded None to s3://mlschool-davidorti/penguins-session2-pipeline-assignment23/code/7297b4b06c23db2e733e9a063268df16/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://mlschool-davidorti/penguins-session2-pipeline-assignment23/code/7f0e78074facc1d2d05c3b54b0d9f907/runproc.sh


Using provided s3_resource


{'PipelineArn': 'arn:aws:sagemaker:eu-west-1:833724363691:pipeline/penguins-session2-pipeline-assignment23',
 'ResponseMetadata': {'RequestId': 'f0c25882-c311-4227-aaeb-6f6d07495e21',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'f0c25882-c311-4227-aaeb-6f6d07495e21',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '107',
   'date': 'Fri, 29 Sep 2023 10:37:28 GMT'},
  'RetryAttempts': 0}}

6. Run pipeline

In [28]:
#session2_pipeline_assignment23.start()

_PipelineExecution(arn='arn:aws:sagemaker:eu-west-1:833724363691:pipeline/penguins-session2-pipeline-assignment23/execution/m94e2v3jwaqv', sagemaker_session=<sagemaker.workflow.pipeline_context.PipelineSession object at 0x7f9868d80fa0>)