<a href="https://colab.research.google.com/github/SARA3SAEED/aws-SageMaker-Studio/blob/main/lab_4_own_container.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Task 4: Perform data processing with your own container

In the previous notebook, you used Amazon SageMaker Processing and the scikit-learn built-in container for data processing.

In this notebook, you set up the environment needed to run a scikit-learn script with your own processing container.

You create your own Docker image, build your processing container, and use a **ScriptProcessor** class from the Amazon SageMaker Python SDK to run a scikit-learn preprocessing script within the container.

Finally, you validate the data processing results saved in Amazon Simple Storage Service (Amazon S3).

### Task 4.1: Setup the environment

In this task, you install the required libraries and dependencies.

You set up an Amazon S3 bucket to store the outputs from the processing job and also get the execution role to run the SageMaker Processing job.

In [None]:
#install-dependencies
import logging
import boto3
import sagemaker
import pandas as pd

sagemaker_logger = logging.getLogger("sagemaker")
sagemaker_logger.setLevel(logging.INFO)
sagemaker_logger.addHandler(logging.StreamHandler())

#Execution role to run the SageMaker Processing job
role = sagemaker.get_execution_role()
print("SageMaker Execution Role: ", role)

#S3 bucket to read the SKLearn processing script and writing processing job outputs
s3 = boto3.resource('s3')
for buckets in s3.buckets.all():
    if 'labdatabucket' in buckets.name:
        bucket = buckets.name
print("Bucket: ", bucket)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
SageMaker Execution Role:  arn:aws:iam::593048189292:role/SageMakerStudioRole
Bucket:  labstack-585a2104-a869-485b-84fd-41b-labdatabucket-b2yq7koihjmz


### Task 4.2: Create a processing container

In this task, you define and create a scikit-learn container using the Dockerfile.

### Task 4.3: Create a Dockerfile

In this task, you create a Docker directory and add the Dockerfile used to create the processing container. Because you are creating a scikit-learn container, you install pandas and scikit-learn.

In [None]:
%mkdir docker

In [None]:
%%writefile docker/Dockerfile
FROM public.ecr.aws/docker/library/python:3.10-slim-bullseye

RUN pip3 install joblib threadpoolctl pandas numpy scikit-learn==1.4.2
ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

Writing docker/Dockerfile


### Task 4.4: Build the container image

In this task, you create a custom container image using the Amazon SageMaker Studio image build command line interface (CLI).

By using the Amazon SageMaker Studio image build CLI, you can build Amazon SageMaker compatible Docker images directly from your SageMaker Studio environments.

Install the Sagemaker Studio image build package:

In [None]:
%pip install sagemaker-studio-image-build

Collecting sagemaker-studio-image-build
  Downloading sagemaker_studio_image_build-0.6.0.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sagemaker-studio-image-build
  Building wheel for sagemaker-studio-image-build (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker-studio-image-build: filename=sagemaker_studio_image_build-0.6.0-py3-none-any.whl size=13454 sha256=69163d06c0070e289a01ebe9cff24543dd8893a47a66e9d39e8126c92ecd7a5b
  Stored in directory: /root/.cache/pip/wheels/69/7b/d1/1318b1530ee5322c9be00f206badea11b5148626ef58c0e0dc
Successfully built sagemaker-studio-image-build
Installing collected packages: sagemaker-studio-image-build
Successfully installed sagemaker-studio-image-build-0.6.0
[0mNote: you may need to restart the kernel to use updated packages.


Update the library by running the following commands.

In [None]:
%%sh
rm /usr/lib/x86_64-linux-gnu/libstdc++.so.6
cp /opt/conda/lib/libstdc++.so.6 /usr/lib/x86_64-linux-gnu/libstdc++.so.6

Navigate to the directory that contains your Dockerfile and run the sm-docker build command. This command automatically logs build output and returns the **Image URI** of your Docker image. This takes approximately 2 minutes to complete.

In [None]:
%%sh

cd docker

sm-docker build .

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
Created ECR repository sagemaker-studio-d-qpqmw0qemocd
...................[Container] 2024/09/08 12:29:16.607920 Running on CodeBuild On-demand

[Container] 2024/09/08 12:29:16.607934 Waiting for agent ping
[Container] 2024/09/08 12:29:19.820442 Waiting for DOWNLOAD_SOURCE
[Container] 2024/09/08 12:29:20.107903 Phase is DOWNLOAD_SOURCE
[Container] 2024/09/08 12:29:20.140613 CODEBUILD_SRC_DIR=/codebuild/output/src2434841079/src
[Container] 2024/09/08 12:29:20.141212 YAML location is /codebuild/output/src2434841079/src/buildspec.yml
[Container] 2024/09/08 12:29:20.144740 Setting HTTP client timeout to higher timeout for S3 source
[Container] 2024/09/08 12:29:20.144959 Processing environment variables
[Container] 2024/09/08 12:29:20.184277 No runtime version selected in buildspec.
[Container] 20

Next, copy the **Image URI** and paste it into a text editor of your choice.
You use this **Image URI** to create a **ScriptProcessor** class.

### Task 4.5: Run the SageMaker processing job

In this task, you use the same preprocessed dataset from the previous notebook.

In [None]:
#import-data
shape=pd.read_csv("data/adult_data.csv", header=None)
shape.sample(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
610,33,Local-gov,217304,Bachelors,13,Never-married,Adm-clerical,Not-in-family,Black,Male,0,0,40,United-States,<=50K
258,39,Private,281768,HS-grad,9,Never-married,Other-service,Not-in-family,Black,Female,0,0,40,United-States,<=50K
707,31,Private,83912,Bachelors,13,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,25,Mexico,<=50K
28,23,Private,134446,HS-grad,9,Separated,Machine-op-inspct,Unmarried,Black,Male,0,0,54,United-States,<=50K
539,52,Private,165001,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,60,United-States,>50K




You then use the SageMaker ScriptProcessor class to define and run a processing script as a processing job. Refer to [SageMaker ScriptProcessor](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor) for more information about this class.

For creating ScriptProcessor class, you configure the following parameters:
- **base_job_name**: Prefix for the processing job name
- **command**: Command to run, in addition to any command-line flags
- **image_uri**: URI of the Docker image to use for the processing jobs
- **role**: SageMaker execution role
- **instance_count**: Number of instances to run the processing job
- **instance_type**: Type of Amazon Elastic Compute Cloud (Amazon EC2) instance used for the processing job

In the following code, replace **REPLACE_IMAGE_URI** with the URI from your text editor.

In [None]:
#sagemaker-script-processor
from sagemaker.processing import ScriptProcessor

# create a ScriptProcessor
script_processor = ScriptProcessor(
    base_job_name="own-processing-container",
    command=["python3"],
    image_uri="REPLACE_IMAGE_URI",
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
)


Next, you use the ScriptProcessor.run() method to run the **sklearn_preprocessing.py** script as a processing job. This is the same script that you used in Task 3, but you are now running it on a custom container built from a base image. Refer to [ScriptProcessor.run()](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ScriptProcessor.run) for more information about this method.

For running the processing job, you configure the following parameters:
- **code**: Path of the preprocessing script
- **inputs**: Path of input data for the preprocessing script (Amazon S3 input location)
- **outputs**: Path of output for the preprocessing script (Amazon S3 output location)
- **arguments**: Command-line arguments to the preprocessing script (such as train test split ratio)

The processing job takes approximately 4–5 minutes to complete.

In [None]:
#processing-job
import os
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Amazon S3 path prefix
input_raw_data_prefix = "data/input"
output_preprocessed_data_prefix = "data/output"

# Run the processing job
script_processor.run(
    code="sklearn_preprocessing.py",
    inputs=[ProcessingInput(source="s3://" + os.path.join(bucket, input_raw_data_prefix, "adult_data.csv"),
                            destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train_data",
                         source="/opt/ml/processing/train",
                         destination="s3://" + os.path.join(bucket, output_preprocessed_data_prefix, "train")),
        ProcessingOutput(output_name="test_data",
                         source="/opt/ml/processing/test",
                         destination="s3://" + os.path.join(bucket, output_preprocessed_data_prefix, "test")),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
    logs=False,
    wait=False,
)
script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)

Creating processing-job with name own-processing-container-2024-09-08-12-37-50-133
INFO:sagemaker:Creating processing-job with name own-processing-container-2024-09-08-12-37-50-133
Please check the troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html#sagemaker-python-sdk-troubleshooting-create-processing-job
ERROR:sagemaker:Please check the troubleshooting guide for common errors: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-python-sdk-troubleshooting.html#sagemaker-python-sdk-troubleshooting-create-processing-job


ClientError: An error occurred (ValidationException) when calling the CreateProcessingJob operation: Invalid image URI public.ecr.aws/docker/library/python:3.10-slim-bullseye. Please provide a valid Amazon Elastic Container Registry path of the Docker image to run.

### Task 4.6: Validate the data processing results

Validate the output of the processing job that you ran by looking at the first five rows of the train and test output datasets.

In [None]:
#view-train-dataset
print("Top 5 rows from s3://{}/{}/train/".format(bucket, output_preprocessed_data_prefix))
!aws s3 cp --quiet s3://$bucket/$output_preprocessed_data_prefix/train/train_features.csv - | head -n5

In [None]:
#view-validation-dataset
print("Top 5 rows from s3://{}/{}/validation/".format(bucket, output_preprocessed_data_prefix))
!aws s3 cp --quiet s3://$bucket/$output_preprocessed_data_prefix/test/test_features.csv - | head -n5

### Conclusion

Congratulations! You have successfully built your own processing container and used SageMaker Processing to run the processing job.

### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and continue with the **Conclusion**.