# Project Title: Image Classification Using AWS Sagemaker

This notebook lists all the steps that you need to complete the complete this project. You will need to complete all the TODOs in this notebook as well as in the README and the two python scripts included with the starter code.


**TODO**: Give a helpful introduction to what this notebook is for. Remember that comments, explanations and good documentation make your project informative and professional.

**Note:** This notebook has a bunch of code and markdown cells with TODOs that you have to complete. These are meant to be helpful guidelines for you to finish your project while meeting the requirements in the project rubrics. Feel free to change the order of these the TODO's and use more than one TODO code cell to do all your tasks.

In [44]:
# TODO: Install any packages that you might need
# For instance, you will need the smdebug package
!pip install smdebug
# Install torch for working with PyTorch models
!pip install torch torchvision
# Install tqdm for progress bars
!pip install tqdm
# Install Pillow for image processing
!pip install Pillow
# Install numpy for numerical operations
!pip install numpy
# Install argparse if you plan on running the script locally from a terminal
!pip install argparse
# Install boto3 for S3 interactions 
!pip install boto3
# Install bokeh for creating interactive visualizations in Python
!pip install smdebug bokeh

Collecting argparse
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


In [45]:
# TODO: Import any packages that you might need
# For instance you will need Boto3 and Sagemaker
import os
import boto3
import IPython
import sagemaker
# Importing SageMaker components for role management, hyperparameter tuning, and model deployment
from sagemaker import get_execution_role
from sagemaker.tuner import CategoricalParameter, ContinuousParameter, IntegerParameter, HyperparameterTuner
from sagemaker.pytorch import PyTorch
# Debugger and Profiler configurations for SageMaker
from sagemaker.debugger import Rule, DebuggerHookConfig, TensorBoardOutputConfig, CollectionConfig, ProfilerRule, rule_configs, ProfilerConfig, FrameworkProfile
# Analytics for hyperparameter tuning jobs
from sagemaker.analytics import HyperparameterTuningJobAnalytics
# Importing specific PyTorch model and predictor from SageMaker
from sagemaker.pytorch import PyTorchModel
from sagemaker.predictor import Predictor
# Importing SageMaker Debugger and Profiler related tools
from smdebug.core.modes import ModeKeys
from smdebug.trials import create_trial
# Utilities for visualizing training job performance and analysis
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts
# For plotting and visualizations
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot
# For image processing
from PIL import Image
import io

## Dataset
TODO: Explain what dataset you are using for this project. Maybe even give a small overview of the classes, class distributions etc that can help anyone not familiar with the dataset get a better understand of it.

In [53]:
# Command to download and unzip data
!wget --no-check-certificate https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip -f dogImages.zip 

--2024-10-27 10:30:56--  https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
Resolving s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)... 52.219.193.16, 52.219.113.120, 52.219.220.232, ...
Connecting to s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)|52.219.193.16|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 1132023110 (1.1G) [application/zip]
Saving to: ‘dogImages.zip’


2024-10-27 10:31:20 (46.0 MB/s) - ‘dogImages.zip’ saved [1132023110/1132023110]

Archive:  dogImages.zip


In [64]:
import zipfile

with zipfile.ZipFile('dogImages.zip', 'r') as zip_ref:
    zip_ref.extractall('.')

In [5]:
import sagemaker
import os
# Initialize a SageMaker session to manage resources
session = sagemaker.Session()
# Get the default S3 bucket for the session
bucket = session.default_bucket()
print("Default Bucket: {}".format(bucket))
# Retrieve the AWS region for the session
region = session.boto_region_name
print("AWS Region: {}".format(region))
# Get the IAM role that SageMaker will use to access resources
role = sagemaker.get_execution_role()
print("RoleArn: {}".format(role))
# Define a prefix for organizing uploaded data in S3
prefix = "dogImages"  
# Upload the local dogImages directory to S3
inputs = session.upload_data(path="./dogImages", bucket=bucket, key_prefix=prefix)
print(f"input: {inputs}")

Default Bucket: sagemaker-us-east-1-637163362320
AWS Region: us-east-1
RoleArn: arn:aws:iam::637163362320:role/service-role/AmazonSageMaker-ExecutionRole-20241026T151604
input: s3://sagemaker-us-east-1-637163362320/dogImages


## Hyperparameter Tuning
**TODO:** This is the part where you will finetune a pretrained model with hyperparameter tuning. Remember that you have to tune a minimum of two hyperparameters. However you are encouraged to tune more. You are also encouraged to explain why you chose to tune those particular hyperparameters and the ranges.

**Note:** You will need to use the `hpo.py` script to perform hyperparameter tuning.

In [42]:
# TODO: Declare your HP ranges, metrics etc.
from sagemaker.tuner import ContinuousParameter, IntegerParameter, CategoricalParameter

# Define hyperparameter ranges for tuning
hyperparameter_ranges = {
    "lr": ContinuousParameter(0.001, 0.1),  
    "batch_size": CategoricalParameter([16, 64]), 
    "epochs": IntegerParameter(5, 10),  
}
# The objective metric to be used by the Hyperparameter Tuning jobs is the Test Accuracy of the model on the validation dataset
objective_metric_name = "Accuracy"
objective_type = "Maximize"
metric_definitions = [{"Name": "Accuracy", "Regex": "Test Loss: .*? Accuracy: ([0-9\\.]+)%"}] 

In [43]:
# TODO: Create estimators for your HPs
# Define the PyTorch estimator with your configuration
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import HyperparameterTuner

estimator = PyTorch(
    entry_point="hpo.py",
    role=sagemaker.get_execution_role(),
    py_version='py36',
    framework_version="1.8",
    instance_count=1,
    instance_type="ml.g4dn.xlarge"
)
# Configuring the estimated to run 6 total hyperparameter tuner jobs with an allowed parallel job count of 2
tuner = HyperparameterTuner(
    estimator = estimator,
    objective_metric_name = objective_metric_name,
    early_stopping_type = "Auto",
    hyperparameter_ranges = hyperparameter_ranges,
    metric_definitions = metric_definitions,
    max_jobs=6,
    max_parallel_jobs=2,
    objective_type = objective_type,
)

In [44]:
# TODO: Fit your HP Tuner
tuner.fit({'training': inputs, 'validation': inputs})

No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config


.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................!


In [45]:
# TODO: Get the best estimators and the best HPs
# Retrieve the best estimator from the tuning job
best_estimator = tuner.best_estimator()
# Get the hyperparameters of the best trained model
best_hyperparameters = best_estimator.hyperparameters()
print(best_hyperparameters)  


2024-10-27 15:44:32 Starting - Preparing the instances for training
2024-10-27 15:44:32 Downloading - Downloading the training image
2024-10-27 15:44:32 Training - Training image download completed. Training in progress.
2024-10-27 15:44:32 Uploading - Uploading generated training model
2024-10-27 15:44:32 Completed - Resource reused by training job: pytorch-training-241027-1519-003-db3d7d00
{'_tuning_objective_metric': '"Accuracy"', 'batch_size': '"64"', 'epochs': '9', 'lr': '0.001506843104990693', 'sagemaker_container_log_level': '20', 'sagemaker_estimator_class_name': '"PyTorch"', 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"', 'sagemaker_job_name': '"pytorch-training-2024-10-27-15-19-28-505"', 'sagemaker_program': '"hpo.py"', 'sagemaker_region': '"us-east-1"', 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-637163362320/pytorch-training-2024-10-27-15-19-28-505/source/sourcedir.tar.gz"'}


## Model Profiling and Debugging
TODO: Using the best hyperparameters, create and finetune a new model

**Note:** You will need to use the `train_model.py` script to perform model profiling and debugging.

In [49]:
print(best_estimator.hyperparameters())

{'_tuning_objective_metric': '"Accuracy"', 'batch_size': '"64"', 'epochs': '9', 'lr': '0.001506843104990693', 'sagemaker_container_log_level': '20', 'sagemaker_estimator_class_name': '"PyTorch"', 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"', 'sagemaker_job_name': '"pytorch-training-2024-10-27-15-19-28-505"', 'sagemaker_program': '"hpo.py"', 'sagemaker_region': '"us-east-1"', 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-637163362320/pytorch-training-2024-10-27-15-19-28-505/source/sourcedir.tar.gz"'}


In [50]:
# TODO: Set up debugging and profiling rules and hooks 

from sagemaker.debugger import DebuggerHookConfig, Rule, CollectionConfig
from sagemaker.debugger import rule_configs

# Extracting the best hyperparameters
best_hyperparameters = {
    "batch-size": int(best_estimator.hyperparameters()["batch_size"].replace('"', "")),   
    "epochs": int(best_estimator.hyperparameters()["epochs"]),   
    "lr": float(best_estimator.hyperparameters()["lr"]),  
}

print(best_hyperparameters)

{'batch-size': 64, 'epochs': 9, 'lr': 0.001506843104990693}


In [53]:
# Define rules for profiling
from sagemaker.debugger import DebuggerHookConfig, Rule, CollectionConfig, ProfilerConfig, FrameworkProfile
from sagemaker.debugger import rule_configs, ProfilerRule
rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)

debugger_config = DebuggerHookConfig(
    hook_parameters={"train.save_interval": "100", "eval.save_interval": "10"}
)

Framework profiling will be deprecated from tensorflow 2.12 and pytorch 2.0 in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [56]:
# TODO: Create and fit an estimator
estimator = PyTorch(
    entry_point="train_model.py",
    framework_version="1.8",
    py_version="py36",
    role=role,
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    hyperparameters=best_hyperparameters,
    rules=rules,
    profiler_config=profiler_config,
    debugger_hook_config=debugger_config,
)

In [59]:
# Fit an estimator
estimator.fit({"training": inputs}, wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2024-10-27-17-40-03-028


2024-10-27 17:40:06 Starting - Starting the training job...
2024-10-27 17:40:35 Starting - Preparing the instances for trainingVanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
ProfilerReport: InProgress
...
2024-10-27 17:40:55 Downloading - Downloading input data......
2024-10-27 17:41:55 Downloading - Downloading the training image..................
2024-10-27 17:44:55 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-10-27 17:44:55,547 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-10-27 17:44:55,580 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-10-27 17:44:55,584 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2024-1

In [77]:
# TODO: Plot a debugging output.
training_job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=training_job_name)

print(training_job_name)
print(region)
print(description)

pytorch-training-2024-10-27-17-40-03-028
us-east-1
{'TrainingJobName': 'pytorch-training-2024-10-27-17-40-03-028', 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:637163362320:training-job/pytorch-training-2024-10-27-17-40-03-028', 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-east-1-637163362320/pytorch-training-2024-10-27-17-40-03-028/output/model.tar.gz'}, 'TrainingJobStatus': 'Completed', 'SecondaryStatus': 'Completed', 'HyperParameters': {'batch-size': '64', 'epochs': '9', 'lr': '0.001506843104990693', 'sagemaker_container_log_level': '20', 'sagemaker_job_name': '"pytorch-training-2024-10-27-17-40-03-028"', 'sagemaker_program': '"train_model.py"', 'sagemaker_region': '"us-east-1"', 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-637163362320/pytorch-training-2024-10-27-17-40-03-028/source/sourcedir.tar.gz"'}, 'AlgorithmSpecification': {'TrainingImage': '763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8-gpu-py36', 'TrainingInputMode': 'File', 'En

**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [80]:
# TODO: Display the profiler output

rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"profiler output {rule_output_path}")

profiler output s3://sagemaker-us-east-1-637163362320/pytorch-training-2024-10-27-17-40-03-028/rule-output


## Model Deploying

In [81]:
# TODO: Deploy your model to an endpoint

# Specify instance type and count for deployment
instance_type = 'ml.m5.large'  
instance_count = 1  

predictor = estimator.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type
)

INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-east-1-637163362320/pytorch-training-2024-10-27-17-40-03-028/output/model.tar.gz), script artifact (s3://sagemaker-us-east-1-637163362320/pytorch-training-2024-10-27-17-40-03-028/source/sourcedir.tar.gz), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-east-1-637163362320/pytorch-training-2024-10-27-18-24-52-100/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: pytorch-training-2024-10-27-18-24-52-100
INFO:sagemaker:Creating endpoint-config with name pytorch-training-2024-10-27-18-24-52-100
INFO:sagemaker:Creating endpoint with name pytorch-training-2024-10-27-18-24-52-100


-------!

In [None]:
import boto3
import numpy as np
from PIL import Image
import io
from sagemaker.predictor import Predictor

# Define your SageMaker endpoint name
endpoint_name = 'pytorch-training-2024-10-27-18-24-52-100'

# Initialize the SageMaker Predictor with the correct endpoint name
predictor = Predictor(endpoint_name=endpoint_name)

# Initialize S3 client
s3 = boto3.client('s3')

# Define the S3 bucket and image path
bucket = 'sagemaker-us-east-1-637163362320'
image_key = 'dogImages/test/003.Airedale_terrier/Airedale_terrier_00166.jpg'

# Download the image from S3
s3.download_file(bucket, image_key, 'temp_image.jpg')

# Load and preprocess the image
image = Image.open('temp_image.jpg')
image = image.resize((224, 224))  

# Convert the image to a byte array
buffered = io.BytesIO()
image.save(buffered, format="JPEG")
img_byte_array = buffered.getvalue()

# Run the prediction
response = predictor.predict(img_byte_array)

# Print the response
print(response)

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()